Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-09-26 Thread Huang, Ying
Christoph Hellwig  writes:

> Hi Ying,
>
>> Any update to this regression?
>
> Not really.  We've optimized everything we could in XFS without
> dropping the architecture that we really want to move to.  Now we're
> waiting for some MM behavior to be fixed that this unconvered.  But
> in the end will probabkly stuck with a slight regression in this
> artificial workload.

I see.  Thanks for update.  Please keep me posted.

Best Regards,
Huang, Ying


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-09-26 Thread Christoph Hellwig
Hi Ying,

> Any update to this regression?

Not really.  We've optimized everything we could in XFS without
dropping the architecture that we really want to move to.  Now we're
waiting for some MM behavior to be fixed that this unconvered.  But
in the end will probabkly stuck with a slight regression in this
artificial workload.


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-09-25 Thread Huang, Ying
Hi, Christoph,

"Huang, Ying"  writes:

> Hi, Christoph,
>
> "Huang, Ying"  writes:
>
>> Christoph Hellwig  writes:
>>
>>> Snipping the long contest:
>>>
>>> I think there are three observations here:
>>>
>>>  (1) removing the mark_page_accessed (which is the only significant
>>>  change in the parent commit)  hurts the
>>>  aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test.
>>>  I'd still rather stick to the filemap version and let the
>>>  VM people sort it out.  How do the numbers for this test
>>>  look for XFS vs say ext4 and btrfs?
>>>  (2) lots of additional spinlock contention in the new case.  A quick
>>>  check shows that I fat-fingered my rewrite so that we do
>>>  the xfs_inode_set_eofblocks_tag call now for the pure lookup
>>>  case, and pretty much all new cycles come from that.
>>>  (3) Boy, are those xfs_inode_set_eofblocks_tag calls expensive, and
>>>  we're already doing way to many even without my little bug above.
>>>
>>> So I've force pushed a new version of the iomap-fixes branch with
>>> (2) fixed, and also a little patch to xfs_inode_set_eofblocks_tag a
>>> lot less expensive slotted in before that.  Would be good to see
>>> the numbers with that.
>>
>> For the original reported regression, the test result is as follow,
>>
>> =
>> compiler/cpufreq_governor/debug-setup/disk/fs/kconfig/load/rootfs/tbox_group/test/testcase:
>>   
>> gcc-6/performance/profile/1BRD_48G/xfs/x86_64-rhel/3000/debian-x86_64-2015-02-07.cgz/ivb44/disk_wrt/aim7
>>
>> commit: 
>>   f0c6bcba74ac51cb77aadb33ad35cb2dc1ad1506 (parent of first bad commit)
>>   68a9f5e7007c1afa2cf6830b690a90d0187c0684 (first bad commit)
>>   99091700659f4df965e138b38b4fa26a29b7eade (base of your fixes branch)
>>   bf4dc6e4ecc2a3d042029319bc8cd4204c185610 (head of your fixes branch)
>>
>> f0c6bcba74ac51cb 68a9f5e7007c1afa2cf6830b69 99091700659f4df965e138b38b 
>> bf4dc6e4ecc2a3d042029319bc 
>>  -- -- 
>> -- 
>>  %stddev %change %stddev %change %stddev 
>> %change %stddev
>>  \  |\  |\   
>>|\  
>> 484435 ±  0% -13.3% 420004 ±  0% -17.0% 402250 ±  0% 
>> -15.6% 408998 ±  0%  aim7.jobs-per-min
>
> It appears the original reported regression hasn't bee resolved by your
> commit.  Could you take a look at the test results and the perf data?

Any update to this regression?

Best Regards,
Huang, Ying

>>
>> And the perf data is as follow,
>>
>>   "perf-profile.func.cycles-pp.intel_idle": 20.25,
>>   "perf-profile.func.cycles-pp.memset_erms": 11.72,
>>   "perf-profile.func.cycles-pp.copy_user_enhanced_fast_string": 8.37,
>>   "perf-profile.func.cycles-pp.__block_commit_write.isra.21": 3.49,
>>   "perf-profile.func.cycles-pp.block_write_end": 1.77,
>>   "perf-profile.func.cycles-pp.native_queued_spin_lock_slowpath": 1.63,
>>   "perf-profile.func.cycles-pp.unlock_page": 1.58,
>>   "perf-profile.func.cycles-pp.___might_sleep": 1.56,
>>   "perf-profile.func.cycles-pp.__block_write_begin_int": 1.33,
>>   "perf-profile.func.cycles-pp.iov_iter_copy_from_user_atomic": 1.23,
>>   "perf-profile.func.cycles-pp.up_write": 1.21,
>>   "perf-profile.func.cycles-pp.__mark_inode_dirty": 1.18,
>>   "perf-profile.func.cycles-pp.down_write": 1.06,
>>   "perf-profile.func.cycles-pp.mark_buffer_dirty": 0.94,
>>   "perf-profile.func.cycles-pp.generic_write_end": 0.92,
>>   "perf-profile.func.cycles-pp.__radix_tree_lookup": 0.91,
>>   "perf-profile.func.cycles-pp._raw_spin_lock": 0.81,
>>   "perf-profile.func.cycles-pp.entry_SYSCALL_64_fastpath": 0.79,
>>   "perf-profile.func.cycles-pp.__might_sleep": 0.79,
>>   "perf-profile.func.cycles-pp.xfs_file_iomap_begin_delay.isra.9": 0.7,
>>   "perf-profile.func.cycles-pp.__list_del_entry": 0.7,
>>   "perf-profile.func.cycles-pp.vfs_write": 0.69,
>>   "perf-profile.func.cycles-pp.drop_buffers": 0.68,
>>   "perf-profile.func.cycles-pp.xfs_file_write_iter": 0.67,
>>   "perf-profile.func.cycles-pp.rwsem_spin_on_owner": 0.67,
>>
>> Best Regards,
>> Huang, Ying
>> ___
>> LKP mailing list
>> l...@lists.01.org
>> https://lists.01.org/mailman/listinfo/lkp


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-09-06 Thread Huang, Ying
Mel Gorman  writes:

> On Fri, Sep 02, 2016 at 09:32:58AM +1000, Dave Chinner wrote:
>> On Fri, Aug 19, 2016 at 04:08:34PM +0100, Mel Gorman wrote:
>> > On Thu, Aug 18, 2016 at 05:11:11PM +1000, Dave Chinner wrote:
>> > > On Thu, Aug 18, 2016 at 01:45:17AM +0100, Mel Gorman wrote:
>> > > > On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote:
>> > > > > > Yes, we could try to batch the locking like DaveC already suggested
>> > > > > > (ie we could move the locking to the caller, and then make
>> > > > > > shrink_page_list() just try to keep the lock held for a few pages 
>> > > > > > if
>> > > > > > the mapping doesn't change), and that might result in fewer crazy
>> > > > > > cacheline ping-pongs overall. But that feels like exactly the wrong
>> > > > > > kind of workaround.
>> > > > > > 
>> > > > > 
>> > > > > Even if such batching was implemented, it would be very specific to 
>> > > > > the
>> > > > > case of a single large file filling LRUs on multiple nodes.
>> > > > > 
>> > > > 
>> > > > The latest Jason Bourne movie was sufficiently bad that I spent time
>> > > > thinking how the tree_lock could be batched during reclaim. It's not
>> > > > straight-forward but this prototype did not blow up on UMA and may be
>> > > > worth considering if Dave can test either approach has a positive 
>> > > > impact.
>> > > 
>> > > SO, I just did a couple of tests. I'll call the two patches "sleepy"
>> > > for the contention backoff patch and "bourney" for the Jason Bourne
>> > > inspired batching patch. This is an average of 3 runs, overwriting
>> > > a 47GB file on a machine with 16GB RAM:
>> > > 
>> > >  IO throughput   wall time __pv_queued_spin_lock_slowpath
>> > > vanilla  470MB/s 1m42s   25-30%
>> > > sleepy   295MB/s 2m43s   <1%
>> > > bourney  425MB/s 1m53s   25-30%
>> > > 
>> > 
>> > This is another blunt-force patch that
>> 
>> Sorry for taking so long to get back to this - had a bunch of other
>> stuff to do (e.g. XFS metadata CRCs have found their first compiler
>> bug) and haven't had to time test this.
>> 
>
> No problem. Thanks for getting back to me.
>
>> The blunt force approach seems to work ok:
>> 
>
> Ok, good to know. Unfortunately I found that it's not a universal win. For
> the swapping-to-fast-storage case (simulated with ramdisk), the batching is
> a bigger gain *except* in the single threaded case. Stalling kswap in the
> "blunt force approach" severely regressed a streaming anonymous reader
> for all thread counts so it's not the right answer.
>
> I'm working on a series during spare time that tries to balance all the
> issues for either swapcache and filecache on different workloads but right
> now, the complexity is high and it's still "win some, lose some".
>
> As an aside for the LKP people using ramdisk for swap -- ramdisk considers
> itself to be rotational storage. It takes the paths that are optimised to
> minimise seeks but it's quite slow. When tree_lock contention is reduced,
> workload is dominated by scan_swap_map. It's a one-line fix and I have
> a patch for it but it only really matters if ramdisk is being used as a
> simulator for swapping to fast storage.

We (LKP people) use drivers/nvdimm/pmem.c instead of drivers/block/brd.c
as ramdisk.  Which considers itself to be non-rotational storage.

And we have a series to optimize other locks in the swap path too, for
example batching the swap space allocating and freeing, etc.  If your
solution to optimize batching removing pages from the swap cache can be
merged, that will help us much!

Best Regards,
Huang, Ying


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-09-06 Thread Mel Gorman
On Fri, Sep 02, 2016 at 09:32:58AM +1000, Dave Chinner wrote:
> On Fri, Aug 19, 2016 at 04:08:34PM +0100, Mel Gorman wrote:
> > On Thu, Aug 18, 2016 at 05:11:11PM +1000, Dave Chinner wrote:
> > > On Thu, Aug 18, 2016 at 01:45:17AM +0100, Mel Gorman wrote:
> > > > On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote:
> > > > > > Yes, we could try to batch the locking like DaveC already suggested
> > > > > > (ie we could move the locking to the caller, and then make
> > > > > > shrink_page_list() just try to keep the lock held for a few pages if
> > > > > > the mapping doesn't change), and that might result in fewer crazy
> > > > > > cacheline ping-pongs overall. But that feels like exactly the wrong
> > > > > > kind of workaround.
> > > > > > 
> > > > > 
> > > > > Even if such batching was implemented, it would be very specific to 
> > > > > the
> > > > > case of a single large file filling LRUs on multiple nodes.
> > > > > 
> > > > 
> > > > The latest Jason Bourne movie was sufficiently bad that I spent time
> > > > thinking how the tree_lock could be batched during reclaim. It's not
> > > > straight-forward but this prototype did not blow up on UMA and may be
> > > > worth considering if Dave can test either approach has a positive 
> > > > impact.
> > > 
> > > SO, I just did a couple of tests. I'll call the two patches "sleepy"
> > > for the contention backoff patch and "bourney" for the Jason Bourne
> > > inspired batching patch. This is an average of 3 runs, overwriting
> > > a 47GB file on a machine with 16GB RAM:
> > > 
> > >   IO throughput   wall time __pv_queued_spin_lock_slowpath
> > > vanilla   470MB/s 1m42s   25-30%
> > > sleepy295MB/s 2m43s   <1%
> > > bourney   425MB/s 1m53s   25-30%
> > > 
> > 
> > This is another blunt-force patch that
> 
> Sorry for taking so long to get back to this - had a bunch of other
> stuff to do (e.g. XFS metadata CRCs have found their first compiler
> bug) and haven't had to time test this.
> 

No problem. Thanks for getting back to me.

> The blunt force approach seems to work ok:
> 

Ok, good to know. Unfortunately I found that it's not a universal win. For
the swapping-to-fast-storage case (simulated with ramdisk), the batching is
a bigger gain *except* in the single threaded case. Stalling kswap in the
"blunt force approach" severely regressed a streaming anonymous reader
for all thread counts so it's not the right answer.

I'm working on a series during spare time that tries to balance all the
issues for either swapcache and filecache on different workloads but right
now, the complexity is high and it's still "win some, lose some".

As an aside for the LKP people using ramdisk for swap -- ramdisk considers
itself to be rotational storage. It takes the paths that are optimised to
minimise seeks but it's quite slow. When tree_lock contention is reduced,
workload is dominated by scan_swap_map. It's a one-line fix and I have
a patch for it but it only really matters if ramdisk is being used as a
simulator for swapping to fast storage.

-- 
Mel Gorman
SUSE Labs


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-09-01 Thread Dave Chinner
On Fri, Aug 19, 2016 at 04:08:34PM +0100, Mel Gorman wrote:
> On Thu, Aug 18, 2016 at 05:11:11PM +1000, Dave Chinner wrote:
> > On Thu, Aug 18, 2016 at 01:45:17AM +0100, Mel Gorman wrote:
> > > On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote:
> > > > > Yes, we could try to batch the locking like DaveC already suggested
> > > > > (ie we could move the locking to the caller, and then make
> > > > > shrink_page_list() just try to keep the lock held for a few pages if
> > > > > the mapping doesn't change), and that might result in fewer crazy
> > > > > cacheline ping-pongs overall. But that feels like exactly the wrong
> > > > > kind of workaround.
> > > > > 
> > > > 
> > > > Even if such batching was implemented, it would be very specific to the
> > > > case of a single large file filling LRUs on multiple nodes.
> > > > 
> > > 
> > > The latest Jason Bourne movie was sufficiently bad that I spent time
> > > thinking how the tree_lock could be batched during reclaim. It's not
> > > straight-forward but this prototype did not blow up on UMA and may be
> > > worth considering if Dave can test either approach has a positive impact.
> > 
> > SO, I just did a couple of tests. I'll call the two patches "sleepy"
> > for the contention backoff patch and "bourney" for the Jason Bourne
> > inspired batching patch. This is an average of 3 runs, overwriting
> > a 47GB file on a machine with 16GB RAM:
> > 
> > IO throughput   wall time __pv_queued_spin_lock_slowpath
> > vanilla 470MB/s 1m42s   25-30%
> > sleepy  295MB/s 2m43s   <1%
> > bourney 425MB/s 1m53s   25-30%
> > 
> 
> This is another blunt-force patch that

Sorry for taking so long to get back to this - had a bunch of other
stuff to do (e.g. XFS metadata CRCs have found their first compiler
bug) and haven't had to time test this.

The blunt force approach seems to work ok:

IO throughput   wall time __pv_queued_spin_lock_slowpath
vanilla 470MB/s 1m42s   25-30%
sleepy  295MB/s 2m43s   <1%
bourney 425MB/s 1m53s   25-30%
blunt   470MB/s 1m41s   ~2%

Performance is pretty much the same as teh vanilla kernel - maybe
a little bit faster if we consider median rather than mean results.

A snapshot profile from 'perf top -U' looks like:

  11.31%  [kernel]  [k] copy_user_generic_string
   3.59%  [kernel]  [k] get_page_from_freelist
   3.22%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
   2.80%  [kernel]  [k] __block_commit_write.isra.29
   2.14%  [kernel]  [k] __pv_queued_spin_lock_slowpath
   1.99%  [kernel]  [k] _raw_spin_lock
   1.98%  [kernel]  [k] wake_all_kswapds
   1.92%  [kernel]  [k] _raw_spin_lock_irqsave
   1.90%  [kernel]  [k] node_dirty_ok
   1.69%  [kernel]  [k] __wake_up_bit
   1.57%  [kernel]  [k] ___might_sleep
   1.49%  [kernel]  [k] __might_sleep
   1.24%  [kernel]  [k] __radix_tree_lookup
   1.18%  [kernel]  [k] kmem_cache_alloc
   1.13%  [kernel]  [k] update_fast_ctr
   1.11%  [kernel]  [k] radix_tree_tag_set
   1.08%  [kernel]  [k] clear_page_dirty_for_io
   1.06%  [kernel]  [k] down_write
   1.06%  [kernel]  [k] up_write
   1.01%  [kernel]  [k] unlock_page
   0.99%  [kernel]  [k] xfs_log_commit_cil
   0.97%  [kernel]  [k] __inc_node_state
   0.95%  [kernel]  [k] __memset
   0.89%  [kernel]  [k] xfs_do_writepage
   0.89%  [kernel]  [k] __list_del_entry
   0.87%  [kernel]  [k] __vfs_write
   0.85%  [kernel]  [k] xfs_inode_item_format
   0.84%  [kernel]  [k] shrink_page_list
   0.82%  [kernel]  [k] kmem_cache_free
   0.79%  [kernel]  [k] radix_tree_tag_clear
   0.78%  [kernel]  [k] _raw_spin_lock_irq
   0.77%  [kernel]  [k] _raw_spin_unlock_irqrestore
   0.76%  [kernel]  [k] node_page_state
   0.72%  [kernel]  [k] xfs_count_page_state
   0.68%  [kernel]  [k] xfs_file_aio_write_checks
   0.65%  [kernel]  [k] wakeup_kswapd

There's still a lot of time in locking, but it's no longer obviously
being spent by spinning contention. We seem to be spending a lot of
time trying to wake kswapds now - the context switch rate of the
workload is only 400-500/s, so there aren't a lot of sleeps and
wakeups actually occurring

Regardless, throughput and locking behvaiour seems to be a lot
better than the other patches...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-25 Thread Mel Gorman
On Wed, Aug 24, 2016 at 08:40:37AM -0700, Huang, Ying wrote:
> Mel Gorman  writes:
> 
> > On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote:
> >> > Yes, we could try to batch the locking like DaveC already suggested
> >> > (ie we could move the locking to the caller, and then make
> >> > shrink_page_list() just try to keep the lock held for a few pages if
> >> > the mapping doesn't change), and that might result in fewer crazy
> >> > cacheline ping-pongs overall. But that feels like exactly the wrong
> >> > kind of workaround.
> >> > 
> >> 
> >> Even if such batching was implemented, it would be very specific to the
> >> case of a single large file filling LRUs on multiple nodes.
> >> 
> >
> > The latest Jason Bourne movie was sufficiently bad that I spent time
> > thinking how the tree_lock could be batched during reclaim. It's not
> > straight-forward but this prototype did not blow up on UMA and may be
> > worth considering if Dave can test either approach has a positive impact.
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 374d95d04178..926110219cd9 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -621,19 +621,39 @@ static pageout_t pageout(struct page *page, struct 
> > address_space *mapping,
> > return PAGE_CLEAN;
> >  }
> 
> We found this patch helps much for swap out performance, where there are
> usually only one mapping for all swap pages. 

Yeah, I expected it would be an unconditional win on swapping. I just
did not concentrate on it very much as it was not the problem at hand.

> In our 16 processes
> sequential swap write test case for a ramdisk on a Xeon E5 v3 machine,
> the swap out throughput improved 40.4%, from ~0.97GB/s to ~1.36GB/s.

Ok, so main benefit would be for ultra-fast storage. I doubt it's noticable
on slow disks.

> What's your plan for this patch?  If it can be merged soon, that will be
> great!
> 

Until this mail, no plan. I'm still waiting to hear if Dave's test case
has improved with the latest prototype for reducing contention.

> I found some issues in the original patch to work with swap cache.  Below
> is my fixes to make it work for swap cache.
> 

Thanks for the fix. I'm going offline today for a few days but I added a
todo item to finish this patch at some point. I won't be rushing it but
it'll get done eventually.

-- 
Mel Gorman
SUSE Labs


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-24 Thread Huang, Ying
Hi, Mel,

Mel Gorman  writes:

> On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote:
>> > Yes, we could try to batch the locking like DaveC already suggested
>> > (ie we could move the locking to the caller, and then make
>> > shrink_page_list() just try to keep the lock held for a few pages if
>> > the mapping doesn't change), and that might result in fewer crazy
>> > cacheline ping-pongs overall. But that feels like exactly the wrong
>> > kind of workaround.
>> > 
>> 
>> Even if such batching was implemented, it would be very specific to the
>> case of a single large file filling LRUs on multiple nodes.
>> 
>
> The latest Jason Bourne movie was sufficiently bad that I spent time
> thinking how the tree_lock could be batched during reclaim. It's not
> straight-forward but this prototype did not blow up on UMA and may be
> worth considering if Dave can test either approach has a positive impact.
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 374d95d04178..926110219cd9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -621,19 +621,39 @@ static pageout_t pageout(struct page *page, struct 
> address_space *mapping,
>   return PAGE_CLEAN;
>  }

We found this patch helps much for swap out performance, where there are
usually only one mapping for all swap pages.  In our 16 processes
sequential swap write test case for a ramdisk on a Xeon E5 v3 machine,
the swap out throughput improved 40.4%, from ~0.97GB/s to ~1.36GB/s.
What's your plan for this patch?  If it can be merged soon, that will be
great!

I found some issues in the original patch to work with swap cache.  Below
is my fixes to make it work for swap cache.

Best Regards,
Huang, Ying

>

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ac5fbff..dcaf295 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -623,22 +623,28 @@ static pageout_t pageout(struct page *page, struct 
address_space *mapping,
 
 static void finalise_remove_mapping(struct list_head *swapcache,
struct list_head *filecache,
+   struct list_head *free_pages,
void (*freepage)(struct page *))
 {
struct page *page;
 
while (!list_empty(swapcache)) {
-   swp_entry_t swap = { .val = page_private(page) };
+   swp_entry_t swap;
page = lru_to_page(swapcache);
list_del(&page->lru);
+   swap.val = page_private(page);
swapcache_free(swap);
set_page_private(page, 0);
+   if (free_pages)
+   list_add(&page->lru, free_pages);
}
 
while (!list_empty(filecache)) {
-   page = lru_to_page(swapcache);
+   page = lru_to_page(filecache);
list_del(&page->lru);
freepage(page);
+   if (free_pages)
+   list_add(&page->lru, free_pages);
}
 }
 
@@ -649,7 +655,8 @@ static void finalise_remove_mapping(struct list_head 
*swapcache,
 static int __remove_mapping_page(struct address_space *mapping,
 struct page *page, bool reclaimed,
 struct list_head *swapcache,
-struct list_head *filecache)
+struct list_head *filecache,
+struct list_head *free_pages)
 {
BUG_ON(!PageLocked(page));
BUG_ON(mapping != page_mapping(page));
@@ -722,6 +729,8 @@ static int __remove_mapping_page(struct address_space 
*mapping,
__delete_from_page_cache(page, shadow);
if (freepage)
list_add(&page->lru, filecache);
+   else if (free_pages)
+   list_add(&page->lru, free_pages);
}
 
return 1;
@@ -747,7 +756,7 @@ int remove_mapping(struct address_space *mapping, struct 
page *page)
spin_lock_irqsave(&mapping->tree_lock, flags);
freepage = mapping->a_ops->freepage;
 
-   if (__remove_mapping_page(mapping, page, false, &swapcache, 
&filecache)) {
+   if (__remove_mapping_page(mapping, page, false, &swapcache, &filecache, 
NULL)) {
/*
 * Unfreezing the refcount with 1 rather than 2 effectively
 * drops the pagecache ref for us without requiring another
@@ -757,7 +766,7 @@ int remove_mapping(struct address_space *mapping, struct 
page *page)
ret = 1;
}
spin_unlock_irqrestore(&mapping->tree_lock, flags);
-   finalise_remove_mapping(&swapcache, &filecache, freepage);
+   finalise_remove_mapping(&swapcache, &filecache, NULL, freepage);
return ret;
 }
 
@@ -776,29 +785,28 @@ static void remove_mapping_list(struct list_head 
*mapping_list,
page = lru_to_page(mapping_list);
list_del(&page->lru);
 
- 

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-22 Thread Huang, Ying
Hi, Christoph,

"Huang, Ying"  writes:

> Christoph Hellwig  writes:
>
>> Snipping the long contest:
>>
>> I think there are three observations here:
>>
>>  (1) removing the mark_page_accessed (which is the only significant
>>  change in the parent commit)  hurts the
>>  aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test.
>>  I'd still rather stick to the filemap version and let the
>>  VM people sort it out.  How do the numbers for this test
>>  look for XFS vs say ext4 and btrfs?
>>  (2) lots of additional spinlock contention in the new case.  A quick
>>  check shows that I fat-fingered my rewrite so that we do
>>  the xfs_inode_set_eofblocks_tag call now for the pure lookup
>>  case, and pretty much all new cycles come from that.
>>  (3) Boy, are those xfs_inode_set_eofblocks_tag calls expensive, and
>>  we're already doing way to many even without my little bug above.
>>
>> So I've force pushed a new version of the iomap-fixes branch with
>> (2) fixed, and also a little patch to xfs_inode_set_eofblocks_tag a
>> lot less expensive slotted in before that.  Would be good to see
>> the numbers with that.
>
> For the original reported regression, the test result is as follow,
>
> =
> compiler/cpufreq_governor/debug-setup/disk/fs/kconfig/load/rootfs/tbox_group/test/testcase:
>   
> gcc-6/performance/profile/1BRD_48G/xfs/x86_64-rhel/3000/debian-x86_64-2015-02-07.cgz/ivb44/disk_wrt/aim7
>
> commit: 
>   f0c6bcba74ac51cb77aadb33ad35cb2dc1ad1506 (parent of first bad commit)
>   68a9f5e7007c1afa2cf6830b690a90d0187c0684 (first bad commit)
>   99091700659f4df965e138b38b4fa26a29b7eade (base of your fixes branch)
>   bf4dc6e4ecc2a3d042029319bc8cd4204c185610 (head of your fixes branch)
>
> f0c6bcba74ac51cb 68a9f5e7007c1afa2cf6830b69 99091700659f4df965e138b38b 
> bf4dc6e4ecc2a3d042029319bc 
>  -- -- 
> -- 
>  %stddev %change %stddev %change %stddev 
> %change %stddev
>  \  |\  |\
>   |\  
> 484435 ±  0% -13.3% 420004 ±  0% -17.0% 402250 ±  0% 
> -15.6% 408998 ±  0%  aim7.jobs-per-min

It appears the original reported regression hasn't bee resolved by your
commit.  Could you take a look at the test results and the perf data?

Best Regards,
Huang, Ying

>
> And the perf data is as follow,
>
>   "perf-profile.func.cycles-pp.intel_idle": 20.25,
>   "perf-profile.func.cycles-pp.memset_erms": 11.72,
>   "perf-profile.func.cycles-pp.copy_user_enhanced_fast_string": 8.37,
>   "perf-profile.func.cycles-pp.__block_commit_write.isra.21": 3.49,
>   "perf-profile.func.cycles-pp.block_write_end": 1.77,
>   "perf-profile.func.cycles-pp.native_queued_spin_lock_slowpath": 1.63,
>   "perf-profile.func.cycles-pp.unlock_page": 1.58,
>   "perf-profile.func.cycles-pp.___might_sleep": 1.56,
>   "perf-profile.func.cycles-pp.__block_write_begin_int": 1.33,
>   "perf-profile.func.cycles-pp.iov_iter_copy_from_user_atomic": 1.23,
>   "perf-profile.func.cycles-pp.up_write": 1.21,
>   "perf-profile.func.cycles-pp.__mark_inode_dirty": 1.18,
>   "perf-profile.func.cycles-pp.down_write": 1.06,
>   "perf-profile.func.cycles-pp.mark_buffer_dirty": 0.94,
>   "perf-profile.func.cycles-pp.generic_write_end": 0.92,
>   "perf-profile.func.cycles-pp.__radix_tree_lookup": 0.91,
>   "perf-profile.func.cycles-pp._raw_spin_lock": 0.81,
>   "perf-profile.func.cycles-pp.entry_SYSCALL_64_fastpath": 0.79,
>   "perf-profile.func.cycles-pp.__might_sleep": 0.79,
>   "perf-profile.func.cycles-pp.xfs_file_iomap_begin_delay.isra.9": 0.7,
>   "perf-profile.func.cycles-pp.__list_del_entry": 0.7,
>   "perf-profile.func.cycles-pp.vfs_write": 0.69,
>   "perf-profile.func.cycles-pp.drop_buffers": 0.68,
>   "perf-profile.func.cycles-pp.xfs_file_write_iter": 0.67,
>   "perf-profile.func.cycles-pp.rwsem_spin_on_owner": 0.67,
>
> Best Regards,
> Huang, Ying
> ___
> LKP mailing list
> l...@lists.01.org
> https://lists.01.org/mailman/listinfo/lkp


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-20 Thread Mel Gorman
On Sat, Aug 20, 2016 at 09:48:39AM +1000, Dave Chinner wrote:
> On Fri, Aug 19, 2016 at 11:49:46AM +0100, Mel Gorman wrote:
> > On Thu, Aug 18, 2016 at 03:25:40PM -0700, Linus Torvalds wrote:
> > > It *could* be as simple/stupid as just saying "let's allocate the page
> > > cache for new pages from the current node" - and if the process that
> > > dirties pages just stays around on one single node, that might already
> > > be sufficient.
> > > 
> > > So just for testing purposes, you could try changing that
> > > 
> > > return alloc_pages(gfp, 0);
> > > 
> > > in __page_cache_alloc() into something like
> > > 
> > > return alloc_pages_node(cpu_to_node(raw_smp_processor_id())), 
> > > gfp, 0);
> > > 
> > > or something.
> > > 
> > 
> > The test would be interesting but I believe that keeping heavy writers
> > on one node will force them to stall early on dirty balancing even if
> > there is plenty of free memory on other nodes.
> 
> Well, it depends on the speed of the storage. The higher the speed
> of the storage, the less we care about stalling on dirty pages
> during reclaim. i.e. faster storage == shorter stalls. We really
> should stop thinking we need to optimise reclaim purely for the
> benefit of slow disks.  500MB/s write speed with latencies of a
> under a couple of milliseconds is common hardware these days. pcie
> based storage (e.g. m2, nvme) is rapidly becoming commonplace and
> they can easily do 1-2GB/s write speeds.
> 

I partially agree. I've been of the opinion for a long time that dirty_time
would be desirable and limit the amount of dirty data by microseconds
required to sync the data and pick a default like 5 seconds. It's
non-trivial as the write speed of all BDIs would have to be estimated
and on rotary storage the estimate would be unreliable.

A short-term practical idea would be to distribute pages for writing
only when the dirty limit is almost reached on a given node. For fast
storage, the distribution may never happen.

Neither idea would actually impact the current problem though unless it
was combined with discarding clean cache agressively if the underlying
storage is fast. Hence, it would still be nice if the contention problem
could be mitigated. Did that last patch help any?

-- 
Mel Gorman
SUSE Labs


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-19 Thread Linus Torvalds
On Fri, Aug 19, 2016 at 4:48 PM, Dave Chinner  wrote:
>
> Well, it depends on the speed of the storage. The higher the speed
> of the storage, the less we care about stalling on dirty pages
> during reclaim

Actually, that's largely true independently of the speed of the storage, I feel.

On really fast storage, you might as well push it out and buffering
lots of dirty memory pointless. And on really slow storage, buffering
lots of dirty memory is absolutely *horrible* from a latency
standpoint.

So I don't think this is about fast-vs-slow disks.

I think a lot of our "let's aggressively buffer dirty data" is
entirely historical. When you had 16MB of RAM in a workstation,
aggressively using half of it for writeback caches meant that you
could do things like untar source trees without waiting for IO.

But when you have 16GB of RAM in a workstation, and terabytes of RAM
in multi-node big machines, it's kind of silly to talk about
"percentages of memory available" for dirty data. I think it's likely
silly to even see "one node worth of memory" as being some limiter.

So I think we should try to avoid stalling on dirty pages during
reclaim by simply aiming to have fewer dirty pages in the first place.
Not because the stall is shorter on a fast disk, but because we just
shouldn't use that much memory for dirty data.

 Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-19 Thread Dave Chinner
On Fri, Aug 19, 2016 at 11:49:46AM +0100, Mel Gorman wrote:
> On Thu, Aug 18, 2016 at 03:25:40PM -0700, Linus Torvalds wrote:
> > It *could* be as simple/stupid as just saying "let's allocate the page
> > cache for new pages from the current node" - and if the process that
> > dirties pages just stays around on one single node, that might already
> > be sufficient.
> > 
> > So just for testing purposes, you could try changing that
> > 
> > return alloc_pages(gfp, 0);
> > 
> > in __page_cache_alloc() into something like
> > 
> > return alloc_pages_node(cpu_to_node(raw_smp_processor_id())), gfp, 
> > 0);
> > 
> > or something.
> > 
> 
> The test would be interesting but I believe that keeping heavy writers
> on one node will force them to stall early on dirty balancing even if
> there is plenty of free memory on other nodes.

Well, it depends on the speed of the storage. The higher the speed
of the storage, the less we care about stalling on dirty pages
during reclaim. i.e. faster storage == shorter stalls. We really
should stop thinking we need to optimise reclaim purely for the
benefit of slow disks.  500MB/s write speed with latencies of a
under a couple of milliseconds is common hardware these days. pcie
based storage (e.g. m2, nvme) is rapidly becoming commonplace and
they can easily do 1-2GB/s write speeds.

The fast storage devices that are arriving need to be treated
more like a fast network device (e.g. a pci-e 4x nvme SSD has the
throughput of 2x10GbE devices). We have to consider if buffering
streaming data in the page cache for any longer than it takes to get
the data to userspace or to disk is worth the cost of reclaiming it
from the page cache.

Really, the question that needs to be answered is this: if we can
pull data from the storage at similar speeds and latencies as we can
from the page cache, then *why are we caching that data*?

We've already made that "don't cache for fast storage" decision in
the case of pmem - the DAX IO path is slowly moving towards making
full use of the mapping infrastructure for all it's tracking
requirements. pcie based storage is a bit slower than pmem, but
the principle is the same - the storage is sufficiently fast that
caching only really makes sense for data that is really hot...

I think the underlying principle here is that the faster the backing
device, the less we should cache and buffer the device in the OS. I
suspect a good initial approximation of "stickiness" for the page
cache would the speed of writeback as measured by the BDI underlying
the mapping

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-19 Thread Mel Gorman
On Thu, Aug 18, 2016 at 05:11:11PM +1000, Dave Chinner wrote:
> On Thu, Aug 18, 2016 at 01:45:17AM +0100, Mel Gorman wrote:
> > On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote:
> > > > Yes, we could try to batch the locking like DaveC already suggested
> > > > (ie we could move the locking to the caller, and then make
> > > > shrink_page_list() just try to keep the lock held for a few pages if
> > > > the mapping doesn't change), and that might result in fewer crazy
> > > > cacheline ping-pongs overall. But that feels like exactly the wrong
> > > > kind of workaround.
> > > > 
> > > 
> > > Even if such batching was implemented, it would be very specific to the
> > > case of a single large file filling LRUs on multiple nodes.
> > > 
> > 
> > The latest Jason Bourne movie was sufficiently bad that I spent time
> > thinking how the tree_lock could be batched during reclaim. It's not
> > straight-forward but this prototype did not blow up on UMA and may be
> > worth considering if Dave can test either approach has a positive impact.
> 
> SO, I just did a couple of tests. I'll call the two patches "sleepy"
> for the contention backoff patch and "bourney" for the Jason Bourne
> inspired batching patch. This is an average of 3 runs, overwriting
> a 47GB file on a machine with 16GB RAM:
> 
>   IO throughput   wall time __pv_queued_spin_lock_slowpath
> vanilla   470MB/s 1m42s   25-30%
> sleepy295MB/s 2m43s   <1%
> bourney   425MB/s 1m53s   25-30%
> 

This is another blunt-force patch that

a) stalls all but one kswapd instance when treelock contention occurs
b) marks a pgdat congested when tree_lock contention is encountered
   which may cause direct reclaimers to wait_iff_congested until
   kswapd finishes balancing the node

I tested this on a KVM instance running on a 4-socket box. The vCPUs
were bound to pCPUs and the memory nodes in the KVM mapped to physical
memory nodes. Without the patch 3% of kswapd cycles were spent on
locking. With the patch, the cycle count was 0.23%

xfs_io contention was reduced from 0.63% to 0.39% which is not perfect.
It can be reduced by stalling all kswapd instances but then xfs_io direct
reclaims and throughput drops.

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d572b78b65e1..f6d3e886f405 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -532,6 +532,7 @@ enum pgdat_flags {
 * many pages under writeback
 */
PGDAT_RECLAIM_LOCKED,   /* prevents concurrent reclaim */
+   PGDAT_CONTENDED,/* kswapd contending on tree_lock */
 };
 
 static inline unsigned long zone_end_pfn(const struct zone *zone)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 374d95d04178..64ca2148755c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -621,19 +621,43 @@ static pageout_t pageout(struct page *page, struct 
address_space *mapping,
return PAGE_CLEAN;
 }
 
+static atomic_t kswapd_contended = ATOMIC_INIT(0);
+
 /*
  * Same as remove_mapping, but if the page is removed from the mapping, it
  * gets returned with a refcount of 0.
  */
 static int __remove_mapping(struct address_space *mapping, struct page *page,
-   bool reclaimed)
+   bool reclaimed, unsigned long *nr_contended)
 {
unsigned long flags;
 
BUG_ON(!PageLocked(page));
BUG_ON(mapping != page_mapping(page));
 
-   spin_lock_irqsave(&mapping->tree_lock, flags);
+   if (!nr_contended || !current_is_kswapd())
+   spin_lock_irqsave(&mapping->tree_lock, flags);
+   else {
+   /* Account for trylock contentions in kswapd */
+   if (!spin_trylock_irqsave(&mapping->tree_lock, flags)) {
+   pg_data_t *pgdat = page_pgdat(page);
+   int nr_kswapd;
+
+   /* Account for contended pages and contended kswapds */
+   (*nr_contended)++;
+   if (!test_and_set_bit(PGDAT_CONTENDED, &pgdat->flags))
+   nr_kswapd = 
atomic_inc_return(&kswapd_contended);
+   else
+   nr_kswapd = atomic_read(&kswapd_contended);
+   BUG_ON(nr_kswapd > nr_online_nodes || nr_kswapd < 0);
+
+   /* Stall kswapd if multiple kswapds are contending */
+   if (nr_kswapd > 1)
+   congestion_wait(BLK_RW_ASYNC, HZ/10);
+
+   spin_lock_irqsave(&mapping->tree_lock, flags);
+   }
+   }
/*
 * The non racy check for a busy page.
 *
@@ -719,7 +743,7 @@ static int __remove_mapping(struct address_space *mapping, 
struct page *page,
  */
 int remove_mapping(struct address_space *mapping, struct page *page)
 {

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-19 Thread Mel Gorman
On Thu, Aug 18, 2016 at 03:25:40PM -0700, Linus Torvalds wrote:
> >> In fact, looking at the __page_cache_alloc(), we already have that
> >> "spread pages out" logic. I'm assuming Dave doesn't actually have that
> >> bit set (I don't think it's the default), but I'm also envisioning
> >> that maybe we could extend on that notion, and try to spread out
> >> allocations in general, but keep page allocations from one particular
> >> mapping within one node.
> >
> > CONFIG_CPUSETS=y
> >
> > But I don't have any cpusets configured (unless systemd is doing
> > something wacky under the covers) so the page spread bit should not
> > be set.
> 
> Yeah, but even when it's not set we just do a generic alloc_pages(),
> which is just going to fill up all nodes. Not perhaps quite as "spread
> out", but there's obviously no attempt to try to be node-aware either.
> 

There is a slight difference. Reads should fill the nodes in turn but
dirty pages (__GFP_WRITE) get distributed to balance the number of dirty
pages on each node to avoid hitting dirty balance limits prematurely.

Yesterday I tried a patch that avoids distributing to remote nodes close
to the high watermark to avoid waking remote kswapd instances. It added a
lot of overhead to the fast path (3%) which hurts every writer but did not
reduce contention enough on the special case of writing a single large file.

As an aside, the dirty distribution check itself is very expensive so I
prototyped something that does the expensive calculations on a vmstat
update. Not sure if it'll work but it's a side issue.

> So _if_ we come up with some reasonable way to say "let's keep the
> pages of this mapping together", we could try to do it in that
> numa-aware __page_cache_alloc().
> 
> It *could* be as simple/stupid as just saying "let's allocate the page
> cache for new pages from the current node" - and if the process that
> dirties pages just stays around on one single node, that might already
> be sufficient.
> 
> So just for testing purposes, you could try changing that
> 
> return alloc_pages(gfp, 0);
> 
> in __page_cache_alloc() into something like
> 
> return alloc_pages_node(cpu_to_node(raw_smp_processor_id())), gfp, 0);
> 
> or something.
> 

The test would be interesting but I believe that keeping heavy writers
on one node will force them to stall early on dirty balancing even if
there is plenty of free memory on other nodes.

-- 
Mel Gorman
SUSE Labs


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-19 Thread Michal Hocko
On Thu 18-08-16 15:25:40, Linus Torvalds wrote:
[...]
> So just for testing purposes, you could try changing that
> 
> return alloc_pages(gfp, 0);
> 
> in __page_cache_alloc() into something like
> 
> return alloc_pages_node(cpu_to_node(raw_smp_processor_id())), gfp, 0);

That would break mempolicies AFAICS. Anyway, I might be missing
something (the mempolicy kod has a strange sense for aesthetic) but the
normal case without any explicit mempolicy should use default_policy
which is MPOL_PREFERRED and MPOL_F_LOCAL which means numa_node_id() so
the local node. So the above two should do the same thing unless I have
missing something.
-- 
Michal Hocko
SUSE Labs


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-18 Thread Dave Chinner
On Thu, Aug 18, 2016 at 10:55:01AM -0700, Linus Torvalds wrote:
> On Thu, Aug 18, 2016 at 6:24 AM, Mel Gorman  
> wrote:
> > On Thu, Aug 18, 2016 at 05:11:11PM +1000, Dave Chinner wrote:
> >> FWIW, I just remembered about /proc/sys/vm/zone_reclaim_mode.
> >>
> >
> > That is a terrifying "fix" for this problem. It just happens to work
> > because there is no spillover to other nodes so only one kswapd instance
> > is potentially active.
> 
> Well, it may be a terrifying fix, but it does bring up an intriguing
> notion: maybe what we should think about is to make the actual page
> cache allocations be more "node-sticky" for a particular mapping? Not
> some hard node binding, but if we were to make a single mapping *tend*
> to allocate pages primarily within the same node, that would have the
> kind of secondary afvantage that it would avoid the cross-node mapping
> locking.

For streaming or use-once IO it makes a lot of sense to restrict the
locality of the page cache. The faster the IO device, the less dirty
page buffering we need to maintain full device bandwidth. And the
larger the machine the greater the effect of global page cache
pollution on the other appplications is.

> Think of it as a gentler "guiding" fix to the spinlock contention
> issue than a hard hammer.
> 
> And trying to (at least initially) keep the allocations of one
> particular file to one particular node sounds like it could have other
> locality advantages too.
> 
> In fact, looking at the __page_cache_alloc(), we already have that
> "spread pages out" logic. I'm assuming Dave doesn't actually have that
> bit set (I don't think it's the default), but I'm also envisioning
> that maybe we could extend on that notion, and try to spread out
> allocations in general, but keep page allocations from one particular
> mapping within one node.

CONFIG_CPUSETS=y

But I don't have any cpusets configured (unless systemd is doing
something wacky under the covers) so the page spread bit should not
be set.

> The fact that zone_reclaim_mode really improves on Dave's numbers
> *that* dramatically does seem to imply that there is something to be
> said for this.
> 
> We do *not* want to limit the whole page cache to a particular node -
> that sounds very unreasonable in general. But limiting any particular
> file mapping (by default - I'm sure there are things like databases
> that just want their one DB file to take over all of memory) to a
> single node sounds much less unreasonable.
> 
> What do you guys think? Worth exploring?

The problem is that whenever we turn this sort of behaviour on, some
benchmark regresses because it no longer holds it's working set in
the page cache, leading to the change being immediately reverted.
Enterprise java benchmarks ring a bell, for some reason.

Hence my comment above about needing it to be tied into specific
"use-once-only" page cache behaviours. I know we have working set
estimation, fadvise modes and things like readahead that help track
sequential and use-once access patterns, but I'm not sure how we can
tie that all together

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-18 Thread Linus Torvalds
On Thu, Aug 18, 2016 at 2:19 PM, Dave Chinner  wrote:
>
> For streaming or use-once IO it makes a lot of sense to restrict the
> locality of the page cache. The faster the IO device, the less dirty
> page buffering we need to maintain full device bandwidth. And the
> larger the machine the greater the effect of global page cache
> pollution on the other appplications is.

Yes. But I agree with you that it might be very hard to actually get
something that does a good job automagically.

>> In fact, looking at the __page_cache_alloc(), we already have that
>> "spread pages out" logic. I'm assuming Dave doesn't actually have that
>> bit set (I don't think it's the default), but I'm also envisioning
>> that maybe we could extend on that notion, and try to spread out
>> allocations in general, but keep page allocations from one particular
>> mapping within one node.
>
> CONFIG_CPUSETS=y
>
> But I don't have any cpusets configured (unless systemd is doing
> something wacky under the covers) so the page spread bit should not
> be set.

Yeah, but even when it's not set we just do a generic alloc_pages(),
which is just going to fill up all nodes. Not perhaps quite as "spread
out", but there's obviously no attempt to try to be node-aware either.

So _if_ we come up with some reasonable way to say "let's keep the
pages of this mapping together", we could try to do it in that
numa-aware __page_cache_alloc().

It *could* be as simple/stupid as just saying "let's allocate the page
cache for new pages from the current node" - and if the process that
dirties pages just stays around on one single node, that might already
be sufficient.

So just for testing purposes, you could try changing that

return alloc_pages(gfp, 0);

in __page_cache_alloc() into something like

return alloc_pages_node(cpu_to_node(raw_smp_processor_id())), gfp, 0);

or something.

>> The fact that zone_reclaim_mode really improves on Dave's numbers
>> *that* dramatically does seem to imply that there is something to be
>> said for this.
>>
>> We do *not* want to limit the whole page cache to a particular node -
>> that sounds very unreasonable in general. But limiting any particular
>> file mapping (by default - I'm sure there are things like databases
>> that just want their one DB file to take over all of memory) to a
>> single node sounds much less unreasonable.
>>
>> What do you guys think? Worth exploring?
>
> The problem is that whenever we turn this sort of behaviour on, some
> benchmark regresses because it no longer holds it's working set in
> the page cache, leading to the change being immediately reverted.
> Enterprise java benchmarks ring a bell, for some reason.

Yeah. It might be ok if we limit the new behavior to just new pages
that get allocated for writing, which is where we want to limit the
page cache more anyway (we already have all those dirty limits etc).

But from a testing standpoint, you can probably try the above
"alloc_pages_node()" hack and see if it even makes a difference. It
might not work, and the dirtier might be moving around too much etc.

 Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-18 Thread Linus Torvalds
On Thu, Aug 18, 2016 at 6:24 AM, Mel Gorman  wrote:
> On Thu, Aug 18, 2016 at 05:11:11PM +1000, Dave Chinner wrote:
>> FWIW, I just remembered about /proc/sys/vm/zone_reclaim_mode.
>>
>
> That is a terrifying "fix" for this problem. It just happens to work
> because there is no spillover to other nodes so only one kswapd instance
> is potentially active.

Well, it may be a terrifying fix, but it does bring up an intriguing
notion: maybe what we should think about is to make the actual page
cache allocations be more "node-sticky" for a particular mapping? Not
some hard node binding, but if we were to make a single mapping *tend*
to allocate pages primarily within the same node, that would have the
kind of secondary afvantage that it would avoid the cross-node mapping
locking.

Think of it as a gentler "guiding" fix to the spinlock contention
issue than a hard hammer.

And trying to (at least initially) keep the allocations of one
particular file to one particular node sounds like it could have other
locality advantages too.

In fact, looking at the __page_cache_alloc(), we already have that
"spread pages out" logic. I'm assuming Dave doesn't actually have that
bit set (I don't think it's the default), but I'm also envisioning
that maybe we could extend on that notion, and try to spread out
allocations in general, but keep page allocations from one particular
mapping within one node.

The fact that zone_reclaim_mode really improves on Dave's numbers
*that* dramatically does seem to imply that there is something to be
said for this.

We do *not* want to limit the whole page cache to a particular node -
that sounds very unreasonable in general. But limiting any particular
file mapping (by default - I'm sure there are things like databases
that just want their one DB file to take over all of memory) to a
single node sounds much less unreasonable.

What do you guys think? Worth exploring?

Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-18 Thread Mel Gorman
On Thu, Aug 18, 2016 at 05:11:11PM +1000, Dave Chinner wrote:
> On Thu, Aug 18, 2016 at 01:45:17AM +0100, Mel Gorman wrote:
> > On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote:
> > > > Yes, we could try to batch the locking like DaveC already suggested
> > > > (ie we could move the locking to the caller, and then make
> > > > shrink_page_list() just try to keep the lock held for a few pages if
> > > > the mapping doesn't change), and that might result in fewer crazy
> > > > cacheline ping-pongs overall. But that feels like exactly the wrong
> > > > kind of workaround.
> > > > 
> > > 
> > > Even if such batching was implemented, it would be very specific to the
> > > case of a single large file filling LRUs on multiple nodes.
> > > 
> > 
> > The latest Jason Bourne movie was sufficiently bad that I spent time
> > thinking how the tree_lock could be batched during reclaim. It's not
> > straight-forward but this prototype did not blow up on UMA and may be
> > worth considering if Dave can test either approach has a positive impact.
> 
> SO, I just did a couple of tests. I'll call the two patches "sleepy"
> for the contention backoff patch and "bourney" for the Jason Bourne
> inspired batching patch. This is an average of 3 runs, overwriting
> a 47GB file on a machine with 16GB RAM:
> 
>   IO throughput   wall time __pv_queued_spin_lock_slowpath
> vanilla   470MB/s 1m42s   25-30%
> sleepy295MB/s 2m43s   <1%
> bourney   425MB/s 1m53s   25-30%
> 

Thanks. I updated the tests today and reran them trying to reproduce what
you saw but I'm simply not seeing it on bare metal with a spinning disk.

xfsio Throughput
  4.8.0-rc2 4.8.0-rc2 4.8.0-rc2
vanillasleepy   bourney
Min  tput147.4450 (  0.00%)147.2580 (  0.13%)147.3900 (  0.04%)
Hmeantput147.5853 (  0.00%)147.5101 (  0.05%)147.6121 ( -0.02%)
Stddev   tput  0.1041 (  0.00%)  0.1785 (-71.47%)  0.2036 (-95.63%)
CoeffVar tput  0.0705 (  0.00%)  0.1210 (-71.56%)  0.1379 (-95.59%)
Max  tput147.6940 (  0.00%)147.6420 (  0.04%)147.8820 ( -0.13%)

I'm currently setting up a KVM instance that may fare better. Due to
quirks of where machines are, I have to setup the KVM instance on real
NUMA hardware but maybe that'll make the problem even more obvious.

> The overall CPU usage of sleepy was much lower than the others, but
> it was also much slower. Too much sleeping and not enough reclaim
> work being done, I think.
> 

Looks like it. On my initial test, there was barely any sleeping.

> As for bourney, it's not immediately clear as to why it's nearly as
> bad as the movie. At worst I would have expected it to have not
> noticable impact, but maybe we are delaying freeing of pages too
> long and so stalling allocation of new pages? It also doesn't do
> much to reduce contention, especially considering the reduction in
> throughput.
> 
> On a hunch that the batch list isn't all one mapping, I sorted it.
> Patch is below if you're curious.
> 

The fact that sorting makes such a difference makes me think that it's
the wrong direction. It's far too specific to this test case and does
nothing to throttle a reclaimer. It's also fairly complex and I expected
that normal users of remove_mapping such as truncation would take a hit.

The hit of bouncing the lock around just hurts too much.

> FWIW, I just remembered about /proc/sys/vm/zone_reclaim_mode.
> 

That is a terrifying "fix" for this problem. It just happens to work
because there is no spillover to other nodes so only one kswapd instance
is potentially active.

> Anyway, I've burnt enough erase cycles on this SSD for today
> 

I'll continue looking at getting KVM up and running and then consider
other possibilities for throttling.

-- 
Mel Gorman
SUSE Labs


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-18 Thread Dave Chinner
On Thu, Aug 18, 2016 at 01:45:17AM +0100, Mel Gorman wrote:
> On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote:
> > > Yes, we could try to batch the locking like DaveC already suggested
> > > (ie we could move the locking to the caller, and then make
> > > shrink_page_list() just try to keep the lock held for a few pages if
> > > the mapping doesn't change), and that might result in fewer crazy
> > > cacheline ping-pongs overall. But that feels like exactly the wrong
> > > kind of workaround.
> > > 
> > 
> > Even if such batching was implemented, it would be very specific to the
> > case of a single large file filling LRUs on multiple nodes.
> > 
> 
> The latest Jason Bourne movie was sufficiently bad that I spent time
> thinking how the tree_lock could be batched during reclaim. It's not
> straight-forward but this prototype did not blow up on UMA and may be
> worth considering if Dave can test either approach has a positive impact.

SO, I just did a couple of tests. I'll call the two patches "sleepy"
for the contention backoff patch and "bourney" for the Jason Bourne
inspired batching patch. This is an average of 3 runs, overwriting
a 47GB file on a machine with 16GB RAM:

IO throughput   wall time __pv_queued_spin_lock_slowpath
vanilla 470MB/s 1m42s   25-30%
sleepy  295MB/s 2m43s   <1%
bourney 425MB/s 1m53s   25-30%

The overall CPU usage of sleepy was much lower than the others, but
it was also much slower. Too much sleeping and not enough reclaim
work being done, I think.

As for bourney, it's not immediately clear as to why it's nearly as
bad as the movie. At worst I would have expected it to have not
noticable impact, but maybe we are delaying freeing of pages too
long and so stalling allocation of new pages? It also doesn't do
much to reduce contention, especially considering the reduction in
throughput.

On a hunch that the batch list isn't all one mapping, I sorted it.
Patch is below if you're curious.

IO throughput   wall time __pv_queued_spin_lock_slowpath
vanilla 470MB/s 1m42s   25-30%
sleepy  295MB/s 2m43s   <1%
bourney 425MB/s 1m53s   25-30%
sorted-bourney  465MB/s 1m43s   20%

The number of reclaim batches (from multiple runs) where the sorting
of the lists would have done anything is counted by list swaps (ls)
being > 1.

# grep " c " /var/log/syslog |grep -v "ls 1" |wc -l
7429
# grep " c " /var/log/syslog |grep "ls 1" |wc -l
1061767

IOWs in 1.07 million batches of pages reclaimed, only ~0.695% of
batches switched to a different mapping tree lock more than once.
>From those numbers I would not have expected sorting the page list
to have any measurable impact on performance. However, performance
seems very sensitive to the number of times the mapping tree lock
is bounced around.

FWIW, I just remembered about /proc/sys/vm/zone_reclaim_mode.

IO throughput   wall time __pv_queued_spin_lock_slowpath
vanilla 470MB/s 1m42s   25-30%
zr=1470MB/s 1m42s   2-3%

So isolating the page cache usage to a single node maintains
performance and shows a significant reduction in pressure on the
mapping tree lock. Same as a single node system, I'd guess.

Anyway, I've burnt enough erase cycles on this SSD for today

-Dave.

---
 mm/vmscan.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9261102..5cf1bd6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -56,6 +56,8 @@
 
 #include "internal.h"
 
+#include 
+
 #define CREATE_TRACE_POINTS
 #include 
 
@@ -761,6 +763,17 @@ int remove_mapping(struct address_space *mapping, struct 
page *page)
return ret;
 }
 
+static int mapping_cmp(void *priv, struct list_head *a, struct list_head *b)
+{
+   struct address_space *ma = container_of(a, struct page, lru)->mapping;
+   struct address_space *mb = container_of(a, struct page, lru)->mapping;
+
+   if (ma == mb)
+   return 0;
+   if (ma > mb)
+   return 1;
+   return -1;
+}
 static void remove_mapping_list(struct list_head *mapping_list,
struct list_head *free_pages,
struct list_head *ret_pages)
@@ -771,12 +784,17 @@ static void remove_mapping_list(struct list_head 
*mapping_list,
LIST_HEAD(swapcache);
LIST_HEAD(filecache);
struct page *page;
+   int c = 0, ls = 0;
+
+   list_sort(NULL, mapping_list, mapping_cmp);
 
while (!list_empty(mapping_list)) {
+   c++;
page = lru_to_page(mapping_list);
list_del(&page->lru);
 
if (!mapping || page->mapping != mapping) {
+   ls++;
if (mapping) {
spin_unlock_irqrestore(

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-17 Thread Dave Chinner
On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote:
> On Tue, Aug 16, 2016 at 10:47:36AM -0700, Linus Torvalds wrote:
> > I've always preferred to see direct reclaim as the primary model for
> > reclaim, partly in order to throttle the actual "bad" process, but
> > also because "kswapd uses lots of CPU time" is such a nasty thing to
> > even begin guessing about.
> > 
> 
> While I agree that bugs with high CPU usage from kswapd are a pain,
> I'm reluctant to move towards direct reclaim being the primary mode. The
> stalls can be severe and there is no guarantee that the process punished
> is the process responsible. I'm basing this assumption on observations
> of severe performance regressions when I accidentally broke kswapd during
> the development of node-lru.
> 
> > So I have to admit to liking that "make kswapd sleep a bit if it's
> > just looping" logic that got removed in that commit.
> > 
> 
> It's primarily the direct reclaimer that is affected by that patch.
> 
> > And looking at DaveC's numbers, it really feels like it's not even
> > what we do inside the locked region that is the problem. Sure,
> > __delete_from_page_cache() (which is most of it) is at 1.86% of CPU
> > time (when including all the things it calls), but that still isn't
> > all that much. Especially when compared to just:
> > 
> >0.78%  [kernel]  [k] _raw_spin_unlock_irqrestore
> > 
> 
> The profile is shocking for such a basic workload. I automated what Dave
> described with xfs_io except that the file size is 2*RAM. The filesystem
> is sized to be roughly the same size as the file to minimise variances
> due to block layout. A call-graph profile collected on bare metal UMA with
> numa=fake=4 and paravirt spinlocks showed
> 
>  1.40% 0.16%  kswapd1  [kernel.vmlinux][k] 
> _raw_spin_lock_irqsave
>  1.36% 0.16%  kswapd2  [kernel.vmlinux][k] 
> _raw_spin_lock_irqsave
>  1.21% 0.12%  kswapd0  [kernel.vmlinux][k] 
> _raw_spin_lock_irqsave
>  1.12% 0.13%  kswapd3  [kernel.vmlinux][k] 
> _raw_spin_lock_irqsave
>  0.81% 0.45%  xfs_io   [kernel.vmlinux][k] 
> _raw_spin_lock_irqsave
> 
> Those contention figures are not great but they are not terrible either. The
> vmstats told me there was no direct reclaim activity so either my theory
> is wrong or this machine is not reproducing the same problem Dave is seeing.

No, that's roughly the same un-normalised CPU percentage I am seeing
in spinlock contention. i.e. take way the idle CPU in the profile
(probably upwards of 80% if it's a 16p machine), and instead look at
that figure as a percentage of total CPU used by the workload. Then
you'll that it's 30-40% of the total CPU consumed by the workload.

> I have partial results from a 2-socket and 4-socket machine. 2-socket spends
> roughtly 1.8% in _raw_spin_lock_irqsave and 4-socket spends roughtly 3%,
> both with no direct reclaim. Clearly the problem gets worse the more NUMA
> nodes there are but not to the same extent Dave reports.
> 
> I believe potential reasons why I do not see the same problem as Dave are;
> 
> 1. Different memory sizes changing timing
> 2. Dave has fast storage and I'm using a spinning disk

This particular is using an abused 3 year old SATA SSD that still
runs at 500MB/s on sequential writes. This is "cheap desktop"
capability these days and is nowhere near what I'd call "fast".

> 3. Lock contention problems are magnified inside KVM
> 
> I think 3 is a good possibility if contended locks result in expensive
> exiting and reentery of the guest. I have a vague recollection that a
> spinning vcpu exits the guest but I did not confirm that.

I don't think anything like that has been implemented in the pv
spinlocks yet. They just spin right now - it's the same lock
implementation as the host. Also, Context switch rates measured on
the host are not significantly higher than what is measured in the
guest, so there doesn't appear to be any extra scheduling on the
host side occurring.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-17 Thread Mel Gorman
On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote:
> > Yes, we could try to batch the locking like DaveC already suggested
> > (ie we could move the locking to the caller, and then make
> > shrink_page_list() just try to keep the lock held for a few pages if
> > the mapping doesn't change), and that might result in fewer crazy
> > cacheline ping-pongs overall. But that feels like exactly the wrong
> > kind of workaround.
> > 
> 
> Even if such batching was implemented, it would be very specific to the
> case of a single large file filling LRUs on multiple nodes.
> 

The latest Jason Bourne movie was sufficiently bad that I spent time
thinking how the tree_lock could be batched during reclaim. It's not
straight-forward but this prototype did not blow up on UMA and may be
worth considering if Dave can test either approach has a positive impact.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 374d95d04178..926110219cd9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -621,19 +621,39 @@ static pageout_t pageout(struct page *page, struct 
address_space *mapping,
return PAGE_CLEAN;
 }
 
+static void finalise_remove_mapping(struct list_head *swapcache,
+   struct list_head *filecache,
+   void (*freepage)(struct page *))
+{
+   struct page *page;
+
+   while (!list_empty(swapcache)) {
+   swp_entry_t swap = { .val = page_private(page) };
+   page = lru_to_page(swapcache);
+   list_del(&page->lru);
+   swapcache_free(swap);
+   set_page_private(page, 0);
+   }
+
+   while (!list_empty(filecache)) {
+   page = lru_to_page(swapcache);
+   list_del(&page->lru);
+   freepage(page);
+   }
+}
+
 /*
  * Same as remove_mapping, but if the page is removed from the mapping, it
  * gets returned with a refcount of 0.
  */
-static int __remove_mapping(struct address_space *mapping, struct page *page,
-   bool reclaimed)
+static int __remove_mapping_page(struct address_space *mapping,
+struct page *page, bool reclaimed,
+struct list_head *swapcache,
+struct list_head *filecache)
 {
-   unsigned long flags;
-
BUG_ON(!PageLocked(page));
BUG_ON(mapping != page_mapping(page));
 
-   spin_lock_irqsave(&mapping->tree_lock, flags);
/*
 * The non racy check for a busy page.
 *
@@ -668,16 +688,18 @@ static int __remove_mapping(struct address_space 
*mapping, struct page *page,
}
 
if (PageSwapCache(page)) {
-   swp_entry_t swap = { .val = page_private(page) };
+   unsigned long swapval = page_private(page);
+   swp_entry_t swap = { .val = swapval };
mem_cgroup_swapout(page, swap);
__delete_from_swap_cache(page);
-   spin_unlock_irqrestore(&mapping->tree_lock, flags);
-   swapcache_free(swap);
+   set_page_private(page, swapval);
+   list_add(&page->lru, swapcache);
} else {
-   void (*freepage)(struct page *);
void *shadow = NULL;
+   void (*freepage)(struct page *);
 
freepage = mapping->a_ops->freepage;
+
/*
 * Remember a shadow entry for reclaimed file cache in
 * order to detect refaults, thus thrashing, later on.
@@ -698,16 +720,13 @@ static int __remove_mapping(struct address_space 
*mapping, struct page *page,
!mapping_exiting(mapping) && !dax_mapping(mapping))
shadow = workingset_eviction(mapping, page);
__delete_from_page_cache(page, shadow);
-   spin_unlock_irqrestore(&mapping->tree_lock, flags);
-
-   if (freepage != NULL)
-   freepage(page);
+   if (freepage)
+   list_add(&page->lru, filecache);
}
 
return 1;
 
 cannot_free:
-   spin_unlock_irqrestore(&mapping->tree_lock, flags);
return 0;
 }
 
@@ -719,16 +738,68 @@ static int __remove_mapping(struct address_space 
*mapping, struct page *page,
  */
 int remove_mapping(struct address_space *mapping, struct page *page)
 {
-   if (__remove_mapping(mapping, page, false)) {
+   unsigned long flags;
+   LIST_HEAD(swapcache);
+   LIST_HEAD(filecache);
+   void (*freepage)(struct page *);
+   int ret = 0;
+
+   spin_lock_irqsave(&mapping->tree_lock, flags);
+   freepage = mapping->a_ops->freepage;
+
+   if (__remove_mapping_page(mapping, page, false, &swapcache, 
&filecache)) {
/*
 * Unfreezing the refcount with 1 rather than 2 effectively
 * drops the pagecache ref for us without requiring another
 * atomic operation.
   

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-17 Thread Michal Hocko
On Wed 17-08-16 17:48:25, Michal Hocko wrote:
[...]
> I will try to catch up with the rest of the email thread but from a
> quick glance it just feels like we are doing more more work under the
> lock.

Hmm, so it doesn't seem to be more work in __remove_mapping as pointed
out in http://lkml.kernel.org/r/20160816220250.GI16044@dastard

As Mel already pointed out the LRU will be basically single mapping for
this workload so any subtle change in timing might make a difference.
I was looking through 4.6..4.7 and one thing that has changed is the
inactive vs. active LRU size ratio. See 59dc76b0d4df ("mm: vmscan:
reduce size of inactive file list"). The machine has quite a lot of
memory and so the LRUs will be large as well so I guess this could have
change the timing somehow, but it feels like a wild guess so I would be
careful to blame this commit...
-- 
Michal Hocko
SUSE Labs


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-17 Thread Peter Zijlstra
On Mon, Aug 15, 2016 at 07:03:00AM +0200, Ingo Molnar wrote:
> 
> * Linus Torvalds  wrote:
> 
> > Make sure you actually use "perf record -e cycles:pp" or something
> > that uses PEBS to get real profiles using CPU performance counters.
> 
> Btw., 'perf record -e cycles:pp' is the default now for modern versions
> of perf tooling (on most x86 systems) - if you do 'perf record' it will
> just use the most precise profiling mode available on that particular
> CPU model.
> 
> If unsure you can check the event that was used, via:
> 
>   triton:~> perf report --stdio 2>&1 | grep '# Samples'
>   # Samples: 27K of event 'cycles:pp'

Problem here is that Dave is using a KVM thingy. Getting hardware
counters in a guest is somewhat tricky but doable, but PEBS does not
virtualize.




Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-17 Thread Mel Gorman
On Tue, Aug 16, 2016 at 10:47:36AM -0700, Linus Torvalds wrote:
> I've always preferred to see direct reclaim as the primary model for
> reclaim, partly in order to throttle the actual "bad" process, but
> also because "kswapd uses lots of CPU time" is such a nasty thing to
> even begin guessing about.
> 

While I agree that bugs with high CPU usage from kswapd are a pain,
I'm reluctant to move towards direct reclaim being the primary mode. The
stalls can be severe and there is no guarantee that the process punished
is the process responsible. I'm basing this assumption on observations
of severe performance regressions when I accidentally broke kswapd during
the development of node-lru.

> So I have to admit to liking that "make kswapd sleep a bit if it's
> just looping" logic that got removed in that commit.
> 

It's primarily the direct reclaimer that is affected by that patch.

> And looking at DaveC's numbers, it really feels like it's not even
> what we do inside the locked region that is the problem. Sure,
> __delete_from_page_cache() (which is most of it) is at 1.86% of CPU
> time (when including all the things it calls), but that still isn't
> all that much. Especially when compared to just:
> 
>0.78%  [kernel]  [k] _raw_spin_unlock_irqrestore
> 

The profile is shocking for such a basic workload. I automated what Dave
described with xfs_io except that the file size is 2*RAM. The filesystem
is sized to be roughly the same size as the file to minimise variances
due to block layout. A call-graph profile collected on bare metal UMA with
numa=fake=4 and paravirt spinlocks showed

 1.40% 0.16%  kswapd1  [kernel.vmlinux][k] 
_raw_spin_lock_irqsave
 1.36% 0.16%  kswapd2  [kernel.vmlinux][k] 
_raw_spin_lock_irqsave
 1.21% 0.12%  kswapd0  [kernel.vmlinux][k] 
_raw_spin_lock_irqsave
 1.12% 0.13%  kswapd3  [kernel.vmlinux][k] 
_raw_spin_lock_irqsave
 0.81% 0.45%  xfs_io   [kernel.vmlinux][k] 
_raw_spin_lock_irqsave

Those contention figures are not great but they are not terrible either. The
vmstats told me there was no direct reclaim activity so either my theory
is wrong or this machine is not reproducing the same problem Dave is seeing.

I have partial results from a 2-socket and 4-socket machine. 2-socket spends
roughtly 1.8% in _raw_spin_lock_irqsave and 4-socket spends roughtly 3%,
both with no direct reclaim. Clearly the problem gets worse the more NUMA
nodes there are but not to the same extent Dave reports.

I believe potential reasons why I do not see the same problem as Dave are;

1. Different memory sizes changing timing
2. Dave has fast storage and I'm using a spinning disk
3. Lock contention problems are magnified inside KVM

I think 3 is a good possibility if contended locks result in expensive
exiting and reentery of the guest. I have a vague recollection that a
spinning vcpu exits the guest but I did not confirm that. I can setup a
KVM instance and run the tests but it'll take a few hours and possibly
will be pushed out until tomorrow.

> So I'm more and more getting the feeling that it's not what we do
> inside the lock that is problematic. I started out blaming memcg
> accounting or something, but none of the numbers seem to back that up.
> So it's primarily really just the fact that kswapd is simply hammering
> on that lock way too much.
> 

Agreed.

> So yeah, I'm blaming kswapd itself doing something wrong. It's not a
> problem in a single-node environment (since there's only one), but
> with multiple nodes it clearly just devolves.
> 
> Yes, we could try to batch the locking like DaveC already suggested
> (ie we could move the locking to the caller, and then make
> shrink_page_list() just try to keep the lock held for a few pages if
> the mapping doesn't change), and that might result in fewer crazy
> cacheline ping-pongs overall. But that feels like exactly the wrong
> kind of workaround.
> 

Even if such batching was implemented, it would be very specific to the
case of a single large file filling LRUs on multiple nodes.

> I'd much rather re-instate some "if kswapd is just spinning on CPU
> time and not actually improving IO parallelism, kswapd should just get
> the hell out" logic.
> 

I'm having trouble right now thinking of a good way of identifying when
kswapd should give up and force direct reclaim to take a hit.

I'd like to pass something else by the wtf-o-meter. I had a prototype
patch lying around that replaced a congestion_wait if too many LRU pages
were isolated with a waitqueue for an unrelated theoretical problem. It's
the bulk of the patch below but can be trivially extended for the case of
tree_lock contention.

The interesting part is the change to __remove_mapping. It stalls a
reclaimer (direct or kswapd) if the lock is contended for either a
timeout or a local reclaimer finishing some reclaim. This stalls for a
"real" reason 

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-17 Thread Michal Hocko
On Tue 16-08-16 10:47:36, Linus Torvalds wrote:
> Mel,
>  thanks for taking a look. Your theory sounds more complete than mine,
> and since Dave is able to see the problem with 4.7, it would be nice
> to hear about the 4.6 behavior and commit ede37713737 in particular.
> 
> That one seems more likely to affect contention than the zone/node one
> I found during the merge window anyway, since it actually removes a
> sleep in kswapd during congestion.

Hmm, the patch removes a short sleep from wait_iff_congested for
kworkers but that cannot affect kswapd context. Then it removes
wait_iff_congested from should_reclaim_retry but that is not kswapd path
and the sleep was added in the same merge window so it wasn't in 4.6 so
it shouldn't make any difference as well.

So I am not really sure how it could make any difference.

I will try to catch up with the rest of the email thread but from a
quick glance it just feels like we are doing more more work under the
lock.

-- 
Michal Hocko
SUSE Labs


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-16 Thread Linus Torvalds
On Tue, Aug 16, 2016 at 3:02 PM, Dave Chinner  wrote:
>>
>> What does your profile show for when you actually dig into
>> __remove_mapping() itself?, Looking at your flat profile, I'm assuming
>> you get
>
> -   22.26% 0.93%  [kernel]  [k] __remove_mapping
>- 3.86% __remove_mapping
>   - 18.35% _raw_spin_lock_irqsave
>__pv_queued_spin_lock_slowpath
> 1.32% __delete_from_page_cache
>   - 0.92% _raw_spin_unlock_irqrestore
>__raw_callee_save___pv_queued_spin_unlock

Ok, that's all very consistent with my profiles, except - obviously -
for the crazy spinlock thing.

One difference is that your unlock has that PV unlock thing - on raw
hardware it's just a single store. But I don't think I saw the
unlock_slowpath in there.

There's nothing really expensive going on there that I can tell.

> And the instruction level profile:

Yup. The bulk is in the cmpxchg and a cache miss (it just shows up in
the instruction after it: you can use "cycles:pp" to get perf to
actually try to fix up the blame to the instruction that _causes_
things rather than the instruction following, but in this case it's
all trivial).

> It's the same code AFAICT, except the pv version jumps straight to
> the "queue" case.

Yes. Your profile looks perfectly fine. Most of the profile is rigth
after the 'pause', which you'd expect.

>From a quick look, it seems like only about 2/3rd of the time is
actually spent in the "pause" loop, but the control flow is complex
enough that maybe I didn't follow it right. The native case is
simpler. But since I suspect that it's not so much about the
spinlocked region being too costly, but just about locking too damn
much), that 2/3rds actually makes sense: it's not that it's
necessarily spinning waiting for the lock all that long in any
individual case, it's just that the spin_lock code is called so much.

So I still kind of just blame kswapd, rather than any new expense. It
would be interesting to hear if Mel is right about that kswapd
sleeping change between 4.6 and 4.7..

   Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-16 Thread Dave Chinner
On Mon, Aug 15, 2016 at 06:51:42PM -0700, Linus Torvalds wrote:
> Anyway, including the direct reclaim call paths gets
> __remove_mapping() a bit higher, and _raw_spin_lock_irqsave climbs to
> 0.26%. But perhaps more importlantly, looking at what __remove_mapping
> actually *does* (apart from the spinlock) gives us:
> 
>  - inside remove_mapping itself (0.11% on its own - flat cost, no
> child accounting)
> 
> 48.50 │   lock   cmpxchg %edx,0x1c(%rbx)
> 
> so that's about 0.05%
> 
>  - 0.40% __delete_from_page_cache (0.22%
> radix_tree_replace_clear_tags, 0.13%__radix_tree_lookup)
> 
>  - 0.06% workingset_eviction()
> 
> so I'm not actually seeing anything *new* expensive in there. The
> __delete_from_page_cache() overhead may have changed a bit with the
> tagged tree changes, but this doesn't look like memcg.
> 
> But we clearly have very different situations.
> 
> What does your profile show for when you actually dig into
> __remove_mapping() itself?, Looking at your flat profile, I'm assuming
> you get

-   22.26% 0.93%  [kernel]  [k] __remove_mapping
   - 3.86% __remove_mapping
  - 18.35% _raw_spin_lock_irqsave
   __pv_queued_spin_lock_slowpath
1.32% __delete_from_page_cache
  - 0.92% _raw_spin_unlock_irqrestore
   __raw_callee_save___pv_queued_spin_unlock

And the instruction level profile:

.
   �   xor%ecx,%ecx
   �   mov%rax,%r15
  0.39 �   mov$0x2,%eax
   �   lock   cmpxchg %ecx,0x1c(%rbx)
 32.56 �   cmp$0x2,%eax
   � � jne12e
   �   mov0x20(%rbx),%rax
   �   lea-0x1(%rax),%rdx
  0.39 �   test   $0x1,%al
   �   cmove  %rbx,%rdx
   �   mov(%rdx),%rax
  0.39 �   test   $0x10,%al
   � � jne127
   �   mov(%rbx),%rcx
   �   shr$0xf,%rcx
   �   and$0x1,%ecx
   � � jne14a
   �   mov0x68(%r14),%rax
 36.03 �   xor%esi,%esi
   �   test   %r13b,%r13b
   �   mov0x50(%rax),%rdx
  1.16 � � jnee8
  0.96 � a9:   mov%rbx,%rdi
.

Indicates most time on the cmpxchg for the page ref followed by the
grabbing on the ->freepage op vector:

freepage = mapping->a_ops->freepage;

> I come back to wondering whether maybe you're hitting some PV-lock problem.
> 
> I know queued_spin_lock_slowpath() is ok. I'm not entirely sure
> __pv_queued_spin_lock_slowpath() is.

It's the same code AFAICT, except the pv version jumps straight to
the "queue" case.

> So I'd love to see you try the non-PV case, but I also think it might
> be interesting to see what the instruction profile for
> __pv_queued_spin_lock_slowpath() itself is. They share a lot of code
> (there's some interesting #include games going on to make
> queued_spin_lock_slowpath() actually *be*
> __pv_queued_spin_lock_slowpath() with some magic hooks), but there
> might be issues.

  0.03 �   data16 data16 data16 xchg %ax,%ax
   �   push   %rbp
  0.00 �   mov%rsp,%rbp
  0.01 �   push   %r15
   �   push   %r14
   �   push   %r13
  0.01 �   push   %r12
   �   mov$0x18740,%r12
   �   push   %rbx
   �   mov%rdi,%rbx
   �   sub$0x10,%rsp
   �   add%gs:0x7ef0d0e0(%rip),%r12
   �   movslq 0xc(%r12),%rax
  0.02 �   mov%gs:0x7ef0d0db(%rip),%r15d
   �   add$0x1,%r15d
   �   shl$0x12,%r15d
   �   lea0x1(%rax),%edx
  0.01 �   mov%edx,0xc(%r12)
   �   mov%eax,%edx
   �   shl$0x4,%rax
   �   add%rax,%r12
   �   shl$0x10,%edx
   �   movq   $0x0,(%r12)
  0.02 �   or %edx,%r15d
   �   mov%gs:0x7ef0d0ad(%rip),%eax
  0.00 �   movl   $0x0,0x8(%r12)
  0.01 �   mov%eax,0x40(%r12)
   �   movb   $0x0,0x44(%r12)
   �   mov(%rdi),%eax
  0.88 �   test   %ax,%ax
   � � jne8f
  0.02 �   mov$0x1,%edx
   �   lock   cmpxchg %dl,(%rdi)
  0.38 �   test   %al,%al
   � � je 14a
  0.02 � 8f:   mov%r15d,%eax
   �   shr$0x10,%eax
   �   xchg   %ax,0x2(%rbx)
  2.07 �   shl$0x10,%eax
   �   test   %eax,%eax
   � � jne171
   �   movq   $0x0,-0x30(%rbp)
  0.02 � ac:   movzbl 0x44(%r12),%eax
  0.97 �   mov$0x1,%r13d
   �   mov$0x100,%r14d
   �   cmp$0x2,%al
   �   sete   %al
   �   movzbl %al,%eax
   �   mov%rax,-0x38(%rbp)
  0.00 � ca:   movb   $0x0,0x44(%r12)
  0.00 �   mov$0x8000,%edx
   �   movb   $0x1,0x1(%rbx)
   � � jmpe6
  0.04 � db:   pause
  8.04 �   sub$0x1,%edx
   � � je 229
   � e6:   movzbl (%rbx),%eax
  7.54 �   test   %al,%al
   � � jnedb
  0.10 �   mov%r14d,%eax
  0.06 �   lock   cmpxchg %r13w,(%rbx)
  

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-16 Thread Linus Torvalds
Mel,
 thanks for taking a look. Your theory sounds more complete than mine,
and since Dave is able to see the problem with 4.7, it would be nice
to hear about the 4.6 behavior and commit ede37713737 in particular.

That one seems more likely to affect contention than the zone/node one
I found during the merge window anyway, since it actually removes a
sleep in kswapd during congestion.

I've always preferred to see direct reclaim as the primary model for
reclaim, partly in order to throttle the actual "bad" process, but
also because "kswapd uses lots of CPU time" is such a nasty thing to
even begin guessing about.

So I have to admit to liking that "make kswapd sleep a bit if it's
just looping" logic that got removed in that commit.

And looking at DaveC's numbers, it really feels like it's not even
what we do inside the locked region that is the problem. Sure,
__delete_from_page_cache() (which is most of it) is at 1.86% of CPU
time (when including all the things it calls), but that still isn't
all that much. Especially when compared to just:

   0.78%  [kernel]  [k] _raw_spin_unlock_irqrestore

from his flat profile. That's not some spinning wait, that's just
releasing the lock with a single write (and the popf, but while that's
an expensive instruction, it's just tens of cpu cycles).

So I'm more and more getting the feeling that it's not what we do
inside the lock that is problematic. I started out blaming memcg
accounting or something, but none of the numbers seem to back that up.
So it's primarily really just the fact that kswapd is simply hammering
on that lock way too much.

So yeah, I'm blaming kswapd itself doing something wrong. It's not a
problem in a single-node environment (since there's only one), but
with multiple nodes it clearly just devolves.

Yes, we could try to batch the locking like DaveC already suggested
(ie we could move the locking to the caller, and then make
shrink_page_list() just try to keep the lock held for a few pages if
the mapping doesn't change), and that might result in fewer crazy
cacheline ping-pongs overall. But that feels like exactly the wrong
kind of workaround.

I'd much rather re-instate some "if kswapd is just spinning on CPU
time and not actually improving IO parallelism, kswapd should just get
the hell out" logic.

Adding Michal Hocko to the participant list too, I think he's one of
the gang in this area. Who else should be made aware of this thread?
Minchan? Vladimir?

[ I'm assuming the new people can look up this thread on lkml. Note to
new people: the subject line (and about 75% of the posts) are about an
unrelated AIM7 regression, but there's this sub-thread about nasty
lock contention on mapping->tree_lock within that bigger context ]

  Linus

On Tue, Aug 16, 2016 at 8:05 AM, Mel Gorman  wrote:
>
> However, historically there have been multiple indirect throttling mechanism
> that were branded as congestion control but basically said "I don't know
> what's going on so it's nap time". Many of these have been removed over
> time and the last major one was ede37713737 ("mm: throttle on IO only when
> there are too many dirty and writeback pages").
>
> Before that commit, a process that entered direct reclaim and failed to make
> progress would sleep before retrying. It's possible that sleep was enough
> to reduce contention by temporarily stalling the writer and letting reclaim
> make progress. After that commit, it may only do a cond_resched() check
> and go back to allocating/reclaiming as quickly as possible. This active
> writer may be enough to increase contention. If so, it also means it
> stops kswapd making forward progress, leading to more direct reclaim and
> more contention.
>
> It's not a perfect theory and assumes;
>
> 1. The writer is direct reclaiming
> 2. The writer was previously failing to __remove_mapping
> 3. The writer calling congestion_wait due to __remove_mapping failing
>was enough to allow kswapd or writeback to make enough progress to
>avoid contention
> 4. The writer staying awake allocating and dirtying pages is keeping all
>the kswapd instances awake and writeback continually active and
>increasing the contention overall.
>
> If it was possible to trigger this problem in 4.7 then it would also be
> worth checking 4.6. If 4.6 is immune, check that before and after commit
> ede37713737.
>
> --
> Mel Gorman
> SUSE Labs


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-16 Thread Mel Gorman
On Mon, Aug 15, 2016 at 04:48:36PM -0700, Linus Torvalds wrote:
> On Mon, Aug 15, 2016 at 4:20 PM, Linus Torvalds
>  wrote:
> >
> > None of this code is all that new, which is annoying. This must have
> > gone on forever,
> 
> ... ooh.
> 
> Wait, I take that back.
> 
> We actually have some very recent changes that I didn't even think
> about that went into this very merge window.
> 
> In particular, I wonder if it's all (or at least partly) due to the
> new per-node LRU lists.
> 
> So in shrink_page_list(), when kswapd is encountering a page that is
> under page writeback due to page reclaim, it does:
> 
> if (current_is_kswapd() &&
> PageReclaim(page) &&
> test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
> nr_immediate++;
> goto keep_locked;
> 

I have a limited view of the full topic as I've been in meetings all day
and have another 3 hours to go. I'll set time aside tomorrow to look closer
but there is a theory at the end of the mail.

Node-lru does alter what locks are contended and affects the timing of some
issues but this spot feels like a bad fit. That logic controls whether kswapd
will stall due to dirty/writeback pages reaching the tail of the LRU too
quickly. It can affect lru_lock contention that may be worse with node-lru,
particularly on single-node machines but a workload of a streaming writer
is unlikely to hit that unless the underlying storage is extremely slow.

Another alternation of node-lru potentially affects when buffer heads get
stripped but that's also a poor fit.

I'm not willing to rule out node-lru because it may be wishful thinking
but it feels unlikely.

> which basically ignores that page and puts it back on the LRU list.
> 
> But that "is this node under writeback" is new - it now does that per
> node, and it *used* to do it per zone (so it _used_ to test "is this
> zone under writeback").
> 

Superficially, a small high zone would affect the timing of when a zone
got marked congested and triggered a sleep. Sleeping avoids new pages being
allocated/dirties and may reduce contention. However, quick sleeps due to
small zones was offset by the fair zone allocation policy and is still
offset by GFP_WRITE distributing dirty pages on different zones. The
timing of when sleeps occur due to excessive dirty pages at the tail of
the LRU should be roughly similar with either zone-lru or node-lru.

> All the mapping pages used to be in the same zone, so I think it
> effectively single-threaded the kswapd reclaim for one mapping under
> reclaim writeback. But in your cases, you have multiple nodes...
> 
> Ok, that's a lot of hand-wavy new-age crystal healing thinking.
> 
> Really, I haven't looked at it more than "this is one thing that has
> changed recently, I wonder if it changes the patterns and could
> explain much higher spin_lock contention on the mapping->tree_lock".
> 
> I'm adding Mel Gorman and his band of miscreants to the cc, so that
> they can tell me that I'm full of shit, and completely missed on what
> that zone->node change actually ends up meaning.
> 
> Mel? The issue is that Dave Chinner is seeing some nasty spinlock
> contention on "mapping->tree_lock":
> 

Band Of Miscreants may be the new name for the MM track at LSF/MM.  In the
meantime lets try some hand waving;

A single-threaded file write on a 4-node system is going to have 4 kswapd
instances, writeback and potentially the writer itself all reclaiming.
Given the workload, it's likely that almost all pages have the same
mapping. As they are contending on __remove_mapping, the pages must be
clean when the attempt to reclaim was made and buffers stripped.

The throttling mechanisms for kswapd and direct reclaim rely on either
too many pages being isolated (unlikely to fire in this case) or too many
dirty/writeback pages reaching the end of the LRU. There is not a direct
throttling mechanism for excessive lock contention

However, historically there have been multiple indirect throttling mechanism
that were branded as congestion control but basically said "I don't know
what's going on so it's nap time". Many of these have been removed over
time and the last major one was ede37713737 ("mm: throttle on IO only when
there are too many dirty and writeback pages").

Before that commit, a process that entered direct reclaim and failed to make
progress would sleep before retrying. It's possible that sleep was enough
to reduce contention by temporarily stalling the writer and letting reclaim
make progress. After that commit, it may only do a cond_resched() check
and go back to allocating/reclaiming as quickly as possible. This active
writer may be enough to increase contention. If so, it also means it
stops kswapd making forward progress, leading to more direct reclaim and
more contention.

It's not a perfect theory and assumes;

1. The writer is direct reclaiming
2. The writer was p

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-16 Thread Fengguang Wu

On Sun, Aug 14, 2016 at 06:17:24PM +0200, Christoph Hellwig wrote:

Snipping the long contest:

I think there are three observations here:

(1) removing the mark_page_accessed (which is the only significant
change in the parent commit)  hurts the
aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test.
I'd still rather stick to the filemap version and let the
VM people sort it out.  How do the numbers for this test
look for XFS vs say ext4 and btrfs?


Here is a basic comparison of the 3 filesystems based on 99091700 ("
Merge tag 'nfs-for-4.8-2' of 
git://git.linux-nfs.org/projects/trondmy/linux-nfs").

% compare -a -g 99091700659f4df965e138b38b4fa26a29b7eade -d fs xfs ext4 btrfs

xfsext4   btrfs  
testcase/testparams/testbox
  --  --  
---
%stddev  change %stddev  change %stddev
\  |\  |\
   193335 -27% 141400-100%  8
GEO-MEAN aim7.jobs-per-min
   267649 ±  3%   -51% 130085
aim7/1BRD_48G-disk_cp-3000-performance/ivb44
   485217 ±  3%   402%2434088 ±  3%   350%2184471 ±  4%  
aim7/1BRD_48G-disk_rd-9000-performance/ivb44
   360286 -64% 130351
aim7/1BRD_48G-disk_rr-3000-performance/ivb44
   338114 -78%  73280
aim7/1BRD_48G-disk_rw-3000-performance/ivb44
60130 ±  5%   361% 277035
aim7/1BRD_48G-disk_src-3000-performance/ivb44
   403144 -68% 127584
aim7/1BRD_48G-disk_wrt-3000-performance/ivb44
26327 -60%  10571
aim7/1BRD_48G-sync_disk_rw-600-performance/ivb44

xfsext4   btrfs
  --  --
 2652 -96%118 -82%468
GEO-MEAN fsmark.files_per_sec
  393 ±  4%-6%368 ±  3%10%433 ±  5%  
fsmark/1x-1t-1BRD_48G-4M-40G-NoSync-performance/ivb44
  200  -4%191  -7%185 ±  6%  
fsmark/1x-1t-1BRD_48G-4M-40G-fsyncBeforeClose-performance/ivb44
 1583 ±  3%   -29%   1130 -31%   1088
fsmark/1x-64t-1BRD_48G-4M-40G-fsyncBeforeClose-performance/ivb44
21363  59%  33958
fsmark/8-1SSD-16-9B-48G-fsyncBeforeClose-16d-256fpd-performance/lkp-hsw-ep4
11033 -17%   9117
fsmark/8-1SSD-4-8K-24G-fsyncBeforeClose-16d-256fpd-performance/lkp-hsw-ep4
11833  12%  13234
fsmark/8-1SSD-4-9B-16G-fsyncBeforeClose-16d-256fpd-performance/lkp-hsw-ep4

xfsext4   btrfs
  --  --
  2381976-100%   6598 -96% 100973
GEO-MEAN fsmark.app_overhead
   564520 ±  7%21% 681192 ±  3%63% 919364 ±  3%  
fsmark/1x-64t-1BRD_48G-4M-40G-NoSync-performance/ivb44
   860074 ±  5%   112%1820590 ± 14%47%1262443 ±  3%  
fsmark/1x-64t-1BRD_48G-4M-40G-fsyncBeforeClose-performance/ivb44
 12232633 -18%   10085199
fsmark/8-1SSD-16-9B-48G-fsyncBeforeClose-16d-256fpd-performance/lkp-hsw-ep4
  3143334 -11%2784178
fsmark/8-1SSD-4-8K-24G-fsyncBeforeClose-16d-256fpd-performance/lkp-hsw-ep4
  4107347 -21%3248210
fsmark/8-1SSD-4-9B-16G-fsyncBeforeClose-16d-256fpd-performance/lkp-hsw-ep4

Thanks,
Fengguang
---

Some less important numbers.

xfsext4   btrfs
  --  --
 1314 222%   4225-100%  2
GEO-MEAN aim7.time.system_time
 1491 ±  6%   302%   6004
aim7/1BRD_48G-disk_cp-3000-performance/ivb44
 4786 ±  3%   -89%502 ±  7%   -87%632 ±  7%  
aim7/1BRD_48G-disk_rd-9000-performance/ivb44
  756 689%   5971
aim7/1BRD_48G-disk_rr-3000-performance/ivb44
  8911146%  11108
aim7/1BRD_48G-disk_rw-3000-performance/ivb44
  940 ±  5%70%   1598
aim7/1BRD_48G-disk_src-3000-performance/ivb44
  599 925%   61

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-16 Thread Fengguang Wu

On Tue, Aug 16, 2016 at 07:22:40AM +1000, Dave Chinner wrote:

On Mon, Aug 15, 2016 at 10:14:55PM +0800, Fengguang Wu wrote:

Hi Christoph,

On Sun, Aug 14, 2016 at 06:17:24PM +0200, Christoph Hellwig wrote:
>Snipping the long contest:
>
>I think there are three observations here:
>
>(1) removing the mark_page_accessed (which is the only significant
>change in the parent commit)  hurts the
>aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test.
>I'd still rather stick to the filemap version and let the
>VM people sort it out.  How do the numbers for this test
>look for XFS vs say ext4 and btrfs?
>(2) lots of additional spinlock contention in the new case.  A quick
>check shows that I fat-fingered my rewrite so that we do
>the xfs_inode_set_eofblocks_tag call now for the pure lookup
>case, and pretty much all new cycles come from that.
>(3) Boy, are those xfs_inode_set_eofblocks_tag calls expensive, and
>we're already doing way to many even without my little bug above.
>
>So I've force pushed a new version of the iomap-fixes branch with
>(2) fixed, and also a little patch to xfs_inode_set_eofblocks_tag a
>lot less expensive slotted in before that.  Would be good to see
>the numbers with that.

The aim7 1BRD tests finished and there are ups and downs, with overall
performance remain flat.

99091700659f4df9  74a242ad94d13436a1644c0b45  bf4dc6e4ecc2a3d042029319bc  
testcase/testparams/testbox
  --  --  
---


What do these commits refer to, please? They mean nothing without
the commit names

/me goes searching. Ok:

99091700659 is the top of Linus' tree
74a242ad94d is 


That's the below one's parent commit, 74a242ad94d ("xfs: make
xfs_inode_set_eofblocks_tag cheaper for the common case").

Typically we'll compare a commit with its parent commit, and/or
the branch's base commit, which is normally on mainline kernel.


bf4dc6e4ecc is the latest in Christoph's tree (because it's
mentioned below)


%stddev %change %stddev %change %stddev
\  |\  |
\ 159926  157324  158574
GEO-MEAN aim7.jobs-per-min
70897   5%  74137   4%  73775
aim7/1BRD_48G-xfs-creat-clo-1500-performance/ivb44
   485217 ±  3%492431  477533
aim7/1BRD_48G-xfs-disk_rd-9000-performance/ivb44
   360451 -19% 292980 -17% 299377
aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44


So, why does random read go backwards by 20%? The iomap IO path
patches we are testing only affect the write path, so this
doesn't make a whole lot of sense.


   338114  338410   5% 354078
aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44
60130 ±  5% 4%  62438   5%  62923
aim7/1BRD_48G-xfs-disk_src-3000-performance/ivb44
   403144  397790  410648
aim7/1BRD_48G-xfs-disk_wrt-3000-performance/ivb44


And this is the test the original regression was reported for:

gcc-6/performance/profile/1BRD_48G/xfs/x86_64-rhel/3000/debian-x86_64-2015-02-07.cgz/ivb44/disk_wrt/aim7

And that shows no improvement at all. The orginal regression was:

484435 ±  0% -13.3% 420004 ±  0%  aim7.jobs-per-min

So it's still 15% down on the orginal performance which, again,
doesn't make a whole lot of sense given the improvement in so many
other tests I've run


Yes, same performance with 4.8-rc1 means the regression is still not
back comparing to the original reported first-bad-commit's parent
f0c6bcba74ac51cb ("xfs: reorder zeroing and flushing sequence in
truncate") which is on 4.7-rc1. 


26327   26534   26128
aim7/1BRD_48G-xfs-sync_disk_rw-600-performance/ivb44

The new commit bf4dc6e ("xfs: rewrite and optimize the delalloc write
path") improves the aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44
case by 5%. Here are the detailed numbers:

aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44


Not important at all. We need the results for the disk_wrt regression
we are chasing (disk_wrt-3000) so we can see how the code change
affected behaviour.


Yeah it may not relevant to this case study, however should help
evaluate the patch in a more complete way.


Here are the detailed numbers for the slowed down case:

aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44

99091700659f4df9  bf4dc6e4ecc2a3d042029319bc
  --
%stddev  change %stddev
\  |\
   360451 -17% 299377aim7.jobs-per-min
12806 481%  74447
aim7.time.involuntary_context_switches

.

19377 459% 

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Linus Torvalds
On Mon, Aug 15, 2016 at 5:19 PM, Dave Chinner  wrote:
>
>> None of this code is all that new, which is annoying. This must have
>> gone on forever,
>
> Yes, it has been. Just worse than I've notice before, probably
> because of all the stuff put under the tree lock in the past couple
> of years.

So this is where a good profile can matter.

Particularly if it's all about kswapd, and all the contention is just
from __remove_mapping(), what should matter is the "all the stuff"
added *there* and absolutely nowhere else.

Sadly (well, not for me), in my profiles I have

 --3.37%--kswapd
   |
--3.36%--shrink_node
  |
  |--2.88%--shrink_node_memcg
  |  |
  |   --2.87%--shrink_inactive_list
  | |
  | |--2.55%--shrink_page_list
  | |  |
  | |  |--0.84%--__remove_mapping
  | |  |  |
  | |  |  |--0.37%--__delete_from_page_cache
  | |  |  |  |
  | |  |  |   --0.21%--radix_tree_replace_clear_tags
  | |  |  | |
  | |  |  |  --0.12%--__radix_tree_lookup
  | |  |  |
  | |  |   --0.23%--_raw_spin_lock_irqsave
  | |  | |
  | |  |  --0.11%--queued_spin_lock_slowpath
  | |  |
   


which is rather different from your 22% spin-lock overhead.

Anyway, including the direct reclaim call paths gets
__remove_mapping() a bit higher, and _raw_spin_lock_irqsave climbs to
0.26%. But perhaps more importlantly, looking at what __remove_mapping
actually *does* (apart from the spinlock) gives us:

 - inside remove_mapping itself (0.11% on its own - flat cost, no
child accounting)

48.50 │   lock   cmpxchg %edx,0x1c(%rbx)

so that's about 0.05%

 - 0.40% __delete_from_page_cache (0.22%
radix_tree_replace_clear_tags, 0.13%__radix_tree_lookup)

 - 0.06% workingset_eviction()

so I'm not actually seeing anything *new* expensive in there. The
__delete_from_page_cache() overhead may have changed a bit with the
tagged tree changes, but this doesn't look like memcg.

But we clearly have very different situations.

What does your profile show for when you actually dig into
__remove_mapping() itself?, Looking at your flat profile, I'm assuming
you get

   1.31%  [kernel]  [k] __radix_tree_lookup
   1.22%  [kernel]  [k] radix_tree_tag_set
   1.14%  [kernel]  [k] __remove_mapping

which is higher (but part of why my percentages are lower is that I
have that "50% CPU used for encryption" on my machine).

But I'm not seeing anything I'd attribute to "all the stuff added".
For example, originally I would have blamed memcg, but that's not
actually in this path at all.

I come back to wondering whether maybe you're hitting some PV-lock problem.

I know queued_spin_lock_slowpath() is ok. I'm not entirely sure
__pv_queued_spin_lock_slowpath() is.

So I'd love to see you try the non-PV case, but I also think it might
be interesting to see what the instruction profile for
__pv_queued_spin_lock_slowpath() itself is. They share a lot of code
(there's some interesting #include games going on to make
queued_spin_lock_slowpath() actually *be*
__pv_queued_spin_lock_slowpath() with some magic hooks), but there
might be issues.

For example, if you run a virtual 16-core system on a physical machine
that then doesn't consistently give 16 cores to the virtual machine,
you'll get no end of hiccups.

Because as mentioned, we've had bugs ("performance anomalies") there before.

   Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Linus Torvalds
On Mon, Aug 15, 2016 at 5:38 PM, Dave Chinner  wrote:
>
> Same in 4.7 (flat profile numbers climbed higher after this
> snapshot was taken, as can be seen by the callgraph numbers):

Ok, so it's not the zone-vs-node thing.  It's just that nobody has
looked at that load in recent times.

Where "recent" may be years, of course.

   Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Linus Torvalds
On Mon, Aug 15, 2016 at 5:17 PM, Dave Chinner  wrote:
>
> Read the code, Linus?

I am. It's how I came up with my current pet theory.

But I don't actually have enough sane numbers to make it much more
than a cute pet theory. It *might* explain why you see tons of kswap
time and bad lock contention where it didn't use to exist, but ..

I can't recreate the problem, and your old profiles were bad enough
that they aren't really worth looking at.

> Except they *aren't broken*. They are simply *less accurate* than
> they could be.

They are so much less accurate that quite frankly, there's no point in
looking at them outside of "there is contention on the lock".

And considering that the numbers didn't even change when you had
spinlock debugging on, it's not the lock itself that causes this, I'm
pretty sure.

Because when you have normal contention due to the *locking* itself
being the problem, it tends to absolutely _explode_ with the debugging
spinlocks, because the lock itself becomes much more expensive.
Usually super-linearly.

But that wasn't the case here. The numbers stayed constant.

So yeah, I started looking at bigger behavioral issues, which is why I
zeroed in on that zone-vs-node change. But it might be a completely
broken theory. For example, if you still have the contention when
running plain 4.7, that theory was clearly complete BS.

And this is where "less accurate" means that they are almost entirely useless.

More detail needed. It might not be in the profiles themselves, of
course. There might be other much more informative sources if you can
come up with anything...

   Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Dave Chinner
On Mon, Aug 15, 2016 at 04:48:36PM -0700, Linus Torvalds wrote:
> On Mon, Aug 15, 2016 at 4:20 PM, Linus Torvalds
>  wrote:
> >
> > None of this code is all that new, which is annoying. This must have
> > gone on forever,
> 
> ... ooh.
> 
> Wait, I take that back.
> 
> We actually have some very recent changes that I didn't even think
> about that went into this very merge window.

> Mel? The issue is that Dave Chinner is seeing some nasty spinlock
> contention on "mapping->tree_lock":
> 
> >   31.18%  [kernel]  [k] __pv_queued_spin_lock_slowpath
> 
> and one of the main paths is this:
> 
> >- 30.29% kswapd
> >   - 30.23% shrink_node
> >  - 30.07% shrink_node_memcg.isra.75
> > - 30.15% shrink_inactive_list
> >- 29.49% shrink_page_list
> >   - 22.79% __remove_mapping
> >  - 22.27% _raw_spin_lock_irqsave
> >   __pv_queued_spin_lock_slowpath
> 
> so there's something ridiculously bad going on with a fairly simple benchmark.
> 
> Dave's benchmark is literally just a "write a new 48GB file in
> single-page chunks on a 4-node machine". Nothing odd - not rewriting
> files, not seeking around, no nothing.
> 
> You can probably recreate it with a silly
> 
>   dd bs=4096 count=$((12*1024*1024)) if=/dev/zero of=bigfile
> 
> although Dave actually had something rather fancier, I think.

16p, 16GB RAM, fake_numa=4. Overwrite a 47GB file on a 48GB
filesystem:

# mkfs.xfs -f -d size=48g /dev/vdc
# mount /dev/vdc /mnt/scratch
# xfs_io -f -c "pwrite 0 47g" /mnt/scratch/fooey

Wait for memory to fill and reclaim to kick in, then look at the
profile. If you run it a second time, reclaim kicks in straight
away.

It's not the new code in 4.8 - it reproduces on 4.7 just fine, and
probably will reproduce all the way back to when the memcg-aware
writeback code was added

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Dave Chinner
On Mon, Aug 15, 2016 at 04:20:55PM -0700, Linus Torvalds wrote:
> On Mon, Aug 15, 2016 at 3:42 PM, Dave Chinner  wrote:
> >
> >   31.18%  [kernel]  [k] __pv_queued_spin_lock_slowpath
> >9.90%  [kernel]  [k] copy_user_generic_string
> >3.65%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
> >2.62%  [kernel]  [k] __block_commit_write.isra.29
> >2.26%  [kernel]  [k] _raw_spin_lock_irqsave
> >1.72%  [kernel]  [k] _raw_spin_lock
> 
> Ok, this is more like it.
> 
> I'd still like to see it on raw hardware, just to see if we may have a
> bug in the PV code. Because that code has been buggy before. I
> *thought* we fixed it, but ...
> 
> In fact, you don't even need to do it outside of virtualization, but
> with paravirt disabled (so that it runs the native non-pv locking in
> the virtual machine).
> 
> >36.60% 0.00%  [kernel][k] kswapd
> >- 30.29% kswapd
> >   - 30.23% shrink_node
> >  - 30.07% shrink_node_memcg.isra.75
> > - 30.15% shrink_inactive_list
> >- 29.49% shrink_page_list
> >   - 22.79% __remove_mapping
> >  - 22.27% _raw_spin_lock_irqsave
> >   __pv_queued_spin_lock_slowpath
> 
> How I dislike the way perf shows the call graph data... Just last week
> I was talking to Arnaldo about how to better visualize the cost of
> spinlocks, because the normal way "perf" shows costs is so nasty.

Do not change it - it's the way call graph profiles have been
presented for the past 20 years. I hate it when long standing
conventions are changed because one person doesn't like them and
everyone else has to relearn skills the haven't had to think about
for years

> What happens is that you see that 36% of CPU time is attributed to
> kswapd, and then you can drill down and see where that 36% comes from.
> So far so good, and that's what perf does fairly well.
> 
> But then when you find the spinlock, you actually want to go the other
> way, and instead ask it to show "who were the callers to this routine
> and what were the percentages", so that you can then see whether (for
> example) it's just that __remove_mapping() use that contends with
> itself, or whether it's contending with the page additions or
> whatever..

Um, perf already does that:

-   31.55%31.55%  [kernel][k] __pv_queued_spin_lock_slowpath
   - 19.83% ret_from_fork
  - kthread
 - 18.55% kswapd
  shrink_node
  shrink_node_memcg.isra.75
  shrink_inactive_list
   1.76% worker_thread
  process_one_work
  wb_workfn 
  wb_writeback
  __writeback_inodes_wb
  writeback_sb_inodes 
  __writeback_single_inode
  do_writepages
  xfs_vm_writepages
  write_cache_pages
  xfs_do_writepage
   + 5.95% __libc_pwrite

I have that right here because *it's a view of the profile I've
already looked at*. I didn't post it because, well, it's shorter to
simply say "contention is from in kswapd".

> So what I'd like to see (and this is where it becomes *so* much more
> useful to be able to recreate it myself so that I can play with the
> perf data several different ways) is to see what the profile looks
> like in that spinlocked region.

Boot your machine with "fake_numa=4", and play till you heart is
content. That's all I do with my test VMs to make them exercise NUMA
paths.

> None of this code is all that new, which is annoying. This must have
> gone on forever,

Yes, it has been. Just worse than I've notice before, probably
because of all the stuff put under the tree lock in the past couple
of years.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Dave Chinner
On Mon, Aug 15, 2016 at 05:15:47PM -0700, Linus Torvalds wrote:
> DaveC - does the spinlock contention go away if you just go back to
> 4.7? If so, I think it's the new zone thing. But it would be good to
> verify - maybe it's something entirely different and it goes back much
> further.

Same in 4.7 (flat profile numbers climbed higher after this
snapshot was taken, as can be seen by the callgraph numbers):

  29.47%  [kernel]  [k] __pv_queued_spin_lock_slowpath
  11.59%  [kernel]  [k] copy_user_generic_string
   3.13%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
   2.87%  [kernel]  [k] __block_commit_write.isra.29
   2.02%  [kernel]  [k] _raw_spin_lock_irqsave
   1.77%  [kernel]  [k] get_page_from_freelist
   1.36%  [kernel]  [k] __wake_up_bit 
   1.31%  [kernel]  [k] __radix_tree_lookup
   1.22%  [kernel]  [k] radix_tree_tag_set
   1.16%  [kernel]  [k] clear_page_dirty_for_io
   1.14%  [kernel]  [k] __remove_mapping
   1.14%  [kernel]  [k] _raw_spin_lock
   1.00%  [kernel]  [k] zone_dirty_ok
   0.95%  [kernel]  [k] radix_tree_tag_clear
   0.90%  [kernel]  [k] generic_write_end
   0.89%  [kernel]  [k] __delete_from_page_cache
   0.87%  [kernel]  [k] unlock_page
   0.86%  [kernel]  [k] cancel_dirty_page
   0.81%  [kernel]  [k] up_write
   0.80%  [kernel]  [k] ___might_sleep
   0.77%  [kernel]  [k] _raw_spin_unlock_irqrestore
   0.75%  [kernel]  [k] generic_perform_write
   0.72%  [kernel]  [k] xfs_do_writepage  
   0.69%  [kernel]  [k] down_write
   0.63%  [kernel]  [k] shrink_page_list  
   0.63%  [kernel]  [k] __xfs_get_blocks  
   0.61%  [kernel]  [k] __test_set_page_writeback
   0.59%  [kernel]  [k] free_hot_cold_page
   0.57%  [kernel]  [k] write_cache_pages
   0.56%  [kernel]  [k] __radix_tree_create
   0.55%  [kernel]  [k] __list_add
   0.53%  [kernel]  [k] page_mapping
   0.53%  [kernel]  [k] drop_buffers
   0.51%  [kernel]  [k] xfs_vm_releasepage
   0.51%  [kernel]  [k] free_pcppages_bulk
   0.50%  [kernel]  [k] __list_del_entry  


   38.07%38.07%  [kernel][k] __pv_queued_spin_lock_slowpath
   - 25.52% ret_from_fork
  - kthread
 - 24.36% kswapd
  shrink_zone
  shrink_zone_memcg.isra.73
  shrink_inactive_list
 - 3.21% worker_thread
  process_one_work
  wb_workfn
  wb_writeback
  __writeback_inodes_wb
  writeback_sb_inodes
  __writeback_single_inode
  do_writepages
  xfs_vm_writepages
  write_cache_pages
   - 10.06% __libc_pwrite
entry_SYSCALL_64_fastpath
sys_pwrite64
vfs_write
__vfs_write
xfs_file_write_iter
xfs_file_buffered_aio_write
  - generic_perform_write
 - 5.51% xfs_vm_write_begin
- 4.94% grab_cache_page_write_begin
 pagecache_get_page
  0.57% __block_write_begin
 create_page_buffers
 create_empty_buffers
 _raw_spin_lock
 __pv_queued_spin_lock_slowpath
 - 4.88% xfs_vm_write_end
  generic_write_end
  block_write_end
  __block_commit_write.isra.29
  mark_buffer_dirty
  __set_page_dirty

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Dave Chinner
On Mon, Aug 15, 2016 at 04:01:00PM -0700, Linus Torvalds wrote:
> On Mon, Aug 15, 2016 at 3:22 PM, Dave Chinner  wrote:
> >
> > Right, but that does not make the profile data useless,
> 
> Yes it does. Because it basically hides everything that happens inside
> the lock, which is what causes the contention in the first place.

Read the code, Linus?

> So stop making inane and stupid arguments, Dave.

We know what happens inside the lock, and we know exactly how much
it is supposed to cost. And it isn't anywhere near as much as the
profiles indicate the function that contains the lock is costing.

Occam's Razor leads to only one conclusion, like it or not

> Your profiles are shit. Deal with it, or accept that nobody is ever
> going to bother working on them because your profiles don't give
> useful information.
> 
> I see that you actually fixed your profiles, but quite frankly, the
> amount of pure unadulterated crap you posted in this email is worth
> reacting negatively to.

I'm happy to be told that I'm wrong *when I'm wrong*, but you always
say "read the code to understand a problem" rather than depending on
potentially unreliable tools and debug information that is gathered.

Yet when I do that using partial profile information, your reaction
is to tell me I am "full of shit" because my information isn't 100%
reliable? Really, Linus?

> You generally make so much sense that it's shocking to see you then
> make these crazy excuses for your completely broken profiles.

Except they *aren't broken*. They are simply *less accurate* than
they could be. That does not invalidate the profile nor does it mean
that the insight it gives us into the functioning of the code is
wrong.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Dave Chinner
On Mon, Aug 15, 2016 at 10:22:43AM -0700, Huang, Ying wrote:
> Hi, Chinner,
> 
> Dave Chinner  writes:
> 
> > On Wed, Aug 10, 2016 at 06:00:24PM -0700, Linus Torvalds wrote:
> >> On Wed, Aug 10, 2016 at 5:33 PM, Huang, Ying  wrote:
> >> >
> >> > Here it is,
> >> 
> >> Thanks.
> >> 
> >> Appended is a munged "after" list, with the "before" values in
> >> parenthesis. It actually looks fairly similar.
> >> 
> >> The biggest difference is that we have "mark_page_accessed()" show up
> >> after, and not before. There was also a lot of LRU noise in the
> >> non-profile data. I wonder if that is the reason here: the old model
> >> of using generic_perform_write/block_page_mkwrite didn't mark the
> >> pages accessed, and now with iomap_file_buffered_write() they get
> >> marked as active and that screws up the LRU list, and makes us not
> >> flush out the dirty pages well (because they are seen as active and
> >> not good for writeback), and then you get bad memory use.
> >> 
> >> I'm not seeing anything that looks like locking-related.
> >
> > Not in that profile. I've been doing some local testing inside a
> > 4-node fake-numa 16p/16GB RAM VM to see what I can find.
> 
> You run the test in a virtual machine, I think that is why your perf
> data looks strange (high value of _raw_spin_unlock_irqrestore).
> 
> To setup KVM to use perf, you may refer to,
> 
> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Virtualization_Tuning_and_Optimization_Guide/sect-Virtualization_Tuning_Optimization_Guide-Monitoring_Tools-vPMU.html
> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Virtualization_Administration_Guide/sect-perf-mon.html
> 
> I haven't tested them.  You may Google to find more information.  Or the
> perf/kvm people can give you more information.

Thanks, "-cpu host" on the qemu command line works. I hate magic,
undocumented(*) features that are necessary to make basic stuff work. 

-Dave.

(*) yeah, try working this capability even exists out from the
qemu/kvm man page.

-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Linus Torvalds
On Mon, Aug 15, 2016 at 4:20 PM, Linus Torvalds
 wrote:
>
> But I'll try to see what happens
> on my profile, even if I can't recreate the contention itself, just
> trying to see what happens inside of that region.

Yeah, since I run my machines on encrypted disks, my profile shows 60%
kthread, but that's just because 55% is crypto.

I only have 5% in kswapd. And the spinlock doesn't even show up for me
(but "__delete_from_page_cache()" does, which doesn't look
unreasonable).

And while the biggest reason the spinlock doesn't show up is likely
simply my single-node "everything is on one die", I still think the
lower kswapd CPU use might be partly due to the node-vs-zone thing.

For me, with just one node, the new

test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {

ends up being very similar to what we used to have before, ie

test_bit(ZONE_WRITEBACK, &zone->flags)) {

but on a multi-node machine it would be rather different.

So I might never see contention anyway.

The basic logic in shrink_swap_list() goes back to commit 283aba9f9e0
("mm: vmscan: block kswapd if it is encountering pages under
writeback") but it has been messed around with a lot (and something
else existed there before - we've always had some "throttle kswapd so
that it doesn't use insane amounts of CPU time").

DaveC - does the spinlock contention go away if you just go back to
4.7? If so, I think it's the new zone thing. But it would be good to
verify - maybe it's something entirely different and it goes back much
further.

Mel - I may be barking up entirely the wrong tree, but it would be
good if you could take a look just in case this is actually it.

  Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Linus Torvalds
On Mon, Aug 15, 2016 at 4:20 PM, Linus Torvalds
 wrote:
>
> None of this code is all that new, which is annoying. This must have
> gone on forever,

... ooh.

Wait, I take that back.

We actually have some very recent changes that I didn't even think
about that went into this very merge window.

In particular, I wonder if it's all (or at least partly) due to the
new per-node LRU lists.

So in shrink_page_list(), when kswapd is encountering a page that is
under page writeback due to page reclaim, it does:

if (current_is_kswapd() &&
PageReclaim(page) &&
test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
nr_immediate++;
goto keep_locked;

which basically ignores that page and puts it back on the LRU list.

But that "is this node under writeback" is new - it now does that per
node, and it *used* to do it per zone (so it _used_ to test "is this
zone under writeback").

All the mapping pages used to be in the same zone, so I think it
effectively single-threaded the kswapd reclaim for one mapping under
reclaim writeback. But in your cases, you have multiple nodes...

Ok, that's a lot of hand-wavy new-age crystal healing thinking.

Really, I haven't looked at it more than "this is one thing that has
changed recently, I wonder if it changes the patterns and could
explain much higher spin_lock contention on the mapping->tree_lock".

I'm adding Mel Gorman and his band of miscreants to the cc, so that
they can tell me that I'm full of shit, and completely missed on what
that zone->node change actually ends up meaning.

Mel? The issue is that Dave Chinner is seeing some nasty spinlock
contention on "mapping->tree_lock":

>   31.18%  [kernel]  [k] __pv_queued_spin_lock_slowpath

and one of the main paths is this:

>- 30.29% kswapd
>   - 30.23% shrink_node
>  - 30.07% shrink_node_memcg.isra.75
> - 30.15% shrink_inactive_list
>- 29.49% shrink_page_list
>   - 22.79% __remove_mapping
>  - 22.27% _raw_spin_lock_irqsave
>   __pv_queued_spin_lock_slowpath

so there's something ridiculously bad going on with a fairly simple benchmark.

Dave's benchmark is literally just a "write a new 48GB file in
single-page chunks on a 4-node machine". Nothing odd - not rewriting
files, not seeking around, no nothing.

You can probably recreate it with a silly

  dd bs=4096 count=$((12*1024*1024)) if=/dev/zero of=bigfile

although Dave actually had something rather fancier, I think.

 Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Linus Torvalds
On Mon, Aug 15, 2016 at 3:42 PM, Dave Chinner  wrote:
>
>   31.18%  [kernel]  [k] __pv_queued_spin_lock_slowpath
>9.90%  [kernel]  [k] copy_user_generic_string
>3.65%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
>2.62%  [kernel]  [k] __block_commit_write.isra.29
>2.26%  [kernel]  [k] _raw_spin_lock_irqsave
>1.72%  [kernel]  [k] _raw_spin_lock

Ok, this is more like it.

I'd still like to see it on raw hardware, just to see if we may have a
bug in the PV code. Because that code has been buggy before. I
*thought* we fixed it, but ...

In fact, you don't even need to do it outside of virtualization, but
with paravirt disabled (so that it runs the native non-pv locking in
the virtual machine).

>36.60% 0.00%  [kernel][k] kswapd
>- 30.29% kswapd
>   - 30.23% shrink_node
>  - 30.07% shrink_node_memcg.isra.75
> - 30.15% shrink_inactive_list
>- 29.49% shrink_page_list
>   - 22.79% __remove_mapping
>  - 22.27% _raw_spin_lock_irqsave
>   __pv_queued_spin_lock_slowpath

How I dislike the way perf shows the call graph data... Just last week
I was talking to Arnaldo about how to better visualize the cost of
spinlocks, because the normal way "perf" shows costs is so nasty.

What happens is that you see that 36% of CPU time is attributed to
kswapd, and then you can drill down and see where that 36% comes from.
So far so good, and that's what perf does fairly well.

But then when you find the spinlock, you actually want to go the other
way, and instead ask it to show "who were the callers to this routine
and what were the percentages", so that you can then see whether (for
example) it's just that __remove_mapping() use that contends with
itself, or whether it's contending with the page additions or
whatever..

And perf makes that unnecessarily much too hard to see.

So what I'd like to see (and this is where it becomes *so* much more
useful to be able to recreate it myself so that I can play with the
perf data several different ways) is to see what the profile looks
like in that spinlocked region.

Hmm. I guess you could just send me the "perf.data" and "vmlinux"
files, and I can look at it that way. But I'll try to see what happens
on my profile, even if I can't recreate the contention itself, just
trying to see what happens inside of that region.

None of this code is all that new, which is annoying. This must have
gone on forever,

   Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Linus Torvalds
On Mon, Aug 15, 2016 at 3:22 PM, Dave Chinner  wrote:
>
> Right, but that does not make the profile data useless,

Yes it does. Because it basically hides everything that happens inside
the lock, which is what causes the contention in the first place.

So stop making inane and stupid arguments, Dave.

Your profiles are shit. Deal with it, or accept that nobody is ever
going to bother working on them because your profiles don't give
useful information.

I see that you actually fixed your profiles, but quite frankly, the
amount of pure unadulterated crap you posted in this email is worth
reacting negatively to.

You generally make so much sense that it's shocking to see you then
make these crazy excuses for your completely broken profiles.

 Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Dave Chinner
On Tue, Aug 16, 2016 at 08:22:11AM +1000, Dave Chinner wrote:
> On Sun, Aug 14, 2016 at 10:12:20PM -0700, Linus Torvalds wrote:
> > On Aug 14, 2016 10:00 PM, "Dave Chinner"  wrote:
> > >
> > > > What does it say if you annotate that _raw_spin_unlock_irqrestore()
> > function?
> > > 
> > >¿
> > >¿Disassembly of section load0:
> > >¿
> > >¿81e628b0 :
> > >¿  nop
> > >¿  push   %rbp
> > >¿  mov%rsp,%rbp
> > >¿  movb   $0x0,(%rdi)
> > >¿  nop
> > >¿  mov%rsi,%rdi
> > >¿  push   %rdi
> > >¿  popfq
> > >  99.35 ¿  nop
> > 
> > Yeah, that's a good disassembly of a non-debug spin unlock, and the symbols
> > are fine, but the profile is not valid. That's an interrupt point, right
> > after the popf that enables interiors again.
> > 
> > I don't know why 'perf' isn't working on your machine, but it clearly
> > isn't.
> > 
> > Has it ever worked on that machine?
> 
> It's working the same as it's worked since I started using it many
> years ago.
> 
> > What cpu is it?
> 
> Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz
> 
> > Are you running in some
> > virtualized environment without performance counters, perhaps?
> 
> I've mentioned a couple of times in this thread that I'm testing
> inside a VM. It's the same VM I've been running performance tests in
> since early 2010. Nobody has complained that the profiles I've
> posted are useless before, and not once in all that time have they
> been wrong in indicating a spinning lock contention point.
> 
> i.e. In previous cases where I've measured double digit CPU usage
> numbers in a spin_unlock variant, it's always been a result of
> spinlock contention. And fixing the algorithmic problem that lead to
> the spinlock showing up in the profile in the first place has always
> substantially improved performance and scalability.
> 
> As such, I'm always going to treat a locking profile like that as
> contention because even if it isn't contending *on my machine*,
> that amount of work being done under a spinning lock is /way too
> much/ and it *will* cause contention problems with larger machines.

And, so, after helpfully being pointed at the magic kvm "-cpu host"
flag to enable access to the performance counters from the guest
(using "-e cycles", because more precise counters aren't available),
the profile looks like this:

  31.18%  [kernel]  [k] __pv_queued_spin_lock_slowpath
   9.90%  [kernel]  [k] copy_user_generic_string
   3.65%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
   2.62%  [kernel]  [k] __block_commit_write.isra.29
   2.26%  [kernel]  [k] _raw_spin_lock_irqsave
   1.72%  [kernel]  [k] _raw_spin_lock
   1.33%  [kernel]  [k] __wake_up_bit
   1.20%  [kernel]  [k] __radix_tree_lookup
   1.19%  [kernel]  [k] __remove_mapping  
   1.12%  [kernel]  [k] __delete_from_page_cache
   0.97%  [kernel]  [k] xfs_do_writepage  
   0.91%  [kernel]  [k] get_page_from_freelist
   0.90%  [kernel]  [k] up_write  
   0.88%  [kernel]  [k] clear_page_dirty_for_io
   0.83%  [kernel]  [k] radix_tree_tag_set
   0.81%  [kernel]  [k] radix_tree_tag_clear
   0.80%  [kernel]  [k] down_write
   0.78%  [kernel]  [k] _raw_spin_unlock_irqrestore
   0.77%  [kernel]  [k] shrink_page_list
   0.76%  [kernel]  [k] ___might_sleep
   0.76%  [kernel]  [k] unlock_page
   0.74%  [kernel]  [k] __list_del_entry
   0.67%  [kernel]  [k] __add_to_page_cache_locked
   0.65%  [kernel]  [k] node_dirty_ok
   0.61%  [kernel]  [k] __rmqueue
   0.61%  [kernel]  [k] __block_write_begin_int
   0.61%  [kernel]  [k] cancel_dirty_page
   0.61%  [kernel]  [k] __test_set_page_writeback
   0.59%  [kernel]  [k] page_mapping
   0.57%  [kernel]  [k] __list_add
   0.56%  [kernel]  [k] free_pcppages_bulk
   0.54%  [kernel]  [k] _raw_spin_lock_irq
   0.54%  [kernel]  [k] generic_write_end
   0.51%  [kernel]  [k] drop_buffers

The call graph should be familiar by now:

   36.60% 0.00%  [kernel][k] kswapd
   - 30.29% kswapd  
  - 30.23% shrink_node
 - 30.07% shrink_node_memcg.isra.75
- 30.15% shrink_inactive_list
   - 29.49% shrink_page_list
  - 22.79% __remove_mapping
 - 22.27% _raw_spin_lock_irqsave
  __pv_queued_spin_lock_slowpath
 + 1.86% __delete_from_page_cache
 + 1.27% _raw_spin_unlock_irqrestore
  + 4.31% try_to_release_page
  + 1.21% free_hot_cold_page_list
0.56% page_evictable
 0.77% isolate_lru_pages.isra.72

That sure looks like spin lock contention to me

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Dave Chinner
On Sun, Aug 14, 2016 at 10:12:20PM -0700, Linus Torvalds wrote:
> On Aug 14, 2016 10:00 PM, "Dave Chinner"  wrote:
> >
> > > What does it say if you annotate that _raw_spin_unlock_irqrestore()
> function?
> > 
> >¿
> >¿Disassembly of section load0:
> >¿
> >¿81e628b0 :
> >¿  nop
> >¿  push   %rbp
> >¿  mov%rsp,%rbp
> >¿  movb   $0x0,(%rdi)
> >¿  nop
> >¿  mov%rsi,%rdi
> >¿  push   %rdi
> >¿  popfq
> >  99.35 ¿  nop
> 
> Yeah, that's a good disassembly of a non-debug spin unlock, and the symbols
> are fine, but the profile is not valid. That's an interrupt point, right
> after the popf that enables interiors again.
> 
> I don't know why 'perf' isn't working on your machine, but it clearly
> isn't.
> 
> Has it ever worked on that machine?

It's working the same as it's worked since I started using it many
years ago.

> What cpu is it?

Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz

> Are you running in some
> virtualized environment without performance counters, perhaps?

I've mentioned a couple of times in this thread that I'm testing
inside a VM. It's the same VM I've been running performance tests in
since early 2010. Nobody has complained that the profiles I've
posted are useless before, and not once in all that time have they
been wrong in indicating a spinning lock contention point.

i.e. In previous cases where I've measured double digit CPU usage
numbers in a spin_unlock variant, it's always been a result of
spinlock contention. And fixing the algorithmic problem that lead to
the spinlock showing up in the profile in the first place has always
substantially improved performance and scalability.

As such, I'm always going to treat a locking profile like that as
contention because even if it isn't contending *on my machine*,
that amount of work being done under a spinning lock is /way too
much/ and it *will* cause contention problems with larger machines.

> It's not actually the unlock that is expensive, and there is no contention
> on the lock (if there had been, the numbers would have been entirely
> different for the debug case, which makes locking an order of magnitude
> more expensive). All the cost of everything that happened while interrupts
> were disabled is just accounted to the instruction after they were enabled
> again.

Right, but that does not make the profile data useless, nor you
should shoot the messenger because they weren't supplied with
information you think should have been in the message. The message
still says that the majority of the overhead is in
__remove_mapping(), and it's an excessive amount of work being done
inside the tree_lock with interrupts disabled

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Dave Chinner
On Mon, Aug 15, 2016 at 10:14:55PM +0800, Fengguang Wu wrote:
> Hi Christoph,
> 
> On Sun, Aug 14, 2016 at 06:17:24PM +0200, Christoph Hellwig wrote:
> >Snipping the long contest:
> >
> >I think there are three observations here:
> >
> >(1) removing the mark_page_accessed (which is the only significant
> >change in the parent commit)  hurts the
> >aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test.
> >I'd still rather stick to the filemap version and let the
> >VM people sort it out.  How do the numbers for this test
> >look for XFS vs say ext4 and btrfs?
> >(2) lots of additional spinlock contention in the new case.  A quick
> >check shows that I fat-fingered my rewrite so that we do
> >the xfs_inode_set_eofblocks_tag call now for the pure lookup
> >case, and pretty much all new cycles come from that.
> >(3) Boy, are those xfs_inode_set_eofblocks_tag calls expensive, and
> >we're already doing way to many even without my little bug above.
> >
> >So I've force pushed a new version of the iomap-fixes branch with
> >(2) fixed, and also a little patch to xfs_inode_set_eofblocks_tag a
> >lot less expensive slotted in before that.  Would be good to see
> >the numbers with that.
> 
> The aim7 1BRD tests finished and there are ups and downs, with overall
> performance remain flat.
> 
> 99091700659f4df9  74a242ad94d13436a1644c0b45  bf4dc6e4ecc2a3d042029319bc  
> testcase/testparams/testbox
>   --  --  
> ---

What do these commits refer to, please? They mean nothing without
the commit names

/me goes searching. Ok:

99091700659 is the top of Linus' tree
74a242ad94d is 
bf4dc6e4ecc is the latest in Christoph's tree (because it's
mentioned below)

> %stddev %change %stddev %change %stddev
> \  |\  |
> \ 159926  157324  158574
> GEO-MEAN aim7.jobs-per-min
> 70897   5%  74137   4%  73775
> aim7/1BRD_48G-xfs-creat-clo-1500-performance/ivb44
>485217 ±  3%492431  477533
> aim7/1BRD_48G-xfs-disk_rd-9000-performance/ivb44
>360451 -19% 292980 -17% 299377
> aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44

So, why does random read go backwards by 20%? The iomap IO path
patches we are testing only affect the write path, so this
doesn't make a whole lot of sense.

>338114  338410   5% 354078
> aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44
> 60130 ±  5% 4%  62438   5%  62923
> aim7/1BRD_48G-xfs-disk_src-3000-performance/ivb44
>403144  397790  410648
> aim7/1BRD_48G-xfs-disk_wrt-3000-performance/ivb44

And this is the test the original regression was reported for:

gcc-6/performance/profile/1BRD_48G/xfs/x86_64-rhel/3000/debian-x86_64-2015-02-07.cgz/ivb44/disk_wrt/aim7

And that shows no improvement at all. The orginal regression was:

484435 ±  0% -13.3% 420004 ±  0%  aim7.jobs-per-min

So it's still 15% down on the orginal performance which, again,
doesn't make a whole lot of sense given the improvement in so many
other tests I've run

> 26327   26534   26128
> aim7/1BRD_48G-xfs-sync_disk_rw-600-performance/ivb44
> 
> The new commit bf4dc6e ("xfs: rewrite and optimize the delalloc write
> path") improves the aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44
> case by 5%. Here are the detailed numbers:
> 
> aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44

Not important at all. We need the results for the disk_wrt regression
we are chasing (disk_wrt-3000) so we can see how the code change
affected behaviour.

> Here are the detailed numbers for the slowed down case:
> 
> aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44
> 
> 99091700659f4df9  bf4dc6e4ecc2a3d042029319bc
>   --
> %stddev  change %stddev
> \  |\
>360451 -17% 299377aim7.jobs-per-min
> 12806 481%  74447
> aim7.time.involuntary_context_switches
.
> 19377 459% 108364vmstat.system.cs
.
>   487 ± 89%  3e+04  26448 ± 57%  
> latency_stats.max.down.xfs_buf_lock._xfs_buf_find.xfs_buf_get_map.xfs_buf_read_map.xfs_trans_read_buf_map.xfs_read_agf.xfs_alloc_read_agf.xfs_alloc_fix_freelist.xfs_free_extent_fix_freelist.xfs_free_extent.xfs_trans_free_extent
>  1823 ± 82%  2e+061913796 ± 38%  
> latency_stats.sum.down.xfs_buf_lock._xfs_buf_find.xfs_buf_get_map.xfs_buf_read_map.xfs_trans_read_buf_map.xfs_read_agf.xfs_alloc_read_agf.xfs_alloc_fix_freelist.xfs_free_ext

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Huang, Ying
Christoph Hellwig  writes:

> Snipping the long contest:
>
> I think there are three observations here:
>
>  (1) removing the mark_page_accessed (which is the only significant
>  change in the parent commit)  hurts the
>  aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test.
>  I'd still rather stick to the filemap version and let the
>  VM people sort it out.  How do the numbers for this test
>  look for XFS vs say ext4 and btrfs?
>  (2) lots of additional spinlock contention in the new case.  A quick
>  check shows that I fat-fingered my rewrite so that we do
>  the xfs_inode_set_eofblocks_tag call now for the pure lookup
>  case, and pretty much all new cycles come from that.
>  (3) Boy, are those xfs_inode_set_eofblocks_tag calls expensive, and
>  we're already doing way to many even without my little bug above.
>
> So I've force pushed a new version of the iomap-fixes branch with
> (2) fixed, and also a little patch to xfs_inode_set_eofblocks_tag a
> lot less expensive slotted in before that.  Would be good to see
> the numbers with that.

For the original reported regression, the test result is as follow,

=
compiler/cpufreq_governor/debug-setup/disk/fs/kconfig/load/rootfs/tbox_group/test/testcase:
  
gcc-6/performance/profile/1BRD_48G/xfs/x86_64-rhel/3000/debian-x86_64-2015-02-07.cgz/ivb44/disk_wrt/aim7

commit: 
  f0c6bcba74ac51cb77aadb33ad35cb2dc1ad1506 (parent of first bad commit)
  68a9f5e7007c1afa2cf6830b690a90d0187c0684 (first bad commit)
  99091700659f4df965e138b38b4fa26a29b7eade (base of your fixes branch)
  bf4dc6e4ecc2a3d042029319bc8cd4204c185610 (head of your fixes branch)

f0c6bcba74ac51cb 68a9f5e7007c1afa2cf6830b69 99091700659f4df965e138b38b 
bf4dc6e4ecc2a3d042029319bc 
 -- -- 
-- 
 %stddev %change %stddev %change %stddev 
%change %stddev
 \  |\  |\  
|\  
484435 ±  0% -13.3% 420004 ±  0% -17.0% 402250 ±  0% 
-15.6% 408998 ±  0%  aim7.jobs-per-min


And the perf data is as follow,

  "perf-profile.func.cycles-pp.intel_idle": 20.25,
  "perf-profile.func.cycles-pp.memset_erms": 11.72,
  "perf-profile.func.cycles-pp.copy_user_enhanced_fast_string": 8.37,
  "perf-profile.func.cycles-pp.__block_commit_write.isra.21": 3.49,
  "perf-profile.func.cycles-pp.block_write_end": 1.77,
  "perf-profile.func.cycles-pp.native_queued_spin_lock_slowpath": 1.63,
  "perf-profile.func.cycles-pp.unlock_page": 1.58,
  "perf-profile.func.cycles-pp.___might_sleep": 1.56,
  "perf-profile.func.cycles-pp.__block_write_begin_int": 1.33,
  "perf-profile.func.cycles-pp.iov_iter_copy_from_user_atomic": 1.23,
  "perf-profile.func.cycles-pp.up_write": 1.21,
  "perf-profile.func.cycles-pp.__mark_inode_dirty": 1.18,
  "perf-profile.func.cycles-pp.down_write": 1.06,
  "perf-profile.func.cycles-pp.mark_buffer_dirty": 0.94,
  "perf-profile.func.cycles-pp.generic_write_end": 0.92,
  "perf-profile.func.cycles-pp.__radix_tree_lookup": 0.91,
  "perf-profile.func.cycles-pp._raw_spin_lock": 0.81,
  "perf-profile.func.cycles-pp.entry_SYSCALL_64_fastpath": 0.79,
  "perf-profile.func.cycles-pp.__might_sleep": 0.79,
  "perf-profile.func.cycles-pp.xfs_file_iomap_begin_delay.isra.9": 0.7,
  "perf-profile.func.cycles-pp.__list_del_entry": 0.7,
  "perf-profile.func.cycles-pp.vfs_write": 0.69,
  "perf-profile.func.cycles-pp.drop_buffers": 0.68,
  "perf-profile.func.cycles-pp.xfs_file_write_iter": 0.67,
  "perf-profile.func.cycles-pp.rwsem_spin_on_owner": 0.67,

Best Regards,
Huang, Ying


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Huang, Ying
Hi, Chinner,

Dave Chinner  writes:

> On Wed, Aug 10, 2016 at 06:00:24PM -0700, Linus Torvalds wrote:
>> On Wed, Aug 10, 2016 at 5:33 PM, Huang, Ying  wrote:
>> >
>> > Here it is,
>> 
>> Thanks.
>> 
>> Appended is a munged "after" list, with the "before" values in
>> parenthesis. It actually looks fairly similar.
>> 
>> The biggest difference is that we have "mark_page_accessed()" show up
>> after, and not before. There was also a lot of LRU noise in the
>> non-profile data. I wonder if that is the reason here: the old model
>> of using generic_perform_write/block_page_mkwrite didn't mark the
>> pages accessed, and now with iomap_file_buffered_write() they get
>> marked as active and that screws up the LRU list, and makes us not
>> flush out the dirty pages well (because they are seen as active and
>> not good for writeback), and then you get bad memory use.
>> 
>> I'm not seeing anything that looks like locking-related.
>
> Not in that profile. I've been doing some local testing inside a
> 4-node fake-numa 16p/16GB RAM VM to see what I can find.

You run the test in a virtual machine, I think that is why your perf
data looks strange (high value of _raw_spin_unlock_irqrestore).

To setup KVM to use perf, you may refer to,

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Virtualization_Tuning_and_Optimization_Guide/sect-Virtualization_Tuning_Optimization_Guide-Monitoring_Tools-vPMU.html
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Virtualization_Administration_Guide/sect-perf-mon.html

I haven't tested them.  You may Google to find more information.  Or the
perf/kvm people can give you more information.

> I'm yet to work out how I can trigger a profile like the one that
> was reported (I really need to see the event traces), but in the
> mean time I found this
>
> Doing a large sequential single threaded buffered write using a 4k
> buffer (so single page per syscall to make the XFS IO path allocator
> behave the same way as in 4.7), I'm seeing a CPU profile that
> indicates we have a potential mapping->tree_lock issue:
>
> # xfs_io -f -c "truncate 0" -c "pwrite 0 47g" /mnt/scratch/fooey
> wrote 50465865728/50465865728 bytes at offset 0
> 47.000 GiB, 12320768 ops; 0:01:36.00 (499.418 MiB/sec and 127850.9132 ops/sec)
>
> 
>
>   24.15%  [kernel]  [k] _raw_spin_unlock_irqrestore
>9.67%  [kernel]  [k] copy_user_generic_string
>5.64%  [kernel]  [k] _raw_spin_unlock_irq
>3.34%  [kernel]  [k] get_page_from_freelist
>2.57%  [kernel]  [k] mark_page_accessed
>2.45%  [kernel]  [k] do_raw_spin_lock
>1.83%  [kernel]  [k] shrink_page_list
>1.70%  [kernel]  [k] free_hot_cold_page
>1.26%  [kernel]  [k] xfs_do_writepage

Best Regards,
Huang, Ying


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Fengguang Wu

Hi Christoph,

On Sun, Aug 14, 2016 at 06:17:24PM +0200, Christoph Hellwig wrote:

Snipping the long contest:

I think there are three observations here:

(1) removing the mark_page_accessed (which is the only significant
change in the parent commit)  hurts the
aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test.
I'd still rather stick to the filemap version and let the
VM people sort it out.  How do the numbers for this test
look for XFS vs say ext4 and btrfs?
(2) lots of additional spinlock contention in the new case.  A quick
check shows that I fat-fingered my rewrite so that we do
the xfs_inode_set_eofblocks_tag call now for the pure lookup
case, and pretty much all new cycles come from that.
(3) Boy, are those xfs_inode_set_eofblocks_tag calls expensive, and
we're already doing way to many even without my little bug above.

So I've force pushed a new version of the iomap-fixes branch with
(2) fixed, and also a little patch to xfs_inode_set_eofblocks_tag a
lot less expensive slotted in before that.  Would be good to see
the numbers with that.


The aim7 1BRD tests finished and there are ups and downs, with overall
performance remain flat.

99091700659f4df9  74a242ad94d13436a1644c0b45  bf4dc6e4ecc2a3d042029319bc  
testcase/testparams/testbox
  --  --  
---
%stddev %change %stddev %change %stddev
\  |\  |\  
   159926  157324  158574GEO-MEAN aim7.jobs-per-min

70897   5%  74137   4%  73775
aim7/1BRD_48G-xfs-creat-clo-1500-performance/ivb44
   485217 ±  3%492431  477533
aim7/1BRD_48G-xfs-disk_rd-9000-performance/ivb44
   360451 -19% 292980 -17% 299377
aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44
   338114  338410   5% 354078
aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44
60130 ±  5% 4%  62438   5%  62923
aim7/1BRD_48G-xfs-disk_src-3000-performance/ivb44
   403144  397790  410648
aim7/1BRD_48G-xfs-disk_wrt-3000-performance/ivb44
26327   26534   26128
aim7/1BRD_48G-xfs-sync_disk_rw-600-performance/ivb44

The new commit bf4dc6e ("xfs: rewrite and optimize the delalloc write
path") improves the aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44
case by 5%. Here are the detailed numbers:

aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44

74a242ad94d13436  bf4dc6e4ecc2a3d042029319bc
  --
%stddev %change %stddev
\  |\
   338410   5% 354078aim7.jobs-per-min
   404390   8% 435117
aim7.time.voluntary_context_switches
 2502  -4%   2396aim7.time.maximum_resident_set_size
15018  -9%  13701
aim7.time.involuntary_context_switches
  900 -11%801aim7.time.system_time
17432  11%  19365vmstat.system.cs
47736 ± 19%   -24%  36087
interrupts.CAL:Function_call_interrupts
  2129646  31%2790638proc-vmstat.pgalloc_dma32
   379503  13% 429384numa-meminfo.node0.Dirty
15018  -9%  13701time.involuntary_context_switches
  900 -11%801time.system_time
 1560  10%   1716slabinfo.mnt_cache.active_objs
 1560  10%   1716slabinfo.mnt_cache.num_objs
61.53   -4  57.45 ±  4%  
perf-profile.cycles-pp.intel_idle.cpuidle_enter_state.cpuidle_enter.call_cpuidle.cpu_startup_entry
61.63   -4  57.55 ±  4%  
perf-profile.func.cycles-pp.intel_idle
  1007188 ± 16%   156%2577911 ±  6%  numa-numastat.node0.numa_miss
  9662857 ±  4%   -13%8420159 ±  3%  numa-numastat.node0.numa_foreign
  1008220 ± 16%   155%2570630 ±  6%  numa-numastat.node1.numa_foreign
  9664033 ±  4%   -13%8413184 ±  3%  numa-numastat.node1.numa_miss
 26519887 ±  3%18%   31322674cpuidle.C1-IVT.time
   122238  16% 142383cpuidle.C1-IVT.usage
46548  11%  51645cpuidle.C1E-IVT.usage
 17253419  13%   19567582cpuidle.C3-IVT.time
86847  13%  98333cpuidle.C3-IVT.usage
   482033 ± 12%   108%1000665 ±  8%  numa-vmstat.node0.numa_miss
94689  14% 107744
numa-vmstat.node0.nr_zone_write_pending
94677  14% 107718numa-vmstat.node0.nr_dirty
  3156643 ±  3

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Fengguang Wu

On Fri, Aug 12, 2016 at 11:03:33AM -0700, Linus Torvalds wrote:

On Thu, Aug 11, 2016 at 8:56 PM, Dave Chinner  wrote:

On Thu, Aug 11, 2016 at 07:27:52PM -0700, Linus Torvalds wrote:


I don't recall having ever seen the mapping tree_lock as a contention
point before, but it's not like I've tried that load either. So it
might be a regression (going back long, I suspect), or just an unusual
load that nobody has traditionally tested much.

Single-threaded big file write one page at a time, was it?


Yup. On a 4 node NUMA system.


Ok, I can't see any real contention on my single-node workstation
(running ext4 too, so there may be filesystem differences), but I
guess that shouldn't surprise me. The cacheline bouncing just isn't
expensive enough when it all stays on-die.

I can see the tree_lock in my profiles (just not very high), and at
least for ext4 the main caller ssems to be
__set_page_dirty_nobuffers().

And yes, looking at that, the biggest cost by _far_ inside the
spinlock seems to be the accounting.

Which doesn't even have to be inside the mapping lock, as far as I can
tell, and as far as comments go.

So a stupid patch to just move the dirty page accounting to outside
the spinlock might help a lot.

Does this attached patch help your contention numbers?


Hi Linus,

The 1BRD tests finished and there are no conclusive changes.

The overall aim7 jobs-per-min slightly decreases and
the overall fsmark files_per_sec slightly increases,
however both are small enough (less than 1%), which are kind of
expected numbers.

NUMA test results should be available tomorrow.

99091700659f4df9  1b5f2eb4a752e1fa7102f37545  testcase/testparams/testbox
  --  ---
%stddev %change %stddev
\  |\
71443   71286GEO-MEAN aim7.jobs-per-min
  972 961
aim7/1BRD_48G-btrfs-creat-clo-4-performance/ivb44
52205   51525
aim7/1BRD_48G-btrfs-disk_cp-1500-performance/ivb44
  2184471 ±  4%-6%2051740 ±  3%  
aim7/1BRD_48G-btrfs-disk_rd-9000-performance/ivb44
47049   46630 ±  3%  
aim7/1BRD_48G-btrfs-disk_rr-1500-performance/ivb44
24932  -4%  23812
aim7/1BRD_48G-btrfs-disk_rw-1500-performance/ivb44
 58845856
aim7/1BRD_48G-btrfs-disk_src-500-performance/ivb44
51430   51286
aim7/1BRD_48G-btrfs-disk_wrt-1500-performance/ivb44
  218 220
aim7/1BRD_48G-btrfs-sync_disk_rw-10-performance/ivb44
22777   23199
aim7/1BRD_48G-ext4-creat-clo-1000-performance/ivb44
   130085  128991
aim7/1BRD_48G-ext4-disk_cp-3000-performance/ivb44
  2434088 ±  3%-8%2232211 ±  4%  
aim7/1BRD_48G-ext4-disk_rd-9000-performance/ivb44
   130351  128977
aim7/1BRD_48G-ext4-disk_rr-3000-performance/ivb44
73280   74044
aim7/1BRD_48G-ext4-disk_rw-3000-performance/ivb44
   277035  -3% 268057
aim7/1BRD_48G-ext4-disk_src-3000-performance/ivb44
   127584   4% 132639
aim7/1BRD_48G-ext4-disk_wrt-3000-performance/ivb44
10571   10659
aim7/1BRD_48G-ext4-sync_disk_rw-600-performance/ivb44
36924 ±  7% 36327
aim7/1BRD_48G-f2fs-creat-clo-1500-performance/ivb44
   117238  119130
aim7/1BRD_48G-f2fs-disk_cp-3000-performance/ivb44
  2340512 ±  5%   2352619 ± 10%  
aim7/1BRD_48G-f2fs-disk_rd-9000-performance/ivb44
   107506 ±  9% 7% 114869
aim7/1BRD_48G-f2fs-disk_rr-3000-performance/ivb44
   105642  106835
aim7/1BRD_48G-f2fs-disk_rw-3000-performance/ivb44
26900 ±  3% 26442 ±  3%  
aim7/1BRD_48G-f2fs-disk_src-3000-performance/ivb44
   117124 ±  3%117678
aim7/1BRD_48G-f2fs-disk_wrt-3000-performance/ivb44
 36893616
aim7/1BRD_48G-f2fs-sync_disk_rw-600-performance/ivb44
70897   72758
aim7/1BRD_48G-xfs-creat-clo-1500-performance/ivb44
   267649 ±  3%270867
aim7/1BRD_48G-xfs-disk_cp-3000-performance/ivb44
   485217 ±  3%489403
aim7/1BRD_48G-xfs-disk_rd-9000-performance/ivb44
   360451  359042
aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44
   338114  336838
aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44
60130 ±  5% 4%  62663
aim7/1BRD_48G-xfs-disk_src-3000-performance/ivb44
   403144  401476
aim7/1BRD_48G-xfs-disk_wrt-3000-performance/ivb44
26327   26513
aim7/1BRD_48G-xfs-sync_disk_rw-600-performance/ivb

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-14 Thread Ingo Molnar

* Linus Torvalds  wrote:

> Make sure you actually use "perf record -e cycles:pp" or something
> that uses PEBS to get real profiles using CPU performance counters.

Btw., 'perf record -e cycles:pp' is the default now for modern versions
of perf tooling (on most x86 systems) - if you do 'perf record' it will
just use the most precise profiling mode available on that particular
CPU model.

If unsure you can check the event that was used, via:

  triton:~> perf report --stdio 2>&1 | grep '# Samples'
  # Samples: 27K of event 'cycles:pp'

Thanks,

Ingo


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-14 Thread Dave Chinner
On Sun, Aug 14, 2016 at 07:53:40PM -0700, Linus Torvalds wrote:
> On Sun, Aug 14, 2016 at 7:28 PM, Dave Chinner  wrote:
> >>
> >> Maybe your symbol table came from a old kernel, and functions moved
> >> around enough that the profile attributions ended up bogus.
> >
> > No, I don't think so. I don't install symbol tables on my test VMs,
> > I let /proc/kallsyms do that work for me. From an strace of 'perf
> > top -U -g":
> 
> Ok. But something is definitely wrong with your profile.
> 
> What does it say if you annotate that _raw_spin_unlock_irqrestore() function?


raw_spin_unlock_irqrestore  /proc/kcore
   ¿
   ¿
   ¿
   ¿Disassembly of section load0:
   ¿
   ¿81e628b0 :
   ¿  nop
   ¿  push   %rbp
   ¿  mov%rsp,%rbp
   ¿  movb   $0x0,(%rdi)
   ¿  nop
   ¿  mov%rsi,%rdi
   ¿  push   %rdi
   ¿  popfq
 99.35 ¿  nop
   ¿  decl   %gs:0x7e1a9bc7(%rip)
  0.65 ¿¿ je 25
   ¿  pop%rbp
   ¿¿ retq
   ¿25:   callq  0x81002000
   ¿  pop%rbp
   ¿¿ retq


> I guarantee you that no, it's not spending 41% of time in
> spin_unlock_irqrestore. That just isn't a valid profile. There's
> something seriously wrong somewhere.
> 
> The fact that you then get the same profile when you turn _off_
> CONFIG_DEBUG_SPINLOCK only proves there is something going on that is
> pure garbage.
> 
> I suspect that what you did was to edit the .config file, remove
> DEBUG_SPINLOCK, and then do "make oldconfig" again.

Yes.

> And it got turned on again,

No. I'm not that stupid - I checked:

$ grep SPINLOCK .config
CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y
CONFIG_QUEUED_SPINLOCKS=y
CONFIG_PARAVIRT_SPINLOCKS=y
# CONFIG_DEBUG_SPINLOCK is not set
$

> because you have one of the lock debugging
> options on that force spinlock debuggin on again:

>  - DEBUG_WW_MUTEX_SLOWPATH
>  - DEBUG_LOCK_ALLOC
>  - PROVE_LOCKING

None of which are set:

$ grep 'DEBUG\|PROVE' .config |grep -v '#'
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_DEBUG_RODATA=y
CONFIG_SLUB_DEBUG=y
CONFIG_HAVE_DMA_API_DEBUG=y
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_PM_DEBUG=y
CONFIG_PM_SLEEP_DEBUG=y
CONFIG_DEBUG_DEVRES=y
CONFIG_PNP_DEBUG_MESSAGES=y
CONFIG_XFS_DEBUG=y
CONFIG_OCFS2_DEBUG_MASKLOG=y
CONFIG_CIFS_DEBUG=y
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_FS=y
CONFIG_DEBUG_KERNEL=y
CONFIG_HAVE_DEBUG_KMEMLEAK=y
CONFIG_DEBUG_STACK_USAGE=y
CONFIG_HAVE_DEBUG_STACKOVERFLOW=y
CONFIG_SCHED_DEBUG=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_ATOMIC_SLEEP=y
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_LIST=y
CONFIG_FAULT_INJECTION_DEBUG_FS=y
CONFIG_ARCH_HAS_DEBUG_STRICT_USER_COPY_CHECKS=y
CONFIG_DEBUG_BOOT_PARAMS=y
$

> [ Light goes on ]
> 
> Oh, no, I can see another possibility: you're not doing proper CPU
> profiles, you're doing some timer-irq profile, and the reason you get
> 41% on spin_unlock_irq_restore() is that that is where the interrupts
> are enabled again.
> 
> Timer-interrupt based profiles are not useful either.

I've just been using whatever perf defaults to. Defaults are
supposed to be useful - if they aren't then perf needs to be fixed.
perf top reports this by default:

Samples: 118K of event 'cpu-clock', Event count (approx.): 793748915
Overhead  Shared O  Symbol  

   ¿
  34.48%  [kernel]  [k] _raw_spin_unlock_irqrestore 

   ¿
   7.89%  [kernel]  [k] copy_user_generic_string

   ¿
   5.08%  [kernel]  [k] _raw_spin_unlock_irq

...

> Make sure you actually use "perf record -e cycles:pp" or something
> that uses PEBS to get real profiles using CPU performance counters.

WTF is PEBS? I'm not a CPU nerd, and I certainly don't expect to
have to learn all the intricacies of hardware performance counters
just to profile the kernel in a correct and sane manner. That's what
the *perf defaults* are supposed to do.


Anyway: `perf top -U -e cycles:pp`:

Samples: 301K of event 'cpu-clock:ppH', Event count (approx.): 69364814
Overhead  Shared O  Symbol  

   ¿
  30.89%  [kernel]  [k] _raw_spin_unlock_irqrestore 

   ¿
   7.04%  [kernel]  [k] _raw_spin_unlock_irq

   ¿
   4.08%  [kernel]  [k] copy_user_generic_string
   

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-14 Thread Linus Torvalds
On Sun, Aug 14, 2016 at 7:28 PM, Dave Chinner  wrote:
>>
>> Maybe your symbol table came from a old kernel, and functions moved
>> around enough that the profile attributions ended up bogus.
>
> No, I don't think so. I don't install symbol tables on my test VMs,
> I let /proc/kallsyms do that work for me. From an strace of 'perf
> top -U -g":

Ok. But something is definitely wrong with your profile.

What does it say if you annotate that _raw_spin_unlock_irqrestore() function?

I guarantee you that no, it's not spending 41% of time in
spin_unlock_irqrestore. That just isn't a valid profile. There's
something seriously wrong somewhere.

The fact that you then get the same profile when you turn _off_
CONFIG_DEBUG_SPINLOCK only proves there is something going on that is
pure garbage.

I suspect that what you did was to edit the .config file, remove
DEBUG_SPINLOCK, and then do "make oldconfig" again.

And it got turned on again, because you have one of the lock debugging
options on that force spinlock debuggin on again:

 - DEBUG_WW_MUTEX_SLOWPATH
 - DEBUG_LOCK_ALLOC
 - PROVE_LOCKING

all of which would make any profiles entirely pointless.

[ Light goes on ]

Oh, no, I can see another possibility: you're not doing proper CPU
profiles, you're doing some timer-irq profile, and the reason you get
41% on spin_unlock_irq_restore() is that that is where the interrupts
are enabled again.

Timer-interrupt based profiles are not useful either.

Make sure you actually use "perf record -e cycles:pp" or something
that uses PEBS to get real profiles using CPU performance counters.

Because right now the profile data is worthless.

Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-14 Thread Dave Chinner
On Sun, Aug 14, 2016 at 06:37:33PM -0700, Linus Torvalds wrote:
> On Sun, Aug 14, 2016 at 5:48 PM, Dave Chinner  wrote:
> >>
> >> Does this attached patch help your contention numbers?
> >
> > No. If anything, it makes it worse. Without the patch, I was
> > measuring 36-37% in _raw_spin_unlock_irqrestore. With the patch, it
> > is 42-43%. Write throughtput is the same at ~505MB/s.
> 
> Not helping any I can see, but I don't see how it could hurt...
> 
> Did you perhaps test it together with the other patches that improved
> xfs performance? If other things improve, then I'd expect the
> contention to get worse.
> 
> Not that it matters. Clearly that patch isn't even a stop-gap solution.

Tried it with and without. Same result.

> > There's a couple of interesting things showing up in the profile:
> >
> >   41.64%  [kernel]  [k] _raw_spin_unlock_irqrestore
> 
> Actually, you didn't point this one out, but *this* is the real kicker.
> 
> There's no way a *unlock* should show up that high. It's not spinning.
> It's doing a single store and a pushq/popfq sequence.
> 
> Sure, it's going to take a cross-node cachemiss in the presence of
> contention, but even then it should never be more expensive than the
> locking side - which will *also* do the node changes.
> 
> So there's something really odd in your profile. I don't think that's valid.
> 
> Maybe your symbol table came from a old kernel, and functions moved
> around enough that the profile attributions ended up bogus.

No, I don't think so. I don't install symbol tables on my test VMs,
I let /proc/kallsyms do that work for me. From an strace of 'perf
top -U -g":

18916 open("vmlinux", O_RDONLY) = -1 ENOENT (No such file or directory)
18916 open("/boot/vmlinux", O_RDONLY)   = -1 ENOENT (No such file or directory)
18916 open("/boot/vmlinux-4.8.0-rc1-dgc+", O_RDONLY) = -1 ENOENT (No such file 
or directory)
18916 open("/usr/lib/debug/boot/vmlinux-4.8.0-rc1-dgc+", O_RDONLY) = -1 ENOENT 
(No such file or directory)
18916 open("/lib/modules/4.8.0-rc1-dgc+/build/vmlinux", O_RDONLY) = -1 ENOENT 
(No such file or directory)
18916 open("/usr/lib/debug/lib/modules/4.8.0-rc1-dgc+/vmlinux", O_RDONLY) = -1 
ENOENT (No such file or directory)
18916 open("/usr/lib/debug/boot/vmlinux-4.8.0-rc1-dgc+.debug", O_RDONLY) = -1 
ENOENT (No such file or directory)
18916 open("/root/.debug/.build-id/63/aab665ce90bd81763b90ff2cf103d8e8e823bc", 
O_RDONLY) = -1 ENOENT (No such file or directory)
18916 open("/sys/kernel/notes", O_RDONLY) = 56
18916 read(56, "\4\0\0\0\24\0\0\0\3\0\0\0", 12) = 12
18916 read(56, "GNU\0", 4)  = 4
18916 read(56, "c\252\266e\316\220\275\201v;\220\377,\361\3\330\350\350#\274", 
20) = 20
18916 close(56) = 0
18916 
open("/root/.debug/[kernel.kcore]/63aab665ce90bd81763b90ff2cf103d8e8e823bc", 
O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or 
directory)
18916 open("/proc/kallsyms", O_RDONLY)  = 56
18916 fstat(56, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
18916 read(56, " A irq_stack_uni"..., 1024) = 1024
18916 read(56, "a\nb8c0 A rsp_scratch"..., 1024) = 1024
18916 read(56, "000c6e0 A cmci_storm_state\n0"..., 1024) = 1024
18916 read(56, "00ccd8 A sd_llc_id\n0"..., 1024) = 1024

You can see that perf is pulling the symbol table from the running
kernel, so I don't think there's a symbol mismatch here at all.

> I suspect it's actually supposed to be _raw_spin_lock_irqrestore()
> which is right next to that function. Although I'd actually expect
> that if it's lock contention, you should see the contention mostly in
> queued_spin_lock_slowpath().
> 
> Unless you have spinlock debugging turned on, in which case your
> contention is all from *that*. That's possible, of course.

$ grep SPINLOCK .config
CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y
CONFIG_QUEUED_SPINLOCKS=y
CONFIG_PARAVIRT_SPINLOCKS=y
CONFIG_DEBUG_SPINLOCK=y
$

So, turn off CONFIG_DEBUG_SPINLOCK, and:

  41.06%  [kernel]  [k] _raw_spin_unlock_irqrestore
   7.68%  [kernel]  [k] copy_user_generic_string
   4.52%  [kernel]  [k] _raw_spin_unlock_irq
   2.78%  [kernel]  [k] _raw_spin_lock
   2.30%  [kernel]  [k] get_page_from_freelist
   2.21%  [kernel]  [k] native_queued_spin_lock_slowpath
   2.16%  [kernel]  [k] __slab_free
   2.12%  [kernel]  [k] __block_commit_write.isra.29
   1.55%  [kernel]  [k] __list_add
   1.49%  [kernel]  [k] shrink_page_list
   1.23%  [kernel]  [k] free_hot_cold_page
   1.14%  [kernel]  [k] __wake_up_bit
   1.01%  [kernel]  [k] try_to_release_page
   1.00%  [kernel]  [k] page_evictable
   0.90%  [kernel]  [k] cancel_dirty_page
   0.80%  [kernel]  [k] unlock_page
   0.80%  [kernel]  [k] up_write
   0.73%  [kernel]  [k] ___might_sleep
   0.68%  [kernel]  [k] clear_page_dirty_for_io
   0.64%  [kernel]  [k] __radix_tree_lookup
   0.61%  [kernel]  [k] __block_write_begin_int
   0.60%  [kernel]  [k] xfs_do_writepage
   0.59%  [kernel]  [k] node_dirty_ok
   0.55%  [kernel]  [k] down_writ

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-14 Thread Linus Torvalds
On Sun, Aug 14, 2016 at 5:48 PM, Dave Chinner  wrote:
>>
>> Does this attached patch help your contention numbers?
>
> No. If anything, it makes it worse. Without the patch, I was
> measuring 36-37% in _raw_spin_unlock_irqrestore. With the patch, it
> is 42-43%. Write throughtput is the same at ~505MB/s.

Not helping any I can see, but I don't see how it could hurt...

Did you perhaps test it together with the other patches that improved
xfs performance? If other things improve, then I'd expect the
contention to get worse.

Not that it matters. Clearly that patch isn't even a stop-gap solution.

> There's a couple of interesting things showing up in the profile:
>
>   41.64%  [kernel]  [k] _raw_spin_unlock_irqrestore

Actually, you didn't point this one out, but *this* is the real kicker.

There's no way a *unlock* should show up that high. It's not spinning.
It's doing a single store and a pushq/popfq sequence.

Sure, it's going to take a cross-node cachemiss in the presence of
contention, but even then it should never be more expensive than the
locking side - which will *also* do the node changes.

So there's something really odd in your profile. I don't think that's valid.

Maybe your symbol table came from a old kernel, and functions moved
around enough that the profile attributions ended up bogus.

I suspect it's actually supposed to be _raw_spin_lock_irqrestore()
which is right next to that function. Although I'd actually expect
that if it's lock contention, you should see the contention mostly in
queued_spin_lock_slowpath().

Unless you have spinlock debugging turned on, in which case your
contention is all from *that*. That's possible, of course.

>7.92%  [kernel]  [k] copy_user_generic_string
>5.87%  [kernel]  [k] _raw_spin_unlock_irq
>3.18%  [kernel]  [k] do_raw_spin_lock
>2.51%  [kernel]  [k] cancel_dirty_page <<<
...
> Why are we even calling into cancel_dirty_page() if the page isn't
> dirty? xfs_vm_release_page() won't let dirty pages through to
> try_to_free_buffers(), so all this is just pure overhead for XFS.

See above: there's something screwy with your profile, you should
check that first. Maybe it's not actually cancel_dirty_page() but
something close-by.

(Although I don't see anything closeby normally, so even if the
spin_unlock_irq is bogus, I think *that* part may be incorrect.

Anyway, the reason you'd get cancel_dirty_page() is either due to
truncate, or due to try_to_free_buffers() having dropped the buffers
successfully because the filesystem had already written them out, but
the page is still marked dirty.

> FWIW, this is not under the mapping->tree_lock, but the profile shows
> that reclaiming bufferheads is roughly 20% of all the work kswapd is
> doing.

Well, that may not actually be wrong. That's the most expensive part
of reclaiming memory.

But please double-check your profile, because something is seriously
wrong in it.

Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-14 Thread Dave Chinner
On Fri, Aug 12, 2016 at 11:03:33AM -0700, Linus Torvalds wrote:
> On Thu, Aug 11, 2016 at 8:56 PM, Dave Chinner  wrote:
> > On Thu, Aug 11, 2016 at 07:27:52PM -0700, Linus Torvalds wrote:
> >>
> >> I don't recall having ever seen the mapping tree_lock as a contention
> >> point before, but it's not like I've tried that load either. So it
> >> might be a regression (going back long, I suspect), or just an unusual
> >> load that nobody has traditionally tested much.
> >>
> >> Single-threaded big file write one page at a time, was it?
> >
> > Yup. On a 4 node NUMA system.
> 
> Ok, I can't see any real contention on my single-node workstation
> (running ext4 too, so there may be filesystem differences), but I
> guess that shouldn't surprise me. The cacheline bouncing just isn't
> expensive enough when it all stays on-die.
> 
> I can see the tree_lock in my profiles (just not very high), and at
> least for ext4 the main caller ssems to be
> __set_page_dirty_nobuffers().
> 
> And yes, looking at that, the biggest cost by _far_ inside the
> spinlock seems to be the accounting.
> 
> Which doesn't even have to be inside the mapping lock, as far as I can
> tell, and as far as comments go.
> 
> So a stupid patch to just move the dirty page accounting to outside
> the spinlock might help a lot.
> 
> Does this attached patch help your contention numbers?

No. If anything, it makes it worse. Without the patch, I was
measuring 36-37% in _raw_spin_unlock_irqrestore. With the patch, it
is 42-43%. Write throughtput is the same at ~505MB/s.

There's a couple of interesting things showing up in the profile:

  41.64%  [kernel]  [k] _raw_spin_unlock_irqrestore
   7.92%  [kernel]  [k] copy_user_generic_string
   5.87%  [kernel]  [k] _raw_spin_unlock_irq
   3.18%  [kernel]  [k] do_raw_spin_lock
   2.51%  [kernel]  [k] cancel_dirty_page <<<
   2.35%  [kernel]  [k] get_page_from_freelist
   2.22%  [kernel]  [k] shrink_page_list
   2.04%  [kernel]  [k] __block_commit_write.isra.30
   1.40%  [kernel]  [k] xfs_vm_releasepage
   1.21%  [kernel]  [k] free_hot_cold_page
   1.17%  [kernel]  [k] delay_tsc
   1.12%  [kernel]  [k] __wake_up_bit
   0.92%  [kernel]  [k] __slab_free
   0.91%  [kernel]  [k] clear_page_dirty_for_io
   0.82%  [kernel]  [k] __radix_tree_lookup
   0.76%  [kernel]  [k] node_dirty_ok
   0.68%  [kernel]  [k] page_evictable
   0.63%  [kernel]  [k] do_raw_spin_unlock
   0.62%  [kernel]  [k] up_write
.

Why are we even calling into cancel_dirty_page() if the page isn't
dirty? xfs_vm_release_page() won't let dirty pages through to
try_to_free_buffers(), so all this is just pure overhead for XFS.


FWIW, this is not under the mapping->tree_lock, but the profile shows
that reclaiming bufferheads is roughly 20% of all the work kswapd is
doing. If we take away the mapping->tree_lock contention from the
usage totals, it's actually closer to 50% of the real work kswapd is
doing. The call graph profile is pretty grim:

-   41.89% 0.00%  [kernel]  [k] kswapd
   - kswapd
  - 41.84% shrink_node
 - 41.61% shrink_node_memcg.isra.75
- 41.50% shrink_inactive_list
   - 40.21% shrink_page_list
  - 26.47% __remove_mapping
   26.30% _raw_spin_unlock_irqrestore
  + 9.03% try_to_release_page
 - 8.82% try_to_release_page
- 8.80% xfs_vm_releasepage
 7.55% try_to_free_buffers
  + 1.56% free_hot_cold_page_list
0.73% page_evictable
 0.88% _raw_spin_unlock_irq

I guess now that the iomap code is in and we no longer really depend
on bufferheads in the writeback path, it's time to take the next
step in removing bufferheads from XFS altogether

> I realize that this has nothing to do with the AIM7 regression (the
> spinlock just isn't high enough in that profile), but your contention
> numbers just aren't right, and updating accounting statistics inside a
> critical spinlock when not needed is just wrong.

Yup, but the above profile shows that the lock contention is mainly
coming from concurrent access in memory reclaim so I don't think the
accounting has anything to do with it.

Perhaps some kind of per-mapping reclaim batching reduce the
__remove_mapping() locking overhead is in order here. Especially as
this problem will only get worse on larger NUMA machines

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-14 Thread Fengguang Wu

Hi Christoph,

On Sun, Aug 14, 2016 at 06:17:24PM +0200, Christoph Hellwig wrote:

Snipping the long contest:

I think there are three observations here:

(1) removing the mark_page_accessed (which is the only significant
change in the parent commit)  hurts the
aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test.
I'd still rather stick to the filemap version and let the
VM people sort it out.  How do the numbers for this test
look for XFS vs say ext4 and btrfs?


We'll be able to compare between filesystems when the tests for Linus'
patch finish.


(2) lots of additional spinlock contention in the new case.  A quick
check shows that I fat-fingered my rewrite so that we do
the xfs_inode_set_eofblocks_tag call now for the pure lookup
case, and pretty much all new cycles come from that.
(3) Boy, are those xfs_inode_set_eofblocks_tag calls expensive, and
we're already doing way to many even without my little bug above.

So I've force pushed a new version of the iomap-fixes branch with
(2) fixed, and also a little patch to xfs_inode_set_eofblocks_tag a
lot less expensive slotted in before that.  Would be good to see
the numbers with that.


I just queued these jobs. The comment-out ones will be submitted as
the 2nd stage when the 1st-round quick tests finish.

queue=(
   queue
   -q vip
   --repeat-to 3
   fs=xfs
   perf-profile.delay=1
   -b hch-vfs/iomap-fixes
   -k bf4dc6e4ecc2a3d042029319bc8cd4204c185610
   -k 74a242ad94d13436a1644c0b4586700e39871491
   -k 99091700659f4df965e138b38b4fa26a29b7eade
)

"${queue[@]}" -t ivb44 aim7-fs-1brd.yaml
"${queue[@]}" -t ivb44 fsmark-generic-1brd.yaml
"${queue[@]}" -t ivb43 fsmark-stress-journal-1brd.yaml
"${queue[@]}" -t lkp-hsx02 fsmark-generic-brd-raid.yaml
"${queue[@]}" -t lkp-hsw-ep4 fsmark-1ssd-nvme-small.yaml
#"${queue[@]}" -t ivb43 fsmark-stress-journal-1hdd.yaml
#"${queue[@]}" -t ivb44 dd-write-1hdd.yaml fsmark-generic-1hdd.yaml

Thanks,
Fengguang


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-14 Thread Dave Chinner
On Sun, Aug 14, 2016 at 06:17:24PM +0200, Christoph Hellwig wrote:
> Snipping the long contest:
> 
> I think there are three observations here:
> 
>  (1) removing the mark_page_accessed (which is the only significant
>  change in the parent commit)  hurts the
>  aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test.
>  I'd still rather stick to the filemap version and let the
>  VM people sort it out.  How do the numbers for this test
>  look for XFS vs say ext4 and btrfs?
>  (2) lots of additional spinlock contention in the new case.  A quick
>  check shows that I fat-fingered my rewrite so that we do
>  the xfs_inode_set_eofblocks_tag call now for the pure lookup
>  case, and pretty much all new cycles come from that.
>  (3) Boy, are those xfs_inode_set_eofblocks_tag calls expensive, and
>  we're already doing way to many even without my little bug above.
> 
> So I've force pushed a new version of the iomap-fixes branch with
> (2) fixed, and also a little patch to xfs_inode_set_eofblocks_tag a
> lot less expensive slotted in before that.  Would be good to see
> the numbers with that.

With this new set of fixes, the 1byte write test runs ~30% faster on
my test machine (130k writes/s vs 100k writes/s), and the 1k write
on the pmem device runs about 10% faster (660MB/s vs 590MB/s).
dbench numbers on the pmem device also go through the roof (they
didn't show any regression to begin with) - 50% faster at 16 clients
on a 16AG filesystem (5700MB/s vs 3800MB/s).

The 10Mx4k file create fsmark workload I run (on the sparse 500TB
XFS filesystem backed by a pair of SSDs)  is giving the highest
throughput *and* the lowest std dev I've ever recorded
(55014.8+/-1.3e+04 files/s) and that shows in the runtime which also
drops from 3m57s to 3m22s.

So regardless of what aim7 results we get from these changes, I'll
be merging them pending review and further testing...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-14 Thread Christoph Hellwig
Snipping the long contest:

I think there are three observations here:

 (1) removing the mark_page_accessed (which is the only significant
 change in the parent commit)  hurts the
 aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test.
 I'd still rather stick to the filemap version and let the
 VM people sort it out.  How do the numbers for this test
 look for XFS vs say ext4 and btrfs?
 (2) lots of additional spinlock contention in the new case.  A quick
 check shows that I fat-fingered my rewrite so that we do
 the xfs_inode_set_eofblocks_tag call now for the pure lookup
 case, and pretty much all new cycles come from that.
 (3) Boy, are those xfs_inode_set_eofblocks_tag calls expensive, and
 we're already doing way to many even without my little bug above.

So I've force pushed a new version of the iomap-fixes branch with
(2) fixed, and also a little patch to xfs_inode_set_eofblocks_tag a
lot less expensive slotted in before that.  Would be good to see
the numbers with that.


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-14 Thread Fengguang Wu

Hi Christoph,

On Sun, Aug 14, 2016 at 06:51:28AM +0800, Fengguang Wu wrote:

Hi Christoph,

On Sun, Aug 14, 2016 at 12:15:08AM +0200, Christoph Hellwig wrote:

Hi Fengguang,

feel free to try this git tree:

  git://git.infradead.org/users/hch/vfs.git iomap-fixes


I just queued some test jobs for it.

% queue -q vip -t ivb44 -b hch-vfs/iomap-fixes aim7-fs-1brd.yaml fs=xfs -r3 -k 
fe9c2c81ed073878768785a985295cbacc349e42 -k 
ca2edab2e1d8f30dda874b7f717c2d4664991e9b -k 
99091700659f4df965e138b38b4fa26a29b7eade

That job file can be found here:

   
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/tree/jobs/aim7-fs-1brd.yaml

It specifies a matrix of the below atom tests:

   wfg /c/lkp-tests% split-job jobs/aim7-fs-1brd.yaml -s 'fs: xfs'

   jobs/aim7-fs-1brd.yaml => 
./aim7-fs-1brd-1BRD_48G-xfs-disk_src-3000-performance.yaml
   jobs/aim7-fs-1brd.yaml => 
./aim7-fs-1brd-1BRD_48G-xfs-disk_rr-3000-performance.yaml
   jobs/aim7-fs-1brd.yaml => 
./aim7-fs-1brd-1BRD_48G-xfs-disk_rw-3000-performance.yaml
   jobs/aim7-fs-1brd.yaml => 
./aim7-fs-1brd-1BRD_48G-xfs-disk_cp-3000-performance.yaml
   jobs/aim7-fs-1brd.yaml => 
./aim7-fs-1brd-1BRD_48G-xfs-disk_wrt-3000-performance.yaml
   jobs/aim7-fs-1brd.yaml => 
./aim7-fs-1brd-1BRD_48G-xfs-sync_disk_rw-600-performance.yaml
   jobs/aim7-fs-1brd.yaml => 
./aim7-fs-1brd-1BRD_48G-xfs-creat-clo-1500-performance.yaml
   jobs/aim7-fs-1brd.yaml => 
./aim7-fs-1brd-1BRD_48G-xfs-disk_rd-9000-performance.yaml


I got some results now. The several finished aim7 tests have some
performance regressions for commit fe9c2c81 ("xfs: rewrite and
optimize the delalloc write path") comparing to its parent commit
ca2edab2e and their base mainline commit 990917006 ("Merge tag
'nfs-for-4.8-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs").

wfg@inn ~% compare -g aim7 -ai 99091700659f4df965e138b38b4fa26a29b7eade 
ca2edab2e1d8f30dda874b7f717c2d4664991e9b 
fe9c2c81ed073878768785a985295cbacc349e42
tests: 4
   60  perf-index  fe9c2c81ed073878768785a985295cbacc349e42
   97 power-index  fe9c2c81ed073878768785a985295cbacc349e42

99091700659f4df9  ca2edab2e1d8f30dda874b7f71  fe9c2c81ed073878768785a985  
testcase/testparams/testbox
  --  --  
---
%stddev %change %stddev %change %stddev
\  |\  |\
   270459  272267 ±  3%   -48% 139834 ±  3%  
aim7/1BRD_48G-xfs-disk_cp-3000-performance/ivb44
   473257  468546   5% 497512
aim7/1BRD_48G-xfs-disk_rd-9000-performance/ivb44
   360578 -18% 296589 -60% 144974
aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44
   358701  -6% 335712 -40% 216057
GEO-MEAN aim7.jobs-per-min

99091700659f4df9  ca2edab2e1d8f30dda874b7f71  fe9c2c81ed073878768785a985
  --  --
48.48   48.15  36%  65.85
aim7/1BRD_48G-xfs-disk_cp-3000-performance/ivb44
89.50   89.76   88.75
aim7/1BRD_48G-xfs-disk_rd-9000-performance/ivb44
35.78  23%  43.93  76%  63.09
aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44
53.75   7%  57.48  33%  71.71
GEO-MEAN turbostat.%Busy

99091700659f4df9  ca2edab2e1d8f30dda874b7f71  fe9c2c81ed073878768785a985
  --  --
 14391431  36%   1964
aim7/1BRD_48G-xfs-disk_cp-3000-performance/ivb44
 267126742650
aim7/1BRD_48G-xfs-disk_rd-9000-performance/ivb44
 1057  23%   1303  78%   1883
aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44
 1595   7%   1708  34%   2139
GEO-MEAN turbostat.Avg_MHz

99091700659f4df9  ca2edab2e1d8f30dda874b7f71  fe9c2c81ed073878768785a985
  --  --
  167 167   6%177
aim7/1BRD_48G-xfs-disk_cp-3000-performance/ivb44
  175 175 176
aim7/1BRD_48G-xfs-disk_rd-9000-performance/ivb44
  150   8%162  19%178
aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44
  164 168   8%177
GEO-MEAN turbostat.PkgWatt

99091700659f4df9  ca2edab2e1d8f30dda874b7f71  fe9c2c81ed073878768785a985
  --  --
10.27   10.43

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-14 Thread Dave Chinner
On Sat, Aug 13, 2016 at 02:30:54AM +0200, Christoph Hellwig wrote:
> On Fri, Aug 12, 2016 at 08:02:08PM +1000, Dave Chinner wrote:
> > Which says "no change". Oh well, back to the drawing board...
> 
> I don't see how it would change thing much - for all relevant calculations
> we convert to block units first anyway.

THere was definitely an off-by-one in the code, which meant for
1-byte writes it never triggered speculative prealloc, so it was
doing the past-EOF real block check for every write. With it also
passing less than a block size, when the > XFS_ISIZE check passed
3 out of every 4 want_preallocate checks were landing on an already
allocated block, too, so it was doing 3x as many lookups as needed.
for 1k writes on a 4k block size filesystem. Amongst other things...

> But the whole xfs_iomap_write_delay is a giant mess anyway.  For a usual
> call we do at least four lookups in the extent btree, which seems rather
> costly.  Especially given that the low-level xfs_bmap_search_extents
> interface would give us all required information in one single call.

I noticed, though I was looking for a smaller, targetted fix rather
than rewriting the whole thing. Don't get me wrong, I think it needs
a rewrite to be efficient for the iomap infrastructure, just didn't
want to do that as a regression fix if a 1-liner might be
sufficient...

> Below is a patch I hacked up this morning to do just that.  It passes
> xfstests, but I've not done any real benchmarking with it.  If the
> reduced lookup overhead in it doesn't help enough we'll need to some
> sort of look aside cache for the information, but I hope that we
> can avoid that.  And yes, it's a rather large patch - but the old
> path was so entangled that I couldn't come up with something lighter.

I'll run some tests on it. If it does so;ve the regression, I'm
going to hold it back until we get a decent amount of review and
test coverage on it, though...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-14 Thread Fengguang Wu

Hi Christoph,

On Sun, Aug 14, 2016 at 12:15:08AM +0200, Christoph Hellwig wrote:

Hi Fengguang,

feel free to try this git tree:

  git://git.infradead.org/users/hch/vfs.git iomap-fixes


I just queued some test jobs for it.

% queue -q vip -t ivb44 -b hch-vfs/iomap-fixes aim7-fs-1brd.yaml fs=xfs -r3 -k 
fe9c2c81ed073878768785a985295cbacc349e42 -k 
ca2edab2e1d8f30dda874b7f717c2d4664991e9b -k 
99091700659f4df965e138b38b4fa26a29b7eade

That job file can be found here:

   
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/tree/jobs/aim7-fs-1brd.yaml

It specifies a matrix of the below atom tests:

   wfg /c/lkp-tests% split-job jobs/aim7-fs-1brd.yaml -s 'fs: xfs'

   jobs/aim7-fs-1brd.yaml => 
./aim7-fs-1brd-1BRD_48G-xfs-disk_src-3000-performance.yaml
   jobs/aim7-fs-1brd.yaml => 
./aim7-fs-1brd-1BRD_48G-xfs-disk_rr-3000-performance.yaml
   jobs/aim7-fs-1brd.yaml => 
./aim7-fs-1brd-1BRD_48G-xfs-disk_rw-3000-performance.yaml
   jobs/aim7-fs-1brd.yaml => 
./aim7-fs-1brd-1BRD_48G-xfs-disk_cp-3000-performance.yaml
   jobs/aim7-fs-1brd.yaml => 
./aim7-fs-1brd-1BRD_48G-xfs-disk_wrt-3000-performance.yaml
   jobs/aim7-fs-1brd.yaml => 
./aim7-fs-1brd-1BRD_48G-xfs-sync_disk_rw-600-performance.yaml
   jobs/aim7-fs-1brd.yaml => 
./aim7-fs-1brd-1BRD_48G-xfs-creat-clo-1500-performance.yaml
   jobs/aim7-fs-1brd.yaml => 
./aim7-fs-1brd-1BRD_48G-xfs-disk_rd-9000-performance.yaml

If you see other suitable tests for this patch, feel free to drop me a
hint.  I'v queued these jobs to the other machines to make them run in
parallel.

% queue -q vip -t ivb43 -b hch-vfs/iomap-fixes fsmark-stress-journal-1hdd.yaml 
fsmark-stress-journal-1brd.yaml fs=xfs -r3 -k 
fe9c2c81ed073878768785a985295cbacc349e42 -k 
ca2edab2e1d8f30dda874b7f717c2d4664991e9b -k 
99091700659f4df965e138b38b4fa26a29b7eade

% queue -q vip -t ivb44 -b hch-vfs/iomap-fixes fsmark-generic-1brd.yaml 
dd-write-1hdd.yaml  fsmark-generic-1hdd.yaml  fs=xfs -r3 -k 
fe9c2c81ed073878768785a985295cbacc349e42 -k 
ca2edab2e1d8f30dda874b7f717c2d4664991e9b -k 
99091700659f4df965e138b38b4fa26a29b7eade

% queue -q vip -t lkp-hsx02 -b hch-vfs/iomap-fixes fsmark-generic-brd-raid.yaml 
fs=xfs -r3 -k fe9c2c81ed073878768785a985[0/1710]349e42 -k 
ca2edab2e1d8f30dda874b7f717c2d4664991e9b -k 
99091700659f4df965e138b38b4fa26a29b7eade

% queue -q vip -t lkp-hsw-ep4 -b hch-vfs/iomap-fixes 
fsmark-1ssd-nvme-small.yaml fs=xfs -r3 -k 
fe9c2c81ed073878768785a985295cbacc349e42 -k 
ca2edab2e1d8f30dda874b7f717c2d4664991e9b -k 
99091700659f4df965e138b38b4fa26a29b7eade

Thanks,
Fengguang


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-14 Thread Christoph Hellwig
Hi Fengguang,

feel free to try this git tree:

   git://git.infradead.org/users/hch/vfs.git iomap-fixes



Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-14 Thread Fengguang Wu

Hi Linus,

On Fri, Aug 12, 2016 at 11:03:33AM -0700, Linus Torvalds wrote:

On Thu, Aug 11, 2016 at 8:56 PM, Dave Chinner  wrote:

On Thu, Aug 11, 2016 at 07:27:52PM -0700, Linus Torvalds wrote:


I don't recall having ever seen the mapping tree_lock as a contention
point before, but it's not like I've tried that load either. So it
might be a regression (going back long, I suspect), or just an unusual
load that nobody has traditionally tested much.

Single-threaded big file write one page at a time, was it?


Yup. On a 4 node NUMA system.


Ok, I can't see any real contention on my single-node workstation
(running ext4 too, so there may be filesystem differences), but I
guess that shouldn't surprise me. The cacheline bouncing just isn't
expensive enough when it all stays on-die.

I can see the tree_lock in my profiles (just not very high), and at
least for ext4 the main caller ssems to be
__set_page_dirty_nobuffers().

And yes, looking at that, the biggest cost by _far_ inside the
spinlock seems to be the accounting.

Which doesn't even have to be inside the mapping lock, as far as I can
tell, and as far as comments go.

So a stupid patch to just move the dirty page accounting to outside
the spinlock might help a lot.

Does this attached patch help your contention numbers?

Adding a few people who get blamed for account_page_dirtied() and
inode_attach_wb() just to make sure that nobody expected the
mapping_lock spinlock to be held when calling account_page_dirtied().

I realize that this has nothing to do with the AIM7 regression (the
spinlock just isn't high enough in that profile), but your contention
numbers just aren't right, and updating accounting statistics inside a
critical spinlock when not needed is just wrong.


I'm testing this patch on top of 9909170065 ("Merge tag 'nfs-for-4.8-2'
of git://git.linux-nfs.org/projects/trondmy/linux-nfs").

The BRD (Ram backed block device, drivers/block/brd.c) tests enables
pretty fast IO. And the fsmark-generic-brd-raid.yaml on lkp-hsx02 will
simulate 8 RAID disks on a 4-node NUMA machine.

queue -q vip -t ivb44 -b wfg/account_page_dirtied-linus aim7-fs-1brd.yaml  -R3 
-k 1b5f2eb4a752e1fa7102f37545f92e64fabd0cf8 -k 
99091700659f4df965e138b38b4fa26a29b7eade

queue -q vip -t ivb43 -b wfg/account_page_dirtied-linus 
fsmark-stress-journal-1hdd.yaml fsmark-stress-journal-1brd.yaml  -R3 -k 
1b5f2eb4a752e1fa7102f37545f92e64fabd0cf8 -k 
99091700659f4df965e138b38b4fa26a29b7eade

queue -q vip -t ivb44 -b wfg/account_page_dirtied-linus 
fsmark-generic-1brd.yaml dd-write-1hdd.yaml  fsmark-generic-1hdd.yaml   -R3 -k 
1b5f2eb4a752e1fa7102f37545f92e64fabd0cf8 -k 
99091700659f4df965e138b38b4fa26a29b7eade

queue -q vip -t lkp-hsx02 -b wfg/account_page_dirtied-linus 
fsmark-generic-brd-raid.yaml  -R3 -k 1b5f2eb4a752e1fa7102f37545f92e64fabd0cf8 
-k 99091700659f4df965e138b38b4fa26a29b7eade

queue -q vip -t lkp-hsw-ep4 -b wfg/account_page_dirtied-linus 
fsmark-1ssd-nvme-small.yaml  -R3 -k 1b5f2eb4a752e1fa7102f37545f92e64fabd0cf8 -k 
99091700659f4df965e138b38b4fa26a29b7eade

queue -q vip -t lkp-hsw-ep4 -b wfg/account_page_dirtied-linus 
fsmark-1ssd-nvme-small.yaml  -R3 -k 1b5f2eb4a752e1fa7102f37545f92e64fabd0cf8 -k 
99091700659f4df965e138b38b4fa26a29b7eade

Thanks,
Fengguang


fs/buffer.c | 5 -
fs/xfs/xfs_aops.c   | 5 -
mm/page-writeback.c | 2 +-
3 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 9c8eb9b6db6a..f79a9d241589 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -628,15 +628,18 @@ static void __set_page_dirty(struct page *page, struct 
address_space *mapping,
 int warn)
{
unsigned long flags;
+   bool account = false;

spin_lock_irqsave(&mapping->tree_lock, flags);
if (page->mapping) { /* Race with truncate? */
WARN_ON_ONCE(warn && !PageUptodate(page));
-   account_page_dirtied(page, mapping);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
+   account = true;
}
spin_unlock_irqrestore(&mapping->tree_lock, flags);
+   if (account)
+   account_page_dirtied(page, mapping);
}

/*
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 7575cfc3ad15..59169c36765e 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1490,15 +1490,18 @@ xfs_vm_set_page_dirty(
if (newly_dirty) {
/* sigh - __set_page_dirty() is static, so copy it here, too */
unsigned long flags;
+   bool account = false;

spin_lock_irqsave(&mapping->tree_lock, flags);
if (page->mapping) { /* Race with truncate? */
WARN_ON_ONCE(!PageUptodate(page));
-   account_page_dirtied(page, mapping);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-14 Thread Fengguang Wu

Hi Christoph,

On Sat, Aug 13, 2016 at 11:48:25PM +0200, Christoph Hellwig wrote:

On Sat, Aug 13, 2016 at 02:30:54AM +0200, Christoph Hellwig wrote:

Below is a patch I hacked up this morning to do just that.  It passes
xfstests, but I've not done any real benchmarking with it.  If the
reduced lookup overhead in it doesn't help enough we'll need to some
sort of look aside cache for the information, but I hope that we
can avoid that.  And yes, it's a rather large patch - but the old
path was so entangled that I couldn't come up with something lighter.


Hi Fengguang or Xiaolong,

any chance to add this thread to a lkp run?


Sure. To which base should I apply it? Or if you already pushed the
git tree, I'll test your commit directly.

Thanks,
Fengguang


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-14 Thread Christoph Hellwig
On Sat, Aug 13, 2016 at 02:30:54AM +0200, Christoph Hellwig wrote:
> Below is a patch I hacked up this morning to do just that.  It passes
> xfstests, but I've not done any real benchmarking with it.  If the
> reduced lookup overhead in it doesn't help enough we'll need to some
> sort of look aside cache for the information, but I hope that we
> can avoid that.  And yes, it's a rather large patch - but the old
> path was so entangled that I couldn't come up with something lighter.

Hi Fengguang or Xiaolong,

any chance to add this thread to a lkp run?  I've played around with
Dave's simplied xfs_io run, and while the end result for 1k block
size looks pretty similar in terms of execution time and throughput
the profiles look much better.  For 512 byte or 1 byte tests the
tests completes a lot faster too.

Here is the perf report output for a 1k block size run, the first
item directly related to the block mapping shows up is
xfs_file_iomap_begin_delay at .75%.  Although I'm a bit worried
up up_/down_read showing up so much.  While we take a ilock and
iolock a lot they should be mostly uncontended for such a single
threaded write, so the overhead seems a bit worrisome.

(FYI, the tree this was tested on also has the mark_page_accessed
and pagefault_disable fixes applied)

# To display the perf.data header info, please use --header/--header-only 
options.
#
# Samples: 7K of event 'cpu-clock'
# Event count (approx.): 190925
#
# Overhead   Command  Shared Object 
Symbol
#     .  
.
#
37.71%   swapper  [kernel.kallsyms]  [k] native_safe_halt   
  
 9.85%  kworker/u8:5  [kernel.kallsyms]  [k] __copy_user_nocache
  
 2.83%xfs_io  [kernel.kallsyms]  [k] copy_user_generic_string   
  
 2.33%xfs_io  [kernel.kallsyms]  [k] __memset   
  
 2.23%xfs_io  [kernel.kallsyms]  [k] __block_commit_write.isra.34   
  
 1.73%xfs_io  [kernel.kallsyms]  [k] down_write 
  
 1.64%xfs_io  [kernel.kallsyms]  [k] up_write   
  
 1.39%xfs_io  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
  
 1.23%xfs_io  [kernel.kallsyms]  [k] entry_SYSCALL_64_fastpath  
  
 1.18%xfs_io  [kernel.kallsyms]  [k] __mark_inode_dirty 
  
 1.18%xfs_io  [kernel.kallsyms]  [k] _raw_spin_lock 
  
 1.15%  kworker/u8:5  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
  
 1.13%xfs_io  [kernel.kallsyms]  [k] __block_write_begin_int
  
 1.10%xfs_io  [kernel.kallsyms]  [k] mark_buffer_dirty  
  
 1.07%xfs_io  [kernel.kallsyms]  [k] __radix_tree_lookup
  
 1.01%   kworker/0:2  [kernel.kallsyms]  [k] end_buffer_async_write 
  
 0.97%xfs_io  [kernel.kallsyms]  [k] unlock_page
  
 0.92%   kworker/0:2  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
  
 0.89%xfs_io  [kernel.kallsyms]  [k] iov_iter_copy_from_user_atomic 
  
 0.84%xfs_io  [kernel.kallsyms]  [k] generic_write_end  
  
 0.80%xfs_io  [kernel.kallsyms]  [k] get_page_from_freelist 
  
 0.80%xfs_io  [kernel.kallsyms]  [k] xfs_perag_put  
  
 0.79%xfs_io  [kernel.kallsyms]  [k] __add_to_page_cache_locked 
  
 0.75%xfs_io  libc-2.19.so   [.] __libc_pwrite  
  
 0.72%xfs_io  [kernel.kallsyms]  [k] 
xfs_file_iomap_begin_delay.isra.5
 0.71%xfs_io  [kernel.kallsyms]  [k] iomap_write_actor  
  
 0.67%xfs_io  [kernel.kallsyms]  [k] pagecache_get_page 
  
 0.64%xfs_io  [kernel.kallsyms]  [k] 
balance_dirty_pages_ratelimited  
 0.64%xfs_io  [kernel.kallsyms]  [k] vfs_write  
  
 0.63%  kworker/u8:5  [kernel.kallsyms]  [k] clear_page_dirty_for_io
  
 0.62%xfs_io  [kernel.kallsyms]  [k] xfs_file_write_iter
  
 0.60%xfs_io  [kernel.kallsyms]  [k] __vfs_write
  
 0.55%xfs_io  [kernel.kallsyms]  [k] page_waitqueue 
  
 0.54%xfs_io  [kernel.kallsyms]  [k] xfs_perag_get  
  
 0.52%xfs_io  [kernel.kallsyms]  [k] __wake_up_bit  
  
 0.52%xfs_io  [kernel.kallsyms]  [k] radix_tree_tag_set 
  
 0.50%  kworker/u8:5  [kernel.kallsyms]  [k] xfs_do_writepage   
  
 0.50%xfs_io  [kernel.kallsyms]  [k] iov_iter_advance   
  
 0.47%xfs_io  [kernel.kallsyms]  [k] kmem_cache_alloc   
  
 0.46%xfs_io  [kernel.kallsyms]  [k] xfs_file_buffered_aio_write
  
 0.46%x

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-12 Thread Christoph Hellwig
On Fri, Aug 12, 2016 at 08:02:08PM +1000, Dave Chinner wrote:
> Which says "no change". Oh well, back to the drawing board...

I don't see how it would change thing much - for all relevant calculations
we convert to block units first anyway.

But the whole xfs_iomap_write_delay is a giant mess anyway.  For a usual
call we do at least four lookups in the extent btree, which seems rather
costly.  Especially given that the low-level xfs_bmap_search_extents
interface would give us all required information in one single call.

Below is a patch I hacked up this morning to do just that.  It passes
xfstests, but I've not done any real benchmarking with it.  If the
reduced lookup overhead in it doesn't help enough we'll need to some
sort of look aside cache for the information, but I hope that we
can avoid that.  And yes, it's a rather large patch - but the old
path was so entangled that I couldn't come up with something lighter.

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index b060bca..614803b 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -1388,7 +1388,7 @@ xfs_bmap_search_multi_extents(
  * Else, *lastxp will be set to the index of the found
  * entry; *gotp will contain the entry.
  */
-STATIC xfs_bmbt_rec_host_t * /* pointer to found extent entry 
*/
+xfs_bmbt_rec_host_t * /* pointer to found extent entry */
 xfs_bmap_search_extents(
xfs_inode_t *ip,/* incore inode pointer */
xfs_fileoff_t   bno,/* block number searched for */
@@ -4074,7 +4074,7 @@ xfs_bmapi_read(
return 0;
 }
 
-STATIC int
+int
 xfs_bmapi_reserve_delalloc(
struct xfs_inode*ip,
xfs_fileoff_t   aoff,
@@ -4170,91 +4170,6 @@ out_unreserve_quota:
return error;
 }
 
-/*
- * Map file blocks to filesystem blocks, adding delayed allocations as needed.
- */
-int
-xfs_bmapi_delay(
-   struct xfs_inode*ip,/* incore inode */
-   xfs_fileoff_t   bno,/* starting file offs. mapped */
-   xfs_filblks_t   len,/* length to map in file */
-   struct xfs_bmbt_irec*mval,  /* output: map values */
-   int *nmap,  /* i/o: mval size/count */
-   int flags)  /* XFS_BMAPI_... */
-{
-   struct xfs_mount*mp = ip->i_mount;
-   struct xfs_ifork*ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
-   struct xfs_bmbt_irecgot;/* current file extent record */
-   struct xfs_bmbt_irecprev;   /* previous file extent record */
-   xfs_fileoff_t   obno;   /* old block number (offset) */
-   xfs_fileoff_t   end;/* end of mapped file region */
-   xfs_extnum_tlastx;  /* last useful extent number */
-   int eof;/* we've hit the end of extents */
-   int n = 0;  /* current extent index */
-   int error = 0;
-
-   ASSERT(*nmap >= 1);
-   ASSERT(*nmap <= XFS_BMAP_MAX_NMAP);
-   ASSERT(!(flags & ~XFS_BMAPI_ENTIRE));
-   ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
-
-   if (unlikely(XFS_TEST_ERROR(
-   (XFS_IFORK_FORMAT(ip, XFS_DATA_FORK) != XFS_DINODE_FMT_EXTENTS &&
-XFS_IFORK_FORMAT(ip, XFS_DATA_FORK) != XFS_DINODE_FMT_BTREE),
-mp, XFS_ERRTAG_BMAPIFORMAT, XFS_RANDOM_BMAPIFORMAT))) {
-   XFS_ERROR_REPORT("xfs_bmapi_delay", XFS_ERRLEVEL_LOW, mp);
-   return -EFSCORRUPTED;
-   }
-
-   if (XFS_FORCED_SHUTDOWN(mp))
-   return -EIO;
-
-   XFS_STATS_INC(mp, xs_blk_mapw);
-
-   if (!(ifp->if_flags & XFS_IFEXTENTS)) {
-   error = xfs_iread_extents(NULL, ip, XFS_DATA_FORK);
-   if (error)
-   return error;
-   }
-
-   xfs_bmap_search_extents(ip, bno, XFS_DATA_FORK, &eof, &lastx, &got, 
&prev);
-   end = bno + len;
-   obno = bno;
-
-   while (bno < end && n < *nmap) {
-   if (eof || got.br_startoff > bno) {
-   error = xfs_bmapi_reserve_delalloc(ip, bno, len, &got,
-  &prev, &lastx, eof);
-   if (error) {
-   if (n == 0) {
-   *nmap = 0;
-   return error;
-   }
-   break;
-   }
-   }
-
-   /* set up the extent map to return. */
-   xfs_bmapi_trim_map(mval, &got, &bno, len, obno, end, n, flags);
-   xfs_bmapi_update_map(&mval, &bno, &len, obno, end, &n, flags);
-
-   /* If we're done, stop now. */
-   if (bno >= end || n >= *nmap)
-   break;
-
-   /* Else go on to the next record. */
-   prev = got;
-   if (++

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-12 Thread Linus Torvalds
On Thu, Aug 11, 2016 at 8:56 PM, Dave Chinner  wrote:
> On Thu, Aug 11, 2016 at 07:27:52PM -0700, Linus Torvalds wrote:
>>
>> I don't recall having ever seen the mapping tree_lock as a contention
>> point before, but it's not like I've tried that load either. So it
>> might be a regression (going back long, I suspect), or just an unusual
>> load that nobody has traditionally tested much.
>>
>> Single-threaded big file write one page at a time, was it?
>
> Yup. On a 4 node NUMA system.

Ok, I can't see any real contention on my single-node workstation
(running ext4 too, so there may be filesystem differences), but I
guess that shouldn't surprise me. The cacheline bouncing just isn't
expensive enough when it all stays on-die.

I can see the tree_lock in my profiles (just not very high), and at
least for ext4 the main caller ssems to be
__set_page_dirty_nobuffers().

And yes, looking at that, the biggest cost by _far_ inside the
spinlock seems to be the accounting.

Which doesn't even have to be inside the mapping lock, as far as I can
tell, and as far as comments go.

So a stupid patch to just move the dirty page accounting to outside
the spinlock might help a lot.

Does this attached patch help your contention numbers?

Adding a few people who get blamed for account_page_dirtied() and
inode_attach_wb() just to make sure that nobody expected the
mapping_lock spinlock to be held when calling account_page_dirtied().

I realize that this has nothing to do with the AIM7 regression (the
spinlock just isn't high enough in that profile), but your contention
numbers just aren't right, and updating accounting statistics inside a
critical spinlock when not needed is just wrong.

Linus
 fs/buffer.c | 5 -
 fs/xfs/xfs_aops.c   | 5 -
 mm/page-writeback.c | 2 +-
 3 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 9c8eb9b6db6a..f79a9d241589 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -628,15 +628,18 @@ static void __set_page_dirty(struct page *page, struct 
address_space *mapping,
 int warn)
 {
unsigned long flags;
+   bool account = false;
 
spin_lock_irqsave(&mapping->tree_lock, flags);
if (page->mapping) {/* Race with truncate? */
WARN_ON_ONCE(warn && !PageUptodate(page));
-   account_page_dirtied(page, mapping);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
+   account = true;
}
spin_unlock_irqrestore(&mapping->tree_lock, flags);
+   if (account)
+   account_page_dirtied(page, mapping);
 }
 
 /*
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 7575cfc3ad15..59169c36765e 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1490,15 +1490,18 @@ xfs_vm_set_page_dirty(
if (newly_dirty) {
/* sigh - __set_page_dirty() is static, so copy it here, too */
unsigned long flags;
+   bool account = false;
 
spin_lock_irqsave(&mapping->tree_lock, flags);
if (page->mapping) {/* Race with truncate? */
WARN_ON_ONCE(!PageUptodate(page));
-   account_page_dirtied(page, mapping);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
+   account = true;
}
spin_unlock_irqrestore(&mapping->tree_lock, flags);
+   if (account)
+   account_page_dirtied(page, mapping);
}
unlock_page_memcg(page);
if (newly_dirty)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f4cd7d8005c9..9a6a6b99acfe 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2517,10 +2517,10 @@ int __set_page_dirty_nobuffers(struct page *page)
spin_lock_irqsave(&mapping->tree_lock, flags);
BUG_ON(page_mapping(page) != mapping);
WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
-   account_page_dirtied(page, mapping);
radix_tree_tag_set(&mapping->page_tree, page_index(page),
   PAGECACHE_TAG_DIRTY);
spin_unlock_irqrestore(&mapping->tree_lock, flags);
+   account_page_dirtied(page, mapping);
unlock_page_memcg(page);
 
if (mapping->host) {


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-12 Thread Dave Chinner
On Fri, Aug 12, 2016 at 04:51:24PM +0800, Ye Xiaolong wrote:
> On 08/12, Ye Xiaolong wrote:
> >On 08/12, Dave Chinner wrote:
> 
> [snip]
> 
> >>lkp-folk: the patch I've just tested it attached below - can you
> >>feed that through your test and see if it fixes the regression?
> >>
> >
> >Hi, Dave
> >
> >I am verifying your fix patch in lkp environment now, will send the
> >result once I get it.
> >
> 
> Here is the test result.

Which says "no change". Oh well, back to the drawing board...

Can you send me the aim7 config file and command line you are using
for the test?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-12 Thread Ye Xiaolong
On 08/12, Ye Xiaolong wrote:
>On 08/12, Dave Chinner wrote:

[snip]

>>lkp-folk: the patch I've just tested it attached below - can you
>>feed that through your test and see if it fixes the regression?
>>
>
>Hi, Dave
>
>I am verifying your fix patch in lkp environment now, will send the
>result once I get it.
>

Here is the test result.

commit 636b594f38278080db93f2d67d11d31700924f5d
Author: Dave Chinner 
AuthorDate: Fri Aug 12 14:23:44 2016 +0800
Commit: Xiaolong Ye 
CommitDate: Fri Aug 12 14:23:44 2016 +0800

When a write occurs that extends the file, we check to see if we
need to preallocate more delalloc space.  When we do sub-page
writes, the new iomap write path passes a sub-block write length to
the block mapping code.  xfs_iomap_write_delay does not expect to be
pased byte counts smaller than one filesystem block, so it ends up
checking the BMBT on for blocks beyond EOF on every write,
regardless of whether we need to or not. This causes a regression in
aim7 benchmarks as it is full of sub-page writes.

To fix this, clamp the minimum length of a mapping request coming
through xfs_file_iomap_begin() to one filesystem block. This ensures
we are passing the same length to xfs_iomap_write_delay() as we did
when calling through the get_blocks path. This substantially reduces
the amount of lookup load being placed on the BMBT during sub-block
write loads.

Signed-off-by: Dave Chinner 
---
 fs/xfs/xfs_iomap.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 620fc91..5eaace0 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1015,10 +1015,15 @@ xfs_file_iomap_begin(
 * number pulled out of thin air as a best guess for initial
 * testing.
 *
+* xfs_iomap_write_delay() only works if the length passed in is
+* >= one filesystem block. Hence we need to clamp the minimum
+* length we map, too.
+*
 * Note that the values needs to be less than 32-bits wide until
 * the lower level functions are updated.
 */
length = min_t(loff_t, length, 1024 * PAGE_SIZE);
+   length = max_t(loff_t, length, (1 << inode->i_blkbits));
if (xfs_get_extsz_hint(ip)) {
/*
 * xfs_iomap_write_direct() expects the shared lock. It

f0c6bcba74ac51cb  68a9f5e7007c1afa2cf6830b69  636b594f38278080db93f2d67d  
  --  --  
 %stddev %change %stddev %change %stddev
 \  |\  |\  
484435 ±  0% -13.3% 420004 ±  0% -14.0% 416777 ±  0%  
aim7.jobs-per-min
  6491 ±  3% +30.8%   8491 ±  0% +35.7%   8806 ±  1%  
aim7.time.involuntary_context_switches
   376 ±  0% +28.4%484 ±  0% +29.6%488 ±  0%  
aim7.time.system_time
430512 ±  0% -20.1% 343838 ±  0% -19.7% 345708 ±  0%  
aim7.time.voluntary_context_switches
 37.37 ±  0% +15.3%  43.09 ±  0% +16.1%  43.41 ±  0%  
aim7.time.elapsed_time
 37.37 ±  0% +15.3%  43.09 ±  0% +16.1%  43.41 ±  0%  
aim7.time.elapsed_time.max
155184 ±  1%  -2.1% 151864 ±  1%  -2.7% 150937 ±  1%  
aim7.time.minor_page_faults
 0 ±  0%  +Inf% 215412 ±141%  +Inf% 334416 ± 75%  
latency_stats.sum.wait_on_page_bit.__migration_entry_wait.migration_entry_wait.handle_pte_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault
 24772 ±  0% -28.6%  17675 ±  0% -26.7%  18149 ±  2%  
vmstat.system.cs
 26816 ±  8% +10.2%  29542 ±  1% +13.3%  30370 ±  1%  
interrupts.CAL:Function_call_interrupts
125122 ± 10% -10.7% 111758 ± 12% -11.1% 111223 ± 11%  
softirqs.SCHED
  3906 ±  0% +28.8%   5032 ±  2% +29.1%   5045 ±  1%  
proc-vmstat.nr_active_file
  3444 ±  5% +41.8%   4884 ±  0% +25.0%   4304 ± 11%  
proc-vmstat.nr_shmem
  4092 ± 14% +61.2%   6595 ±  1% +40.0%   5728 ± 15%  
proc-vmstat.pgactivate
 15627 ±  0% +27.7%  19956 ±  1% +27.4%  19902 ±  0%  
meminfo.Active(file)
 16103 ±  3% +14.3%  18405 ±  8% +11.2%  17900 ±  1%  
meminfo.AnonHugePages
 13777 ±  5% +43.1%  19709 ±  0% +25.0%  17220 ± 11%  
meminfo.Shmem
   1724300 ± 27% -40.5%1025538 ±  1% -41.3%1012868 ±  0%  
sched_debug.cfs_rq:/.load.max
   1724300 ± 27% -40.5%1025538 ±  1% -41.3%1012868 ±  0%  
sched_debug.cpu.load.max
 37.37 ±  0% +15.3%  43.09 ±  0% +16.1%  43.41 ±  0%  
time.elapsed_time
 37.37 ±  0% +15.3%  43.09 ±  0% +16.1%

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-11 Thread Ye Xiaolong
On 08/12, Dave Chinner wrote:
>On Thu, Aug 11, 2016 at 10:02:39PM -0700, Linus Torvalds wrote:
>> On Thu, Aug 11, 2016 at 9:16 PM, Dave Chinner  wrote:
>> >
>> > That's why running aim7 as your "does the filesystem scale"
>> > benchmark is somewhat irrelevant to scaling applications on high
>> > performance systems these days
>> 
>> Yes, don't get me wrong - I'm not at all trying to say that AIM7 is a
>> good benchmark. It's just that I think what it happens to test is
>> still meaningful, even if it's not necessarily in any way some kind of
>> "high performance IO" thing.
>> 
>> There are probably lots of other more important loads, I just reacted
>> to Christoph seeming to argue that the AIM7 behavior was _so_ broken
>> that we shouldn't even care. It's not _that_ broken, it's just not
>> about high-performance IO streaming, it happens to test something else
>> entirely.
>
>Right - I admit that my first reaction once I worked out what the
>problem was is exactly what Christoph said. But after looking at it
>further, regardless of how crappy the benchmark it, it is a
>regression
>
>> We've actually had AIM7 occasionally find other issues just because
>> some of the things it does is so odd.
>
>*nod*
>
>> And let's face it, user programs doing odd and not very efficient
>> things should be considered par for the course. We're never going to
>> get rid of insane user programs, so we might as well fix the
>> performance problems even when we say "that's just stupid".
>
>Yup, that's what I'm doing :/
>
>It looks like the underlying cause is that the old block mapping
>code only fed filesystem block size lengths into
>xfs_iomap_write_delay(), whereas the iomap code is feeding the
>(capped) write() length into it. Hence xfs_iomap_write_delay() is
>not detecting the need for speculative preallocation correctly on
>these sub-block writes. The profile looks better for the 1 byte
>write - I've combined the old and new for comparison below:
>
>   4.22%   __block_commit_write.isra.30
>   3.80%   up_write
>   3.74%   xfs_bmapi_read
>   3.65%   ___might_sleep
>   3.55%   down_write
>   3.20%   entry_SYSCALL_64_fastpath
>   3.02%   mark_buffer_dirty
>   2.78%   __mark_inode_dirty
>   2.78%   unlock_page
>   2.59%   xfs_break_layouts
>   2.47%   xfs_iext_bno_to_ext
>   2.38%   __block_write_begin_int
>   2.22%   find_get_entry
>   2.17%   xfs_file_write_iter
>   2.16%   __radix_tree_lookup
>   2.13%   iomap_write_actor
>   2.04%   xfs_bmap_search_extents
>   1.98%   __might_sleep
>   1.84%   xfs_file_buffered_aio_write
>   1.76%   iomap_apply
>   1.71%   generic_write_end
>   1.68%   vfs_write
>   1.66%   iov_iter_copy_from_user_atomic
>   1.56%   xfs_bmap_search_multi_extents
>   1.55%   __vfs_write
>   1.52%   pagecache_get_page
>   1.46%   xfs_bmapi_update_map
>   1.33%   xfs_iunlock
>   1.32%   xfs_iomap_write_delay
>   1.29%   xfs_file_iomap_begin
>   1.29%   do_raw_spin_lock
>   1.29%   __xfs_bmbt_get_all
>   1.21%   iov_iter_advance
>   1.20%   xfs_file_aio_write_checks
>   1.14%   xfs_ilock
>   1.11%   balance_dirty_pages_ratelimited
>   1.10%   xfs_bmapi_trim_map
>   1.06%   xfs_iomap_eof_want_preallocate
>   1.00%   xfs_bmapi_delay
>
>Comparison of common functions:
>
>OldNew function
>4.50%  3.74%   xfs_bmapi_read
>3.64%  4.22%   __block_commit_write.isra.30
>3.55%  2.16%   __radix_tree_lookup
>3.46%  3.80%   up_write
>3.43%  3.65%   ___might_sleep
>3.09%  3.20%   entry_SYSCALL_64_fastpath
>3.01%  2.47%   xfs_iext_bno_to_ext
>3.01%  2.22%   find_get_entry
>2.98%  3.55%   down_write
>2.71%  3.02%   mark_buffer_dirty
>2.52%  2.78%   __mark_inode_dirty
>2.38%  2.78%   unlock_page
>2.14%  2.59%   xfs_break_layouts
>2.07%  1.46%   xfs_bmapi_update_map
>2.06%  2.04%   xfs_bmap_search_extents
>2.04%  1.32%   xfs_iomap_write_delay
>2.00%  0.38%   generic_write_checks
>1.96%  1.56%   xfs_bmap_search_multi_extents
>1.90%  1.29%   __xfs_bmbt_get_all
>1.89%  1.11%   balance_dirty_pages_ratelimited
>1.82%  0.28%   wait_for_stable_page
>1.76%  2.17%   xfs_file_write_iter
>1.68%  1.06%   xfs_iomap_eof_want_preallocate
>1.68%  1.00%   xfs_bmapi_delay
>1.67%  2.13%   iomap_write_actor
>1.60%  1.84%   xfs_file_buffered_aio_write
>1.56%  1.98%   __might_sleep
>1.48%  1.29%   do_raw_spin_lock
>1.44%  1.71%   generic_write_e

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-11 Thread Dave Chinner
On Thu, Aug 11, 2016 at 10:02:39PM -0700, Linus Torvalds wrote:
> On Thu, Aug 11, 2016 at 9:16 PM, Dave Chinner  wrote:
> >
> > That's why running aim7 as your "does the filesystem scale"
> > benchmark is somewhat irrelevant to scaling applications on high
> > performance systems these days
> 
> Yes, don't get me wrong - I'm not at all trying to say that AIM7 is a
> good benchmark. It's just that I think what it happens to test is
> still meaningful, even if it's not necessarily in any way some kind of
> "high performance IO" thing.
> 
> There are probably lots of other more important loads, I just reacted
> to Christoph seeming to argue that the AIM7 behavior was _so_ broken
> that we shouldn't even care. It's not _that_ broken, it's just not
> about high-performance IO streaming, it happens to test something else
> entirely.

Right - I admit that my first reaction once I worked out what the
problem was is exactly what Christoph said. But after looking at it
further, regardless of how crappy the benchmark it, it is a
regression

> We've actually had AIM7 occasionally find other issues just because
> some of the things it does is so odd.

*nod*

> And let's face it, user programs doing odd and not very efficient
> things should be considered par for the course. We're never going to
> get rid of insane user programs, so we might as well fix the
> performance problems even when we say "that's just stupid".

Yup, that's what I'm doing :/

It looks like the underlying cause is that the old block mapping
code only fed filesystem block size lengths into
xfs_iomap_write_delay(), whereas the iomap code is feeding the
(capped) write() length into it. Hence xfs_iomap_write_delay() is
not detecting the need for speculative preallocation correctly on
these sub-block writes. The profile looks better for the 1 byte
write - I've combined the old and new for comparison below:

4.22%   __block_commit_write.isra.30
3.80%   up_write
3.74%   xfs_bmapi_read
3.65%   ___might_sleep
3.55%   down_write
3.20%   entry_SYSCALL_64_fastpath
3.02%   mark_buffer_dirty
2.78%   __mark_inode_dirty
2.78%   unlock_page
2.59%   xfs_break_layouts
2.47%   xfs_iext_bno_to_ext
2.38%   __block_write_begin_int
2.22%   find_get_entry
2.17%   xfs_file_write_iter
2.16%   __radix_tree_lookup
2.13%   iomap_write_actor
2.04%   xfs_bmap_search_extents
1.98%   __might_sleep
1.84%   xfs_file_buffered_aio_write
1.76%   iomap_apply
1.71%   generic_write_end
1.68%   vfs_write
1.66%   iov_iter_copy_from_user_atomic
1.56%   xfs_bmap_search_multi_extents
1.55%   __vfs_write
1.52%   pagecache_get_page
1.46%   xfs_bmapi_update_map
1.33%   xfs_iunlock
1.32%   xfs_iomap_write_delay
1.29%   xfs_file_iomap_begin
1.29%   do_raw_spin_lock
1.29%   __xfs_bmbt_get_all
1.21%   iov_iter_advance
1.20%   xfs_file_aio_write_checks
1.14%   xfs_ilock
1.11%   balance_dirty_pages_ratelimited
1.10%   xfs_bmapi_trim_map
1.06%   xfs_iomap_eof_want_preallocate
1.00%   xfs_bmapi_delay

Comparison of common functions:

Old New function
4.50%   3.74%   xfs_bmapi_read
3.64%   4.22%   __block_commit_write.isra.30
3.55%   2.16%   __radix_tree_lookup
3.46%   3.80%   up_write
3.43%   3.65%   ___might_sleep
3.09%   3.20%   entry_SYSCALL_64_fastpath
3.01%   2.47%   xfs_iext_bno_to_ext
3.01%   2.22%   find_get_entry
2.98%   3.55%   down_write
2.71%   3.02%   mark_buffer_dirty
2.52%   2.78%   __mark_inode_dirty
2.38%   2.78%   unlock_page
2.14%   2.59%   xfs_break_layouts
2.07%   1.46%   xfs_bmapi_update_map
2.06%   2.04%   xfs_bmap_search_extents
2.04%   1.32%   xfs_iomap_write_delay
2.00%   0.38%   generic_write_checks
1.96%   1.56%   xfs_bmap_search_multi_extents
1.90%   1.29%   __xfs_bmbt_get_all
1.89%   1.11%   balance_dirty_pages_ratelimited
1.82%   0.28%   wait_for_stable_page
1.76%   2.17%   xfs_file_write_iter
1.68%   1.06%   xfs_iomap_eof_want_preallocate
1.68%   1.00%   xfs_bmapi_delay
1.67%   2.13%   iomap_write_actor
1.60%   1.84%   xfs_file_buffered_aio_write
1.56%   1.98%   __might_sleep
1.48%   1.29%   do_raw_spin_lock
1.44%   1.71%   generic_write_end
1.41%   1.52%   pagecache_get_page
1.38%   1.10%   xfs_bm

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-11 Thread Linus Torvalds
On Thu, Aug 11, 2016 at 9:16 PM, Dave Chinner  wrote:
>
> That's why running aim7 as your "does the filesystem scale"
> benchmark is somewhat irrelevant to scaling applications on high
> performance systems these days

Yes, don't get me wrong - I'm not at all trying to say that AIM7 is a
good benchmark. It's just that I think what it happens to test is
still meaningful, even if it's not necessarily in any way some kind of
"high performance IO" thing.

There are probably lots of other more important loads, I just reacted
to Christoph seeming to argue that the AIM7 behavior was _so_ broken
that we shouldn't even care. It's not _that_ broken, it's just not
about high-performance IO streaming, it happens to test something else
entirely.

We've actually had AIM7 occasionally find other issues just because
some of the things it does is so odd.

Iirc it has a fork test that doesn't execve (very unusual - you'd
generally use threads if you care about performance), and that has
shown issues with our anon_vma scaling before anything else did.

I also seem to remember some odd pty open/close/ioctl subtest that
showed problems with some of the last remnants of the old BKL (the
test probably actually tested something else, but ended up choking on
the odd tty things).

So in general, I'm not a fan of AIM as a benchmark, but it actually
_has_ found lots of real issues because it tends to do things that
kernel developers think are insane.

And let's face it, user programs doing odd and not very efficient
things should be considered par for the course. We're never going to
get rid of insane user programs, so we might as well fix the
performance problems even when we say "that's just stupid".

Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-11 Thread Dave Chinner
On Thu, Aug 11, 2016 at 08:20:53PM -0700, Linus Torvalds wrote:
> On Thu, Aug 11, 2016 at 7:52 PM, Christoph Hellwig  wrote:
> >
> > I can look at that, but indeed optimizing this patch seems a bit
> > stupid.
> 
> The "write less than a full block to the end of the file" is actually
> a reasonably common case.
> 
> It may not make for a great filesystem benchmark, but it also isn't
> actually insane. People who do logging in user space do this all the
> time, for example. And it is *not* stupid in that context. Not at all.
> 
> It's never going to be the *main* thing you do (unless you're AIM),
> but I do think it's worth fixing.
> 
> And AIM7 remains one of those odd benchmarks that people use. I'm not
> quite sure why, but I really do think that the normal "append smaller
> chunks to the end of the file" should absolutely not be dismissed as
> stupid.

Yes, I agree that there are reasons for making sub-block IO work
well (which is why I'm looking to try to fix it), but that does't
mean the benchmark is sane. aim7 is, technically, a "scalability
benchmark". As such, expecting tiny writes to scale to moving large
amounts of data is the "stupid" thing it does. If you scale up the
amount of data you need to move, tehn you ned to scale up the
efficiency of moving that data. Case in point - writing 1GB of data
in 1kb chunks to XFs on a local /dev/pmem1 runs at ~600MB/s, whilst
moving it it in 1MB chunks runs at 1.9GB/s. aim7 doesn't actually
stress the scalability of the hardware, because inefficiencies in
it's implementation prevent it from getting to those limits.

That's what aim7 misses - as speeds and capabilities go up, the way
code needs to be written to make efficient use of the hardware also
changes. e.g. High throughput logging solutions don't write every
incoming log event immediately - they aggregate them into larger
buffers and then write those, knowing that they can support much
higher logging rates by doing this

That's why running aim7 as your "does the filesystem scale"
benchmark is somewhat irrelevant to scaling applications on high
performance systems these days - users with fast storage will be
expecting to see that 1.9GB/s throughput from their app, not
600MB/s

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-11 Thread Dave Chinner
On Thu, Aug 11, 2016 at 07:27:52PM -0700, Linus Torvalds wrote:
> On Thu, Aug 11, 2016 at 5:54 PM, Dave Chinner  wrote:
> >
> > So, removing mark_page_accessed() made the spinlock contention
> > *worse*.
> >
> >   36.51%  [kernel]  [k] _raw_spin_unlock_irqrestore
> >6.27%  [kernel]  [k] copy_user_generic_string
> >3.73%  [kernel]  [k] _raw_spin_unlock_irq
> >3.55%  [kernel]  [k] get_page_from_freelist
> >1.97%  [kernel]  [k] do_raw_spin_lock
> >1.72%  [kernel]  [k] __block_commit_write.isra.30
> 
> I don't recall having ever seen the mapping tree_lock as a contention
> point before, but it's not like I've tried that load either. So it
> might be a regression (going back long, I suspect), or just an unusual
> load that nobody has traditionally tested much.
> 
> Single-threaded big file write one page at a time, was it?

Yup. On a 4 node NUMA system.

So when memory reclaim kicks in, there's a write process, a
writeback kworker and 4 kswapd kthreads all banging on the
mapping->tree_lock. There's an awful lot of concurrency happening
behind the scenes of that single user process writing to a file...

> The mapping tree lock has been around forever (it used to be a rw-lock
> long long ago), but I wonder if we might have moved more stuff into it
> (memory accounting comes to mind) causing much worse contention or
> something.

Yeah, there is now a crapton of accounting updated in
account_page_dirtied under the tree lock - memcg, writeback, node,
zone, task, etc. And there's a *lot* of code that
__delete_from_page_cache() can execute under the tree lock.

> Hmm. Just for fun, I googled "tree_lock contention". It's shown up
> before - back in 2006, and it was you hitting it back then too.

Of course! That, however, would have been when I was playing with
real big SGI machines, not a tiddly little 16p VM :P

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-11 Thread Linus Torvalds
On Thu, Aug 11, 2016 at 7:52 PM, Christoph Hellwig  wrote:
>
> I can look at that, but indeed optimizing this patch seems a bit
> stupid.

The "write less than a full block to the end of the file" is actually
a reasonably common case.

It may not make for a great filesystem benchmark, but it also isn't
actually insane. People who do logging in user space do this all the
time, for example. And it is *not* stupid in that context. Not at all.

It's never going to be the *main* thing you do (unless you're AIM),
but I do think it's worth fixing.

And AIM7 remains one of those odd benchmarks that people use. I'm not
quite sure why, but I really do think that the normal "append smaller
chunks to the end of the file" should absolutely not be dismissed as
stupid.

Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-11 Thread Christoph Hellwig
On Fri, Aug 12, 2016 at 12:23:29PM +1000, Dave Chinner wrote:
> Christoph, maybe there's something we can do to only trigger
> speculative prealloc growth checks if the new file size crosses the end of
> the currently allocated block at the EOF. That would chop out a fair
> chunk of the xfs_bmapi_read calls being done in this workload. I'm
> not sure how much effort we should spend optimising this slow path,
> though

I can look at that, but indeed optimizing this patch seems a bit
stupid.  The other thing we could do is to optimize xfs_bmapi_read - even
if it shouldn't be called this often it seems like it should waste a whole
lot less CPU cycles.


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-11 Thread Linus Torvalds
On Thu, Aug 11, 2016 at 7:23 PM, Dave Chinner  wrote:
>
> And, as usual, that's the answer. Here's the reproducer:
>
> # sudo mkfs.xfs -f -m crc=0 /dev/pmem1
> # sudo mount -o noatime /dev/pmem1 /mnt/scratch
> # sudo xfs_io -f -c "pwrite 0 512m -b 1" /mnt/scratch/fooey

Heh. Ok, so 1 byte or 1kB at a time is pretty much the same thing, yeah.

And I guess that also explains why the system call entry showed up so
high in the profiles.

I'l take another look at tree_lock tomorrow, but it sounds like this
particular AIM regression is now effectively a solved (or at least
known) issue. Thanks,

  Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-11 Thread Linus Torvalds
On Thu, Aug 11, 2016 at 5:54 PM, Dave Chinner  wrote:
>
> So, removing mark_page_accessed() made the spinlock contention
> *worse*.
>
>   36.51%  [kernel]  [k] _raw_spin_unlock_irqrestore
>6.27%  [kernel]  [k] copy_user_generic_string
>3.73%  [kernel]  [k] _raw_spin_unlock_irq
>3.55%  [kernel]  [k] get_page_from_freelist
>1.97%  [kernel]  [k] do_raw_spin_lock
>1.72%  [kernel]  [k] __block_commit_write.isra.30

I don't recall having ever seen the mapping tree_lock as a contention
point before, but it's not like I've tried that load either. So it
might be a regression (going back long, I suspect), or just an unusual
load that nobody has traditionally tested much.

Single-threaded big file write one page at a time, was it?

The mapping tree lock has been around forever (it used to be a rw-lock
long long ago), but I wonder if we might have moved more stuff into it
(memory accounting comes to mind) causing much worse contention or
something.

Hmm. Just for fun, I googled "tree_lock contention". It's shown up
before - back in 2006, and it was you hitting it back then too.

There was an even older one (related to AIM7, interesting) which was
what caused the tree_lock to become a rw-lock back in 2005 (but then
Nick Piggin made it a spinlock again in 2008).

So it's not unheard of, but it certainly hasn't been a big issue.

That's the only obvious ones I found (apart from some btrfs issues,
but btrfs has a completely different notion of tree locking, so those
are not about the same thing).

 Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-11 Thread Dave Chinner
On Fri, Aug 12, 2016 at 10:54:42AM +1000, Dave Chinner wrote:
> I'm now going to test Christoph's theory that this is an "overwrite
> doing lots of block mapping" issue. More on that to follow.

Ok, so going back to the profiles, I can say it's not an overwrite
issue, because there is delayed allocation showing up in the
profile. Lots of it. Which lead me to think "maybe the benchmark is
just completely *dumb*".

And, as usual, that's the answer. Here's the reproducer:

# sudo mkfs.xfs -f -m crc=0 /dev/pmem1
# sudo mount -o noatime /dev/pmem1 /mnt/scratch
# sudo xfs_io -f -c "pwrite 0 512m -b 1" /mnt/scratch/fooey

And here's the profile:

   4.50%  [kernel]  [k] xfs_bmapi_read
   3.64%  [kernel]  [k] __block_commit_write.isra.30
   3.55%  [kernel]  [k] __radix_tree_lookup
   3.46%  [kernel]  [k] up_write
   3.43%  [kernel]  [k] ___might_sleep
   3.09%  [kernel]  [k] entry_SYSCALL_64_fastpath
   3.01%  [kernel]  [k] xfs_iext_bno_to_ext
   3.01%  [kernel]  [k] find_get_entry
   2.98%  [kernel]  [k] down_write
   2.71%  [kernel]  [k] mark_buffer_dirty
   2.52%  [kernel]  [k] __mark_inode_dirty
   2.38%  [kernel]  [k] unlock_page
   2.14%  [kernel]  [k] xfs_break_layouts
   2.07%  [kernel]  [k] xfs_bmapi_update_map
   2.06%  [kernel]  [k] xfs_bmap_search_extents
   2.04%  [kernel]  [k] xfs_iomap_write_delay
   2.00%  [kernel]  [k] generic_write_checks
   1.96%  [kernel]  [k] xfs_bmap_search_multi_extents
   1.90%  [kernel]  [k] __xfs_bmbt_get_all
   1.89%  [kernel]  [k] balance_dirty_pages_ratelimited
   1.82%  [kernel]  [k] wait_for_stable_page
   1.76%  [kernel]  [k] xfs_file_write_iter
   1.68%  [kernel]  [k] xfs_iomap_eof_want_preallocate
   1.68%  [kernel]  [k] xfs_bmapi_delay
   1.67%  [kernel]  [k] iomap_write_actor
   1.60%  [kernel]  [k] xfs_file_buffered_aio_write
   1.56%  [kernel]  [k] __might_sleep
   1.48%  [kernel]  [k] do_raw_spin_lock
   1.44%  [kernel]  [k] generic_write_end
   1.41%  [kernel]  [k] pagecache_get_page
   1.38%  [kernel]  [k] xfs_bmapi_trim_map
   1.21%  [kernel]  [k] __block_write_begin_int
   1.17%  [kernel]  [k] vfs_write
   1.17%  [kernel]  [k] xfs_file_iomap_begin
   1.17%  [kernel]  [k] xfs_bmbt_get_startoff
   1.14%  [kernel]  [k] iomap_apply
   1.08%  [kernel]  [k] xfs_iunlock
   1.08%  [kernel]  [k] iov_iter_copy_from_user_atomic
   0.97%  [kernel]  [k] xfs_file_aio_write_checks
   0.96%  [kernel]  [k] xfs_ilock
.

Yeah, I'm doing a sequential write in *1 byte pwrite() calls*.

Ok, so the benchmark isn't /quite/ that abysmally stupid. It's
still, ah, extremely challenged:

if (NBUFSIZE != 1024) { /* enforce known block size */
fprintf(stderr, "NBUFSIZE changed to %d\n", NBUFSIZE);
exit(1);
}

i.e. it's hard coded to do all it's "disk" IO in 1k block sizes.
Every read, every write, every file copy, etc are all done with a
1024 byte buffer. There are lots of loops that look like:

while (--n) {
write(fd, nbuf, sizeof nbuf)
}

where n is the file size specified in the job file. Those loops are
what is generating the profile we see: repeated partial page writes
that extend the file.

IOWs, the benchmark is doing exactly what we document in the fstat()
man page *not to do* as it is will cause inefficient IO patterns:

   The st_blksize field gives the "preferred" blocksize for
   efficient filesystem I/O.  (Writing to a file in smaller
   chunks may cause an inefficient  read-modify-rewrite.)

The smallest we ever set st_blksize to is PAGE_SIZE, so the
benchmark is running well known and documented (at least 10 years
ago) slow paths through the IO stack.  I'm very tempted now simply
to say that the aim7 disk benchmark is showing it's age and as such
the results are not actually reflective of what typical applications
will see.

Christoph, maybe there's something we can do to only trigger
speculative prealloc growth checks if the new file size crosses the end of
the currently allocated block at the EOF. That would chop out a fair
chunk of the xfs_bmapi_read calls being done in this workload. I'm
not sure how much effort we should spend optimising this slow path,
though

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-11 Thread Dave Chinner
On Thu, Aug 11, 2016 at 11:16:12AM +1000, Dave Chinner wrote:
> On Wed, Aug 10, 2016 at 05:33:20PM -0700, Huang, Ying wrote:
> We need to know what is happening that is different - there's a good
> chance the mapping trace events will tell us. Huang, can you get
> a raw event trace from the test?
> 
> I need to see these events:
> 
>   xfs_file*
>   xfs_iomap*
>   xfs_get_block*
> 

lkp-folks, can I please get these traces run and sent to me? I don't
have the time or patience to try to get aim7 running on my machines
- the build is full of hard-coded paths and libraries that aren't
provided by modern distros (e.g. it requires a static libaio.a!) and
it fails at the configure stage complaining that:

configure: error: C compiler cannot create executables

Which is a complete load of BS.

Hence I can't make progress until I have some way of understanding
what the IO pattern is that is generating the profile being
measured. So far I'm unable to do that with any of the tools I been
trying, hence I need the traces to work out what I'm missing...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-11 Thread Dave Chinner
On Thu, Aug 11, 2016 at 09:55:33AM -0700, Linus Torvalds wrote:
> On Thu, Aug 11, 2016 at 8:57 AM, Christoph Hellwig  wrote:
> >
> > The one liner below (not tested yet) to simply remove it should fix that
> > up.  I also noticed we have a spurious pagefault_disable/enable, I
> > need to dig into the history of that first, though.
> 
> Hopefully the pagefault_disable/enable doesn't matter for this case.
> 
> Can we get this one-liner tested with the kernel robot for comparison?
> I really think a messed-up LRU list could cause bad IO patterns, and
> end up keeping dirty pages around that should be streaming out to disk
> and re-used, so causing memory pressure etc for no good reason.
> 
> I think the mapping->tree_lock issue that Dave sees is interesting
> too, but the kswapd activity (and the extra locking it causes) could
> also be a symptom of the same thing - memory pressure due to just
> putting pages in the active file that simply shouldn't be there.

So, removing mark_page_accessed() made the spinlock contention
*worse*.

  36.51%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.27%  [kernel]  [k] copy_user_generic_string
   3.73%  [kernel]  [k] _raw_spin_unlock_irq
   3.55%  [kernel]  [k] get_page_from_freelist
   1.97%  [kernel]  [k] do_raw_spin_lock  
   1.72%  [kernel]  [k] __block_commit_write.isra.30
   1.44%  [kernel]  [k] __wake_up_bit
   1.41%  [kernel]  [k] shrink_page_list  
   1.24%  [kernel]  [k] __radix_tree_lookup
   1.03%  [kernel]  [k] xfs_log_commit_cil
   0.99%  [kernel]  [k] free_hot_cold_page
   0.96%  [kernel]  [k] end_buffer_async_write
   0.95%  [kernel]  [k] delay_tsc
   0.94%  [kernel]  [k] ___might_sleep
   0.93%  [kernel]  [k] kmem_cache_alloc
   0.90%  [kernel]  [k] unlock_page
   0.82%  [kernel]  [k] kmem_cache_free
   0.74%  [kernel]  [k] up_write
   0.72%  [kernel]  [k] node_dirty_ok
   0.66%  [kernel]  [k] clear_page_dirty_for_io
   0.65%  [kernel]  [k] __mark_inode_dirty
   0.64%  [kernel]  [k] __block_write_begin_int
   0.58%  [kernel]  [k] xfs_inode_item_format
   0.57%  [kernel]  [k] __memset
   0.57%  [kernel]  [k] cancel_dirty_page
   0.56%  [kernel]  [k] down_write
   0.54%  [kernel]  [k] page_evictable
   0.53%  [kernel]  [k] page_mapping
   0.52%  [kernel]  [k] __slab_free
   0.49%  [kernel]  [k] xfs_do_writepage
   0.49%  [kernel]  [k] drop_buffers

-   41.82%41.82%  [kernel][k] _raw_spin_unlock_irqrestore
   - 35.93% ret_from_fork
  - kthread
 - 29.76% kswapd
  shrink_node
  shrink_node_memcg.isra.75
  shrink_inactive_list
  shrink_page_list
  __remove_mapping
  _raw_spin_unlock_irqrestore
 - 7.13% worker_thread
- process_one_work
   - 4.40% wb_workfn
wb_writeback
__writeback_inodes_wb
writeback_sb_inodes
__writeback_single_inode
do_writepages
xfs_vm_writepages
write_cache_pages
xfs_do_writepage
   - 2.71% xfs_end_io
xfs_destroy_ioend
end_buffer_async_write
end_page_writeback
test_clear_page_writeback
_raw_spin_unlock_irqrestore
   + 4.88% __libc_pwrite

The kswapd contention has jumped from 20% to 30% of the CPU time
in the profiles. I can't see how changing what LRU the page is on
will improve the contention problem - at it's sources it's a N:1
problem where the writing process and N kswapd worker threads are
all trying to access the same lock concurrently

This is not the AIM7 problem we are looking for - what this test
demonstrates is a fundamental page cache scalability issue at the
design level - the mapping->tree_lock is a global serialisation
point

I'm now going to test Christoph's theory that this is an "overwrite
doing lots of block mapping" issue. More on that to follow.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-11 Thread Linus Torvalds
On Thu, Aug 11, 2016 at 3:16 PM, Al Viro  wrote:
>
> Huh?  The very first thing it does is
> char *kaddr = kmap_atomic(page), *p = kaddr + offset;
>
> If _that_ does not disable pagefaults, we are very deep in shit.

Right you are - it does, even with highmem disabled. Never mind, those
pagefault_disabled/enabled() calls are clearly bogus.

 Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-11 Thread Al Viro
On Thu, Aug 11, 2016 at 01:35:00PM -0700, Linus Torvalds wrote:

> The thing is, iov_iter_copy_from_user_atomic() doesn't itself enforce
> non-blocking user accesses, it depends on the caller blocking page
> faults.

Huh?  The very first thing it does is
char *kaddr = kmap_atomic(page), *p = kaddr + offset;

If _that_ does not disable pagefaults, we are very deep in shit.  AFAICS,
all instances of that sucker do disable those, including the non-highmem
default (it's
static inline void *kmap_atomic(struct page *page)
{
preempt_disable();
pagefault_disable();
return page_address(page);
}
)


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-11 Thread Christoph Hellwig
I'll need to dig into what AIM7 actually does in this benchmark, which
isn't too easy as I'm on a business trip currently, but from the list
below it looks like it keeps overwriting and overwriting a file that's
already been allocated.  This is a pretty stupid workload, but fortunately
it should also be able to be optimized by skipping the actual block
lookup, which is what the old buffer.c code happens to do.


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-11 Thread Linus Torvalds
On Thu, Aug 11, 2016 at 2:16 PM, Huang, Ying  wrote:
>
> Test result is as follow,

Thanks. No change.

> raw perf data:

I redid my munging, with the old (good) percentages in parenthesis:

  intel_idle:   17.66   (16.88)
  copy_user_enhanced_fast_string:3.25(3.94)
  memset_erms:   2.56(3.26)
  xfs_bmapi_read:2.28
  ___might_sleep:2.09(2.33)
  __block_commit_write.isra.24:  2.07(2.47)
  xfs_iext_bno_to_ext:   1.79
  __block_write_begin_int:   1.74(1.56)
  up_write:  1.72(1.61)
  unlock_page:   1.69(1.69)
  down_write:1.59(1.55)
  __mark_inode_dirty:1.54(1.88)
  xfs_bmap_search_extents:   1.33
  xfs_iomap_write_delay: 1.23
  mark_buffer_dirty: 1.21(1.53)
  __radix_tree_lookup:   1.2 (1.32)
  xfs_bmap_search_multi_extents: 1.18
  xfs_iomap_eof_want_preallocate.constprop.8:1.17
  entry_SYSCALL_64_fastpath: 1.15(1.47)
  __might_sleep: 1.14(1.26)
  _raw_spin_lock:0.97(1.17)
  vfs_write: 0.94(1.14)
  xfs_bmapi_delay:   0.93
  iomap_write_actor: 0.9
  pagecache_get_page:0.89(1.03)
  xfs_file_write_iter:   0.86(1.03)
  xfs_file_iomap_begin:  0.81
  iov_iter_copy_from_user_atomic:0.78(0.87)
  iomap_apply:   0.77
  generic_write_end: 0.74(1.36)
  xfs_file_buffered_aio_write:   0.72(0.84)
  find_get_entry:0.69(0.79)
  __vfs_write:   0.67(0.87)

and it's worth noting a few things:

 - most of the old percentages are bigger, but that's natural: the
load used to take longer, and the more efficient (old) case thus has
higher percent values. That doesn't mean it was slower, quite the
reverse.

 - the main exception is intel_idle, so we do have more idle time.

But the *big* difference is all the functions that didn't use to show
up at all, and have no previous percent values:

  xfs_bmapi_read: 2.28
  xfs_iext_bno_to_ext: 1.79
  xfs_bmap_search_extents: 1.33
  xfs_iomap_write_delay: 1.23
  xfs_bmap_search_multi_extents: 1.18
  xfs_iomap_eof_want_preallocate.constprop.8: 1.17
  xfs_bmapi_delay: 0.93
  iomap_write_actor: 0.9
  xfs_file_iomap_begin: 0.81
  iomap_apply: 0.77

and I think this really can explain the regression. That all adds up
to 12% or so of "new overhead". Which is fairly close to the
regression.

(Ok, that is playing fast and loose with percentages, but I think it
migth be "close enough" in practice).

So for some reason the new code doesn't do a lot more per-page
operations (the unlock_page() etc costs are fairly similar), but it
has a *much* m ore expensive footprint in the xfs_bmap/iomap
functions.

The old code had almost no XFS footprint at all, and didn't need to
look up block mappings etc, and worked almost entirely with the vfs
caches (so used the block numbers in the buffers etc).

And I know that DaveC often complains about vfs overhead, but the fact
is, the VFS layer is optimized to hell and back and does really really
well. Having to call down to filesystem routines (for block mappings
etc) is when performance goes down. I think this is an example of
that.

And hey, maybe I'm just misreading things, or reading too much into
those profiles. But it does look like that commit
68a9f5e7007c1afa2cf6830b690a90d0187c0684 ends up causing more xfs bmap
activity.

 Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-11 Thread Huang, Ying
Christoph Hellwig  writes:

> On Thu, Aug 11, 2016 at 12:51:31PM -0700, Linus Torvalds wrote:
>> Ok. It does seem to also reset the active file page counts back, so
>> that part did seem to be related, but yeah, from a performance
>> standpoint that was clearly not a major issue.
>> 
>> Let's hope Dave can figure out something based on his numbers, because
>> I'm out of ideas. Or maybe it's the pagefault-atomic thing that
>> Christoph was looking at.
>
> I can't really think of any reason why the pagefaul_disable() would
> sіgnificantly change performance.  Anyway, the patch for the is below
> (on top of the previous mark_page_accessed() one), so feel free to
> re-run the test with it.  It would also be nice to see the profiles
> with the two patches applied.
>
> commit 43106eea246074acc4bb7d12fdb91f58002f52ed
> Author: Christoph Hellwig 
> Date:   Thu Aug 11 10:41:40 2016 -0700
>
> fs: remove superflous pagefault_disable from iomap_write_actor
> 
> No idea where this really came from, generic_perform_write only briefly
> did a pagefaul_disable when trying a different prefault scheme.
> 
> Signed-off-by: Christoph Hellwig 
>
> diff --git a/fs/iomap.c b/fs/iomap.c
> index f39c318..74712e2 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -194,9 +194,7 @@ again:
>   if (mapping_writably_mapped(inode->i_mapping))
>   flush_dcache_page(page);
>  
> - pagefault_disable();
>   copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
> - pagefault_enable();
>  
>   flush_dcache_page(page);
>  

Test result is as follow,

commit e129b86bfacc8bb517b843fca41d0d179de7a4ca
Author: Christoph Hellwig 
Date:   Thu Aug 11 10:41:40 2016 -0700

fs: remove superflous pagefault_disable from iomap_write_actor

No idea where this really came from, generic_perform_write only briefly
did a pagefaul_disable when trying a different prefault scheme.

Signed-off-by: Christoph Hellwig 

diff --git a/fs/iomap.c b/fs/iomap.c
index f39c318..74712e2 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -194,9 +194,7 @@ again:
if (mapping_writably_mapped(inode->i_mapping))
flush_dcache_page(page);
 
-   pagefault_disable();
copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
-   pagefault_enable();
 
flush_dcache_page(page);
 

=
compiler/cpufreq_governor/debug-setup/disk/fs/kconfig/load/rootfs/tbox_group/test/testcase:
  
gcc-6/performance/profile/1BRD_48G/xfs/x86_64-rhel/3000/debian-x86_64-2015-02-07.cgz/ivb44/disk_wrt/aim7

commit: 
  f0c6bcba74ac51cb77aadb33ad35cb2dc1ad1506
  68a9f5e7007c1afa2cf6830b690a90d0187c0684
  e129b86bfacc8bb517b843fca41d0d179de7a4ca

f0c6bcba74ac51cb 68a9f5e7007c1afa2cf6830b69 e129b86bfacc8bb517b843fca4 
 -- -- 
 %stddev %change %stddev %change %stddev
 \  |\  |\  
484435 ±  0% -13.3% 420004 ±  0% -11.6% 428323 ±  0%  
aim7.jobs-per-min
 37.37 ±  0% +15.3%  43.09 ±  0% +13.0%  42.24 ±  0%  
aim7.time.elapsed_time
 37.37 ±  0% +15.3%  43.09 ±  0% +13.0%  42.24 ±  0%  
aim7.time.elapsed_time.max
  6491 ±  3% +30.8%   8491 ±  0% +19.9%   7781 ±  4%  
aim7.time.involuntary_context_switches
  2378 ±  0%  +1.1%   2404 ±  0%  +9.2%   2598 ±  0%  
aim7.time.maximum_resident_set_size
376.89 ±  0% +28.4% 484.11 ±  0% +22.8% 462.92 ±  0%  
aim7.time.system_time
430512 ±  0% -20.1% 343838 ±  0% -17.3% 356032 ±  0%  
aim7.time.voluntary_context_switches
 26816 ±  8% +10.2%  29542 ±  1% +13.5%  30428 ±  0%  
interrupts.CAL:Function_call_interrupts
  1016 ±  8%+527.7%   6381 ± 59%+483.5%   5932 ± 82%  
latency_stats.sum.down.xfs_buf_lock._xfs_buf_find.xfs_buf_get_map.xfs_trans_get_buf_map.xfs_da_get_buf.xfs_dir3_data_init.xfs_dir2_sf_to_block.xfs_dir2_sf_addname.xfs_dir_createname.xfs_create.xfs_generic_create
125122 ± 10% -10.7% 111758 ± 12%  +8.2% 135440 ±  3%  
softirqs.SCHED
410707 ± 12%  +5.2% 432155 ± 11% +24.9% 512838 ±  4%  
softirqs.TIMER
 24772 ±  0% -28.6%  17675 ±  0% -24.1%  18813 ±  1%  
vmstat.system.cs
 53477 ±  2%  +5.6%  56453 ±  0%  +6.1%  56716 ±  0%  
vmstat.system.in
 43469 ±  0%  +5.3%  45792 ±  1% +25.0%  54343 ± 16%  
proc-vmstat.nr_active_anon
  3906 ±  0% +28.8%   5032 ±  2% -48.8%   2000 ± 96%  
proc-vmstat.nr_active_file
919.33 ±  5% +14.8%   1055 ±  8% +17.8%   1083 ± 10%  
proc-vmstat.nr_dirty
  2270 ±  0% 

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-11 Thread Linus Torvalds
On Thu, Aug 11, 2016 at 1:00 PM, Christoph Hellwig  wrote:
>
> I can't really think of any reason why the pagefaul_disable() would
> sіgnificantly change performance.

No, you're right, we prefault the page anyway.

And quite frankly, looking at it, I think the pagefault_disable/enable
is actually *correct*.

The thing is, iov_iter_copy_from_user_atomic() doesn't itself enforce
non-blocking user accesses, it depends on the caller blocking page
faults.

And the reason we want to block page faults is to make sure we don't
deadlock on the page we just locked for writing.

So looking at it, I think the pagefault_disable/enable is actually
needed, and it may in fact be a bug that mm/filemap.c doesn't do that.

However, see commit 00a3d660cbac ("Revert "fs: do not prefault
sys_write() user buffer pages"") about why mm/filemap.c doesn't do the
pagefault_disable().

But depending on the prefaulting actually guaranteeing that we don't
deadlock sounds like a nasty race in theory (ie somebody does mmap
tricks in another thread in between the pre-faulting and the final
copying).

   Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-11 Thread Christoph Hellwig
On Thu, Aug 11, 2016 at 12:51:31PM -0700, Linus Torvalds wrote:
> Ok. It does seem to also reset the active file page counts back, so
> that part did seem to be related, but yeah, from a performance
> standpoint that was clearly not a major issue.
> 
> Let's hope Dave can figure out something based on his numbers, because
> I'm out of ideas. Or maybe it's the pagefault-atomic thing that
> Christoph was looking at.

I can't really think of any reason why the pagefaul_disable() would
sіgnificantly change performance.  Anyway, the patch for the is below
(on top of the previous mark_page_accessed() one), so feel free to
re-run the test with it.  It would also be nice to see the profiles
with the two patches applied.

commit 43106eea246074acc4bb7d12fdb91f58002f52ed
Author: Christoph Hellwig 
Date:   Thu Aug 11 10:41:40 2016 -0700

fs: remove superflous pagefault_disable from iomap_write_actor

No idea where this really came from, generic_perform_write only briefly
did a pagefaul_disable when trying a different prefault scheme.

Signed-off-by: Christoph Hellwig 

diff --git a/fs/iomap.c b/fs/iomap.c
index f39c318..74712e2 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -194,9 +194,7 @@ again:
if (mapping_writably_mapped(inode->i_mapping))
flush_dcache_page(page);
 
-   pagefault_disable();
copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
-   pagefault_enable();
 
flush_dcache_page(page);
 


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-11 Thread Linus Torvalds
On Thu, Aug 11, 2016 at 10:51 AM, Huang, Ying  wrote:
>>
>
> Here is the test result for the debug patch.  It appears that the aim7
> score is a little better, but the regression is not recovered.

Ok. It does seem to also reset the active file page counts back, so
that part did seem to be related, but yeah, from a performance
standpoint that was clearly not a major issue.

Let's hope Dave can figure out something based on his numbers, because
I'm out of ideas. Or maybe it's the pagefault-atomic thing that
Christoph was looking at.

Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-11 Thread Huang, Ying
Linus Torvalds  writes:

> On Thu, Aug 11, 2016 at 8:57 AM, Christoph Hellwig  wrote:
>>
>> The one liner below (not tested yet) to simply remove it should fix that
>> up.  I also noticed we have a spurious pagefault_disable/enable, I
>> need to dig into the history of that first, though.
>
> Hopefully the pagefault_disable/enable doesn't matter for this case.
>
> Can we get this one-liner tested with the kernel robot for comparison?
> I really think a messed-up LRU list could cause bad IO patterns, and
> end up keeping dirty pages around that should be streaming out to disk
> and re-used, so causing memory pressure etc for no good reason.
>
> I think the mapping->tree_lock issue that Dave sees is interesting
> too, but the kswapd activity (and the extra locking it causes) could
> also be a symptom of the same thing - memory pressure due to just
> putting pages in the active file that simply shouldn't be there.
>
> So the trivial oneliner _might_ just explain things. It would be
> really nice if the regression turns out to be due to something so
> easily fixed.
>

Here is the test result for the debug patch.  It appears that the aim7
score is a little better, but the regression is not recovered.

commit 5c70fdfdf82723e47ac180e36a7638ca06ea19d8
Author: Christoph Hellwig 
Date:   Thu Aug 11 17:57:21 2016 +0200

dbg fix 68a9f5e700: aim7.jobs-per-min -13.6% regression

On Wed, Aug 10, 2016 at 06:00:24PM -0700, Linus Torvalds wrote:
> The biggest difference is that we have "mark_page_accessed()" show up
> after, and not before. There was also a lot of LRU noise in the
> non-profile data. I wonder if that is the reason here: the old model
> of using generic_perform_write/block_page_mkwrite didn't mark the
> pages accessed, and now with iomap_file_buffered_write() they get
> marked as active and that screws up the LRU list, and makes us not
> flush out the dirty pages well (because they are seen as active and
> not good for writeback), and then you get bad memory use.

And that's actually a "bug" in the new code - mostly because I failed
to pick up changes to the core code happening after we 'forked' it,
in this case commit 2457ae ("mm: non-atomically mark page accessed during 
page
cache allocation where possible").

The one liner below (not tested yet) to simply remove it should fix that
up.  I also noticed we have a spurious pagefault_disable/enable, I
need to dig into the history of that first, though.

diff --git a/fs/iomap.c b/fs/iomap.c
index 48141b8..f39c318 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -199,7 +199,6 @@ again:
pagefault_enable();
 
flush_dcache_page(page);
-   mark_page_accessed(page);
 
status = iomap_write_end(inode, pos, bytes, copied, page);
if (unlikely(status < 0))


=
compiler/cpufreq_governor/disk/fs/kconfig/load/rootfs/tbox_group/test/testcase:
  
gcc-6/performance/1BRD_48G/xfs/x86_64-rhel/3000/debian-x86_64-2015-02-07.cgz/ivb44/disk_wrt/aim7

commit: 
  f0c6bcba74ac51cb77aadb33ad35cb2dc1ad1506
  68a9f5e7007c1afa2cf6830b690a90d0187c0684
  5c70fdfdf82723e47ac180e36a7638ca06ea19d8

f0c6bcba74ac51cb 68a9f5e7007c1afa2cf6830b69 5c70fdfdf82723e47ac180e36a 
 -- -- 
 %stddev %change %stddev %change %stddev
 \  |\  |\  
486586 ±  0% -13.6% 420523 ±  0% -11.2% 432165 ±  0%  
aim7.jobs-per-min
 37.23 ±  0% +15.5%  43.02 ±  0% +12.4%  41.87 ±  0%  
aim7.time.elapsed_time
 37.23 ±  0% +15.5%  43.02 ±  0% +12.4%  41.87 ±  0%  
aim7.time.elapsed_time.max
  6424 ±  1% +31.3%   8437 ±  1% +24.4%   7992 ±  3%  
aim7.time.involuntary_context_switches
  2489 ±  3%  -2.5%   2425 ±  1%  -1.0%   2465 ±  3%  
aim7.time.maximum_resident_set_size
151288 ±  0%  +3.0% 155764 ±  0%  +2.0% 154252 ±  1%  
aim7.time.minor_page_faults
376.31 ±  0% +28.4% 483.23 ±  0% +22.5% 460.96 ±  1%  
aim7.time.system_time
429058 ±  0% -20.1% 343013 ±  0% -16.7% 357231 ±  0%  
aim7.time.voluntary_context_switches
 16014 ±  0% +28.3%  20545 ±  1%  -4.0%  15369 ±  0%  
meminfo.Active(file)
127154 ±  9% -15.5% 107424 ± 10%  -4.4% 121505 ± 11%  
softirqs.SCHED
 24561 ±  0% -27.1%  17895 ±  1% -22.5%  19033 ±  1%  
vmstat.system.cs
 47289 ±  0%  +1.4%  47938 ±  0%  +1.1%  47807 ±  0%  
vmstat.system.in
  4003 ±  0% +27.8%   5117 ±  1%  -4.0%   3842 ±  0%  
proc-vmstat.nr_active_file
979.25 ±  0% +59.1%   1558 ±  5% -11.8% 864.00 ±  1%  
proc-vmstat.pgactivate
 

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-11 Thread Linus Torvalds
On Thu, Aug 11, 2016 at 8:57 AM, Christoph Hellwig  wrote:
>
> The one liner below (not tested yet) to simply remove it should fix that
> up.  I also noticed we have a spurious pagefault_disable/enable, I
> need to dig into the history of that first, though.

Hopefully the pagefault_disable/enable doesn't matter for this case.

Can we get this one-liner tested with the kernel robot for comparison?
I really think a messed-up LRU list could cause bad IO patterns, and
end up keeping dirty pages around that should be streaming out to disk
and re-used, so causing memory pressure etc for no good reason.

I think the mapping->tree_lock issue that Dave sees is interesting
too, but the kswapd activity (and the extra locking it causes) could
also be a symptom of the same thing - memory pressure due to just
putting pages in the active file that simply shouldn't be there.

So the trivial oneliner _might_ just explain things. It would be
really nice if the regression turns out to be due to something so
easily fixed.

Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-11 Thread Christoph Hellwig
On Wed, Aug 10, 2016 at 06:00:24PM -0700, Linus Torvalds wrote:
> The biggest difference is that we have "mark_page_accessed()" show up
> after, and not before. There was also a lot of LRU noise in the
> non-profile data. I wonder if that is the reason here: the old model
> of using generic_perform_write/block_page_mkwrite didn't mark the
> pages accessed, and now with iomap_file_buffered_write() they get
> marked as active and that screws up the LRU list, and makes us not
> flush out the dirty pages well (because they are seen as active and
> not good for writeback), and then you get bad memory use.

And that's actually a "bug" in the new code - mostly because I failed
to pick up changes to the core code happening after we 'forked' it,
in this case commit 2457ae ("mm: non-atomically mark page accessed during page
cache allocation where possible").

The one liner below (not tested yet) to simply remove it should fix that
up.  I also noticed we have a spurious pagefault_disable/enable, I
need to dig into the history of that first, though.

diff --git a/fs/iomap.c b/fs/iomap.c
index 48141b8..f39c318 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -199,7 +199,6 @@ again:
pagefault_enable();
 
flush_dcache_page(page);
-   mark_page_accessed(page);
 
status = iomap_write_end(inode, pos, bytes, copied, page);
if (unlikely(status < 0))


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-10 Thread Dave Chinner
On Wed, Aug 10, 2016 at 06:00:24PM -0700, Linus Torvalds wrote:
> On Wed, Aug 10, 2016 at 5:33 PM, Huang, Ying  wrote:
> >
> > Here it is,
> 
> Thanks.
> 
> Appended is a munged "after" list, with the "before" values in
> parenthesis. It actually looks fairly similar.
> 
> The biggest difference is that we have "mark_page_accessed()" show up
> after, and not before. There was also a lot of LRU noise in the
> non-profile data. I wonder if that is the reason here: the old model
> of using generic_perform_write/block_page_mkwrite didn't mark the
> pages accessed, and now with iomap_file_buffered_write() they get
> marked as active and that screws up the LRU list, and makes us not
> flush out the dirty pages well (because they are seen as active and
> not good for writeback), and then you get bad memory use.
> 
> I'm not seeing anything that looks like locking-related.

Not in that profile. I've been doing some local testing inside a
4-node fake-numa 16p/16GB RAM VM to see what I can find.
I'm yet to work out how I can trigger a profile like the one that
was reported (I really need to see the event traces), but in the
mean time I found this

Doing a large sequential single threaded buffered write using a 4k
buffer (so single page per syscall to make the XFS IO path allocator
behave the same way as in 4.7), I'm seeing a CPU profile that
indicates we have a potential mapping->tree_lock issue:

# xfs_io -f -c "truncate 0" -c "pwrite 0 47g" /mnt/scratch/fooey
wrote 50465865728/50465865728 bytes at offset 0
47.000 GiB, 12320768 ops; 0:01:36.00 (499.418 MiB/sec and 127850.9132 ops/sec)



  24.15%  [kernel]  [k] _raw_spin_unlock_irqrestore
   9.67%  [kernel]  [k] copy_user_generic_string
   5.64%  [kernel]  [k] _raw_spin_unlock_irq
   3.34%  [kernel]  [k] get_page_from_freelist
   2.57%  [kernel]  [k] mark_page_accessed
   2.45%  [kernel]  [k] do_raw_spin_lock
   1.83%  [kernel]  [k] shrink_page_list
   1.70%  [kernel]  [k] free_hot_cold_page
   1.26%  [kernel]  [k] xfs_do_writepage
   1.21%  [kernel]  [k] __radix_tree_lookup
   1.20%  [kernel]  [k] __wake_up_bit
   0.99%  [kernel]  [k] __block_write_begin_int
   0.95%  [kernel]  [k] find_get_pages_tag
   0.92%  [kernel]  [k] cancel_dirty_page
   0.89%  [kernel]  [k] unlock_page
   0.87%  [kernel]  [k] clear_page_dirty_for_io
   0.85%  [kernel]  [k] xfs_bmap_worst_indlen
   0.84%  [kernel]  [k] xfs_file_buffered_aio_write
   0.81%  [kernel]  [k] delay_tsc
   0.78%  [kernel]  [k] node_dirty_ok
   0.77%  [kernel]  [k] up_write
   0.74%  [kernel]  [k] ___might_sleep
   0.73%  [kernel]  [k] xfs_bmap_add_extent_hole_delay
   0.72%  [kernel]  [k] __fget_light
   0.67%  [kernel]  [k] add_to_page_cache_lru
   0.67%  [kernel]  [k] __slab_free
   0.63%  [kernel]  [k] drop_buffers
   0.59%  [kernel]  [k] down_write
   0.59%  [kernel]  [k] kmem_cache_alloc
   0.58%  [kernel]  [k] iomap_write_actor
   0.53%  [kernel]  [k] page_mapping
   0.52%  [kernel]  [k] entry_SYSCALL_64_fastpath
   0.52%  [kernel]  [k] __mark_inode_dirty
   0.51%  [kernel]  [k] __block_commit_write.isra.30
   0.51%  [kernel]  [k] xfs_file_write_iter
   0.49%  [kernel]  [k] mark_buffer_async_write
   0.47%  [kernel]  [k] balance_dirty_pages_ratelimited
   0.47%  [kernel]  [k] xfs_count_page_state
   0.47%  [kernel]  [k] page_evictable
   0.46%  [kernel]  [k] xfs_vm_releasepage
   0.46%  [kernel]  [k] xfs_iomap_write_delay
   0.46%  [kernel]  [k] do_raw_spin_unlock
   0.44%  [kernel]  [k] xfs_file_iomap_begin
   0.44%  [kernel]  [k] xfs_map_at_offset
   0.42%  [kernel]  [k] xfs_iext_bno_to_ext

There's very little XFS showing up near the top of the profile;
it's all page cache, writeback and spin lock traffic. This is a
dead give-away as to the lock being contended:

-   33.30% 0.01%  [kernel][k] kswapd
   - 4.67% kswapd
  - 4.69% shrink_node
 - 4.77% shrink_node_memcg.isra.75
- 7.38% shrink_inactive_list
   - 6.70% shrink_page_list
  - 20.02% __remove_mapping
   19.90% _raw_spin_unlock_irqrestore

I don't think that this is the same as what aim7 is triggering as
there's no XFS write() path allocation functions near the top of the
profile to speak of. Still, I don't recall seeing this before...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-10 Thread Dave Chinner
On Thu, Aug 11, 2016 at 10:36:59AM +0800, Ye Xiaolong wrote:
> On 08/11, Dave Chinner wrote:
> >On Thu, Aug 11, 2016 at 11:16:12AM +1000, Dave Chinner wrote:
> >> I need to see these events:
> >> 
> >>xfs_file*
> >>xfs_iomap*
> >>xfs_get_block*
> >> 
> >> For both kernels. An example trace from 4.8-rc1 running the command
> >> `xfs_io -f -c 'pwrite 0 512k -b 128k' /mnt/scratch/fooey doing an
> >> overwrite and extend of the existing file ends up looking like:
> >> 
> >> $ sudo trace-cmd start -e xfs_iomap\* -e xfs_file\* -e xfs_get_blocks\*
> >> $ sudo cat /sys/kernel/tracing/trace_pipe
> >><...>-2946  [001]  253971.750304: xfs_file_ioctl: dev 
> >> 253:32 ino 0x84
> >>   xfs_io-2946  [001]  253971.750938: xfs_file_buffered_write: 
> >> dev 253:32 ino 0x84 size 0x4 offset 0x0 count 0x2
> >>   xfs_io-2946  [001]  253971.750961: xfs_iomap_found: dev 
> >> 253:32 ino 0x84 size 0x4 offset 0x0 count 131072 type invalid startoff 
> >> 0x0 startblock 24 blockcount 0x60
> >>   xfs_io-2946  [001]  253971.751114: xfs_file_buffered_write: 
> >> dev 253:32 ino 0x84 size 0x4 offset 0x2 count 0x2
> >>   xfs_io-2946  [001]  253971.751128: xfs_iomap_found: dev 
> >> 253:32 ino 0x84 size 0x4 offset 0x2 count 131072 type invalid 
> >> startoff 0x0 startblock 24 blockcount 0x60
> >>   xfs_io-2946  [001]  253971.751234: xfs_file_buffered_write: 
> >> dev 253:32 ino 0x84 size 0x4 offset 0x4 count 0x2
> >>   xfs_io-2946  [001]  253971.751236: xfs_iomap_found: dev 
> >> 253:32 ino 0x84 size 0x4 offset 0x4 count 131072 type invalid 
> >> startoff 0x0 startblock 24 blockcount 0x60
> >>   xfs_io-2946  [001]  253971.751381: xfs_file_buffered_write: 
> >> dev 253:32 ino 0x84 size 0x4 offset 0x6 count 0x2
> >>   xfs_io-2946  [001]  253971.751415: xfs_iomap_prealloc_size: 
> >> dev 253:32 ino 0x84 prealloc blocks 128 shift 0 m_writeio_blocks 16
> >>   xfs_io-2946  [001]  253971.751425: xfs_iomap_alloc: dev 
> >> 253:32 ino 0x84 size 0x4 offset 0x6 count 131072 type invalid 
> >> startoff 0x60 startblock -1 blockcount 0x90
> >> 
> >> That's the output I need for the complete test - you'll need to use
> >> a better recording mechanism that this (e.g. trace-cmd record,
> >> trace-cmd report) because it will generate a lot of events. Compress
> >> the two report files (they'll be large) and send them to me offlist.
> >
> >Can you also send me the output of xfs_info on the filesystem you
> >are testing?
> 
> Hi, Dave
> 
> Here is the xfs_info output:
> 
> # xfs_info /fs/ram0/
> meta-data=/dev/ram0  isize=256agcount=4, agsize=3145728 blks
>  =   sectsz=4096  attr=2, projid32bit=1
>  =   crc=0finobt=0
> data =   bsize=4096   blocks=12582912, imaxpct=25
>  =   sunit=0  swidth=0 blks
> naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
> log  =internal   bsize=4096   blocks=6144, version=2
>  =   sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =none   extsz=4096   blocks=0, rtextents=0

OK, nothing unusual there. One thing that I did just think of - how
close to ENOSPC does this test get? i.e. are we hitting the "we're
almost out of free space" slow paths on this test?

Cheers,

dave.
> 
> Thanks,
> Xiaolong
> >
> >Cheers,
> >
> >Dave.
> >-- 
> >Dave Chinner
> >da...@fromorbit.com
> >___
> >LKP mailing list
> >l...@lists.01.org
> >https://lists.01.org/mailman/listinfo/lkp
> 

-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-10 Thread Ye Xiaolong
On 08/11, Dave Chinner wrote:
>On Thu, Aug 11, 2016 at 11:16:12AM +1000, Dave Chinner wrote:
>> I need to see these events:
>> 
>>  xfs_file*
>>  xfs_iomap*
>>  xfs_get_block*
>> 
>> For both kernels. An example trace from 4.8-rc1 running the command
>> `xfs_io -f -c 'pwrite 0 512k -b 128k' /mnt/scratch/fooey doing an
>> overwrite and extend of the existing file ends up looking like:
>> 
>> $ sudo trace-cmd start -e xfs_iomap\* -e xfs_file\* -e xfs_get_blocks\*
>> $ sudo cat /sys/kernel/tracing/trace_pipe
>><...>-2946  [001]  253971.750304: xfs_file_ioctl: dev 253:32 
>> ino 0x84
>>   xfs_io-2946  [001]  253971.750938: xfs_file_buffered_write: 
>> dev 253:32 ino 0x84 size 0x4 offset 0x0 count 0x2
>>   xfs_io-2946  [001]  253971.750961: xfs_iomap_found: dev 253:32 
>> ino 0x84 size 0x4 offset 0x0 count 131072 type invalid startoff 0x0 
>> startblock 24 blockcount 0x60
>>   xfs_io-2946  [001]  253971.751114: xfs_file_buffered_write: 
>> dev 253:32 ino 0x84 size 0x4 offset 0x2 count 0x2
>>   xfs_io-2946  [001]  253971.751128: xfs_iomap_found: dev 253:32 
>> ino 0x84 size 0x4 offset 0x2 count 131072 type invalid startoff 0x0 
>> startblock 24 blockcount 0x60
>>   xfs_io-2946  [001]  253971.751234: xfs_file_buffered_write: 
>> dev 253:32 ino 0x84 size 0x4 offset 0x4 count 0x2
>>   xfs_io-2946  [001]  253971.751236: xfs_iomap_found: dev 253:32 
>> ino 0x84 size 0x4 offset 0x4 count 131072 type invalid startoff 0x0 
>> startblock 24 blockcount 0x60
>>   xfs_io-2946  [001]  253971.751381: xfs_file_buffered_write: 
>> dev 253:32 ino 0x84 size 0x4 offset 0x6 count 0x2
>>   xfs_io-2946  [001]  253971.751415: xfs_iomap_prealloc_size: 
>> dev 253:32 ino 0x84 prealloc blocks 128 shift 0 m_writeio_blocks 16
>>   xfs_io-2946  [001]  253971.751425: xfs_iomap_alloc: dev 253:32 
>> ino 0x84 size 0x4 offset 0x6 count 131072 type invalid startoff 0x60 
>> startblock -1 blockcount 0x90
>> 
>> That's the output I need for the complete test - you'll need to use
>> a better recording mechanism that this (e.g. trace-cmd record,
>> trace-cmd report) because it will generate a lot of events. Compress
>> the two report files (they'll be large) and send them to me offlist.
>
>Can you also send me the output of xfs_info on the filesystem you
>are testing?

Hi, Dave

Here is the xfs_info output:

# xfs_info /fs/ram0/
meta-data=/dev/ram0  isize=256agcount=4, agsize=3145728 blks
 =   sectsz=4096  attr=2, projid32bit=1
 =   crc=0finobt=0
data =   bsize=4096   blocks=12582912, imaxpct=25
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
log  =internal   bsize=4096   blocks=6144, version=2
 =   sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0

Thanks,
Xiaolong
>
>Cheers,
>
>Dave.
>-- 
>Dave Chinner
>da...@fromorbit.com
>___
>LKP mailing list
>l...@lists.01.org
>https://lists.01.org/mailman/listinfo/lkp


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-10 Thread Dave Chinner
On Thu, Aug 11, 2016 at 11:16:12AM +1000, Dave Chinner wrote:
> I need to see these events:
> 
>   xfs_file*
>   xfs_iomap*
>   xfs_get_block*
> 
> For both kernels. An example trace from 4.8-rc1 running the command
> `xfs_io -f -c 'pwrite 0 512k -b 128k' /mnt/scratch/fooey doing an
> overwrite and extend of the existing file ends up looking like:
> 
> $ sudo trace-cmd start -e xfs_iomap\* -e xfs_file\* -e xfs_get_blocks\*
> $ sudo cat /sys/kernel/tracing/trace_pipe
><...>-2946  [001]  253971.750304: xfs_file_ioctl: dev 253:32 
> ino 0x84
>   xfs_io-2946  [001]  253971.750938: xfs_file_buffered_write: dev 
> 253:32 ino 0x84 size 0x4 offset 0x0 count 0x2
>   xfs_io-2946  [001]  253971.750961: xfs_iomap_found: dev 253:32 
> ino 0x84 size 0x4 offset 0x0 count 131072 type invalid startoff 0x0 
> startblock 24 blockcount 0x60
>   xfs_io-2946  [001]  253971.751114: xfs_file_buffered_write: dev 
> 253:32 ino 0x84 size 0x4 offset 0x2 count 0x2
>   xfs_io-2946  [001]  253971.751128: xfs_iomap_found: dev 253:32 
> ino 0x84 size 0x4 offset 0x2 count 131072 type invalid startoff 0x0 
> startblock 24 blockcount 0x60
>   xfs_io-2946  [001]  253971.751234: xfs_file_buffered_write: dev 
> 253:32 ino 0x84 size 0x4 offset 0x4 count 0x2
>   xfs_io-2946  [001]  253971.751236: xfs_iomap_found: dev 253:32 
> ino 0x84 size 0x4 offset 0x4 count 131072 type invalid startoff 0x0 
> startblock 24 blockcount 0x60
>   xfs_io-2946  [001]  253971.751381: xfs_file_buffered_write: dev 
> 253:32 ino 0x84 size 0x4 offset 0x6 count 0x2
>   xfs_io-2946  [001]  253971.751415: xfs_iomap_prealloc_size: dev 
> 253:32 ino 0x84 prealloc blocks 128 shift 0 m_writeio_blocks 16
>   xfs_io-2946  [001]  253971.751425: xfs_iomap_alloc: dev 253:32 
> ino 0x84 size 0x4 offset 0x6 count 131072 type invalid startoff 0x60 
> startblock -1 blockcount 0x90
> 
> That's the output I need for the complete test - you'll need to use
> a better recording mechanism that this (e.g. trace-cmd record,
> trace-cmd report) because it will generate a lot of events. Compress
> the two report files (they'll be large) and send them to me offlist.

Can you also send me the output of xfs_info on the filesystem you
are testing?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-10 Thread Dave Chinner
On Wed, Aug 10, 2016 at 05:33:20PM -0700, Huang, Ying wrote:
> Linus Torvalds  writes:
> 
> > On Wed, Aug 10, 2016 at 5:11 PM, Huang, Ying  wrote:
> >>
> >> Here is the comparison result with perf-profile data.
> >
> > Heh. The diff is actually harder to read than just showing A/B
> > state.The fact that the call chain shows up as part of the symbol
> > makes it even more so.
> >
> > For example:
> >
> >>   0.00 ± -1%  +Inf%   1.68 ±  1%  
> >> perf-profile.cycles-pp.__add_to_page_cache_locked.add_to_page_cache_lru.pagecache_get_page.grab_cache_page_write_begin.iomap_write_begin
> >>   1.80 ±  1%-100.0%   0.00 ± -1%  
> >> perf-profile.cycles-pp.__add_to_page_cache_locked.add_to_page_cache_lru.pagecache_get_page.grab_cache_page_write_begin.xfs_vm_write_begin
> >
> > Ok, so it went from 1.8% to 1.68%, and isn't actually that big of a
> > change, but it shows up as a big change because the caller changed
> > from xfs_vm_write_begin to iomap_write_begin.
> >
> > There's a few other cases of that too.
> >
> > So I think it would actually be easier to just see "what 20 functions
> > were the hottest" (or maybe 50) before and after separately (just
> > sorted by cycles), without the diff part. Because the diff is really
> > hard to read.
> 
> Here it is,
> 
> Before:
> 
>   "perf-profile.func.cycles-pp.intel_idle": 16.88,
>   "perf-profile.func.cycles-pp.copy_user_enhanced_fast_string": 3.94,
>   "perf-profile.func.cycles-pp.memset_erms": 3.26,
>   "perf-profile.func.cycles-pp.__block_commit_write.isra.24": 2.47,
>   "perf-profile.func.cycles-pp.___might_sleep": 2.33,
>   "perf-profile.func.cycles-pp.__mark_inode_dirty": 1.88,
>   "perf-profile.func.cycles-pp.unlock_page": 1.69,
>   "perf-profile.func.cycles-pp.up_write": 1.61,
>   "perf-profile.func.cycles-pp.__block_write_begin_int": 1.56,
>   "perf-profile.func.cycles-pp.down_write": 1.55,
>   "perf-profile.func.cycles-pp.mark_buffer_dirty": 1.53,
>   "perf-profile.func.cycles-pp.entry_SYSCALL_64_fastpath": 1.47,
>   "perf-profile.func.cycles-pp.generic_write_end": 1.36,
>   "perf-profile.func.cycles-pp.generic_perform_write": 1.33,
>   "perf-profile.func.cycles-pp.__radix_tree_lookup": 1.32,
>   "perf-profile.func.cycles-pp.__might_sleep": 1.26,
>   "perf-profile.func.cycles-pp._raw_spin_lock": 1.17,
>   "perf-profile.func.cycles-pp.vfs_write": 1.14,
>   "perf-profile.func.cycles-pp.__xfs_get_blocks": 1.07,

Ok, so that is the old block mapping call in the buffered IO path.
I don't see any of the functions it calls in the profile;
specifically xfs_bmapi_read(), and xfs_iomap_write_delay(), so it
appears the extent mapping and allocation overhead on the old code
totals somewhere under 2-3% of the entire CPU usage.

>   "perf-profile.func.cycles-pp.xfs_file_write_iter": 1.03,
>   "perf-profile.func.cycles-pp.pagecache_get_page": 1.03,
>   "perf-profile.func.cycles-pp.native_queued_spin_lock_slowpath": 0.98,
>   "perf-profile.func.cycles-pp.get_page_from_freelist": 0.94,
>   "perf-profile.func.cycles-pp.rwsem_spin_on_owner": 0.94,
>   "perf-profile.func.cycles-pp.__vfs_write": 0.87,
>   "perf-profile.func.cycles-pp.iov_iter_copy_from_user_atomic": 0.87,
>   "perf-profile.func.cycles-pp.xfs_file_buffered_aio_write": 0.84,
>   "perf-profile.func.cycles-pp.find_get_entry": 0.79,
>   "perf-profile.func.cycles-pp._raw_spin_lock_irqsave": 0.78,
> 
> 
> After:
> 
>   "perf-profile.func.cycles-pp.intel_idle": 16.82,
>   "perf-profile.func.cycles-pp.copy_user_enhanced_fast_string": 3.27,
>   "perf-profile.func.cycles-pp.memset_erms": 2.6,
>   "perf-profile.func.cycles-pp.xfs_bmapi_read": 2.24,

Straight away - thats' at least 3x more overhead block mapping lookups
with the iomap code.

>   "perf-profile.func.cycles-pp.___might_sleep": 2.04,
>   "perf-profile.func.cycles-pp.mark_page_accessed": 1.93,
>   "perf-profile.func.cycles-pp.__block_write_begin_int": 1.78,
>   "perf-profile.func.cycles-pp.up_write": 1.72,
>   "perf-profile.func.cycles-pp.xfs_iext_bno_to_ext": 1.7,

Plus this child.

>   "perf-profile.func.cycles-pp.__block_commit_write.isra.24": 1.65,
>   "perf-profile.func.cycles-pp.down_write": 1.51,
>   "perf-profile.func.cycles-pp.__mark_inode_dirty": 1.51,
>   "perf-profile.func.cycles-pp.unlock_page": 1.43,
>   "perf-profile.func.cycles-pp.xfs_bmap_search_multi_extents": 1.25,
>   "perf-profile.func.cycles-pp.xfs_bmap_search_extents": 1.23,

And these two.

>   "perf-profile.func.cycles-pp.mark_buffer_dirty": 1.21,
>   "perf-profile.func.cycles-pp.xfs_iomap_write_delay": 1.19,
>   "perf-profile.func.cycles-pp.xfs_iomap_eof_want_preallocate.constprop.8": 
> 1.15,

And these two.

So, essentially the old code had maybe 2-3% cpu usage overhead in
the block mapping path on this workload, but the new code is, for
some reason, showing at least 8-9% CPU usage overhead. That, right
now, makes no sense at all to me as we should be doing - at worst -
exactly the same number of block mapping calls as the old code.

We need to know what is happe

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-10 Thread Linus Torvalds
On Wed, Aug 10, 2016 at 5:33 PM, Huang, Ying  wrote:
>
> Here it is,

Thanks.

Appended is a munged "after" list, with the "before" values in
parenthesis. It actually looks fairly similar.

The biggest difference is that we have "mark_page_accessed()" show up
after, and not before. There was also a lot of LRU noise in the
non-profile data. I wonder if that is the reason here: the old model
of using generic_perform_write/block_page_mkwrite didn't mark the
pages accessed, and now with iomap_file_buffered_write() they get
marked as active and that screws up the LRU list, and makes us not
flush out the dirty pages well (because they are seen as active and
not good for writeback), and then you get bad memory use.

I'm not seeing anything that looks like locking-related.

And I may well have screwed up that list munging. I should have
automated it more than I did.

Dave, Christoph?

Linus

---

   intel_idle   16.82   (16.88)
   copy_user_enhanced_fast_string3.27(3.94)
   memset_erms   2.6 (3.26)
   xfs_bmapi_read2.24
   ___might_sleep2.04(2.33)
   mark_page_accessed1.93
   __block_write_begin_int   1.78(1.56)
   up_write  1.72(1.61)
   xfs_iext_bno_to_ext   1.7
   __block_commit_write.isra.24  1.65(2.47)
   down_write1.51(1.55)
   __mark_inode_dirty1.51(1.88)
   unlock_page   1.43(1.69)
   xfs_bmap_search_multi_extents 1.25
   xfs_bmap_search_extents   1.23
   mark_buffer_dirty 1.21(1.53)
   xfs_iomap_write_delay 1.19
   xfs_iomap_eof_want_preallocate.constprop.81.15
   iomap_write_actor 1.14
   __might_sleep 1.12(1.26)
   __radix_tree_lookup   1.08(1.32)
   entry_SYSCALL_64_fastpath 1.07(1.47)
   pagecache_get_page0.95(1.03)
   _raw_spin_lock0.95(1.17)
   xfs_bmapi_delay   0.93
   vfs_write 0.92(1.14)
   xfs_file_write_iter   0.86


  1   2   >