Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-09-26 Thread Huang, Ying
Christoph Hellwig  writes:

> Hi Ying,
>
>> Any update to this regression?
>
> Not really.  We've optimized everything we could in XFS without
> dropping the architecture that we really want to move to.  Now we're
> waiting for some MM behavior to be fixed that this unconvered.  But
> in the end will probabkly stuck with a slight regression in this
> artificial workload.

I see.  Thanks for update.  Please keep me posted.

Best Regards,
Huang, Ying


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-09-26 Thread Huang, Ying
Christoph Hellwig  writes:

> Hi Ying,
>
>> Any update to this regression?
>
> Not really.  We've optimized everything we could in XFS without
> dropping the architecture that we really want to move to.  Now we're
> waiting for some MM behavior to be fixed that this unconvered.  But
> in the end will probabkly stuck with a slight regression in this
> artificial workload.

I see.  Thanks for update.  Please keep me posted.

Best Regards,
Huang, Ying


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-09-26 Thread Christoph Hellwig
Hi Ying,

> Any update to this regression?

Not really.  We've optimized everything we could in XFS without
dropping the architecture that we really want to move to.  Now we're
waiting for some MM behavior to be fixed that this unconvered.  But
in the end will probabkly stuck with a slight regression in this
artificial workload.


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-09-26 Thread Christoph Hellwig
Hi Ying,

> Any update to this regression?

Not really.  We've optimized everything we could in XFS without
dropping the architecture that we really want to move to.  Now we're
waiting for some MM behavior to be fixed that this unconvered.  But
in the end will probabkly stuck with a slight regression in this
artificial workload.


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-09-26 Thread Huang, Ying
Hi, Christoph,

"Huang, Ying"  writes:

> Hi, Christoph,
>
> "Huang, Ying"  writes:
>
>> Christoph Hellwig  writes:
>>
>>> Snipping the long contest:
>>>
>>> I think there are three observations here:
>>>
>>>  (1) removing the mark_page_accessed (which is the only significant
>>>  change in the parent commit)  hurts the
>>>  aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test.
>>>  I'd still rather stick to the filemap version and let the
>>>  VM people sort it out.  How do the numbers for this test
>>>  look for XFS vs say ext4 and btrfs?
>>>  (2) lots of additional spinlock contention in the new case.  A quick
>>>  check shows that I fat-fingered my rewrite so that we do
>>>  the xfs_inode_set_eofblocks_tag call now for the pure lookup
>>>  case, and pretty much all new cycles come from that.
>>>  (3) Boy, are those xfs_inode_set_eofblocks_tag calls expensive, and
>>>  we're already doing way to many even without my little bug above.
>>>
>>> So I've force pushed a new version of the iomap-fixes branch with
>>> (2) fixed, and also a little patch to xfs_inode_set_eofblocks_tag a
>>> lot less expensive slotted in before that.  Would be good to see
>>> the numbers with that.
>>
>> For the original reported regression, the test result is as follow,
>>
>> =
>> compiler/cpufreq_governor/debug-setup/disk/fs/kconfig/load/rootfs/tbox_group/test/testcase:
>>   
>> gcc-6/performance/profile/1BRD_48G/xfs/x86_64-rhel/3000/debian-x86_64-2015-02-07.cgz/ivb44/disk_wrt/aim7
>>
>> commit: 
>>   f0c6bcba74ac51cb77aadb33ad35cb2dc1ad1506 (parent of first bad commit)
>>   68a9f5e7007c1afa2cf6830b690a90d0187c0684 (first bad commit)
>>   99091700659f4df965e138b38b4fa26a29b7eade (base of your fixes branch)
>>   bf4dc6e4ecc2a3d042029319bc8cd4204c185610 (head of your fixes branch)
>>
>> f0c6bcba74ac51cb 68a9f5e7007c1afa2cf6830b69 99091700659f4df965e138b38b 
>> bf4dc6e4ecc2a3d042029319bc 
>>  -- -- 
>> -- 
>>  %stddev %change %stddev %change %stddev 
>> %change %stddev
>>  \  |\  |\   
>>|\  
>> 484435 ±  0% -13.3% 420004 ±  0% -17.0% 402250 ±  0% 
>> -15.6% 408998 ±  0%  aim7.jobs-per-min
>
> It appears the original reported regression hasn't bee resolved by your
> commit.  Could you take a look at the test results and the perf data?

Any update to this regression?

Best Regards,
Huang, Ying

>>
>> And the perf data is as follow,
>>
>>   "perf-profile.func.cycles-pp.intel_idle": 20.25,
>>   "perf-profile.func.cycles-pp.memset_erms": 11.72,
>>   "perf-profile.func.cycles-pp.copy_user_enhanced_fast_string": 8.37,
>>   "perf-profile.func.cycles-pp.__block_commit_write.isra.21": 3.49,
>>   "perf-profile.func.cycles-pp.block_write_end": 1.77,
>>   "perf-profile.func.cycles-pp.native_queued_spin_lock_slowpath": 1.63,
>>   "perf-profile.func.cycles-pp.unlock_page": 1.58,
>>   "perf-profile.func.cycles-pp.___might_sleep": 1.56,
>>   "perf-profile.func.cycles-pp.__block_write_begin_int": 1.33,
>>   "perf-profile.func.cycles-pp.iov_iter_copy_from_user_atomic": 1.23,
>>   "perf-profile.func.cycles-pp.up_write": 1.21,
>>   "perf-profile.func.cycles-pp.__mark_inode_dirty": 1.18,
>>   "perf-profile.func.cycles-pp.down_write": 1.06,
>>   "perf-profile.func.cycles-pp.mark_buffer_dirty": 0.94,
>>   "perf-profile.func.cycles-pp.generic_write_end": 0.92,
>>   "perf-profile.func.cycles-pp.__radix_tree_lookup": 0.91,
>>   "perf-profile.func.cycles-pp._raw_spin_lock": 0.81,
>>   "perf-profile.func.cycles-pp.entry_SYSCALL_64_fastpath": 0.79,
>>   "perf-profile.func.cycles-pp.__might_sleep": 0.79,
>>   "perf-profile.func.cycles-pp.xfs_file_iomap_begin_delay.isra.9": 0.7,
>>   "perf-profile.func.cycles-pp.__list_del_entry": 0.7,
>>   "perf-profile.func.cycles-pp.vfs_write": 0.69,
>>   "perf-profile.func.cycles-pp.drop_buffers": 0.68,
>>   "perf-profile.func.cycles-pp.xfs_file_write_iter": 0.67,
>>   "perf-profile.func.cycles-pp.rwsem_spin_on_owner": 0.67,
>>
>> Best Regards,
>> Huang, Ying
>> ___
>> LKP mailing list
>> l...@lists.01.org
>> https://lists.01.org/mailman/listinfo/lkp


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-09-26 Thread Huang, Ying
Hi, Christoph,

"Huang, Ying"  writes:

> Hi, Christoph,
>
> "Huang, Ying"  writes:
>
>> Christoph Hellwig  writes:
>>
>>> Snipping the long contest:
>>>
>>> I think there are three observations here:
>>>
>>>  (1) removing the mark_page_accessed (which is the only significant
>>>  change in the parent commit)  hurts the
>>>  aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test.
>>>  I'd still rather stick to the filemap version and let the
>>>  VM people sort it out.  How do the numbers for this test
>>>  look for XFS vs say ext4 and btrfs?
>>>  (2) lots of additional spinlock contention in the new case.  A quick
>>>  check shows that I fat-fingered my rewrite so that we do
>>>  the xfs_inode_set_eofblocks_tag call now for the pure lookup
>>>  case, and pretty much all new cycles come from that.
>>>  (3) Boy, are those xfs_inode_set_eofblocks_tag calls expensive, and
>>>  we're already doing way to many even without my little bug above.
>>>
>>> So I've force pushed a new version of the iomap-fixes branch with
>>> (2) fixed, and also a little patch to xfs_inode_set_eofblocks_tag a
>>> lot less expensive slotted in before that.  Would be good to see
>>> the numbers with that.
>>
>> For the original reported regression, the test result is as follow,
>>
>> =
>> compiler/cpufreq_governor/debug-setup/disk/fs/kconfig/load/rootfs/tbox_group/test/testcase:
>>   
>> gcc-6/performance/profile/1BRD_48G/xfs/x86_64-rhel/3000/debian-x86_64-2015-02-07.cgz/ivb44/disk_wrt/aim7
>>
>> commit: 
>>   f0c6bcba74ac51cb77aadb33ad35cb2dc1ad1506 (parent of first bad commit)
>>   68a9f5e7007c1afa2cf6830b690a90d0187c0684 (first bad commit)
>>   99091700659f4df965e138b38b4fa26a29b7eade (base of your fixes branch)
>>   bf4dc6e4ecc2a3d042029319bc8cd4204c185610 (head of your fixes branch)
>>
>> f0c6bcba74ac51cb 68a9f5e7007c1afa2cf6830b69 99091700659f4df965e138b38b 
>> bf4dc6e4ecc2a3d042029319bc 
>>  -- -- 
>> -- 
>>  %stddev %change %stddev %change %stddev 
>> %change %stddev
>>  \  |\  |\   
>>|\  
>> 484435 ±  0% -13.3% 420004 ±  0% -17.0% 402250 ±  0% 
>> -15.6% 408998 ±  0%  aim7.jobs-per-min
>
> It appears the original reported regression hasn't bee resolved by your
> commit.  Could you take a look at the test results and the perf data?

Any update to this regression?

Best Regards,
Huang, Ying

>>
>> And the perf data is as follow,
>>
>>   "perf-profile.func.cycles-pp.intel_idle": 20.25,
>>   "perf-profile.func.cycles-pp.memset_erms": 11.72,
>>   "perf-profile.func.cycles-pp.copy_user_enhanced_fast_string": 8.37,
>>   "perf-profile.func.cycles-pp.__block_commit_write.isra.21": 3.49,
>>   "perf-profile.func.cycles-pp.block_write_end": 1.77,
>>   "perf-profile.func.cycles-pp.native_queued_spin_lock_slowpath": 1.63,
>>   "perf-profile.func.cycles-pp.unlock_page": 1.58,
>>   "perf-profile.func.cycles-pp.___might_sleep": 1.56,
>>   "perf-profile.func.cycles-pp.__block_write_begin_int": 1.33,
>>   "perf-profile.func.cycles-pp.iov_iter_copy_from_user_atomic": 1.23,
>>   "perf-profile.func.cycles-pp.up_write": 1.21,
>>   "perf-profile.func.cycles-pp.__mark_inode_dirty": 1.18,
>>   "perf-profile.func.cycles-pp.down_write": 1.06,
>>   "perf-profile.func.cycles-pp.mark_buffer_dirty": 0.94,
>>   "perf-profile.func.cycles-pp.generic_write_end": 0.92,
>>   "perf-profile.func.cycles-pp.__radix_tree_lookup": 0.91,
>>   "perf-profile.func.cycles-pp._raw_spin_lock": 0.81,
>>   "perf-profile.func.cycles-pp.entry_SYSCALL_64_fastpath": 0.79,
>>   "perf-profile.func.cycles-pp.__might_sleep": 0.79,
>>   "perf-profile.func.cycles-pp.xfs_file_iomap_begin_delay.isra.9": 0.7,
>>   "perf-profile.func.cycles-pp.__list_del_entry": 0.7,
>>   "perf-profile.func.cycles-pp.vfs_write": 0.69,
>>   "perf-profile.func.cycles-pp.drop_buffers": 0.68,
>>   "perf-profile.func.cycles-pp.xfs_file_write_iter": 0.67,
>>   "perf-profile.func.cycles-pp.rwsem_spin_on_owner": 0.67,
>>
>> Best Regards,
>> Huang, Ying
>> ___
>> LKP mailing list
>> l...@lists.01.org
>> https://lists.01.org/mailman/listinfo/lkp


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-09-06 Thread Huang, Ying
Mel Gorman  writes:

> On Fri, Sep 02, 2016 at 09:32:58AM +1000, Dave Chinner wrote:
>> On Fri, Aug 19, 2016 at 04:08:34PM +0100, Mel Gorman wrote:
>> > On Thu, Aug 18, 2016 at 05:11:11PM +1000, Dave Chinner wrote:
>> > > On Thu, Aug 18, 2016 at 01:45:17AM +0100, Mel Gorman wrote:
>> > > > On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote:
>> > > > > > Yes, we could try to batch the locking like DaveC already suggested
>> > > > > > (ie we could move the locking to the caller, and then make
>> > > > > > shrink_page_list() just try to keep the lock held for a few pages 
>> > > > > > if
>> > > > > > the mapping doesn't change), and that might result in fewer crazy
>> > > > > > cacheline ping-pongs overall. But that feels like exactly the wrong
>> > > > > > kind of workaround.
>> > > > > > 
>> > > > > 
>> > > > > Even if such batching was implemented, it would be very specific to 
>> > > > > the
>> > > > > case of a single large file filling LRUs on multiple nodes.
>> > > > > 
>> > > > 
>> > > > The latest Jason Bourne movie was sufficiently bad that I spent time
>> > > > thinking how the tree_lock could be batched during reclaim. It's not
>> > > > straight-forward but this prototype did not blow up on UMA and may be
>> > > > worth considering if Dave can test either approach has a positive 
>> > > > impact.
>> > > 
>> > > SO, I just did a couple of tests. I'll call the two patches "sleepy"
>> > > for the contention backoff patch and "bourney" for the Jason Bourne
>> > > inspired batching patch. This is an average of 3 runs, overwriting
>> > > a 47GB file on a machine with 16GB RAM:
>> > > 
>> > >  IO throughput   wall time __pv_queued_spin_lock_slowpath
>> > > vanilla  470MB/s 1m42s   25-30%
>> > > sleepy   295MB/s 2m43s   <1%
>> > > bourney  425MB/s 1m53s   25-30%
>> > > 
>> > 
>> > This is another blunt-force patch that
>> 
>> Sorry for taking so long to get back to this - had a bunch of other
>> stuff to do (e.g. XFS metadata CRCs have found their first compiler
>> bug) and haven't had to time test this.
>> 
>
> No problem. Thanks for getting back to me.
>
>> The blunt force approach seems to work ok:
>> 
>
> Ok, good to know. Unfortunately I found that it's not a universal win. For
> the swapping-to-fast-storage case (simulated with ramdisk), the batching is
> a bigger gain *except* in the single threaded case. Stalling kswap in the
> "blunt force approach" severely regressed a streaming anonymous reader
> for all thread counts so it's not the right answer.
>
> I'm working on a series during spare time that tries to balance all the
> issues for either swapcache and filecache on different workloads but right
> now, the complexity is high and it's still "win some, lose some".
>
> As an aside for the LKP people using ramdisk for swap -- ramdisk considers
> itself to be rotational storage. It takes the paths that are optimised to
> minimise seeks but it's quite slow. When tree_lock contention is reduced,
> workload is dominated by scan_swap_map. It's a one-line fix and I have
> a patch for it but it only really matters if ramdisk is being used as a
> simulator for swapping to fast storage.

We (LKP people) use drivers/nvdimm/pmem.c instead of drivers/block/brd.c
as ramdisk.  Which considers itself to be non-rotational storage.

And we have a series to optimize other locks in the swap path too, for
example batching the swap space allocating and freeing, etc.  If your
solution to optimize batching removing pages from the swap cache can be
merged, that will help us much!

Best Regards,
Huang, Ying


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-09-06 Thread Huang, Ying
Mel Gorman  writes:

> On Fri, Sep 02, 2016 at 09:32:58AM +1000, Dave Chinner wrote:
>> On Fri, Aug 19, 2016 at 04:08:34PM +0100, Mel Gorman wrote:
>> > On Thu, Aug 18, 2016 at 05:11:11PM +1000, Dave Chinner wrote:
>> > > On Thu, Aug 18, 2016 at 01:45:17AM +0100, Mel Gorman wrote:
>> > > > On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote:
>> > > > > > Yes, we could try to batch the locking like DaveC already suggested
>> > > > > > (ie we could move the locking to the caller, and then make
>> > > > > > shrink_page_list() just try to keep the lock held for a few pages 
>> > > > > > if
>> > > > > > the mapping doesn't change), and that might result in fewer crazy
>> > > > > > cacheline ping-pongs overall. But that feels like exactly the wrong
>> > > > > > kind of workaround.
>> > > > > > 
>> > > > > 
>> > > > > Even if such batching was implemented, it would be very specific to 
>> > > > > the
>> > > > > case of a single large file filling LRUs on multiple nodes.
>> > > > > 
>> > > > 
>> > > > The latest Jason Bourne movie was sufficiently bad that I spent time
>> > > > thinking how the tree_lock could be batched during reclaim. It's not
>> > > > straight-forward but this prototype did not blow up on UMA and may be
>> > > > worth considering if Dave can test either approach has a positive 
>> > > > impact.
>> > > 
>> > > SO, I just did a couple of tests. I'll call the two patches "sleepy"
>> > > for the contention backoff patch and "bourney" for the Jason Bourne
>> > > inspired batching patch. This is an average of 3 runs, overwriting
>> > > a 47GB file on a machine with 16GB RAM:
>> > > 
>> > >  IO throughput   wall time __pv_queued_spin_lock_slowpath
>> > > vanilla  470MB/s 1m42s   25-30%
>> > > sleepy   295MB/s 2m43s   <1%
>> > > bourney  425MB/s 1m53s   25-30%
>> > > 
>> > 
>> > This is another blunt-force patch that
>> 
>> Sorry for taking so long to get back to this - had a bunch of other
>> stuff to do (e.g. XFS metadata CRCs have found their first compiler
>> bug) and haven't had to time test this.
>> 
>
> No problem. Thanks for getting back to me.
>
>> The blunt force approach seems to work ok:
>> 
>
> Ok, good to know. Unfortunately I found that it's not a universal win. For
> the swapping-to-fast-storage case (simulated with ramdisk), the batching is
> a bigger gain *except* in the single threaded case. Stalling kswap in the
> "blunt force approach" severely regressed a streaming anonymous reader
> for all thread counts so it's not the right answer.
>
> I'm working on a series during spare time that tries to balance all the
> issues for either swapcache and filecache on different workloads but right
> now, the complexity is high and it's still "win some, lose some".
>
> As an aside for the LKP people using ramdisk for swap -- ramdisk considers
> itself to be rotational storage. It takes the paths that are optimised to
> minimise seeks but it's quite slow. When tree_lock contention is reduced,
> workload is dominated by scan_swap_map. It's a one-line fix and I have
> a patch for it but it only really matters if ramdisk is being used as a
> simulator for swapping to fast storage.

We (LKP people) use drivers/nvdimm/pmem.c instead of drivers/block/brd.c
as ramdisk.  Which considers itself to be non-rotational storage.

And we have a series to optimize other locks in the swap path too, for
example batching the swap space allocating and freeing, etc.  If your
solution to optimize batching removing pages from the swap cache can be
merged, that will help us much!

Best Regards,
Huang, Ying


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-09-06 Thread Mel Gorman
On Fri, Sep 02, 2016 at 09:32:58AM +1000, Dave Chinner wrote:
> On Fri, Aug 19, 2016 at 04:08:34PM +0100, Mel Gorman wrote:
> > On Thu, Aug 18, 2016 at 05:11:11PM +1000, Dave Chinner wrote:
> > > On Thu, Aug 18, 2016 at 01:45:17AM +0100, Mel Gorman wrote:
> > > > On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote:
> > > > > > Yes, we could try to batch the locking like DaveC already suggested
> > > > > > (ie we could move the locking to the caller, and then make
> > > > > > shrink_page_list() just try to keep the lock held for a few pages if
> > > > > > the mapping doesn't change), and that might result in fewer crazy
> > > > > > cacheline ping-pongs overall. But that feels like exactly the wrong
> > > > > > kind of workaround.
> > > > > > 
> > > > > 
> > > > > Even if such batching was implemented, it would be very specific to 
> > > > > the
> > > > > case of a single large file filling LRUs on multiple nodes.
> > > > > 
> > > > 
> > > > The latest Jason Bourne movie was sufficiently bad that I spent time
> > > > thinking how the tree_lock could be batched during reclaim. It's not
> > > > straight-forward but this prototype did not blow up on UMA and may be
> > > > worth considering if Dave can test either approach has a positive 
> > > > impact.
> > > 
> > > SO, I just did a couple of tests. I'll call the two patches "sleepy"
> > > for the contention backoff patch and "bourney" for the Jason Bourne
> > > inspired batching patch. This is an average of 3 runs, overwriting
> > > a 47GB file on a machine with 16GB RAM:
> > > 
> > >   IO throughput   wall time __pv_queued_spin_lock_slowpath
> > > vanilla   470MB/s 1m42s   25-30%
> > > sleepy295MB/s 2m43s   <1%
> > > bourney   425MB/s 1m53s   25-30%
> > > 
> > 
> > This is another blunt-force patch that
> 
> Sorry for taking so long to get back to this - had a bunch of other
> stuff to do (e.g. XFS metadata CRCs have found their first compiler
> bug) and haven't had to time test this.
> 

No problem. Thanks for getting back to me.

> The blunt force approach seems to work ok:
> 

Ok, good to know. Unfortunately I found that it's not a universal win. For
the swapping-to-fast-storage case (simulated with ramdisk), the batching is
a bigger gain *except* in the single threaded case. Stalling kswap in the
"blunt force approach" severely regressed a streaming anonymous reader
for all thread counts so it's not the right answer.

I'm working on a series during spare time that tries to balance all the
issues for either swapcache and filecache on different workloads but right
now, the complexity is high and it's still "win some, lose some".

As an aside for the LKP people using ramdisk for swap -- ramdisk considers
itself to be rotational storage. It takes the paths that are optimised to
minimise seeks but it's quite slow. When tree_lock contention is reduced,
workload is dominated by scan_swap_map. It's a one-line fix and I have
a patch for it but it only really matters if ramdisk is being used as a
simulator for swapping to fast storage.

-- 
Mel Gorman
SUSE Labs


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-09-06 Thread Mel Gorman
On Fri, Sep 02, 2016 at 09:32:58AM +1000, Dave Chinner wrote:
> On Fri, Aug 19, 2016 at 04:08:34PM +0100, Mel Gorman wrote:
> > On Thu, Aug 18, 2016 at 05:11:11PM +1000, Dave Chinner wrote:
> > > On Thu, Aug 18, 2016 at 01:45:17AM +0100, Mel Gorman wrote:
> > > > On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote:
> > > > > > Yes, we could try to batch the locking like DaveC already suggested
> > > > > > (ie we could move the locking to the caller, and then make
> > > > > > shrink_page_list() just try to keep the lock held for a few pages if
> > > > > > the mapping doesn't change), and that might result in fewer crazy
> > > > > > cacheline ping-pongs overall. But that feels like exactly the wrong
> > > > > > kind of workaround.
> > > > > > 
> > > > > 
> > > > > Even if such batching was implemented, it would be very specific to 
> > > > > the
> > > > > case of a single large file filling LRUs on multiple nodes.
> > > > > 
> > > > 
> > > > The latest Jason Bourne movie was sufficiently bad that I spent time
> > > > thinking how the tree_lock could be batched during reclaim. It's not
> > > > straight-forward but this prototype did not blow up on UMA and may be
> > > > worth considering if Dave can test either approach has a positive 
> > > > impact.
> > > 
> > > SO, I just did a couple of tests. I'll call the two patches "sleepy"
> > > for the contention backoff patch and "bourney" for the Jason Bourne
> > > inspired batching patch. This is an average of 3 runs, overwriting
> > > a 47GB file on a machine with 16GB RAM:
> > > 
> > >   IO throughput   wall time __pv_queued_spin_lock_slowpath
> > > vanilla   470MB/s 1m42s   25-30%
> > > sleepy295MB/s 2m43s   <1%
> > > bourney   425MB/s 1m53s   25-30%
> > > 
> > 
> > This is another blunt-force patch that
> 
> Sorry for taking so long to get back to this - had a bunch of other
> stuff to do (e.g. XFS metadata CRCs have found their first compiler
> bug) and haven't had to time test this.
> 

No problem. Thanks for getting back to me.

> The blunt force approach seems to work ok:
> 

Ok, good to know. Unfortunately I found that it's not a universal win. For
the swapping-to-fast-storage case (simulated with ramdisk), the batching is
a bigger gain *except* in the single threaded case. Stalling kswap in the
"blunt force approach" severely regressed a streaming anonymous reader
for all thread counts so it's not the right answer.

I'm working on a series during spare time that tries to balance all the
issues for either swapcache and filecache on different workloads but right
now, the complexity is high and it's still "win some, lose some".

As an aside for the LKP people using ramdisk for swap -- ramdisk considers
itself to be rotational storage. It takes the paths that are optimised to
minimise seeks but it's quite slow. When tree_lock contention is reduced,
workload is dominated by scan_swap_map. It's a one-line fix and I have
a patch for it but it only really matters if ramdisk is being used as a
simulator for swapping to fast storage.

-- 
Mel Gorman
SUSE Labs


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-09-01 Thread Dave Chinner
On Fri, Aug 19, 2016 at 04:08:34PM +0100, Mel Gorman wrote:
> On Thu, Aug 18, 2016 at 05:11:11PM +1000, Dave Chinner wrote:
> > On Thu, Aug 18, 2016 at 01:45:17AM +0100, Mel Gorman wrote:
> > > On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote:
> > > > > Yes, we could try to batch the locking like DaveC already suggested
> > > > > (ie we could move the locking to the caller, and then make
> > > > > shrink_page_list() just try to keep the lock held for a few pages if
> > > > > the mapping doesn't change), and that might result in fewer crazy
> > > > > cacheline ping-pongs overall. But that feels like exactly the wrong
> > > > > kind of workaround.
> > > > > 
> > > > 
> > > > Even if such batching was implemented, it would be very specific to the
> > > > case of a single large file filling LRUs on multiple nodes.
> > > > 
> > > 
> > > The latest Jason Bourne movie was sufficiently bad that I spent time
> > > thinking how the tree_lock could be batched during reclaim. It's not
> > > straight-forward but this prototype did not blow up on UMA and may be
> > > worth considering if Dave can test either approach has a positive impact.
> > 
> > SO, I just did a couple of tests. I'll call the two patches "sleepy"
> > for the contention backoff patch and "bourney" for the Jason Bourne
> > inspired batching patch. This is an average of 3 runs, overwriting
> > a 47GB file on a machine with 16GB RAM:
> > 
> > IO throughput   wall time __pv_queued_spin_lock_slowpath
> > vanilla 470MB/s 1m42s   25-30%
> > sleepy  295MB/s 2m43s   <1%
> > bourney 425MB/s 1m53s   25-30%
> > 
> 
> This is another blunt-force patch that

Sorry for taking so long to get back to this - had a bunch of other
stuff to do (e.g. XFS metadata CRCs have found their first compiler
bug) and haven't had to time test this.

The blunt force approach seems to work ok:

IO throughput   wall time __pv_queued_spin_lock_slowpath
vanilla 470MB/s 1m42s   25-30%
sleepy  295MB/s 2m43s   <1%
bourney 425MB/s 1m53s   25-30%
blunt   470MB/s 1m41s   ~2%

Performance is pretty much the same as teh vanilla kernel - maybe
a little bit faster if we consider median rather than mean results.

A snapshot profile from 'perf top -U' looks like:

  11.31%  [kernel]  [k] copy_user_generic_string
   3.59%  [kernel]  [k] get_page_from_freelist
   3.22%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
   2.80%  [kernel]  [k] __block_commit_write.isra.29
   2.14%  [kernel]  [k] __pv_queued_spin_lock_slowpath
   1.99%  [kernel]  [k] _raw_spin_lock
   1.98%  [kernel]  [k] wake_all_kswapds
   1.92%  [kernel]  [k] _raw_spin_lock_irqsave
   1.90%  [kernel]  [k] node_dirty_ok
   1.69%  [kernel]  [k] __wake_up_bit
   1.57%  [kernel]  [k] ___might_sleep
   1.49%  [kernel]  [k] __might_sleep
   1.24%  [kernel]  [k] __radix_tree_lookup
   1.18%  [kernel]  [k] kmem_cache_alloc
   1.13%  [kernel]  [k] update_fast_ctr
   1.11%  [kernel]  [k] radix_tree_tag_set
   1.08%  [kernel]  [k] clear_page_dirty_for_io
   1.06%  [kernel]  [k] down_write
   1.06%  [kernel]  [k] up_write
   1.01%  [kernel]  [k] unlock_page
   0.99%  [kernel]  [k] xfs_log_commit_cil
   0.97%  [kernel]  [k] __inc_node_state
   0.95%  [kernel]  [k] __memset
   0.89%  [kernel]  [k] xfs_do_writepage
   0.89%  [kernel]  [k] __list_del_entry
   0.87%  [kernel]  [k] __vfs_write
   0.85%  [kernel]  [k] xfs_inode_item_format
   0.84%  [kernel]  [k] shrink_page_list
   0.82%  [kernel]  [k] kmem_cache_free
   0.79%  [kernel]  [k] radix_tree_tag_clear
   0.78%  [kernel]  [k] _raw_spin_lock_irq
   0.77%  [kernel]  [k] _raw_spin_unlock_irqrestore
   0.76%  [kernel]  [k] node_page_state
   0.72%  [kernel]  [k] xfs_count_page_state
   0.68%  [kernel]  [k] xfs_file_aio_write_checks
   0.65%  [kernel]  [k] wakeup_kswapd

There's still a lot of time in locking, but it's no longer obviously
being spent by spinning contention. We seem to be spending a lot of
time trying to wake kswapds now - the context switch rate of the
workload is only 400-500/s, so there aren't a lot of sleeps and
wakeups actually occurring

Regardless, throughput and locking behvaiour seems to be a lot
better than the other patches...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-09-01 Thread Dave Chinner
On Fri, Aug 19, 2016 at 04:08:34PM +0100, Mel Gorman wrote:
> On Thu, Aug 18, 2016 at 05:11:11PM +1000, Dave Chinner wrote:
> > On Thu, Aug 18, 2016 at 01:45:17AM +0100, Mel Gorman wrote:
> > > On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote:
> > > > > Yes, we could try to batch the locking like DaveC already suggested
> > > > > (ie we could move the locking to the caller, and then make
> > > > > shrink_page_list() just try to keep the lock held for a few pages if
> > > > > the mapping doesn't change), and that might result in fewer crazy
> > > > > cacheline ping-pongs overall. But that feels like exactly the wrong
> > > > > kind of workaround.
> > > > > 
> > > > 
> > > > Even if such batching was implemented, it would be very specific to the
> > > > case of a single large file filling LRUs on multiple nodes.
> > > > 
> > > 
> > > The latest Jason Bourne movie was sufficiently bad that I spent time
> > > thinking how the tree_lock could be batched during reclaim. It's not
> > > straight-forward but this prototype did not blow up on UMA and may be
> > > worth considering if Dave can test either approach has a positive impact.
> > 
> > SO, I just did a couple of tests. I'll call the two patches "sleepy"
> > for the contention backoff patch and "bourney" for the Jason Bourne
> > inspired batching patch. This is an average of 3 runs, overwriting
> > a 47GB file on a machine with 16GB RAM:
> > 
> > IO throughput   wall time __pv_queued_spin_lock_slowpath
> > vanilla 470MB/s 1m42s   25-30%
> > sleepy  295MB/s 2m43s   <1%
> > bourney 425MB/s 1m53s   25-30%
> > 
> 
> This is another blunt-force patch that

Sorry for taking so long to get back to this - had a bunch of other
stuff to do (e.g. XFS metadata CRCs have found their first compiler
bug) and haven't had to time test this.

The blunt force approach seems to work ok:

IO throughput   wall time __pv_queued_spin_lock_slowpath
vanilla 470MB/s 1m42s   25-30%
sleepy  295MB/s 2m43s   <1%
bourney 425MB/s 1m53s   25-30%
blunt   470MB/s 1m41s   ~2%

Performance is pretty much the same as teh vanilla kernel - maybe
a little bit faster if we consider median rather than mean results.

A snapshot profile from 'perf top -U' looks like:

  11.31%  [kernel]  [k] copy_user_generic_string
   3.59%  [kernel]  [k] get_page_from_freelist
   3.22%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
   2.80%  [kernel]  [k] __block_commit_write.isra.29
   2.14%  [kernel]  [k] __pv_queued_spin_lock_slowpath
   1.99%  [kernel]  [k] _raw_spin_lock
   1.98%  [kernel]  [k] wake_all_kswapds
   1.92%  [kernel]  [k] _raw_spin_lock_irqsave
   1.90%  [kernel]  [k] node_dirty_ok
   1.69%  [kernel]  [k] __wake_up_bit
   1.57%  [kernel]  [k] ___might_sleep
   1.49%  [kernel]  [k] __might_sleep
   1.24%  [kernel]  [k] __radix_tree_lookup
   1.18%  [kernel]  [k] kmem_cache_alloc
   1.13%  [kernel]  [k] update_fast_ctr
   1.11%  [kernel]  [k] radix_tree_tag_set
   1.08%  [kernel]  [k] clear_page_dirty_for_io
   1.06%  [kernel]  [k] down_write
   1.06%  [kernel]  [k] up_write
   1.01%  [kernel]  [k] unlock_page
   0.99%  [kernel]  [k] xfs_log_commit_cil
   0.97%  [kernel]  [k] __inc_node_state
   0.95%  [kernel]  [k] __memset
   0.89%  [kernel]  [k] xfs_do_writepage
   0.89%  [kernel]  [k] __list_del_entry
   0.87%  [kernel]  [k] __vfs_write
   0.85%  [kernel]  [k] xfs_inode_item_format
   0.84%  [kernel]  [k] shrink_page_list
   0.82%  [kernel]  [k] kmem_cache_free
   0.79%  [kernel]  [k] radix_tree_tag_clear
   0.78%  [kernel]  [k] _raw_spin_lock_irq
   0.77%  [kernel]  [k] _raw_spin_unlock_irqrestore
   0.76%  [kernel]  [k] node_page_state
   0.72%  [kernel]  [k] xfs_count_page_state
   0.68%  [kernel]  [k] xfs_file_aio_write_checks
   0.65%  [kernel]  [k] wakeup_kswapd

There's still a lot of time in locking, but it's no longer obviously
being spent by spinning contention. We seem to be spending a lot of
time trying to wake kswapds now - the context switch rate of the
workload is only 400-500/s, so there aren't a lot of sleeps and
wakeups actually occurring

Regardless, throughput and locking behvaiour seems to be a lot
better than the other patches...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-25 Thread Mel Gorman
On Wed, Aug 24, 2016 at 08:40:37AM -0700, Huang, Ying wrote:
> Mel Gorman  writes:
> 
> > On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote:
> >> > Yes, we could try to batch the locking like DaveC already suggested
> >> > (ie we could move the locking to the caller, and then make
> >> > shrink_page_list() just try to keep the lock held for a few pages if
> >> > the mapping doesn't change), and that might result in fewer crazy
> >> > cacheline ping-pongs overall. But that feels like exactly the wrong
> >> > kind of workaround.
> >> > 
> >> 
> >> Even if such batching was implemented, it would be very specific to the
> >> case of a single large file filling LRUs on multiple nodes.
> >> 
> >
> > The latest Jason Bourne movie was sufficiently bad that I spent time
> > thinking how the tree_lock could be batched during reclaim. It's not
> > straight-forward but this prototype did not blow up on UMA and may be
> > worth considering if Dave can test either approach has a positive impact.
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 374d95d04178..926110219cd9 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -621,19 +621,39 @@ static pageout_t pageout(struct page *page, struct 
> > address_space *mapping,
> > return PAGE_CLEAN;
> >  }
> 
> We found this patch helps much for swap out performance, where there are
> usually only one mapping for all swap pages. 

Yeah, I expected it would be an unconditional win on swapping. I just
did not concentrate on it very much as it was not the problem at hand.

> In our 16 processes
> sequential swap write test case for a ramdisk on a Xeon E5 v3 machine,
> the swap out throughput improved 40.4%, from ~0.97GB/s to ~1.36GB/s.

Ok, so main benefit would be for ultra-fast storage. I doubt it's noticable
on slow disks.

> What's your plan for this patch?  If it can be merged soon, that will be
> great!
> 

Until this mail, no plan. I'm still waiting to hear if Dave's test case
has improved with the latest prototype for reducing contention.

> I found some issues in the original patch to work with swap cache.  Below
> is my fixes to make it work for swap cache.
> 

Thanks for the fix. I'm going offline today for a few days but I added a
todo item to finish this patch at some point. I won't be rushing it but
it'll get done eventually.

-- 
Mel Gorman
SUSE Labs


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-25 Thread Mel Gorman
On Wed, Aug 24, 2016 at 08:40:37AM -0700, Huang, Ying wrote:
> Mel Gorman  writes:
> 
> > On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote:
> >> > Yes, we could try to batch the locking like DaveC already suggested
> >> > (ie we could move the locking to the caller, and then make
> >> > shrink_page_list() just try to keep the lock held for a few pages if
> >> > the mapping doesn't change), and that might result in fewer crazy
> >> > cacheline ping-pongs overall. But that feels like exactly the wrong
> >> > kind of workaround.
> >> > 
> >> 
> >> Even if such batching was implemented, it would be very specific to the
> >> case of a single large file filling LRUs on multiple nodes.
> >> 
> >
> > The latest Jason Bourne movie was sufficiently bad that I spent time
> > thinking how the tree_lock could be batched during reclaim. It's not
> > straight-forward but this prototype did not blow up on UMA and may be
> > worth considering if Dave can test either approach has a positive impact.
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 374d95d04178..926110219cd9 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -621,19 +621,39 @@ static pageout_t pageout(struct page *page, struct 
> > address_space *mapping,
> > return PAGE_CLEAN;
> >  }
> 
> We found this patch helps much for swap out performance, where there are
> usually only one mapping for all swap pages. 

Yeah, I expected it would be an unconditional win on swapping. I just
did not concentrate on it very much as it was not the problem at hand.

> In our 16 processes
> sequential swap write test case for a ramdisk on a Xeon E5 v3 machine,
> the swap out throughput improved 40.4%, from ~0.97GB/s to ~1.36GB/s.

Ok, so main benefit would be for ultra-fast storage. I doubt it's noticable
on slow disks.

> What's your plan for this patch?  If it can be merged soon, that will be
> great!
> 

Until this mail, no plan. I'm still waiting to hear if Dave's test case
has improved with the latest prototype for reducing contention.

> I found some issues in the original patch to work with swap cache.  Below
> is my fixes to make it work for swap cache.
> 

Thanks for the fix. I'm going offline today for a few days but I added a
todo item to finish this patch at some point. I won't be rushing it but
it'll get done eventually.

-- 
Mel Gorman
SUSE Labs


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-24 Thread Huang, Ying
Hi, Mel,

Mel Gorman  writes:

> On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote:
>> > Yes, we could try to batch the locking like DaveC already suggested
>> > (ie we could move the locking to the caller, and then make
>> > shrink_page_list() just try to keep the lock held for a few pages if
>> > the mapping doesn't change), and that might result in fewer crazy
>> > cacheline ping-pongs overall. But that feels like exactly the wrong
>> > kind of workaround.
>> > 
>> 
>> Even if such batching was implemented, it would be very specific to the
>> case of a single large file filling LRUs on multiple nodes.
>> 
>
> The latest Jason Bourne movie was sufficiently bad that I spent time
> thinking how the tree_lock could be batched during reclaim. It's not
> straight-forward but this prototype did not blow up on UMA and may be
> worth considering if Dave can test either approach has a positive impact.
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 374d95d04178..926110219cd9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -621,19 +621,39 @@ static pageout_t pageout(struct page *page, struct 
> address_space *mapping,
>   return PAGE_CLEAN;
>  }

We found this patch helps much for swap out performance, where there are
usually only one mapping for all swap pages.  In our 16 processes
sequential swap write test case for a ramdisk on a Xeon E5 v3 machine,
the swap out throughput improved 40.4%, from ~0.97GB/s to ~1.36GB/s.
What's your plan for this patch?  If it can be merged soon, that will be
great!

I found some issues in the original patch to work with swap cache.  Below
is my fixes to make it work for swap cache.

Best Regards,
Huang, Ying

>

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ac5fbff..dcaf295 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -623,22 +623,28 @@ static pageout_t pageout(struct page *page, struct 
address_space *mapping,
 
 static void finalise_remove_mapping(struct list_head *swapcache,
struct list_head *filecache,
+   struct list_head *free_pages,
void (*freepage)(struct page *))
 {
struct page *page;
 
while (!list_empty(swapcache)) {
-   swp_entry_t swap = { .val = page_private(page) };
+   swp_entry_t swap;
page = lru_to_page(swapcache);
list_del(>lru);
+   swap.val = page_private(page);
swapcache_free(swap);
set_page_private(page, 0);
+   if (free_pages)
+   list_add(>lru, free_pages);
}
 
while (!list_empty(filecache)) {
-   page = lru_to_page(swapcache);
+   page = lru_to_page(filecache);
list_del(>lru);
freepage(page);
+   if (free_pages)
+   list_add(>lru, free_pages);
}
 }
 
@@ -649,7 +655,8 @@ static void finalise_remove_mapping(struct list_head 
*swapcache,
 static int __remove_mapping_page(struct address_space *mapping,
 struct page *page, bool reclaimed,
 struct list_head *swapcache,
-struct list_head *filecache)
+struct list_head *filecache,
+struct list_head *free_pages)
 {
BUG_ON(!PageLocked(page));
BUG_ON(mapping != page_mapping(page));
@@ -722,6 +729,8 @@ static int __remove_mapping_page(struct address_space 
*mapping,
__delete_from_page_cache(page, shadow);
if (freepage)
list_add(>lru, filecache);
+   else if (free_pages)
+   list_add(>lru, free_pages);
}
 
return 1;
@@ -747,7 +756,7 @@ int remove_mapping(struct address_space *mapping, struct 
page *page)
spin_lock_irqsave(>tree_lock, flags);
freepage = mapping->a_ops->freepage;
 
-   if (__remove_mapping_page(mapping, page, false, , 
)) {
+   if (__remove_mapping_page(mapping, page, false, , , 
NULL)) {
/*
 * Unfreezing the refcount with 1 rather than 2 effectively
 * drops the pagecache ref for us without requiring another
@@ -757,7 +766,7 @@ int remove_mapping(struct address_space *mapping, struct 
page *page)
ret = 1;
}
spin_unlock_irqrestore(>tree_lock, flags);
-   finalise_remove_mapping(, , freepage);
+   finalise_remove_mapping(, , NULL, freepage);
return ret;
 }
 
@@ -776,29 +785,28 @@ static void remove_mapping_list(struct list_head 
*mapping_list,
page = lru_to_page(mapping_list);
list_del(>lru);
 
-   if (!mapping || page->mapping != mapping) {
+   if (!mapping || page_mapping(page) != 

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-24 Thread Huang, Ying
Hi, Mel,

Mel Gorman  writes:

> On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote:
>> > Yes, we could try to batch the locking like DaveC already suggested
>> > (ie we could move the locking to the caller, and then make
>> > shrink_page_list() just try to keep the lock held for a few pages if
>> > the mapping doesn't change), and that might result in fewer crazy
>> > cacheline ping-pongs overall. But that feels like exactly the wrong
>> > kind of workaround.
>> > 
>> 
>> Even if such batching was implemented, it would be very specific to the
>> case of a single large file filling LRUs on multiple nodes.
>> 
>
> The latest Jason Bourne movie was sufficiently bad that I spent time
> thinking how the tree_lock could be batched during reclaim. It's not
> straight-forward but this prototype did not blow up on UMA and may be
> worth considering if Dave can test either approach has a positive impact.
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 374d95d04178..926110219cd9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -621,19 +621,39 @@ static pageout_t pageout(struct page *page, struct 
> address_space *mapping,
>   return PAGE_CLEAN;
>  }

We found this patch helps much for swap out performance, where there are
usually only one mapping for all swap pages.  In our 16 processes
sequential swap write test case for a ramdisk on a Xeon E5 v3 machine,
the swap out throughput improved 40.4%, from ~0.97GB/s to ~1.36GB/s.
What's your plan for this patch?  If it can be merged soon, that will be
great!

I found some issues in the original patch to work with swap cache.  Below
is my fixes to make it work for swap cache.

Best Regards,
Huang, Ying

>

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ac5fbff..dcaf295 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -623,22 +623,28 @@ static pageout_t pageout(struct page *page, struct 
address_space *mapping,
 
 static void finalise_remove_mapping(struct list_head *swapcache,
struct list_head *filecache,
+   struct list_head *free_pages,
void (*freepage)(struct page *))
 {
struct page *page;
 
while (!list_empty(swapcache)) {
-   swp_entry_t swap = { .val = page_private(page) };
+   swp_entry_t swap;
page = lru_to_page(swapcache);
list_del(>lru);
+   swap.val = page_private(page);
swapcache_free(swap);
set_page_private(page, 0);
+   if (free_pages)
+   list_add(>lru, free_pages);
}
 
while (!list_empty(filecache)) {
-   page = lru_to_page(swapcache);
+   page = lru_to_page(filecache);
list_del(>lru);
freepage(page);
+   if (free_pages)
+   list_add(>lru, free_pages);
}
 }
 
@@ -649,7 +655,8 @@ static void finalise_remove_mapping(struct list_head 
*swapcache,
 static int __remove_mapping_page(struct address_space *mapping,
 struct page *page, bool reclaimed,
 struct list_head *swapcache,
-struct list_head *filecache)
+struct list_head *filecache,
+struct list_head *free_pages)
 {
BUG_ON(!PageLocked(page));
BUG_ON(mapping != page_mapping(page));
@@ -722,6 +729,8 @@ static int __remove_mapping_page(struct address_space 
*mapping,
__delete_from_page_cache(page, shadow);
if (freepage)
list_add(>lru, filecache);
+   else if (free_pages)
+   list_add(>lru, free_pages);
}
 
return 1;
@@ -747,7 +756,7 @@ int remove_mapping(struct address_space *mapping, struct 
page *page)
spin_lock_irqsave(>tree_lock, flags);
freepage = mapping->a_ops->freepage;
 
-   if (__remove_mapping_page(mapping, page, false, , 
)) {
+   if (__remove_mapping_page(mapping, page, false, , , 
NULL)) {
/*
 * Unfreezing the refcount with 1 rather than 2 effectively
 * drops the pagecache ref for us without requiring another
@@ -757,7 +766,7 @@ int remove_mapping(struct address_space *mapping, struct 
page *page)
ret = 1;
}
spin_unlock_irqrestore(>tree_lock, flags);
-   finalise_remove_mapping(, , freepage);
+   finalise_remove_mapping(, , NULL, freepage);
return ret;
 }
 
@@ -776,29 +785,28 @@ static void remove_mapping_list(struct list_head 
*mapping_list,
page = lru_to_page(mapping_list);
list_del(>lru);
 
-   if (!mapping || page->mapping != mapping) {
+   if (!mapping || page_mapping(page) != mapping) {
 

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-22 Thread Huang, Ying
Hi, Christoph,

"Huang, Ying"  writes:

> Christoph Hellwig  writes:
>
>> Snipping the long contest:
>>
>> I think there are three observations here:
>>
>>  (1) removing the mark_page_accessed (which is the only significant
>>  change in the parent commit)  hurts the
>>  aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test.
>>  I'd still rather stick to the filemap version and let the
>>  VM people sort it out.  How do the numbers for this test
>>  look for XFS vs say ext4 and btrfs?
>>  (2) lots of additional spinlock contention in the new case.  A quick
>>  check shows that I fat-fingered my rewrite so that we do
>>  the xfs_inode_set_eofblocks_tag call now for the pure lookup
>>  case, and pretty much all new cycles come from that.
>>  (3) Boy, are those xfs_inode_set_eofblocks_tag calls expensive, and
>>  we're already doing way to many even without my little bug above.
>>
>> So I've force pushed a new version of the iomap-fixes branch with
>> (2) fixed, and also a little patch to xfs_inode_set_eofblocks_tag a
>> lot less expensive slotted in before that.  Would be good to see
>> the numbers with that.
>
> For the original reported regression, the test result is as follow,
>
> =
> compiler/cpufreq_governor/debug-setup/disk/fs/kconfig/load/rootfs/tbox_group/test/testcase:
>   
> gcc-6/performance/profile/1BRD_48G/xfs/x86_64-rhel/3000/debian-x86_64-2015-02-07.cgz/ivb44/disk_wrt/aim7
>
> commit: 
>   f0c6bcba74ac51cb77aadb33ad35cb2dc1ad1506 (parent of first bad commit)
>   68a9f5e7007c1afa2cf6830b690a90d0187c0684 (first bad commit)
>   99091700659f4df965e138b38b4fa26a29b7eade (base of your fixes branch)
>   bf4dc6e4ecc2a3d042029319bc8cd4204c185610 (head of your fixes branch)
>
> f0c6bcba74ac51cb 68a9f5e7007c1afa2cf6830b69 99091700659f4df965e138b38b 
> bf4dc6e4ecc2a3d042029319bc 
>  -- -- 
> -- 
>  %stddev %change %stddev %change %stddev 
> %change %stddev
>  \  |\  |\
>   |\  
> 484435 ±  0% -13.3% 420004 ±  0% -17.0% 402250 ±  0% 
> -15.6% 408998 ±  0%  aim7.jobs-per-min

It appears the original reported regression hasn't bee resolved by your
commit.  Could you take a look at the test results and the perf data?

Best Regards,
Huang, Ying

>
> And the perf data is as follow,
>
>   "perf-profile.func.cycles-pp.intel_idle": 20.25,
>   "perf-profile.func.cycles-pp.memset_erms": 11.72,
>   "perf-profile.func.cycles-pp.copy_user_enhanced_fast_string": 8.37,
>   "perf-profile.func.cycles-pp.__block_commit_write.isra.21": 3.49,
>   "perf-profile.func.cycles-pp.block_write_end": 1.77,
>   "perf-profile.func.cycles-pp.native_queued_spin_lock_slowpath": 1.63,
>   "perf-profile.func.cycles-pp.unlock_page": 1.58,
>   "perf-profile.func.cycles-pp.___might_sleep": 1.56,
>   "perf-profile.func.cycles-pp.__block_write_begin_int": 1.33,
>   "perf-profile.func.cycles-pp.iov_iter_copy_from_user_atomic": 1.23,
>   "perf-profile.func.cycles-pp.up_write": 1.21,
>   "perf-profile.func.cycles-pp.__mark_inode_dirty": 1.18,
>   "perf-profile.func.cycles-pp.down_write": 1.06,
>   "perf-profile.func.cycles-pp.mark_buffer_dirty": 0.94,
>   "perf-profile.func.cycles-pp.generic_write_end": 0.92,
>   "perf-profile.func.cycles-pp.__radix_tree_lookup": 0.91,
>   "perf-profile.func.cycles-pp._raw_spin_lock": 0.81,
>   "perf-profile.func.cycles-pp.entry_SYSCALL_64_fastpath": 0.79,
>   "perf-profile.func.cycles-pp.__might_sleep": 0.79,
>   "perf-profile.func.cycles-pp.xfs_file_iomap_begin_delay.isra.9": 0.7,
>   "perf-profile.func.cycles-pp.__list_del_entry": 0.7,
>   "perf-profile.func.cycles-pp.vfs_write": 0.69,
>   "perf-profile.func.cycles-pp.drop_buffers": 0.68,
>   "perf-profile.func.cycles-pp.xfs_file_write_iter": 0.67,
>   "perf-profile.func.cycles-pp.rwsem_spin_on_owner": 0.67,
>
> Best Regards,
> Huang, Ying
> ___
> LKP mailing list
> l...@lists.01.org
> https://lists.01.org/mailman/listinfo/lkp


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-22 Thread Huang, Ying
Hi, Christoph,

"Huang, Ying"  writes:

> Christoph Hellwig  writes:
>
>> Snipping the long contest:
>>
>> I think there are three observations here:
>>
>>  (1) removing the mark_page_accessed (which is the only significant
>>  change in the parent commit)  hurts the
>>  aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test.
>>  I'd still rather stick to the filemap version and let the
>>  VM people sort it out.  How do the numbers for this test
>>  look for XFS vs say ext4 and btrfs?
>>  (2) lots of additional spinlock contention in the new case.  A quick
>>  check shows that I fat-fingered my rewrite so that we do
>>  the xfs_inode_set_eofblocks_tag call now for the pure lookup
>>  case, and pretty much all new cycles come from that.
>>  (3) Boy, are those xfs_inode_set_eofblocks_tag calls expensive, and
>>  we're already doing way to many even without my little bug above.
>>
>> So I've force pushed a new version of the iomap-fixes branch with
>> (2) fixed, and also a little patch to xfs_inode_set_eofblocks_tag a
>> lot less expensive slotted in before that.  Would be good to see
>> the numbers with that.
>
> For the original reported regression, the test result is as follow,
>
> =
> compiler/cpufreq_governor/debug-setup/disk/fs/kconfig/load/rootfs/tbox_group/test/testcase:
>   
> gcc-6/performance/profile/1BRD_48G/xfs/x86_64-rhel/3000/debian-x86_64-2015-02-07.cgz/ivb44/disk_wrt/aim7
>
> commit: 
>   f0c6bcba74ac51cb77aadb33ad35cb2dc1ad1506 (parent of first bad commit)
>   68a9f5e7007c1afa2cf6830b690a90d0187c0684 (first bad commit)
>   99091700659f4df965e138b38b4fa26a29b7eade (base of your fixes branch)
>   bf4dc6e4ecc2a3d042029319bc8cd4204c185610 (head of your fixes branch)
>
> f0c6bcba74ac51cb 68a9f5e7007c1afa2cf6830b69 99091700659f4df965e138b38b 
> bf4dc6e4ecc2a3d042029319bc 
>  -- -- 
> -- 
>  %stddev %change %stddev %change %stddev 
> %change %stddev
>  \  |\  |\
>   |\  
> 484435 ±  0% -13.3% 420004 ±  0% -17.0% 402250 ±  0% 
> -15.6% 408998 ±  0%  aim7.jobs-per-min

It appears the original reported regression hasn't bee resolved by your
commit.  Could you take a look at the test results and the perf data?

Best Regards,
Huang, Ying

>
> And the perf data is as follow,
>
>   "perf-profile.func.cycles-pp.intel_idle": 20.25,
>   "perf-profile.func.cycles-pp.memset_erms": 11.72,
>   "perf-profile.func.cycles-pp.copy_user_enhanced_fast_string": 8.37,
>   "perf-profile.func.cycles-pp.__block_commit_write.isra.21": 3.49,
>   "perf-profile.func.cycles-pp.block_write_end": 1.77,
>   "perf-profile.func.cycles-pp.native_queued_spin_lock_slowpath": 1.63,
>   "perf-profile.func.cycles-pp.unlock_page": 1.58,
>   "perf-profile.func.cycles-pp.___might_sleep": 1.56,
>   "perf-profile.func.cycles-pp.__block_write_begin_int": 1.33,
>   "perf-profile.func.cycles-pp.iov_iter_copy_from_user_atomic": 1.23,
>   "perf-profile.func.cycles-pp.up_write": 1.21,
>   "perf-profile.func.cycles-pp.__mark_inode_dirty": 1.18,
>   "perf-profile.func.cycles-pp.down_write": 1.06,
>   "perf-profile.func.cycles-pp.mark_buffer_dirty": 0.94,
>   "perf-profile.func.cycles-pp.generic_write_end": 0.92,
>   "perf-profile.func.cycles-pp.__radix_tree_lookup": 0.91,
>   "perf-profile.func.cycles-pp._raw_spin_lock": 0.81,
>   "perf-profile.func.cycles-pp.entry_SYSCALL_64_fastpath": 0.79,
>   "perf-profile.func.cycles-pp.__might_sleep": 0.79,
>   "perf-profile.func.cycles-pp.xfs_file_iomap_begin_delay.isra.9": 0.7,
>   "perf-profile.func.cycles-pp.__list_del_entry": 0.7,
>   "perf-profile.func.cycles-pp.vfs_write": 0.69,
>   "perf-profile.func.cycles-pp.drop_buffers": 0.68,
>   "perf-profile.func.cycles-pp.xfs_file_write_iter": 0.67,
>   "perf-profile.func.cycles-pp.rwsem_spin_on_owner": 0.67,
>
> Best Regards,
> Huang, Ying
> ___
> LKP mailing list
> l...@lists.01.org
> https://lists.01.org/mailman/listinfo/lkp


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-20 Thread Mel Gorman
On Sat, Aug 20, 2016 at 09:48:39AM +1000, Dave Chinner wrote:
> On Fri, Aug 19, 2016 at 11:49:46AM +0100, Mel Gorman wrote:
> > On Thu, Aug 18, 2016 at 03:25:40PM -0700, Linus Torvalds wrote:
> > > It *could* be as simple/stupid as just saying "let's allocate the page
> > > cache for new pages from the current node" - and if the process that
> > > dirties pages just stays around on one single node, that might already
> > > be sufficient.
> > > 
> > > So just for testing purposes, you could try changing that
> > > 
> > > return alloc_pages(gfp, 0);
> > > 
> > > in __page_cache_alloc() into something like
> > > 
> > > return alloc_pages_node(cpu_to_node(raw_smp_processor_id())), 
> > > gfp, 0);
> > > 
> > > or something.
> > > 
> > 
> > The test would be interesting but I believe that keeping heavy writers
> > on one node will force them to stall early on dirty balancing even if
> > there is plenty of free memory on other nodes.
> 
> Well, it depends on the speed of the storage. The higher the speed
> of the storage, the less we care about stalling on dirty pages
> during reclaim. i.e. faster storage == shorter stalls. We really
> should stop thinking we need to optimise reclaim purely for the
> benefit of slow disks.  500MB/s write speed with latencies of a
> under a couple of milliseconds is common hardware these days. pcie
> based storage (e.g. m2, nvme) is rapidly becoming commonplace and
> they can easily do 1-2GB/s write speeds.
> 

I partially agree. I've been of the opinion for a long time that dirty_time
would be desirable and limit the amount of dirty data by microseconds
required to sync the data and pick a default like 5 seconds. It's
non-trivial as the write speed of all BDIs would have to be estimated
and on rotary storage the estimate would be unreliable.

A short-term practical idea would be to distribute pages for writing
only when the dirty limit is almost reached on a given node. For fast
storage, the distribution may never happen.

Neither idea would actually impact the current problem though unless it
was combined with discarding clean cache agressively if the underlying
storage is fast. Hence, it would still be nice if the contention problem
could be mitigated. Did that last patch help any?

-- 
Mel Gorman
SUSE Labs


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-20 Thread Mel Gorman
On Sat, Aug 20, 2016 at 09:48:39AM +1000, Dave Chinner wrote:
> On Fri, Aug 19, 2016 at 11:49:46AM +0100, Mel Gorman wrote:
> > On Thu, Aug 18, 2016 at 03:25:40PM -0700, Linus Torvalds wrote:
> > > It *could* be as simple/stupid as just saying "let's allocate the page
> > > cache for new pages from the current node" - and if the process that
> > > dirties pages just stays around on one single node, that might already
> > > be sufficient.
> > > 
> > > So just for testing purposes, you could try changing that
> > > 
> > > return alloc_pages(gfp, 0);
> > > 
> > > in __page_cache_alloc() into something like
> > > 
> > > return alloc_pages_node(cpu_to_node(raw_smp_processor_id())), 
> > > gfp, 0);
> > > 
> > > or something.
> > > 
> > 
> > The test would be interesting but I believe that keeping heavy writers
> > on one node will force them to stall early on dirty balancing even if
> > there is plenty of free memory on other nodes.
> 
> Well, it depends on the speed of the storage. The higher the speed
> of the storage, the less we care about stalling on dirty pages
> during reclaim. i.e. faster storage == shorter stalls. We really
> should stop thinking we need to optimise reclaim purely for the
> benefit of slow disks.  500MB/s write speed with latencies of a
> under a couple of milliseconds is common hardware these days. pcie
> based storage (e.g. m2, nvme) is rapidly becoming commonplace and
> they can easily do 1-2GB/s write speeds.
> 

I partially agree. I've been of the opinion for a long time that dirty_time
would be desirable and limit the amount of dirty data by microseconds
required to sync the data and pick a default like 5 seconds. It's
non-trivial as the write speed of all BDIs would have to be estimated
and on rotary storage the estimate would be unreliable.

A short-term practical idea would be to distribute pages for writing
only when the dirty limit is almost reached on a given node. For fast
storage, the distribution may never happen.

Neither idea would actually impact the current problem though unless it
was combined with discarding clean cache agressively if the underlying
storage is fast. Hence, it would still be nice if the contention problem
could be mitigated. Did that last patch help any?

-- 
Mel Gorman
SUSE Labs


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-19 Thread Linus Torvalds
On Fri, Aug 19, 2016 at 4:48 PM, Dave Chinner  wrote:
>
> Well, it depends on the speed of the storage. The higher the speed
> of the storage, the less we care about stalling on dirty pages
> during reclaim

Actually, that's largely true independently of the speed of the storage, I feel.

On really fast storage, you might as well push it out and buffering
lots of dirty memory pointless. And on really slow storage, buffering
lots of dirty memory is absolutely *horrible* from a latency
standpoint.

So I don't think this is about fast-vs-slow disks.

I think a lot of our "let's aggressively buffer dirty data" is
entirely historical. When you had 16MB of RAM in a workstation,
aggressively using half of it for writeback caches meant that you
could do things like untar source trees without waiting for IO.

But when you have 16GB of RAM in a workstation, and terabytes of RAM
in multi-node big machines, it's kind of silly to talk about
"percentages of memory available" for dirty data. I think it's likely
silly to even see "one node worth of memory" as being some limiter.

So I think we should try to avoid stalling on dirty pages during
reclaim by simply aiming to have fewer dirty pages in the first place.
Not because the stall is shorter on a fast disk, but because we just
shouldn't use that much memory for dirty data.

 Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-19 Thread Linus Torvalds
On Fri, Aug 19, 2016 at 4:48 PM, Dave Chinner  wrote:
>
> Well, it depends on the speed of the storage. The higher the speed
> of the storage, the less we care about stalling on dirty pages
> during reclaim

Actually, that's largely true independently of the speed of the storage, I feel.

On really fast storage, you might as well push it out and buffering
lots of dirty memory pointless. And on really slow storage, buffering
lots of dirty memory is absolutely *horrible* from a latency
standpoint.

So I don't think this is about fast-vs-slow disks.

I think a lot of our "let's aggressively buffer dirty data" is
entirely historical. When you had 16MB of RAM in a workstation,
aggressively using half of it for writeback caches meant that you
could do things like untar source trees without waiting for IO.

But when you have 16GB of RAM in a workstation, and terabytes of RAM
in multi-node big machines, it's kind of silly to talk about
"percentages of memory available" for dirty data. I think it's likely
silly to even see "one node worth of memory" as being some limiter.

So I think we should try to avoid stalling on dirty pages during
reclaim by simply aiming to have fewer dirty pages in the first place.
Not because the stall is shorter on a fast disk, but because we just
shouldn't use that much memory for dirty data.

 Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-19 Thread Dave Chinner
On Fri, Aug 19, 2016 at 11:49:46AM +0100, Mel Gorman wrote:
> On Thu, Aug 18, 2016 at 03:25:40PM -0700, Linus Torvalds wrote:
> > It *could* be as simple/stupid as just saying "let's allocate the page
> > cache for new pages from the current node" - and if the process that
> > dirties pages just stays around on one single node, that might already
> > be sufficient.
> > 
> > So just for testing purposes, you could try changing that
> > 
> > return alloc_pages(gfp, 0);
> > 
> > in __page_cache_alloc() into something like
> > 
> > return alloc_pages_node(cpu_to_node(raw_smp_processor_id())), gfp, 
> > 0);
> > 
> > or something.
> > 
> 
> The test would be interesting but I believe that keeping heavy writers
> on one node will force them to stall early on dirty balancing even if
> there is plenty of free memory on other nodes.

Well, it depends on the speed of the storage. The higher the speed
of the storage, the less we care about stalling on dirty pages
during reclaim. i.e. faster storage == shorter stalls. We really
should stop thinking we need to optimise reclaim purely for the
benefit of slow disks.  500MB/s write speed with latencies of a
under a couple of milliseconds is common hardware these days. pcie
based storage (e.g. m2, nvme) is rapidly becoming commonplace and
they can easily do 1-2GB/s write speeds.

The fast storage devices that are arriving need to be treated
more like a fast network device (e.g. a pci-e 4x nvme SSD has the
throughput of 2x10GbE devices). We have to consider if buffering
streaming data in the page cache for any longer than it takes to get
the data to userspace or to disk is worth the cost of reclaiming it
from the page cache.

Really, the question that needs to be answered is this: if we can
pull data from the storage at similar speeds and latencies as we can
from the page cache, then *why are we caching that data*?

We've already made that "don't cache for fast storage" decision in
the case of pmem - the DAX IO path is slowly moving towards making
full use of the mapping infrastructure for all it's tracking
requirements. pcie based storage is a bit slower than pmem, but
the principle is the same - the storage is sufficiently fast that
caching only really makes sense for data that is really hot...

I think the underlying principle here is that the faster the backing
device, the less we should cache and buffer the device in the OS. I
suspect a good initial approximation of "stickiness" for the page
cache would the speed of writeback as measured by the BDI underlying
the mapping

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-19 Thread Dave Chinner
On Fri, Aug 19, 2016 at 11:49:46AM +0100, Mel Gorman wrote:
> On Thu, Aug 18, 2016 at 03:25:40PM -0700, Linus Torvalds wrote:
> > It *could* be as simple/stupid as just saying "let's allocate the page
> > cache for new pages from the current node" - and if the process that
> > dirties pages just stays around on one single node, that might already
> > be sufficient.
> > 
> > So just for testing purposes, you could try changing that
> > 
> > return alloc_pages(gfp, 0);
> > 
> > in __page_cache_alloc() into something like
> > 
> > return alloc_pages_node(cpu_to_node(raw_smp_processor_id())), gfp, 
> > 0);
> > 
> > or something.
> > 
> 
> The test would be interesting but I believe that keeping heavy writers
> on one node will force them to stall early on dirty balancing even if
> there is plenty of free memory on other nodes.

Well, it depends on the speed of the storage. The higher the speed
of the storage, the less we care about stalling on dirty pages
during reclaim. i.e. faster storage == shorter stalls. We really
should stop thinking we need to optimise reclaim purely for the
benefit of slow disks.  500MB/s write speed with latencies of a
under a couple of milliseconds is common hardware these days. pcie
based storage (e.g. m2, nvme) is rapidly becoming commonplace and
they can easily do 1-2GB/s write speeds.

The fast storage devices that are arriving need to be treated
more like a fast network device (e.g. a pci-e 4x nvme SSD has the
throughput of 2x10GbE devices). We have to consider if buffering
streaming data in the page cache for any longer than it takes to get
the data to userspace or to disk is worth the cost of reclaiming it
from the page cache.

Really, the question that needs to be answered is this: if we can
pull data from the storage at similar speeds and latencies as we can
from the page cache, then *why are we caching that data*?

We've already made that "don't cache for fast storage" decision in
the case of pmem - the DAX IO path is slowly moving towards making
full use of the mapping infrastructure for all it's tracking
requirements. pcie based storage is a bit slower than pmem, but
the principle is the same - the storage is sufficiently fast that
caching only really makes sense for data that is really hot...

I think the underlying principle here is that the faster the backing
device, the less we should cache and buffer the device in the OS. I
suspect a good initial approximation of "stickiness" for the page
cache would the speed of writeback as measured by the BDI underlying
the mapping

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-19 Thread Mel Gorman
On Thu, Aug 18, 2016 at 05:11:11PM +1000, Dave Chinner wrote:
> On Thu, Aug 18, 2016 at 01:45:17AM +0100, Mel Gorman wrote:
> > On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote:
> > > > Yes, we could try to batch the locking like DaveC already suggested
> > > > (ie we could move the locking to the caller, and then make
> > > > shrink_page_list() just try to keep the lock held for a few pages if
> > > > the mapping doesn't change), and that might result in fewer crazy
> > > > cacheline ping-pongs overall. But that feels like exactly the wrong
> > > > kind of workaround.
> > > > 
> > > 
> > > Even if such batching was implemented, it would be very specific to the
> > > case of a single large file filling LRUs on multiple nodes.
> > > 
> > 
> > The latest Jason Bourne movie was sufficiently bad that I spent time
> > thinking how the tree_lock could be batched during reclaim. It's not
> > straight-forward but this prototype did not blow up on UMA and may be
> > worth considering if Dave can test either approach has a positive impact.
> 
> SO, I just did a couple of tests. I'll call the two patches "sleepy"
> for the contention backoff patch and "bourney" for the Jason Bourne
> inspired batching patch. This is an average of 3 runs, overwriting
> a 47GB file on a machine with 16GB RAM:
> 
>   IO throughput   wall time __pv_queued_spin_lock_slowpath
> vanilla   470MB/s 1m42s   25-30%
> sleepy295MB/s 2m43s   <1%
> bourney   425MB/s 1m53s   25-30%
> 

This is another blunt-force patch that

a) stalls all but one kswapd instance when treelock contention occurs
b) marks a pgdat congested when tree_lock contention is encountered
   which may cause direct reclaimers to wait_iff_congested until
   kswapd finishes balancing the node

I tested this on a KVM instance running on a 4-socket box. The vCPUs
were bound to pCPUs and the memory nodes in the KVM mapped to physical
memory nodes. Without the patch 3% of kswapd cycles were spent on
locking. With the patch, the cycle count was 0.23%

xfs_io contention was reduced from 0.63% to 0.39% which is not perfect.
It can be reduced by stalling all kswapd instances but then xfs_io direct
reclaims and throughput drops.

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d572b78b65e1..f6d3e886f405 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -532,6 +532,7 @@ enum pgdat_flags {
 * many pages under writeback
 */
PGDAT_RECLAIM_LOCKED,   /* prevents concurrent reclaim */
+   PGDAT_CONTENDED,/* kswapd contending on tree_lock */
 };
 
 static inline unsigned long zone_end_pfn(const struct zone *zone)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 374d95d04178..64ca2148755c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -621,19 +621,43 @@ static pageout_t pageout(struct page *page, struct 
address_space *mapping,
return PAGE_CLEAN;
 }
 
+static atomic_t kswapd_contended = ATOMIC_INIT(0);
+
 /*
  * Same as remove_mapping, but if the page is removed from the mapping, it
  * gets returned with a refcount of 0.
  */
 static int __remove_mapping(struct address_space *mapping, struct page *page,
-   bool reclaimed)
+   bool reclaimed, unsigned long *nr_contended)
 {
unsigned long flags;
 
BUG_ON(!PageLocked(page));
BUG_ON(mapping != page_mapping(page));
 
-   spin_lock_irqsave(>tree_lock, flags);
+   if (!nr_contended || !current_is_kswapd())
+   spin_lock_irqsave(>tree_lock, flags);
+   else {
+   /* Account for trylock contentions in kswapd */
+   if (!spin_trylock_irqsave(>tree_lock, flags)) {
+   pg_data_t *pgdat = page_pgdat(page);
+   int nr_kswapd;
+
+   /* Account for contended pages and contended kswapds */
+   (*nr_contended)++;
+   if (!test_and_set_bit(PGDAT_CONTENDED, >flags))
+   nr_kswapd = 
atomic_inc_return(_contended);
+   else
+   nr_kswapd = atomic_read(_contended);
+   BUG_ON(nr_kswapd > nr_online_nodes || nr_kswapd < 0);
+
+   /* Stall kswapd if multiple kswapds are contending */
+   if (nr_kswapd > 1)
+   congestion_wait(BLK_RW_ASYNC, HZ/10);
+
+   spin_lock_irqsave(>tree_lock, flags);
+   }
+   }
/*
 * The non racy check for a busy page.
 *
@@ -719,7 +743,7 @@ static int __remove_mapping(struct address_space *mapping, 
struct page *page,
  */
 int remove_mapping(struct address_space *mapping, struct page *page)
 {
-   if (__remove_mapping(mapping, page, false)) {
+ 

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-19 Thread Mel Gorman
On Thu, Aug 18, 2016 at 05:11:11PM +1000, Dave Chinner wrote:
> On Thu, Aug 18, 2016 at 01:45:17AM +0100, Mel Gorman wrote:
> > On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote:
> > > > Yes, we could try to batch the locking like DaveC already suggested
> > > > (ie we could move the locking to the caller, and then make
> > > > shrink_page_list() just try to keep the lock held for a few pages if
> > > > the mapping doesn't change), and that might result in fewer crazy
> > > > cacheline ping-pongs overall. But that feels like exactly the wrong
> > > > kind of workaround.
> > > > 
> > > 
> > > Even if such batching was implemented, it would be very specific to the
> > > case of a single large file filling LRUs on multiple nodes.
> > > 
> > 
> > The latest Jason Bourne movie was sufficiently bad that I spent time
> > thinking how the tree_lock could be batched during reclaim. It's not
> > straight-forward but this prototype did not blow up on UMA and may be
> > worth considering if Dave can test either approach has a positive impact.
> 
> SO, I just did a couple of tests. I'll call the two patches "sleepy"
> for the contention backoff patch and "bourney" for the Jason Bourne
> inspired batching patch. This is an average of 3 runs, overwriting
> a 47GB file on a machine with 16GB RAM:
> 
>   IO throughput   wall time __pv_queued_spin_lock_slowpath
> vanilla   470MB/s 1m42s   25-30%
> sleepy295MB/s 2m43s   <1%
> bourney   425MB/s 1m53s   25-30%
> 

This is another blunt-force patch that

a) stalls all but one kswapd instance when treelock contention occurs
b) marks a pgdat congested when tree_lock contention is encountered
   which may cause direct reclaimers to wait_iff_congested until
   kswapd finishes balancing the node

I tested this on a KVM instance running on a 4-socket box. The vCPUs
were bound to pCPUs and the memory nodes in the KVM mapped to physical
memory nodes. Without the patch 3% of kswapd cycles were spent on
locking. With the patch, the cycle count was 0.23%

xfs_io contention was reduced from 0.63% to 0.39% which is not perfect.
It can be reduced by stalling all kswapd instances but then xfs_io direct
reclaims and throughput drops.

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d572b78b65e1..f6d3e886f405 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -532,6 +532,7 @@ enum pgdat_flags {
 * many pages under writeback
 */
PGDAT_RECLAIM_LOCKED,   /* prevents concurrent reclaim */
+   PGDAT_CONTENDED,/* kswapd contending on tree_lock */
 };
 
 static inline unsigned long zone_end_pfn(const struct zone *zone)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 374d95d04178..64ca2148755c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -621,19 +621,43 @@ static pageout_t pageout(struct page *page, struct 
address_space *mapping,
return PAGE_CLEAN;
 }
 
+static atomic_t kswapd_contended = ATOMIC_INIT(0);
+
 /*
  * Same as remove_mapping, but if the page is removed from the mapping, it
  * gets returned with a refcount of 0.
  */
 static int __remove_mapping(struct address_space *mapping, struct page *page,
-   bool reclaimed)
+   bool reclaimed, unsigned long *nr_contended)
 {
unsigned long flags;
 
BUG_ON(!PageLocked(page));
BUG_ON(mapping != page_mapping(page));
 
-   spin_lock_irqsave(>tree_lock, flags);
+   if (!nr_contended || !current_is_kswapd())
+   spin_lock_irqsave(>tree_lock, flags);
+   else {
+   /* Account for trylock contentions in kswapd */
+   if (!spin_trylock_irqsave(>tree_lock, flags)) {
+   pg_data_t *pgdat = page_pgdat(page);
+   int nr_kswapd;
+
+   /* Account for contended pages and contended kswapds */
+   (*nr_contended)++;
+   if (!test_and_set_bit(PGDAT_CONTENDED, >flags))
+   nr_kswapd = 
atomic_inc_return(_contended);
+   else
+   nr_kswapd = atomic_read(_contended);
+   BUG_ON(nr_kswapd > nr_online_nodes || nr_kswapd < 0);
+
+   /* Stall kswapd if multiple kswapds are contending */
+   if (nr_kswapd > 1)
+   congestion_wait(BLK_RW_ASYNC, HZ/10);
+
+   spin_lock_irqsave(>tree_lock, flags);
+   }
+   }
/*
 * The non racy check for a busy page.
 *
@@ -719,7 +743,7 @@ static int __remove_mapping(struct address_space *mapping, 
struct page *page,
  */
 int remove_mapping(struct address_space *mapping, struct page *page)
 {
-   if (__remove_mapping(mapping, page, false)) {
+ 

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-19 Thread Mel Gorman
On Thu, Aug 18, 2016 at 03:25:40PM -0700, Linus Torvalds wrote:
> >> In fact, looking at the __page_cache_alloc(), we already have that
> >> "spread pages out" logic. I'm assuming Dave doesn't actually have that
> >> bit set (I don't think it's the default), but I'm also envisioning
> >> that maybe we could extend on that notion, and try to spread out
> >> allocations in general, but keep page allocations from one particular
> >> mapping within one node.
> >
> > CONFIG_CPUSETS=y
> >
> > But I don't have any cpusets configured (unless systemd is doing
> > something wacky under the covers) so the page spread bit should not
> > be set.
> 
> Yeah, but even when it's not set we just do a generic alloc_pages(),
> which is just going to fill up all nodes. Not perhaps quite as "spread
> out", but there's obviously no attempt to try to be node-aware either.
> 

There is a slight difference. Reads should fill the nodes in turn but
dirty pages (__GFP_WRITE) get distributed to balance the number of dirty
pages on each node to avoid hitting dirty balance limits prematurely.

Yesterday I tried a patch that avoids distributing to remote nodes close
to the high watermark to avoid waking remote kswapd instances. It added a
lot of overhead to the fast path (3%) which hurts every writer but did not
reduce contention enough on the special case of writing a single large file.

As an aside, the dirty distribution check itself is very expensive so I
prototyped something that does the expensive calculations on a vmstat
update. Not sure if it'll work but it's a side issue.

> So _if_ we come up with some reasonable way to say "let's keep the
> pages of this mapping together", we could try to do it in that
> numa-aware __page_cache_alloc().
> 
> It *could* be as simple/stupid as just saying "let's allocate the page
> cache for new pages from the current node" - and if the process that
> dirties pages just stays around on one single node, that might already
> be sufficient.
> 
> So just for testing purposes, you could try changing that
> 
> return alloc_pages(gfp, 0);
> 
> in __page_cache_alloc() into something like
> 
> return alloc_pages_node(cpu_to_node(raw_smp_processor_id())), gfp, 0);
> 
> or something.
> 

The test would be interesting but I believe that keeping heavy writers
on one node will force them to stall early on dirty balancing even if
there is plenty of free memory on other nodes.

-- 
Mel Gorman
SUSE Labs


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-19 Thread Mel Gorman
On Thu, Aug 18, 2016 at 03:25:40PM -0700, Linus Torvalds wrote:
> >> In fact, looking at the __page_cache_alloc(), we already have that
> >> "spread pages out" logic. I'm assuming Dave doesn't actually have that
> >> bit set (I don't think it's the default), but I'm also envisioning
> >> that maybe we could extend on that notion, and try to spread out
> >> allocations in general, but keep page allocations from one particular
> >> mapping within one node.
> >
> > CONFIG_CPUSETS=y
> >
> > But I don't have any cpusets configured (unless systemd is doing
> > something wacky under the covers) so the page spread bit should not
> > be set.
> 
> Yeah, but even when it's not set we just do a generic alloc_pages(),
> which is just going to fill up all nodes. Not perhaps quite as "spread
> out", but there's obviously no attempt to try to be node-aware either.
> 

There is a slight difference. Reads should fill the nodes in turn but
dirty pages (__GFP_WRITE) get distributed to balance the number of dirty
pages on each node to avoid hitting dirty balance limits prematurely.

Yesterday I tried a patch that avoids distributing to remote nodes close
to the high watermark to avoid waking remote kswapd instances. It added a
lot of overhead to the fast path (3%) which hurts every writer but did not
reduce contention enough on the special case of writing a single large file.

As an aside, the dirty distribution check itself is very expensive so I
prototyped something that does the expensive calculations on a vmstat
update. Not sure if it'll work but it's a side issue.

> So _if_ we come up with some reasonable way to say "let's keep the
> pages of this mapping together", we could try to do it in that
> numa-aware __page_cache_alloc().
> 
> It *could* be as simple/stupid as just saying "let's allocate the page
> cache for new pages from the current node" - and if the process that
> dirties pages just stays around on one single node, that might already
> be sufficient.
> 
> So just for testing purposes, you could try changing that
> 
> return alloc_pages(gfp, 0);
> 
> in __page_cache_alloc() into something like
> 
> return alloc_pages_node(cpu_to_node(raw_smp_processor_id())), gfp, 0);
> 
> or something.
> 

The test would be interesting but I believe that keeping heavy writers
on one node will force them to stall early on dirty balancing even if
there is plenty of free memory on other nodes.

-- 
Mel Gorman
SUSE Labs


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-19 Thread Michal Hocko
On Thu 18-08-16 15:25:40, Linus Torvalds wrote:
[...]
> So just for testing purposes, you could try changing that
> 
> return alloc_pages(gfp, 0);
> 
> in __page_cache_alloc() into something like
> 
> return alloc_pages_node(cpu_to_node(raw_smp_processor_id())), gfp, 0);

That would break mempolicies AFAICS. Anyway, I might be missing
something (the mempolicy kod has a strange sense for aesthetic) but the
normal case without any explicit mempolicy should use default_policy
which is MPOL_PREFERRED and MPOL_F_LOCAL which means numa_node_id() so
the local node. So the above two should do the same thing unless I have
missing something.
-- 
Michal Hocko
SUSE Labs


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-19 Thread Michal Hocko
On Thu 18-08-16 15:25:40, Linus Torvalds wrote:
[...]
> So just for testing purposes, you could try changing that
> 
> return alloc_pages(gfp, 0);
> 
> in __page_cache_alloc() into something like
> 
> return alloc_pages_node(cpu_to_node(raw_smp_processor_id())), gfp, 0);

That would break mempolicies AFAICS. Anyway, I might be missing
something (the mempolicy kod has a strange sense for aesthetic) but the
normal case without any explicit mempolicy should use default_policy
which is MPOL_PREFERRED and MPOL_F_LOCAL which means numa_node_id() so
the local node. So the above two should do the same thing unless I have
missing something.
-- 
Michal Hocko
SUSE Labs


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-18 Thread Dave Chinner
On Thu, Aug 18, 2016 at 10:55:01AM -0700, Linus Torvalds wrote:
> On Thu, Aug 18, 2016 at 6:24 AM, Mel Gorman  
> wrote:
> > On Thu, Aug 18, 2016 at 05:11:11PM +1000, Dave Chinner wrote:
> >> FWIW, I just remembered about /proc/sys/vm/zone_reclaim_mode.
> >>
> >
> > That is a terrifying "fix" for this problem. It just happens to work
> > because there is no spillover to other nodes so only one kswapd instance
> > is potentially active.
> 
> Well, it may be a terrifying fix, but it does bring up an intriguing
> notion: maybe what we should think about is to make the actual page
> cache allocations be more "node-sticky" for a particular mapping? Not
> some hard node binding, but if we were to make a single mapping *tend*
> to allocate pages primarily within the same node, that would have the
> kind of secondary afvantage that it would avoid the cross-node mapping
> locking.

For streaming or use-once IO it makes a lot of sense to restrict the
locality of the page cache. The faster the IO device, the less dirty
page buffering we need to maintain full device bandwidth. And the
larger the machine the greater the effect of global page cache
pollution on the other appplications is.

> Think of it as a gentler "guiding" fix to the spinlock contention
> issue than a hard hammer.
> 
> And trying to (at least initially) keep the allocations of one
> particular file to one particular node sounds like it could have other
> locality advantages too.
> 
> In fact, looking at the __page_cache_alloc(), we already have that
> "spread pages out" logic. I'm assuming Dave doesn't actually have that
> bit set (I don't think it's the default), but I'm also envisioning
> that maybe we could extend on that notion, and try to spread out
> allocations in general, but keep page allocations from one particular
> mapping within one node.

CONFIG_CPUSETS=y

But I don't have any cpusets configured (unless systemd is doing
something wacky under the covers) so the page spread bit should not
be set.

> The fact that zone_reclaim_mode really improves on Dave's numbers
> *that* dramatically does seem to imply that there is something to be
> said for this.
> 
> We do *not* want to limit the whole page cache to a particular node -
> that sounds very unreasonable in general. But limiting any particular
> file mapping (by default - I'm sure there are things like databases
> that just want their one DB file to take over all of memory) to a
> single node sounds much less unreasonable.
> 
> What do you guys think? Worth exploring?

The problem is that whenever we turn this sort of behaviour on, some
benchmark regresses because it no longer holds it's working set in
the page cache, leading to the change being immediately reverted.
Enterprise java benchmarks ring a bell, for some reason.

Hence my comment above about needing it to be tied into specific
"use-once-only" page cache behaviours. I know we have working set
estimation, fadvise modes and things like readahead that help track
sequential and use-once access patterns, but I'm not sure how we can
tie that all together

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-18 Thread Dave Chinner
On Thu, Aug 18, 2016 at 10:55:01AM -0700, Linus Torvalds wrote:
> On Thu, Aug 18, 2016 at 6:24 AM, Mel Gorman  
> wrote:
> > On Thu, Aug 18, 2016 at 05:11:11PM +1000, Dave Chinner wrote:
> >> FWIW, I just remembered about /proc/sys/vm/zone_reclaim_mode.
> >>
> >
> > That is a terrifying "fix" for this problem. It just happens to work
> > because there is no spillover to other nodes so only one kswapd instance
> > is potentially active.
> 
> Well, it may be a terrifying fix, but it does bring up an intriguing
> notion: maybe what we should think about is to make the actual page
> cache allocations be more "node-sticky" for a particular mapping? Not
> some hard node binding, but if we were to make a single mapping *tend*
> to allocate pages primarily within the same node, that would have the
> kind of secondary afvantage that it would avoid the cross-node mapping
> locking.

For streaming or use-once IO it makes a lot of sense to restrict the
locality of the page cache. The faster the IO device, the less dirty
page buffering we need to maintain full device bandwidth. And the
larger the machine the greater the effect of global page cache
pollution on the other appplications is.

> Think of it as a gentler "guiding" fix to the spinlock contention
> issue than a hard hammer.
> 
> And trying to (at least initially) keep the allocations of one
> particular file to one particular node sounds like it could have other
> locality advantages too.
> 
> In fact, looking at the __page_cache_alloc(), we already have that
> "spread pages out" logic. I'm assuming Dave doesn't actually have that
> bit set (I don't think it's the default), but I'm also envisioning
> that maybe we could extend on that notion, and try to spread out
> allocations in general, but keep page allocations from one particular
> mapping within one node.

CONFIG_CPUSETS=y

But I don't have any cpusets configured (unless systemd is doing
something wacky under the covers) so the page spread bit should not
be set.

> The fact that zone_reclaim_mode really improves on Dave's numbers
> *that* dramatically does seem to imply that there is something to be
> said for this.
> 
> We do *not* want to limit the whole page cache to a particular node -
> that sounds very unreasonable in general. But limiting any particular
> file mapping (by default - I'm sure there are things like databases
> that just want their one DB file to take over all of memory) to a
> single node sounds much less unreasonable.
> 
> What do you guys think? Worth exploring?

The problem is that whenever we turn this sort of behaviour on, some
benchmark regresses because it no longer holds it's working set in
the page cache, leading to the change being immediately reverted.
Enterprise java benchmarks ring a bell, for some reason.

Hence my comment above about needing it to be tied into specific
"use-once-only" page cache behaviours. I know we have working set
estimation, fadvise modes and things like readahead that help track
sequential and use-once access patterns, but I'm not sure how we can
tie that all together

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-18 Thread Linus Torvalds
On Thu, Aug 18, 2016 at 2:19 PM, Dave Chinner  wrote:
>
> For streaming or use-once IO it makes a lot of sense to restrict the
> locality of the page cache. The faster the IO device, the less dirty
> page buffering we need to maintain full device bandwidth. And the
> larger the machine the greater the effect of global page cache
> pollution on the other appplications is.

Yes. But I agree with you that it might be very hard to actually get
something that does a good job automagically.

>> In fact, looking at the __page_cache_alloc(), we already have that
>> "spread pages out" logic. I'm assuming Dave doesn't actually have that
>> bit set (I don't think it's the default), but I'm also envisioning
>> that maybe we could extend on that notion, and try to spread out
>> allocations in general, but keep page allocations from one particular
>> mapping within one node.
>
> CONFIG_CPUSETS=y
>
> But I don't have any cpusets configured (unless systemd is doing
> something wacky under the covers) so the page spread bit should not
> be set.

Yeah, but even when it's not set we just do a generic alloc_pages(),
which is just going to fill up all nodes. Not perhaps quite as "spread
out", but there's obviously no attempt to try to be node-aware either.

So _if_ we come up with some reasonable way to say "let's keep the
pages of this mapping together", we could try to do it in that
numa-aware __page_cache_alloc().

It *could* be as simple/stupid as just saying "let's allocate the page
cache for new pages from the current node" - and if the process that
dirties pages just stays around on one single node, that might already
be sufficient.

So just for testing purposes, you could try changing that

return alloc_pages(gfp, 0);

in __page_cache_alloc() into something like

return alloc_pages_node(cpu_to_node(raw_smp_processor_id())), gfp, 0);

or something.

>> The fact that zone_reclaim_mode really improves on Dave's numbers
>> *that* dramatically does seem to imply that there is something to be
>> said for this.
>>
>> We do *not* want to limit the whole page cache to a particular node -
>> that sounds very unreasonable in general. But limiting any particular
>> file mapping (by default - I'm sure there are things like databases
>> that just want their one DB file to take over all of memory) to a
>> single node sounds much less unreasonable.
>>
>> What do you guys think? Worth exploring?
>
> The problem is that whenever we turn this sort of behaviour on, some
> benchmark regresses because it no longer holds it's working set in
> the page cache, leading to the change being immediately reverted.
> Enterprise java benchmarks ring a bell, for some reason.

Yeah. It might be ok if we limit the new behavior to just new pages
that get allocated for writing, which is where we want to limit the
page cache more anyway (we already have all those dirty limits etc).

But from a testing standpoint, you can probably try the above
"alloc_pages_node()" hack and see if it even makes a difference. It
might not work, and the dirtier might be moving around too much etc.

 Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-18 Thread Linus Torvalds
On Thu, Aug 18, 2016 at 2:19 PM, Dave Chinner  wrote:
>
> For streaming or use-once IO it makes a lot of sense to restrict the
> locality of the page cache. The faster the IO device, the less dirty
> page buffering we need to maintain full device bandwidth. And the
> larger the machine the greater the effect of global page cache
> pollution on the other appplications is.

Yes. But I agree with you that it might be very hard to actually get
something that does a good job automagically.

>> In fact, looking at the __page_cache_alloc(), we already have that
>> "spread pages out" logic. I'm assuming Dave doesn't actually have that
>> bit set (I don't think it's the default), but I'm also envisioning
>> that maybe we could extend on that notion, and try to spread out
>> allocations in general, but keep page allocations from one particular
>> mapping within one node.
>
> CONFIG_CPUSETS=y
>
> But I don't have any cpusets configured (unless systemd is doing
> something wacky under the covers) so the page spread bit should not
> be set.

Yeah, but even when it's not set we just do a generic alloc_pages(),
which is just going to fill up all nodes. Not perhaps quite as "spread
out", but there's obviously no attempt to try to be node-aware either.

So _if_ we come up with some reasonable way to say "let's keep the
pages of this mapping together", we could try to do it in that
numa-aware __page_cache_alloc().

It *could* be as simple/stupid as just saying "let's allocate the page
cache for new pages from the current node" - and if the process that
dirties pages just stays around on one single node, that might already
be sufficient.

So just for testing purposes, you could try changing that

return alloc_pages(gfp, 0);

in __page_cache_alloc() into something like

return alloc_pages_node(cpu_to_node(raw_smp_processor_id())), gfp, 0);

or something.

>> The fact that zone_reclaim_mode really improves on Dave's numbers
>> *that* dramatically does seem to imply that there is something to be
>> said for this.
>>
>> We do *not* want to limit the whole page cache to a particular node -
>> that sounds very unreasonable in general. But limiting any particular
>> file mapping (by default - I'm sure there are things like databases
>> that just want their one DB file to take over all of memory) to a
>> single node sounds much less unreasonable.
>>
>> What do you guys think? Worth exploring?
>
> The problem is that whenever we turn this sort of behaviour on, some
> benchmark regresses because it no longer holds it's working set in
> the page cache, leading to the change being immediately reverted.
> Enterprise java benchmarks ring a bell, for some reason.

Yeah. It might be ok if we limit the new behavior to just new pages
that get allocated for writing, which is where we want to limit the
page cache more anyway (we already have all those dirty limits etc).

But from a testing standpoint, you can probably try the above
"alloc_pages_node()" hack and see if it even makes a difference. It
might not work, and the dirtier might be moving around too much etc.

 Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-18 Thread Linus Torvalds
On Thu, Aug 18, 2016 at 6:24 AM, Mel Gorman  wrote:
> On Thu, Aug 18, 2016 at 05:11:11PM +1000, Dave Chinner wrote:
>> FWIW, I just remembered about /proc/sys/vm/zone_reclaim_mode.
>>
>
> That is a terrifying "fix" for this problem. It just happens to work
> because there is no spillover to other nodes so only one kswapd instance
> is potentially active.

Well, it may be a terrifying fix, but it does bring up an intriguing
notion: maybe what we should think about is to make the actual page
cache allocations be more "node-sticky" for a particular mapping? Not
some hard node binding, but if we were to make a single mapping *tend*
to allocate pages primarily within the same node, that would have the
kind of secondary afvantage that it would avoid the cross-node mapping
locking.

Think of it as a gentler "guiding" fix to the spinlock contention
issue than a hard hammer.

And trying to (at least initially) keep the allocations of one
particular file to one particular node sounds like it could have other
locality advantages too.

In fact, looking at the __page_cache_alloc(), we already have that
"spread pages out" logic. I'm assuming Dave doesn't actually have that
bit set (I don't think it's the default), but I'm also envisioning
that maybe we could extend on that notion, and try to spread out
allocations in general, but keep page allocations from one particular
mapping within one node.

The fact that zone_reclaim_mode really improves on Dave's numbers
*that* dramatically does seem to imply that there is something to be
said for this.

We do *not* want to limit the whole page cache to a particular node -
that sounds very unreasonable in general. But limiting any particular
file mapping (by default - I'm sure there are things like databases
that just want their one DB file to take over all of memory) to a
single node sounds much less unreasonable.

What do you guys think? Worth exploring?

Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-18 Thread Linus Torvalds
On Thu, Aug 18, 2016 at 6:24 AM, Mel Gorman  wrote:
> On Thu, Aug 18, 2016 at 05:11:11PM +1000, Dave Chinner wrote:
>> FWIW, I just remembered about /proc/sys/vm/zone_reclaim_mode.
>>
>
> That is a terrifying "fix" for this problem. It just happens to work
> because there is no spillover to other nodes so only one kswapd instance
> is potentially active.

Well, it may be a terrifying fix, but it does bring up an intriguing
notion: maybe what we should think about is to make the actual page
cache allocations be more "node-sticky" for a particular mapping? Not
some hard node binding, but if we were to make a single mapping *tend*
to allocate pages primarily within the same node, that would have the
kind of secondary afvantage that it would avoid the cross-node mapping
locking.

Think of it as a gentler "guiding" fix to the spinlock contention
issue than a hard hammer.

And trying to (at least initially) keep the allocations of one
particular file to one particular node sounds like it could have other
locality advantages too.

In fact, looking at the __page_cache_alloc(), we already have that
"spread pages out" logic. I'm assuming Dave doesn't actually have that
bit set (I don't think it's the default), but I'm also envisioning
that maybe we could extend on that notion, and try to spread out
allocations in general, but keep page allocations from one particular
mapping within one node.

The fact that zone_reclaim_mode really improves on Dave's numbers
*that* dramatically does seem to imply that there is something to be
said for this.

We do *not* want to limit the whole page cache to a particular node -
that sounds very unreasonable in general. But limiting any particular
file mapping (by default - I'm sure there are things like databases
that just want their one DB file to take over all of memory) to a
single node sounds much less unreasonable.

What do you guys think? Worth exploring?

Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-18 Thread Mel Gorman
On Thu, Aug 18, 2016 at 05:11:11PM +1000, Dave Chinner wrote:
> On Thu, Aug 18, 2016 at 01:45:17AM +0100, Mel Gorman wrote:
> > On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote:
> > > > Yes, we could try to batch the locking like DaveC already suggested
> > > > (ie we could move the locking to the caller, and then make
> > > > shrink_page_list() just try to keep the lock held for a few pages if
> > > > the mapping doesn't change), and that might result in fewer crazy
> > > > cacheline ping-pongs overall. But that feels like exactly the wrong
> > > > kind of workaround.
> > > > 
> > > 
> > > Even if such batching was implemented, it would be very specific to the
> > > case of a single large file filling LRUs on multiple nodes.
> > > 
> > 
> > The latest Jason Bourne movie was sufficiently bad that I spent time
> > thinking how the tree_lock could be batched during reclaim. It's not
> > straight-forward but this prototype did not blow up on UMA and may be
> > worth considering if Dave can test either approach has a positive impact.
> 
> SO, I just did a couple of tests. I'll call the two patches "sleepy"
> for the contention backoff patch and "bourney" for the Jason Bourne
> inspired batching patch. This is an average of 3 runs, overwriting
> a 47GB file on a machine with 16GB RAM:
> 
>   IO throughput   wall time __pv_queued_spin_lock_slowpath
> vanilla   470MB/s 1m42s   25-30%
> sleepy295MB/s 2m43s   <1%
> bourney   425MB/s 1m53s   25-30%
> 

Thanks. I updated the tests today and reran them trying to reproduce what
you saw but I'm simply not seeing it on bare metal with a spinning disk.

xfsio Throughput
  4.8.0-rc2 4.8.0-rc2 4.8.0-rc2
vanillasleepy   bourney
Min  tput147.4450 (  0.00%)147.2580 (  0.13%)147.3900 (  0.04%)
Hmeantput147.5853 (  0.00%)147.5101 (  0.05%)147.6121 ( -0.02%)
Stddev   tput  0.1041 (  0.00%)  0.1785 (-71.47%)  0.2036 (-95.63%)
CoeffVar tput  0.0705 (  0.00%)  0.1210 (-71.56%)  0.1379 (-95.59%)
Max  tput147.6940 (  0.00%)147.6420 (  0.04%)147.8820 ( -0.13%)

I'm currently setting up a KVM instance that may fare better. Due to
quirks of where machines are, I have to setup the KVM instance on real
NUMA hardware but maybe that'll make the problem even more obvious.

> The overall CPU usage of sleepy was much lower than the others, but
> it was also much slower. Too much sleeping and not enough reclaim
> work being done, I think.
> 

Looks like it. On my initial test, there was barely any sleeping.

> As for bourney, it's not immediately clear as to why it's nearly as
> bad as the movie. At worst I would have expected it to have not
> noticable impact, but maybe we are delaying freeing of pages too
> long and so stalling allocation of new pages? It also doesn't do
> much to reduce contention, especially considering the reduction in
> throughput.
> 
> On a hunch that the batch list isn't all one mapping, I sorted it.
> Patch is below if you're curious.
> 

The fact that sorting makes such a difference makes me think that it's
the wrong direction. It's far too specific to this test case and does
nothing to throttle a reclaimer. It's also fairly complex and I expected
that normal users of remove_mapping such as truncation would take a hit.

The hit of bouncing the lock around just hurts too much.

> FWIW, I just remembered about /proc/sys/vm/zone_reclaim_mode.
> 

That is a terrifying "fix" for this problem. It just happens to work
because there is no spillover to other nodes so only one kswapd instance
is potentially active.

> Anyway, I've burnt enough erase cycles on this SSD for today
> 

I'll continue looking at getting KVM up and running and then consider
other possibilities for throttling.

-- 
Mel Gorman
SUSE Labs


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-18 Thread Mel Gorman
On Thu, Aug 18, 2016 at 05:11:11PM +1000, Dave Chinner wrote:
> On Thu, Aug 18, 2016 at 01:45:17AM +0100, Mel Gorman wrote:
> > On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote:
> > > > Yes, we could try to batch the locking like DaveC already suggested
> > > > (ie we could move the locking to the caller, and then make
> > > > shrink_page_list() just try to keep the lock held for a few pages if
> > > > the mapping doesn't change), and that might result in fewer crazy
> > > > cacheline ping-pongs overall. But that feels like exactly the wrong
> > > > kind of workaround.
> > > > 
> > > 
> > > Even if such batching was implemented, it would be very specific to the
> > > case of a single large file filling LRUs on multiple nodes.
> > > 
> > 
> > The latest Jason Bourne movie was sufficiently bad that I spent time
> > thinking how the tree_lock could be batched during reclaim. It's not
> > straight-forward but this prototype did not blow up on UMA and may be
> > worth considering if Dave can test either approach has a positive impact.
> 
> SO, I just did a couple of tests. I'll call the two patches "sleepy"
> for the contention backoff patch and "bourney" for the Jason Bourne
> inspired batching patch. This is an average of 3 runs, overwriting
> a 47GB file on a machine with 16GB RAM:
> 
>   IO throughput   wall time __pv_queued_spin_lock_slowpath
> vanilla   470MB/s 1m42s   25-30%
> sleepy295MB/s 2m43s   <1%
> bourney   425MB/s 1m53s   25-30%
> 

Thanks. I updated the tests today and reran them trying to reproduce what
you saw but I'm simply not seeing it on bare metal with a spinning disk.

xfsio Throughput
  4.8.0-rc2 4.8.0-rc2 4.8.0-rc2
vanillasleepy   bourney
Min  tput147.4450 (  0.00%)147.2580 (  0.13%)147.3900 (  0.04%)
Hmeantput147.5853 (  0.00%)147.5101 (  0.05%)147.6121 ( -0.02%)
Stddev   tput  0.1041 (  0.00%)  0.1785 (-71.47%)  0.2036 (-95.63%)
CoeffVar tput  0.0705 (  0.00%)  0.1210 (-71.56%)  0.1379 (-95.59%)
Max  tput147.6940 (  0.00%)147.6420 (  0.04%)147.8820 ( -0.13%)

I'm currently setting up a KVM instance that may fare better. Due to
quirks of where machines are, I have to setup the KVM instance on real
NUMA hardware but maybe that'll make the problem even more obvious.

> The overall CPU usage of sleepy was much lower than the others, but
> it was also much slower. Too much sleeping and not enough reclaim
> work being done, I think.
> 

Looks like it. On my initial test, there was barely any sleeping.

> As for bourney, it's not immediately clear as to why it's nearly as
> bad as the movie. At worst I would have expected it to have not
> noticable impact, but maybe we are delaying freeing of pages too
> long and so stalling allocation of new pages? It also doesn't do
> much to reduce contention, especially considering the reduction in
> throughput.
> 
> On a hunch that the batch list isn't all one mapping, I sorted it.
> Patch is below if you're curious.
> 

The fact that sorting makes such a difference makes me think that it's
the wrong direction. It's far too specific to this test case and does
nothing to throttle a reclaimer. It's also fairly complex and I expected
that normal users of remove_mapping such as truncation would take a hit.

The hit of bouncing the lock around just hurts too much.

> FWIW, I just remembered about /proc/sys/vm/zone_reclaim_mode.
> 

That is a terrifying "fix" for this problem. It just happens to work
because there is no spillover to other nodes so only one kswapd instance
is potentially active.

> Anyway, I've burnt enough erase cycles on this SSD for today
> 

I'll continue looking at getting KVM up and running and then consider
other possibilities for throttling.

-- 
Mel Gorman
SUSE Labs


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-18 Thread Dave Chinner
On Thu, Aug 18, 2016 at 01:45:17AM +0100, Mel Gorman wrote:
> On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote:
> > > Yes, we could try to batch the locking like DaveC already suggested
> > > (ie we could move the locking to the caller, and then make
> > > shrink_page_list() just try to keep the lock held for a few pages if
> > > the mapping doesn't change), and that might result in fewer crazy
> > > cacheline ping-pongs overall. But that feels like exactly the wrong
> > > kind of workaround.
> > > 
> > 
> > Even if such batching was implemented, it would be very specific to the
> > case of a single large file filling LRUs on multiple nodes.
> > 
> 
> The latest Jason Bourne movie was sufficiently bad that I spent time
> thinking how the tree_lock could be batched during reclaim. It's not
> straight-forward but this prototype did not blow up on UMA and may be
> worth considering if Dave can test either approach has a positive impact.

SO, I just did a couple of tests. I'll call the two patches "sleepy"
for the contention backoff patch and "bourney" for the Jason Bourne
inspired batching patch. This is an average of 3 runs, overwriting
a 47GB file on a machine with 16GB RAM:

IO throughput   wall time __pv_queued_spin_lock_slowpath
vanilla 470MB/s 1m42s   25-30%
sleepy  295MB/s 2m43s   <1%
bourney 425MB/s 1m53s   25-30%

The overall CPU usage of sleepy was much lower than the others, but
it was also much slower. Too much sleeping and not enough reclaim
work being done, I think.

As for bourney, it's not immediately clear as to why it's nearly as
bad as the movie. At worst I would have expected it to have not
noticable impact, but maybe we are delaying freeing of pages too
long and so stalling allocation of new pages? It also doesn't do
much to reduce contention, especially considering the reduction in
throughput.

On a hunch that the batch list isn't all one mapping, I sorted it.
Patch is below if you're curious.

IO throughput   wall time __pv_queued_spin_lock_slowpath
vanilla 470MB/s 1m42s   25-30%
sleepy  295MB/s 2m43s   <1%
bourney 425MB/s 1m53s   25-30%
sorted-bourney  465MB/s 1m43s   20%

The number of reclaim batches (from multiple runs) where the sorting
of the lists would have done anything is counted by list swaps (ls)
being > 1.

# grep " c " /var/log/syslog |grep -v "ls 1" |wc -l
7429
# grep " c " /var/log/syslog |grep "ls 1" |wc -l
1061767

IOWs in 1.07 million batches of pages reclaimed, only ~0.695% of
batches switched to a different mapping tree lock more than once.
>From those numbers I would not have expected sorting the page list
to have any measurable impact on performance. However, performance
seems very sensitive to the number of times the mapping tree lock
is bounced around.

FWIW, I just remembered about /proc/sys/vm/zone_reclaim_mode.

IO throughput   wall time __pv_queued_spin_lock_slowpath
vanilla 470MB/s 1m42s   25-30%
zr=1470MB/s 1m42s   2-3%

So isolating the page cache usage to a single node maintains
performance and shows a significant reduction in pressure on the
mapping tree lock. Same as a single node system, I'd guess.

Anyway, I've burnt enough erase cycles on this SSD for today

-Dave.

---
 mm/vmscan.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9261102..5cf1bd6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -56,6 +56,8 @@
 
 #include "internal.h"
 
+#include 
+
 #define CREATE_TRACE_POINTS
 #include 
 
@@ -761,6 +763,17 @@ int remove_mapping(struct address_space *mapping, struct 
page *page)
return ret;
 }
 
+static int mapping_cmp(void *priv, struct list_head *a, struct list_head *b)
+{
+   struct address_space *ma = container_of(a, struct page, lru)->mapping;
+   struct address_space *mb = container_of(a, struct page, lru)->mapping;
+
+   if (ma == mb)
+   return 0;
+   if (ma > mb)
+   return 1;
+   return -1;
+}
 static void remove_mapping_list(struct list_head *mapping_list,
struct list_head *free_pages,
struct list_head *ret_pages)
@@ -771,12 +784,17 @@ static void remove_mapping_list(struct list_head 
*mapping_list,
LIST_HEAD(swapcache);
LIST_HEAD(filecache);
struct page *page;
+   int c = 0, ls = 0;
+
+   list_sort(NULL, mapping_list, mapping_cmp);
 
while (!list_empty(mapping_list)) {
+   c++;
page = lru_to_page(mapping_list);
list_del(>lru);
 
if (!mapping || page->mapping != mapping) {
+   ls++;
if (mapping) {

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-18 Thread Dave Chinner
On Thu, Aug 18, 2016 at 01:45:17AM +0100, Mel Gorman wrote:
> On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote:
> > > Yes, we could try to batch the locking like DaveC already suggested
> > > (ie we could move the locking to the caller, and then make
> > > shrink_page_list() just try to keep the lock held for a few pages if
> > > the mapping doesn't change), and that might result in fewer crazy
> > > cacheline ping-pongs overall. But that feels like exactly the wrong
> > > kind of workaround.
> > > 
> > 
> > Even if such batching was implemented, it would be very specific to the
> > case of a single large file filling LRUs on multiple nodes.
> > 
> 
> The latest Jason Bourne movie was sufficiently bad that I spent time
> thinking how the tree_lock could be batched during reclaim. It's not
> straight-forward but this prototype did not blow up on UMA and may be
> worth considering if Dave can test either approach has a positive impact.

SO, I just did a couple of tests. I'll call the two patches "sleepy"
for the contention backoff patch and "bourney" for the Jason Bourne
inspired batching patch. This is an average of 3 runs, overwriting
a 47GB file on a machine with 16GB RAM:

IO throughput   wall time __pv_queued_spin_lock_slowpath
vanilla 470MB/s 1m42s   25-30%
sleepy  295MB/s 2m43s   <1%
bourney 425MB/s 1m53s   25-30%

The overall CPU usage of sleepy was much lower than the others, but
it was also much slower. Too much sleeping and not enough reclaim
work being done, I think.

As for bourney, it's not immediately clear as to why it's nearly as
bad as the movie. At worst I would have expected it to have not
noticable impact, but maybe we are delaying freeing of pages too
long and so stalling allocation of new pages? It also doesn't do
much to reduce contention, especially considering the reduction in
throughput.

On a hunch that the batch list isn't all one mapping, I sorted it.
Patch is below if you're curious.

IO throughput   wall time __pv_queued_spin_lock_slowpath
vanilla 470MB/s 1m42s   25-30%
sleepy  295MB/s 2m43s   <1%
bourney 425MB/s 1m53s   25-30%
sorted-bourney  465MB/s 1m43s   20%

The number of reclaim batches (from multiple runs) where the sorting
of the lists would have done anything is counted by list swaps (ls)
being > 1.

# grep " c " /var/log/syslog |grep -v "ls 1" |wc -l
7429
# grep " c " /var/log/syslog |grep "ls 1" |wc -l
1061767

IOWs in 1.07 million batches of pages reclaimed, only ~0.695% of
batches switched to a different mapping tree lock more than once.
>From those numbers I would not have expected sorting the page list
to have any measurable impact on performance. However, performance
seems very sensitive to the number of times the mapping tree lock
is bounced around.

FWIW, I just remembered about /proc/sys/vm/zone_reclaim_mode.

IO throughput   wall time __pv_queued_spin_lock_slowpath
vanilla 470MB/s 1m42s   25-30%
zr=1470MB/s 1m42s   2-3%

So isolating the page cache usage to a single node maintains
performance and shows a significant reduction in pressure on the
mapping tree lock. Same as a single node system, I'd guess.

Anyway, I've burnt enough erase cycles on this SSD for today

-Dave.

---
 mm/vmscan.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9261102..5cf1bd6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -56,6 +56,8 @@
 
 #include "internal.h"
 
+#include 
+
 #define CREATE_TRACE_POINTS
 #include 
 
@@ -761,6 +763,17 @@ int remove_mapping(struct address_space *mapping, struct 
page *page)
return ret;
 }
 
+static int mapping_cmp(void *priv, struct list_head *a, struct list_head *b)
+{
+   struct address_space *ma = container_of(a, struct page, lru)->mapping;
+   struct address_space *mb = container_of(a, struct page, lru)->mapping;
+
+   if (ma == mb)
+   return 0;
+   if (ma > mb)
+   return 1;
+   return -1;
+}
 static void remove_mapping_list(struct list_head *mapping_list,
struct list_head *free_pages,
struct list_head *ret_pages)
@@ -771,12 +784,17 @@ static void remove_mapping_list(struct list_head 
*mapping_list,
LIST_HEAD(swapcache);
LIST_HEAD(filecache);
struct page *page;
+   int c = 0, ls = 0;
+
+   list_sort(NULL, mapping_list, mapping_cmp);
 
while (!list_empty(mapping_list)) {
+   c++;
page = lru_to_page(mapping_list);
list_del(>lru);
 
if (!mapping || page->mapping != mapping) {
+   ls++;
if (mapping) {

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-17 Thread Dave Chinner
On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote:
> On Tue, Aug 16, 2016 at 10:47:36AM -0700, Linus Torvalds wrote:
> > I've always preferred to see direct reclaim as the primary model for
> > reclaim, partly in order to throttle the actual "bad" process, but
> > also because "kswapd uses lots of CPU time" is such a nasty thing to
> > even begin guessing about.
> > 
> 
> While I agree that bugs with high CPU usage from kswapd are a pain,
> I'm reluctant to move towards direct reclaim being the primary mode. The
> stalls can be severe and there is no guarantee that the process punished
> is the process responsible. I'm basing this assumption on observations
> of severe performance regressions when I accidentally broke kswapd during
> the development of node-lru.
> 
> > So I have to admit to liking that "make kswapd sleep a bit if it's
> > just looping" logic that got removed in that commit.
> > 
> 
> It's primarily the direct reclaimer that is affected by that patch.
> 
> > And looking at DaveC's numbers, it really feels like it's not even
> > what we do inside the locked region that is the problem. Sure,
> > __delete_from_page_cache() (which is most of it) is at 1.86% of CPU
> > time (when including all the things it calls), but that still isn't
> > all that much. Especially when compared to just:
> > 
> >0.78%  [kernel]  [k] _raw_spin_unlock_irqrestore
> > 
> 
> The profile is shocking for such a basic workload. I automated what Dave
> described with xfs_io except that the file size is 2*RAM. The filesystem
> is sized to be roughly the same size as the file to minimise variances
> due to block layout. A call-graph profile collected on bare metal UMA with
> numa=fake=4 and paravirt spinlocks showed
> 
>  1.40% 0.16%  kswapd1  [kernel.vmlinux][k] 
> _raw_spin_lock_irqsave
>  1.36% 0.16%  kswapd2  [kernel.vmlinux][k] 
> _raw_spin_lock_irqsave
>  1.21% 0.12%  kswapd0  [kernel.vmlinux][k] 
> _raw_spin_lock_irqsave
>  1.12% 0.13%  kswapd3  [kernel.vmlinux][k] 
> _raw_spin_lock_irqsave
>  0.81% 0.45%  xfs_io   [kernel.vmlinux][k] 
> _raw_spin_lock_irqsave
> 
> Those contention figures are not great but they are not terrible either. The
> vmstats told me there was no direct reclaim activity so either my theory
> is wrong or this machine is not reproducing the same problem Dave is seeing.

No, that's roughly the same un-normalised CPU percentage I am seeing
in spinlock contention. i.e. take way the idle CPU in the profile
(probably upwards of 80% if it's a 16p machine), and instead look at
that figure as a percentage of total CPU used by the workload. Then
you'll that it's 30-40% of the total CPU consumed by the workload.

> I have partial results from a 2-socket and 4-socket machine. 2-socket spends
> roughtly 1.8% in _raw_spin_lock_irqsave and 4-socket spends roughtly 3%,
> both with no direct reclaim. Clearly the problem gets worse the more NUMA
> nodes there are but not to the same extent Dave reports.
> 
> I believe potential reasons why I do not see the same problem as Dave are;
> 
> 1. Different memory sizes changing timing
> 2. Dave has fast storage and I'm using a spinning disk

This particular is using an abused 3 year old SATA SSD that still
runs at 500MB/s on sequential writes. This is "cheap desktop"
capability these days and is nowhere near what I'd call "fast".

> 3. Lock contention problems are magnified inside KVM
> 
> I think 3 is a good possibility if contended locks result in expensive
> exiting and reentery of the guest. I have a vague recollection that a
> spinning vcpu exits the guest but I did not confirm that.

I don't think anything like that has been implemented in the pv
spinlocks yet. They just spin right now - it's the same lock
implementation as the host. Also, Context switch rates measured on
the host are not significantly higher than what is measured in the
guest, so there doesn't appear to be any extra scheduling on the
host side occurring.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-17 Thread Dave Chinner
On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote:
> On Tue, Aug 16, 2016 at 10:47:36AM -0700, Linus Torvalds wrote:
> > I've always preferred to see direct reclaim as the primary model for
> > reclaim, partly in order to throttle the actual "bad" process, but
> > also because "kswapd uses lots of CPU time" is such a nasty thing to
> > even begin guessing about.
> > 
> 
> While I agree that bugs with high CPU usage from kswapd are a pain,
> I'm reluctant to move towards direct reclaim being the primary mode. The
> stalls can be severe and there is no guarantee that the process punished
> is the process responsible. I'm basing this assumption on observations
> of severe performance regressions when I accidentally broke kswapd during
> the development of node-lru.
> 
> > So I have to admit to liking that "make kswapd sleep a bit if it's
> > just looping" logic that got removed in that commit.
> > 
> 
> It's primarily the direct reclaimer that is affected by that patch.
> 
> > And looking at DaveC's numbers, it really feels like it's not even
> > what we do inside the locked region that is the problem. Sure,
> > __delete_from_page_cache() (which is most of it) is at 1.86% of CPU
> > time (when including all the things it calls), but that still isn't
> > all that much. Especially when compared to just:
> > 
> >0.78%  [kernel]  [k] _raw_spin_unlock_irqrestore
> > 
> 
> The profile is shocking for such a basic workload. I automated what Dave
> described with xfs_io except that the file size is 2*RAM. The filesystem
> is sized to be roughly the same size as the file to minimise variances
> due to block layout. A call-graph profile collected on bare metal UMA with
> numa=fake=4 and paravirt spinlocks showed
> 
>  1.40% 0.16%  kswapd1  [kernel.vmlinux][k] 
> _raw_spin_lock_irqsave
>  1.36% 0.16%  kswapd2  [kernel.vmlinux][k] 
> _raw_spin_lock_irqsave
>  1.21% 0.12%  kswapd0  [kernel.vmlinux][k] 
> _raw_spin_lock_irqsave
>  1.12% 0.13%  kswapd3  [kernel.vmlinux][k] 
> _raw_spin_lock_irqsave
>  0.81% 0.45%  xfs_io   [kernel.vmlinux][k] 
> _raw_spin_lock_irqsave
> 
> Those contention figures are not great but they are not terrible either. The
> vmstats told me there was no direct reclaim activity so either my theory
> is wrong or this machine is not reproducing the same problem Dave is seeing.

No, that's roughly the same un-normalised CPU percentage I am seeing
in spinlock contention. i.e. take way the idle CPU in the profile
(probably upwards of 80% if it's a 16p machine), and instead look at
that figure as a percentage of total CPU used by the workload. Then
you'll that it's 30-40% of the total CPU consumed by the workload.

> I have partial results from a 2-socket and 4-socket machine. 2-socket spends
> roughtly 1.8% in _raw_spin_lock_irqsave and 4-socket spends roughtly 3%,
> both with no direct reclaim. Clearly the problem gets worse the more NUMA
> nodes there are but not to the same extent Dave reports.
> 
> I believe potential reasons why I do not see the same problem as Dave are;
> 
> 1. Different memory sizes changing timing
> 2. Dave has fast storage and I'm using a spinning disk

This particular is using an abused 3 year old SATA SSD that still
runs at 500MB/s on sequential writes. This is "cheap desktop"
capability these days and is nowhere near what I'd call "fast".

> 3. Lock contention problems are magnified inside KVM
> 
> I think 3 is a good possibility if contended locks result in expensive
> exiting and reentery of the guest. I have a vague recollection that a
> spinning vcpu exits the guest but I did not confirm that.

I don't think anything like that has been implemented in the pv
spinlocks yet. They just spin right now - it's the same lock
implementation as the host. Also, Context switch rates measured on
the host are not significantly higher than what is measured in the
guest, so there doesn't appear to be any extra scheduling on the
host side occurring.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-17 Thread Mel Gorman
On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote:
> > Yes, we could try to batch the locking like DaveC already suggested
> > (ie we could move the locking to the caller, and then make
> > shrink_page_list() just try to keep the lock held for a few pages if
> > the mapping doesn't change), and that might result in fewer crazy
> > cacheline ping-pongs overall. But that feels like exactly the wrong
> > kind of workaround.
> > 
> 
> Even if such batching was implemented, it would be very specific to the
> case of a single large file filling LRUs on multiple nodes.
> 

The latest Jason Bourne movie was sufficiently bad that I spent time
thinking how the tree_lock could be batched during reclaim. It's not
straight-forward but this prototype did not blow up on UMA and may be
worth considering if Dave can test either approach has a positive impact.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 374d95d04178..926110219cd9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -621,19 +621,39 @@ static pageout_t pageout(struct page *page, struct 
address_space *mapping,
return PAGE_CLEAN;
 }
 
+static void finalise_remove_mapping(struct list_head *swapcache,
+   struct list_head *filecache,
+   void (*freepage)(struct page *))
+{
+   struct page *page;
+
+   while (!list_empty(swapcache)) {
+   swp_entry_t swap = { .val = page_private(page) };
+   page = lru_to_page(swapcache);
+   list_del(>lru);
+   swapcache_free(swap);
+   set_page_private(page, 0);
+   }
+
+   while (!list_empty(filecache)) {
+   page = lru_to_page(swapcache);
+   list_del(>lru);
+   freepage(page);
+   }
+}
+
 /*
  * Same as remove_mapping, but if the page is removed from the mapping, it
  * gets returned with a refcount of 0.
  */
-static int __remove_mapping(struct address_space *mapping, struct page *page,
-   bool reclaimed)
+static int __remove_mapping_page(struct address_space *mapping,
+struct page *page, bool reclaimed,
+struct list_head *swapcache,
+struct list_head *filecache)
 {
-   unsigned long flags;
-
BUG_ON(!PageLocked(page));
BUG_ON(mapping != page_mapping(page));
 
-   spin_lock_irqsave(>tree_lock, flags);
/*
 * The non racy check for a busy page.
 *
@@ -668,16 +688,18 @@ static int __remove_mapping(struct address_space 
*mapping, struct page *page,
}
 
if (PageSwapCache(page)) {
-   swp_entry_t swap = { .val = page_private(page) };
+   unsigned long swapval = page_private(page);
+   swp_entry_t swap = { .val = swapval };
mem_cgroup_swapout(page, swap);
__delete_from_swap_cache(page);
-   spin_unlock_irqrestore(>tree_lock, flags);
-   swapcache_free(swap);
+   set_page_private(page, swapval);
+   list_add(>lru, swapcache);
} else {
-   void (*freepage)(struct page *);
void *shadow = NULL;
+   void (*freepage)(struct page *);
 
freepage = mapping->a_ops->freepage;
+
/*
 * Remember a shadow entry for reclaimed file cache in
 * order to detect refaults, thus thrashing, later on.
@@ -698,16 +720,13 @@ static int __remove_mapping(struct address_space 
*mapping, struct page *page,
!mapping_exiting(mapping) && !dax_mapping(mapping))
shadow = workingset_eviction(mapping, page);
__delete_from_page_cache(page, shadow);
-   spin_unlock_irqrestore(>tree_lock, flags);
-
-   if (freepage != NULL)
-   freepage(page);
+   if (freepage)
+   list_add(>lru, filecache);
}
 
return 1;
 
 cannot_free:
-   spin_unlock_irqrestore(>tree_lock, flags);
return 0;
 }
 
@@ -719,16 +738,68 @@ static int __remove_mapping(struct address_space 
*mapping, struct page *page,
  */
 int remove_mapping(struct address_space *mapping, struct page *page)
 {
-   if (__remove_mapping(mapping, page, false)) {
+   unsigned long flags;
+   LIST_HEAD(swapcache);
+   LIST_HEAD(filecache);
+   void (*freepage)(struct page *);
+   int ret = 0;
+
+   spin_lock_irqsave(>tree_lock, flags);
+   freepage = mapping->a_ops->freepage;
+
+   if (__remove_mapping_page(mapping, page, false, , 
)) {
/*
 * Unfreezing the refcount with 1 rather than 2 effectively
 * drops the pagecache ref for us without requiring another
 * atomic operation.
 */
page_ref_unfreeze(page, 1);
-   return 1;
+ 

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-17 Thread Mel Gorman
On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote:
> > Yes, we could try to batch the locking like DaveC already suggested
> > (ie we could move the locking to the caller, and then make
> > shrink_page_list() just try to keep the lock held for a few pages if
> > the mapping doesn't change), and that might result in fewer crazy
> > cacheline ping-pongs overall. But that feels like exactly the wrong
> > kind of workaround.
> > 
> 
> Even if such batching was implemented, it would be very specific to the
> case of a single large file filling LRUs on multiple nodes.
> 

The latest Jason Bourne movie was sufficiently bad that I spent time
thinking how the tree_lock could be batched during reclaim. It's not
straight-forward but this prototype did not blow up on UMA and may be
worth considering if Dave can test either approach has a positive impact.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 374d95d04178..926110219cd9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -621,19 +621,39 @@ static pageout_t pageout(struct page *page, struct 
address_space *mapping,
return PAGE_CLEAN;
 }
 
+static void finalise_remove_mapping(struct list_head *swapcache,
+   struct list_head *filecache,
+   void (*freepage)(struct page *))
+{
+   struct page *page;
+
+   while (!list_empty(swapcache)) {
+   swp_entry_t swap = { .val = page_private(page) };
+   page = lru_to_page(swapcache);
+   list_del(>lru);
+   swapcache_free(swap);
+   set_page_private(page, 0);
+   }
+
+   while (!list_empty(filecache)) {
+   page = lru_to_page(swapcache);
+   list_del(>lru);
+   freepage(page);
+   }
+}
+
 /*
  * Same as remove_mapping, but if the page is removed from the mapping, it
  * gets returned with a refcount of 0.
  */
-static int __remove_mapping(struct address_space *mapping, struct page *page,
-   bool reclaimed)
+static int __remove_mapping_page(struct address_space *mapping,
+struct page *page, bool reclaimed,
+struct list_head *swapcache,
+struct list_head *filecache)
 {
-   unsigned long flags;
-
BUG_ON(!PageLocked(page));
BUG_ON(mapping != page_mapping(page));
 
-   spin_lock_irqsave(>tree_lock, flags);
/*
 * The non racy check for a busy page.
 *
@@ -668,16 +688,18 @@ static int __remove_mapping(struct address_space 
*mapping, struct page *page,
}
 
if (PageSwapCache(page)) {
-   swp_entry_t swap = { .val = page_private(page) };
+   unsigned long swapval = page_private(page);
+   swp_entry_t swap = { .val = swapval };
mem_cgroup_swapout(page, swap);
__delete_from_swap_cache(page);
-   spin_unlock_irqrestore(>tree_lock, flags);
-   swapcache_free(swap);
+   set_page_private(page, swapval);
+   list_add(>lru, swapcache);
} else {
-   void (*freepage)(struct page *);
void *shadow = NULL;
+   void (*freepage)(struct page *);
 
freepage = mapping->a_ops->freepage;
+
/*
 * Remember a shadow entry for reclaimed file cache in
 * order to detect refaults, thus thrashing, later on.
@@ -698,16 +720,13 @@ static int __remove_mapping(struct address_space 
*mapping, struct page *page,
!mapping_exiting(mapping) && !dax_mapping(mapping))
shadow = workingset_eviction(mapping, page);
__delete_from_page_cache(page, shadow);
-   spin_unlock_irqrestore(>tree_lock, flags);
-
-   if (freepage != NULL)
-   freepage(page);
+   if (freepage)
+   list_add(>lru, filecache);
}
 
return 1;
 
 cannot_free:
-   spin_unlock_irqrestore(>tree_lock, flags);
return 0;
 }
 
@@ -719,16 +738,68 @@ static int __remove_mapping(struct address_space 
*mapping, struct page *page,
  */
 int remove_mapping(struct address_space *mapping, struct page *page)
 {
-   if (__remove_mapping(mapping, page, false)) {
+   unsigned long flags;
+   LIST_HEAD(swapcache);
+   LIST_HEAD(filecache);
+   void (*freepage)(struct page *);
+   int ret = 0;
+
+   spin_lock_irqsave(>tree_lock, flags);
+   freepage = mapping->a_ops->freepage;
+
+   if (__remove_mapping_page(mapping, page, false, , 
)) {
/*
 * Unfreezing the refcount with 1 rather than 2 effectively
 * drops the pagecache ref for us without requiring another
 * atomic operation.
 */
page_ref_unfreeze(page, 1);
-   return 1;
+ 

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-17 Thread Michal Hocko
On Wed 17-08-16 17:48:25, Michal Hocko wrote:
[...]
> I will try to catch up with the rest of the email thread but from a
> quick glance it just feels like we are doing more more work under the
> lock.

Hmm, so it doesn't seem to be more work in __remove_mapping as pointed
out in http://lkml.kernel.org/r/20160816220250.GI16044@dastard

As Mel already pointed out the LRU will be basically single mapping for
this workload so any subtle change in timing might make a difference.
I was looking through 4.6..4.7 and one thing that has changed is the
inactive vs. active LRU size ratio. See 59dc76b0d4df ("mm: vmscan:
reduce size of inactive file list"). The machine has quite a lot of
memory and so the LRUs will be large as well so I guess this could have
change the timing somehow, but it feels like a wild guess so I would be
careful to blame this commit...
-- 
Michal Hocko
SUSE Labs


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-17 Thread Michal Hocko
On Wed 17-08-16 17:48:25, Michal Hocko wrote:
[...]
> I will try to catch up with the rest of the email thread but from a
> quick glance it just feels like we are doing more more work under the
> lock.

Hmm, so it doesn't seem to be more work in __remove_mapping as pointed
out in http://lkml.kernel.org/r/20160816220250.GI16044@dastard

As Mel already pointed out the LRU will be basically single mapping for
this workload so any subtle change in timing might make a difference.
I was looking through 4.6..4.7 and one thing that has changed is the
inactive vs. active LRU size ratio. See 59dc76b0d4df ("mm: vmscan:
reduce size of inactive file list"). The machine has quite a lot of
memory and so the LRUs will be large as well so I guess this could have
change the timing somehow, but it feels like a wild guess so I would be
careful to blame this commit...
-- 
Michal Hocko
SUSE Labs


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-17 Thread Peter Zijlstra
On Mon, Aug 15, 2016 at 07:03:00AM +0200, Ingo Molnar wrote:
> 
> * Linus Torvalds  wrote:
> 
> > Make sure you actually use "perf record -e cycles:pp" or something
> > that uses PEBS to get real profiles using CPU performance counters.
> 
> Btw., 'perf record -e cycles:pp' is the default now for modern versions
> of perf tooling (on most x86 systems) - if you do 'perf record' it will
> just use the most precise profiling mode available on that particular
> CPU model.
> 
> If unsure you can check the event that was used, via:
> 
>   triton:~> perf report --stdio 2>&1 | grep '# Samples'
>   # Samples: 27K of event 'cycles:pp'

Problem here is that Dave is using a KVM thingy. Getting hardware
counters in a guest is somewhat tricky but doable, but PEBS does not
virtualize.




Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-17 Thread Peter Zijlstra
On Mon, Aug 15, 2016 at 07:03:00AM +0200, Ingo Molnar wrote:
> 
> * Linus Torvalds  wrote:
> 
> > Make sure you actually use "perf record -e cycles:pp" or something
> > that uses PEBS to get real profiles using CPU performance counters.
> 
> Btw., 'perf record -e cycles:pp' is the default now for modern versions
> of perf tooling (on most x86 systems) - if you do 'perf record' it will
> just use the most precise profiling mode available on that particular
> CPU model.
> 
> If unsure you can check the event that was used, via:
> 
>   triton:~> perf report --stdio 2>&1 | grep '# Samples'
>   # Samples: 27K of event 'cycles:pp'

Problem here is that Dave is using a KVM thingy. Getting hardware
counters in a guest is somewhat tricky but doable, but PEBS does not
virtualize.




Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-17 Thread Mel Gorman
On Tue, Aug 16, 2016 at 10:47:36AM -0700, Linus Torvalds wrote:
> I've always preferred to see direct reclaim as the primary model for
> reclaim, partly in order to throttle the actual "bad" process, but
> also because "kswapd uses lots of CPU time" is such a nasty thing to
> even begin guessing about.
> 

While I agree that bugs with high CPU usage from kswapd are a pain,
I'm reluctant to move towards direct reclaim being the primary mode. The
stalls can be severe and there is no guarantee that the process punished
is the process responsible. I'm basing this assumption on observations
of severe performance regressions when I accidentally broke kswapd during
the development of node-lru.

> So I have to admit to liking that "make kswapd sleep a bit if it's
> just looping" logic that got removed in that commit.
> 

It's primarily the direct reclaimer that is affected by that patch.

> And looking at DaveC's numbers, it really feels like it's not even
> what we do inside the locked region that is the problem. Sure,
> __delete_from_page_cache() (which is most of it) is at 1.86% of CPU
> time (when including all the things it calls), but that still isn't
> all that much. Especially when compared to just:
> 
>0.78%  [kernel]  [k] _raw_spin_unlock_irqrestore
> 

The profile is shocking for such a basic workload. I automated what Dave
described with xfs_io except that the file size is 2*RAM. The filesystem
is sized to be roughly the same size as the file to minimise variances
due to block layout. A call-graph profile collected on bare metal UMA with
numa=fake=4 and paravirt spinlocks showed

 1.40% 0.16%  kswapd1  [kernel.vmlinux][k] 
_raw_spin_lock_irqsave
 1.36% 0.16%  kswapd2  [kernel.vmlinux][k] 
_raw_spin_lock_irqsave
 1.21% 0.12%  kswapd0  [kernel.vmlinux][k] 
_raw_spin_lock_irqsave
 1.12% 0.13%  kswapd3  [kernel.vmlinux][k] 
_raw_spin_lock_irqsave
 0.81% 0.45%  xfs_io   [kernel.vmlinux][k] 
_raw_spin_lock_irqsave

Those contention figures are not great but they are not terrible either. The
vmstats told me there was no direct reclaim activity so either my theory
is wrong or this machine is not reproducing the same problem Dave is seeing.

I have partial results from a 2-socket and 4-socket machine. 2-socket spends
roughtly 1.8% in _raw_spin_lock_irqsave and 4-socket spends roughtly 3%,
both with no direct reclaim. Clearly the problem gets worse the more NUMA
nodes there are but not to the same extent Dave reports.

I believe potential reasons why I do not see the same problem as Dave are;

1. Different memory sizes changing timing
2. Dave has fast storage and I'm using a spinning disk
3. Lock contention problems are magnified inside KVM

I think 3 is a good possibility if contended locks result in expensive
exiting and reentery of the guest. I have a vague recollection that a
spinning vcpu exits the guest but I did not confirm that. I can setup a
KVM instance and run the tests but it'll take a few hours and possibly
will be pushed out until tomorrow.

> So I'm more and more getting the feeling that it's not what we do
> inside the lock that is problematic. I started out blaming memcg
> accounting or something, but none of the numbers seem to back that up.
> So it's primarily really just the fact that kswapd is simply hammering
> on that lock way too much.
> 

Agreed.

> So yeah, I'm blaming kswapd itself doing something wrong. It's not a
> problem in a single-node environment (since there's only one), but
> with multiple nodes it clearly just devolves.
> 
> Yes, we could try to batch the locking like DaveC already suggested
> (ie we could move the locking to the caller, and then make
> shrink_page_list() just try to keep the lock held for a few pages if
> the mapping doesn't change), and that might result in fewer crazy
> cacheline ping-pongs overall. But that feels like exactly the wrong
> kind of workaround.
> 

Even if such batching was implemented, it would be very specific to the
case of a single large file filling LRUs on multiple nodes.

> I'd much rather re-instate some "if kswapd is just spinning on CPU
> time and not actually improving IO parallelism, kswapd should just get
> the hell out" logic.
> 

I'm having trouble right now thinking of a good way of identifying when
kswapd should give up and force direct reclaim to take a hit.

I'd like to pass something else by the wtf-o-meter. I had a prototype
patch lying around that replaced a congestion_wait if too many LRU pages
were isolated with a waitqueue for an unrelated theoretical problem. It's
the bulk of the patch below but can be trivially extended for the case of
tree_lock contention.

The interesting part is the change to __remove_mapping. It stalls a
reclaimer (direct or kswapd) if the lock is contended for either a
timeout or a local reclaimer finishing some reclaim. This stalls for a
"real" reason 

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-17 Thread Mel Gorman
On Tue, Aug 16, 2016 at 10:47:36AM -0700, Linus Torvalds wrote:
> I've always preferred to see direct reclaim as the primary model for
> reclaim, partly in order to throttle the actual "bad" process, but
> also because "kswapd uses lots of CPU time" is such a nasty thing to
> even begin guessing about.
> 

While I agree that bugs with high CPU usage from kswapd are a pain,
I'm reluctant to move towards direct reclaim being the primary mode. The
stalls can be severe and there is no guarantee that the process punished
is the process responsible. I'm basing this assumption on observations
of severe performance regressions when I accidentally broke kswapd during
the development of node-lru.

> So I have to admit to liking that "make kswapd sleep a bit if it's
> just looping" logic that got removed in that commit.
> 

It's primarily the direct reclaimer that is affected by that patch.

> And looking at DaveC's numbers, it really feels like it's not even
> what we do inside the locked region that is the problem. Sure,
> __delete_from_page_cache() (which is most of it) is at 1.86% of CPU
> time (when including all the things it calls), but that still isn't
> all that much. Especially when compared to just:
> 
>0.78%  [kernel]  [k] _raw_spin_unlock_irqrestore
> 

The profile is shocking for such a basic workload. I automated what Dave
described with xfs_io except that the file size is 2*RAM. The filesystem
is sized to be roughly the same size as the file to minimise variances
due to block layout. A call-graph profile collected on bare metal UMA with
numa=fake=4 and paravirt spinlocks showed

 1.40% 0.16%  kswapd1  [kernel.vmlinux][k] 
_raw_spin_lock_irqsave
 1.36% 0.16%  kswapd2  [kernel.vmlinux][k] 
_raw_spin_lock_irqsave
 1.21% 0.12%  kswapd0  [kernel.vmlinux][k] 
_raw_spin_lock_irqsave
 1.12% 0.13%  kswapd3  [kernel.vmlinux][k] 
_raw_spin_lock_irqsave
 0.81% 0.45%  xfs_io   [kernel.vmlinux][k] 
_raw_spin_lock_irqsave

Those contention figures are not great but they are not terrible either. The
vmstats told me there was no direct reclaim activity so either my theory
is wrong or this machine is not reproducing the same problem Dave is seeing.

I have partial results from a 2-socket and 4-socket machine. 2-socket spends
roughtly 1.8% in _raw_spin_lock_irqsave and 4-socket spends roughtly 3%,
both with no direct reclaim. Clearly the problem gets worse the more NUMA
nodes there are but not to the same extent Dave reports.

I believe potential reasons why I do not see the same problem as Dave are;

1. Different memory sizes changing timing
2. Dave has fast storage and I'm using a spinning disk
3. Lock contention problems are magnified inside KVM

I think 3 is a good possibility if contended locks result in expensive
exiting and reentery of the guest. I have a vague recollection that a
spinning vcpu exits the guest but I did not confirm that. I can setup a
KVM instance and run the tests but it'll take a few hours and possibly
will be pushed out until tomorrow.

> So I'm more and more getting the feeling that it's not what we do
> inside the lock that is problematic. I started out blaming memcg
> accounting or something, but none of the numbers seem to back that up.
> So it's primarily really just the fact that kswapd is simply hammering
> on that lock way too much.
> 

Agreed.

> So yeah, I'm blaming kswapd itself doing something wrong. It's not a
> problem in a single-node environment (since there's only one), but
> with multiple nodes it clearly just devolves.
> 
> Yes, we could try to batch the locking like DaveC already suggested
> (ie we could move the locking to the caller, and then make
> shrink_page_list() just try to keep the lock held for a few pages if
> the mapping doesn't change), and that might result in fewer crazy
> cacheline ping-pongs overall. But that feels like exactly the wrong
> kind of workaround.
> 

Even if such batching was implemented, it would be very specific to the
case of a single large file filling LRUs on multiple nodes.

> I'd much rather re-instate some "if kswapd is just spinning on CPU
> time and not actually improving IO parallelism, kswapd should just get
> the hell out" logic.
> 

I'm having trouble right now thinking of a good way of identifying when
kswapd should give up and force direct reclaim to take a hit.

I'd like to pass something else by the wtf-o-meter. I had a prototype
patch lying around that replaced a congestion_wait if too many LRU pages
were isolated with a waitqueue for an unrelated theoretical problem. It's
the bulk of the patch below but can be trivially extended for the case of
tree_lock contention.

The interesting part is the change to __remove_mapping. It stalls a
reclaimer (direct or kswapd) if the lock is contended for either a
timeout or a local reclaimer finishing some reclaim. This stalls for a
"real" reason 

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-17 Thread Michal Hocko
On Tue 16-08-16 10:47:36, Linus Torvalds wrote:
> Mel,
>  thanks for taking a look. Your theory sounds more complete than mine,
> and since Dave is able to see the problem with 4.7, it would be nice
> to hear about the 4.6 behavior and commit ede37713737 in particular.
> 
> That one seems more likely to affect contention than the zone/node one
> I found during the merge window anyway, since it actually removes a
> sleep in kswapd during congestion.

Hmm, the patch removes a short sleep from wait_iff_congested for
kworkers but that cannot affect kswapd context. Then it removes
wait_iff_congested from should_reclaim_retry but that is not kswapd path
and the sleep was added in the same merge window so it wasn't in 4.6 so
it shouldn't make any difference as well.

So I am not really sure how it could make any difference.

I will try to catch up with the rest of the email thread but from a
quick glance it just feels like we are doing more more work under the
lock.

-- 
Michal Hocko
SUSE Labs


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-17 Thread Michal Hocko
On Tue 16-08-16 10:47:36, Linus Torvalds wrote:
> Mel,
>  thanks for taking a look. Your theory sounds more complete than mine,
> and since Dave is able to see the problem with 4.7, it would be nice
> to hear about the 4.6 behavior and commit ede37713737 in particular.
> 
> That one seems more likely to affect contention than the zone/node one
> I found during the merge window anyway, since it actually removes a
> sleep in kswapd during congestion.

Hmm, the patch removes a short sleep from wait_iff_congested for
kworkers but that cannot affect kswapd context. Then it removes
wait_iff_congested from should_reclaim_retry but that is not kswapd path
and the sleep was added in the same merge window so it wasn't in 4.6 so
it shouldn't make any difference as well.

So I am not really sure how it could make any difference.

I will try to catch up with the rest of the email thread but from a
quick glance it just feels like we are doing more more work under the
lock.

-- 
Michal Hocko
SUSE Labs


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-16 Thread Linus Torvalds
On Tue, Aug 16, 2016 at 3:02 PM, Dave Chinner  wrote:
>>
>> What does your profile show for when you actually dig into
>> __remove_mapping() itself?, Looking at your flat profile, I'm assuming
>> you get
>
> -   22.26% 0.93%  [kernel]  [k] __remove_mapping
>- 3.86% __remove_mapping
>   - 18.35% _raw_spin_lock_irqsave
>__pv_queued_spin_lock_slowpath
> 1.32% __delete_from_page_cache
>   - 0.92% _raw_spin_unlock_irqrestore
>__raw_callee_save___pv_queued_spin_unlock

Ok, that's all very consistent with my profiles, except - obviously -
for the crazy spinlock thing.

One difference is that your unlock has that PV unlock thing - on raw
hardware it's just a single store. But I don't think I saw the
unlock_slowpath in there.

There's nothing really expensive going on there that I can tell.

> And the instruction level profile:

Yup. The bulk is in the cmpxchg and a cache miss (it just shows up in
the instruction after it: you can use "cycles:pp" to get perf to
actually try to fix up the blame to the instruction that _causes_
things rather than the instruction following, but in this case it's
all trivial).

> It's the same code AFAICT, except the pv version jumps straight to
> the "queue" case.

Yes. Your profile looks perfectly fine. Most of the profile is rigth
after the 'pause', which you'd expect.

>From a quick look, it seems like only about 2/3rd of the time is
actually spent in the "pause" loop, but the control flow is complex
enough that maybe I didn't follow it right. The native case is
simpler. But since I suspect that it's not so much about the
spinlocked region being too costly, but just about locking too damn
much), that 2/3rds actually makes sense: it's not that it's
necessarily spinning waiting for the lock all that long in any
individual case, it's just that the spin_lock code is called so much.

So I still kind of just blame kswapd, rather than any new expense. It
would be interesting to hear if Mel is right about that kswapd
sleeping change between 4.6 and 4.7..

   Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-16 Thread Linus Torvalds
On Tue, Aug 16, 2016 at 3:02 PM, Dave Chinner  wrote:
>>
>> What does your profile show for when you actually dig into
>> __remove_mapping() itself?, Looking at your flat profile, I'm assuming
>> you get
>
> -   22.26% 0.93%  [kernel]  [k] __remove_mapping
>- 3.86% __remove_mapping
>   - 18.35% _raw_spin_lock_irqsave
>__pv_queued_spin_lock_slowpath
> 1.32% __delete_from_page_cache
>   - 0.92% _raw_spin_unlock_irqrestore
>__raw_callee_save___pv_queued_spin_unlock

Ok, that's all very consistent with my profiles, except - obviously -
for the crazy spinlock thing.

One difference is that your unlock has that PV unlock thing - on raw
hardware it's just a single store. But I don't think I saw the
unlock_slowpath in there.

There's nothing really expensive going on there that I can tell.

> And the instruction level profile:

Yup. The bulk is in the cmpxchg and a cache miss (it just shows up in
the instruction after it: you can use "cycles:pp" to get perf to
actually try to fix up the blame to the instruction that _causes_
things rather than the instruction following, but in this case it's
all trivial).

> It's the same code AFAICT, except the pv version jumps straight to
> the "queue" case.

Yes. Your profile looks perfectly fine. Most of the profile is rigth
after the 'pause', which you'd expect.

>From a quick look, it seems like only about 2/3rd of the time is
actually spent in the "pause" loop, but the control flow is complex
enough that maybe I didn't follow it right. The native case is
simpler. But since I suspect that it's not so much about the
spinlocked region being too costly, but just about locking too damn
much), that 2/3rds actually makes sense: it's not that it's
necessarily spinning waiting for the lock all that long in any
individual case, it's just that the spin_lock code is called so much.

So I still kind of just blame kswapd, rather than any new expense. It
would be interesting to hear if Mel is right about that kswapd
sleeping change between 4.6 and 4.7..

   Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-16 Thread Dave Chinner
On Mon, Aug 15, 2016 at 06:51:42PM -0700, Linus Torvalds wrote:
> Anyway, including the direct reclaim call paths gets
> __remove_mapping() a bit higher, and _raw_spin_lock_irqsave climbs to
> 0.26%. But perhaps more importlantly, looking at what __remove_mapping
> actually *does* (apart from the spinlock) gives us:
> 
>  - inside remove_mapping itself (0.11% on its own - flat cost, no
> child accounting)
> 
> 48.50 │   lock   cmpxchg %edx,0x1c(%rbx)
> 
> so that's about 0.05%
> 
>  - 0.40% __delete_from_page_cache (0.22%
> radix_tree_replace_clear_tags, 0.13%__radix_tree_lookup)
> 
>  - 0.06% workingset_eviction()
> 
> so I'm not actually seeing anything *new* expensive in there. The
> __delete_from_page_cache() overhead may have changed a bit with the
> tagged tree changes, but this doesn't look like memcg.
> 
> But we clearly have very different situations.
> 
> What does your profile show for when you actually dig into
> __remove_mapping() itself?, Looking at your flat profile, I'm assuming
> you get

-   22.26% 0.93%  [kernel]  [k] __remove_mapping
   - 3.86% __remove_mapping
  - 18.35% _raw_spin_lock_irqsave
   __pv_queued_spin_lock_slowpath
1.32% __delete_from_page_cache
  - 0.92% _raw_spin_unlock_irqrestore
   __raw_callee_save___pv_queued_spin_unlock

And the instruction level profile:

.
   �   xor%ecx,%ecx
   �   mov%rax,%r15
  0.39 �   mov$0x2,%eax
   �   lock   cmpxchg %ecx,0x1c(%rbx)
 32.56 �   cmp$0x2,%eax
   � � jne12e
   �   mov0x20(%rbx),%rax
   �   lea-0x1(%rax),%rdx
  0.39 �   test   $0x1,%al
   �   cmove  %rbx,%rdx
   �   mov(%rdx),%rax
  0.39 �   test   $0x10,%al
   � � jne127
   �   mov(%rbx),%rcx
   �   shr$0xf,%rcx
   �   and$0x1,%ecx
   � � jne14a
   �   mov0x68(%r14),%rax
 36.03 �   xor%esi,%esi
   �   test   %r13b,%r13b
   �   mov0x50(%rax),%rdx
  1.16 � � jnee8
  0.96 � a9:   mov%rbx,%rdi
.

Indicates most time on the cmpxchg for the page ref followed by the
grabbing on the ->freepage op vector:

freepage = mapping->a_ops->freepage;

> I come back to wondering whether maybe you're hitting some PV-lock problem.
> 
> I know queued_spin_lock_slowpath() is ok. I'm not entirely sure
> __pv_queued_spin_lock_slowpath() is.

It's the same code AFAICT, except the pv version jumps straight to
the "queue" case.

> So I'd love to see you try the non-PV case, but I also think it might
> be interesting to see what the instruction profile for
> __pv_queued_spin_lock_slowpath() itself is. They share a lot of code
> (there's some interesting #include games going on to make
> queued_spin_lock_slowpath() actually *be*
> __pv_queued_spin_lock_slowpath() with some magic hooks), but there
> might be issues.

  0.03 �   data16 data16 data16 xchg %ax,%ax
   �   push   %rbp
  0.00 �   mov%rsp,%rbp
  0.01 �   push   %r15
   �   push   %r14
   �   push   %r13
  0.01 �   push   %r12
   �   mov$0x18740,%r12
   �   push   %rbx
   �   mov%rdi,%rbx
   �   sub$0x10,%rsp
   �   add%gs:0x7ef0d0e0(%rip),%r12
   �   movslq 0xc(%r12),%rax
  0.02 �   mov%gs:0x7ef0d0db(%rip),%r15d
   �   add$0x1,%r15d
   �   shl$0x12,%r15d
   �   lea0x1(%rax),%edx
  0.01 �   mov%edx,0xc(%r12)
   �   mov%eax,%edx
   �   shl$0x4,%rax
   �   add%rax,%r12
   �   shl$0x10,%edx
   �   movq   $0x0,(%r12)
  0.02 �   or %edx,%r15d
   �   mov%gs:0x7ef0d0ad(%rip),%eax
  0.00 �   movl   $0x0,0x8(%r12)
  0.01 �   mov%eax,0x40(%r12)
   �   movb   $0x0,0x44(%r12)
   �   mov(%rdi),%eax
  0.88 �   test   %ax,%ax
   � � jne8f
  0.02 �   mov$0x1,%edx
   �   lock   cmpxchg %dl,(%rdi)
  0.38 �   test   %al,%al
   � � je 14a
  0.02 � 8f:   mov%r15d,%eax
   �   shr$0x10,%eax
   �   xchg   %ax,0x2(%rbx)
  2.07 �   shl$0x10,%eax
   �   test   %eax,%eax
   � � jne171
   �   movq   $0x0,-0x30(%rbp)
  0.02 � ac:   movzbl 0x44(%r12),%eax
  0.97 �   mov$0x1,%r13d
   �   mov$0x100,%r14d
   �   cmp$0x2,%al
   �   sete   %al
   �   movzbl %al,%eax
   �   mov%rax,-0x38(%rbp)
  0.00 � ca:   movb   $0x0,0x44(%r12)
  0.00 �   mov$0x8000,%edx
   �   movb   $0x1,0x1(%rbx)
   � � jmpe6
  0.04 � db:   pause
  8.04 �   sub$0x1,%edx
   � � je 229
   � e6:   movzbl (%rbx),%eax
  7.54 �   test   %al,%al
   � � jnedb
  0.10 �   mov%r14d,%eax
  0.06 �   lock   cmpxchg %r13w,(%rbx)
  

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-16 Thread Dave Chinner
On Mon, Aug 15, 2016 at 06:51:42PM -0700, Linus Torvalds wrote:
> Anyway, including the direct reclaim call paths gets
> __remove_mapping() a bit higher, and _raw_spin_lock_irqsave climbs to
> 0.26%. But perhaps more importlantly, looking at what __remove_mapping
> actually *does* (apart from the spinlock) gives us:
> 
>  - inside remove_mapping itself (0.11% on its own - flat cost, no
> child accounting)
> 
> 48.50 │   lock   cmpxchg %edx,0x1c(%rbx)
> 
> so that's about 0.05%
> 
>  - 0.40% __delete_from_page_cache (0.22%
> radix_tree_replace_clear_tags, 0.13%__radix_tree_lookup)
> 
>  - 0.06% workingset_eviction()
> 
> so I'm not actually seeing anything *new* expensive in there. The
> __delete_from_page_cache() overhead may have changed a bit with the
> tagged tree changes, but this doesn't look like memcg.
> 
> But we clearly have very different situations.
> 
> What does your profile show for when you actually dig into
> __remove_mapping() itself?, Looking at your flat profile, I'm assuming
> you get

-   22.26% 0.93%  [kernel]  [k] __remove_mapping
   - 3.86% __remove_mapping
  - 18.35% _raw_spin_lock_irqsave
   __pv_queued_spin_lock_slowpath
1.32% __delete_from_page_cache
  - 0.92% _raw_spin_unlock_irqrestore
   __raw_callee_save___pv_queued_spin_unlock

And the instruction level profile:

.
   �   xor%ecx,%ecx
   �   mov%rax,%r15
  0.39 �   mov$0x2,%eax
   �   lock   cmpxchg %ecx,0x1c(%rbx)
 32.56 �   cmp$0x2,%eax
   � � jne12e
   �   mov0x20(%rbx),%rax
   �   lea-0x1(%rax),%rdx
  0.39 �   test   $0x1,%al
   �   cmove  %rbx,%rdx
   �   mov(%rdx),%rax
  0.39 �   test   $0x10,%al
   � � jne127
   �   mov(%rbx),%rcx
   �   shr$0xf,%rcx
   �   and$0x1,%ecx
   � � jne14a
   �   mov0x68(%r14),%rax
 36.03 �   xor%esi,%esi
   �   test   %r13b,%r13b
   �   mov0x50(%rax),%rdx
  1.16 � � jnee8
  0.96 � a9:   mov%rbx,%rdi
.

Indicates most time on the cmpxchg for the page ref followed by the
grabbing on the ->freepage op vector:

freepage = mapping->a_ops->freepage;

> I come back to wondering whether maybe you're hitting some PV-lock problem.
> 
> I know queued_spin_lock_slowpath() is ok. I'm not entirely sure
> __pv_queued_spin_lock_slowpath() is.

It's the same code AFAICT, except the pv version jumps straight to
the "queue" case.

> So I'd love to see you try the non-PV case, but I also think it might
> be interesting to see what the instruction profile for
> __pv_queued_spin_lock_slowpath() itself is. They share a lot of code
> (there's some interesting #include games going on to make
> queued_spin_lock_slowpath() actually *be*
> __pv_queued_spin_lock_slowpath() with some magic hooks), but there
> might be issues.

  0.03 �   data16 data16 data16 xchg %ax,%ax
   �   push   %rbp
  0.00 �   mov%rsp,%rbp
  0.01 �   push   %r15
   �   push   %r14
   �   push   %r13
  0.01 �   push   %r12
   �   mov$0x18740,%r12
   �   push   %rbx
   �   mov%rdi,%rbx
   �   sub$0x10,%rsp
   �   add%gs:0x7ef0d0e0(%rip),%r12
   �   movslq 0xc(%r12),%rax
  0.02 �   mov%gs:0x7ef0d0db(%rip),%r15d
   �   add$0x1,%r15d
   �   shl$0x12,%r15d
   �   lea0x1(%rax),%edx
  0.01 �   mov%edx,0xc(%r12)
   �   mov%eax,%edx
   �   shl$0x4,%rax
   �   add%rax,%r12
   �   shl$0x10,%edx
   �   movq   $0x0,(%r12)
  0.02 �   or %edx,%r15d
   �   mov%gs:0x7ef0d0ad(%rip),%eax
  0.00 �   movl   $0x0,0x8(%r12)
  0.01 �   mov%eax,0x40(%r12)
   �   movb   $0x0,0x44(%r12)
   �   mov(%rdi),%eax
  0.88 �   test   %ax,%ax
   � � jne8f
  0.02 �   mov$0x1,%edx
   �   lock   cmpxchg %dl,(%rdi)
  0.38 �   test   %al,%al
   � � je 14a
  0.02 � 8f:   mov%r15d,%eax
   �   shr$0x10,%eax
   �   xchg   %ax,0x2(%rbx)
  2.07 �   shl$0x10,%eax
   �   test   %eax,%eax
   � � jne171
   �   movq   $0x0,-0x30(%rbp)
  0.02 � ac:   movzbl 0x44(%r12),%eax
  0.97 �   mov$0x1,%r13d
   �   mov$0x100,%r14d
   �   cmp$0x2,%al
   �   sete   %al
   �   movzbl %al,%eax
   �   mov%rax,-0x38(%rbp)
  0.00 � ca:   movb   $0x0,0x44(%r12)
  0.00 �   mov$0x8000,%edx
   �   movb   $0x1,0x1(%rbx)
   � � jmpe6
  0.04 � db:   pause
  8.04 �   sub$0x1,%edx
   � � je 229
   � e6:   movzbl (%rbx),%eax
  7.54 �   test   %al,%al
   � � jnedb
  0.10 �   mov%r14d,%eax
  0.06 �   lock   cmpxchg %r13w,(%rbx)
  

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-16 Thread Linus Torvalds
Mel,
 thanks for taking a look. Your theory sounds more complete than mine,
and since Dave is able to see the problem with 4.7, it would be nice
to hear about the 4.6 behavior and commit ede37713737 in particular.

That one seems more likely to affect contention than the zone/node one
I found during the merge window anyway, since it actually removes a
sleep in kswapd during congestion.

I've always preferred to see direct reclaim as the primary model for
reclaim, partly in order to throttle the actual "bad" process, but
also because "kswapd uses lots of CPU time" is such a nasty thing to
even begin guessing about.

So I have to admit to liking that "make kswapd sleep a bit if it's
just looping" logic that got removed in that commit.

And looking at DaveC's numbers, it really feels like it's not even
what we do inside the locked region that is the problem. Sure,
__delete_from_page_cache() (which is most of it) is at 1.86% of CPU
time (when including all the things it calls), but that still isn't
all that much. Especially when compared to just:

   0.78%  [kernel]  [k] _raw_spin_unlock_irqrestore

from his flat profile. That's not some spinning wait, that's just
releasing the lock with a single write (and the popf, but while that's
an expensive instruction, it's just tens of cpu cycles).

So I'm more and more getting the feeling that it's not what we do
inside the lock that is problematic. I started out blaming memcg
accounting or something, but none of the numbers seem to back that up.
So it's primarily really just the fact that kswapd is simply hammering
on that lock way too much.

So yeah, I'm blaming kswapd itself doing something wrong. It's not a
problem in a single-node environment (since there's only one), but
with multiple nodes it clearly just devolves.

Yes, we could try to batch the locking like DaveC already suggested
(ie we could move the locking to the caller, and then make
shrink_page_list() just try to keep the lock held for a few pages if
the mapping doesn't change), and that might result in fewer crazy
cacheline ping-pongs overall. But that feels like exactly the wrong
kind of workaround.

I'd much rather re-instate some "if kswapd is just spinning on CPU
time and not actually improving IO parallelism, kswapd should just get
the hell out" logic.

Adding Michal Hocko to the participant list too, I think he's one of
the gang in this area. Who else should be made aware of this thread?
Minchan? Vladimir?

[ I'm assuming the new people can look up this thread on lkml. Note to
new people: the subject line (and about 75% of the posts) are about an
unrelated AIM7 regression, but there's this sub-thread about nasty
lock contention on mapping->tree_lock within that bigger context ]

  Linus

On Tue, Aug 16, 2016 at 8:05 AM, Mel Gorman  wrote:
>
> However, historically there have been multiple indirect throttling mechanism
> that were branded as congestion control but basically said "I don't know
> what's going on so it's nap time". Many of these have been removed over
> time and the last major one was ede37713737 ("mm: throttle on IO only when
> there are too many dirty and writeback pages").
>
> Before that commit, a process that entered direct reclaim and failed to make
> progress would sleep before retrying. It's possible that sleep was enough
> to reduce contention by temporarily stalling the writer and letting reclaim
> make progress. After that commit, it may only do a cond_resched() check
> and go back to allocating/reclaiming as quickly as possible. This active
> writer may be enough to increase contention. If so, it also means it
> stops kswapd making forward progress, leading to more direct reclaim and
> more contention.
>
> It's not a perfect theory and assumes;
>
> 1. The writer is direct reclaiming
> 2. The writer was previously failing to __remove_mapping
> 3. The writer calling congestion_wait due to __remove_mapping failing
>was enough to allow kswapd or writeback to make enough progress to
>avoid contention
> 4. The writer staying awake allocating and dirtying pages is keeping all
>the kswapd instances awake and writeback continually active and
>increasing the contention overall.
>
> If it was possible to trigger this problem in 4.7 then it would also be
> worth checking 4.6. If 4.6 is immune, check that before and after commit
> ede37713737.
>
> --
> Mel Gorman
> SUSE Labs


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-16 Thread Linus Torvalds
Mel,
 thanks for taking a look. Your theory sounds more complete than mine,
and since Dave is able to see the problem with 4.7, it would be nice
to hear about the 4.6 behavior and commit ede37713737 in particular.

That one seems more likely to affect contention than the zone/node one
I found during the merge window anyway, since it actually removes a
sleep in kswapd during congestion.

I've always preferred to see direct reclaim as the primary model for
reclaim, partly in order to throttle the actual "bad" process, but
also because "kswapd uses lots of CPU time" is such a nasty thing to
even begin guessing about.

So I have to admit to liking that "make kswapd sleep a bit if it's
just looping" logic that got removed in that commit.

And looking at DaveC's numbers, it really feels like it's not even
what we do inside the locked region that is the problem. Sure,
__delete_from_page_cache() (which is most of it) is at 1.86% of CPU
time (when including all the things it calls), but that still isn't
all that much. Especially when compared to just:

   0.78%  [kernel]  [k] _raw_spin_unlock_irqrestore

from his flat profile. That's not some spinning wait, that's just
releasing the lock with a single write (and the popf, but while that's
an expensive instruction, it's just tens of cpu cycles).

So I'm more and more getting the feeling that it's not what we do
inside the lock that is problematic. I started out blaming memcg
accounting or something, but none of the numbers seem to back that up.
So it's primarily really just the fact that kswapd is simply hammering
on that lock way too much.

So yeah, I'm blaming kswapd itself doing something wrong. It's not a
problem in a single-node environment (since there's only one), but
with multiple nodes it clearly just devolves.

Yes, we could try to batch the locking like DaveC already suggested
(ie we could move the locking to the caller, and then make
shrink_page_list() just try to keep the lock held for a few pages if
the mapping doesn't change), and that might result in fewer crazy
cacheline ping-pongs overall. But that feels like exactly the wrong
kind of workaround.

I'd much rather re-instate some "if kswapd is just spinning on CPU
time and not actually improving IO parallelism, kswapd should just get
the hell out" logic.

Adding Michal Hocko to the participant list too, I think he's one of
the gang in this area. Who else should be made aware of this thread?
Minchan? Vladimir?

[ I'm assuming the new people can look up this thread on lkml. Note to
new people: the subject line (and about 75% of the posts) are about an
unrelated AIM7 regression, but there's this sub-thread about nasty
lock contention on mapping->tree_lock within that bigger context ]

  Linus

On Tue, Aug 16, 2016 at 8:05 AM, Mel Gorman  wrote:
>
> However, historically there have been multiple indirect throttling mechanism
> that were branded as congestion control but basically said "I don't know
> what's going on so it's nap time". Many of these have been removed over
> time and the last major one was ede37713737 ("mm: throttle on IO only when
> there are too many dirty and writeback pages").
>
> Before that commit, a process that entered direct reclaim and failed to make
> progress would sleep before retrying. It's possible that sleep was enough
> to reduce contention by temporarily stalling the writer and letting reclaim
> make progress. After that commit, it may only do a cond_resched() check
> and go back to allocating/reclaiming as quickly as possible. This active
> writer may be enough to increase contention. If so, it also means it
> stops kswapd making forward progress, leading to more direct reclaim and
> more contention.
>
> It's not a perfect theory and assumes;
>
> 1. The writer is direct reclaiming
> 2. The writer was previously failing to __remove_mapping
> 3. The writer calling congestion_wait due to __remove_mapping failing
>was enough to allow kswapd or writeback to make enough progress to
>avoid contention
> 4. The writer staying awake allocating and dirtying pages is keeping all
>the kswapd instances awake and writeback continually active and
>increasing the contention overall.
>
> If it was possible to trigger this problem in 4.7 then it would also be
> worth checking 4.6. If 4.6 is immune, check that before and after commit
> ede37713737.
>
> --
> Mel Gorman
> SUSE Labs


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-16 Thread Mel Gorman
On Mon, Aug 15, 2016 at 04:48:36PM -0700, Linus Torvalds wrote:
> On Mon, Aug 15, 2016 at 4:20 PM, Linus Torvalds
>  wrote:
> >
> > None of this code is all that new, which is annoying. This must have
> > gone on forever,
> 
> ... ooh.
> 
> Wait, I take that back.
> 
> We actually have some very recent changes that I didn't even think
> about that went into this very merge window.
> 
> In particular, I wonder if it's all (or at least partly) due to the
> new per-node LRU lists.
> 
> So in shrink_page_list(), when kswapd is encountering a page that is
> under page writeback due to page reclaim, it does:
> 
> if (current_is_kswapd() &&
> PageReclaim(page) &&
> test_bit(PGDAT_WRITEBACK, >flags)) {
> nr_immediate++;
> goto keep_locked;
> 

I have a limited view of the full topic as I've been in meetings all day
and have another 3 hours to go. I'll set time aside tomorrow to look closer
but there is a theory at the end of the mail.

Node-lru does alter what locks are contended and affects the timing of some
issues but this spot feels like a bad fit. That logic controls whether kswapd
will stall due to dirty/writeback pages reaching the tail of the LRU too
quickly. It can affect lru_lock contention that may be worse with node-lru,
particularly on single-node machines but a workload of a streaming writer
is unlikely to hit that unless the underlying storage is extremely slow.

Another alternation of node-lru potentially affects when buffer heads get
stripped but that's also a poor fit.

I'm not willing to rule out node-lru because it may be wishful thinking
but it feels unlikely.

> which basically ignores that page and puts it back on the LRU list.
> 
> But that "is this node under writeback" is new - it now does that per
> node, and it *used* to do it per zone (so it _used_ to test "is this
> zone under writeback").
> 

Superficially, a small high zone would affect the timing of when a zone
got marked congested and triggered a sleep. Sleeping avoids new pages being
allocated/dirties and may reduce contention. However, quick sleeps due to
small zones was offset by the fair zone allocation policy and is still
offset by GFP_WRITE distributing dirty pages on different zones. The
timing of when sleeps occur due to excessive dirty pages at the tail of
the LRU should be roughly similar with either zone-lru or node-lru.

> All the mapping pages used to be in the same zone, so I think it
> effectively single-threaded the kswapd reclaim for one mapping under
> reclaim writeback. But in your cases, you have multiple nodes...
> 
> Ok, that's a lot of hand-wavy new-age crystal healing thinking.
> 
> Really, I haven't looked at it more than "this is one thing that has
> changed recently, I wonder if it changes the patterns and could
> explain much higher spin_lock contention on the mapping->tree_lock".
> 
> I'm adding Mel Gorman and his band of miscreants to the cc, so that
> they can tell me that I'm full of shit, and completely missed on what
> that zone->node change actually ends up meaning.
> 
> Mel? The issue is that Dave Chinner is seeing some nasty spinlock
> contention on "mapping->tree_lock":
> 

Band Of Miscreants may be the new name for the MM track at LSF/MM.  In the
meantime lets try some hand waving;

A single-threaded file write on a 4-node system is going to have 4 kswapd
instances, writeback and potentially the writer itself all reclaiming.
Given the workload, it's likely that almost all pages have the same
mapping. As they are contending on __remove_mapping, the pages must be
clean when the attempt to reclaim was made and buffers stripped.

The throttling mechanisms for kswapd and direct reclaim rely on either
too many pages being isolated (unlikely to fire in this case) or too many
dirty/writeback pages reaching the end of the LRU. There is not a direct
throttling mechanism for excessive lock contention

However, historically there have been multiple indirect throttling mechanism
that were branded as congestion control but basically said "I don't know
what's going on so it's nap time". Many of these have been removed over
time and the last major one was ede37713737 ("mm: throttle on IO only when
there are too many dirty and writeback pages").

Before that commit, a process that entered direct reclaim and failed to make
progress would sleep before retrying. It's possible that sleep was enough
to reduce contention by temporarily stalling the writer and letting reclaim
make progress. After that commit, it may only do a cond_resched() check
and go back to allocating/reclaiming as quickly as possible. This active
writer may be enough to increase contention. If so, it also means it
stops kswapd making forward progress, leading to more direct reclaim and
more contention.

It's not a perfect theory and assumes;

1. The writer is direct 

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-16 Thread Mel Gorman
On Mon, Aug 15, 2016 at 04:48:36PM -0700, Linus Torvalds wrote:
> On Mon, Aug 15, 2016 at 4:20 PM, Linus Torvalds
>  wrote:
> >
> > None of this code is all that new, which is annoying. This must have
> > gone on forever,
> 
> ... ooh.
> 
> Wait, I take that back.
> 
> We actually have some very recent changes that I didn't even think
> about that went into this very merge window.
> 
> In particular, I wonder if it's all (or at least partly) due to the
> new per-node LRU lists.
> 
> So in shrink_page_list(), when kswapd is encountering a page that is
> under page writeback due to page reclaim, it does:
> 
> if (current_is_kswapd() &&
> PageReclaim(page) &&
> test_bit(PGDAT_WRITEBACK, >flags)) {
> nr_immediate++;
> goto keep_locked;
> 

I have a limited view of the full topic as I've been in meetings all day
and have another 3 hours to go. I'll set time aside tomorrow to look closer
but there is a theory at the end of the mail.

Node-lru does alter what locks are contended and affects the timing of some
issues but this spot feels like a bad fit. That logic controls whether kswapd
will stall due to dirty/writeback pages reaching the tail of the LRU too
quickly. It can affect lru_lock contention that may be worse with node-lru,
particularly on single-node machines but a workload of a streaming writer
is unlikely to hit that unless the underlying storage is extremely slow.

Another alternation of node-lru potentially affects when buffer heads get
stripped but that's also a poor fit.

I'm not willing to rule out node-lru because it may be wishful thinking
but it feels unlikely.

> which basically ignores that page and puts it back on the LRU list.
> 
> But that "is this node under writeback" is new - it now does that per
> node, and it *used* to do it per zone (so it _used_ to test "is this
> zone under writeback").
> 

Superficially, a small high zone would affect the timing of when a zone
got marked congested and triggered a sleep. Sleeping avoids new pages being
allocated/dirties and may reduce contention. However, quick sleeps due to
small zones was offset by the fair zone allocation policy and is still
offset by GFP_WRITE distributing dirty pages on different zones. The
timing of when sleeps occur due to excessive dirty pages at the tail of
the LRU should be roughly similar with either zone-lru or node-lru.

> All the mapping pages used to be in the same zone, so I think it
> effectively single-threaded the kswapd reclaim for one mapping under
> reclaim writeback. But in your cases, you have multiple nodes...
> 
> Ok, that's a lot of hand-wavy new-age crystal healing thinking.
> 
> Really, I haven't looked at it more than "this is one thing that has
> changed recently, I wonder if it changes the patterns and could
> explain much higher spin_lock contention on the mapping->tree_lock".
> 
> I'm adding Mel Gorman and his band of miscreants to the cc, so that
> they can tell me that I'm full of shit, and completely missed on what
> that zone->node change actually ends up meaning.
> 
> Mel? The issue is that Dave Chinner is seeing some nasty spinlock
> contention on "mapping->tree_lock":
> 

Band Of Miscreants may be the new name for the MM track at LSF/MM.  In the
meantime lets try some hand waving;

A single-threaded file write on a 4-node system is going to have 4 kswapd
instances, writeback and potentially the writer itself all reclaiming.
Given the workload, it's likely that almost all pages have the same
mapping. As they are contending on __remove_mapping, the pages must be
clean when the attempt to reclaim was made and buffers stripped.

The throttling mechanisms for kswapd and direct reclaim rely on either
too many pages being isolated (unlikely to fire in this case) or too many
dirty/writeback pages reaching the end of the LRU. There is not a direct
throttling mechanism for excessive lock contention

However, historically there have been multiple indirect throttling mechanism
that were branded as congestion control but basically said "I don't know
what's going on so it's nap time". Many of these have been removed over
time and the last major one was ede37713737 ("mm: throttle on IO only when
there are too many dirty and writeback pages").

Before that commit, a process that entered direct reclaim and failed to make
progress would sleep before retrying. It's possible that sleep was enough
to reduce contention by temporarily stalling the writer and letting reclaim
make progress. After that commit, it may only do a cond_resched() check
and go back to allocating/reclaiming as quickly as possible. This active
writer may be enough to increase contention. If so, it also means it
stops kswapd making forward progress, leading to more direct reclaim and
more contention.

It's not a perfect theory and assumes;

1. The writer is direct reclaiming
2. The writer was 

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-16 Thread Fengguang Wu

On Sun, Aug 14, 2016 at 06:17:24PM +0200, Christoph Hellwig wrote:

Snipping the long contest:

I think there are three observations here:

(1) removing the mark_page_accessed (which is the only significant
change in the parent commit)  hurts the
aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test.
I'd still rather stick to the filemap version and let the
VM people sort it out.  How do the numbers for this test
look for XFS vs say ext4 and btrfs?


Here is a basic comparison of the 3 filesystems based on 99091700 ("
Merge tag 'nfs-for-4.8-2' of 
git://git.linux-nfs.org/projects/trondmy/linux-nfs").

% compare -a -g 99091700659f4df965e138b38b4fa26a29b7eade -d fs xfs ext4 btrfs

xfsext4   btrfs  
testcase/testparams/testbox
  --  --  
---
%stddev  change %stddev  change %stddev
\  |\  |\
   193335 -27% 141400-100%  8
GEO-MEAN aim7.jobs-per-min
   267649 ±  3%   -51% 130085
aim7/1BRD_48G-disk_cp-3000-performance/ivb44
   485217 ±  3%   402%2434088 ±  3%   350%2184471 ±  4%  
aim7/1BRD_48G-disk_rd-9000-performance/ivb44
   360286 -64% 130351
aim7/1BRD_48G-disk_rr-3000-performance/ivb44
   338114 -78%  73280
aim7/1BRD_48G-disk_rw-3000-performance/ivb44
60130 ±  5%   361% 277035
aim7/1BRD_48G-disk_src-3000-performance/ivb44
   403144 -68% 127584
aim7/1BRD_48G-disk_wrt-3000-performance/ivb44
26327 -60%  10571
aim7/1BRD_48G-sync_disk_rw-600-performance/ivb44

xfsext4   btrfs
  --  --
 2652 -96%118 -82%468
GEO-MEAN fsmark.files_per_sec
  393 ±  4%-6%368 ±  3%10%433 ±  5%  
fsmark/1x-1t-1BRD_48G-4M-40G-NoSync-performance/ivb44
  200  -4%191  -7%185 ±  6%  
fsmark/1x-1t-1BRD_48G-4M-40G-fsyncBeforeClose-performance/ivb44
 1583 ±  3%   -29%   1130 -31%   1088
fsmark/1x-64t-1BRD_48G-4M-40G-fsyncBeforeClose-performance/ivb44
21363  59%  33958
fsmark/8-1SSD-16-9B-48G-fsyncBeforeClose-16d-256fpd-performance/lkp-hsw-ep4
11033 -17%   9117
fsmark/8-1SSD-4-8K-24G-fsyncBeforeClose-16d-256fpd-performance/lkp-hsw-ep4
11833  12%  13234
fsmark/8-1SSD-4-9B-16G-fsyncBeforeClose-16d-256fpd-performance/lkp-hsw-ep4

xfsext4   btrfs
  --  --
  2381976-100%   6598 -96% 100973
GEO-MEAN fsmark.app_overhead
   564520 ±  7%21% 681192 ±  3%63% 919364 ±  3%  
fsmark/1x-64t-1BRD_48G-4M-40G-NoSync-performance/ivb44
   860074 ±  5%   112%1820590 ± 14%47%1262443 ±  3%  
fsmark/1x-64t-1BRD_48G-4M-40G-fsyncBeforeClose-performance/ivb44
 12232633 -18%   10085199
fsmark/8-1SSD-16-9B-48G-fsyncBeforeClose-16d-256fpd-performance/lkp-hsw-ep4
  3143334 -11%2784178
fsmark/8-1SSD-4-8K-24G-fsyncBeforeClose-16d-256fpd-performance/lkp-hsw-ep4
  4107347 -21%3248210
fsmark/8-1SSD-4-9B-16G-fsyncBeforeClose-16d-256fpd-performance/lkp-hsw-ep4

Thanks,
Fengguang
---

Some less important numbers.

xfsext4   btrfs
  --  --
 1314 222%   4225-100%  2
GEO-MEAN aim7.time.system_time
 1491 ±  6%   302%   6004
aim7/1BRD_48G-disk_cp-3000-performance/ivb44
 4786 ±  3%   -89%502 ±  7%   -87%632 ±  7%  
aim7/1BRD_48G-disk_rd-9000-performance/ivb44
  756 689%   5971
aim7/1BRD_48G-disk_rr-3000-performance/ivb44
  8911146%  11108
aim7/1BRD_48G-disk_rw-3000-performance/ivb44
  940 ±  5%70%   1598
aim7/1BRD_48G-disk_src-3000-performance/ivb44
  599 925%   

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-16 Thread Fengguang Wu

On Sun, Aug 14, 2016 at 06:17:24PM +0200, Christoph Hellwig wrote:

Snipping the long contest:

I think there are three observations here:

(1) removing the mark_page_accessed (which is the only significant
change in the parent commit)  hurts the
aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test.
I'd still rather stick to the filemap version and let the
VM people sort it out.  How do the numbers for this test
look for XFS vs say ext4 and btrfs?


Here is a basic comparison of the 3 filesystems based on 99091700 ("
Merge tag 'nfs-for-4.8-2' of 
git://git.linux-nfs.org/projects/trondmy/linux-nfs").

% compare -a -g 99091700659f4df965e138b38b4fa26a29b7eade -d fs xfs ext4 btrfs

xfsext4   btrfs  
testcase/testparams/testbox
  --  --  
---
%stddev  change %stddev  change %stddev
\  |\  |\
   193335 -27% 141400-100%  8
GEO-MEAN aim7.jobs-per-min
   267649 ±  3%   -51% 130085
aim7/1BRD_48G-disk_cp-3000-performance/ivb44
   485217 ±  3%   402%2434088 ±  3%   350%2184471 ±  4%  
aim7/1BRD_48G-disk_rd-9000-performance/ivb44
   360286 -64% 130351
aim7/1BRD_48G-disk_rr-3000-performance/ivb44
   338114 -78%  73280
aim7/1BRD_48G-disk_rw-3000-performance/ivb44
60130 ±  5%   361% 277035
aim7/1BRD_48G-disk_src-3000-performance/ivb44
   403144 -68% 127584
aim7/1BRD_48G-disk_wrt-3000-performance/ivb44
26327 -60%  10571
aim7/1BRD_48G-sync_disk_rw-600-performance/ivb44

xfsext4   btrfs
  --  --
 2652 -96%118 -82%468
GEO-MEAN fsmark.files_per_sec
  393 ±  4%-6%368 ±  3%10%433 ±  5%  
fsmark/1x-1t-1BRD_48G-4M-40G-NoSync-performance/ivb44
  200  -4%191  -7%185 ±  6%  
fsmark/1x-1t-1BRD_48G-4M-40G-fsyncBeforeClose-performance/ivb44
 1583 ±  3%   -29%   1130 -31%   1088
fsmark/1x-64t-1BRD_48G-4M-40G-fsyncBeforeClose-performance/ivb44
21363  59%  33958
fsmark/8-1SSD-16-9B-48G-fsyncBeforeClose-16d-256fpd-performance/lkp-hsw-ep4
11033 -17%   9117
fsmark/8-1SSD-4-8K-24G-fsyncBeforeClose-16d-256fpd-performance/lkp-hsw-ep4
11833  12%  13234
fsmark/8-1SSD-4-9B-16G-fsyncBeforeClose-16d-256fpd-performance/lkp-hsw-ep4

xfsext4   btrfs
  --  --
  2381976-100%   6598 -96% 100973
GEO-MEAN fsmark.app_overhead
   564520 ±  7%21% 681192 ±  3%63% 919364 ±  3%  
fsmark/1x-64t-1BRD_48G-4M-40G-NoSync-performance/ivb44
   860074 ±  5%   112%1820590 ± 14%47%1262443 ±  3%  
fsmark/1x-64t-1BRD_48G-4M-40G-fsyncBeforeClose-performance/ivb44
 12232633 -18%   10085199
fsmark/8-1SSD-16-9B-48G-fsyncBeforeClose-16d-256fpd-performance/lkp-hsw-ep4
  3143334 -11%2784178
fsmark/8-1SSD-4-8K-24G-fsyncBeforeClose-16d-256fpd-performance/lkp-hsw-ep4
  4107347 -21%3248210
fsmark/8-1SSD-4-9B-16G-fsyncBeforeClose-16d-256fpd-performance/lkp-hsw-ep4

Thanks,
Fengguang
---

Some less important numbers.

xfsext4   btrfs
  --  --
 1314 222%   4225-100%  2
GEO-MEAN aim7.time.system_time
 1491 ±  6%   302%   6004
aim7/1BRD_48G-disk_cp-3000-performance/ivb44
 4786 ±  3%   -89%502 ±  7%   -87%632 ±  7%  
aim7/1BRD_48G-disk_rd-9000-performance/ivb44
  756 689%   5971
aim7/1BRD_48G-disk_rr-3000-performance/ivb44
  8911146%  11108
aim7/1BRD_48G-disk_rw-3000-performance/ivb44
  940 ±  5%70%   1598
aim7/1BRD_48G-disk_src-3000-performance/ivb44
  599 925%   

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-16 Thread Fengguang Wu

On Tue, Aug 16, 2016 at 07:22:40AM +1000, Dave Chinner wrote:

On Mon, Aug 15, 2016 at 10:14:55PM +0800, Fengguang Wu wrote:

Hi Christoph,

On Sun, Aug 14, 2016 at 06:17:24PM +0200, Christoph Hellwig wrote:
>Snipping the long contest:
>
>I think there are three observations here:
>
>(1) removing the mark_page_accessed (which is the only significant
>change in the parent commit)  hurts the
>aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test.
>I'd still rather stick to the filemap version and let the
>VM people sort it out.  How do the numbers for this test
>look for XFS vs say ext4 and btrfs?
>(2) lots of additional spinlock contention in the new case.  A quick
>check shows that I fat-fingered my rewrite so that we do
>the xfs_inode_set_eofblocks_tag call now for the pure lookup
>case, and pretty much all new cycles come from that.
>(3) Boy, are those xfs_inode_set_eofblocks_tag calls expensive, and
>we're already doing way to many even without my little bug above.
>
>So I've force pushed a new version of the iomap-fixes branch with
>(2) fixed, and also a little patch to xfs_inode_set_eofblocks_tag a
>lot less expensive slotted in before that.  Would be good to see
>the numbers with that.

The aim7 1BRD tests finished and there are ups and downs, with overall
performance remain flat.

99091700659f4df9  74a242ad94d13436a1644c0b45  bf4dc6e4ecc2a3d042029319bc  
testcase/testparams/testbox
  --  --  
---


What do these commits refer to, please? They mean nothing without
the commit names

/me goes searching. Ok:

99091700659 is the top of Linus' tree
74a242ad94d is 


That's the below one's parent commit, 74a242ad94d ("xfs: make
xfs_inode_set_eofblocks_tag cheaper for the common case").

Typically we'll compare a commit with its parent commit, and/or
the branch's base commit, which is normally on mainline kernel.


bf4dc6e4ecc is the latest in Christoph's tree (because it's
mentioned below)


%stddev %change %stddev %change %stddev
\  |\  |
\ 159926  157324  158574
GEO-MEAN aim7.jobs-per-min
70897   5%  74137   4%  73775
aim7/1BRD_48G-xfs-creat-clo-1500-performance/ivb44
   485217 ±  3%492431  477533
aim7/1BRD_48G-xfs-disk_rd-9000-performance/ivb44
   360451 -19% 292980 -17% 299377
aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44


So, why does random read go backwards by 20%? The iomap IO path
patches we are testing only affect the write path, so this
doesn't make a whole lot of sense.


   338114  338410   5% 354078
aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44
60130 ±  5% 4%  62438   5%  62923
aim7/1BRD_48G-xfs-disk_src-3000-performance/ivb44
   403144  397790  410648
aim7/1BRD_48G-xfs-disk_wrt-3000-performance/ivb44


And this is the test the original regression was reported for:

gcc-6/performance/profile/1BRD_48G/xfs/x86_64-rhel/3000/debian-x86_64-2015-02-07.cgz/ivb44/disk_wrt/aim7

And that shows no improvement at all. The orginal regression was:

484435 ±  0% -13.3% 420004 ±  0%  aim7.jobs-per-min

So it's still 15% down on the orginal performance which, again,
doesn't make a whole lot of sense given the improvement in so many
other tests I've run


Yes, same performance with 4.8-rc1 means the regression is still not
back comparing to the original reported first-bad-commit's parent
f0c6bcba74ac51cb ("xfs: reorder zeroing and flushing sequence in
truncate") which is on 4.7-rc1. 


26327   26534   26128
aim7/1BRD_48G-xfs-sync_disk_rw-600-performance/ivb44

The new commit bf4dc6e ("xfs: rewrite and optimize the delalloc write
path") improves the aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44
case by 5%. Here are the detailed numbers:

aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44


Not important at all. We need the results for the disk_wrt regression
we are chasing (disk_wrt-3000) so we can see how the code change
affected behaviour.


Yeah it may not relevant to this case study, however should help
evaluate the patch in a more complete way.


Here are the detailed numbers for the slowed down case:

aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44

99091700659f4df9  bf4dc6e4ecc2a3d042029319bc
  --
%stddev  change %stddev
\  |\
   360451 -17% 299377aim7.jobs-per-min
12806 481%  74447
aim7.time.involuntary_context_switches

.

19377 459% 

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-16 Thread Fengguang Wu

On Tue, Aug 16, 2016 at 07:22:40AM +1000, Dave Chinner wrote:

On Mon, Aug 15, 2016 at 10:14:55PM +0800, Fengguang Wu wrote:

Hi Christoph,

On Sun, Aug 14, 2016 at 06:17:24PM +0200, Christoph Hellwig wrote:
>Snipping the long contest:
>
>I think there are three observations here:
>
>(1) removing the mark_page_accessed (which is the only significant
>change in the parent commit)  hurts the
>aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test.
>I'd still rather stick to the filemap version and let the
>VM people sort it out.  How do the numbers for this test
>look for XFS vs say ext4 and btrfs?
>(2) lots of additional spinlock contention in the new case.  A quick
>check shows that I fat-fingered my rewrite so that we do
>the xfs_inode_set_eofblocks_tag call now for the pure lookup
>case, and pretty much all new cycles come from that.
>(3) Boy, are those xfs_inode_set_eofblocks_tag calls expensive, and
>we're already doing way to many even without my little bug above.
>
>So I've force pushed a new version of the iomap-fixes branch with
>(2) fixed, and also a little patch to xfs_inode_set_eofblocks_tag a
>lot less expensive slotted in before that.  Would be good to see
>the numbers with that.

The aim7 1BRD tests finished and there are ups and downs, with overall
performance remain flat.

99091700659f4df9  74a242ad94d13436a1644c0b45  bf4dc6e4ecc2a3d042029319bc  
testcase/testparams/testbox
  --  --  
---


What do these commits refer to, please? They mean nothing without
the commit names

/me goes searching. Ok:

99091700659 is the top of Linus' tree
74a242ad94d is 


That's the below one's parent commit, 74a242ad94d ("xfs: make
xfs_inode_set_eofblocks_tag cheaper for the common case").

Typically we'll compare a commit with its parent commit, and/or
the branch's base commit, which is normally on mainline kernel.


bf4dc6e4ecc is the latest in Christoph's tree (because it's
mentioned below)


%stddev %change %stddev %change %stddev
\  |\  |
\ 159926  157324  158574
GEO-MEAN aim7.jobs-per-min
70897   5%  74137   4%  73775
aim7/1BRD_48G-xfs-creat-clo-1500-performance/ivb44
   485217 ±  3%492431  477533
aim7/1BRD_48G-xfs-disk_rd-9000-performance/ivb44
   360451 -19% 292980 -17% 299377
aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44


So, why does random read go backwards by 20%? The iomap IO path
patches we are testing only affect the write path, so this
doesn't make a whole lot of sense.


   338114  338410   5% 354078
aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44
60130 ±  5% 4%  62438   5%  62923
aim7/1BRD_48G-xfs-disk_src-3000-performance/ivb44
   403144  397790  410648
aim7/1BRD_48G-xfs-disk_wrt-3000-performance/ivb44


And this is the test the original regression was reported for:

gcc-6/performance/profile/1BRD_48G/xfs/x86_64-rhel/3000/debian-x86_64-2015-02-07.cgz/ivb44/disk_wrt/aim7

And that shows no improvement at all. The orginal regression was:

484435 ±  0% -13.3% 420004 ±  0%  aim7.jobs-per-min

So it's still 15% down on the orginal performance which, again,
doesn't make a whole lot of sense given the improvement in so many
other tests I've run


Yes, same performance with 4.8-rc1 means the regression is still not
back comparing to the original reported first-bad-commit's parent
f0c6bcba74ac51cb ("xfs: reorder zeroing and flushing sequence in
truncate") which is on 4.7-rc1. 


26327   26534   26128
aim7/1BRD_48G-xfs-sync_disk_rw-600-performance/ivb44

The new commit bf4dc6e ("xfs: rewrite and optimize the delalloc write
path") improves the aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44
case by 5%. Here are the detailed numbers:

aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44


Not important at all. We need the results for the disk_wrt regression
we are chasing (disk_wrt-3000) so we can see how the code change
affected behaviour.


Yeah it may not relevant to this case study, however should help
evaluate the patch in a more complete way.


Here are the detailed numbers for the slowed down case:

aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44

99091700659f4df9  bf4dc6e4ecc2a3d042029319bc
  --
%stddev  change %stddev
\  |\
   360451 -17% 299377aim7.jobs-per-min
12806 481%  74447
aim7.time.involuntary_context_switches

.

19377 459% 

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Linus Torvalds
On Mon, Aug 15, 2016 at 5:19 PM, Dave Chinner  wrote:
>
>> None of this code is all that new, which is annoying. This must have
>> gone on forever,
>
> Yes, it has been. Just worse than I've notice before, probably
> because of all the stuff put under the tree lock in the past couple
> of years.

So this is where a good profile can matter.

Particularly if it's all about kswapd, and all the contention is just
from __remove_mapping(), what should matter is the "all the stuff"
added *there* and absolutely nowhere else.

Sadly (well, not for me), in my profiles I have

 --3.37%--kswapd
   |
--3.36%--shrink_node
  |
  |--2.88%--shrink_node_memcg
  |  |
  |   --2.87%--shrink_inactive_list
  | |
  | |--2.55%--shrink_page_list
  | |  |
  | |  |--0.84%--__remove_mapping
  | |  |  |
  | |  |  |--0.37%--__delete_from_page_cache
  | |  |  |  |
  | |  |  |   --0.21%--radix_tree_replace_clear_tags
  | |  |  | |
  | |  |  |  --0.12%--__radix_tree_lookup
  | |  |  |
  | |  |   --0.23%--_raw_spin_lock_irqsave
  | |  | |
  | |  |  --0.11%--queued_spin_lock_slowpath
  | |  |
   


which is rather different from your 22% spin-lock overhead.

Anyway, including the direct reclaim call paths gets
__remove_mapping() a bit higher, and _raw_spin_lock_irqsave climbs to
0.26%. But perhaps more importlantly, looking at what __remove_mapping
actually *does* (apart from the spinlock) gives us:

 - inside remove_mapping itself (0.11% on its own - flat cost, no
child accounting)

48.50 │   lock   cmpxchg %edx,0x1c(%rbx)

so that's about 0.05%

 - 0.40% __delete_from_page_cache (0.22%
radix_tree_replace_clear_tags, 0.13%__radix_tree_lookup)

 - 0.06% workingset_eviction()

so I'm not actually seeing anything *new* expensive in there. The
__delete_from_page_cache() overhead may have changed a bit with the
tagged tree changes, but this doesn't look like memcg.

But we clearly have very different situations.

What does your profile show for when you actually dig into
__remove_mapping() itself?, Looking at your flat profile, I'm assuming
you get

   1.31%  [kernel]  [k] __radix_tree_lookup
   1.22%  [kernel]  [k] radix_tree_tag_set
   1.14%  [kernel]  [k] __remove_mapping

which is higher (but part of why my percentages are lower is that I
have that "50% CPU used for encryption" on my machine).

But I'm not seeing anything I'd attribute to "all the stuff added".
For example, originally I would have blamed memcg, but that's not
actually in this path at all.

I come back to wondering whether maybe you're hitting some PV-lock problem.

I know queued_spin_lock_slowpath() is ok. I'm not entirely sure
__pv_queued_spin_lock_slowpath() is.

So I'd love to see you try the non-PV case, but I also think it might
be interesting to see what the instruction profile for
__pv_queued_spin_lock_slowpath() itself is. They share a lot of code
(there's some interesting #include games going on to make
queued_spin_lock_slowpath() actually *be*
__pv_queued_spin_lock_slowpath() with some magic hooks), but there
might be issues.

For example, if you run a virtual 16-core system on a physical machine
that then doesn't consistently give 16 cores to the virtual machine,
you'll get no end of hiccups.

Because as mentioned, we've had bugs ("performance anomalies") there before.

   Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Linus Torvalds
On Mon, Aug 15, 2016 at 5:19 PM, Dave Chinner  wrote:
>
>> None of this code is all that new, which is annoying. This must have
>> gone on forever,
>
> Yes, it has been. Just worse than I've notice before, probably
> because of all the stuff put under the tree lock in the past couple
> of years.

So this is where a good profile can matter.

Particularly if it's all about kswapd, and all the contention is just
from __remove_mapping(), what should matter is the "all the stuff"
added *there* and absolutely nowhere else.

Sadly (well, not for me), in my profiles I have

 --3.37%--kswapd
   |
--3.36%--shrink_node
  |
  |--2.88%--shrink_node_memcg
  |  |
  |   --2.87%--shrink_inactive_list
  | |
  | |--2.55%--shrink_page_list
  | |  |
  | |  |--0.84%--__remove_mapping
  | |  |  |
  | |  |  |--0.37%--__delete_from_page_cache
  | |  |  |  |
  | |  |  |   --0.21%--radix_tree_replace_clear_tags
  | |  |  | |
  | |  |  |  --0.12%--__radix_tree_lookup
  | |  |  |
  | |  |   --0.23%--_raw_spin_lock_irqsave
  | |  | |
  | |  |  --0.11%--queued_spin_lock_slowpath
  | |  |
   


which is rather different from your 22% spin-lock overhead.

Anyway, including the direct reclaim call paths gets
__remove_mapping() a bit higher, and _raw_spin_lock_irqsave climbs to
0.26%. But perhaps more importlantly, looking at what __remove_mapping
actually *does* (apart from the spinlock) gives us:

 - inside remove_mapping itself (0.11% on its own - flat cost, no
child accounting)

48.50 │   lock   cmpxchg %edx,0x1c(%rbx)

so that's about 0.05%

 - 0.40% __delete_from_page_cache (0.22%
radix_tree_replace_clear_tags, 0.13%__radix_tree_lookup)

 - 0.06% workingset_eviction()

so I'm not actually seeing anything *new* expensive in there. The
__delete_from_page_cache() overhead may have changed a bit with the
tagged tree changes, but this doesn't look like memcg.

But we clearly have very different situations.

What does your profile show for when you actually dig into
__remove_mapping() itself?, Looking at your flat profile, I'm assuming
you get

   1.31%  [kernel]  [k] __radix_tree_lookup
   1.22%  [kernel]  [k] radix_tree_tag_set
   1.14%  [kernel]  [k] __remove_mapping

which is higher (but part of why my percentages are lower is that I
have that "50% CPU used for encryption" on my machine).

But I'm not seeing anything I'd attribute to "all the stuff added".
For example, originally I would have blamed memcg, but that's not
actually in this path at all.

I come back to wondering whether maybe you're hitting some PV-lock problem.

I know queued_spin_lock_slowpath() is ok. I'm not entirely sure
__pv_queued_spin_lock_slowpath() is.

So I'd love to see you try the non-PV case, but I also think it might
be interesting to see what the instruction profile for
__pv_queued_spin_lock_slowpath() itself is. They share a lot of code
(there's some interesting #include games going on to make
queued_spin_lock_slowpath() actually *be*
__pv_queued_spin_lock_slowpath() with some magic hooks), but there
might be issues.

For example, if you run a virtual 16-core system on a physical machine
that then doesn't consistently give 16 cores to the virtual machine,
you'll get no end of hiccups.

Because as mentioned, we've had bugs ("performance anomalies") there before.

   Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Linus Torvalds
On Mon, Aug 15, 2016 at 5:38 PM, Dave Chinner  wrote:
>
> Same in 4.7 (flat profile numbers climbed higher after this
> snapshot was taken, as can be seen by the callgraph numbers):

Ok, so it's not the zone-vs-node thing.  It's just that nobody has
looked at that load in recent times.

Where "recent" may be years, of course.

   Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Linus Torvalds
On Mon, Aug 15, 2016 at 5:38 PM, Dave Chinner  wrote:
>
> Same in 4.7 (flat profile numbers climbed higher after this
> snapshot was taken, as can be seen by the callgraph numbers):

Ok, so it's not the zone-vs-node thing.  It's just that nobody has
looked at that load in recent times.

Where "recent" may be years, of course.

   Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Linus Torvalds
On Mon, Aug 15, 2016 at 5:17 PM, Dave Chinner  wrote:
>
> Read the code, Linus?

I am. It's how I came up with my current pet theory.

But I don't actually have enough sane numbers to make it much more
than a cute pet theory. It *might* explain why you see tons of kswap
time and bad lock contention where it didn't use to exist, but ..

I can't recreate the problem, and your old profiles were bad enough
that they aren't really worth looking at.

> Except they *aren't broken*. They are simply *less accurate* than
> they could be.

They are so much less accurate that quite frankly, there's no point in
looking at them outside of "there is contention on the lock".

And considering that the numbers didn't even change when you had
spinlock debugging on, it's not the lock itself that causes this, I'm
pretty sure.

Because when you have normal contention due to the *locking* itself
being the problem, it tends to absolutely _explode_ with the debugging
spinlocks, because the lock itself becomes much more expensive.
Usually super-linearly.

But that wasn't the case here. The numbers stayed constant.

So yeah, I started looking at bigger behavioral issues, which is why I
zeroed in on that zone-vs-node change. But it might be a completely
broken theory. For example, if you still have the contention when
running plain 4.7, that theory was clearly complete BS.

And this is where "less accurate" means that they are almost entirely useless.

More detail needed. It might not be in the profiles themselves, of
course. There might be other much more informative sources if you can
come up with anything...

   Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Linus Torvalds
On Mon, Aug 15, 2016 at 5:17 PM, Dave Chinner  wrote:
>
> Read the code, Linus?

I am. It's how I came up with my current pet theory.

But I don't actually have enough sane numbers to make it much more
than a cute pet theory. It *might* explain why you see tons of kswap
time and bad lock contention where it didn't use to exist, but ..

I can't recreate the problem, and your old profiles were bad enough
that they aren't really worth looking at.

> Except they *aren't broken*. They are simply *less accurate* than
> they could be.

They are so much less accurate that quite frankly, there's no point in
looking at them outside of "there is contention on the lock".

And considering that the numbers didn't even change when you had
spinlock debugging on, it's not the lock itself that causes this, I'm
pretty sure.

Because when you have normal contention due to the *locking* itself
being the problem, it tends to absolutely _explode_ with the debugging
spinlocks, because the lock itself becomes much more expensive.
Usually super-linearly.

But that wasn't the case here. The numbers stayed constant.

So yeah, I started looking at bigger behavioral issues, which is why I
zeroed in on that zone-vs-node change. But it might be a completely
broken theory. For example, if you still have the contention when
running plain 4.7, that theory was clearly complete BS.

And this is where "less accurate" means that they are almost entirely useless.

More detail needed. It might not be in the profiles themselves, of
course. There might be other much more informative sources if you can
come up with anything...

   Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Dave Chinner
On Mon, Aug 15, 2016 at 04:48:36PM -0700, Linus Torvalds wrote:
> On Mon, Aug 15, 2016 at 4:20 PM, Linus Torvalds
>  wrote:
> >
> > None of this code is all that new, which is annoying. This must have
> > gone on forever,
> 
> ... ooh.
> 
> Wait, I take that back.
> 
> We actually have some very recent changes that I didn't even think
> about that went into this very merge window.

> Mel? The issue is that Dave Chinner is seeing some nasty spinlock
> contention on "mapping->tree_lock":
> 
> >   31.18%  [kernel]  [k] __pv_queued_spin_lock_slowpath
> 
> and one of the main paths is this:
> 
> >- 30.29% kswapd
> >   - 30.23% shrink_node
> >  - 30.07% shrink_node_memcg.isra.75
> > - 30.15% shrink_inactive_list
> >- 29.49% shrink_page_list
> >   - 22.79% __remove_mapping
> >  - 22.27% _raw_spin_lock_irqsave
> >   __pv_queued_spin_lock_slowpath
> 
> so there's something ridiculously bad going on with a fairly simple benchmark.
> 
> Dave's benchmark is literally just a "write a new 48GB file in
> single-page chunks on a 4-node machine". Nothing odd - not rewriting
> files, not seeking around, no nothing.
> 
> You can probably recreate it with a silly
> 
>   dd bs=4096 count=$((12*1024*1024)) if=/dev/zero of=bigfile
> 
> although Dave actually had something rather fancier, I think.

16p, 16GB RAM, fake_numa=4. Overwrite a 47GB file on a 48GB
filesystem:

# mkfs.xfs -f -d size=48g /dev/vdc
# mount /dev/vdc /mnt/scratch
# xfs_io -f -c "pwrite 0 47g" /mnt/scratch/fooey

Wait for memory to fill and reclaim to kick in, then look at the
profile. If you run it a second time, reclaim kicks in straight
away.

It's not the new code in 4.8 - it reproduces on 4.7 just fine, and
probably will reproduce all the way back to when the memcg-aware
writeback code was added

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Dave Chinner
On Mon, Aug 15, 2016 at 04:48:36PM -0700, Linus Torvalds wrote:
> On Mon, Aug 15, 2016 at 4:20 PM, Linus Torvalds
>  wrote:
> >
> > None of this code is all that new, which is annoying. This must have
> > gone on forever,
> 
> ... ooh.
> 
> Wait, I take that back.
> 
> We actually have some very recent changes that I didn't even think
> about that went into this very merge window.

> Mel? The issue is that Dave Chinner is seeing some nasty spinlock
> contention on "mapping->tree_lock":
> 
> >   31.18%  [kernel]  [k] __pv_queued_spin_lock_slowpath
> 
> and one of the main paths is this:
> 
> >- 30.29% kswapd
> >   - 30.23% shrink_node
> >  - 30.07% shrink_node_memcg.isra.75
> > - 30.15% shrink_inactive_list
> >- 29.49% shrink_page_list
> >   - 22.79% __remove_mapping
> >  - 22.27% _raw_spin_lock_irqsave
> >   __pv_queued_spin_lock_slowpath
> 
> so there's something ridiculously bad going on with a fairly simple benchmark.
> 
> Dave's benchmark is literally just a "write a new 48GB file in
> single-page chunks on a 4-node machine". Nothing odd - not rewriting
> files, not seeking around, no nothing.
> 
> You can probably recreate it with a silly
> 
>   dd bs=4096 count=$((12*1024*1024)) if=/dev/zero of=bigfile
> 
> although Dave actually had something rather fancier, I think.

16p, 16GB RAM, fake_numa=4. Overwrite a 47GB file on a 48GB
filesystem:

# mkfs.xfs -f -d size=48g /dev/vdc
# mount /dev/vdc /mnt/scratch
# xfs_io -f -c "pwrite 0 47g" /mnt/scratch/fooey

Wait for memory to fill and reclaim to kick in, then look at the
profile. If you run it a second time, reclaim kicks in straight
away.

It's not the new code in 4.8 - it reproduces on 4.7 just fine, and
probably will reproduce all the way back to when the memcg-aware
writeback code was added

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Dave Chinner
On Mon, Aug 15, 2016 at 04:20:55PM -0700, Linus Torvalds wrote:
> On Mon, Aug 15, 2016 at 3:42 PM, Dave Chinner  wrote:
> >
> >   31.18%  [kernel]  [k] __pv_queued_spin_lock_slowpath
> >9.90%  [kernel]  [k] copy_user_generic_string
> >3.65%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
> >2.62%  [kernel]  [k] __block_commit_write.isra.29
> >2.26%  [kernel]  [k] _raw_spin_lock_irqsave
> >1.72%  [kernel]  [k] _raw_spin_lock
> 
> Ok, this is more like it.
> 
> I'd still like to see it on raw hardware, just to see if we may have a
> bug in the PV code. Because that code has been buggy before. I
> *thought* we fixed it, but ...
> 
> In fact, you don't even need to do it outside of virtualization, but
> with paravirt disabled (so that it runs the native non-pv locking in
> the virtual machine).
> 
> >36.60% 0.00%  [kernel][k] kswapd
> >- 30.29% kswapd
> >   - 30.23% shrink_node
> >  - 30.07% shrink_node_memcg.isra.75
> > - 30.15% shrink_inactive_list
> >- 29.49% shrink_page_list
> >   - 22.79% __remove_mapping
> >  - 22.27% _raw_spin_lock_irqsave
> >   __pv_queued_spin_lock_slowpath
> 
> How I dislike the way perf shows the call graph data... Just last week
> I was talking to Arnaldo about how to better visualize the cost of
> spinlocks, because the normal way "perf" shows costs is so nasty.

Do not change it - it's the way call graph profiles have been
presented for the past 20 years. I hate it when long standing
conventions are changed because one person doesn't like them and
everyone else has to relearn skills the haven't had to think about
for years

> What happens is that you see that 36% of CPU time is attributed to
> kswapd, and then you can drill down and see where that 36% comes from.
> So far so good, and that's what perf does fairly well.
> 
> But then when you find the spinlock, you actually want to go the other
> way, and instead ask it to show "who were the callers to this routine
> and what were the percentages", so that you can then see whether (for
> example) it's just that __remove_mapping() use that contends with
> itself, or whether it's contending with the page additions or
> whatever..

Um, perf already does that:

-   31.55%31.55%  [kernel][k] __pv_queued_spin_lock_slowpath
   - 19.83% ret_from_fork
  - kthread
 - 18.55% kswapd
  shrink_node
  shrink_node_memcg.isra.75
  shrink_inactive_list
   1.76% worker_thread
  process_one_work
  wb_workfn 
  wb_writeback
  __writeback_inodes_wb
  writeback_sb_inodes 
  __writeback_single_inode
  do_writepages
  xfs_vm_writepages
  write_cache_pages
  xfs_do_writepage
   + 5.95% __libc_pwrite

I have that right here because *it's a view of the profile I've
already looked at*. I didn't post it because, well, it's shorter to
simply say "contention is from in kswapd".

> So what I'd like to see (and this is where it becomes *so* much more
> useful to be able to recreate it myself so that I can play with the
> perf data several different ways) is to see what the profile looks
> like in that spinlocked region.

Boot your machine with "fake_numa=4", and play till you heart is
content. That's all I do with my test VMs to make them exercise NUMA
paths.

> None of this code is all that new, which is annoying. This must have
> gone on forever,

Yes, it has been. Just worse than I've notice before, probably
because of all the stuff put under the tree lock in the past couple
of years.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Dave Chinner
On Mon, Aug 15, 2016 at 04:20:55PM -0700, Linus Torvalds wrote:
> On Mon, Aug 15, 2016 at 3:42 PM, Dave Chinner  wrote:
> >
> >   31.18%  [kernel]  [k] __pv_queued_spin_lock_slowpath
> >9.90%  [kernel]  [k] copy_user_generic_string
> >3.65%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
> >2.62%  [kernel]  [k] __block_commit_write.isra.29
> >2.26%  [kernel]  [k] _raw_spin_lock_irqsave
> >1.72%  [kernel]  [k] _raw_spin_lock
> 
> Ok, this is more like it.
> 
> I'd still like to see it on raw hardware, just to see if we may have a
> bug in the PV code. Because that code has been buggy before. I
> *thought* we fixed it, but ...
> 
> In fact, you don't even need to do it outside of virtualization, but
> with paravirt disabled (so that it runs the native non-pv locking in
> the virtual machine).
> 
> >36.60% 0.00%  [kernel][k] kswapd
> >- 30.29% kswapd
> >   - 30.23% shrink_node
> >  - 30.07% shrink_node_memcg.isra.75
> > - 30.15% shrink_inactive_list
> >- 29.49% shrink_page_list
> >   - 22.79% __remove_mapping
> >  - 22.27% _raw_spin_lock_irqsave
> >   __pv_queued_spin_lock_slowpath
> 
> How I dislike the way perf shows the call graph data... Just last week
> I was talking to Arnaldo about how to better visualize the cost of
> spinlocks, because the normal way "perf" shows costs is so nasty.

Do not change it - it's the way call graph profiles have been
presented for the past 20 years. I hate it when long standing
conventions are changed because one person doesn't like them and
everyone else has to relearn skills the haven't had to think about
for years

> What happens is that you see that 36% of CPU time is attributed to
> kswapd, and then you can drill down and see where that 36% comes from.
> So far so good, and that's what perf does fairly well.
> 
> But then when you find the spinlock, you actually want to go the other
> way, and instead ask it to show "who were the callers to this routine
> and what were the percentages", so that you can then see whether (for
> example) it's just that __remove_mapping() use that contends with
> itself, or whether it's contending with the page additions or
> whatever..

Um, perf already does that:

-   31.55%31.55%  [kernel][k] __pv_queued_spin_lock_slowpath
   - 19.83% ret_from_fork
  - kthread
 - 18.55% kswapd
  shrink_node
  shrink_node_memcg.isra.75
  shrink_inactive_list
   1.76% worker_thread
  process_one_work
  wb_workfn 
  wb_writeback
  __writeback_inodes_wb
  writeback_sb_inodes 
  __writeback_single_inode
  do_writepages
  xfs_vm_writepages
  write_cache_pages
  xfs_do_writepage
   + 5.95% __libc_pwrite

I have that right here because *it's a view of the profile I've
already looked at*. I didn't post it because, well, it's shorter to
simply say "contention is from in kswapd".

> So what I'd like to see (and this is where it becomes *so* much more
> useful to be able to recreate it myself so that I can play with the
> perf data several different ways) is to see what the profile looks
> like in that spinlocked region.

Boot your machine with "fake_numa=4", and play till you heart is
content. That's all I do with my test VMs to make them exercise NUMA
paths.

> None of this code is all that new, which is annoying. This must have
> gone on forever,

Yes, it has been. Just worse than I've notice before, probably
because of all the stuff put under the tree lock in the past couple
of years.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Dave Chinner
On Mon, Aug 15, 2016 at 05:15:47PM -0700, Linus Torvalds wrote:
> DaveC - does the spinlock contention go away if you just go back to
> 4.7? If so, I think it's the new zone thing. But it would be good to
> verify - maybe it's something entirely different and it goes back much
> further.

Same in 4.7 (flat profile numbers climbed higher after this
snapshot was taken, as can be seen by the callgraph numbers):

  29.47%  [kernel]  [k] __pv_queued_spin_lock_slowpath
  11.59%  [kernel]  [k] copy_user_generic_string
   3.13%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
   2.87%  [kernel]  [k] __block_commit_write.isra.29
   2.02%  [kernel]  [k] _raw_spin_lock_irqsave
   1.77%  [kernel]  [k] get_page_from_freelist
   1.36%  [kernel]  [k] __wake_up_bit 
   1.31%  [kernel]  [k] __radix_tree_lookup
   1.22%  [kernel]  [k] radix_tree_tag_set
   1.16%  [kernel]  [k] clear_page_dirty_for_io
   1.14%  [kernel]  [k] __remove_mapping
   1.14%  [kernel]  [k] _raw_spin_lock
   1.00%  [kernel]  [k] zone_dirty_ok
   0.95%  [kernel]  [k] radix_tree_tag_clear
   0.90%  [kernel]  [k] generic_write_end
   0.89%  [kernel]  [k] __delete_from_page_cache
   0.87%  [kernel]  [k] unlock_page
   0.86%  [kernel]  [k] cancel_dirty_page
   0.81%  [kernel]  [k] up_write
   0.80%  [kernel]  [k] ___might_sleep
   0.77%  [kernel]  [k] _raw_spin_unlock_irqrestore
   0.75%  [kernel]  [k] generic_perform_write
   0.72%  [kernel]  [k] xfs_do_writepage  
   0.69%  [kernel]  [k] down_write
   0.63%  [kernel]  [k] shrink_page_list  
   0.63%  [kernel]  [k] __xfs_get_blocks  
   0.61%  [kernel]  [k] __test_set_page_writeback
   0.59%  [kernel]  [k] free_hot_cold_page
   0.57%  [kernel]  [k] write_cache_pages
   0.56%  [kernel]  [k] __radix_tree_create
   0.55%  [kernel]  [k] __list_add
   0.53%  [kernel]  [k] page_mapping
   0.53%  [kernel]  [k] drop_buffers
   0.51%  [kernel]  [k] xfs_vm_releasepage
   0.51%  [kernel]  [k] free_pcppages_bulk
   0.50%  [kernel]  [k] __list_del_entry  


   38.07%38.07%  [kernel][k] __pv_queued_spin_lock_slowpath
   - 25.52% ret_from_fork
  - kthread
 - 24.36% kswapd
  shrink_zone
  shrink_zone_memcg.isra.73
  shrink_inactive_list
 - 3.21% worker_thread
  process_one_work
  wb_workfn
  wb_writeback
  __writeback_inodes_wb
  writeback_sb_inodes
  __writeback_single_inode
  do_writepages
  xfs_vm_writepages
  write_cache_pages
   - 10.06% __libc_pwrite
entry_SYSCALL_64_fastpath
sys_pwrite64
vfs_write
__vfs_write
xfs_file_write_iter
xfs_file_buffered_aio_write
  - generic_perform_write
 - 5.51% xfs_vm_write_begin
- 4.94% grab_cache_page_write_begin
 pagecache_get_page
  0.57% __block_write_begin
 create_page_buffers
 create_empty_buffers
 _raw_spin_lock
 __pv_queued_spin_lock_slowpath
 - 4.88% xfs_vm_write_end
  generic_write_end
  block_write_end
  __block_commit_write.isra.29
  mark_buffer_dirty
  __set_page_dirty

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Dave Chinner
On Mon, Aug 15, 2016 at 05:15:47PM -0700, Linus Torvalds wrote:
> DaveC - does the spinlock contention go away if you just go back to
> 4.7? If so, I think it's the new zone thing. But it would be good to
> verify - maybe it's something entirely different and it goes back much
> further.

Same in 4.7 (flat profile numbers climbed higher after this
snapshot was taken, as can be seen by the callgraph numbers):

  29.47%  [kernel]  [k] __pv_queued_spin_lock_slowpath
  11.59%  [kernel]  [k] copy_user_generic_string
   3.13%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
   2.87%  [kernel]  [k] __block_commit_write.isra.29
   2.02%  [kernel]  [k] _raw_spin_lock_irqsave
   1.77%  [kernel]  [k] get_page_from_freelist
   1.36%  [kernel]  [k] __wake_up_bit 
   1.31%  [kernel]  [k] __radix_tree_lookup
   1.22%  [kernel]  [k] radix_tree_tag_set
   1.16%  [kernel]  [k] clear_page_dirty_for_io
   1.14%  [kernel]  [k] __remove_mapping
   1.14%  [kernel]  [k] _raw_spin_lock
   1.00%  [kernel]  [k] zone_dirty_ok
   0.95%  [kernel]  [k] radix_tree_tag_clear
   0.90%  [kernel]  [k] generic_write_end
   0.89%  [kernel]  [k] __delete_from_page_cache
   0.87%  [kernel]  [k] unlock_page
   0.86%  [kernel]  [k] cancel_dirty_page
   0.81%  [kernel]  [k] up_write
   0.80%  [kernel]  [k] ___might_sleep
   0.77%  [kernel]  [k] _raw_spin_unlock_irqrestore
   0.75%  [kernel]  [k] generic_perform_write
   0.72%  [kernel]  [k] xfs_do_writepage  
   0.69%  [kernel]  [k] down_write
   0.63%  [kernel]  [k] shrink_page_list  
   0.63%  [kernel]  [k] __xfs_get_blocks  
   0.61%  [kernel]  [k] __test_set_page_writeback
   0.59%  [kernel]  [k] free_hot_cold_page
   0.57%  [kernel]  [k] write_cache_pages
   0.56%  [kernel]  [k] __radix_tree_create
   0.55%  [kernel]  [k] __list_add
   0.53%  [kernel]  [k] page_mapping
   0.53%  [kernel]  [k] drop_buffers
   0.51%  [kernel]  [k] xfs_vm_releasepage
   0.51%  [kernel]  [k] free_pcppages_bulk
   0.50%  [kernel]  [k] __list_del_entry  


   38.07%38.07%  [kernel][k] __pv_queued_spin_lock_slowpath
   - 25.52% ret_from_fork
  - kthread
 - 24.36% kswapd
  shrink_zone
  shrink_zone_memcg.isra.73
  shrink_inactive_list
 - 3.21% worker_thread
  process_one_work
  wb_workfn
  wb_writeback
  __writeback_inodes_wb
  writeback_sb_inodes
  __writeback_single_inode
  do_writepages
  xfs_vm_writepages
  write_cache_pages
   - 10.06% __libc_pwrite
entry_SYSCALL_64_fastpath
sys_pwrite64
vfs_write
__vfs_write
xfs_file_write_iter
xfs_file_buffered_aio_write
  - generic_perform_write
 - 5.51% xfs_vm_write_begin
- 4.94% grab_cache_page_write_begin
 pagecache_get_page
  0.57% __block_write_begin
 create_page_buffers
 create_empty_buffers
 _raw_spin_lock
 __pv_queued_spin_lock_slowpath
 - 4.88% xfs_vm_write_end
  generic_write_end
  block_write_end
  __block_commit_write.isra.29
  mark_buffer_dirty
  __set_page_dirty

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Dave Chinner
On Mon, Aug 15, 2016 at 04:01:00PM -0700, Linus Torvalds wrote:
> On Mon, Aug 15, 2016 at 3:22 PM, Dave Chinner  wrote:
> >
> > Right, but that does not make the profile data useless,
> 
> Yes it does. Because it basically hides everything that happens inside
> the lock, which is what causes the contention in the first place.

Read the code, Linus?

> So stop making inane and stupid arguments, Dave.

We know what happens inside the lock, and we know exactly how much
it is supposed to cost. And it isn't anywhere near as much as the
profiles indicate the function that contains the lock is costing.

Occam's Razor leads to only one conclusion, like it or not

> Your profiles are shit. Deal with it, or accept that nobody is ever
> going to bother working on them because your profiles don't give
> useful information.
> 
> I see that you actually fixed your profiles, but quite frankly, the
> amount of pure unadulterated crap you posted in this email is worth
> reacting negatively to.

I'm happy to be told that I'm wrong *when I'm wrong*, but you always
say "read the code to understand a problem" rather than depending on
potentially unreliable tools and debug information that is gathered.

Yet when I do that using partial profile information, your reaction
is to tell me I am "full of shit" because my information isn't 100%
reliable? Really, Linus?

> You generally make so much sense that it's shocking to see you then
> make these crazy excuses for your completely broken profiles.

Except they *aren't broken*. They are simply *less accurate* than
they could be. That does not invalidate the profile nor does it mean
that the insight it gives us into the functioning of the code is
wrong.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Dave Chinner
On Mon, Aug 15, 2016 at 04:01:00PM -0700, Linus Torvalds wrote:
> On Mon, Aug 15, 2016 at 3:22 PM, Dave Chinner  wrote:
> >
> > Right, but that does not make the profile data useless,
> 
> Yes it does. Because it basically hides everything that happens inside
> the lock, which is what causes the contention in the first place.

Read the code, Linus?

> So stop making inane and stupid arguments, Dave.

We know what happens inside the lock, and we know exactly how much
it is supposed to cost. And it isn't anywhere near as much as the
profiles indicate the function that contains the lock is costing.

Occam's Razor leads to only one conclusion, like it or not

> Your profiles are shit. Deal with it, or accept that nobody is ever
> going to bother working on them because your profiles don't give
> useful information.
> 
> I see that you actually fixed your profiles, but quite frankly, the
> amount of pure unadulterated crap you posted in this email is worth
> reacting negatively to.

I'm happy to be told that I'm wrong *when I'm wrong*, but you always
say "read the code to understand a problem" rather than depending on
potentially unreliable tools and debug information that is gathered.

Yet when I do that using partial profile information, your reaction
is to tell me I am "full of shit" because my information isn't 100%
reliable? Really, Linus?

> You generally make so much sense that it's shocking to see you then
> make these crazy excuses for your completely broken profiles.

Except they *aren't broken*. They are simply *less accurate* than
they could be. That does not invalidate the profile nor does it mean
that the insight it gives us into the functioning of the code is
wrong.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Dave Chinner
On Mon, Aug 15, 2016 at 10:22:43AM -0700, Huang, Ying wrote:
> Hi, Chinner,
> 
> Dave Chinner  writes:
> 
> > On Wed, Aug 10, 2016 at 06:00:24PM -0700, Linus Torvalds wrote:
> >> On Wed, Aug 10, 2016 at 5:33 PM, Huang, Ying  wrote:
> >> >
> >> > Here it is,
> >> 
> >> Thanks.
> >> 
> >> Appended is a munged "after" list, with the "before" values in
> >> parenthesis. It actually looks fairly similar.
> >> 
> >> The biggest difference is that we have "mark_page_accessed()" show up
> >> after, and not before. There was also a lot of LRU noise in the
> >> non-profile data. I wonder if that is the reason here: the old model
> >> of using generic_perform_write/block_page_mkwrite didn't mark the
> >> pages accessed, and now with iomap_file_buffered_write() they get
> >> marked as active and that screws up the LRU list, and makes us not
> >> flush out the dirty pages well (because they are seen as active and
> >> not good for writeback), and then you get bad memory use.
> >> 
> >> I'm not seeing anything that looks like locking-related.
> >
> > Not in that profile. I've been doing some local testing inside a
> > 4-node fake-numa 16p/16GB RAM VM to see what I can find.
> 
> You run the test in a virtual machine, I think that is why your perf
> data looks strange (high value of _raw_spin_unlock_irqrestore).
> 
> To setup KVM to use perf, you may refer to,
> 
> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Virtualization_Tuning_and_Optimization_Guide/sect-Virtualization_Tuning_Optimization_Guide-Monitoring_Tools-vPMU.html
> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Virtualization_Administration_Guide/sect-perf-mon.html
> 
> I haven't tested them.  You may Google to find more information.  Or the
> perf/kvm people can give you more information.

Thanks, "-cpu host" on the qemu command line works. I hate magic,
undocumented(*) features that are necessary to make basic stuff work. 

-Dave.

(*) yeah, try working this capability even exists out from the
qemu/kvm man page.

-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Dave Chinner
On Mon, Aug 15, 2016 at 10:22:43AM -0700, Huang, Ying wrote:
> Hi, Chinner,
> 
> Dave Chinner  writes:
> 
> > On Wed, Aug 10, 2016 at 06:00:24PM -0700, Linus Torvalds wrote:
> >> On Wed, Aug 10, 2016 at 5:33 PM, Huang, Ying  wrote:
> >> >
> >> > Here it is,
> >> 
> >> Thanks.
> >> 
> >> Appended is a munged "after" list, with the "before" values in
> >> parenthesis. It actually looks fairly similar.
> >> 
> >> The biggest difference is that we have "mark_page_accessed()" show up
> >> after, and not before. There was also a lot of LRU noise in the
> >> non-profile data. I wonder if that is the reason here: the old model
> >> of using generic_perform_write/block_page_mkwrite didn't mark the
> >> pages accessed, and now with iomap_file_buffered_write() they get
> >> marked as active and that screws up the LRU list, and makes us not
> >> flush out the dirty pages well (because they are seen as active and
> >> not good for writeback), and then you get bad memory use.
> >> 
> >> I'm not seeing anything that looks like locking-related.
> >
> > Not in that profile. I've been doing some local testing inside a
> > 4-node fake-numa 16p/16GB RAM VM to see what I can find.
> 
> You run the test in a virtual machine, I think that is why your perf
> data looks strange (high value of _raw_spin_unlock_irqrestore).
> 
> To setup KVM to use perf, you may refer to,
> 
> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Virtualization_Tuning_and_Optimization_Guide/sect-Virtualization_Tuning_Optimization_Guide-Monitoring_Tools-vPMU.html
> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Virtualization_Administration_Guide/sect-perf-mon.html
> 
> I haven't tested them.  You may Google to find more information.  Or the
> perf/kvm people can give you more information.

Thanks, "-cpu host" on the qemu command line works. I hate magic,
undocumented(*) features that are necessary to make basic stuff work. 

-Dave.

(*) yeah, try working this capability even exists out from the
qemu/kvm man page.

-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Linus Torvalds
On Mon, Aug 15, 2016 at 4:20 PM, Linus Torvalds
 wrote:
>
> But I'll try to see what happens
> on my profile, even if I can't recreate the contention itself, just
> trying to see what happens inside of that region.

Yeah, since I run my machines on encrypted disks, my profile shows 60%
kthread, but that's just because 55% is crypto.

I only have 5% in kswapd. And the spinlock doesn't even show up for me
(but "__delete_from_page_cache()" does, which doesn't look
unreasonable).

And while the biggest reason the spinlock doesn't show up is likely
simply my single-node "everything is on one die", I still think the
lower kswapd CPU use might be partly due to the node-vs-zone thing.

For me, with just one node, the new

test_bit(PGDAT_WRITEBACK, >flags)) {

ends up being very similar to what we used to have before, ie

test_bit(ZONE_WRITEBACK, >flags)) {

but on a multi-node machine it would be rather different.

So I might never see contention anyway.

The basic logic in shrink_swap_list() goes back to commit 283aba9f9e0
("mm: vmscan: block kswapd if it is encountering pages under
writeback") but it has been messed around with a lot (and something
else existed there before - we've always had some "throttle kswapd so
that it doesn't use insane amounts of CPU time").

DaveC - does the spinlock contention go away if you just go back to
4.7? If so, I think it's the new zone thing. But it would be good to
verify - maybe it's something entirely different and it goes back much
further.

Mel - I may be barking up entirely the wrong tree, but it would be
good if you could take a look just in case this is actually it.

  Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Linus Torvalds
On Mon, Aug 15, 2016 at 4:20 PM, Linus Torvalds
 wrote:
>
> But I'll try to see what happens
> on my profile, even if I can't recreate the contention itself, just
> trying to see what happens inside of that region.

Yeah, since I run my machines on encrypted disks, my profile shows 60%
kthread, but that's just because 55% is crypto.

I only have 5% in kswapd. And the spinlock doesn't even show up for me
(but "__delete_from_page_cache()" does, which doesn't look
unreasonable).

And while the biggest reason the spinlock doesn't show up is likely
simply my single-node "everything is on one die", I still think the
lower kswapd CPU use might be partly due to the node-vs-zone thing.

For me, with just one node, the new

test_bit(PGDAT_WRITEBACK, >flags)) {

ends up being very similar to what we used to have before, ie

test_bit(ZONE_WRITEBACK, >flags)) {

but on a multi-node machine it would be rather different.

So I might never see contention anyway.

The basic logic in shrink_swap_list() goes back to commit 283aba9f9e0
("mm: vmscan: block kswapd if it is encountering pages under
writeback") but it has been messed around with a lot (and something
else existed there before - we've always had some "throttle kswapd so
that it doesn't use insane amounts of CPU time").

DaveC - does the spinlock contention go away if you just go back to
4.7? If so, I think it's the new zone thing. But it would be good to
verify - maybe it's something entirely different and it goes back much
further.

Mel - I may be barking up entirely the wrong tree, but it would be
good if you could take a look just in case this is actually it.

  Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Linus Torvalds
On Mon, Aug 15, 2016 at 4:20 PM, Linus Torvalds
 wrote:
>
> None of this code is all that new, which is annoying. This must have
> gone on forever,

... ooh.

Wait, I take that back.

We actually have some very recent changes that I didn't even think
about that went into this very merge window.

In particular, I wonder if it's all (or at least partly) due to the
new per-node LRU lists.

So in shrink_page_list(), when kswapd is encountering a page that is
under page writeback due to page reclaim, it does:

if (current_is_kswapd() &&
PageReclaim(page) &&
test_bit(PGDAT_WRITEBACK, >flags)) {
nr_immediate++;
goto keep_locked;

which basically ignores that page and puts it back on the LRU list.

But that "is this node under writeback" is new - it now does that per
node, and it *used* to do it per zone (so it _used_ to test "is this
zone under writeback").

All the mapping pages used to be in the same zone, so I think it
effectively single-threaded the kswapd reclaim for one mapping under
reclaim writeback. But in your cases, you have multiple nodes...

Ok, that's a lot of hand-wavy new-age crystal healing thinking.

Really, I haven't looked at it more than "this is one thing that has
changed recently, I wonder if it changes the patterns and could
explain much higher spin_lock contention on the mapping->tree_lock".

I'm adding Mel Gorman and his band of miscreants to the cc, so that
they can tell me that I'm full of shit, and completely missed on what
that zone->node change actually ends up meaning.

Mel? The issue is that Dave Chinner is seeing some nasty spinlock
contention on "mapping->tree_lock":

>   31.18%  [kernel]  [k] __pv_queued_spin_lock_slowpath

and one of the main paths is this:

>- 30.29% kswapd
>   - 30.23% shrink_node
>  - 30.07% shrink_node_memcg.isra.75
> - 30.15% shrink_inactive_list
>- 29.49% shrink_page_list
>   - 22.79% __remove_mapping
>  - 22.27% _raw_spin_lock_irqsave
>   __pv_queued_spin_lock_slowpath

so there's something ridiculously bad going on with a fairly simple benchmark.

Dave's benchmark is literally just a "write a new 48GB file in
single-page chunks on a 4-node machine". Nothing odd - not rewriting
files, not seeking around, no nothing.

You can probably recreate it with a silly

  dd bs=4096 count=$((12*1024*1024)) if=/dev/zero of=bigfile

although Dave actually had something rather fancier, I think.

 Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Linus Torvalds
On Mon, Aug 15, 2016 at 4:20 PM, Linus Torvalds
 wrote:
>
> None of this code is all that new, which is annoying. This must have
> gone on forever,

... ooh.

Wait, I take that back.

We actually have some very recent changes that I didn't even think
about that went into this very merge window.

In particular, I wonder if it's all (or at least partly) due to the
new per-node LRU lists.

So in shrink_page_list(), when kswapd is encountering a page that is
under page writeback due to page reclaim, it does:

if (current_is_kswapd() &&
PageReclaim(page) &&
test_bit(PGDAT_WRITEBACK, >flags)) {
nr_immediate++;
goto keep_locked;

which basically ignores that page and puts it back on the LRU list.

But that "is this node under writeback" is new - it now does that per
node, and it *used* to do it per zone (so it _used_ to test "is this
zone under writeback").

All the mapping pages used to be in the same zone, so I think it
effectively single-threaded the kswapd reclaim for one mapping under
reclaim writeback. But in your cases, you have multiple nodes...

Ok, that's a lot of hand-wavy new-age crystal healing thinking.

Really, I haven't looked at it more than "this is one thing that has
changed recently, I wonder if it changes the patterns and could
explain much higher spin_lock contention on the mapping->tree_lock".

I'm adding Mel Gorman and his band of miscreants to the cc, so that
they can tell me that I'm full of shit, and completely missed on what
that zone->node change actually ends up meaning.

Mel? The issue is that Dave Chinner is seeing some nasty spinlock
contention on "mapping->tree_lock":

>   31.18%  [kernel]  [k] __pv_queued_spin_lock_slowpath

and one of the main paths is this:

>- 30.29% kswapd
>   - 30.23% shrink_node
>  - 30.07% shrink_node_memcg.isra.75
> - 30.15% shrink_inactive_list
>- 29.49% shrink_page_list
>   - 22.79% __remove_mapping
>  - 22.27% _raw_spin_lock_irqsave
>   __pv_queued_spin_lock_slowpath

so there's something ridiculously bad going on with a fairly simple benchmark.

Dave's benchmark is literally just a "write a new 48GB file in
single-page chunks on a 4-node machine". Nothing odd - not rewriting
files, not seeking around, no nothing.

You can probably recreate it with a silly

  dd bs=4096 count=$((12*1024*1024)) if=/dev/zero of=bigfile

although Dave actually had something rather fancier, I think.

 Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Linus Torvalds
On Mon, Aug 15, 2016 at 3:42 PM, Dave Chinner  wrote:
>
>   31.18%  [kernel]  [k] __pv_queued_spin_lock_slowpath
>9.90%  [kernel]  [k] copy_user_generic_string
>3.65%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
>2.62%  [kernel]  [k] __block_commit_write.isra.29
>2.26%  [kernel]  [k] _raw_spin_lock_irqsave
>1.72%  [kernel]  [k] _raw_spin_lock

Ok, this is more like it.

I'd still like to see it on raw hardware, just to see if we may have a
bug in the PV code. Because that code has been buggy before. I
*thought* we fixed it, but ...

In fact, you don't even need to do it outside of virtualization, but
with paravirt disabled (so that it runs the native non-pv locking in
the virtual machine).

>36.60% 0.00%  [kernel][k] kswapd
>- 30.29% kswapd
>   - 30.23% shrink_node
>  - 30.07% shrink_node_memcg.isra.75
> - 30.15% shrink_inactive_list
>- 29.49% shrink_page_list
>   - 22.79% __remove_mapping
>  - 22.27% _raw_spin_lock_irqsave
>   __pv_queued_spin_lock_slowpath

How I dislike the way perf shows the call graph data... Just last week
I was talking to Arnaldo about how to better visualize the cost of
spinlocks, because the normal way "perf" shows costs is so nasty.

What happens is that you see that 36% of CPU time is attributed to
kswapd, and then you can drill down and see where that 36% comes from.
So far so good, and that's what perf does fairly well.

But then when you find the spinlock, you actually want to go the other
way, and instead ask it to show "who were the callers to this routine
and what were the percentages", so that you can then see whether (for
example) it's just that __remove_mapping() use that contends with
itself, or whether it's contending with the page additions or
whatever..

And perf makes that unnecessarily much too hard to see.

So what I'd like to see (and this is where it becomes *so* much more
useful to be able to recreate it myself so that I can play with the
perf data several different ways) is to see what the profile looks
like in that spinlocked region.

Hmm. I guess you could just send me the "perf.data" and "vmlinux"
files, and I can look at it that way. But I'll try to see what happens
on my profile, even if I can't recreate the contention itself, just
trying to see what happens inside of that region.

None of this code is all that new, which is annoying. This must have
gone on forever,

   Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Linus Torvalds
On Mon, Aug 15, 2016 at 3:42 PM, Dave Chinner  wrote:
>
>   31.18%  [kernel]  [k] __pv_queued_spin_lock_slowpath
>9.90%  [kernel]  [k] copy_user_generic_string
>3.65%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
>2.62%  [kernel]  [k] __block_commit_write.isra.29
>2.26%  [kernel]  [k] _raw_spin_lock_irqsave
>1.72%  [kernel]  [k] _raw_spin_lock

Ok, this is more like it.

I'd still like to see it on raw hardware, just to see if we may have a
bug in the PV code. Because that code has been buggy before. I
*thought* we fixed it, but ...

In fact, you don't even need to do it outside of virtualization, but
with paravirt disabled (so that it runs the native non-pv locking in
the virtual machine).

>36.60% 0.00%  [kernel][k] kswapd
>- 30.29% kswapd
>   - 30.23% shrink_node
>  - 30.07% shrink_node_memcg.isra.75
> - 30.15% shrink_inactive_list
>- 29.49% shrink_page_list
>   - 22.79% __remove_mapping
>  - 22.27% _raw_spin_lock_irqsave
>   __pv_queued_spin_lock_slowpath

How I dislike the way perf shows the call graph data... Just last week
I was talking to Arnaldo about how to better visualize the cost of
spinlocks, because the normal way "perf" shows costs is so nasty.

What happens is that you see that 36% of CPU time is attributed to
kswapd, and then you can drill down and see where that 36% comes from.
So far so good, and that's what perf does fairly well.

But then when you find the spinlock, you actually want to go the other
way, and instead ask it to show "who were the callers to this routine
and what were the percentages", so that you can then see whether (for
example) it's just that __remove_mapping() use that contends with
itself, or whether it's contending with the page additions or
whatever..

And perf makes that unnecessarily much too hard to see.

So what I'd like to see (and this is where it becomes *so* much more
useful to be able to recreate it myself so that I can play with the
perf data several different ways) is to see what the profile looks
like in that spinlocked region.

Hmm. I guess you could just send me the "perf.data" and "vmlinux"
files, and I can look at it that way. But I'll try to see what happens
on my profile, even if I can't recreate the contention itself, just
trying to see what happens inside of that region.

None of this code is all that new, which is annoying. This must have
gone on forever,

   Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Linus Torvalds
On Mon, Aug 15, 2016 at 3:22 PM, Dave Chinner  wrote:
>
> Right, but that does not make the profile data useless,

Yes it does. Because it basically hides everything that happens inside
the lock, which is what causes the contention in the first place.

So stop making inane and stupid arguments, Dave.

Your profiles are shit. Deal with it, or accept that nobody is ever
going to bother working on them because your profiles don't give
useful information.

I see that you actually fixed your profiles, but quite frankly, the
amount of pure unadulterated crap you posted in this email is worth
reacting negatively to.

You generally make so much sense that it's shocking to see you then
make these crazy excuses for your completely broken profiles.

 Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Linus Torvalds
On Mon, Aug 15, 2016 at 3:22 PM, Dave Chinner  wrote:
>
> Right, but that does not make the profile data useless,

Yes it does. Because it basically hides everything that happens inside
the lock, which is what causes the contention in the first place.

So stop making inane and stupid arguments, Dave.

Your profiles are shit. Deal with it, or accept that nobody is ever
going to bother working on them because your profiles don't give
useful information.

I see that you actually fixed your profiles, but quite frankly, the
amount of pure unadulterated crap you posted in this email is worth
reacting negatively to.

You generally make so much sense that it's shocking to see you then
make these crazy excuses for your completely broken profiles.

 Linus


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Dave Chinner
On Tue, Aug 16, 2016 at 08:22:11AM +1000, Dave Chinner wrote:
> On Sun, Aug 14, 2016 at 10:12:20PM -0700, Linus Torvalds wrote:
> > On Aug 14, 2016 10:00 PM, "Dave Chinner"  wrote:
> > >
> > > > What does it say if you annotate that _raw_spin_unlock_irqrestore()
> > function?
> > > 
> > >¿
> > >¿Disassembly of section load0:
> > >¿
> > >¿81e628b0 :
> > >¿  nop
> > >¿  push   %rbp
> > >¿  mov%rsp,%rbp
> > >¿  movb   $0x0,(%rdi)
> > >¿  nop
> > >¿  mov%rsi,%rdi
> > >¿  push   %rdi
> > >¿  popfq
> > >  99.35 ¿  nop
> > 
> > Yeah, that's a good disassembly of a non-debug spin unlock, and the symbols
> > are fine, but the profile is not valid. That's an interrupt point, right
> > after the popf that enables interiors again.
> > 
> > I don't know why 'perf' isn't working on your machine, but it clearly
> > isn't.
> > 
> > Has it ever worked on that machine?
> 
> It's working the same as it's worked since I started using it many
> years ago.
> 
> > What cpu is it?
> 
> Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz
> 
> > Are you running in some
> > virtualized environment without performance counters, perhaps?
> 
> I've mentioned a couple of times in this thread that I'm testing
> inside a VM. It's the same VM I've been running performance tests in
> since early 2010. Nobody has complained that the profiles I've
> posted are useless before, and not once in all that time have they
> been wrong in indicating a spinning lock contention point.
> 
> i.e. In previous cases where I've measured double digit CPU usage
> numbers in a spin_unlock variant, it's always been a result of
> spinlock contention. And fixing the algorithmic problem that lead to
> the spinlock showing up in the profile in the first place has always
> substantially improved performance and scalability.
> 
> As such, I'm always going to treat a locking profile like that as
> contention because even if it isn't contending *on my machine*,
> that amount of work being done under a spinning lock is /way too
> much/ and it *will* cause contention problems with larger machines.

And, so, after helpfully being pointed at the magic kvm "-cpu host"
flag to enable access to the performance counters from the guest
(using "-e cycles", because more precise counters aren't available),
the profile looks like this:

  31.18%  [kernel]  [k] __pv_queued_spin_lock_slowpath
   9.90%  [kernel]  [k] copy_user_generic_string
   3.65%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
   2.62%  [kernel]  [k] __block_commit_write.isra.29
   2.26%  [kernel]  [k] _raw_spin_lock_irqsave
   1.72%  [kernel]  [k] _raw_spin_lock
   1.33%  [kernel]  [k] __wake_up_bit
   1.20%  [kernel]  [k] __radix_tree_lookup
   1.19%  [kernel]  [k] __remove_mapping  
   1.12%  [kernel]  [k] __delete_from_page_cache
   0.97%  [kernel]  [k] xfs_do_writepage  
   0.91%  [kernel]  [k] get_page_from_freelist
   0.90%  [kernel]  [k] up_write  
   0.88%  [kernel]  [k] clear_page_dirty_for_io
   0.83%  [kernel]  [k] radix_tree_tag_set
   0.81%  [kernel]  [k] radix_tree_tag_clear
   0.80%  [kernel]  [k] down_write
   0.78%  [kernel]  [k] _raw_spin_unlock_irqrestore
   0.77%  [kernel]  [k] shrink_page_list
   0.76%  [kernel]  [k] ___might_sleep
   0.76%  [kernel]  [k] unlock_page
   0.74%  [kernel]  [k] __list_del_entry
   0.67%  [kernel]  [k] __add_to_page_cache_locked
   0.65%  [kernel]  [k] node_dirty_ok
   0.61%  [kernel]  [k] __rmqueue
   0.61%  [kernel]  [k] __block_write_begin_int
   0.61%  [kernel]  [k] cancel_dirty_page
   0.61%  [kernel]  [k] __test_set_page_writeback
   0.59%  [kernel]  [k] page_mapping
   0.57%  [kernel]  [k] __list_add
   0.56%  [kernel]  [k] free_pcppages_bulk
   0.54%  [kernel]  [k] _raw_spin_lock_irq
   0.54%  [kernel]  [k] generic_write_end
   0.51%  [kernel]  [k] drop_buffers

The call graph should be familiar by now:

   36.60% 0.00%  [kernel][k] kswapd
   - 30.29% kswapd  
  - 30.23% shrink_node
 - 30.07% shrink_node_memcg.isra.75
- 30.15% shrink_inactive_list
   - 29.49% shrink_page_list
  - 22.79% __remove_mapping
 - 22.27% _raw_spin_lock_irqsave
  __pv_queued_spin_lock_slowpath
 + 1.86% __delete_from_page_cache
 + 1.27% _raw_spin_unlock_irqrestore
  + 4.31% try_to_release_page
  + 1.21% free_hot_cold_page_list
0.56% page_evictable
 0.77% isolate_lru_pages.isra.72

That sure looks like spin lock contention to me

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Dave Chinner
On Tue, Aug 16, 2016 at 08:22:11AM +1000, Dave Chinner wrote:
> On Sun, Aug 14, 2016 at 10:12:20PM -0700, Linus Torvalds wrote:
> > On Aug 14, 2016 10:00 PM, "Dave Chinner"  wrote:
> > >
> > > > What does it say if you annotate that _raw_spin_unlock_irqrestore()
> > function?
> > > 
> > >¿
> > >¿Disassembly of section load0:
> > >¿
> > >¿81e628b0 :
> > >¿  nop
> > >¿  push   %rbp
> > >¿  mov%rsp,%rbp
> > >¿  movb   $0x0,(%rdi)
> > >¿  nop
> > >¿  mov%rsi,%rdi
> > >¿  push   %rdi
> > >¿  popfq
> > >  99.35 ¿  nop
> > 
> > Yeah, that's a good disassembly of a non-debug spin unlock, and the symbols
> > are fine, but the profile is not valid. That's an interrupt point, right
> > after the popf that enables interiors again.
> > 
> > I don't know why 'perf' isn't working on your machine, but it clearly
> > isn't.
> > 
> > Has it ever worked on that machine?
> 
> It's working the same as it's worked since I started using it many
> years ago.
> 
> > What cpu is it?
> 
> Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz
> 
> > Are you running in some
> > virtualized environment without performance counters, perhaps?
> 
> I've mentioned a couple of times in this thread that I'm testing
> inside a VM. It's the same VM I've been running performance tests in
> since early 2010. Nobody has complained that the profiles I've
> posted are useless before, and not once in all that time have they
> been wrong in indicating a spinning lock contention point.
> 
> i.e. In previous cases where I've measured double digit CPU usage
> numbers in a spin_unlock variant, it's always been a result of
> spinlock contention. And fixing the algorithmic problem that lead to
> the spinlock showing up in the profile in the first place has always
> substantially improved performance and scalability.
> 
> As such, I'm always going to treat a locking profile like that as
> contention because even if it isn't contending *on my machine*,
> that amount of work being done under a spinning lock is /way too
> much/ and it *will* cause contention problems with larger machines.

And, so, after helpfully being pointed at the magic kvm "-cpu host"
flag to enable access to the performance counters from the guest
(using "-e cycles", because more precise counters aren't available),
the profile looks like this:

  31.18%  [kernel]  [k] __pv_queued_spin_lock_slowpath
   9.90%  [kernel]  [k] copy_user_generic_string
   3.65%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
   2.62%  [kernel]  [k] __block_commit_write.isra.29
   2.26%  [kernel]  [k] _raw_spin_lock_irqsave
   1.72%  [kernel]  [k] _raw_spin_lock
   1.33%  [kernel]  [k] __wake_up_bit
   1.20%  [kernel]  [k] __radix_tree_lookup
   1.19%  [kernel]  [k] __remove_mapping  
   1.12%  [kernel]  [k] __delete_from_page_cache
   0.97%  [kernel]  [k] xfs_do_writepage  
   0.91%  [kernel]  [k] get_page_from_freelist
   0.90%  [kernel]  [k] up_write  
   0.88%  [kernel]  [k] clear_page_dirty_for_io
   0.83%  [kernel]  [k] radix_tree_tag_set
   0.81%  [kernel]  [k] radix_tree_tag_clear
   0.80%  [kernel]  [k] down_write
   0.78%  [kernel]  [k] _raw_spin_unlock_irqrestore
   0.77%  [kernel]  [k] shrink_page_list
   0.76%  [kernel]  [k] ___might_sleep
   0.76%  [kernel]  [k] unlock_page
   0.74%  [kernel]  [k] __list_del_entry
   0.67%  [kernel]  [k] __add_to_page_cache_locked
   0.65%  [kernel]  [k] node_dirty_ok
   0.61%  [kernel]  [k] __rmqueue
   0.61%  [kernel]  [k] __block_write_begin_int
   0.61%  [kernel]  [k] cancel_dirty_page
   0.61%  [kernel]  [k] __test_set_page_writeback
   0.59%  [kernel]  [k] page_mapping
   0.57%  [kernel]  [k] __list_add
   0.56%  [kernel]  [k] free_pcppages_bulk
   0.54%  [kernel]  [k] _raw_spin_lock_irq
   0.54%  [kernel]  [k] generic_write_end
   0.51%  [kernel]  [k] drop_buffers

The call graph should be familiar by now:

   36.60% 0.00%  [kernel][k] kswapd
   - 30.29% kswapd  
  - 30.23% shrink_node
 - 30.07% shrink_node_memcg.isra.75
- 30.15% shrink_inactive_list
   - 29.49% shrink_page_list
  - 22.79% __remove_mapping
 - 22.27% _raw_spin_lock_irqsave
  __pv_queued_spin_lock_slowpath
 + 1.86% __delete_from_page_cache
 + 1.27% _raw_spin_unlock_irqrestore
  + 4.31% try_to_release_page
  + 1.21% free_hot_cold_page_list
0.56% page_evictable
 0.77% isolate_lru_pages.isra.72

That sure looks like spin lock contention to me

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Dave Chinner
On Sun, Aug 14, 2016 at 10:12:20PM -0700, Linus Torvalds wrote:
> On Aug 14, 2016 10:00 PM, "Dave Chinner"  wrote:
> >
> > > What does it say if you annotate that _raw_spin_unlock_irqrestore()
> function?
> > 
> >¿
> >¿Disassembly of section load0:
> >¿
> >¿81e628b0 :
> >¿  nop
> >¿  push   %rbp
> >¿  mov%rsp,%rbp
> >¿  movb   $0x0,(%rdi)
> >¿  nop
> >¿  mov%rsi,%rdi
> >¿  push   %rdi
> >¿  popfq
> >  99.35 ¿  nop
> 
> Yeah, that's a good disassembly of a non-debug spin unlock, and the symbols
> are fine, but the profile is not valid. That's an interrupt point, right
> after the popf that enables interiors again.
> 
> I don't know why 'perf' isn't working on your machine, but it clearly
> isn't.
> 
> Has it ever worked on that machine?

It's working the same as it's worked since I started using it many
years ago.

> What cpu is it?

Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz

> Are you running in some
> virtualized environment without performance counters, perhaps?

I've mentioned a couple of times in this thread that I'm testing
inside a VM. It's the same VM I've been running performance tests in
since early 2010. Nobody has complained that the profiles I've
posted are useless before, and not once in all that time have they
been wrong in indicating a spinning lock contention point.

i.e. In previous cases where I've measured double digit CPU usage
numbers in a spin_unlock variant, it's always been a result of
spinlock contention. And fixing the algorithmic problem that lead to
the spinlock showing up in the profile in the first place has always
substantially improved performance and scalability.

As such, I'm always going to treat a locking profile like that as
contention because even if it isn't contending *on my machine*,
that amount of work being done under a spinning lock is /way too
much/ and it *will* cause contention problems with larger machines.

> It's not actually the unlock that is expensive, and there is no contention
> on the lock (if there had been, the numbers would have been entirely
> different for the debug case, which makes locking an order of magnitude
> more expensive). All the cost of everything that happened while interrupts
> were disabled is just accounted to the instruction after they were enabled
> again.

Right, but that does not make the profile data useless, nor you
should shoot the messenger because they weren't supplied with
information you think should have been in the message. The message
still says that the majority of the overhead is in
__remove_mapping(), and it's an excessive amount of work being done
inside the tree_lock with interrupts disabled

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Dave Chinner
On Sun, Aug 14, 2016 at 10:12:20PM -0700, Linus Torvalds wrote:
> On Aug 14, 2016 10:00 PM, "Dave Chinner"  wrote:
> >
> > > What does it say if you annotate that _raw_spin_unlock_irqrestore()
> function?
> > 
> >¿
> >¿Disassembly of section load0:
> >¿
> >¿81e628b0 :
> >¿  nop
> >¿  push   %rbp
> >¿  mov%rsp,%rbp
> >¿  movb   $0x0,(%rdi)
> >¿  nop
> >¿  mov%rsi,%rdi
> >¿  push   %rdi
> >¿  popfq
> >  99.35 ¿  nop
> 
> Yeah, that's a good disassembly of a non-debug spin unlock, and the symbols
> are fine, but the profile is not valid. That's an interrupt point, right
> after the popf that enables interiors again.
> 
> I don't know why 'perf' isn't working on your machine, but it clearly
> isn't.
> 
> Has it ever worked on that machine?

It's working the same as it's worked since I started using it many
years ago.

> What cpu is it?

Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz

> Are you running in some
> virtualized environment without performance counters, perhaps?

I've mentioned a couple of times in this thread that I'm testing
inside a VM. It's the same VM I've been running performance tests in
since early 2010. Nobody has complained that the profiles I've
posted are useless before, and not once in all that time have they
been wrong in indicating a spinning lock contention point.

i.e. In previous cases where I've measured double digit CPU usage
numbers in a spin_unlock variant, it's always been a result of
spinlock contention. And fixing the algorithmic problem that lead to
the spinlock showing up in the profile in the first place has always
substantially improved performance and scalability.

As such, I'm always going to treat a locking profile like that as
contention because even if it isn't contending *on my machine*,
that amount of work being done under a spinning lock is /way too
much/ and it *will* cause contention problems with larger machines.

> It's not actually the unlock that is expensive, and there is no contention
> on the lock (if there had been, the numbers would have been entirely
> different for the debug case, which makes locking an order of magnitude
> more expensive). All the cost of everything that happened while interrupts
> were disabled is just accounted to the instruction after they were enabled
> again.

Right, but that does not make the profile data useless, nor you
should shoot the messenger because they weren't supplied with
information you think should have been in the message. The message
still says that the majority of the overhead is in
__remove_mapping(), and it's an excessive amount of work being done
inside the tree_lock with interrupts disabled

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Dave Chinner
On Mon, Aug 15, 2016 at 10:14:55PM +0800, Fengguang Wu wrote:
> Hi Christoph,
> 
> On Sun, Aug 14, 2016 at 06:17:24PM +0200, Christoph Hellwig wrote:
> >Snipping the long contest:
> >
> >I think there are three observations here:
> >
> >(1) removing the mark_page_accessed (which is the only significant
> >change in the parent commit)  hurts the
> >aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test.
> >I'd still rather stick to the filemap version and let the
> >VM people sort it out.  How do the numbers for this test
> >look for XFS vs say ext4 and btrfs?
> >(2) lots of additional spinlock contention in the new case.  A quick
> >check shows that I fat-fingered my rewrite so that we do
> >the xfs_inode_set_eofblocks_tag call now for the pure lookup
> >case, and pretty much all new cycles come from that.
> >(3) Boy, are those xfs_inode_set_eofblocks_tag calls expensive, and
> >we're already doing way to many even without my little bug above.
> >
> >So I've force pushed a new version of the iomap-fixes branch with
> >(2) fixed, and also a little patch to xfs_inode_set_eofblocks_tag a
> >lot less expensive slotted in before that.  Would be good to see
> >the numbers with that.
> 
> The aim7 1BRD tests finished and there are ups and downs, with overall
> performance remain flat.
> 
> 99091700659f4df9  74a242ad94d13436a1644c0b45  bf4dc6e4ecc2a3d042029319bc  
> testcase/testparams/testbox
>   --  --  
> ---

What do these commits refer to, please? They mean nothing without
the commit names

/me goes searching. Ok:

99091700659 is the top of Linus' tree
74a242ad94d is 
bf4dc6e4ecc is the latest in Christoph's tree (because it's
mentioned below)

> %stddev %change %stddev %change %stddev
> \  |\  |
> \ 159926  157324  158574
> GEO-MEAN aim7.jobs-per-min
> 70897   5%  74137   4%  73775
> aim7/1BRD_48G-xfs-creat-clo-1500-performance/ivb44
>485217 ±  3%492431  477533
> aim7/1BRD_48G-xfs-disk_rd-9000-performance/ivb44
>360451 -19% 292980 -17% 299377
> aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44

So, why does random read go backwards by 20%? The iomap IO path
patches we are testing only affect the write path, so this
doesn't make a whole lot of sense.

>338114  338410   5% 354078
> aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44
> 60130 ±  5% 4%  62438   5%  62923
> aim7/1BRD_48G-xfs-disk_src-3000-performance/ivb44
>403144  397790  410648
> aim7/1BRD_48G-xfs-disk_wrt-3000-performance/ivb44

And this is the test the original regression was reported for:

gcc-6/performance/profile/1BRD_48G/xfs/x86_64-rhel/3000/debian-x86_64-2015-02-07.cgz/ivb44/disk_wrt/aim7

And that shows no improvement at all. The orginal regression was:

484435 ±  0% -13.3% 420004 ±  0%  aim7.jobs-per-min

So it's still 15% down on the orginal performance which, again,
doesn't make a whole lot of sense given the improvement in so many
other tests I've run

> 26327   26534   26128
> aim7/1BRD_48G-xfs-sync_disk_rw-600-performance/ivb44
> 
> The new commit bf4dc6e ("xfs: rewrite and optimize the delalloc write
> path") improves the aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44
> case by 5%. Here are the detailed numbers:
> 
> aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44

Not important at all. We need the results for the disk_wrt regression
we are chasing (disk_wrt-3000) so we can see how the code change
affected behaviour.

> Here are the detailed numbers for the slowed down case:
> 
> aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44
> 
> 99091700659f4df9  bf4dc6e4ecc2a3d042029319bc
>   --
> %stddev  change %stddev
> \  |\
>360451 -17% 299377aim7.jobs-per-min
> 12806 481%  74447
> aim7.time.involuntary_context_switches
.
> 19377 459% 108364vmstat.system.cs
.
>   487 ± 89%  3e+04  26448 ± 57%  
> latency_stats.max.down.xfs_buf_lock._xfs_buf_find.xfs_buf_get_map.xfs_buf_read_map.xfs_trans_read_buf_map.xfs_read_agf.xfs_alloc_read_agf.xfs_alloc_fix_freelist.xfs_free_extent_fix_freelist.xfs_free_extent.xfs_trans_free_extent
>  1823 ± 82%  2e+061913796 ± 38%  
> 

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Dave Chinner
On Mon, Aug 15, 2016 at 10:14:55PM +0800, Fengguang Wu wrote:
> Hi Christoph,
> 
> On Sun, Aug 14, 2016 at 06:17:24PM +0200, Christoph Hellwig wrote:
> >Snipping the long contest:
> >
> >I think there are three observations here:
> >
> >(1) removing the mark_page_accessed (which is the only significant
> >change in the parent commit)  hurts the
> >aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test.
> >I'd still rather stick to the filemap version and let the
> >VM people sort it out.  How do the numbers for this test
> >look for XFS vs say ext4 and btrfs?
> >(2) lots of additional spinlock contention in the new case.  A quick
> >check shows that I fat-fingered my rewrite so that we do
> >the xfs_inode_set_eofblocks_tag call now for the pure lookup
> >case, and pretty much all new cycles come from that.
> >(3) Boy, are those xfs_inode_set_eofblocks_tag calls expensive, and
> >we're already doing way to many even without my little bug above.
> >
> >So I've force pushed a new version of the iomap-fixes branch with
> >(2) fixed, and also a little patch to xfs_inode_set_eofblocks_tag a
> >lot less expensive slotted in before that.  Would be good to see
> >the numbers with that.
> 
> The aim7 1BRD tests finished and there are ups and downs, with overall
> performance remain flat.
> 
> 99091700659f4df9  74a242ad94d13436a1644c0b45  bf4dc6e4ecc2a3d042029319bc  
> testcase/testparams/testbox
>   --  --  
> ---

What do these commits refer to, please? They mean nothing without
the commit names

/me goes searching. Ok:

99091700659 is the top of Linus' tree
74a242ad94d is 
bf4dc6e4ecc is the latest in Christoph's tree (because it's
mentioned below)

> %stddev %change %stddev %change %stddev
> \  |\  |
> \ 159926  157324  158574
> GEO-MEAN aim7.jobs-per-min
> 70897   5%  74137   4%  73775
> aim7/1BRD_48G-xfs-creat-clo-1500-performance/ivb44
>485217 ±  3%492431  477533
> aim7/1BRD_48G-xfs-disk_rd-9000-performance/ivb44
>360451 -19% 292980 -17% 299377
> aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44

So, why does random read go backwards by 20%? The iomap IO path
patches we are testing only affect the write path, so this
doesn't make a whole lot of sense.

>338114  338410   5% 354078
> aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44
> 60130 ±  5% 4%  62438   5%  62923
> aim7/1BRD_48G-xfs-disk_src-3000-performance/ivb44
>403144  397790  410648
> aim7/1BRD_48G-xfs-disk_wrt-3000-performance/ivb44

And this is the test the original regression was reported for:

gcc-6/performance/profile/1BRD_48G/xfs/x86_64-rhel/3000/debian-x86_64-2015-02-07.cgz/ivb44/disk_wrt/aim7

And that shows no improvement at all. The orginal regression was:

484435 ±  0% -13.3% 420004 ±  0%  aim7.jobs-per-min

So it's still 15% down on the orginal performance which, again,
doesn't make a whole lot of sense given the improvement in so many
other tests I've run

> 26327   26534   26128
> aim7/1BRD_48G-xfs-sync_disk_rw-600-performance/ivb44
> 
> The new commit bf4dc6e ("xfs: rewrite and optimize the delalloc write
> path") improves the aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44
> case by 5%. Here are the detailed numbers:
> 
> aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44

Not important at all. We need the results for the disk_wrt regression
we are chasing (disk_wrt-3000) so we can see how the code change
affected behaviour.

> Here are the detailed numbers for the slowed down case:
> 
> aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44
> 
> 99091700659f4df9  bf4dc6e4ecc2a3d042029319bc
>   --
> %stddev  change %stddev
> \  |\
>360451 -17% 299377aim7.jobs-per-min
> 12806 481%  74447
> aim7.time.involuntary_context_switches
.
> 19377 459% 108364vmstat.system.cs
.
>   487 ± 89%  3e+04  26448 ± 57%  
> latency_stats.max.down.xfs_buf_lock._xfs_buf_find.xfs_buf_get_map.xfs_buf_read_map.xfs_trans_read_buf_map.xfs_read_agf.xfs_alloc_read_agf.xfs_alloc_fix_freelist.xfs_free_extent_fix_freelist.xfs_free_extent.xfs_trans_free_extent
>  1823 ± 82%  2e+061913796 ± 38%  
> 

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Huang, Ying
Christoph Hellwig  writes:

> Snipping the long contest:
>
> I think there are three observations here:
>
>  (1) removing the mark_page_accessed (which is the only significant
>  change in the parent commit)  hurts the
>  aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test.
>  I'd still rather stick to the filemap version and let the
>  VM people sort it out.  How do the numbers for this test
>  look for XFS vs say ext4 and btrfs?
>  (2) lots of additional spinlock contention in the new case.  A quick
>  check shows that I fat-fingered my rewrite so that we do
>  the xfs_inode_set_eofblocks_tag call now for the pure lookup
>  case, and pretty much all new cycles come from that.
>  (3) Boy, are those xfs_inode_set_eofblocks_tag calls expensive, and
>  we're already doing way to many even without my little bug above.
>
> So I've force pushed a new version of the iomap-fixes branch with
> (2) fixed, and also a little patch to xfs_inode_set_eofblocks_tag a
> lot less expensive slotted in before that.  Would be good to see
> the numbers with that.

For the original reported regression, the test result is as follow,

=
compiler/cpufreq_governor/debug-setup/disk/fs/kconfig/load/rootfs/tbox_group/test/testcase:
  
gcc-6/performance/profile/1BRD_48G/xfs/x86_64-rhel/3000/debian-x86_64-2015-02-07.cgz/ivb44/disk_wrt/aim7

commit: 
  f0c6bcba74ac51cb77aadb33ad35cb2dc1ad1506 (parent of first bad commit)
  68a9f5e7007c1afa2cf6830b690a90d0187c0684 (first bad commit)
  99091700659f4df965e138b38b4fa26a29b7eade (base of your fixes branch)
  bf4dc6e4ecc2a3d042029319bc8cd4204c185610 (head of your fixes branch)

f0c6bcba74ac51cb 68a9f5e7007c1afa2cf6830b69 99091700659f4df965e138b38b 
bf4dc6e4ecc2a3d042029319bc 
 -- -- 
-- 
 %stddev %change %stddev %change %stddev 
%change %stddev
 \  |\  |\  
|\  
484435 ±  0% -13.3% 420004 ±  0% -17.0% 402250 ±  0% 
-15.6% 408998 ±  0%  aim7.jobs-per-min


And the perf data is as follow,

  "perf-profile.func.cycles-pp.intel_idle": 20.25,
  "perf-profile.func.cycles-pp.memset_erms": 11.72,
  "perf-profile.func.cycles-pp.copy_user_enhanced_fast_string": 8.37,
  "perf-profile.func.cycles-pp.__block_commit_write.isra.21": 3.49,
  "perf-profile.func.cycles-pp.block_write_end": 1.77,
  "perf-profile.func.cycles-pp.native_queued_spin_lock_slowpath": 1.63,
  "perf-profile.func.cycles-pp.unlock_page": 1.58,
  "perf-profile.func.cycles-pp.___might_sleep": 1.56,
  "perf-profile.func.cycles-pp.__block_write_begin_int": 1.33,
  "perf-profile.func.cycles-pp.iov_iter_copy_from_user_atomic": 1.23,
  "perf-profile.func.cycles-pp.up_write": 1.21,
  "perf-profile.func.cycles-pp.__mark_inode_dirty": 1.18,
  "perf-profile.func.cycles-pp.down_write": 1.06,
  "perf-profile.func.cycles-pp.mark_buffer_dirty": 0.94,
  "perf-profile.func.cycles-pp.generic_write_end": 0.92,
  "perf-profile.func.cycles-pp.__radix_tree_lookup": 0.91,
  "perf-profile.func.cycles-pp._raw_spin_lock": 0.81,
  "perf-profile.func.cycles-pp.entry_SYSCALL_64_fastpath": 0.79,
  "perf-profile.func.cycles-pp.__might_sleep": 0.79,
  "perf-profile.func.cycles-pp.xfs_file_iomap_begin_delay.isra.9": 0.7,
  "perf-profile.func.cycles-pp.__list_del_entry": 0.7,
  "perf-profile.func.cycles-pp.vfs_write": 0.69,
  "perf-profile.func.cycles-pp.drop_buffers": 0.68,
  "perf-profile.func.cycles-pp.xfs_file_write_iter": 0.67,
  "perf-profile.func.cycles-pp.rwsem_spin_on_owner": 0.67,

Best Regards,
Huang, Ying


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Huang, Ying
Christoph Hellwig  writes:

> Snipping the long contest:
>
> I think there are three observations here:
>
>  (1) removing the mark_page_accessed (which is the only significant
>  change in the parent commit)  hurts the
>  aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test.
>  I'd still rather stick to the filemap version and let the
>  VM people sort it out.  How do the numbers for this test
>  look for XFS vs say ext4 and btrfs?
>  (2) lots of additional spinlock contention in the new case.  A quick
>  check shows that I fat-fingered my rewrite so that we do
>  the xfs_inode_set_eofblocks_tag call now for the pure lookup
>  case, and pretty much all new cycles come from that.
>  (3) Boy, are those xfs_inode_set_eofblocks_tag calls expensive, and
>  we're already doing way to many even without my little bug above.
>
> So I've force pushed a new version of the iomap-fixes branch with
> (2) fixed, and also a little patch to xfs_inode_set_eofblocks_tag a
> lot less expensive slotted in before that.  Would be good to see
> the numbers with that.

For the original reported regression, the test result is as follow,

=
compiler/cpufreq_governor/debug-setup/disk/fs/kconfig/load/rootfs/tbox_group/test/testcase:
  
gcc-6/performance/profile/1BRD_48G/xfs/x86_64-rhel/3000/debian-x86_64-2015-02-07.cgz/ivb44/disk_wrt/aim7

commit: 
  f0c6bcba74ac51cb77aadb33ad35cb2dc1ad1506 (parent of first bad commit)
  68a9f5e7007c1afa2cf6830b690a90d0187c0684 (first bad commit)
  99091700659f4df965e138b38b4fa26a29b7eade (base of your fixes branch)
  bf4dc6e4ecc2a3d042029319bc8cd4204c185610 (head of your fixes branch)

f0c6bcba74ac51cb 68a9f5e7007c1afa2cf6830b69 99091700659f4df965e138b38b 
bf4dc6e4ecc2a3d042029319bc 
 -- -- 
-- 
 %stddev %change %stddev %change %stddev 
%change %stddev
 \  |\  |\  
|\  
484435 ±  0% -13.3% 420004 ±  0% -17.0% 402250 ±  0% 
-15.6% 408998 ±  0%  aim7.jobs-per-min


And the perf data is as follow,

  "perf-profile.func.cycles-pp.intel_idle": 20.25,
  "perf-profile.func.cycles-pp.memset_erms": 11.72,
  "perf-profile.func.cycles-pp.copy_user_enhanced_fast_string": 8.37,
  "perf-profile.func.cycles-pp.__block_commit_write.isra.21": 3.49,
  "perf-profile.func.cycles-pp.block_write_end": 1.77,
  "perf-profile.func.cycles-pp.native_queued_spin_lock_slowpath": 1.63,
  "perf-profile.func.cycles-pp.unlock_page": 1.58,
  "perf-profile.func.cycles-pp.___might_sleep": 1.56,
  "perf-profile.func.cycles-pp.__block_write_begin_int": 1.33,
  "perf-profile.func.cycles-pp.iov_iter_copy_from_user_atomic": 1.23,
  "perf-profile.func.cycles-pp.up_write": 1.21,
  "perf-profile.func.cycles-pp.__mark_inode_dirty": 1.18,
  "perf-profile.func.cycles-pp.down_write": 1.06,
  "perf-profile.func.cycles-pp.mark_buffer_dirty": 0.94,
  "perf-profile.func.cycles-pp.generic_write_end": 0.92,
  "perf-profile.func.cycles-pp.__radix_tree_lookup": 0.91,
  "perf-profile.func.cycles-pp._raw_spin_lock": 0.81,
  "perf-profile.func.cycles-pp.entry_SYSCALL_64_fastpath": 0.79,
  "perf-profile.func.cycles-pp.__might_sleep": 0.79,
  "perf-profile.func.cycles-pp.xfs_file_iomap_begin_delay.isra.9": 0.7,
  "perf-profile.func.cycles-pp.__list_del_entry": 0.7,
  "perf-profile.func.cycles-pp.vfs_write": 0.69,
  "perf-profile.func.cycles-pp.drop_buffers": 0.68,
  "perf-profile.func.cycles-pp.xfs_file_write_iter": 0.67,
  "perf-profile.func.cycles-pp.rwsem_spin_on_owner": 0.67,

Best Regards,
Huang, Ying


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Huang, Ying
Hi, Chinner,

Dave Chinner  writes:

> On Wed, Aug 10, 2016 at 06:00:24PM -0700, Linus Torvalds wrote:
>> On Wed, Aug 10, 2016 at 5:33 PM, Huang, Ying  wrote:
>> >
>> > Here it is,
>> 
>> Thanks.
>> 
>> Appended is a munged "after" list, with the "before" values in
>> parenthesis. It actually looks fairly similar.
>> 
>> The biggest difference is that we have "mark_page_accessed()" show up
>> after, and not before. There was also a lot of LRU noise in the
>> non-profile data. I wonder if that is the reason here: the old model
>> of using generic_perform_write/block_page_mkwrite didn't mark the
>> pages accessed, and now with iomap_file_buffered_write() they get
>> marked as active and that screws up the LRU list, and makes us not
>> flush out the dirty pages well (because they are seen as active and
>> not good for writeback), and then you get bad memory use.
>> 
>> I'm not seeing anything that looks like locking-related.
>
> Not in that profile. I've been doing some local testing inside a
> 4-node fake-numa 16p/16GB RAM VM to see what I can find.

You run the test in a virtual machine, I think that is why your perf
data looks strange (high value of _raw_spin_unlock_irqrestore).

To setup KVM to use perf, you may refer to,

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Virtualization_Tuning_and_Optimization_Guide/sect-Virtualization_Tuning_Optimization_Guide-Monitoring_Tools-vPMU.html
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Virtualization_Administration_Guide/sect-perf-mon.html

I haven't tested them.  You may Google to find more information.  Or the
perf/kvm people can give you more information.

> I'm yet to work out how I can trigger a profile like the one that
> was reported (I really need to see the event traces), but in the
> mean time I found this
>
> Doing a large sequential single threaded buffered write using a 4k
> buffer (so single page per syscall to make the XFS IO path allocator
> behave the same way as in 4.7), I'm seeing a CPU profile that
> indicates we have a potential mapping->tree_lock issue:
>
> # xfs_io -f -c "truncate 0" -c "pwrite 0 47g" /mnt/scratch/fooey
> wrote 50465865728/50465865728 bytes at offset 0
> 47.000 GiB, 12320768 ops; 0:01:36.00 (499.418 MiB/sec and 127850.9132 ops/sec)
>
> 
>
>   24.15%  [kernel]  [k] _raw_spin_unlock_irqrestore
>9.67%  [kernel]  [k] copy_user_generic_string
>5.64%  [kernel]  [k] _raw_spin_unlock_irq
>3.34%  [kernel]  [k] get_page_from_freelist
>2.57%  [kernel]  [k] mark_page_accessed
>2.45%  [kernel]  [k] do_raw_spin_lock
>1.83%  [kernel]  [k] shrink_page_list
>1.70%  [kernel]  [k] free_hot_cold_page
>1.26%  [kernel]  [k] xfs_do_writepage

Best Regards,
Huang, Ying


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Huang, Ying
Hi, Chinner,

Dave Chinner  writes:

> On Wed, Aug 10, 2016 at 06:00:24PM -0700, Linus Torvalds wrote:
>> On Wed, Aug 10, 2016 at 5:33 PM, Huang, Ying  wrote:
>> >
>> > Here it is,
>> 
>> Thanks.
>> 
>> Appended is a munged "after" list, with the "before" values in
>> parenthesis. It actually looks fairly similar.
>> 
>> The biggest difference is that we have "mark_page_accessed()" show up
>> after, and not before. There was also a lot of LRU noise in the
>> non-profile data. I wonder if that is the reason here: the old model
>> of using generic_perform_write/block_page_mkwrite didn't mark the
>> pages accessed, and now with iomap_file_buffered_write() they get
>> marked as active and that screws up the LRU list, and makes us not
>> flush out the dirty pages well (because they are seen as active and
>> not good for writeback), and then you get bad memory use.
>> 
>> I'm not seeing anything that looks like locking-related.
>
> Not in that profile. I've been doing some local testing inside a
> 4-node fake-numa 16p/16GB RAM VM to see what I can find.

You run the test in a virtual machine, I think that is why your perf
data looks strange (high value of _raw_spin_unlock_irqrestore).

To setup KVM to use perf, you may refer to,

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Virtualization_Tuning_and_Optimization_Guide/sect-Virtualization_Tuning_Optimization_Guide-Monitoring_Tools-vPMU.html
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Virtualization_Administration_Guide/sect-perf-mon.html

I haven't tested them.  You may Google to find more information.  Or the
perf/kvm people can give you more information.

> I'm yet to work out how I can trigger a profile like the one that
> was reported (I really need to see the event traces), but in the
> mean time I found this
>
> Doing a large sequential single threaded buffered write using a 4k
> buffer (so single page per syscall to make the XFS IO path allocator
> behave the same way as in 4.7), I'm seeing a CPU profile that
> indicates we have a potential mapping->tree_lock issue:
>
> # xfs_io -f -c "truncate 0" -c "pwrite 0 47g" /mnt/scratch/fooey
> wrote 50465865728/50465865728 bytes at offset 0
> 47.000 GiB, 12320768 ops; 0:01:36.00 (499.418 MiB/sec and 127850.9132 ops/sec)
>
> 
>
>   24.15%  [kernel]  [k] _raw_spin_unlock_irqrestore
>9.67%  [kernel]  [k] copy_user_generic_string
>5.64%  [kernel]  [k] _raw_spin_unlock_irq
>3.34%  [kernel]  [k] get_page_from_freelist
>2.57%  [kernel]  [k] mark_page_accessed
>2.45%  [kernel]  [k] do_raw_spin_lock
>1.83%  [kernel]  [k] shrink_page_list
>1.70%  [kernel]  [k] free_hot_cold_page
>1.26%  [kernel]  [k] xfs_do_writepage

Best Regards,
Huang, Ying


Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Fengguang Wu

Hi Christoph,

On Sun, Aug 14, 2016 at 06:17:24PM +0200, Christoph Hellwig wrote:

Snipping the long contest:

I think there are three observations here:

(1) removing the mark_page_accessed (which is the only significant
change in the parent commit)  hurts the
aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test.
I'd still rather stick to the filemap version and let the
VM people sort it out.  How do the numbers for this test
look for XFS vs say ext4 and btrfs?
(2) lots of additional spinlock contention in the new case.  A quick
check shows that I fat-fingered my rewrite so that we do
the xfs_inode_set_eofblocks_tag call now for the pure lookup
case, and pretty much all new cycles come from that.
(3) Boy, are those xfs_inode_set_eofblocks_tag calls expensive, and
we're already doing way to many even without my little bug above.

So I've force pushed a new version of the iomap-fixes branch with
(2) fixed, and also a little patch to xfs_inode_set_eofblocks_tag a
lot less expensive slotted in before that.  Would be good to see
the numbers with that.


The aim7 1BRD tests finished and there are ups and downs, with overall
performance remain flat.

99091700659f4df9  74a242ad94d13436a1644c0b45  bf4dc6e4ecc2a3d042029319bc  
testcase/testparams/testbox
  --  --  
---
%stddev %change %stddev %change %stddev
\  |\  |\  
   159926  157324  158574GEO-MEAN aim7.jobs-per-min

70897   5%  74137   4%  73775
aim7/1BRD_48G-xfs-creat-clo-1500-performance/ivb44
   485217 ±  3%492431  477533
aim7/1BRD_48G-xfs-disk_rd-9000-performance/ivb44
   360451 -19% 292980 -17% 299377
aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44
   338114  338410   5% 354078
aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44
60130 ±  5% 4%  62438   5%  62923
aim7/1BRD_48G-xfs-disk_src-3000-performance/ivb44
   403144  397790  410648
aim7/1BRD_48G-xfs-disk_wrt-3000-performance/ivb44
26327   26534   26128
aim7/1BRD_48G-xfs-sync_disk_rw-600-performance/ivb44

The new commit bf4dc6e ("xfs: rewrite and optimize the delalloc write
path") improves the aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44
case by 5%. Here are the detailed numbers:

aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44

74a242ad94d13436  bf4dc6e4ecc2a3d042029319bc
  --
%stddev %change %stddev
\  |\
   338410   5% 354078aim7.jobs-per-min
   404390   8% 435117
aim7.time.voluntary_context_switches
 2502  -4%   2396aim7.time.maximum_resident_set_size
15018  -9%  13701
aim7.time.involuntary_context_switches
  900 -11%801aim7.time.system_time
17432  11%  19365vmstat.system.cs
47736 ± 19%   -24%  36087
interrupts.CAL:Function_call_interrupts
  2129646  31%2790638proc-vmstat.pgalloc_dma32
   379503  13% 429384numa-meminfo.node0.Dirty
15018  -9%  13701time.involuntary_context_switches
  900 -11%801time.system_time
 1560  10%   1716slabinfo.mnt_cache.active_objs
 1560  10%   1716slabinfo.mnt_cache.num_objs
61.53   -4  57.45 ±  4%  
perf-profile.cycles-pp.intel_idle.cpuidle_enter_state.cpuidle_enter.call_cpuidle.cpu_startup_entry
61.63   -4  57.55 ±  4%  
perf-profile.func.cycles-pp.intel_idle
  1007188 ± 16%   156%2577911 ±  6%  numa-numastat.node0.numa_miss
  9662857 ±  4%   -13%8420159 ±  3%  numa-numastat.node0.numa_foreign
  1008220 ± 16%   155%2570630 ±  6%  numa-numastat.node1.numa_foreign
  9664033 ±  4%   -13%8413184 ±  3%  numa-numastat.node1.numa_miss
 26519887 ±  3%18%   31322674cpuidle.C1-IVT.time
   122238  16% 142383cpuidle.C1-IVT.usage
46548  11%  51645cpuidle.C1E-IVT.usage
 17253419  13%   19567582cpuidle.C3-IVT.time
86847  13%  98333cpuidle.C3-IVT.usage
   482033 ± 12%   108%1000665 ±  8%  numa-vmstat.node0.numa_miss
94689  14% 107744
numa-vmstat.node0.nr_zone_write_pending
94677  14% 107718numa-vmstat.node0.nr_dirty
  3156643 ±  

Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

2016-08-15 Thread Fengguang Wu

Hi Christoph,

On Sun, Aug 14, 2016 at 06:17:24PM +0200, Christoph Hellwig wrote:

Snipping the long contest:

I think there are three observations here:

(1) removing the mark_page_accessed (which is the only significant
change in the parent commit)  hurts the
aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test.
I'd still rather stick to the filemap version and let the
VM people sort it out.  How do the numbers for this test
look for XFS vs say ext4 and btrfs?
(2) lots of additional spinlock contention in the new case.  A quick
check shows that I fat-fingered my rewrite so that we do
the xfs_inode_set_eofblocks_tag call now for the pure lookup
case, and pretty much all new cycles come from that.
(3) Boy, are those xfs_inode_set_eofblocks_tag calls expensive, and
we're already doing way to many even without my little bug above.

So I've force pushed a new version of the iomap-fixes branch with
(2) fixed, and also a little patch to xfs_inode_set_eofblocks_tag a
lot less expensive slotted in before that.  Would be good to see
the numbers with that.


The aim7 1BRD tests finished and there are ups and downs, with overall
performance remain flat.

99091700659f4df9  74a242ad94d13436a1644c0b45  bf4dc6e4ecc2a3d042029319bc  
testcase/testparams/testbox
  --  --  
---
%stddev %change %stddev %change %stddev
\  |\  |\  
   159926  157324  158574GEO-MEAN aim7.jobs-per-min

70897   5%  74137   4%  73775
aim7/1BRD_48G-xfs-creat-clo-1500-performance/ivb44
   485217 ±  3%492431  477533
aim7/1BRD_48G-xfs-disk_rd-9000-performance/ivb44
   360451 -19% 292980 -17% 299377
aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44
   338114  338410   5% 354078
aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44
60130 ±  5% 4%  62438   5%  62923
aim7/1BRD_48G-xfs-disk_src-3000-performance/ivb44
   403144  397790  410648
aim7/1BRD_48G-xfs-disk_wrt-3000-performance/ivb44
26327   26534   26128
aim7/1BRD_48G-xfs-sync_disk_rw-600-performance/ivb44

The new commit bf4dc6e ("xfs: rewrite and optimize the delalloc write
path") improves the aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44
case by 5%. Here are the detailed numbers:

aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44

74a242ad94d13436  bf4dc6e4ecc2a3d042029319bc
  --
%stddev %change %stddev
\  |\
   338410   5% 354078aim7.jobs-per-min
   404390   8% 435117
aim7.time.voluntary_context_switches
 2502  -4%   2396aim7.time.maximum_resident_set_size
15018  -9%  13701
aim7.time.involuntary_context_switches
  900 -11%801aim7.time.system_time
17432  11%  19365vmstat.system.cs
47736 ± 19%   -24%  36087
interrupts.CAL:Function_call_interrupts
  2129646  31%2790638proc-vmstat.pgalloc_dma32
   379503  13% 429384numa-meminfo.node0.Dirty
15018  -9%  13701time.involuntary_context_switches
  900 -11%801time.system_time
 1560  10%   1716slabinfo.mnt_cache.active_objs
 1560  10%   1716slabinfo.mnt_cache.num_objs
61.53   -4  57.45 ±  4%  
perf-profile.cycles-pp.intel_idle.cpuidle_enter_state.cpuidle_enter.call_cpuidle.cpu_startup_entry
61.63   -4  57.55 ±  4%  
perf-profile.func.cycles-pp.intel_idle
  1007188 ± 16%   156%2577911 ±  6%  numa-numastat.node0.numa_miss
  9662857 ±  4%   -13%8420159 ±  3%  numa-numastat.node0.numa_foreign
  1008220 ± 16%   155%2570630 ±  6%  numa-numastat.node1.numa_foreign
  9664033 ±  4%   -13%8413184 ±  3%  numa-numastat.node1.numa_miss
 26519887 ±  3%18%   31322674cpuidle.C1-IVT.time
   122238  16% 142383cpuidle.C1-IVT.usage
46548  11%  51645cpuidle.C1E-IVT.usage
 17253419  13%   19567582cpuidle.C3-IVT.time
86847  13%  98333cpuidle.C3-IVT.usage
   482033 ± 12%   108%1000665 ±  8%  numa-vmstat.node0.numa_miss
94689  14% 107744
numa-vmstat.node0.nr_zone_write_pending
94677  14% 107718numa-vmstat.node0.nr_dirty
  3156643 ±  

  1   2   3   >