Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-18 Thread Michal Hocko
On Tue 18-10-16 17:24:46, Dave Chinner wrote:
> On Mon, Oct 17, 2016 at 10:22:56AM +0200, Michal Hocko wrote:
> > On Mon 17-10-16 07:49:59, Dave Chinner wrote:
> > > On Thu, Oct 13, 2016 at 01:04:56PM +0200, Michal Hocko wrote:
> > > > On Thu 13-10-16 09:39:47, Michal Hocko wrote:
> > > > > On Thu 13-10-16 11:29:24, Dave Chinner wrote:
> > > > > > On Fri, Oct 07, 2016 at 03:18:14PM +0200, Michal Hocko wrote:
> > > > > [...]
> > > > > > > Unpatched kernel:
> > > > > > > #   Version 3.3, 16 thread(s) starting at Fri Oct  7 09:55:05 
> > > > > > > 2016
> > > > > > > #   Sync method: NO SYNC: Test does not issue sync() or 
> > > > > > > fsync() calls.
> > > > > > > #   Directories:  Time based hash between directories across 
> > > > > > > 1 subdirectories with 180 seconds per subdirectory.
> > > > > > > #   File names: 40 bytes long, (16 initial bytes of time 
> > > > > > > stamp with 24 random bytes at end of name)
> > > > > > > #   Files info: size 0 bytes, written with an IO size of 
> > > > > > > 16384 bytes per write
> > > > > > > #   App overhead is time in microseconds spent in the test 
> > > > > > > not doing file writing related system calls.
> > > > > > > #
> > > > > > > FSUse%Count SizeFiles/sec App Overhead
> > > > > > >  1  1600   4300.1 20745838
> > > > > > >  3  3200   4239.9 23849857
> > > > > > >  5  4800   4243.4 25939543
> > > > > > >  6  6400   4248.4 19514050
> > > > > > >  8  8000   4262.1 20796169
> > > > > > >  9  9600   4257.6 21288675
> > > > > > > 11 11200   4259.7 19375120
> > > > > > > 13 12800   4220.7 22734141
> > > > > > > 14 14400   4238.5 31936458
> > > > > > > 16 16000   4231.5 23409901
> > > > > > > 18 17600   4045.3 23577700
> > > > > > > 19 19200   2783.4 58299526
> > > > > > > 21 20800   2678.2 40616302
> > > > > > > 23 22400   2693.5 83973996
> > > > > > > Ctrl+C because it just took too long.
> > > > > > 
> > > > > > Try running it on a larger filesystem, or configure the fs with more
> > > > > > AGs and a larger log (i.e. mkfs.xfs -f -dagcount=24 -l size=512m
> > > > > > ). That will speed up modifications and increase concurrency.
> > > > > > This test should be able to run 5-10x faster than this (it
> > > > > > sustains 55,000 files/s @ 300MB/s write on my test fs on a cheap
> > > > > > SSD).
> > > > > 
> > > > > Will add more memory to the machine. Will report back on that.
> > > > 
> > > > increasing the memory to 1G didn't help. So I've tried to add
> > > > -dagcount=24 -l size=512m and that didn't help much either. I am at 5k
> > > > files/s so nowhere close to your 55k. I thought this is more about CPUs
> > > > count than about the amount of memory. So I've tried a larger machine
> > > > with 24 CPUs (no dagcount etc...), this one doesn't have a fast storage
> > > > so I've backed the fs image by ramdisk but even then I am getting very
> > > > similar results. No idea what is wrong with my kvm setup.
> > > 
> > > What's the backing storage? I use an image file in an XFS
> > > filesystem, configured with virtio,cache=none so it's concurrency
> > > model matches that of a real disk...
> > 
> > I am using qcow qemu image exported to qemu by
> > -drive file=storage.img,if=ide,index=1,cache=none
> > parameter.
> 
> storage.img is on what type of filesystem?

ext3 on the host system

> Only XFs will give you
> proper IO concurrency with direct IO, and you really need to use a
> raw image file rather than qcow2. If you're not using the special
> capabilities of qcow2 (e.g. snapshots), there's no reason to use
> it...

OK, I will try with the raw image as soon as I have some more time
(hopefully this week).

Thanks
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-18 Thread Michal Hocko
On Tue 18-10-16 17:24:46, Dave Chinner wrote:
> On Mon, Oct 17, 2016 at 10:22:56AM +0200, Michal Hocko wrote:
> > On Mon 17-10-16 07:49:59, Dave Chinner wrote:
> > > On Thu, Oct 13, 2016 at 01:04:56PM +0200, Michal Hocko wrote:
> > > > On Thu 13-10-16 09:39:47, Michal Hocko wrote:
> > > > > On Thu 13-10-16 11:29:24, Dave Chinner wrote:
> > > > > > On Fri, Oct 07, 2016 at 03:18:14PM +0200, Michal Hocko wrote:
> > > > > [...]
> > > > > > > Unpatched kernel:
> > > > > > > #   Version 3.3, 16 thread(s) starting at Fri Oct  7 09:55:05 
> > > > > > > 2016
> > > > > > > #   Sync method: NO SYNC: Test does not issue sync() or 
> > > > > > > fsync() calls.
> > > > > > > #   Directories:  Time based hash between directories across 
> > > > > > > 1 subdirectories with 180 seconds per subdirectory.
> > > > > > > #   File names: 40 bytes long, (16 initial bytes of time 
> > > > > > > stamp with 24 random bytes at end of name)
> > > > > > > #   Files info: size 0 bytes, written with an IO size of 
> > > > > > > 16384 bytes per write
> > > > > > > #   App overhead is time in microseconds spent in the test 
> > > > > > > not doing file writing related system calls.
> > > > > > > #
> > > > > > > FSUse%Count SizeFiles/sec App Overhead
> > > > > > >  1  1600   4300.1 20745838
> > > > > > >  3  3200   4239.9 23849857
> > > > > > >  5  4800   4243.4 25939543
> > > > > > >  6  6400   4248.4 19514050
> > > > > > >  8  8000   4262.1 20796169
> > > > > > >  9  9600   4257.6 21288675
> > > > > > > 11 11200   4259.7 19375120
> > > > > > > 13 12800   4220.7 22734141
> > > > > > > 14 14400   4238.5 31936458
> > > > > > > 16 16000   4231.5 23409901
> > > > > > > 18 17600   4045.3 23577700
> > > > > > > 19 19200   2783.4 58299526
> > > > > > > 21 20800   2678.2 40616302
> > > > > > > 23 22400   2693.5 83973996
> > > > > > > Ctrl+C because it just took too long.
> > > > > > 
> > > > > > Try running it on a larger filesystem, or configure the fs with more
> > > > > > AGs and a larger log (i.e. mkfs.xfs -f -dagcount=24 -l size=512m
> > > > > > ). That will speed up modifications and increase concurrency.
> > > > > > This test should be able to run 5-10x faster than this (it
> > > > > > sustains 55,000 files/s @ 300MB/s write on my test fs on a cheap
> > > > > > SSD).
> > > > > 
> > > > > Will add more memory to the machine. Will report back on that.
> > > > 
> > > > increasing the memory to 1G didn't help. So I've tried to add
> > > > -dagcount=24 -l size=512m and that didn't help much either. I am at 5k
> > > > files/s so nowhere close to your 55k. I thought this is more about CPUs
> > > > count than about the amount of memory. So I've tried a larger machine
> > > > with 24 CPUs (no dagcount etc...), this one doesn't have a fast storage
> > > > so I've backed the fs image by ramdisk but even then I am getting very
> > > > similar results. No idea what is wrong with my kvm setup.
> > > 
> > > What's the backing storage? I use an image file in an XFS
> > > filesystem, configured with virtio,cache=none so it's concurrency
> > > model matches that of a real disk...
> > 
> > I am using qcow qemu image exported to qemu by
> > -drive file=storage.img,if=ide,index=1,cache=none
> > parameter.
> 
> storage.img is on what type of filesystem?

ext3 on the host system

> Only XFs will give you
> proper IO concurrency with direct IO, and you really need to use a
> raw image file rather than qcow2. If you're not using the special
> capabilities of qcow2 (e.g. snapshots), there's no reason to use
> it...

OK, I will try with the raw image as soon as I have some more time
(hopefully this week).

Thanks
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-18 Thread Dave Chinner
On Mon, Oct 17, 2016 at 10:22:56AM +0200, Michal Hocko wrote:
> On Mon 17-10-16 07:49:59, Dave Chinner wrote:
> > On Thu, Oct 13, 2016 at 01:04:56PM +0200, Michal Hocko wrote:
> > > On Thu 13-10-16 09:39:47, Michal Hocko wrote:
> > > > On Thu 13-10-16 11:29:24, Dave Chinner wrote:
> > > > > On Fri, Oct 07, 2016 at 03:18:14PM +0200, Michal Hocko wrote:
> > > > [...]
> > > > > > Unpatched kernel:
> > > > > > #   Version 3.3, 16 thread(s) starting at Fri Oct  7 09:55:05 
> > > > > > 2016
> > > > > > #   Sync method: NO SYNC: Test does not issue sync() or fsync() 
> > > > > > calls.
> > > > > > #   Directories:  Time based hash between directories across 
> > > > > > 1 subdirectories with 180 seconds per subdirectory.
> > > > > > #   File names: 40 bytes long, (16 initial bytes of time stamp 
> > > > > > with 24 random bytes at end of name)
> > > > > > #   Files info: size 0 bytes, written with an IO size of 16384 
> > > > > > bytes per write
> > > > > > #   App overhead is time in microseconds spent in the test not 
> > > > > > doing file writing related system calls.
> > > > > > #
> > > > > > FSUse%Count SizeFiles/sec App Overhead
> > > > > >  1  1600   4300.1 20745838
> > > > > >  3  3200   4239.9 23849857
> > > > > >  5  4800   4243.4 25939543
> > > > > >  6  6400   4248.4 19514050
> > > > > >  8  8000   4262.1 20796169
> > > > > >  9  9600   4257.6 21288675
> > > > > > 11 11200   4259.7 19375120
> > > > > > 13 12800   4220.7 22734141
> > > > > > 14 14400   4238.5 31936458
> > > > > > 16 16000   4231.5 23409901
> > > > > > 18 17600   4045.3 23577700
> > > > > > 19 19200   2783.4 58299526
> > > > > > 21 20800   2678.2 40616302
> > > > > > 23 22400   2693.5 83973996
> > > > > > Ctrl+C because it just took too long.
> > > > > 
> > > > > Try running it on a larger filesystem, or configure the fs with more
> > > > > AGs and a larger log (i.e. mkfs.xfs -f -dagcount=24 -l size=512m
> > > > > ). That will speed up modifications and increase concurrency.
> > > > > This test should be able to run 5-10x faster than this (it
> > > > > sustains 55,000 files/s @ 300MB/s write on my test fs on a cheap
> > > > > SSD).
> > > > 
> > > > Will add more memory to the machine. Will report back on that.
> > > 
> > > increasing the memory to 1G didn't help. So I've tried to add
> > > -dagcount=24 -l size=512m and that didn't help much either. I am at 5k
> > > files/s so nowhere close to your 55k. I thought this is more about CPUs
> > > count than about the amount of memory. So I've tried a larger machine
> > > with 24 CPUs (no dagcount etc...), this one doesn't have a fast storage
> > > so I've backed the fs image by ramdisk but even then I am getting very
> > > similar results. No idea what is wrong with my kvm setup.
> > 
> > What's the backing storage? I use an image file in an XFS
> > filesystem, configured with virtio,cache=none so it's concurrency
> > model matches that of a real disk...
> 
> I am using qcow qemu image exported to qemu by
> -drive file=storage.img,if=ide,index=1,cache=none
> parameter.

storage.img is on what type of filesystem? Only XFs will give you
proper IO concurrency with direct IO, and you really need to use a
raw image file rather than qcow2. If you're not using the special
capabilities of qcow2 (e.g. snapshots), there's no reason to use
it...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-18 Thread Dave Chinner
On Mon, Oct 17, 2016 at 10:22:56AM +0200, Michal Hocko wrote:
> On Mon 17-10-16 07:49:59, Dave Chinner wrote:
> > On Thu, Oct 13, 2016 at 01:04:56PM +0200, Michal Hocko wrote:
> > > On Thu 13-10-16 09:39:47, Michal Hocko wrote:
> > > > On Thu 13-10-16 11:29:24, Dave Chinner wrote:
> > > > > On Fri, Oct 07, 2016 at 03:18:14PM +0200, Michal Hocko wrote:
> > > > [...]
> > > > > > Unpatched kernel:
> > > > > > #   Version 3.3, 16 thread(s) starting at Fri Oct  7 09:55:05 
> > > > > > 2016
> > > > > > #   Sync method: NO SYNC: Test does not issue sync() or fsync() 
> > > > > > calls.
> > > > > > #   Directories:  Time based hash between directories across 
> > > > > > 1 subdirectories with 180 seconds per subdirectory.
> > > > > > #   File names: 40 bytes long, (16 initial bytes of time stamp 
> > > > > > with 24 random bytes at end of name)
> > > > > > #   Files info: size 0 bytes, written with an IO size of 16384 
> > > > > > bytes per write
> > > > > > #   App overhead is time in microseconds spent in the test not 
> > > > > > doing file writing related system calls.
> > > > > > #
> > > > > > FSUse%Count SizeFiles/sec App Overhead
> > > > > >  1  1600   4300.1 20745838
> > > > > >  3  3200   4239.9 23849857
> > > > > >  5  4800   4243.4 25939543
> > > > > >  6  6400   4248.4 19514050
> > > > > >  8  8000   4262.1 20796169
> > > > > >  9  9600   4257.6 21288675
> > > > > > 11 11200   4259.7 19375120
> > > > > > 13 12800   4220.7 22734141
> > > > > > 14 14400   4238.5 31936458
> > > > > > 16 16000   4231.5 23409901
> > > > > > 18 17600   4045.3 23577700
> > > > > > 19 19200   2783.4 58299526
> > > > > > 21 20800   2678.2 40616302
> > > > > > 23 22400   2693.5 83973996
> > > > > > Ctrl+C because it just took too long.
> > > > > 
> > > > > Try running it on a larger filesystem, or configure the fs with more
> > > > > AGs and a larger log (i.e. mkfs.xfs -f -dagcount=24 -l size=512m
> > > > > ). That will speed up modifications and increase concurrency.
> > > > > This test should be able to run 5-10x faster than this (it
> > > > > sustains 55,000 files/s @ 300MB/s write on my test fs on a cheap
> > > > > SSD).
> > > > 
> > > > Will add more memory to the machine. Will report back on that.
> > > 
> > > increasing the memory to 1G didn't help. So I've tried to add
> > > -dagcount=24 -l size=512m and that didn't help much either. I am at 5k
> > > files/s so nowhere close to your 55k. I thought this is more about CPUs
> > > count than about the amount of memory. So I've tried a larger machine
> > > with 24 CPUs (no dagcount etc...), this one doesn't have a fast storage
> > > so I've backed the fs image by ramdisk but even then I am getting very
> > > similar results. No idea what is wrong with my kvm setup.
> > 
> > What's the backing storage? I use an image file in an XFS
> > filesystem, configured with virtio,cache=none so it's concurrency
> > model matches that of a real disk...
> 
> I am using qcow qemu image exported to qemu by
> -drive file=storage.img,if=ide,index=1,cache=none
> parameter.

storage.img is on what type of filesystem? Only XFs will give you
proper IO concurrency with direct IO, and you really need to use a
raw image file rather than qcow2. If you're not using the special
capabilities of qcow2 (e.g. snapshots), there's no reason to use
it...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-17 Thread Michal Hocko
On Mon 17-10-16 07:49:59, Dave Chinner wrote:
> On Thu, Oct 13, 2016 at 01:04:56PM +0200, Michal Hocko wrote:
> > On Thu 13-10-16 09:39:47, Michal Hocko wrote:
> > > On Thu 13-10-16 11:29:24, Dave Chinner wrote:
> > > > On Fri, Oct 07, 2016 at 03:18:14PM +0200, Michal Hocko wrote:
> > > [...]
> > > > > Unpatched kernel:
> > > > > #   Version 3.3, 16 thread(s) starting at Fri Oct  7 09:55:05 2016
> > > > > #   Sync method: NO SYNC: Test does not issue sync() or fsync() 
> > > > > calls.
> > > > > #   Directories:  Time based hash between directories across 
> > > > > 1 subdirectories with 180 seconds per subdirectory.
> > > > > #   File names: 40 bytes long, (16 initial bytes of time stamp 
> > > > > with 24 random bytes at end of name)
> > > > > #   Files info: size 0 bytes, written with an IO size of 16384 
> > > > > bytes per write
> > > > > #   App overhead is time in microseconds spent in the test not 
> > > > > doing file writing related system calls.
> > > > > #
> > > > > FSUse%Count SizeFiles/sec App Overhead
> > > > >  1  1600   4300.1 20745838
> > > > >  3  3200   4239.9 23849857
> > > > >  5  4800   4243.4 25939543
> > > > >  6  6400   4248.4 19514050
> > > > >  8  8000   4262.1 20796169
> > > > >  9  9600   4257.6 21288675
> > > > > 11 11200   4259.7 19375120
> > > > > 13 12800   4220.7 22734141
> > > > > 14 14400   4238.5 31936458
> > > > > 16 16000   4231.5 23409901
> > > > > 18 17600   4045.3 23577700
> > > > > 19 19200   2783.4 58299526
> > > > > 21 20800   2678.2 40616302
> > > > > 23 22400   2693.5 83973996
> > > > > Ctrl+C because it just took too long.
> > > > 
> > > > Try running it on a larger filesystem, or configure the fs with more
> > > > AGs and a larger log (i.e. mkfs.xfs -f -dagcount=24 -l size=512m
> > > > ). That will speed up modifications and increase concurrency.
> > > > This test should be able to run 5-10x faster than this (it
> > > > sustains 55,000 files/s @ 300MB/s write on my test fs on a cheap
> > > > SSD).
> > > 
> > > Will add more memory to the machine. Will report back on that.
> > 
> > increasing the memory to 1G didn't help. So I've tried to add
> > -dagcount=24 -l size=512m and that didn't help much either. I am at 5k
> > files/s so nowhere close to your 55k. I thought this is more about CPUs
> > count than about the amount of memory. So I've tried a larger machine
> > with 24 CPUs (no dagcount etc...), this one doesn't have a fast storage
> > so I've backed the fs image by ramdisk but even then I am getting very
> > similar results. No idea what is wrong with my kvm setup.
> 
> What's the backing storage? I use an image file in an XFS
> filesystem, configured with virtio,cache=none so it's concurrency
> model matches that of a real disk...

I am using qcow qemu image exported to qemu by
-drive file=storage.img,if=ide,index=1,cache=none
parameter.

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-17 Thread Michal Hocko
On Mon 17-10-16 07:49:59, Dave Chinner wrote:
> On Thu, Oct 13, 2016 at 01:04:56PM +0200, Michal Hocko wrote:
> > On Thu 13-10-16 09:39:47, Michal Hocko wrote:
> > > On Thu 13-10-16 11:29:24, Dave Chinner wrote:
> > > > On Fri, Oct 07, 2016 at 03:18:14PM +0200, Michal Hocko wrote:
> > > [...]
> > > > > Unpatched kernel:
> > > > > #   Version 3.3, 16 thread(s) starting at Fri Oct  7 09:55:05 2016
> > > > > #   Sync method: NO SYNC: Test does not issue sync() or fsync() 
> > > > > calls.
> > > > > #   Directories:  Time based hash between directories across 
> > > > > 1 subdirectories with 180 seconds per subdirectory.
> > > > > #   File names: 40 bytes long, (16 initial bytes of time stamp 
> > > > > with 24 random bytes at end of name)
> > > > > #   Files info: size 0 bytes, written with an IO size of 16384 
> > > > > bytes per write
> > > > > #   App overhead is time in microseconds spent in the test not 
> > > > > doing file writing related system calls.
> > > > > #
> > > > > FSUse%Count SizeFiles/sec App Overhead
> > > > >  1  1600   4300.1 20745838
> > > > >  3  3200   4239.9 23849857
> > > > >  5  4800   4243.4 25939543
> > > > >  6  6400   4248.4 19514050
> > > > >  8  8000   4262.1 20796169
> > > > >  9  9600   4257.6 21288675
> > > > > 11 11200   4259.7 19375120
> > > > > 13 12800   4220.7 22734141
> > > > > 14 14400   4238.5 31936458
> > > > > 16 16000   4231.5 23409901
> > > > > 18 17600   4045.3 23577700
> > > > > 19 19200   2783.4 58299526
> > > > > 21 20800   2678.2 40616302
> > > > > 23 22400   2693.5 83973996
> > > > > Ctrl+C because it just took too long.
> > > > 
> > > > Try running it on a larger filesystem, or configure the fs with more
> > > > AGs and a larger log (i.e. mkfs.xfs -f -dagcount=24 -l size=512m
> > > > ). That will speed up modifications and increase concurrency.
> > > > This test should be able to run 5-10x faster than this (it
> > > > sustains 55,000 files/s @ 300MB/s write on my test fs on a cheap
> > > > SSD).
> > > 
> > > Will add more memory to the machine. Will report back on that.
> > 
> > increasing the memory to 1G didn't help. So I've tried to add
> > -dagcount=24 -l size=512m and that didn't help much either. I am at 5k
> > files/s so nowhere close to your 55k. I thought this is more about CPUs
> > count than about the amount of memory. So I've tried a larger machine
> > with 24 CPUs (no dagcount etc...), this one doesn't have a fast storage
> > so I've backed the fs image by ramdisk but even then I am getting very
> > similar results. No idea what is wrong with my kvm setup.
> 
> What's the backing storage? I use an image file in an XFS
> filesystem, configured with virtio,cache=none so it's concurrency
> model matches that of a real disk...

I am using qcow qemu image exported to qemu by
-drive file=storage.img,if=ide,index=1,cache=none
parameter.

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-16 Thread Dave Chinner
On Thu, Oct 13, 2016 at 01:04:56PM +0200, Michal Hocko wrote:
> On Thu 13-10-16 09:39:47, Michal Hocko wrote:
> > On Thu 13-10-16 11:29:24, Dave Chinner wrote:
> > > On Fri, Oct 07, 2016 at 03:18:14PM +0200, Michal Hocko wrote:
> > [...]
> > > > Unpatched kernel:
> > > > #   Version 3.3, 16 thread(s) starting at Fri Oct  7 09:55:05 2016
> > > > #   Sync method: NO SYNC: Test does not issue sync() or fsync() 
> > > > calls.
> > > > #   Directories:  Time based hash between directories across 1 
> > > > subdirectories with 180 seconds per subdirectory.
> > > > #   File names: 40 bytes long, (16 initial bytes of time stamp with 
> > > > 24 random bytes at end of name)
> > > > #   Files info: size 0 bytes, written with an IO size of 16384 
> > > > bytes per write
> > > > #   App overhead is time in microseconds spent in the test not 
> > > > doing file writing related system calls.
> > > > #
> > > > FSUse%Count SizeFiles/sec App Overhead
> > > >  1  1600   4300.1 20745838
> > > >  3  3200   4239.9 23849857
> > > >  5  4800   4243.4 25939543
> > > >  6  6400   4248.4 19514050
> > > >  8  8000   4262.1 20796169
> > > >  9  9600   4257.6 21288675
> > > > 11 11200   4259.7 19375120
> > > > 13 12800   4220.7 22734141
> > > > 14 14400   4238.5 31936458
> > > > 16 16000   4231.5 23409901
> > > > 18 17600   4045.3 23577700
> > > > 19 19200   2783.4 58299526
> > > > 21 20800   2678.2 40616302
> > > > 23 22400   2693.5 83973996
> > > > Ctrl+C because it just took too long.
> > > 
> > > Try running it on a larger filesystem, or configure the fs with more
> > > AGs and a larger log (i.e. mkfs.xfs -f -dagcount=24 -l size=512m
> > > ). That will speed up modifications and increase concurrency.
> > > This test should be able to run 5-10x faster than this (it
> > > sustains 55,000 files/s @ 300MB/s write on my test fs on a cheap
> > > SSD).
> > 
> > Will add more memory to the machine. Will report back on that.
> 
> increasing the memory to 1G didn't help. So I've tried to add
> -dagcount=24 -l size=512m and that didn't help much either. I am at 5k
> files/s so nowhere close to your 55k. I thought this is more about CPUs
> count than about the amount of memory. So I've tried a larger machine
> with 24 CPUs (no dagcount etc...), this one doesn't have a fast storage
> so I've backed the fs image by ramdisk but even then I am getting very
> similar results. No idea what is wrong with my kvm setup.

What's the backing storage? I use an image file in an XFS
filesystem, configured with virtio,cache=none so it's concurrency
model matches that of a real disk...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-16 Thread Dave Chinner
On Thu, Oct 13, 2016 at 01:04:56PM +0200, Michal Hocko wrote:
> On Thu 13-10-16 09:39:47, Michal Hocko wrote:
> > On Thu 13-10-16 11:29:24, Dave Chinner wrote:
> > > On Fri, Oct 07, 2016 at 03:18:14PM +0200, Michal Hocko wrote:
> > [...]
> > > > Unpatched kernel:
> > > > #   Version 3.3, 16 thread(s) starting at Fri Oct  7 09:55:05 2016
> > > > #   Sync method: NO SYNC: Test does not issue sync() or fsync() 
> > > > calls.
> > > > #   Directories:  Time based hash between directories across 1 
> > > > subdirectories with 180 seconds per subdirectory.
> > > > #   File names: 40 bytes long, (16 initial bytes of time stamp with 
> > > > 24 random bytes at end of name)
> > > > #   Files info: size 0 bytes, written with an IO size of 16384 
> > > > bytes per write
> > > > #   App overhead is time in microseconds spent in the test not 
> > > > doing file writing related system calls.
> > > > #
> > > > FSUse%Count SizeFiles/sec App Overhead
> > > >  1  1600   4300.1 20745838
> > > >  3  3200   4239.9 23849857
> > > >  5  4800   4243.4 25939543
> > > >  6  6400   4248.4 19514050
> > > >  8  8000   4262.1 20796169
> > > >  9  9600   4257.6 21288675
> > > > 11 11200   4259.7 19375120
> > > > 13 12800   4220.7 22734141
> > > > 14 14400   4238.5 31936458
> > > > 16 16000   4231.5 23409901
> > > > 18 17600   4045.3 23577700
> > > > 19 19200   2783.4 58299526
> > > > 21 20800   2678.2 40616302
> > > > 23 22400   2693.5 83973996
> > > > Ctrl+C because it just took too long.
> > > 
> > > Try running it on a larger filesystem, or configure the fs with more
> > > AGs and a larger log (i.e. mkfs.xfs -f -dagcount=24 -l size=512m
> > > ). That will speed up modifications and increase concurrency.
> > > This test should be able to run 5-10x faster than this (it
> > > sustains 55,000 files/s @ 300MB/s write on my test fs on a cheap
> > > SSD).
> > 
> > Will add more memory to the machine. Will report back on that.
> 
> increasing the memory to 1G didn't help. So I've tried to add
> -dagcount=24 -l size=512m and that didn't help much either. I am at 5k
> files/s so nowhere close to your 55k. I thought this is more about CPUs
> count than about the amount of memory. So I've tried a larger machine
> with 24 CPUs (no dagcount etc...), this one doesn't have a fast storage
> so I've backed the fs image by ramdisk but even then I am getting very
> similar results. No idea what is wrong with my kvm setup.

What's the backing storage? I use an image file in an XFS
filesystem, configured with virtio,cache=none so it's concurrency
model matches that of a real disk...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-13 Thread Michal Hocko
On Thu 13-10-16 09:39:47, Michal Hocko wrote:
> On Thu 13-10-16 11:29:24, Dave Chinner wrote:
> > On Fri, Oct 07, 2016 at 03:18:14PM +0200, Michal Hocko wrote:
> [...]
> > > Unpatched kernel:
> > > #   Version 3.3, 16 thread(s) starting at Fri Oct  7 09:55:05 2016
> > > #   Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
> > > #   Directories:  Time based hash between directories across 1 
> > > subdirectories with 180 seconds per subdirectory.
> > > #   File names: 40 bytes long, (16 initial bytes of time stamp with 
> > > 24 random bytes at end of name)
> > > #   Files info: size 0 bytes, written with an IO size of 16384 bytes 
> > > per write
> > > #   App overhead is time in microseconds spent in the test not doing 
> > > file writing related system calls.
> > > #
> > > FSUse%Count SizeFiles/sec App Overhead
> > >  1  1600   4300.1 20745838
> > >  3  3200   4239.9 23849857
> > >  5  4800   4243.4 25939543
> > >  6  6400   4248.4 19514050
> > >  8  8000   4262.1 20796169
> > >  9  9600   4257.6 21288675
> > > 11 11200   4259.7 19375120
> > > 13 12800   4220.7 22734141
> > > 14 14400   4238.5 31936458
> > > 16 16000   4231.5 23409901
> > > 18 17600   4045.3 23577700
> > > 19 19200   2783.4 58299526
> > > 21 20800   2678.2 40616302
> > > 23 22400   2693.5 83973996
> > > Ctrl+C because it just took too long.
> > 
> > Try running it on a larger filesystem, or configure the fs with more
> > AGs and a larger log (i.e. mkfs.xfs -f -dagcount=24 -l size=512m
> > ). That will speed up modifications and increase concurrency.
> > This test should be able to run 5-10x faster than this (it
> > sustains 55,000 files/s @ 300MB/s write on my test fs on a cheap
> > SSD).
> 
> Will add more memory to the machine. Will report back on that.

increasing the memory to 1G didn't help. So I've tried to add
-dagcount=24 -l size=512m and that didn't help much either. I am at 5k
files/s so nowhere close to your 55k. I thought this is more about CPUs
count than about the amount of memory. So I've tried a larger machine
with 24 CPUs (no dagcount etc...), this one doesn't have a fast storage
so I've backed the fs image by ramdisk but even then I am getting very
similar results. No idea what is wrong with my kvm setup.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-13 Thread Michal Hocko
On Thu 13-10-16 09:39:47, Michal Hocko wrote:
> On Thu 13-10-16 11:29:24, Dave Chinner wrote:
> > On Fri, Oct 07, 2016 at 03:18:14PM +0200, Michal Hocko wrote:
> [...]
> > > Unpatched kernel:
> > > #   Version 3.3, 16 thread(s) starting at Fri Oct  7 09:55:05 2016
> > > #   Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
> > > #   Directories:  Time based hash between directories across 1 
> > > subdirectories with 180 seconds per subdirectory.
> > > #   File names: 40 bytes long, (16 initial bytes of time stamp with 
> > > 24 random bytes at end of name)
> > > #   Files info: size 0 bytes, written with an IO size of 16384 bytes 
> > > per write
> > > #   App overhead is time in microseconds spent in the test not doing 
> > > file writing related system calls.
> > > #
> > > FSUse%Count SizeFiles/sec App Overhead
> > >  1  1600   4300.1 20745838
> > >  3  3200   4239.9 23849857
> > >  5  4800   4243.4 25939543
> > >  6  6400   4248.4 19514050
> > >  8  8000   4262.1 20796169
> > >  9  9600   4257.6 21288675
> > > 11 11200   4259.7 19375120
> > > 13 12800   4220.7 22734141
> > > 14 14400   4238.5 31936458
> > > 16 16000   4231.5 23409901
> > > 18 17600   4045.3 23577700
> > > 19 19200   2783.4 58299526
> > > 21 20800   2678.2 40616302
> > > 23 22400   2693.5 83973996
> > > Ctrl+C because it just took too long.
> > 
> > Try running it on a larger filesystem, or configure the fs with more
> > AGs and a larger log (i.e. mkfs.xfs -f -dagcount=24 -l size=512m
> > ). That will speed up modifications and increase concurrency.
> > This test should be able to run 5-10x faster than this (it
> > sustains 55,000 files/s @ 300MB/s write on my test fs on a cheap
> > SSD).
> 
> Will add more memory to the machine. Will report back on that.

increasing the memory to 1G didn't help. So I've tried to add
-dagcount=24 -l size=512m and that didn't help much either. I am at 5k
files/s so nowhere close to your 55k. I thought this is more about CPUs
count than about the amount of memory. So I've tried a larger machine
with 24 CPUs (no dagcount etc...), this one doesn't have a fast storage
so I've backed the fs image by ramdisk but even then I am getting very
similar results. No idea what is wrong with my kvm setup.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-13 Thread Michal Hocko
On Thu 13-10-16 11:29:24, Dave Chinner wrote:
> On Fri, Oct 07, 2016 at 03:18:14PM +0200, Michal Hocko wrote:
[...]
> > Unpatched kernel:
> > #   Version 3.3, 16 thread(s) starting at Fri Oct  7 09:55:05 2016
> > #   Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
> > #   Directories:  Time based hash between directories across 1 
> > subdirectories with 180 seconds per subdirectory.
> > #   File names: 40 bytes long, (16 initial bytes of time stamp with 24 
> > random bytes at end of name)
> > #   Files info: size 0 bytes, written with an IO size of 16384 bytes 
> > per write
> > #   App overhead is time in microseconds spent in the test not doing 
> > file writing related system calls.
> > #
> > FSUse%Count SizeFiles/sec App Overhead
> >  1  1600   4300.1 20745838
> >  3  3200   4239.9 23849857
> >  5  4800   4243.4 25939543
> >  6  6400   4248.4 19514050
> >  8  8000   4262.1 20796169
> >  9  9600   4257.6 21288675
> > 11 11200   4259.7 19375120
> > 13 12800   4220.7 22734141
> > 14 14400   4238.5 31936458
> > 16 16000   4231.5 23409901
> > 18 17600   4045.3 23577700
> > 19 19200   2783.4 58299526
> > 21 20800   2678.2 40616302
> > 23 22400   2693.5 83973996
> > Ctrl+C because it just took too long.
> 
> Try running it on a larger filesystem, or configure the fs with more
> AGs and a larger log (i.e. mkfs.xfs -f -dagcount=24 -l size=512m
> ). That will speed up modifications and increase concurrency.
> This test should be able to run 5-10x faster than this (it
> sustains 55,000 files/s @ 300MB/s write on my test fs on a cheap
> SSD).

Will add more memory to the machine. Will report back on that.
 
> > while it doesn't seem to drop the Files/sec numbers starting with
> > Count=1920. I also see only a single
> > 
> > [ 3063.815003] XFS: fs_mark(3272) possible memory allocation deadlock size 
> > 65624 in kmem_alloc (mode:0x2408240)
> 
> Remember that this is emitted only after /100/ consecutive
> allocation failures. So the fact it is still being emitted means
> that the situation is not drastically better

yes, but we also should consider that with this particular workload
which doesn't have a lot of anonymous memory there is simply not all
that much to migrate so we eventually have to wait for the reclaim
to free up fs bound memory. This patch should put some relief but it
is not a general remedy.

> > Unpatched kernel
> > all orders
> > begin:44.718798 end:5774.618736 allocs:15019288
> > order > 0 
> > begin:44.718798 end:5773.587195 allocs:10438610
> > 
> > Patched kernel
> > all orders
> > begin:64.612804 end:5794.193619 allocs:16110081 [107.2%]
> > order > 0
> > begin:64.612804 end:5794.193619 allocs:11741492 [112.5%]
> > 
> > which would suggest that diving into the compaction rather than backing
> > off and waiting for kcompactd to make the work for us was indeed a
> > better strategy and helped the throughput.
> 
> Well, without a success/failure ratio being calculated it's hard to
> tell what improvement it made. Did it increase the success rate, or
> reduce failure latency so retries happened faster?

I have just noticed that the tracepoint also reports allocation failures
(page==(null) and pfn==0) so I actually can calculate that. Note that
only order > 3 fail with the current page allocator so I have filtered
only those

Unpatched
begin:44.718798 end:5773.587195 allocs:6162244 fail:145

Patched
begin:64.612804 end:5794.193574 allocs:6869496 fail:104

So the success rate is slightly higher but this is negligible but we
seem to manage perform ~10% more allocations so I assume this helped the
throughput and in turn recycle memory better.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-13 Thread Michal Hocko
On Thu 13-10-16 11:29:24, Dave Chinner wrote:
> On Fri, Oct 07, 2016 at 03:18:14PM +0200, Michal Hocko wrote:
[...]
> > Unpatched kernel:
> > #   Version 3.3, 16 thread(s) starting at Fri Oct  7 09:55:05 2016
> > #   Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
> > #   Directories:  Time based hash between directories across 1 
> > subdirectories with 180 seconds per subdirectory.
> > #   File names: 40 bytes long, (16 initial bytes of time stamp with 24 
> > random bytes at end of name)
> > #   Files info: size 0 bytes, written with an IO size of 16384 bytes 
> > per write
> > #   App overhead is time in microseconds spent in the test not doing 
> > file writing related system calls.
> > #
> > FSUse%Count SizeFiles/sec App Overhead
> >  1  1600   4300.1 20745838
> >  3  3200   4239.9 23849857
> >  5  4800   4243.4 25939543
> >  6  6400   4248.4 19514050
> >  8  8000   4262.1 20796169
> >  9  9600   4257.6 21288675
> > 11 11200   4259.7 19375120
> > 13 12800   4220.7 22734141
> > 14 14400   4238.5 31936458
> > 16 16000   4231.5 23409901
> > 18 17600   4045.3 23577700
> > 19 19200   2783.4 58299526
> > 21 20800   2678.2 40616302
> > 23 22400   2693.5 83973996
> > Ctrl+C because it just took too long.
> 
> Try running it on a larger filesystem, or configure the fs with more
> AGs and a larger log (i.e. mkfs.xfs -f -dagcount=24 -l size=512m
> ). That will speed up modifications and increase concurrency.
> This test should be able to run 5-10x faster than this (it
> sustains 55,000 files/s @ 300MB/s write on my test fs on a cheap
> SSD).

Will add more memory to the machine. Will report back on that.
 
> > while it doesn't seem to drop the Files/sec numbers starting with
> > Count=1920. I also see only a single
> > 
> > [ 3063.815003] XFS: fs_mark(3272) possible memory allocation deadlock size 
> > 65624 in kmem_alloc (mode:0x2408240)
> 
> Remember that this is emitted only after /100/ consecutive
> allocation failures. So the fact it is still being emitted means
> that the situation is not drastically better

yes, but we also should consider that with this particular workload
which doesn't have a lot of anonymous memory there is simply not all
that much to migrate so we eventually have to wait for the reclaim
to free up fs bound memory. This patch should put some relief but it
is not a general remedy.

> > Unpatched kernel
> > all orders
> > begin:44.718798 end:5774.618736 allocs:15019288
> > order > 0 
> > begin:44.718798 end:5773.587195 allocs:10438610
> > 
> > Patched kernel
> > all orders
> > begin:64.612804 end:5794.193619 allocs:16110081 [107.2%]
> > order > 0
> > begin:64.612804 end:5794.193619 allocs:11741492 [112.5%]
> > 
> > which would suggest that diving into the compaction rather than backing
> > off and waiting for kcompactd to make the work for us was indeed a
> > better strategy and helped the throughput.
> 
> Well, without a success/failure ratio being calculated it's hard to
> tell what improvement it made. Did it increase the success rate, or
> reduce failure latency so retries happened faster?

I have just noticed that the tracepoint also reports allocation failures
(page==(null) and pfn==0) so I actually can calculate that. Note that
only order > 3 fail with the current page allocator so I have filtered
only those

Unpatched
begin:44.718798 end:5773.587195 allocs:6162244 fail:145

Patched
begin:64.612804 end:5794.193574 allocs:6869496 fail:104

So the success rate is slightly higher but this is negligible but we
seem to manage perform ~10% more allocations so I assume this helped the
throughput and in turn recycle memory better.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-12 Thread Dave Chinner
On Fri, Oct 07, 2016 at 03:18:14PM +0200, Michal Hocko wrote:
> On Thu 06-10-16 13:11:42, Dave Chinner wrote:
> > On Wed, Oct 05, 2016 at 01:38:45PM +0200, Michal Hocko wrote:
> > > On Wed 05-10-16 07:32:02, Dave Chinner wrote:
> > > > On Tue, Oct 04, 2016 at 10:12:15AM +0200, Michal Hocko wrote:
> > > > > From: Michal Hocko 
> > > > > 
> > > > > compaction has been disabled for GFP_NOFS and GFP_NOIO requests since
> > > > > the direct compaction was introduced by 56de7263fcf3 ("mm: compaction:
> > > > > direct compact when a high-order allocation fails"). The main reason
> > > > > is that the migration of page cache pages might recurse back to fs/io
> > > > > layer and we could potentially deadlock. This is overly conservative
> > > > > because all the anonymous memory is migrateable in the GFP_NOFS 
> > > > > context
> > > > > just fine.  This might be a large portion of the memory in many/most
> > > > > workkloads.
> > > > > 
> > > > > Remove the GFP_NOFS restriction and make sure that we skip all fs 
> > > > > pages
> > > > > (those with a mapping) while isolating pages to be migrated. We cannot
> > > > > consider clean fs pages because they might need a metadata update so
> > > > > only isolate pages without any mapping for nofs requests.
> > > > > 
> > > > > The effect of this patch will be probably very limited in many/most
> > > > > workloads because higher order GFP_NOFS requests are quite rare,
> > > > 
> > > > You say they are rare only because you don't know how to trigger
> > > > them easily.  :/
> > > 
> > > true
> > > 
> > > > Try this:
> > > > 
> > > > # mkfs.xfs -f -n size=64k 
> > > > # mount  /mnt/scratch
> > > > # time ./fs_mark  -D  1  -S0  -n  10  -s  0  -L  32 \
> > > > -d  /mnt/scratch/0  -d  /mnt/scratch/1 \
> > > > -d  /mnt/scratch/2  -d  /mnt/scratch/3 \
> > > > -d  /mnt/scratch/4  -d  /mnt/scratch/5 \
> > > > -d  /mnt/scratch/6  -d  /mnt/scratch/7 \
> > > > -d  /mnt/scratch/8  -d  /mnt/scratch/9 \
> > > > -d  /mnt/scratch/10  -d  /mnt/scratch/11 \
> > > > -d  /mnt/scratch/12  -d  /mnt/scratch/13 \
> > > > -d  /mnt/scratch/14  -d  /mnt/scratch/15
> > > 
> > > Does this simulate a standard or usual fs workload/configuration?  I am
> > 
> > Unfortunately, there was an era of cargo cult configuration tweaks
> > in the Ceph community that has resulted in a large number of
> > production machines with XFS filesystems configured this way. And a
> > lot of them store large numbers of small files and run under
> > significant sustained memory pressure.
> 
> I see
> 
> > I slowly working towards getting rid of these high order allocations
> > and replacing them with the equivalent number of single page
> > allocations, but I haven't got that (complex) change working yet.
> 
> Definitely a good plan!
> 
> Anyway I was playing with this in my virtual machine (4CPUs, 512MB of
> RAM split into two NUMA nodes). Started on a freshly created fs after
> boot, no other load in the guest. The performance numbers should be
> taken with grain of salt, though, because the host has 4CPUs as well and
> it wasn't completely idle, but should be OK enough to give us at least
> some picture. This is what fs_mark told me:
> Unpatched kernel:
> #   Version 3.3, 16 thread(s) starting at Fri Oct  7 09:55:05 2016
> #   Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
> #   Directories:  Time based hash between directories across 1 
> subdirectories with 180 seconds per subdirectory.
> #   File names: 40 bytes long, (16 initial bytes of time stamp with 24 
> random bytes at end of name)
> #   Files info: size 0 bytes, written with an IO size of 16384 bytes per 
> write
> #   App overhead is time in microseconds spent in the test not doing file 
> writing related system calls.
> #
> FSUse%Count SizeFiles/sec App Overhead
>  1  1600   4300.1 20745838
>  3  3200   4239.9 23849857
>  5  4800   4243.4 25939543
>  6  6400   4248.4 19514050
>  8  8000   4262.1 20796169
>  9  9600   4257.6 21288675
> 11 11200   4259.7 19375120
> 13 12800   4220.7 22734141
> 14 14400   4238.5 31936458
> 16 16000   4231.5 23409901
> 18 17600   4045.3 23577700
> 19 19200   2783.4 58299526
> 21 20800   2678.2 40616302
> 23 22400   2693.5 83973996
> Ctrl+C because it just took too long.

Try running it on a larger filesystem, or configure the fs with more
AGs and a larger log (i.e. mkfs.xfs 

Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-12 Thread Dave Chinner
On Fri, Oct 07, 2016 at 03:18:14PM +0200, Michal Hocko wrote:
> On Thu 06-10-16 13:11:42, Dave Chinner wrote:
> > On Wed, Oct 05, 2016 at 01:38:45PM +0200, Michal Hocko wrote:
> > > On Wed 05-10-16 07:32:02, Dave Chinner wrote:
> > > > On Tue, Oct 04, 2016 at 10:12:15AM +0200, Michal Hocko wrote:
> > > > > From: Michal Hocko 
> > > > > 
> > > > > compaction has been disabled for GFP_NOFS and GFP_NOIO requests since
> > > > > the direct compaction was introduced by 56de7263fcf3 ("mm: compaction:
> > > > > direct compact when a high-order allocation fails"). The main reason
> > > > > is that the migration of page cache pages might recurse back to fs/io
> > > > > layer and we could potentially deadlock. This is overly conservative
> > > > > because all the anonymous memory is migrateable in the GFP_NOFS 
> > > > > context
> > > > > just fine.  This might be a large portion of the memory in many/most
> > > > > workkloads.
> > > > > 
> > > > > Remove the GFP_NOFS restriction and make sure that we skip all fs 
> > > > > pages
> > > > > (those with a mapping) while isolating pages to be migrated. We cannot
> > > > > consider clean fs pages because they might need a metadata update so
> > > > > only isolate pages without any mapping for nofs requests.
> > > > > 
> > > > > The effect of this patch will be probably very limited in many/most
> > > > > workloads because higher order GFP_NOFS requests are quite rare,
> > > > 
> > > > You say they are rare only because you don't know how to trigger
> > > > them easily.  :/
> > > 
> > > true
> > > 
> > > > Try this:
> > > > 
> > > > # mkfs.xfs -f -n size=64k 
> > > > # mount  /mnt/scratch
> > > > # time ./fs_mark  -D  1  -S0  -n  10  -s  0  -L  32 \
> > > > -d  /mnt/scratch/0  -d  /mnt/scratch/1 \
> > > > -d  /mnt/scratch/2  -d  /mnt/scratch/3 \
> > > > -d  /mnt/scratch/4  -d  /mnt/scratch/5 \
> > > > -d  /mnt/scratch/6  -d  /mnt/scratch/7 \
> > > > -d  /mnt/scratch/8  -d  /mnt/scratch/9 \
> > > > -d  /mnt/scratch/10  -d  /mnt/scratch/11 \
> > > > -d  /mnt/scratch/12  -d  /mnt/scratch/13 \
> > > > -d  /mnt/scratch/14  -d  /mnt/scratch/15
> > > 
> > > Does this simulate a standard or usual fs workload/configuration?  I am
> > 
> > Unfortunately, there was an era of cargo cult configuration tweaks
> > in the Ceph community that has resulted in a large number of
> > production machines with XFS filesystems configured this way. And a
> > lot of them store large numbers of small files and run under
> > significant sustained memory pressure.
> 
> I see
> 
> > I slowly working towards getting rid of these high order allocations
> > and replacing them with the equivalent number of single page
> > allocations, but I haven't got that (complex) change working yet.
> 
> Definitely a good plan!
> 
> Anyway I was playing with this in my virtual machine (4CPUs, 512MB of
> RAM split into two NUMA nodes). Started on a freshly created fs after
> boot, no other load in the guest. The performance numbers should be
> taken with grain of salt, though, because the host has 4CPUs as well and
> it wasn't completely idle, but should be OK enough to give us at least
> some picture. This is what fs_mark told me:
> Unpatched kernel:
> #   Version 3.3, 16 thread(s) starting at Fri Oct  7 09:55:05 2016
> #   Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
> #   Directories:  Time based hash between directories across 1 
> subdirectories with 180 seconds per subdirectory.
> #   File names: 40 bytes long, (16 initial bytes of time stamp with 24 
> random bytes at end of name)
> #   Files info: size 0 bytes, written with an IO size of 16384 bytes per 
> write
> #   App overhead is time in microseconds spent in the test not doing file 
> writing related system calls.
> #
> FSUse%Count SizeFiles/sec App Overhead
>  1  1600   4300.1 20745838
>  3  3200   4239.9 23849857
>  5  4800   4243.4 25939543
>  6  6400   4248.4 19514050
>  8  8000   4262.1 20796169
>  9  9600   4257.6 21288675
> 11 11200   4259.7 19375120
> 13 12800   4220.7 22734141
> 14 14400   4238.5 31936458
> 16 16000   4231.5 23409901
> 18 17600   4045.3 23577700
> 19 19200   2783.4 58299526
> 21 20800   2678.2 40616302
> 23 22400   2693.5 83973996
> Ctrl+C because it just took too long.

Try running it on a larger filesystem, or configure the fs with more
AGs and a larger log (i.e. mkfs.xfs -f -dagcount=24 

Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-10 Thread Vlastimil Babka

On 10/07/2016 11:21 AM, Michal Hocko wrote:

On Fri 07-10-16 10:15:07, Vlastimil Babka wrote:

On 10/07/2016 08:50 AM, Michal Hocko wrote:

On Fri 07-10-16 07:27:37, Vlastimil Babka wrote:

[...]

But make sure you don't break kcompactd and manual compaction from /proc, as
they don't currently set cc->gfp_mask. Looks like until now it was only used
to determine direct compactor's migratetype which is irrelevant in those
contexts.


OK, I see. This is really subtle. One way to go would be to provide a
fake gfp_mask for them. How does the following look to you?


Looks OK. I'll have to think about the kcompactd case, as gfp mask implying
unmovable migratetype might restrict it without good reason. But that would
be separate patch anyway, yours doesn't change that (empty gfp_mask also
means unmovable migratetype) and that's good.


OK, I see. A follow up patch would be really trivial AFAICS. Just add
__GFP_MOVABLE to the mask. But I am not familiar with all these details
enough to propose a patch with full description.


Hm, actually the migratetype only matters for async compaction, and 
kcompactd uses sync_light, so __GFP_MOVABLE will have no effect right now.


Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-10 Thread Vlastimil Babka

On 10/07/2016 11:21 AM, Michal Hocko wrote:

On Fri 07-10-16 10:15:07, Vlastimil Babka wrote:

On 10/07/2016 08:50 AM, Michal Hocko wrote:

On Fri 07-10-16 07:27:37, Vlastimil Babka wrote:

[...]

But make sure you don't break kcompactd and manual compaction from /proc, as
they don't currently set cc->gfp_mask. Looks like until now it was only used
to determine direct compactor's migratetype which is irrelevant in those
contexts.


OK, I see. This is really subtle. One way to go would be to provide a
fake gfp_mask for them. How does the following look to you?


Looks OK. I'll have to think about the kcompactd case, as gfp mask implying
unmovable migratetype might restrict it without good reason. But that would
be separate patch anyway, yours doesn't change that (empty gfp_mask also
means unmovable migratetype) and that's good.


OK, I see. A follow up patch would be really trivial AFAICS. Just add
__GFP_MOVABLE to the mask. But I am not familiar with all these details
enough to propose a patch with full description.


Hm, actually the migratetype only matters for async compaction, and 
kcompactd uses sync_light, so __GFP_MOVABLE will have no effect right now.


Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-07 Thread Michal Hocko
On Thu 06-10-16 13:11:42, Dave Chinner wrote:
> On Wed, Oct 05, 2016 at 01:38:45PM +0200, Michal Hocko wrote:
> > On Wed 05-10-16 07:32:02, Dave Chinner wrote:
> > > On Tue, Oct 04, 2016 at 10:12:15AM +0200, Michal Hocko wrote:
> > > > From: Michal Hocko 
> > > > 
> > > > compaction has been disabled for GFP_NOFS and GFP_NOIO requests since
> > > > the direct compaction was introduced by 56de7263fcf3 ("mm: compaction:
> > > > direct compact when a high-order allocation fails"). The main reason
> > > > is that the migration of page cache pages might recurse back to fs/io
> > > > layer and we could potentially deadlock. This is overly conservative
> > > > because all the anonymous memory is migrateable in the GFP_NOFS context
> > > > just fine.  This might be a large portion of the memory in many/most
> > > > workkloads.
> > > > 
> > > > Remove the GFP_NOFS restriction and make sure that we skip all fs pages
> > > > (those with a mapping) while isolating pages to be migrated. We cannot
> > > > consider clean fs pages because they might need a metadata update so
> > > > only isolate pages without any mapping for nofs requests.
> > > > 
> > > > The effect of this patch will be probably very limited in many/most
> > > > workloads because higher order GFP_NOFS requests are quite rare,
> > > 
> > > You say they are rare only because you don't know how to trigger
> > > them easily.  :/
> > 
> > true
> > 
> > > Try this:
> > > 
> > > # mkfs.xfs -f -n size=64k 
> > > # mount  /mnt/scratch
> > > # time ./fs_mark  -D  1  -S0  -n  10  -s  0  -L  32 \
> > > -d  /mnt/scratch/0  -d  /mnt/scratch/1 \
> > > -d  /mnt/scratch/2  -d  /mnt/scratch/3 \
> > > -d  /mnt/scratch/4  -d  /mnt/scratch/5 \
> > > -d  /mnt/scratch/6  -d  /mnt/scratch/7 \
> > > -d  /mnt/scratch/8  -d  /mnt/scratch/9 \
> > > -d  /mnt/scratch/10  -d  /mnt/scratch/11 \
> > > -d  /mnt/scratch/12  -d  /mnt/scratch/13 \
> > > -d  /mnt/scratch/14  -d  /mnt/scratch/15
> > 
> > Does this simulate a standard or usual fs workload/configuration?  I am
> 
> Unfortunately, there was an era of cargo cult configuration tweaks
> in the Ceph community that has resulted in a large number of
> production machines with XFS filesystems configured this way. And a
> lot of them store large numbers of small files and run under
> significant sustained memory pressure.

I see

> I slowly working towards getting rid of these high order allocations
> and replacing them with the equivalent number of single page
> allocations, but I haven't got that (complex) change working yet.

Definitely a good plan!

Anyway I was playing with this in my virtual machine (4CPUs, 512MB of
RAM split into two NUMA nodes). Started on a freshly created fs after
boot, no other load in the guest. The performance numbers should be
taken with grain of salt, though, because the host has 4CPUs as well and
it wasn't completely idle, but should be OK enough to give us at least
some picture. This is what fs_mark told me:
Unpatched kernel:
#   Version 3.3, 16 thread(s) starting at Fri Oct  7 09:55:05 2016
#   Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#   Directories:  Time based hash between directories across 1 
subdirectories with 180 seconds per subdirectory.
#   File names: 40 bytes long, (16 initial bytes of time stamp with 24 
random bytes at end of name)
#   Files info: size 0 bytes, written with an IO size of 16384 bytes per 
write
#   App overhead is time in microseconds spent in the test not doing file 
writing related system calls.
#
FSUse%Count SizeFiles/sec App Overhead
 1  1600   4300.1 20745838
 3  3200   4239.9 23849857
 5  4800   4243.4 25939543
 6  6400   4248.4 19514050
 8  8000   4262.1 20796169
 9  9600   4257.6 21288675
11 11200   4259.7 19375120
13 12800   4220.7 22734141
14 14400   4238.5 31936458
16 16000   4231.5 23409901
18 17600   4045.3 23577700
19 19200   2783.4 58299526
21 20800   2678.2 40616302
23 22400   2693.5 83973996
Ctrl+C because it just took too long.

For me it was much more interesting to see this in the log:
[ 2304.372647] XFS: fs_mark(3289) possible memory allocation deadlock size 
65624 in kmem_alloc (mode:0x2408240)
[ 2304.443323] XFS: fs_mark(3285) possible memory allocation deadlock size 
65728 in kmem_alloc (mode:0x2408240)
[ 4796.772477] XFS: fs_mark(3424) possible memory allocation deadlock size 
46936 in 

Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-07 Thread Michal Hocko
On Thu 06-10-16 13:11:42, Dave Chinner wrote:
> On Wed, Oct 05, 2016 at 01:38:45PM +0200, Michal Hocko wrote:
> > On Wed 05-10-16 07:32:02, Dave Chinner wrote:
> > > On Tue, Oct 04, 2016 at 10:12:15AM +0200, Michal Hocko wrote:
> > > > From: Michal Hocko 
> > > > 
> > > > compaction has been disabled for GFP_NOFS and GFP_NOIO requests since
> > > > the direct compaction was introduced by 56de7263fcf3 ("mm: compaction:
> > > > direct compact when a high-order allocation fails"). The main reason
> > > > is that the migration of page cache pages might recurse back to fs/io
> > > > layer and we could potentially deadlock. This is overly conservative
> > > > because all the anonymous memory is migrateable in the GFP_NOFS context
> > > > just fine.  This might be a large portion of the memory in many/most
> > > > workkloads.
> > > > 
> > > > Remove the GFP_NOFS restriction and make sure that we skip all fs pages
> > > > (those with a mapping) while isolating pages to be migrated. We cannot
> > > > consider clean fs pages because they might need a metadata update so
> > > > only isolate pages without any mapping for nofs requests.
> > > > 
> > > > The effect of this patch will be probably very limited in many/most
> > > > workloads because higher order GFP_NOFS requests are quite rare,
> > > 
> > > You say they are rare only because you don't know how to trigger
> > > them easily.  :/
> > 
> > true
> > 
> > > Try this:
> > > 
> > > # mkfs.xfs -f -n size=64k 
> > > # mount  /mnt/scratch
> > > # time ./fs_mark  -D  1  -S0  -n  10  -s  0  -L  32 \
> > > -d  /mnt/scratch/0  -d  /mnt/scratch/1 \
> > > -d  /mnt/scratch/2  -d  /mnt/scratch/3 \
> > > -d  /mnt/scratch/4  -d  /mnt/scratch/5 \
> > > -d  /mnt/scratch/6  -d  /mnt/scratch/7 \
> > > -d  /mnt/scratch/8  -d  /mnt/scratch/9 \
> > > -d  /mnt/scratch/10  -d  /mnt/scratch/11 \
> > > -d  /mnt/scratch/12  -d  /mnt/scratch/13 \
> > > -d  /mnt/scratch/14  -d  /mnt/scratch/15
> > 
> > Does this simulate a standard or usual fs workload/configuration?  I am
> 
> Unfortunately, there was an era of cargo cult configuration tweaks
> in the Ceph community that has resulted in a large number of
> production machines with XFS filesystems configured this way. And a
> lot of them store large numbers of small files and run under
> significant sustained memory pressure.

I see

> I slowly working towards getting rid of these high order allocations
> and replacing them with the equivalent number of single page
> allocations, but I haven't got that (complex) change working yet.

Definitely a good plan!

Anyway I was playing with this in my virtual machine (4CPUs, 512MB of
RAM split into two NUMA nodes). Started on a freshly created fs after
boot, no other load in the guest. The performance numbers should be
taken with grain of salt, though, because the host has 4CPUs as well and
it wasn't completely idle, but should be OK enough to give us at least
some picture. This is what fs_mark told me:
Unpatched kernel:
#   Version 3.3, 16 thread(s) starting at Fri Oct  7 09:55:05 2016
#   Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#   Directories:  Time based hash between directories across 1 
subdirectories with 180 seconds per subdirectory.
#   File names: 40 bytes long, (16 initial bytes of time stamp with 24 
random bytes at end of name)
#   Files info: size 0 bytes, written with an IO size of 16384 bytes per 
write
#   App overhead is time in microseconds spent in the test not doing file 
writing related system calls.
#
FSUse%Count SizeFiles/sec App Overhead
 1  1600   4300.1 20745838
 3  3200   4239.9 23849857
 5  4800   4243.4 25939543
 6  6400   4248.4 19514050
 8  8000   4262.1 20796169
 9  9600   4257.6 21288675
11 11200   4259.7 19375120
13 12800   4220.7 22734141
14 14400   4238.5 31936458
16 16000   4231.5 23409901
18 17600   4045.3 23577700
19 19200   2783.4 58299526
21 20800   2678.2 40616302
23 22400   2693.5 83973996
Ctrl+C because it just took too long.

For me it was much more interesting to see this in the log:
[ 2304.372647] XFS: fs_mark(3289) possible memory allocation deadlock size 
65624 in kmem_alloc (mode:0x2408240)
[ 2304.443323] XFS: fs_mark(3285) possible memory allocation deadlock size 
65728 in kmem_alloc (mode:0x2408240)
[ 4796.772477] XFS: fs_mark(3424) possible memory allocation deadlock size 
46936 in kmem_alloc 

Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-07 Thread Michal Hocko
On Fri 07-10-16 10:15:07, Vlastimil Babka wrote:
> On 10/07/2016 08:50 AM, Michal Hocko wrote:
> > On Fri 07-10-16 07:27:37, Vlastimil Babka wrote:
[...]
> > > But make sure you don't break kcompactd and manual compaction from /proc, 
> > > as
> > > they don't currently set cc->gfp_mask. Looks like until now it was only 
> > > used
> > > to determine direct compactor's migratetype which is irrelevant in those
> > > contexts.
> > 
> > OK, I see. This is really subtle. One way to go would be to provide a
> > fake gfp_mask for them. How does the following look to you?
> 
> Looks OK. I'll have to think about the kcompactd case, as gfp mask implying
> unmovable migratetype might restrict it without good reason. But that would
> be separate patch anyway, yours doesn't change that (empty gfp_mask also
> means unmovable migratetype) and that's good.

OK, I see. A follow up patch would be really trivial AFAICS. Just add
__GFP_MOVABLE to the mask. But I am not familiar with all these details
enough to propose a patch with full description.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-07 Thread Michal Hocko
On Fri 07-10-16 10:15:07, Vlastimil Babka wrote:
> On 10/07/2016 08:50 AM, Michal Hocko wrote:
> > On Fri 07-10-16 07:27:37, Vlastimil Babka wrote:
[...]
> > > But make sure you don't break kcompactd and manual compaction from /proc, 
> > > as
> > > they don't currently set cc->gfp_mask. Looks like until now it was only 
> > > used
> > > to determine direct compactor's migratetype which is irrelevant in those
> > > contexts.
> > 
> > OK, I see. This is really subtle. One way to go would be to provide a
> > fake gfp_mask for them. How does the following look to you?
> 
> Looks OK. I'll have to think about the kcompactd case, as gfp mask implying
> unmovable migratetype might restrict it without good reason. But that would
> be separate patch anyway, yours doesn't change that (empty gfp_mask also
> means unmovable migratetype) and that's good.

OK, I see. A follow up patch would be really trivial AFAICS. Just add
__GFP_MOVABLE to the mask. But I am not familiar with all these details
enough to propose a patch with full description.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-07 Thread Vlastimil Babka

On 10/07/2016 08:50 AM, Michal Hocko wrote:

On Fri 07-10-16 07:27:37, Vlastimil Babka wrote:
[...]

diff --git a/mm/compaction.c b/mm/compaction.c
index badb92bf14b4..07254a73ee32 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -834,6 +834,13 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
page_count(page) > page_mapcount(page))
goto isolate_fail;

+   /*
+* Only allow to migrate anonymous pages in GFP_NOFS context
+* because those do not depend on fs locks.
+*/
+   if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
+   goto isolate_fail;


Unless page can acquire a page_mapping between this check and migration, I
don't see a problem with allowing this.


It can be become swapcache but I guess this should be OK. We do not
allow to get here with GFP_NOIO and migrating swapcache pages in NOFS
mode should be OK AFAICS.


But make sure you don't break kcompactd and manual compaction from /proc, as
they don't currently set cc->gfp_mask. Looks like until now it was only used
to determine direct compactor's migratetype which is irrelevant in those
contexts.


OK, I see. This is really subtle. One way to go would be to provide a
fake gfp_mask for them. How does the following look to you?


Looks OK. I'll have to think about the kcompactd case, as gfp mask 
implying unmovable migratetype might restrict it without good reason. 
But that would be separate patch anyway, yours doesn't change that 
(empty gfp_mask also means unmovable migratetype) and that's good.



---
diff --git a/mm/compaction.c b/mm/compaction.c
index 557c165b63ad..d1d90e96ef4b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1779,6 +1779,7 @@ static void compact_node(int nid)
.mode = MIGRATE_SYNC,
.ignore_skip_hint = true,
.whole_zone = true,
+   .gfp_mask = GFP_KERNEL,
};


@@ -1904,6 +1905,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
.classzone_idx = pgdat->kcompactd_classzone_idx,
.mode = MIGRATE_SYNC_LIGHT,
.ignore_skip_hint = true,
+   .gfp_mask = GFP_KERNEL,

};
trace_mm_compaction_kcompactd_wake(pgdat->node_id, cc.order,





Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-07 Thread Vlastimil Babka

On 10/07/2016 08:50 AM, Michal Hocko wrote:

On Fri 07-10-16 07:27:37, Vlastimil Babka wrote:
[...]

diff --git a/mm/compaction.c b/mm/compaction.c
index badb92bf14b4..07254a73ee32 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -834,6 +834,13 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
page_count(page) > page_mapcount(page))
goto isolate_fail;

+   /*
+* Only allow to migrate anonymous pages in GFP_NOFS context
+* because those do not depend on fs locks.
+*/
+   if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
+   goto isolate_fail;


Unless page can acquire a page_mapping between this check and migration, I
don't see a problem with allowing this.


It can be become swapcache but I guess this should be OK. We do not
allow to get here with GFP_NOIO and migrating swapcache pages in NOFS
mode should be OK AFAICS.


But make sure you don't break kcompactd and manual compaction from /proc, as
they don't currently set cc->gfp_mask. Looks like until now it was only used
to determine direct compactor's migratetype which is irrelevant in those
contexts.


OK, I see. This is really subtle. One way to go would be to provide a
fake gfp_mask for them. How does the following look to you?


Looks OK. I'll have to think about the kcompactd case, as gfp mask 
implying unmovable migratetype might restrict it without good reason. 
But that would be separate patch anyway, yours doesn't change that 
(empty gfp_mask also means unmovable migratetype) and that's good.



---
diff --git a/mm/compaction.c b/mm/compaction.c
index 557c165b63ad..d1d90e96ef4b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1779,6 +1779,7 @@ static void compact_node(int nid)
.mode = MIGRATE_SYNC,
.ignore_skip_hint = true,
.whole_zone = true,
+   .gfp_mask = GFP_KERNEL,
};


@@ -1904,6 +1905,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
.classzone_idx = pgdat->kcompactd_classzone_idx,
.mode = MIGRATE_SYNC_LIGHT,
.ignore_skip_hint = true,
+   .gfp_mask = GFP_KERNEL,

};
trace_mm_compaction_kcompactd_wake(pgdat->node_id, cc.order,





Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-07 Thread Michal Hocko
On Fri 07-10-16 07:27:37, Vlastimil Babka wrote:
[...]
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index badb92bf14b4..07254a73ee32 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -834,6 +834,13 @@ isolate_migratepages_block(struct compact_control *cc, 
> > unsigned long low_pfn,
> > page_count(page) > page_mapcount(page))
> > goto isolate_fail;
> > 
> > +   /*
> > +* Only allow to migrate anonymous pages in GFP_NOFS context
> > +* because those do not depend on fs locks.
> > +*/
> > +   if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
> > +   goto isolate_fail;
> 
> Unless page can acquire a page_mapping between this check and migration, I
> don't see a problem with allowing this.

It can be become swapcache but I guess this should be OK. We do not
allow to get here with GFP_NOIO and migrating swapcache pages in NOFS
mode should be OK AFAICS.

> But make sure you don't break kcompactd and manual compaction from /proc, as
> they don't currently set cc->gfp_mask. Looks like until now it was only used
> to determine direct compactor's migratetype which is irrelevant in those
> contexts.

OK, I see. This is really subtle. One way to go would be to provide a
fake gfp_mask for them. How does the following look to you?
---
diff --git a/mm/compaction.c b/mm/compaction.c
index 557c165b63ad..d1d90e96ef4b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1779,6 +1779,7 @@ static void compact_node(int nid)
.mode = MIGRATE_SYNC,
.ignore_skip_hint = true,
.whole_zone = true,
+   .gfp_mask = GFP_KERNEL,
};
 
 
@@ -1904,6 +1905,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
.classzone_idx = pgdat->kcompactd_classzone_idx,
.mode = MIGRATE_SYNC_LIGHT,
.ignore_skip_hint = true,
+   .gfp_mask = GFP_KERNEL,
 
};
trace_mm_compaction_kcompactd_wake(pgdat->node_id, cc.order,
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-07 Thread Michal Hocko
On Fri 07-10-16 07:27:37, Vlastimil Babka wrote:
[...]
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index badb92bf14b4..07254a73ee32 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -834,6 +834,13 @@ isolate_migratepages_block(struct compact_control *cc, 
> > unsigned long low_pfn,
> > page_count(page) > page_mapcount(page))
> > goto isolate_fail;
> > 
> > +   /*
> > +* Only allow to migrate anonymous pages in GFP_NOFS context
> > +* because those do not depend on fs locks.
> > +*/
> > +   if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
> > +   goto isolate_fail;
> 
> Unless page can acquire a page_mapping between this check and migration, I
> don't see a problem with allowing this.

It can be become swapcache but I guess this should be OK. We do not
allow to get here with GFP_NOIO and migrating swapcache pages in NOFS
mode should be OK AFAICS.

> But make sure you don't break kcompactd and manual compaction from /proc, as
> they don't currently set cc->gfp_mask. Looks like until now it was only used
> to determine direct compactor's migratetype which is irrelevant in those
> contexts.

OK, I see. This is really subtle. One way to go would be to provide a
fake gfp_mask for them. How does the following look to you?
---
diff --git a/mm/compaction.c b/mm/compaction.c
index 557c165b63ad..d1d90e96ef4b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1779,6 +1779,7 @@ static void compact_node(int nid)
.mode = MIGRATE_SYNC,
.ignore_skip_hint = true,
.whole_zone = true,
+   .gfp_mask = GFP_KERNEL,
};
 
 
@@ -1904,6 +1905,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
.classzone_idx = pgdat->kcompactd_classzone_idx,
.mode = MIGRATE_SYNC_LIGHT,
.ignore_skip_hint = true,
+   .gfp_mask = GFP_KERNEL,
 
};
trace_mm_compaction_kcompactd_wake(pgdat->node_id, cc.order,
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-06 Thread Vlastimil Babka

On 10/04/2016 10:12 AM, Michal Hocko wrote:

From: Michal Hocko 

compaction has been disabled for GFP_NOFS and GFP_NOIO requests since
the direct compaction was introduced by 56de7263fcf3 ("mm: compaction:
direct compact when a high-order allocation fails"). The main reason
is that the migration of page cache pages might recurse back to fs/io
layer and we could potentially deadlock. This is overly conservative
because all the anonymous memory is migrateable in the GFP_NOFS context
just fine.  This might be a large portion of the memory in many/most
workkloads.

Remove the GFP_NOFS restriction and make sure that we skip all fs pages
(those with a mapping) while isolating pages to be migrated. We cannot
consider clean fs pages because they might need a metadata update so
only isolate pages without any mapping for nofs requests.

The effect of this patch will be probably very limited in many/most
workloads because higher order GFP_NOFS requests are quite rare,
although different configurations might lead to very different results
as GFP_NOFS usage is rather unleashed (e.g. I had hard time to trigger
any with my setup). But still there shouldn't be any strong reason to
completely back off and do nothing in that context. In the worst case
we just skip parts of the block with fs pages. This might be still
sufficient to make a progress for small orders.

Signed-off-by: Michal Hocko 
---

Hi,
I am sending this as an RFC because I am not completely sure this a) is
really worth it and b) it is 100% correct. I couldn't find any problems
when staring into the code but as mentioned in the changelog I wasn't
really able to trigger high order GFP_NOFS requests in my setup.

Thoughts?

 mm/compaction.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index badb92bf14b4..07254a73ee32 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -834,6 +834,13 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
page_count(page) > page_mapcount(page))
goto isolate_fail;

+   /*
+* Only allow to migrate anonymous pages in GFP_NOFS context
+* because those do not depend on fs locks.
+*/
+   if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
+   goto isolate_fail;


Unless page can acquire a page_mapping between this check and migration, 
I don't see a problem with allowing this.


But make sure you don't break kcompactd and manual compaction from 
/proc, as they don't currently set cc->gfp_mask. Looks like until now it 
was only used to determine direct compactor's migratetype which is 
irrelevant in those contexts.



+
/* If we already hold the lock, we can skip some rechecking */
if (!locked) {
locked = compact_trylock_irqsave(zone_lru_lock(zone),
@@ -1696,14 +1703,16 @@ enum compact_result try_to_compact_pages(gfp_t 
gfp_mask, unsigned int order,
unsigned int alloc_flags, const struct alloc_context *ac,
enum compact_priority prio)
 {
-   int may_enter_fs = gfp_mask & __GFP_FS;
int may_perform_io = gfp_mask & __GFP_IO;
struct zoneref *z;
struct zone *zone;
enum compact_result rc = COMPACT_SKIPPED;

-   /* Check if the GFP flags allow compaction */
-   if (!may_enter_fs || !may_perform_io)
+   /*
+* Check if the GFP flags allow compaction - GFP_NOIO is really
+* tricky context because the migration might require IO and
+*/
+   if (!may_perform_io)
return COMPACT_SKIPPED;

trace_mm_compaction_try_to_compact_pages(order, gfp_mask, prio);





Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-06 Thread Vlastimil Babka

On 10/04/2016 10:12 AM, Michal Hocko wrote:

From: Michal Hocko 

compaction has been disabled for GFP_NOFS and GFP_NOIO requests since
the direct compaction was introduced by 56de7263fcf3 ("mm: compaction:
direct compact when a high-order allocation fails"). The main reason
is that the migration of page cache pages might recurse back to fs/io
layer and we could potentially deadlock. This is overly conservative
because all the anonymous memory is migrateable in the GFP_NOFS context
just fine.  This might be a large portion of the memory in many/most
workkloads.

Remove the GFP_NOFS restriction and make sure that we skip all fs pages
(those with a mapping) while isolating pages to be migrated. We cannot
consider clean fs pages because they might need a metadata update so
only isolate pages without any mapping for nofs requests.

The effect of this patch will be probably very limited in many/most
workloads because higher order GFP_NOFS requests are quite rare,
although different configurations might lead to very different results
as GFP_NOFS usage is rather unleashed (e.g. I had hard time to trigger
any with my setup). But still there shouldn't be any strong reason to
completely back off and do nothing in that context. In the worst case
we just skip parts of the block with fs pages. This might be still
sufficient to make a progress for small orders.

Signed-off-by: Michal Hocko 
---

Hi,
I am sending this as an RFC because I am not completely sure this a) is
really worth it and b) it is 100% correct. I couldn't find any problems
when staring into the code but as mentioned in the changelog I wasn't
really able to trigger high order GFP_NOFS requests in my setup.

Thoughts?

 mm/compaction.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index badb92bf14b4..07254a73ee32 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -834,6 +834,13 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
page_count(page) > page_mapcount(page))
goto isolate_fail;

+   /*
+* Only allow to migrate anonymous pages in GFP_NOFS context
+* because those do not depend on fs locks.
+*/
+   if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
+   goto isolate_fail;


Unless page can acquire a page_mapping between this check and migration, 
I don't see a problem with allowing this.


But make sure you don't break kcompactd and manual compaction from 
/proc, as they don't currently set cc->gfp_mask. Looks like until now it 
was only used to determine direct compactor's migratetype which is 
irrelevant in those contexts.



+
/* If we already hold the lock, we can skip some rechecking */
if (!locked) {
locked = compact_trylock_irqsave(zone_lru_lock(zone),
@@ -1696,14 +1703,16 @@ enum compact_result try_to_compact_pages(gfp_t 
gfp_mask, unsigned int order,
unsigned int alloc_flags, const struct alloc_context *ac,
enum compact_priority prio)
 {
-   int may_enter_fs = gfp_mask & __GFP_FS;
int may_perform_io = gfp_mask & __GFP_IO;
struct zoneref *z;
struct zone *zone;
enum compact_result rc = COMPACT_SKIPPED;

-   /* Check if the GFP flags allow compaction */
-   if (!may_enter_fs || !may_perform_io)
+   /*
+* Check if the GFP flags allow compaction - GFP_NOIO is really
+* tricky context because the migration might require IO and
+*/
+   if (!may_perform_io)
return COMPACT_SKIPPED;

trace_mm_compaction_try_to_compact_pages(order, gfp_mask, prio);





Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-05 Thread Dave Chinner
On Wed, Oct 05, 2016 at 01:38:45PM +0200, Michal Hocko wrote:
> On Wed 05-10-16 07:32:02, Dave Chinner wrote:
> > On Tue, Oct 04, 2016 at 10:12:15AM +0200, Michal Hocko wrote:
> > > From: Michal Hocko 
> > > 
> > > compaction has been disabled for GFP_NOFS and GFP_NOIO requests since
> > > the direct compaction was introduced by 56de7263fcf3 ("mm: compaction:
> > > direct compact when a high-order allocation fails"). The main reason
> > > is that the migration of page cache pages might recurse back to fs/io
> > > layer and we could potentially deadlock. This is overly conservative
> > > because all the anonymous memory is migrateable in the GFP_NOFS context
> > > just fine.  This might be a large portion of the memory in many/most
> > > workkloads.
> > > 
> > > Remove the GFP_NOFS restriction and make sure that we skip all fs pages
> > > (those with a mapping) while isolating pages to be migrated. We cannot
> > > consider clean fs pages because they might need a metadata update so
> > > only isolate pages without any mapping for nofs requests.
> > > 
> > > The effect of this patch will be probably very limited in many/most
> > > workloads because higher order GFP_NOFS requests are quite rare,
> > 
> > You say they are rare only because you don't know how to trigger
> > them easily.  :/
> 
> true
> 
> > Try this:
> > 
> > # mkfs.xfs -f -n size=64k 
> > # mount  /mnt/scratch
> > # time ./fs_mark  -D  1  -S0  -n  10  -s  0  -L  32 \
> > -d  /mnt/scratch/0  -d  /mnt/scratch/1 \
> > -d  /mnt/scratch/2  -d  /mnt/scratch/3 \
> > -d  /mnt/scratch/4  -d  /mnt/scratch/5 \
> > -d  /mnt/scratch/6  -d  /mnt/scratch/7 \
> > -d  /mnt/scratch/8  -d  /mnt/scratch/9 \
> > -d  /mnt/scratch/10  -d  /mnt/scratch/11 \
> > -d  /mnt/scratch/12  -d  /mnt/scratch/13 \
> > -d  /mnt/scratch/14  -d  /mnt/scratch/15
> 
> Does this simulate a standard or usual fs workload/configuration?  I am

Unfortunately, there was an era of cargo cult configuration tweaks
in the Ceph community that has resulted in a large number of
production machines with XFS filesystems configured this way. And a
lot of them store large numbers of small files and run under
significant sustained memory pressure.

I slowly working towards getting rid of these high order allocations
and replacing them with the equivalent number of single page
allocations, but I haven't got that (complex) change working yet.

> not questioning that higher order NOFS allocations are non-existent -
> that's why I came with the patch in the first place ;). My observation
> was that they are so rare that the visible effect of this patch might be
> quite low or even hard to notice.

Yup, it's a valid observation that would hold true for the majority
of users.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-05 Thread Dave Chinner
On Wed, Oct 05, 2016 at 01:38:45PM +0200, Michal Hocko wrote:
> On Wed 05-10-16 07:32:02, Dave Chinner wrote:
> > On Tue, Oct 04, 2016 at 10:12:15AM +0200, Michal Hocko wrote:
> > > From: Michal Hocko 
> > > 
> > > compaction has been disabled for GFP_NOFS and GFP_NOIO requests since
> > > the direct compaction was introduced by 56de7263fcf3 ("mm: compaction:
> > > direct compact when a high-order allocation fails"). The main reason
> > > is that the migration of page cache pages might recurse back to fs/io
> > > layer and we could potentially deadlock. This is overly conservative
> > > because all the anonymous memory is migrateable in the GFP_NOFS context
> > > just fine.  This might be a large portion of the memory in many/most
> > > workkloads.
> > > 
> > > Remove the GFP_NOFS restriction and make sure that we skip all fs pages
> > > (those with a mapping) while isolating pages to be migrated. We cannot
> > > consider clean fs pages because they might need a metadata update so
> > > only isolate pages without any mapping for nofs requests.
> > > 
> > > The effect of this patch will be probably very limited in many/most
> > > workloads because higher order GFP_NOFS requests are quite rare,
> > 
> > You say they are rare only because you don't know how to trigger
> > them easily.  :/
> 
> true
> 
> > Try this:
> > 
> > # mkfs.xfs -f -n size=64k 
> > # mount  /mnt/scratch
> > # time ./fs_mark  -D  1  -S0  -n  10  -s  0  -L  32 \
> > -d  /mnt/scratch/0  -d  /mnt/scratch/1 \
> > -d  /mnt/scratch/2  -d  /mnt/scratch/3 \
> > -d  /mnt/scratch/4  -d  /mnt/scratch/5 \
> > -d  /mnt/scratch/6  -d  /mnt/scratch/7 \
> > -d  /mnt/scratch/8  -d  /mnt/scratch/9 \
> > -d  /mnt/scratch/10  -d  /mnt/scratch/11 \
> > -d  /mnt/scratch/12  -d  /mnt/scratch/13 \
> > -d  /mnt/scratch/14  -d  /mnt/scratch/15
> 
> Does this simulate a standard or usual fs workload/configuration?  I am

Unfortunately, there was an era of cargo cult configuration tweaks
in the Ceph community that has resulted in a large number of
production machines with XFS filesystems configured this way. And a
lot of them store large numbers of small files and run under
significant sustained memory pressure.

I slowly working towards getting rid of these high order allocations
and replacing them with the equivalent number of single page
allocations, but I haven't got that (complex) change working yet.

> not questioning that higher order NOFS allocations are non-existent -
> that's why I came with the patch in the first place ;). My observation
> was that they are so rare that the visible effect of this patch might be
> quite low or even hard to notice.

Yup, it's a valid observation that would hold true for the majority
of users.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-05 Thread Michal Hocko
On Wed 05-10-16 07:32:02, Dave Chinner wrote:
> On Tue, Oct 04, 2016 at 10:12:15AM +0200, Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > compaction has been disabled for GFP_NOFS and GFP_NOIO requests since
> > the direct compaction was introduced by 56de7263fcf3 ("mm: compaction:
> > direct compact when a high-order allocation fails"). The main reason
> > is that the migration of page cache pages might recurse back to fs/io
> > layer and we could potentially deadlock. This is overly conservative
> > because all the anonymous memory is migrateable in the GFP_NOFS context
> > just fine.  This might be a large portion of the memory in many/most
> > workkloads.
> > 
> > Remove the GFP_NOFS restriction and make sure that we skip all fs pages
> > (those with a mapping) while isolating pages to be migrated. We cannot
> > consider clean fs pages because they might need a metadata update so
> > only isolate pages without any mapping for nofs requests.
> > 
> > The effect of this patch will be probably very limited in many/most
> > workloads because higher order GFP_NOFS requests are quite rare,
> 
> You say they are rare only because you don't know how to trigger
> them easily.  :/

true

> Try this:
> 
> # mkfs.xfs -f -n size=64k 
> # mount  /mnt/scratch
> # time ./fs_mark  -D  1  -S0  -n  10  -s  0  -L  32 \
> -d  /mnt/scratch/0  -d  /mnt/scratch/1 \
> -d  /mnt/scratch/2  -d  /mnt/scratch/3 \
> -d  /mnt/scratch/4  -d  /mnt/scratch/5 \
> -d  /mnt/scratch/6  -d  /mnt/scratch/7 \
> -d  /mnt/scratch/8  -d  /mnt/scratch/9 \
> -d  /mnt/scratch/10  -d  /mnt/scratch/11 \
> -d  /mnt/scratch/12  -d  /mnt/scratch/13 \
> -d  /mnt/scratch/14  -d  /mnt/scratch/15

Does this simulate a standard or usual fs workload/configuration?  I am
not questioning that higher order NOFS allocations are non-existent -
that's why I came with the patch in the first place ;). My observation
was that they are so rare that the visible effect of this patch might be
quite low or even hard to notice.

Anyway, thanks for a _useful_ testcase to play with! Let's see what
numbers I get from this.

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-05 Thread Michal Hocko
On Wed 05-10-16 07:32:02, Dave Chinner wrote:
> On Tue, Oct 04, 2016 at 10:12:15AM +0200, Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > compaction has been disabled for GFP_NOFS and GFP_NOIO requests since
> > the direct compaction was introduced by 56de7263fcf3 ("mm: compaction:
> > direct compact when a high-order allocation fails"). The main reason
> > is that the migration of page cache pages might recurse back to fs/io
> > layer and we could potentially deadlock. This is overly conservative
> > because all the anonymous memory is migrateable in the GFP_NOFS context
> > just fine.  This might be a large portion of the memory in many/most
> > workkloads.
> > 
> > Remove the GFP_NOFS restriction and make sure that we skip all fs pages
> > (those with a mapping) while isolating pages to be migrated. We cannot
> > consider clean fs pages because they might need a metadata update so
> > only isolate pages without any mapping for nofs requests.
> > 
> > The effect of this patch will be probably very limited in many/most
> > workloads because higher order GFP_NOFS requests are quite rare,
> 
> You say they are rare only because you don't know how to trigger
> them easily.  :/

true

> Try this:
> 
> # mkfs.xfs -f -n size=64k 
> # mount  /mnt/scratch
> # time ./fs_mark  -D  1  -S0  -n  10  -s  0  -L  32 \
> -d  /mnt/scratch/0  -d  /mnt/scratch/1 \
> -d  /mnt/scratch/2  -d  /mnt/scratch/3 \
> -d  /mnt/scratch/4  -d  /mnt/scratch/5 \
> -d  /mnt/scratch/6  -d  /mnt/scratch/7 \
> -d  /mnt/scratch/8  -d  /mnt/scratch/9 \
> -d  /mnt/scratch/10  -d  /mnt/scratch/11 \
> -d  /mnt/scratch/12  -d  /mnt/scratch/13 \
> -d  /mnt/scratch/14  -d  /mnt/scratch/15

Does this simulate a standard or usual fs workload/configuration?  I am
not questioning that higher order NOFS allocations are non-existent -
that's why I came with the patch in the first place ;). My observation
was that they are so rare that the visible effect of this patch might be
quite low or even hard to notice.

Anyway, thanks for a _useful_ testcase to play with! Let's see what
numbers I get from this.

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-04 Thread Dave Chinner
On Tue, Oct 04, 2016 at 10:12:15AM +0200, Michal Hocko wrote:
> From: Michal Hocko 
> 
> compaction has been disabled for GFP_NOFS and GFP_NOIO requests since
> the direct compaction was introduced by 56de7263fcf3 ("mm: compaction:
> direct compact when a high-order allocation fails"). The main reason
> is that the migration of page cache pages might recurse back to fs/io
> layer and we could potentially deadlock. This is overly conservative
> because all the anonymous memory is migrateable in the GFP_NOFS context
> just fine.  This might be a large portion of the memory in many/most
> workkloads.
> 
> Remove the GFP_NOFS restriction and make sure that we skip all fs pages
> (those with a mapping) while isolating pages to be migrated. We cannot
> consider clean fs pages because they might need a metadata update so
> only isolate pages without any mapping for nofs requests.
> 
> The effect of this patch will be probably very limited in many/most
> workloads because higher order GFP_NOFS requests are quite rare,

You say they are rare only because you don't know how to trigger
them easily.  :/

Try this:

# mkfs.xfs -f -n size=64k 
# mount  /mnt/scratch
# time ./fs_mark  -D  1  -S0  -n  10  -s  0  -L  32 \
-d  /mnt/scratch/0  -d  /mnt/scratch/1 \
-d  /mnt/scratch/2  -d  /mnt/scratch/3 \
-d  /mnt/scratch/4  -d  /mnt/scratch/5 \
-d  /mnt/scratch/6  -d  /mnt/scratch/7 \
-d  /mnt/scratch/8  -d  /mnt/scratch/9 \
-d  /mnt/scratch/10  -d  /mnt/scratch/11 \
-d  /mnt/scratch/12  -d  /mnt/scratch/13 \
-d  /mnt/scratch/14  -d  /mnt/scratch/15

As soon as tail pushing on the journal starts (a few seconds in,
most likely), you'll start to see lots of 65kB allocations being
requested in GFP_NOFS context by the xfs-cil-worker context doing
journal checkpoint formatting

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-04 Thread Dave Chinner
On Tue, Oct 04, 2016 at 10:12:15AM +0200, Michal Hocko wrote:
> From: Michal Hocko 
> 
> compaction has been disabled for GFP_NOFS and GFP_NOIO requests since
> the direct compaction was introduced by 56de7263fcf3 ("mm: compaction:
> direct compact when a high-order allocation fails"). The main reason
> is that the migration of page cache pages might recurse back to fs/io
> layer and we could potentially deadlock. This is overly conservative
> because all the anonymous memory is migrateable in the GFP_NOFS context
> just fine.  This might be a large portion of the memory in many/most
> workkloads.
> 
> Remove the GFP_NOFS restriction and make sure that we skip all fs pages
> (those with a mapping) while isolating pages to be migrated. We cannot
> consider clean fs pages because they might need a metadata update so
> only isolate pages without any mapping for nofs requests.
> 
> The effect of this patch will be probably very limited in many/most
> workloads because higher order GFP_NOFS requests are quite rare,

You say they are rare only because you don't know how to trigger
them easily.  :/

Try this:

# mkfs.xfs -f -n size=64k 
# mount  /mnt/scratch
# time ./fs_mark  -D  1  -S0  -n  10  -s  0  -L  32 \
-d  /mnt/scratch/0  -d  /mnt/scratch/1 \
-d  /mnt/scratch/2  -d  /mnt/scratch/3 \
-d  /mnt/scratch/4  -d  /mnt/scratch/5 \
-d  /mnt/scratch/6  -d  /mnt/scratch/7 \
-d  /mnt/scratch/8  -d  /mnt/scratch/9 \
-d  /mnt/scratch/10  -d  /mnt/scratch/11 \
-d  /mnt/scratch/12  -d  /mnt/scratch/13 \
-d  /mnt/scratch/14  -d  /mnt/scratch/15

As soon as tail pushing on the journal starts (a few seconds in,
most likely), you'll start to see lots of 65kB allocations being
requested in GFP_NOFS context by the xfs-cil-worker context doing
journal checkpoint formatting

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


[RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-04 Thread Michal Hocko
From: Michal Hocko 

compaction has been disabled for GFP_NOFS and GFP_NOIO requests since
the direct compaction was introduced by 56de7263fcf3 ("mm: compaction:
direct compact when a high-order allocation fails"). The main reason
is that the migration of page cache pages might recurse back to fs/io
layer and we could potentially deadlock. This is overly conservative
because all the anonymous memory is migrateable in the GFP_NOFS context
just fine.  This might be a large portion of the memory in many/most
workkloads.

Remove the GFP_NOFS restriction and make sure that we skip all fs pages
(those with a mapping) while isolating pages to be migrated. We cannot
consider clean fs pages because they might need a metadata update so
only isolate pages without any mapping for nofs requests.

The effect of this patch will be probably very limited in many/most
workloads because higher order GFP_NOFS requests are quite rare,
although different configurations might lead to very different results
as GFP_NOFS usage is rather unleashed (e.g. I had hard time to trigger
any with my setup). But still there shouldn't be any strong reason to
completely back off and do nothing in that context. In the worst case
we just skip parts of the block with fs pages. This might be still
sufficient to make a progress for small orders.

Signed-off-by: Michal Hocko 
---

Hi,
I am sending this as an RFC because I am not completely sure this a) is
really worth it and b) it is 100% correct. I couldn't find any problems
when staring into the code but as mentioned in the changelog I wasn't
really able to trigger high order GFP_NOFS requests in my setup.

Thoughts?

 mm/compaction.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index badb92bf14b4..07254a73ee32 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -834,6 +834,13 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
page_count(page) > page_mapcount(page))
goto isolate_fail;
 
+   /*
+* Only allow to migrate anonymous pages in GFP_NOFS context
+* because those do not depend on fs locks.
+*/
+   if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
+   goto isolate_fail;
+
/* If we already hold the lock, we can skip some rechecking */
if (!locked) {
locked = compact_trylock_irqsave(zone_lru_lock(zone),
@@ -1696,14 +1703,16 @@ enum compact_result try_to_compact_pages(gfp_t 
gfp_mask, unsigned int order,
unsigned int alloc_flags, const struct alloc_context *ac,
enum compact_priority prio)
 {
-   int may_enter_fs = gfp_mask & __GFP_FS;
int may_perform_io = gfp_mask & __GFP_IO;
struct zoneref *z;
struct zone *zone;
enum compact_result rc = COMPACT_SKIPPED;
 
-   /* Check if the GFP flags allow compaction */
-   if (!may_enter_fs || !may_perform_io)
+   /*
+* Check if the GFP flags allow compaction - GFP_NOIO is really
+* tricky context because the migration might require IO and
+*/
+   if (!may_perform_io)
return COMPACT_SKIPPED;
 
trace_mm_compaction_try_to_compact_pages(order, gfp_mask, prio);
-- 
2.9.3



[RFC PATCH] mm, compaction: allow compaction for GFP_NOFS requests

2016-10-04 Thread Michal Hocko
From: Michal Hocko 

compaction has been disabled for GFP_NOFS and GFP_NOIO requests since
the direct compaction was introduced by 56de7263fcf3 ("mm: compaction:
direct compact when a high-order allocation fails"). The main reason
is that the migration of page cache pages might recurse back to fs/io
layer and we could potentially deadlock. This is overly conservative
because all the anonymous memory is migrateable in the GFP_NOFS context
just fine.  This might be a large portion of the memory in many/most
workkloads.

Remove the GFP_NOFS restriction and make sure that we skip all fs pages
(those with a mapping) while isolating pages to be migrated. We cannot
consider clean fs pages because they might need a metadata update so
only isolate pages without any mapping for nofs requests.

The effect of this patch will be probably very limited in many/most
workloads because higher order GFP_NOFS requests are quite rare,
although different configurations might lead to very different results
as GFP_NOFS usage is rather unleashed (e.g. I had hard time to trigger
any with my setup). But still there shouldn't be any strong reason to
completely back off and do nothing in that context. In the worst case
we just skip parts of the block with fs pages. This might be still
sufficient to make a progress for small orders.

Signed-off-by: Michal Hocko 
---

Hi,
I am sending this as an RFC because I am not completely sure this a) is
really worth it and b) it is 100% correct. I couldn't find any problems
when staring into the code but as mentioned in the changelog I wasn't
really able to trigger high order GFP_NOFS requests in my setup.

Thoughts?

 mm/compaction.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index badb92bf14b4..07254a73ee32 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -834,6 +834,13 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
page_count(page) > page_mapcount(page))
goto isolate_fail;
 
+   /*
+* Only allow to migrate anonymous pages in GFP_NOFS context
+* because those do not depend on fs locks.
+*/
+   if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
+   goto isolate_fail;
+
/* If we already hold the lock, we can skip some rechecking */
if (!locked) {
locked = compact_trylock_irqsave(zone_lru_lock(zone),
@@ -1696,14 +1703,16 @@ enum compact_result try_to_compact_pages(gfp_t 
gfp_mask, unsigned int order,
unsigned int alloc_flags, const struct alloc_context *ac,
enum compact_priority prio)
 {
-   int may_enter_fs = gfp_mask & __GFP_FS;
int may_perform_io = gfp_mask & __GFP_IO;
struct zoneref *z;
struct zone *zone;
enum compact_result rc = COMPACT_SKIPPED;
 
-   /* Check if the GFP flags allow compaction */
-   if (!may_enter_fs || !may_perform_io)
+   /*
+* Check if the GFP flags allow compaction - GFP_NOIO is really
+* tricky context because the migration might require IO and
+*/
+   if (!may_perform_io)
return COMPACT_SKIPPED;
 
trace_mm_compaction_try_to_compact_pages(order, gfp_mask, prio);
-- 
2.9.3