Re: [RFC 0/3] reduce latency of direct async compaction

2015-12-09 Thread Aaron Lu
On 12/10/2015 12:35 PM, Joonsoo Kim wrote:
> On Wed, Dec 09, 2015 at 01:40:06PM +0800, Aaron Lu wrote:
>> On Wed, Dec 09, 2015 at 09:33:53AM +0900, Joonsoo Kim wrote:
>>> On Tue, Dec 08, 2015 at 04:52:42PM +0800, Aaron Lu wrote:
 On Tue, Dec 08, 2015 at 03:51:16PM +0900, Joonsoo Kim wrote:
> I add work-around for this problem at isolate_freepages(). Please test
> following one.

 Still no luck and the error is about the same:
>>>
>>> There is a mistake... Could you insert () for
>>> cc->free_pfn & ~(pageblock_nr_pages-1) like as following?
>>>
>>> cc->free_pfn == (cc->free_pfn & ~(pageblock_nr_pages-1))
>>
>> Oh right, of course.
>>
>> Good news, the result is much better now:
>> $ cat {0..8}/swap
>> cmdline: /lkp/aaron/src/bin/usemem 100064603136
>> 100064603136 transferred in 72 seconds, throughput: 1325 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100072049664
>> 100072049664 transferred in 74 seconds, throughput: 1289 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100070246400
>> 100070246400 transferred in 92 seconds, throughput: 1037 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100069545984
>> 100069545984 transferred in 81 seconds, throughput: 1178 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100058895360
>> 100058895360 transferred in 78 seconds, throughput: 1223 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100066074624
>> 100066074624 transferred in 94 seconds, throughput: 1015 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100062855168
>> 100062855168 transferred in 77 seconds, throughput: 1239 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100060990464
>> 100060990464 transferred in 73 seconds, throughput: 1307 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100064996352
>> 100064996352 transferred in 84 seconds, throughput: 1136 MB/s
>> Max: 1325 MB/s
>> Min: 1015 MB/s
>> Avg: 1194 MB/s
> 
> Nice result! Thanks for testing.
> I will make a proper formatted patch soon.

Thanks for the nice work.

> 
> Then, your concern is solved? I think that performance of

I think so.

> always-always on this test case can't follow up performance of
> always-never because migration cost to make hugepage is additionally
> charged to always-always case. Instead, it will have more hugepage
> mapping and it may result in better performance in some situation.
> I guess that it is intention of that option.

OK, I see.

Regards,
Aaron
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/3] reduce latency of direct async compaction

2015-12-09 Thread Joonsoo Kim
On Wed, Dec 09, 2015 at 01:40:06PM +0800, Aaron Lu wrote:
> On Wed, Dec 09, 2015 at 09:33:53AM +0900, Joonsoo Kim wrote:
> > On Tue, Dec 08, 2015 at 04:52:42PM +0800, Aaron Lu wrote:
> > > On Tue, Dec 08, 2015 at 03:51:16PM +0900, Joonsoo Kim wrote:
> > > > I add work-around for this problem at isolate_freepages(). Please test
> > > > following one.
> > > 
> > > Still no luck and the error is about the same:
> > 
> > There is a mistake... Could you insert () for
> > cc->free_pfn & ~(pageblock_nr_pages-1) like as following?
> > 
> > cc->free_pfn == (cc->free_pfn & ~(pageblock_nr_pages-1))
> 
> Oh right, of course.
> 
> Good news, the result is much better now:
> $ cat {0..8}/swap
> cmdline: /lkp/aaron/src/bin/usemem 100064603136
> 100064603136 transferred in 72 seconds, throughput: 1325 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100072049664
> 100072049664 transferred in 74 seconds, throughput: 1289 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100070246400
> 100070246400 transferred in 92 seconds, throughput: 1037 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100069545984
> 100069545984 transferred in 81 seconds, throughput: 1178 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100058895360
> 100058895360 transferred in 78 seconds, throughput: 1223 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100066074624
> 100066074624 transferred in 94 seconds, throughput: 1015 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100062855168
> 100062855168 transferred in 77 seconds, throughput: 1239 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100060990464
> 100060990464 transferred in 73 seconds, throughput: 1307 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100064996352
> 100064996352 transferred in 84 seconds, throughput: 1136 MB/s
> Max: 1325 MB/s
> Min: 1015 MB/s
> Avg: 1194 MB/s

Nice result! Thanks for testing.
I will make a proper formatted patch soon.

Then, your concern is solved? I think that performance of
always-always on this test case can't follow up performance of
always-never because migration cost to make hugepage is additionally
charged to always-always case. Instead, it will have more hugepage
mapping and it may result in better performance in some situation.
I guess that it is intention of that option.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/3] reduce latency of direct async compaction

2015-12-09 Thread Aaron Lu
On 12/10/2015 12:35 PM, Joonsoo Kim wrote:
> On Wed, Dec 09, 2015 at 01:40:06PM +0800, Aaron Lu wrote:
>> On Wed, Dec 09, 2015 at 09:33:53AM +0900, Joonsoo Kim wrote:
>>> On Tue, Dec 08, 2015 at 04:52:42PM +0800, Aaron Lu wrote:
 On Tue, Dec 08, 2015 at 03:51:16PM +0900, Joonsoo Kim wrote:
> I add work-around for this problem at isolate_freepages(). Please test
> following one.

 Still no luck and the error is about the same:
>>>
>>> There is a mistake... Could you insert () for
>>> cc->free_pfn & ~(pageblock_nr_pages-1) like as following?
>>>
>>> cc->free_pfn == (cc->free_pfn & ~(pageblock_nr_pages-1))
>>
>> Oh right, of course.
>>
>> Good news, the result is much better now:
>> $ cat {0..8}/swap
>> cmdline: /lkp/aaron/src/bin/usemem 100064603136
>> 100064603136 transferred in 72 seconds, throughput: 1325 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100072049664
>> 100072049664 transferred in 74 seconds, throughput: 1289 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100070246400
>> 100070246400 transferred in 92 seconds, throughput: 1037 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100069545984
>> 100069545984 transferred in 81 seconds, throughput: 1178 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100058895360
>> 100058895360 transferred in 78 seconds, throughput: 1223 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100066074624
>> 100066074624 transferred in 94 seconds, throughput: 1015 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100062855168
>> 100062855168 transferred in 77 seconds, throughput: 1239 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100060990464
>> 100060990464 transferred in 73 seconds, throughput: 1307 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100064996352
>> 100064996352 transferred in 84 seconds, throughput: 1136 MB/s
>> Max: 1325 MB/s
>> Min: 1015 MB/s
>> Avg: 1194 MB/s
> 
> Nice result! Thanks for testing.
> I will make a proper formatted patch soon.

Thanks for the nice work.

> 
> Then, your concern is solved? I think that performance of

I think so.

> always-always on this test case can't follow up performance of
> always-never because migration cost to make hugepage is additionally
> charged to always-always case. Instead, it will have more hugepage
> mapping and it may result in better performance in some situation.
> I guess that it is intention of that option.

OK, I see.

Regards,
Aaron
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/3] reduce latency of direct async compaction

2015-12-09 Thread Joonsoo Kim
On Wed, Dec 09, 2015 at 01:40:06PM +0800, Aaron Lu wrote:
> On Wed, Dec 09, 2015 at 09:33:53AM +0900, Joonsoo Kim wrote:
> > On Tue, Dec 08, 2015 at 04:52:42PM +0800, Aaron Lu wrote:
> > > On Tue, Dec 08, 2015 at 03:51:16PM +0900, Joonsoo Kim wrote:
> > > > I add work-around for this problem at isolate_freepages(). Please test
> > > > following one.
> > > 
> > > Still no luck and the error is about the same:
> > 
> > There is a mistake... Could you insert () for
> > cc->free_pfn & ~(pageblock_nr_pages-1) like as following?
> > 
> > cc->free_pfn == (cc->free_pfn & ~(pageblock_nr_pages-1))
> 
> Oh right, of course.
> 
> Good news, the result is much better now:
> $ cat {0..8}/swap
> cmdline: /lkp/aaron/src/bin/usemem 100064603136
> 100064603136 transferred in 72 seconds, throughput: 1325 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100072049664
> 100072049664 transferred in 74 seconds, throughput: 1289 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100070246400
> 100070246400 transferred in 92 seconds, throughput: 1037 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100069545984
> 100069545984 transferred in 81 seconds, throughput: 1178 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100058895360
> 100058895360 transferred in 78 seconds, throughput: 1223 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100066074624
> 100066074624 transferred in 94 seconds, throughput: 1015 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100062855168
> 100062855168 transferred in 77 seconds, throughput: 1239 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100060990464
> 100060990464 transferred in 73 seconds, throughput: 1307 MB/s
> cmdline: /lkp/aaron/src/bin/usemem 100064996352
> 100064996352 transferred in 84 seconds, throughput: 1136 MB/s
> Max: 1325 MB/s
> Min: 1015 MB/s
> Avg: 1194 MB/s

Nice result! Thanks for testing.
I will make a proper formatted patch soon.

Then, your concern is solved? I think that performance of
always-always on this test case can't follow up performance of
always-never because migration cost to make hugepage is additionally
charged to always-always case. Instead, it will have more hugepage
mapping and it may result in better performance in some situation.
I guess that it is intention of that option.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/3] reduce latency of direct async compaction

2015-12-08 Thread Aaron Lu
On Wed, Dec 09, 2015 at 09:33:53AM +0900, Joonsoo Kim wrote:
> On Tue, Dec 08, 2015 at 04:52:42PM +0800, Aaron Lu wrote:
> > On Tue, Dec 08, 2015 at 03:51:16PM +0900, Joonsoo Kim wrote:
> > > I add work-around for this problem at isolate_freepages(). Please test
> > > following one.
> > 
> > Still no luck and the error is about the same:
> 
> There is a mistake... Could you insert () for
> cc->free_pfn & ~(pageblock_nr_pages-1) like as following?
> 
> cc->free_pfn == (cc->free_pfn & ~(pageblock_nr_pages-1))

Oh right, of course.

Good news, the result is much better now:
$ cat {0..8}/swap
cmdline: /lkp/aaron/src/bin/usemem 100064603136
100064603136 transferred in 72 seconds, throughput: 1325 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100072049664
100072049664 transferred in 74 seconds, throughput: 1289 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100070246400
100070246400 transferred in 92 seconds, throughput: 1037 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100069545984
100069545984 transferred in 81 seconds, throughput: 1178 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100058895360
100058895360 transferred in 78 seconds, throughput: 1223 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100066074624
100066074624 transferred in 94 seconds, throughput: 1015 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100062855168
100062855168 transferred in 77 seconds, throughput: 1239 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100060990464
100060990464 transferred in 73 seconds, throughput: 1307 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100064996352
100064996352 transferred in 84 seconds, throughput: 1136 MB/s
Max: 1325 MB/s
Min: 1015 MB/s
Avg: 1194 MB/s

The base result for reference:
$ cat {0..8}/swap
cmdline: /lkp/aaron/src/bin/usemem 10622592
10622592 transferred in 103 seconds, throughput: 925 MB/s
cmdline: /lkp/aaron/src/bin/usemem 9559680
9559680 transferred in 92 seconds, throughput: 1036 MB/s
cmdline: /lkp/aaron/src/bin/usemem 6171264
6171264 transferred in 92 seconds, throughput: 1036 MB/s
cmdline: /lkp/aaron/src/bin/usemem 15663744
15663744 transferred in 150 seconds, throughput: 635 MB/s
cmdline: /lkp/aaron/src/bin/usemem 12966528
12966528 transferred in 87 seconds, throughput: 1096 MB/s
cmdline: /lkp/aaron/src/bin/usemem 5784192
5784192 transferred in 131 seconds, throughput: 727 MB/s
cmdline: /lkp/aaron/src/bin/usemem 13731456
13731456 transferred in 97 seconds, throughput: 983 MB/s
cmdline: /lkp/aaron/src/bin/usemem 16440960
16440960 transferred in 109 seconds, throughput: 874 MB/s
cmdline: /lkp/aaron/src/bin/usemem 8813184
8813184 transferred in 122 seconds, throughput: 781 MB/s
Max: 1096 MB/s
Min: 635 MB/s
Avg: 899 MB/s
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/3] reduce latency of direct async compaction

2015-12-08 Thread Joonsoo Kim
On Tue, Dec 08, 2015 at 04:52:42PM +0800, Aaron Lu wrote:
> On Tue, Dec 08, 2015 at 03:51:16PM +0900, Joonsoo Kim wrote:
> > On Tue, Dec 08, 2015 at 01:14:39PM +0800, Aaron Lu wrote:
> > > On Tue, Dec 08, 2015 at 09:41:18AM +0900, Joonsoo Kim wrote:
> > > > On Mon, Dec 07, 2015 at 04:59:56PM +0800, Aaron Lu wrote:
> > > > > On Mon, Dec 07, 2015 at 04:35:24PM +0900, Joonsoo Kim wrote:
> > > > > > It looks like overhead still remain. I guess that migration scanner
> > > > > > would call pageblock_pfn_to_page() for more extended range so
> > > > > > overhead still remain.
> > > > > > 
> > > > > > I have an idea to solve his problem. Aaron, could you test 
> > > > > > following patch
> > > > > > on top of base? It tries to skip calling pageblock_pfn_to_page()
> > > > > 
> > > > > It doesn't apply on top of 25364a9e54fb8296837061bf684b76d20eec01fb
> > > > > cleanly, so I made some changes to make it apply and the result is:
> > > > > https://github.com/aaronlu/linux/commit/cb8d05829190b806ad3948ff9b9e08c8ba1daf63
> > > > 
> > > > Yes, that's okay. I made it on my working branch but it will not result 
> > > > in
> > > > any problem except applying.
> > > > 
> > > > > 
> > > > > There is a problem occured right after the test starts:
> > > > > [   58.080962] BUG: unable to handle kernel paging request at 
> > > > > ea008218
> > > > > [   58.089124] IP: [] compaction_alloc+0xf9/0x270
> > > > > [   58.096109] PGD 107ffd6067 PUD 207f7d5067 PMD 0
> > > > > [   58.101569] Oops:  [#1] SMP 
> > > > 
> > > > I did some mistake. Please test following patch. It is also made
> > > > on my working branch so you need to resolve conflict but it would be
> > > > trivial.
> > > > 
> > > > I inserted some logs to check whether zone is contiguous or not.
> > > > Please check that normal zone is set to contiguous after testing.
> > > 
> > > Yes it is contiguous, but unfortunately, the problem remains:
> > > [   56.536930] check_zone_contiguous: Normal
> > > [   56.543467] check_zone_contiguous: Normal: contiguous
> > > [   56.549640] BUG: unable to handle kernel paging request at 
> > > ea008218
> > > [   56.557717] IP: [] compaction_alloc+0xf9/0x270
> > > [   56.564719] PGD 107ffd6067 PUD 207f7d5067 PMD 0
> > > 
> > 
> > Maybe, I find the reason. cc->free_pfn can be initialized to invalid pfn
> > that isn't checked so optimized pageblock_pfn_to_page() causes BUG().
> > 
> > I add work-around for this problem at isolate_freepages(). Please test
> > following one.
> 
> Still no luck and the error is about the same:

There is a mistake... Could you insert () for
cc->free_pfn & ~(pageblock_nr_pages-1) like as following?

cc->free_pfn == (cc->free_pfn & ~(pageblock_nr_pages-1))

Thanks.

> 
> [   64.727792] check_zone_contiguous: Normal
> [   64.733950] check_zone_contiguous: Normal: contiguous
> [   64.741610] BUG: unable to handle kernel paging request at ea008218
> [   64.749708] IP: [] compaction_alloc+0xf9/0x270
> [   64.756806] PGD 107ffd6067 PUD 207f7d5067 PMD 0 
> [   64.762302] Oops:  [#1] SMP 
> [   64.766294] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss 
> nfsv4 dns_resolver netconsole sg sd_mod x86_pkg_temp_thermal coretemp 
> kvm_intel kvm mgag200 irqbypass crct10dif_pclmul ttm crc32_pclmul 
> crc32c_intel drm_kms_helper ahci syscopyarea sysfillrect sysimgblt snd_pcm 
> libahci fb_sys_fops snd_timer snd sb_edac aesni_intel soundcore lrw drm 
> gf128mul pcspkr edac_core ipmi_devintf glue_helper ablk_helper cryptd libata 
> ipmi_si shpchp wmi ipmi_msghandler acpi_power_meter acpi_pad
> [   64.816579] CPU: 19 PID: 1526 Comm: usemem Not tainted 
> 4.4.0-rc3-00025-gf60ea5f #1
> [   64.825419] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS 
> SE5C610.86B.01.01.0008.021120151325 02/11/2015
> [   64.837483] task: 88168a0aca80 ti: 88168a564000 
> task.ti:88168a564000
> [   64.846264] RIP: 0010:[]  [] 
> compaction_alloc+0xf9/0x270
> [   64.856147] RSP: :88168a567940  EFLAGS: 00010286
> [   64.862520] RAX: 88207ffdcd80 RBX: 88168a567ac0 RCX: 
> 88207ffdcd80
> [   64.870944] RDX: 0208 RSI: 88168a567ac0 RDI: 
> 88168a567ac0
> [   64.879377] RBP: 88168a567990 R08: ea008200 R09: 
> 
> [   64.887813] R10:  R11: 0001ae88 R12: 
> ea008200
> [   64.896254] R13: ea0059f20780 R14: 0208 R15: 
> 0208
> [   64.904704] FS:  7f2d4e6e8700() GS:88203444() 
> knlGS:
> [   64.914232] CS:  0010 DS:  ES:  CR0: 80050033
> [   64.921151] CR2: ea008218 CR3: 002015771000 CR4: 
> 001406e0
> [   64.929635] Stack:
> [   64.932413]  88168a568000 0167ca00 81193196 
> 88207ffdcd80
> [   64.941292]  0208 ea0059f207c0 88168a567ac0 
> ea0059f20780
> [   64.950179]  ea0059f207e0 88207ffdcd80 88168a567a20 
> 811d097e
> [   64.959071] 

Re: [RFC 0/3] reduce latency of direct async compaction

2015-12-08 Thread Aaron Lu
On Tue, Dec 08, 2015 at 03:51:16PM +0900, Joonsoo Kim wrote:
> On Tue, Dec 08, 2015 at 01:14:39PM +0800, Aaron Lu wrote:
> > On Tue, Dec 08, 2015 at 09:41:18AM +0900, Joonsoo Kim wrote:
> > > On Mon, Dec 07, 2015 at 04:59:56PM +0800, Aaron Lu wrote:
> > > > On Mon, Dec 07, 2015 at 04:35:24PM +0900, Joonsoo Kim wrote:
> > > > > It looks like overhead still remain. I guess that migration scanner
> > > > > would call pageblock_pfn_to_page() for more extended range so
> > > > > overhead still remain.
> > > > > 
> > > > > I have an idea to solve his problem. Aaron, could you test following 
> > > > > patch
> > > > > on top of base? It tries to skip calling pageblock_pfn_to_page()
> > > > 
> > > > It doesn't apply on top of 25364a9e54fb8296837061bf684b76d20eec01fb
> > > > cleanly, so I made some changes to make it apply and the result is:
> > > > https://github.com/aaronlu/linux/commit/cb8d05829190b806ad3948ff9b9e08c8ba1daf63
> > > 
> > > Yes, that's okay. I made it on my working branch but it will not result in
> > > any problem except applying.
> > > 
> > > > 
> > > > There is a problem occured right after the test starts:
> > > > [   58.080962] BUG: unable to handle kernel paging request at 
> > > > ea008218
> > > > [   58.089124] IP: [] compaction_alloc+0xf9/0x270
> > > > [   58.096109] PGD 107ffd6067 PUD 207f7d5067 PMD 0
> > > > [   58.101569] Oops:  [#1] SMP 
> > > 
> > > I did some mistake. Please test following patch. It is also made
> > > on my working branch so you need to resolve conflict but it would be
> > > trivial.
> > > 
> > > I inserted some logs to check whether zone is contiguous or not.
> > > Please check that normal zone is set to contiguous after testing.
> > 
> > Yes it is contiguous, but unfortunately, the problem remains:
> > [   56.536930] check_zone_contiguous: Normal
> > [   56.543467] check_zone_contiguous: Normal: contiguous
> > [   56.549640] BUG: unable to handle kernel paging request at 
> > ea008218
> > [   56.557717] IP: [] compaction_alloc+0xf9/0x270
> > [   56.564719] PGD 107ffd6067 PUD 207f7d5067 PMD 0
> > 
> 
> Maybe, I find the reason. cc->free_pfn can be initialized to invalid pfn
> that isn't checked so optimized pageblock_pfn_to_page() causes BUG().
> 
> I add work-around for this problem at isolate_freepages(). Please test
> following one.

Still no luck and the error is about the same:

[   64.727792] check_zone_contiguous: Normal
[   64.733950] check_zone_contiguous: Normal: contiguous
[   64.741610] BUG: unable to handle kernel paging request at ea008218
[   64.749708] IP: [] compaction_alloc+0xf9/0x270
[   64.756806] PGD 107ffd6067 PUD 207f7d5067 PMD 0 
[   64.762302] Oops:  [#1] SMP 
[   64.766294] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss nfsv4 
dns_resolver netconsole sg sd_mod x86_pkg_temp_thermal coretemp kvm_intel kvm 
mgag200 irqbypass crct10dif_pclmul ttm crc32_pclmul crc32c_intel drm_kms_helper 
ahci syscopyarea sysfillrect sysimgblt snd_pcm libahci fb_sys_fops snd_timer 
snd sb_edac aesni_intel soundcore lrw drm gf128mul pcspkr edac_core 
ipmi_devintf glue_helper ablk_helper cryptd libata ipmi_si shpchp wmi 
ipmi_msghandler acpi_power_meter acpi_pad
[   64.816579] CPU: 19 PID: 1526 Comm: usemem Not tainted 
4.4.0-rc3-00025-gf60ea5f #1
[   64.825419] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS 
SE5C610.86B.01.01.0008.021120151325 02/11/2015
[   64.837483] task: 88168a0aca80 ti: 88168a564000 
task.ti:88168a564000
[   64.846264] RIP: 0010:[]  [] 
compaction_alloc+0xf9/0x270
[   64.856147] RSP: :88168a567940  EFLAGS: 00010286
[   64.862520] RAX: 88207ffdcd80 RBX: 88168a567ac0 RCX: 88207ffdcd80
[   64.870944] RDX: 0208 RSI: 88168a567ac0 RDI: 88168a567ac0
[   64.879377] RBP: 88168a567990 R08: ea008200 R09: 
[   64.887813] R10:  R11: 0001ae88 R12: ea008200
[   64.896254] R13: ea0059f20780 R14: 0208 R15: 0208
[   64.904704] FS:  7f2d4e6e8700() GS:88203444() 
knlGS:
[   64.914232] CS:  0010 DS:  ES:  CR0: 80050033
[   64.921151] CR2: ea008218 CR3: 002015771000 CR4: 001406e0
[   64.929635] Stack:
[   64.932413]  88168a568000 0167ca00 81193196 
88207ffdcd80
[   64.941292]  0208 ea0059f207c0 88168a567ac0 
ea0059f20780
[   64.950179]  ea0059f207e0 88207ffdcd80 88168a567a20 
811d097e
[   64.959071] Call Trace:
[   64.962364]  [] ? update_pageblock_skip+0x56/0xa0
[   64.969939]  [] migrate_pages+0x28e/0x7b0
[   64.976728]  [] ? update_pageblock_skip+0xa0/0xa0
[   64.984312]  [] ? __pageblock_pfn_to_page+0xe0/0xe0
[   64.992093]  [] compact_zone+0x38a/0x8e0
[   64.998811]  [] compact_zone_order+0x6d/0x90
[   65.005926]  [] ? get_page_from_freelist+0xd4/0xa20
[   65.013861]  [] try_to_compact_pages+0xec/0x210
[   65.021212]  [] ? 

Re: [RFC 0/3] reduce latency of direct async compaction

2015-12-08 Thread Aaron Lu
On Tue, Dec 08, 2015 at 03:51:16PM +0900, Joonsoo Kim wrote:
> On Tue, Dec 08, 2015 at 01:14:39PM +0800, Aaron Lu wrote:
> > On Tue, Dec 08, 2015 at 09:41:18AM +0900, Joonsoo Kim wrote:
> > > On Mon, Dec 07, 2015 at 04:59:56PM +0800, Aaron Lu wrote:
> > > > On Mon, Dec 07, 2015 at 04:35:24PM +0900, Joonsoo Kim wrote:
> > > > > It looks like overhead still remain. I guess that migration scanner
> > > > > would call pageblock_pfn_to_page() for more extended range so
> > > > > overhead still remain.
> > > > > 
> > > > > I have an idea to solve his problem. Aaron, could you test following 
> > > > > patch
> > > > > on top of base? It tries to skip calling pageblock_pfn_to_page()
> > > > 
> > > > It doesn't apply on top of 25364a9e54fb8296837061bf684b76d20eec01fb
> > > > cleanly, so I made some changes to make it apply and the result is:
> > > > https://github.com/aaronlu/linux/commit/cb8d05829190b806ad3948ff9b9e08c8ba1daf63
> > > 
> > > Yes, that's okay. I made it on my working branch but it will not result in
> > > any problem except applying.
> > > 
> > > > 
> > > > There is a problem occured right after the test starts:
> > > > [   58.080962] BUG: unable to handle kernel paging request at 
> > > > ea008218
> > > > [   58.089124] IP: [] compaction_alloc+0xf9/0x270
> > > > [   58.096109] PGD 107ffd6067 PUD 207f7d5067 PMD 0
> > > > [   58.101569] Oops:  [#1] SMP 
> > > 
> > > I did some mistake. Please test following patch. It is also made
> > > on my working branch so you need to resolve conflict but it would be
> > > trivial.
> > > 
> > > I inserted some logs to check whether zone is contiguous or not.
> > > Please check that normal zone is set to contiguous after testing.
> > 
> > Yes it is contiguous, but unfortunately, the problem remains:
> > [   56.536930] check_zone_contiguous: Normal
> > [   56.543467] check_zone_contiguous: Normal: contiguous
> > [   56.549640] BUG: unable to handle kernel paging request at 
> > ea008218
> > [   56.557717] IP: [] compaction_alloc+0xf9/0x270
> > [   56.564719] PGD 107ffd6067 PUD 207f7d5067 PMD 0
> > 
> 
> Maybe, I find the reason. cc->free_pfn can be initialized to invalid pfn
> that isn't checked so optimized pageblock_pfn_to_page() causes BUG().
> 
> I add work-around for this problem at isolate_freepages(). Please test
> following one.

Still no luck and the error is about the same:

[   64.727792] check_zone_contiguous: Normal
[   64.733950] check_zone_contiguous: Normal: contiguous
[   64.741610] BUG: unable to handle kernel paging request at ea008218
[   64.749708] IP: [] compaction_alloc+0xf9/0x270
[   64.756806] PGD 107ffd6067 PUD 207f7d5067 PMD 0 
[   64.762302] Oops:  [#1] SMP 
[   64.766294] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss nfsv4 
dns_resolver netconsole sg sd_mod x86_pkg_temp_thermal coretemp kvm_intel kvm 
mgag200 irqbypass crct10dif_pclmul ttm crc32_pclmul crc32c_intel drm_kms_helper 
ahci syscopyarea sysfillrect sysimgblt snd_pcm libahci fb_sys_fops snd_timer 
snd sb_edac aesni_intel soundcore lrw drm gf128mul pcspkr edac_core 
ipmi_devintf glue_helper ablk_helper cryptd libata ipmi_si shpchp wmi 
ipmi_msghandler acpi_power_meter acpi_pad
[   64.816579] CPU: 19 PID: 1526 Comm: usemem Not tainted 
4.4.0-rc3-00025-gf60ea5f #1
[   64.825419] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS 
SE5C610.86B.01.01.0008.021120151325 02/11/2015
[   64.837483] task: 88168a0aca80 ti: 88168a564000 
task.ti:88168a564000
[   64.846264] RIP: 0010:[]  [] 
compaction_alloc+0xf9/0x270
[   64.856147] RSP: :88168a567940  EFLAGS: 00010286
[   64.862520] RAX: 88207ffdcd80 RBX: 88168a567ac0 RCX: 88207ffdcd80
[   64.870944] RDX: 0208 RSI: 88168a567ac0 RDI: 88168a567ac0
[   64.879377] RBP: 88168a567990 R08: ea008200 R09: 
[   64.887813] R10:  R11: 0001ae88 R12: ea008200
[   64.896254] R13: ea0059f20780 R14: 0208 R15: 0208
[   64.904704] FS:  7f2d4e6e8700() GS:88203444() 
knlGS:
[   64.914232] CS:  0010 DS:  ES:  CR0: 80050033
[   64.921151] CR2: ea008218 CR3: 002015771000 CR4: 001406e0
[   64.929635] Stack:
[   64.932413]  88168a568000 0167ca00 81193196 
88207ffdcd80
[   64.941292]  0208 ea0059f207c0 88168a567ac0 
ea0059f20780
[   64.950179]  ea0059f207e0 88207ffdcd80 88168a567a20 
811d097e
[   64.959071] Call Trace:
[   64.962364]  [] ? update_pageblock_skip+0x56/0xa0
[   64.969939]  [] migrate_pages+0x28e/0x7b0
[   64.976728]  [] ? update_pageblock_skip+0xa0/0xa0
[   64.984312]  [] ? __pageblock_pfn_to_page+0xe0/0xe0
[   64.992093]  [] compact_zone+0x38a/0x8e0
[   64.998811]  [] compact_zone_order+0x6d/0x90
[   65.005926]  [] ? get_page_from_freelist+0xd4/0xa20
[   65.013861]  [] try_to_compact_pages+0xec/0x210
[   65.021212]  [] ? 

Re: [RFC 0/3] reduce latency of direct async compaction

2015-12-08 Thread Joonsoo Kim
On Tue, Dec 08, 2015 at 04:52:42PM +0800, Aaron Lu wrote:
> On Tue, Dec 08, 2015 at 03:51:16PM +0900, Joonsoo Kim wrote:
> > On Tue, Dec 08, 2015 at 01:14:39PM +0800, Aaron Lu wrote:
> > > On Tue, Dec 08, 2015 at 09:41:18AM +0900, Joonsoo Kim wrote:
> > > > On Mon, Dec 07, 2015 at 04:59:56PM +0800, Aaron Lu wrote:
> > > > > On Mon, Dec 07, 2015 at 04:35:24PM +0900, Joonsoo Kim wrote:
> > > > > > It looks like overhead still remain. I guess that migration scanner
> > > > > > would call pageblock_pfn_to_page() for more extended range so
> > > > > > overhead still remain.
> > > > > > 
> > > > > > I have an idea to solve his problem. Aaron, could you test 
> > > > > > following patch
> > > > > > on top of base? It tries to skip calling pageblock_pfn_to_page()
> > > > > 
> > > > > It doesn't apply on top of 25364a9e54fb8296837061bf684b76d20eec01fb
> > > > > cleanly, so I made some changes to make it apply and the result is:
> > > > > https://github.com/aaronlu/linux/commit/cb8d05829190b806ad3948ff9b9e08c8ba1daf63
> > > > 
> > > > Yes, that's okay. I made it on my working branch but it will not result 
> > > > in
> > > > any problem except applying.
> > > > 
> > > > > 
> > > > > There is a problem occured right after the test starts:
> > > > > [   58.080962] BUG: unable to handle kernel paging request at 
> > > > > ea008218
> > > > > [   58.089124] IP: [] compaction_alloc+0xf9/0x270
> > > > > [   58.096109] PGD 107ffd6067 PUD 207f7d5067 PMD 0
> > > > > [   58.101569] Oops:  [#1] SMP 
> > > > 
> > > > I did some mistake. Please test following patch. It is also made
> > > > on my working branch so you need to resolve conflict but it would be
> > > > trivial.
> > > > 
> > > > I inserted some logs to check whether zone is contiguous or not.
> > > > Please check that normal zone is set to contiguous after testing.
> > > 
> > > Yes it is contiguous, but unfortunately, the problem remains:
> > > [   56.536930] check_zone_contiguous: Normal
> > > [   56.543467] check_zone_contiguous: Normal: contiguous
> > > [   56.549640] BUG: unable to handle kernel paging request at 
> > > ea008218
> > > [   56.557717] IP: [] compaction_alloc+0xf9/0x270
> > > [   56.564719] PGD 107ffd6067 PUD 207f7d5067 PMD 0
> > > 
> > 
> > Maybe, I find the reason. cc->free_pfn can be initialized to invalid pfn
> > that isn't checked so optimized pageblock_pfn_to_page() causes BUG().
> > 
> > I add work-around for this problem at isolate_freepages(). Please test
> > following one.
> 
> Still no luck and the error is about the same:

There is a mistake... Could you insert () for
cc->free_pfn & ~(pageblock_nr_pages-1) like as following?

cc->free_pfn == (cc->free_pfn & ~(pageblock_nr_pages-1))

Thanks.

> 
> [   64.727792] check_zone_contiguous: Normal
> [   64.733950] check_zone_contiguous: Normal: contiguous
> [   64.741610] BUG: unable to handle kernel paging request at ea008218
> [   64.749708] IP: [] compaction_alloc+0xf9/0x270
> [   64.756806] PGD 107ffd6067 PUD 207f7d5067 PMD 0 
> [   64.762302] Oops:  [#1] SMP 
> [   64.766294] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss 
> nfsv4 dns_resolver netconsole sg sd_mod x86_pkg_temp_thermal coretemp 
> kvm_intel kvm mgag200 irqbypass crct10dif_pclmul ttm crc32_pclmul 
> crc32c_intel drm_kms_helper ahci syscopyarea sysfillrect sysimgblt snd_pcm 
> libahci fb_sys_fops snd_timer snd sb_edac aesni_intel soundcore lrw drm 
> gf128mul pcspkr edac_core ipmi_devintf glue_helper ablk_helper cryptd libata 
> ipmi_si shpchp wmi ipmi_msghandler acpi_power_meter acpi_pad
> [   64.816579] CPU: 19 PID: 1526 Comm: usemem Not tainted 
> 4.4.0-rc3-00025-gf60ea5f #1
> [   64.825419] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS 
> SE5C610.86B.01.01.0008.021120151325 02/11/2015
> [   64.837483] task: 88168a0aca80 ti: 88168a564000 
> task.ti:88168a564000
> [   64.846264] RIP: 0010:[]  [] 
> compaction_alloc+0xf9/0x270
> [   64.856147] RSP: :88168a567940  EFLAGS: 00010286
> [   64.862520] RAX: 88207ffdcd80 RBX: 88168a567ac0 RCX: 
> 88207ffdcd80
> [   64.870944] RDX: 0208 RSI: 88168a567ac0 RDI: 
> 88168a567ac0
> [   64.879377] RBP: 88168a567990 R08: ea008200 R09: 
> 
> [   64.887813] R10:  R11: 0001ae88 R12: 
> ea008200
> [   64.896254] R13: ea0059f20780 R14: 0208 R15: 
> 0208
> [   64.904704] FS:  7f2d4e6e8700() GS:88203444() 
> knlGS:
> [   64.914232] CS:  0010 DS:  ES:  CR0: 80050033
> [   64.921151] CR2: ea008218 CR3: 002015771000 CR4: 
> 001406e0
> [   64.929635] Stack:
> [   64.932413]  88168a568000 0167ca00 81193196 
> 88207ffdcd80
> [   64.941292]  0208 ea0059f207c0 88168a567ac0 
> ea0059f20780
> [   64.950179]  ea0059f207e0 88207ffdcd80 88168a567a20 
> 811d097e
> [   64.959071] 

Re: [RFC 0/3] reduce latency of direct async compaction

2015-12-08 Thread Aaron Lu
On Wed, Dec 09, 2015 at 09:33:53AM +0900, Joonsoo Kim wrote:
> On Tue, Dec 08, 2015 at 04:52:42PM +0800, Aaron Lu wrote:
> > On Tue, Dec 08, 2015 at 03:51:16PM +0900, Joonsoo Kim wrote:
> > > I add work-around for this problem at isolate_freepages(). Please test
> > > following one.
> > 
> > Still no luck and the error is about the same:
> 
> There is a mistake... Could you insert () for
> cc->free_pfn & ~(pageblock_nr_pages-1) like as following?
> 
> cc->free_pfn == (cc->free_pfn & ~(pageblock_nr_pages-1))

Oh right, of course.

Good news, the result is much better now:
$ cat {0..8}/swap
cmdline: /lkp/aaron/src/bin/usemem 100064603136
100064603136 transferred in 72 seconds, throughput: 1325 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100072049664
100072049664 transferred in 74 seconds, throughput: 1289 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100070246400
100070246400 transferred in 92 seconds, throughput: 1037 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100069545984
100069545984 transferred in 81 seconds, throughput: 1178 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100058895360
100058895360 transferred in 78 seconds, throughput: 1223 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100066074624
100066074624 transferred in 94 seconds, throughput: 1015 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100062855168
100062855168 transferred in 77 seconds, throughput: 1239 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100060990464
100060990464 transferred in 73 seconds, throughput: 1307 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100064996352
100064996352 transferred in 84 seconds, throughput: 1136 MB/s
Max: 1325 MB/s
Min: 1015 MB/s
Avg: 1194 MB/s

The base result for reference:
$ cat {0..8}/swap
cmdline: /lkp/aaron/src/bin/usemem 10622592
10622592 transferred in 103 seconds, throughput: 925 MB/s
cmdline: /lkp/aaron/src/bin/usemem 9559680
9559680 transferred in 92 seconds, throughput: 1036 MB/s
cmdline: /lkp/aaron/src/bin/usemem 6171264
6171264 transferred in 92 seconds, throughput: 1036 MB/s
cmdline: /lkp/aaron/src/bin/usemem 15663744
15663744 transferred in 150 seconds, throughput: 635 MB/s
cmdline: /lkp/aaron/src/bin/usemem 12966528
12966528 transferred in 87 seconds, throughput: 1096 MB/s
cmdline: /lkp/aaron/src/bin/usemem 5784192
5784192 transferred in 131 seconds, throughput: 727 MB/s
cmdline: /lkp/aaron/src/bin/usemem 13731456
13731456 transferred in 97 seconds, throughput: 983 MB/s
cmdline: /lkp/aaron/src/bin/usemem 16440960
16440960 transferred in 109 seconds, throughput: 874 MB/s
cmdline: /lkp/aaron/src/bin/usemem 8813184
8813184 transferred in 122 seconds, throughput: 781 MB/s
Max: 1096 MB/s
Min: 635 MB/s
Avg: 899 MB/s
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/3] reduce latency of direct async compaction

2015-12-07 Thread Joonsoo Kim
On Tue, Dec 08, 2015 at 01:14:39PM +0800, Aaron Lu wrote:
> On Tue, Dec 08, 2015 at 09:41:18AM +0900, Joonsoo Kim wrote:
> > On Mon, Dec 07, 2015 at 04:59:56PM +0800, Aaron Lu wrote:
> > > On Mon, Dec 07, 2015 at 04:35:24PM +0900, Joonsoo Kim wrote:
> > > > It looks like overhead still remain. I guess that migration scanner
> > > > would call pageblock_pfn_to_page() for more extended range so
> > > > overhead still remain.
> > > > 
> > > > I have an idea to solve his problem. Aaron, could you test following 
> > > > patch
> > > > on top of base? It tries to skip calling pageblock_pfn_to_page()
> > > 
> > > It doesn't apply on top of 25364a9e54fb8296837061bf684b76d20eec01fb
> > > cleanly, so I made some changes to make it apply and the result is:
> > > https://github.com/aaronlu/linux/commit/cb8d05829190b806ad3948ff9b9e08c8ba1daf63
> > 
> > Yes, that's okay. I made it on my working branch but it will not result in
> > any problem except applying.
> > 
> > > 
> > > There is a problem occured right after the test starts:
> > > [   58.080962] BUG: unable to handle kernel paging request at 
> > > ea008218
> > > [   58.089124] IP: [] compaction_alloc+0xf9/0x270
> > > [   58.096109] PGD 107ffd6067 PUD 207f7d5067 PMD 0
> > > [   58.101569] Oops:  [#1] SMP 
> > 
> > I did some mistake. Please test following patch. It is also made
> > on my working branch so you need to resolve conflict but it would be
> > trivial.
> > 
> > I inserted some logs to check whether zone is contiguous or not.
> > Please check that normal zone is set to contiguous after testing.
> 
> Yes it is contiguous, but unfortunately, the problem remains:
> [   56.536930] check_zone_contiguous: Normal
> [   56.543467] check_zone_contiguous: Normal: contiguous
> [   56.549640] BUG: unable to handle kernel paging request at ea008218
> [   56.557717] IP: [] compaction_alloc+0xf9/0x270
> [   56.564719] PGD 107ffd6067 PUD 207f7d5067 PMD 0
> 

Maybe, I find the reason. cc->free_pfn can be initialized to invalid pfn
that isn't checked so optimized pageblock_pfn_to_page() causes BUG().

I add work-around for this problem at isolate_freepages(). Please test
following one.

Thanks.

-->8---
>From 7e954a68fb555a868acc5860627a1ad8dadbe3bf Mon Sep 17 00:00:00 2001
From: Joonsoo Kim 
Date: Mon, 7 Dec 2015 14:51:42 +0900
Subject: [PATCH] mm/compaction: Optimize pageblock_pfn_to_page() for
 contiguous zone

Signed-off-by: Joonsoo Kim 
---
 include/linux/mmzone.h |  1 +
 mm/compaction.c| 60 +-
 2 files changed, 60 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e23a9e7..573f9a9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -521,6 +521,7 @@ struct zone {
 #endif
 
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
+   int contiguous;
/* Set to true when the PG_migrate_skip bits should be cleared */
boolcompact_blockskip_flush;
 #endif
diff --git a/mm/compaction.c b/mm/compaction.c
index de3e1e7..ff5fb04 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -88,7 +88,7 @@ static inline bool migrate_async_suitable(int migratetype)
  * the first and last page of a pageblock and avoid checking each individual
  * page in a pageblock.
  */
-static struct page *pageblock_pfn_to_page(unsigned long start_pfn,
+static struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
unsigned long end_pfn, struct zone *zone)
 {
struct page *start_page;
@@ -114,6 +114,56 @@ static struct page *pageblock_pfn_to_page(unsigned long 
start_pfn,
return start_page;
 }
 
+static inline struct page *pageblock_pfn_to_page(unsigned long start_pfn,
+   unsigned long end_pfn, struct zone *zone)
+{
+   if (zone->contiguous == 1)
+   return pfn_to_page(start_pfn);
+
+   return __pageblock_pfn_to_page(start_pfn, end_pfn, zone);
+}
+
+static void check_zone_contiguous(struct zone *zone)
+{
+   unsigned long block_start_pfn = zone->zone_start_pfn;
+   unsigned long block_end_pfn;
+   unsigned long pfn;
+
+   /* Already checked */
+   if (zone->contiguous)
+   return;
+
+   printk("%s: %s\n", __func__, zone->name);
+   block_end_pfn = ALIGN(block_start_pfn + 1, pageblock_nr_pages);
+   for (; block_start_pfn < zone_end_pfn(zone);
+   block_start_pfn = block_end_pfn,
+   block_end_pfn += pageblock_nr_pages) {
+
+   block_end_pfn = min(block_end_pfn, zone_end_pfn(zone));
+
+   if (!__pageblock_pfn_to_page(block_start_pfn,
+   block_end_pfn, zone)) {
+   /* We have hole */
+   zone->contiguous = -1;
+   printk("%s: %s: uncontiguous\n", __func__, zone->name);
+   return;
+

Re: [RFC 0/3] reduce latency of direct async compaction

2015-12-07 Thread Aaron Lu
On Tue, Dec 08, 2015 at 09:41:18AM +0900, Joonsoo Kim wrote:
> On Mon, Dec 07, 2015 at 04:59:56PM +0800, Aaron Lu wrote:
> > On Mon, Dec 07, 2015 at 04:35:24PM +0900, Joonsoo Kim wrote:
> > > It looks like overhead still remain. I guess that migration scanner
> > > would call pageblock_pfn_to_page() for more extended range so
> > > overhead still remain.
> > > 
> > > I have an idea to solve his problem. Aaron, could you test following patch
> > > on top of base? It tries to skip calling pageblock_pfn_to_page()
> > 
> > It doesn't apply on top of 25364a9e54fb8296837061bf684b76d20eec01fb
> > cleanly, so I made some changes to make it apply and the result is:
> > https://github.com/aaronlu/linux/commit/cb8d05829190b806ad3948ff9b9e08c8ba1daf63
> 
> Yes, that's okay. I made it on my working branch but it will not result in
> any problem except applying.
> 
> > 
> > There is a problem occured right after the test starts:
> > [   58.080962] BUG: unable to handle kernel paging request at 
> > ea008218
> > [   58.089124] IP: [] compaction_alloc+0xf9/0x270
> > [   58.096109] PGD 107ffd6067 PUD 207f7d5067 PMD 0
> > [   58.101569] Oops:  [#1] SMP 
> 
> I did some mistake. Please test following patch. It is also made
> on my working branch so you need to resolve conflict but it would be
> trivial.
> 
> I inserted some logs to check whether zone is contiguous or not.
> Please check that normal zone is set to contiguous after testing.

Yes it is contiguous, but unfortunately, the problem remains:
[   56.536930] check_zone_contiguous: Normal
[   56.543467] check_zone_contiguous: Normal: contiguous
[   56.549640] BUG: unable to handle kernel paging request at ea008218
[   56.557717] IP: [] compaction_alloc+0xf9/0x270
[   56.564719] PGD 107ffd6067 PUD 207f7d5067 PMD 0

Full dmesg attached.

Thanks,
Aaron

> 
> Thanks.
> 
> -->8--
> From 4a1a08d8ab3fb165b87ad2ec0a2000ff6892330f Mon Sep 17 00:00:00 2001
> From: Joonsoo Kim 
> Date: Mon, 7 Dec 2015 14:51:42 +0900
> Subject: [PATCH] mm/compaction: Optimize pageblock_pfn_to_page() for
>  contiguous zone
> 
> Signed-off-by: Joonsoo Kim 
> ---
>  include/linux/mmzone.h |  1 +
>  mm/compaction.c| 54 
> +-
>  2 files changed, 54 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index e23a9e7..573f9a9 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -521,6 +521,7 @@ struct zone {
>  #endif
>  
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
> +   int contiguous;
> /* Set to true when the PG_migrate_skip bits should be cleared */
> boolcompact_blockskip_flush;
>  #endif
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 67b8d90..cb5c7a2 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -88,7 +88,7 @@ static inline bool migrate_async_suitable(int migratetype)
>   * the first and last page of a pageblock and avoid checking each individual
>   * page in a pageblock.
>   */
> -static struct page *pageblock_pfn_to_page(unsigned long start_pfn,
> +static struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
> unsigned long end_pfn, struct zone *zone)
>  {
> struct page *start_page;
> @@ -114,6 +114,56 @@ static struct page *pageblock_pfn_to_page(unsigned long 
> start_pfn,
> return start_page;
>  }
>  
> +static inline struct page *pageblock_pfn_to_page(unsigned long start_pfn,
> +   unsigned long end_pfn, struct zone *zone)
> +{
> +   if (zone->contiguous == 1)
> +   return pfn_to_page(start_pfn);
> +
> +   return __pageblock_pfn_to_page(start_pfn, end_pfn, zone);
> +}
> +
> +static void check_zone_contiguous(struct zone *zone)
> +{
> +   unsigned long block_start_pfn = zone->zone_start_pfn;
> +   unsigned long block_end_pfn;
> +   unsigned long pfn;
> +
> +   /* Already checked */
> +   if (zone->contiguous)
> +   return;
> +
> +   printk("%s: %s\n", __func__, zone->name);
> +   block_end_pfn = ALIGN(block_start_pfn + 1, pageblock_nr_pages);
> +   for (; block_start_pfn < zone_end_pfn(zone);
> +   block_start_pfn = block_end_pfn,
> +   block_end_pfn += pageblock_nr_pages) {
> +
> +   block_end_pfn = min(block_end_pfn, zone_end_pfn(zone));
> +
> +   if (!__pageblock_pfn_to_page(block_start_pfn,
> +   block_end_pfn, zone)) {
> +   /* We have hole */
> +   zone->contiguous = -1;
> +   printk("%s: %s: uncontiguous\n", __func__, 
> zone->name);
> +   return;
> +   }
> +
> +   /* Check validity of pfn within pageblock */
> +   for (pfn = block_start_pfn; pfn < block_end_pfn; pfn++) {
> +   if 

Re: [RFC 0/3] reduce latency of direct async compaction

2015-12-07 Thread Joonsoo Kim
On Mon, Dec 07, 2015 at 04:59:56PM +0800, Aaron Lu wrote:
> On Mon, Dec 07, 2015 at 04:35:24PM +0900, Joonsoo Kim wrote:
> > It looks like overhead still remain. I guess that migration scanner
> > would call pageblock_pfn_to_page() for more extended range so
> > overhead still remain.
> > 
> > I have an idea to solve his problem. Aaron, could you test following patch
> > on top of base? It tries to skip calling pageblock_pfn_to_page()
> 
> It doesn't apply on top of 25364a9e54fb8296837061bf684b76d20eec01fb
> cleanly, so I made some changes to make it apply and the result is:
> https://github.com/aaronlu/linux/commit/cb8d05829190b806ad3948ff9b9e08c8ba1daf63

Yes, that's okay. I made it on my working branch but it will not result in
any problem except applying.

> 
> There is a problem occured right after the test starts:
> [   58.080962] BUG: unable to handle kernel paging request at ea008218
> [   58.089124] IP: [] compaction_alloc+0xf9/0x270
> [   58.096109] PGD 107ffd6067 PUD 207f7d5067 PMD 0
> [   58.101569] Oops:  [#1] SMP 

I did some mistake. Please test following patch. It is also made
on my working branch so you need to resolve conflict but it would be
trivial.

I inserted some logs to check whether zone is contiguous or not.
Please check that normal zone is set to contiguous after testing.

Thanks.

-->8--
>From 4a1a08d8ab3fb165b87ad2ec0a2000ff6892330f Mon Sep 17 00:00:00 2001
From: Joonsoo Kim 
Date: Mon, 7 Dec 2015 14:51:42 +0900
Subject: [PATCH] mm/compaction: Optimize pageblock_pfn_to_page() for
 contiguous zone

Signed-off-by: Joonsoo Kim 
---
 include/linux/mmzone.h |  1 +
 mm/compaction.c| 54 +-
 2 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e23a9e7..573f9a9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -521,6 +521,7 @@ struct zone {
 #endif
 
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
+   int contiguous;
/* Set to true when the PG_migrate_skip bits should be cleared */
boolcompact_blockskip_flush;
 #endif
diff --git a/mm/compaction.c b/mm/compaction.c
index 67b8d90..cb5c7a2 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -88,7 +88,7 @@ static inline bool migrate_async_suitable(int migratetype)
  * the first and last page of a pageblock and avoid checking each individual
  * page in a pageblock.
  */
-static struct page *pageblock_pfn_to_page(unsigned long start_pfn,
+static struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
unsigned long end_pfn, struct zone *zone)
 {
struct page *start_page;
@@ -114,6 +114,56 @@ static struct page *pageblock_pfn_to_page(unsigned long 
start_pfn,
return start_page;
 }
 
+static inline struct page *pageblock_pfn_to_page(unsigned long start_pfn,
+   unsigned long end_pfn, struct zone *zone)
+{
+   if (zone->contiguous == 1)
+   return pfn_to_page(start_pfn);
+
+   return __pageblock_pfn_to_page(start_pfn, end_pfn, zone);
+}
+
+static void check_zone_contiguous(struct zone *zone)
+{
+   unsigned long block_start_pfn = zone->zone_start_pfn;
+   unsigned long block_end_pfn;
+   unsigned long pfn;
+
+   /* Already checked */
+   if (zone->contiguous)
+   return;
+
+   printk("%s: %s\n", __func__, zone->name);
+   block_end_pfn = ALIGN(block_start_pfn + 1, pageblock_nr_pages);
+   for (; block_start_pfn < zone_end_pfn(zone);
+   block_start_pfn = block_end_pfn,
+   block_end_pfn += pageblock_nr_pages) {
+
+   block_end_pfn = min(block_end_pfn, zone_end_pfn(zone));
+
+   if (!__pageblock_pfn_to_page(block_start_pfn,
+   block_end_pfn, zone)) {
+   /* We have hole */
+   zone->contiguous = -1;
+   printk("%s: %s: uncontiguous\n", __func__, zone->name);
+   return;
+   }
+
+   /* Check validity of pfn within pageblock */
+   for (pfn = block_start_pfn; pfn < block_end_pfn; pfn++) {
+   if (!pfn_valid_within(pfn)) {
+   zone->contiguous = -1;
+   printk("%s: %s: uncontiguous\n", __func__, 
zone->name);
+   return;
+   }
+   }
+   }
+
+   /* We don't have hole */
+   zone->contiguous = 1;
+   printk("%s: %s: contiguous\n", __func__, zone->name);
+}
+
 #ifdef CONFIG_COMPACTION
 
 /* Do not skip compaction more than 64 times */
@@ -1353,6 +1403,8 @@ static int compact_zone(struct zone *zone, struct 
compact_control *cc)
;
}
 
+   check_zone_contiguous(zone);
+
/*
 * Clear pageblock skip if 

Re: [RFC 0/3] reduce latency of direct async compaction

2015-12-07 Thread Joonsoo Kim
On Tue, Dec 08, 2015 at 01:14:39PM +0800, Aaron Lu wrote:
> On Tue, Dec 08, 2015 at 09:41:18AM +0900, Joonsoo Kim wrote:
> > On Mon, Dec 07, 2015 at 04:59:56PM +0800, Aaron Lu wrote:
> > > On Mon, Dec 07, 2015 at 04:35:24PM +0900, Joonsoo Kim wrote:
> > > > It looks like overhead still remain. I guess that migration scanner
> > > > would call pageblock_pfn_to_page() for more extended range so
> > > > overhead still remain.
> > > > 
> > > > I have an idea to solve his problem. Aaron, could you test following 
> > > > patch
> > > > on top of base? It tries to skip calling pageblock_pfn_to_page()
> > > 
> > > It doesn't apply on top of 25364a9e54fb8296837061bf684b76d20eec01fb
> > > cleanly, so I made some changes to make it apply and the result is:
> > > https://github.com/aaronlu/linux/commit/cb8d05829190b806ad3948ff9b9e08c8ba1daf63
> > 
> > Yes, that's okay. I made it on my working branch but it will not result in
> > any problem except applying.
> > 
> > > 
> > > There is a problem occured right after the test starts:
> > > [   58.080962] BUG: unable to handle kernel paging request at 
> > > ea008218
> > > [   58.089124] IP: [] compaction_alloc+0xf9/0x270
> > > [   58.096109] PGD 107ffd6067 PUD 207f7d5067 PMD 0
> > > [   58.101569] Oops:  [#1] SMP 
> > 
> > I did some mistake. Please test following patch. It is also made
> > on my working branch so you need to resolve conflict but it would be
> > trivial.
> > 
> > I inserted some logs to check whether zone is contiguous or not.
> > Please check that normal zone is set to contiguous after testing.
> 
> Yes it is contiguous, but unfortunately, the problem remains:
> [   56.536930] check_zone_contiguous: Normal
> [   56.543467] check_zone_contiguous: Normal: contiguous
> [   56.549640] BUG: unable to handle kernel paging request at ea008218
> [   56.557717] IP: [] compaction_alloc+0xf9/0x270
> [   56.564719] PGD 107ffd6067 PUD 207f7d5067 PMD 0
> 

Maybe, I find the reason. cc->free_pfn can be initialized to invalid pfn
that isn't checked so optimized pageblock_pfn_to_page() causes BUG().

I add work-around for this problem at isolate_freepages(). Please test
following one.

Thanks.

-->8---
>From 7e954a68fb555a868acc5860627a1ad8dadbe3bf Mon Sep 17 00:00:00 2001
From: Joonsoo Kim 
Date: Mon, 7 Dec 2015 14:51:42 +0900
Subject: [PATCH] mm/compaction: Optimize pageblock_pfn_to_page() for
 contiguous zone

Signed-off-by: Joonsoo Kim 
---
 include/linux/mmzone.h |  1 +
 mm/compaction.c| 60 +-
 2 files changed, 60 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e23a9e7..573f9a9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -521,6 +521,7 @@ struct zone {
 #endif
 
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
+   int contiguous;
/* Set to true when the PG_migrate_skip bits should be cleared */
boolcompact_blockskip_flush;
 #endif
diff --git a/mm/compaction.c b/mm/compaction.c
index de3e1e7..ff5fb04 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -88,7 +88,7 @@ static inline bool migrate_async_suitable(int migratetype)
  * the first and last page of a pageblock and avoid checking each individual
  * page in a pageblock.
  */
-static struct page *pageblock_pfn_to_page(unsigned long start_pfn,
+static struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
unsigned long end_pfn, struct zone *zone)
 {
struct page *start_page;
@@ -114,6 +114,56 @@ static struct page *pageblock_pfn_to_page(unsigned long 
start_pfn,
return start_page;
 }
 
+static inline struct page *pageblock_pfn_to_page(unsigned long start_pfn,
+   unsigned long end_pfn, struct zone *zone)
+{
+   if (zone->contiguous == 1)
+   return pfn_to_page(start_pfn);
+
+   return __pageblock_pfn_to_page(start_pfn, end_pfn, zone);
+}
+
+static void check_zone_contiguous(struct zone *zone)
+{
+   unsigned long block_start_pfn = zone->zone_start_pfn;
+   unsigned long block_end_pfn;
+   unsigned long pfn;
+
+   /* Already checked */
+   if (zone->contiguous)
+   return;
+
+   printk("%s: %s\n", __func__, zone->name);
+   block_end_pfn = ALIGN(block_start_pfn + 1, pageblock_nr_pages);
+   for (; block_start_pfn < zone_end_pfn(zone);
+   block_start_pfn = block_end_pfn,
+   block_end_pfn += pageblock_nr_pages) {
+
+   block_end_pfn = min(block_end_pfn, zone_end_pfn(zone));
+
+   if (!__pageblock_pfn_to_page(block_start_pfn,
+   block_end_pfn, zone)) {
+   /* We have hole */
+   zone->contiguous = -1;
+   printk("%s: %s: uncontiguous\n", __func__, 

Re: [RFC 0/3] reduce latency of direct async compaction

2015-12-07 Thread Joonsoo Kim
On Mon, Dec 07, 2015 at 04:59:56PM +0800, Aaron Lu wrote:
> On Mon, Dec 07, 2015 at 04:35:24PM +0900, Joonsoo Kim wrote:
> > It looks like overhead still remain. I guess that migration scanner
> > would call pageblock_pfn_to_page() for more extended range so
> > overhead still remain.
> > 
> > I have an idea to solve his problem. Aaron, could you test following patch
> > on top of base? It tries to skip calling pageblock_pfn_to_page()
> 
> It doesn't apply on top of 25364a9e54fb8296837061bf684b76d20eec01fb
> cleanly, so I made some changes to make it apply and the result is:
> https://github.com/aaronlu/linux/commit/cb8d05829190b806ad3948ff9b9e08c8ba1daf63

Yes, that's okay. I made it on my working branch but it will not result in
any problem except applying.

> 
> There is a problem occured right after the test starts:
> [   58.080962] BUG: unable to handle kernel paging request at ea008218
> [   58.089124] IP: [] compaction_alloc+0xf9/0x270
> [   58.096109] PGD 107ffd6067 PUD 207f7d5067 PMD 0
> [   58.101569] Oops:  [#1] SMP 

I did some mistake. Please test following patch. It is also made
on my working branch so you need to resolve conflict but it would be
trivial.

I inserted some logs to check whether zone is contiguous or not.
Please check that normal zone is set to contiguous after testing.

Thanks.

-->8--
>From 4a1a08d8ab3fb165b87ad2ec0a2000ff6892330f Mon Sep 17 00:00:00 2001
From: Joonsoo Kim 
Date: Mon, 7 Dec 2015 14:51:42 +0900
Subject: [PATCH] mm/compaction: Optimize pageblock_pfn_to_page() for
 contiguous zone

Signed-off-by: Joonsoo Kim 
---
 include/linux/mmzone.h |  1 +
 mm/compaction.c| 54 +-
 2 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e23a9e7..573f9a9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -521,6 +521,7 @@ struct zone {
 #endif
 
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
+   int contiguous;
/* Set to true when the PG_migrate_skip bits should be cleared */
boolcompact_blockskip_flush;
 #endif
diff --git a/mm/compaction.c b/mm/compaction.c
index 67b8d90..cb5c7a2 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -88,7 +88,7 @@ static inline bool migrate_async_suitable(int migratetype)
  * the first and last page of a pageblock and avoid checking each individual
  * page in a pageblock.
  */
-static struct page *pageblock_pfn_to_page(unsigned long start_pfn,
+static struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
unsigned long end_pfn, struct zone *zone)
 {
struct page *start_page;
@@ -114,6 +114,56 @@ static struct page *pageblock_pfn_to_page(unsigned long 
start_pfn,
return start_page;
 }
 
+static inline struct page *pageblock_pfn_to_page(unsigned long start_pfn,
+   unsigned long end_pfn, struct zone *zone)
+{
+   if (zone->contiguous == 1)
+   return pfn_to_page(start_pfn);
+
+   return __pageblock_pfn_to_page(start_pfn, end_pfn, zone);
+}
+
+static void check_zone_contiguous(struct zone *zone)
+{
+   unsigned long block_start_pfn = zone->zone_start_pfn;
+   unsigned long block_end_pfn;
+   unsigned long pfn;
+
+   /* Already checked */
+   if (zone->contiguous)
+   return;
+
+   printk("%s: %s\n", __func__, zone->name);
+   block_end_pfn = ALIGN(block_start_pfn + 1, pageblock_nr_pages);
+   for (; block_start_pfn < zone_end_pfn(zone);
+   block_start_pfn = block_end_pfn,
+   block_end_pfn += pageblock_nr_pages) {
+
+   block_end_pfn = min(block_end_pfn, zone_end_pfn(zone));
+
+   if (!__pageblock_pfn_to_page(block_start_pfn,
+   block_end_pfn, zone)) {
+   /* We have hole */
+   zone->contiguous = -1;
+   printk("%s: %s: uncontiguous\n", __func__, zone->name);
+   return;
+   }
+
+   /* Check validity of pfn within pageblock */
+   for (pfn = block_start_pfn; pfn < block_end_pfn; pfn++) {
+   if (!pfn_valid_within(pfn)) {
+   zone->contiguous = -1;
+   printk("%s: %s: uncontiguous\n", __func__, 
zone->name);
+   return;
+   }
+   }
+   }
+
+   /* We don't have hole */
+   zone->contiguous = 1;
+   printk("%s: %s: contiguous\n", __func__, zone->name);
+}
+
 #ifdef CONFIG_COMPACTION
 
 /* Do not skip compaction more than 64 times */
@@ -1353,6 +1403,8 @@ static int compact_zone(struct zone *zone, struct 
compact_control *cc)
;
}
 
+   check_zone_contiguous(zone);
+
 

Re: [RFC 0/3] reduce latency of direct async compaction

2015-12-07 Thread Aaron Lu
On Tue, Dec 08, 2015 at 09:41:18AM +0900, Joonsoo Kim wrote:
> On Mon, Dec 07, 2015 at 04:59:56PM +0800, Aaron Lu wrote:
> > On Mon, Dec 07, 2015 at 04:35:24PM +0900, Joonsoo Kim wrote:
> > > It looks like overhead still remain. I guess that migration scanner
> > > would call pageblock_pfn_to_page() for more extended range so
> > > overhead still remain.
> > > 
> > > I have an idea to solve his problem. Aaron, could you test following patch
> > > on top of base? It tries to skip calling pageblock_pfn_to_page()
> > 
> > It doesn't apply on top of 25364a9e54fb8296837061bf684b76d20eec01fb
> > cleanly, so I made some changes to make it apply and the result is:
> > https://github.com/aaronlu/linux/commit/cb8d05829190b806ad3948ff9b9e08c8ba1daf63
> 
> Yes, that's okay. I made it on my working branch but it will not result in
> any problem except applying.
> 
> > 
> > There is a problem occured right after the test starts:
> > [   58.080962] BUG: unable to handle kernel paging request at 
> > ea008218
> > [   58.089124] IP: [] compaction_alloc+0xf9/0x270
> > [   58.096109] PGD 107ffd6067 PUD 207f7d5067 PMD 0
> > [   58.101569] Oops:  [#1] SMP 
> 
> I did some mistake. Please test following patch. It is also made
> on my working branch so you need to resolve conflict but it would be
> trivial.
> 
> I inserted some logs to check whether zone is contiguous or not.
> Please check that normal zone is set to contiguous after testing.

Yes it is contiguous, but unfortunately, the problem remains:
[   56.536930] check_zone_contiguous: Normal
[   56.543467] check_zone_contiguous: Normal: contiguous
[   56.549640] BUG: unable to handle kernel paging request at ea008218
[   56.557717] IP: [] compaction_alloc+0xf9/0x270
[   56.564719] PGD 107ffd6067 PUD 207f7d5067 PMD 0

Full dmesg attached.

Thanks,
Aaron

> 
> Thanks.
> 
> -->8--
> From 4a1a08d8ab3fb165b87ad2ec0a2000ff6892330f Mon Sep 17 00:00:00 2001
> From: Joonsoo Kim 
> Date: Mon, 7 Dec 2015 14:51:42 +0900
> Subject: [PATCH] mm/compaction: Optimize pageblock_pfn_to_page() for
>  contiguous zone
> 
> Signed-off-by: Joonsoo Kim 
> ---
>  include/linux/mmzone.h |  1 +
>  mm/compaction.c| 54 
> +-
>  2 files changed, 54 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index e23a9e7..573f9a9 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -521,6 +521,7 @@ struct zone {
>  #endif
>  
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
> +   int contiguous;
> /* Set to true when the PG_migrate_skip bits should be cleared */
> boolcompact_blockskip_flush;
>  #endif
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 67b8d90..cb5c7a2 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -88,7 +88,7 @@ static inline bool migrate_async_suitable(int migratetype)
>   * the first and last page of a pageblock and avoid checking each individual
>   * page in a pageblock.
>   */
> -static struct page *pageblock_pfn_to_page(unsigned long start_pfn,
> +static struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
> unsigned long end_pfn, struct zone *zone)
>  {
> struct page *start_page;
> @@ -114,6 +114,56 @@ static struct page *pageblock_pfn_to_page(unsigned long 
> start_pfn,
> return start_page;
>  }
>  
> +static inline struct page *pageblock_pfn_to_page(unsigned long start_pfn,
> +   unsigned long end_pfn, struct zone *zone)
> +{
> +   if (zone->contiguous == 1)
> +   return pfn_to_page(start_pfn);
> +
> +   return __pageblock_pfn_to_page(start_pfn, end_pfn, zone);
> +}
> +
> +static void check_zone_contiguous(struct zone *zone)
> +{
> +   unsigned long block_start_pfn = zone->zone_start_pfn;
> +   unsigned long block_end_pfn;
> +   unsigned long pfn;
> +
> +   /* Already checked */
> +   if (zone->contiguous)
> +   return;
> +
> +   printk("%s: %s\n", __func__, zone->name);
> +   block_end_pfn = ALIGN(block_start_pfn + 1, pageblock_nr_pages);
> +   for (; block_start_pfn < zone_end_pfn(zone);
> +   block_start_pfn = block_end_pfn,
> +   block_end_pfn += pageblock_nr_pages) {
> +
> +   block_end_pfn = min(block_end_pfn, zone_end_pfn(zone));
> +
> +   if (!__pageblock_pfn_to_page(block_start_pfn,
> +   block_end_pfn, zone)) {
> +   /* We have hole */
> +   zone->contiguous = -1;
> +   printk("%s: %s: uncontiguous\n", __func__, 
> zone->name);
> +   return;
> +   }
> +
> +   /* Check validity of pfn within pageblock */
> +   for (pfn = block_start_pfn; pfn < 

Re: [RFC 0/3] reduce latency of direct async compaction

2015-12-06 Thread Joonsoo Kim
On Fri, Dec 04, 2015 at 01:34:09PM +0100, Vlastimil Babka wrote:
> On 12/03/2015 12:52 PM, Aaron Lu wrote:
> >On Thu, Dec 03, 2015 at 07:35:08PM +0800, Aaron Lu wrote:
> >>On Thu, Dec 03, 2015 at 10:38:50AM +0100, Vlastimil Babka wrote:
> >>>On 12/03/2015 10:25 AM, Aaron Lu wrote:
> On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote:
> >>
> >>My bad, I uploaded the wrong data :-/
> >>I uploaded again:
> >>https://drive.google.com/file/d/0B49uX3igf4K4UFI4TEQ3THYta0E
> >>
> >>And I just run the base tree with trace-cmd and found that its
> >>performace drops significantly(from 1000MB/s to 6xxMB/s), is it that
> >>trace-cmd will impact performace a lot?
> 
> Yeah it has some overhead depending on how many events it has to
> process. Your workload is quite sensitive to that.
> 
> >>Any suggestions on how to run
> >>the test regarding trace-cmd? i.e. should I aways run usemem under
> >>trace-cmd or only when necessary?
> 
> I'd run it with tracing only when the goal is to collect traces, but
> not for any performance comparisons. Also it's not useful to collect
> perf data while also tracing.
> 
> >I just run the test with the base tree and with this patch series
> >applied(head), I didn't use trace-cmd this time.
> >
> >The throughput for base tree is 963MB/s while the head is 815MB/s, I
> >have attached pagetypeinfo/proc-vmstat/perf-profile for them.
> 
> The compact stats improvements look fine, perhaps better than in my tests:
> 
> base: compact_migrate_scanned 3476360
> head: compact_migrate_scanned 1020827
> 
> - that's the eager skipping of patch 2
> 
> base: compact_free_scanned 5924928
> head: compact_free_scanned 0
>   compact_free_direct 918813
>   compact_free_direct_miss 500308
> 
> As your workload does exclusively async direct compaction through
> THP faults, the traditional free scanner isn't used at all. Direct
> allocations should be much cheaper, although the "miss" ratio (the
> allocations that were from the same pageblock as the one we are
> compacting) is quite high. I should probably look into making
> migration release pages to the tails of the freelists - could be
> that it's grabbing the very pages that were just freed in the
> previous COMPACT_CLUSTER_MAX cycle (modulo pcplist buffering).
> 
> I however find it strange that your original stats (4.3?) differ
> from the base so much:
> 
> compact_migrate_scanned 1982396
> compact_free_scanned 40576943
> 
> That was order of magnitude more free scanned on 4.3, and half the
> migrate scanned. But your throughput figures in the other mail
> suggested a regression from 4.3 to 4.4, which would be the opposite
> of what the stats say. And anyway, compaction code didn't change
> between 4.3 and 4.4 except changes to tracepoint format...
> 
> moving on...
> base:
> compact_isolated 731304
> compact_stall 10561
> compact_fail 9459
> compact_success 1102
> 
> head:
> compact_isolated 921087
> compact_stall 14451
> compact_fail 12550
> compact_success 1901
> 
> More success in both isolation and compaction results.
> 
> base:
> thp_fault_alloc 45337
> thp_fault_fallback 2349
> 
> head:
> thp_fault_alloc 45564
> thp_fault_fallback 2120
> 
> Somehow the extra compact success didn't fully translate to thp
> alloc success... But given how many of the alloc's didn't even
> involve a compact_stall (two thirds of them), that interpretation
> could also be easily misleading. So, hard to say.
> 
> Looking at the perf profiles...
> base:
> 54.55%54.55%:1550  [kernel.kallsyms]   [k]
> pageblock_pfn_to_page
> 
> head:
> 40.13%40.13%:1551  [kernel.kallsyms]   [k]
> pageblock_pfn_to_page
> 
> Since the freepage allocation doesn't hit this code anymore, it
> shows that the bulk was actually from the migration scanner,
> although the perf callgraph and vmstats suggested otherwise.

It looks like overhead still remain. I guess that migration scanner
would call pageblock_pfn_to_page() for more extended range so
overhead still remain.

I have an idea to solve his problem. Aaron, could you test following patch
on top of base? It tries to skip calling pageblock_pfn_to_page()
if we check that zone is contiguous at initialization stage.

Thanks.

>8
>From 9c4fbf8f8ed37eb88a04a97908e76ba2437404a2 Mon Sep 17 00:00:00 2001
From: Joonsoo Kim 
Date: Mon, 7 Dec 2015 14:51:42 +0900
Subject: [PATCH] mm/compaction: Optimize pageblock_pfn_to_page() for
 contiguous zone

Signed-off-by: Joonsoo Kim 
---
 include/linux/mmzone.h |  1 +
 mm/compaction.c| 35 ++-
 2 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e23a9e7..573f9a9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -521,6 +521,7 @@ struct zone {
 #endif
 
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
+   int contiguous;
/* Set to true when the PG_migrate_skip bits should be cleared */
 

Re: [RFC 0/3] reduce latency of direct async compaction

2015-12-06 Thread Aaron Lu
On 12/04/2015 08:38 PM, Vlastimil Babka wrote:
> On 12/04/2015 07:25 AM, Aaron Lu wrote:
>> On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote:
>>> Aaron, could you try this on your testcase?
>>
>> One time result isn't stable enough, so I did 9 runs for each commit,
>> here is the result:
>>
>> base: 25364a9e54fb8296837061bf684b76d20eec01fb
>> head: 7433b1009ff5a02e1e9f3444802daba2cf385d27
>> (head =  base + this_patch_serie)
>>
>> The always-always case(transparent_hugepage set to always and defrag set
>> to always):
>>
>> Result for base:
>> $ cat {0..8}/swap
>> cmdline: /lkp/aaron/src/bin/usemem 10622592
>> 10622592 transferred in 103 seconds, throughput: 925 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 9559680
>> 9559680 transferred in 92 seconds, throughput: 1036 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 6171264
>> 6171264 transferred in 92 seconds, throughput: 1036 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 15663744
>> 15663744 transferred in 150 seconds, throughput: 635 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 12966528
>> 12966528 transferred in 87 seconds, throughput: 1096 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 5784192
>> 5784192 transferred in 131 seconds, throughput: 727 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 13731456
>> 13731456 transferred in 97 seconds, throughput: 983 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 16440960
>> 16440960 transferred in 109 seconds, throughput: 874 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 8813184
>> 8813184 transferred in 122 seconds, throughput: 781 MB/s
>> Max: 1096 MB/s
>> Min: 635 MB/s
>> Avg: 899 MB/s
>>
>> Result for head:
>> $ cat {0..8}/swap
>> cmdline: /lkp/aaron/src/bin/usemem 13163136
>> 13163136 transferred in 105 seconds, throughput: 908 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 8524416
>> 8524416 transferred in 78 seconds, throughput: 1222 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 3646080
>> 3646080 transferred in 108 seconds, throughput: 882 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 8936064
>> 8936064 transferred in 114 seconds, throughput: 836 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 12204672
>> 12204672 transferred in 73 seconds, throughput: 1306 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 8140416
>> 8140416 transferred in 146 seconds, throughput: 653 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 12941952
>> 12941952 transferred in 78 seconds, throughput: 1222 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 6917760
>> 6917760 transferred in 109 seconds, throughput: 874 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 11405952
>> 11405952 transferred in 96 seconds, throughput: 993 MB/s
>> Max: 1306 MB/s
>> Min: 653 MB/s
>> Avg: 988 MB/s
> 
> Ok that looks better than the first results :) The series either helped, 
> or it's just noise. But hopefully not worse.

Well, it looks to be the case :-)

> 
>> Result for v4.3 as a reference:
>> $ cat {0..8}/swap
>> cmdline: /lkp/aaron/src/bin/usemem 12459648
>> 12459648 transferred in 96 seconds, throughput: 993 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 7375488
>> 7375488 transferred in 96 seconds, throughput: 993 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 9028224
>> 9028224 transferred in 107 seconds, throughput: 891 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 10137216
>> 10137216 transferred in 91 seconds, throughput: 1047 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 13835904
>> 13835904 transferred in 80 seconds, throughput: 1192 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 10143360
>> 10143360 transferred in 96 seconds, throughput: 993 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100020593664
>> 100020593664 transferred in 101 seconds, throughput: 944 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 15805056
>> 15805056 transferred in 87 seconds, throughput: 1096 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 18360960
>> 18360960 transferred in 74 seconds, throughput: 1288 MB/s
>> Max: 1288 MB/s
>> Min: 891 MB/s
>> Avg: 1048 MB/s
> 
> Hard to say if there's actual regression from 4.3 to 4.4, it's too 
> noisy. More iterations could help, but then the eventual bisection would 
> need them too.

One thing puzzles me most is that once compaction is involved, the
results will become undetermined, i.e. the result could be as high
as 1xxx MB/s or as low as 6xx MB/s. The always-never's case is much
better in this regard.

Thanks,
Aaron

> 
>> The always-never case:
>>
>> Result for head:
>> $ cat {0..8}/swap
>> cmdline: /lkp/aaron/src/bin/usemem 13940352
>> 13940352 transferred in 71 seconds, throughput: 1343 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 17411712
>> 17411712 transferred in 62 seconds, throughput: 1538 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 11875968
>> 11875968 transferred in 64 seconds, throughput: 1490 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 13912704
>> 

Re: [RFC 0/3] reduce latency of direct async compaction

2015-12-06 Thread Aaron Lu
On 12/04/2015 08:38 PM, Vlastimil Babka wrote:
> On 12/04/2015 07:25 AM, Aaron Lu wrote:
>> On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote:
>>> Aaron, could you try this on your testcase?
>>
>> One time result isn't stable enough, so I did 9 runs for each commit,
>> here is the result:
>>
>> base: 25364a9e54fb8296837061bf684b76d20eec01fb
>> head: 7433b1009ff5a02e1e9f3444802daba2cf385d27
>> (head =  base + this_patch_serie)
>>
>> The always-always case(transparent_hugepage set to always and defrag set
>> to always):
>>
>> Result for base:
>> $ cat {0..8}/swap
>> cmdline: /lkp/aaron/src/bin/usemem 10622592
>> 10622592 transferred in 103 seconds, throughput: 925 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 9559680
>> 9559680 transferred in 92 seconds, throughput: 1036 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 6171264
>> 6171264 transferred in 92 seconds, throughput: 1036 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 15663744
>> 15663744 transferred in 150 seconds, throughput: 635 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 12966528
>> 12966528 transferred in 87 seconds, throughput: 1096 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 5784192
>> 5784192 transferred in 131 seconds, throughput: 727 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 13731456
>> 13731456 transferred in 97 seconds, throughput: 983 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 16440960
>> 16440960 transferred in 109 seconds, throughput: 874 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 8813184
>> 8813184 transferred in 122 seconds, throughput: 781 MB/s
>> Max: 1096 MB/s
>> Min: 635 MB/s
>> Avg: 899 MB/s
>>
>> Result for head:
>> $ cat {0..8}/swap
>> cmdline: /lkp/aaron/src/bin/usemem 13163136
>> 13163136 transferred in 105 seconds, throughput: 908 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 8524416
>> 8524416 transferred in 78 seconds, throughput: 1222 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 3646080
>> 3646080 transferred in 108 seconds, throughput: 882 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 8936064
>> 8936064 transferred in 114 seconds, throughput: 836 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 12204672
>> 12204672 transferred in 73 seconds, throughput: 1306 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 8140416
>> 8140416 transferred in 146 seconds, throughput: 653 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 12941952
>> 12941952 transferred in 78 seconds, throughput: 1222 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 6917760
>> 6917760 transferred in 109 seconds, throughput: 874 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 11405952
>> 11405952 transferred in 96 seconds, throughput: 993 MB/s
>> Max: 1306 MB/s
>> Min: 653 MB/s
>> Avg: 988 MB/s
> 
> Ok that looks better than the first results :) The series either helped, 
> or it's just noise. But hopefully not worse.

Well, it looks to be the case :-)

> 
>> Result for v4.3 as a reference:
>> $ cat {0..8}/swap
>> cmdline: /lkp/aaron/src/bin/usemem 12459648
>> 12459648 transferred in 96 seconds, throughput: 993 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 7375488
>> 7375488 transferred in 96 seconds, throughput: 993 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 9028224
>> 9028224 transferred in 107 seconds, throughput: 891 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 10137216
>> 10137216 transferred in 91 seconds, throughput: 1047 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 13835904
>> 13835904 transferred in 80 seconds, throughput: 1192 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 10143360
>> 10143360 transferred in 96 seconds, throughput: 993 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 100020593664
>> 100020593664 transferred in 101 seconds, throughput: 944 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 15805056
>> 15805056 transferred in 87 seconds, throughput: 1096 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 18360960
>> 18360960 transferred in 74 seconds, throughput: 1288 MB/s
>> Max: 1288 MB/s
>> Min: 891 MB/s
>> Avg: 1048 MB/s
> 
> Hard to say if there's actual regression from 4.3 to 4.4, it's too 
> noisy. More iterations could help, but then the eventual bisection would 
> need them too.

One thing puzzles me most is that once compaction is involved, the
results will become undetermined, i.e. the result could be as high
as 1xxx MB/s or as low as 6xx MB/s. The always-never's case is much
better in this regard.

Thanks,
Aaron

> 
>> The always-never case:
>>
>> Result for head:
>> $ cat {0..8}/swap
>> cmdline: /lkp/aaron/src/bin/usemem 13940352
>> 13940352 transferred in 71 seconds, throughput: 1343 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 17411712
>> 17411712 transferred in 62 seconds, throughput: 1538 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 11875968
>> 11875968 transferred in 64 seconds, throughput: 1490 MB/s
>> cmdline: /lkp/aaron/src/bin/usemem 13912704
>> 

Re: [RFC 0/3] reduce latency of direct async compaction

2015-12-06 Thread Joonsoo Kim
On Fri, Dec 04, 2015 at 01:34:09PM +0100, Vlastimil Babka wrote:
> On 12/03/2015 12:52 PM, Aaron Lu wrote:
> >On Thu, Dec 03, 2015 at 07:35:08PM +0800, Aaron Lu wrote:
> >>On Thu, Dec 03, 2015 at 10:38:50AM +0100, Vlastimil Babka wrote:
> >>>On 12/03/2015 10:25 AM, Aaron Lu wrote:
> On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote:
> >>
> >>My bad, I uploaded the wrong data :-/
> >>I uploaded again:
> >>https://drive.google.com/file/d/0B49uX3igf4K4UFI4TEQ3THYta0E
> >>
> >>And I just run the base tree with trace-cmd and found that its
> >>performace drops significantly(from 1000MB/s to 6xxMB/s), is it that
> >>trace-cmd will impact performace a lot?
> 
> Yeah it has some overhead depending on how many events it has to
> process. Your workload is quite sensitive to that.
> 
> >>Any suggestions on how to run
> >>the test regarding trace-cmd? i.e. should I aways run usemem under
> >>trace-cmd or only when necessary?
> 
> I'd run it with tracing only when the goal is to collect traces, but
> not for any performance comparisons. Also it's not useful to collect
> perf data while also tracing.
> 
> >I just run the test with the base tree and with this patch series
> >applied(head), I didn't use trace-cmd this time.
> >
> >The throughput for base tree is 963MB/s while the head is 815MB/s, I
> >have attached pagetypeinfo/proc-vmstat/perf-profile for them.
> 
> The compact stats improvements look fine, perhaps better than in my tests:
> 
> base: compact_migrate_scanned 3476360
> head: compact_migrate_scanned 1020827
> 
> - that's the eager skipping of patch 2
> 
> base: compact_free_scanned 5924928
> head: compact_free_scanned 0
>   compact_free_direct 918813
>   compact_free_direct_miss 500308
> 
> As your workload does exclusively async direct compaction through
> THP faults, the traditional free scanner isn't used at all. Direct
> allocations should be much cheaper, although the "miss" ratio (the
> allocations that were from the same pageblock as the one we are
> compacting) is quite high. I should probably look into making
> migration release pages to the tails of the freelists - could be
> that it's grabbing the very pages that were just freed in the
> previous COMPACT_CLUSTER_MAX cycle (modulo pcplist buffering).
> 
> I however find it strange that your original stats (4.3?) differ
> from the base so much:
> 
> compact_migrate_scanned 1982396
> compact_free_scanned 40576943
> 
> That was order of magnitude more free scanned on 4.3, and half the
> migrate scanned. But your throughput figures in the other mail
> suggested a regression from 4.3 to 4.4, which would be the opposite
> of what the stats say. And anyway, compaction code didn't change
> between 4.3 and 4.4 except changes to tracepoint format...
> 
> moving on...
> base:
> compact_isolated 731304
> compact_stall 10561
> compact_fail 9459
> compact_success 1102
> 
> head:
> compact_isolated 921087
> compact_stall 14451
> compact_fail 12550
> compact_success 1901
> 
> More success in both isolation and compaction results.
> 
> base:
> thp_fault_alloc 45337
> thp_fault_fallback 2349
> 
> head:
> thp_fault_alloc 45564
> thp_fault_fallback 2120
> 
> Somehow the extra compact success didn't fully translate to thp
> alloc success... But given how many of the alloc's didn't even
> involve a compact_stall (two thirds of them), that interpretation
> could also be easily misleading. So, hard to say.
> 
> Looking at the perf profiles...
> base:
> 54.55%54.55%:1550  [kernel.kallsyms]   [k]
> pageblock_pfn_to_page
> 
> head:
> 40.13%40.13%:1551  [kernel.kallsyms]   [k]
> pageblock_pfn_to_page
> 
> Since the freepage allocation doesn't hit this code anymore, it
> shows that the bulk was actually from the migration scanner,
> although the perf callgraph and vmstats suggested otherwise.

It looks like overhead still remain. I guess that migration scanner
would call pageblock_pfn_to_page() for more extended range so
overhead still remain.

I have an idea to solve his problem. Aaron, could you test following patch
on top of base? It tries to skip calling pageblock_pfn_to_page()
if we check that zone is contiguous at initialization stage.

Thanks.

>8
>From 9c4fbf8f8ed37eb88a04a97908e76ba2437404a2 Mon Sep 17 00:00:00 2001
From: Joonsoo Kim 
Date: Mon, 7 Dec 2015 14:51:42 +0900
Subject: [PATCH] mm/compaction: Optimize pageblock_pfn_to_page() for
 contiguous zone

Signed-off-by: Joonsoo Kim 
---
 include/linux/mmzone.h |  1 +
 mm/compaction.c| 35 ++-
 2 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e23a9e7..573f9a9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -521,6 +521,7 @@ struct zone {
 #endif
 
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
+   int contiguous;
/* Set to true when 

Re: [RFC 0/3] reduce latency of direct async compaction

2015-12-04 Thread Vlastimil Babka

On 12/04/2015 07:25 AM, Aaron Lu wrote:

On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote:

Aaron, could you try this on your testcase?


One time result isn't stable enough, so I did 9 runs for each commit,
here is the result:

base: 25364a9e54fb8296837061bf684b76d20eec01fb
head: 7433b1009ff5a02e1e9f3444802daba2cf385d27
(head =  base + this_patch_serie)

The always-always case(transparent_hugepage set to always and defrag set
to always):

Result for base:
$ cat {0..8}/swap
cmdline: /lkp/aaron/src/bin/usemem 10622592
10622592 transferred in 103 seconds, throughput: 925 MB/s
cmdline: /lkp/aaron/src/bin/usemem 9559680
9559680 transferred in 92 seconds, throughput: 1036 MB/s
cmdline: /lkp/aaron/src/bin/usemem 6171264
6171264 transferred in 92 seconds, throughput: 1036 MB/s
cmdline: /lkp/aaron/src/bin/usemem 15663744
15663744 transferred in 150 seconds, throughput: 635 MB/s
cmdline: /lkp/aaron/src/bin/usemem 12966528
12966528 transferred in 87 seconds, throughput: 1096 MB/s
cmdline: /lkp/aaron/src/bin/usemem 5784192
5784192 transferred in 131 seconds, throughput: 727 MB/s
cmdline: /lkp/aaron/src/bin/usemem 13731456
13731456 transferred in 97 seconds, throughput: 983 MB/s
cmdline: /lkp/aaron/src/bin/usemem 16440960
16440960 transferred in 109 seconds, throughput: 874 MB/s
cmdline: /lkp/aaron/src/bin/usemem 8813184
8813184 transferred in 122 seconds, throughput: 781 MB/s
Max: 1096 MB/s
Min: 635 MB/s
Avg: 899 MB/s

Result for head:
$ cat {0..8}/swap
cmdline: /lkp/aaron/src/bin/usemem 13163136
13163136 transferred in 105 seconds, throughput: 908 MB/s
cmdline: /lkp/aaron/src/bin/usemem 8524416
8524416 transferred in 78 seconds, throughput: 1222 MB/s
cmdline: /lkp/aaron/src/bin/usemem 3646080
3646080 transferred in 108 seconds, throughput: 882 MB/s
cmdline: /lkp/aaron/src/bin/usemem 8936064
8936064 transferred in 114 seconds, throughput: 836 MB/s
cmdline: /lkp/aaron/src/bin/usemem 12204672
12204672 transferred in 73 seconds, throughput: 1306 MB/s
cmdline: /lkp/aaron/src/bin/usemem 8140416
8140416 transferred in 146 seconds, throughput: 653 MB/s
cmdline: /lkp/aaron/src/bin/usemem 12941952
12941952 transferred in 78 seconds, throughput: 1222 MB/s
cmdline: /lkp/aaron/src/bin/usemem 6917760
6917760 transferred in 109 seconds, throughput: 874 MB/s
cmdline: /lkp/aaron/src/bin/usemem 11405952
11405952 transferred in 96 seconds, throughput: 993 MB/s
Max: 1306 MB/s
Min: 653 MB/s
Avg: 988 MB/s


Ok that looks better than the first results :) The series either helped, 
or it's just noise. But hopefully not worse.



Result for v4.3 as a reference:
$ cat {0..8}/swap
cmdline: /lkp/aaron/src/bin/usemem 12459648
12459648 transferred in 96 seconds, throughput: 993 MB/s
cmdline: /lkp/aaron/src/bin/usemem 7375488
7375488 transferred in 96 seconds, throughput: 993 MB/s
cmdline: /lkp/aaron/src/bin/usemem 9028224
9028224 transferred in 107 seconds, throughput: 891 MB/s
cmdline: /lkp/aaron/src/bin/usemem 10137216
10137216 transferred in 91 seconds, throughput: 1047 MB/s
cmdline: /lkp/aaron/src/bin/usemem 13835904
13835904 transferred in 80 seconds, throughput: 1192 MB/s
cmdline: /lkp/aaron/src/bin/usemem 10143360
10143360 transferred in 96 seconds, throughput: 993 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100020593664
100020593664 transferred in 101 seconds, throughput: 944 MB/s
cmdline: /lkp/aaron/src/bin/usemem 15805056
15805056 transferred in 87 seconds, throughput: 1096 MB/s
cmdline: /lkp/aaron/src/bin/usemem 18360960
18360960 transferred in 74 seconds, throughput: 1288 MB/s
Max: 1288 MB/s
Min: 891 MB/s
Avg: 1048 MB/s


Hard to say if there's actual regression from 4.3 to 4.4, it's too 
noisy. More iterations could help, but then the eventual bisection would 
need them too.



The always-never case:

Result for head:
$ cat {0..8}/swap
cmdline: /lkp/aaron/src/bin/usemem 13940352
13940352 transferred in 71 seconds, throughput: 1343 MB/s
cmdline: /lkp/aaron/src/bin/usemem 17411712
17411712 transferred in 62 seconds, throughput: 1538 MB/s
cmdline: /lkp/aaron/src/bin/usemem 11875968
11875968 transferred in 64 seconds, throughput: 1490 MB/s
cmdline: /lkp/aaron/src/bin/usemem 13912704
13912704 transferred in 62 seconds, throughput: 1538 MB/s
cmdline: /lkp/aaron/src/bin/usemem 12238464
12238464 transferred in 66 seconds, throughput: 1444 MB/s
cmdline: /lkp/aaron/src/bin/usemem 13670016
13670016 transferred in 65 seconds, throughput: 1467 MB/s
cmdline: /lkp/aaron/src/bin/usemem 8364672
8364672 transferred in 68 seconds, throughput: 1402 MB/s
cmdline: /lkp/aaron/src/bin/usemem 15417984
15417984 transferred in 70 seconds, throughput: 1362 MB/s
cmdline: /lkp/aaron/src/bin/usemem 15304320
15304320 transferred in 64 seconds, throughput: 1490 MB/s
Max: 1538 

Re: [RFC 0/3] reduce latency of direct async compaction

2015-12-04 Thread Vlastimil Babka

On 12/03/2015 12:52 PM, Aaron Lu wrote:

On Thu, Dec 03, 2015 at 07:35:08PM +0800, Aaron Lu wrote:

On Thu, Dec 03, 2015 at 10:38:50AM +0100, Vlastimil Babka wrote:

On 12/03/2015 10:25 AM, Aaron Lu wrote:

On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote:


My bad, I uploaded the wrong data :-/
I uploaded again:
https://drive.google.com/file/d/0B49uX3igf4K4UFI4TEQ3THYta0E

And I just run the base tree with trace-cmd and found that its
performace drops significantly(from 1000MB/s to 6xxMB/s), is it that
trace-cmd will impact performace a lot?


Yeah it has some overhead depending on how many events it has to 
process. Your workload is quite sensitive to that.



Any suggestions on how to run
the test regarding trace-cmd? i.e. should I aways run usemem under
trace-cmd or only when necessary?


I'd run it with tracing only when the goal is to collect traces, but not 
for any performance comparisons. Also it's not useful to collect perf 
data while also tracing.



I just run the test with the base tree and with this patch series
applied(head), I didn't use trace-cmd this time.

The throughput for base tree is 963MB/s while the head is 815MB/s, I
have attached pagetypeinfo/proc-vmstat/perf-profile for them.


The compact stats improvements look fine, perhaps better than in my tests:

base: compact_migrate_scanned 3476360
head: compact_migrate_scanned 1020827

- that's the eager skipping of patch 2

base: compact_free_scanned 5924928
head: compact_free_scanned 0
  compact_free_direct 918813
  compact_free_direct_miss 500308

As your workload does exclusively async direct compaction through THP 
faults, the traditional free scanner isn't used at all. Direct 
allocations should be much cheaper, although the "miss" ratio (the 
allocations that were from the same pageblock as the one we are 
compacting) is quite high. I should probably look into making migration 
release pages to the tails of the freelists - could be that it's 
grabbing the very pages that were just freed in the previous 
COMPACT_CLUSTER_MAX cycle (modulo pcplist buffering).


I however find it strange that your original stats (4.3?) differ from 
the base so much:


compact_migrate_scanned 1982396
compact_free_scanned 40576943

That was order of magnitude more free scanned on 4.3, and half the 
migrate scanned. But your throughput figures in the other mail suggested 
a regression from 4.3 to 4.4, which would be the opposite of what the 
stats say. And anyway, compaction code didn't change between 4.3 and 4.4 
except changes to tracepoint format...


moving on...
base:
compact_isolated 731304
compact_stall 10561
compact_fail 9459
compact_success 1102

head:
compact_isolated 921087
compact_stall 14451
compact_fail 12550
compact_success 1901

More success in both isolation and compaction results.

base:
thp_fault_alloc 45337
thp_fault_fallback 2349

head:
thp_fault_alloc 45564
thp_fault_fallback 2120

Somehow the extra compact success didn't fully translate to thp alloc 
success... But given how many of the alloc's didn't even involve a 
compact_stall (two thirds of them), that interpretation could also be 
easily misleading. So, hard to say.


Looking at the perf profiles...
base:
54.55%54.55%:1550  [kernel.kallsyms]   [k] 
pageblock_pfn_to_page


head:
40.13%40.13%:1551  [kernel.kallsyms]   [k] 
pageblock_pfn_to_page


Since the freepage allocation doesn't hit this code anymore, it shows 
that the bulk was actually from the migration scanner, although the perf 
callgraph and vmstats suggested otherwise. However, vmstats count only 
when the scanner actually enters the pageblock, and there are numerous 
reasons why it wouldn't... For example the pageblock_skip bitmap. Could 
it make sense to look at the bitmap before doing the pfn_to_page 
translation?


I don't see much else in the profiles. I guess the remaining problem of 
compaction here is that deferring compaction doesn't trigger for async 
compaction, and this testcase doesn't hit sync compaction at all.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/3] reduce latency of direct async compaction

2015-12-04 Thread Vlastimil Babka

On 12/03/2015 12:52 PM, Aaron Lu wrote:

On Thu, Dec 03, 2015 at 07:35:08PM +0800, Aaron Lu wrote:

On Thu, Dec 03, 2015 at 10:38:50AM +0100, Vlastimil Babka wrote:

On 12/03/2015 10:25 AM, Aaron Lu wrote:

On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote:


My bad, I uploaded the wrong data :-/
I uploaded again:
https://drive.google.com/file/d/0B49uX3igf4K4UFI4TEQ3THYta0E

And I just run the base tree with trace-cmd and found that its
performace drops significantly(from 1000MB/s to 6xxMB/s), is it that
trace-cmd will impact performace a lot?


Yeah it has some overhead depending on how many events it has to 
process. Your workload is quite sensitive to that.



Any suggestions on how to run
the test regarding trace-cmd? i.e. should I aways run usemem under
trace-cmd or only when necessary?


I'd run it with tracing only when the goal is to collect traces, but not 
for any performance comparisons. Also it's not useful to collect perf 
data while also tracing.



I just run the test with the base tree and with this patch series
applied(head), I didn't use trace-cmd this time.

The throughput for base tree is 963MB/s while the head is 815MB/s, I
have attached pagetypeinfo/proc-vmstat/perf-profile for them.


The compact stats improvements look fine, perhaps better than in my tests:

base: compact_migrate_scanned 3476360
head: compact_migrate_scanned 1020827

- that's the eager skipping of patch 2

base: compact_free_scanned 5924928
head: compact_free_scanned 0
  compact_free_direct 918813
  compact_free_direct_miss 500308

As your workload does exclusively async direct compaction through THP 
faults, the traditional free scanner isn't used at all. Direct 
allocations should be much cheaper, although the "miss" ratio (the 
allocations that were from the same pageblock as the one we are 
compacting) is quite high. I should probably look into making migration 
release pages to the tails of the freelists - could be that it's 
grabbing the very pages that were just freed in the previous 
COMPACT_CLUSTER_MAX cycle (modulo pcplist buffering).


I however find it strange that your original stats (4.3?) differ from 
the base so much:


compact_migrate_scanned 1982396
compact_free_scanned 40576943

That was order of magnitude more free scanned on 4.3, and half the 
migrate scanned. But your throughput figures in the other mail suggested 
a regression from 4.3 to 4.4, which would be the opposite of what the 
stats say. And anyway, compaction code didn't change between 4.3 and 4.4 
except changes to tracepoint format...


moving on...
base:
compact_isolated 731304
compact_stall 10561
compact_fail 9459
compact_success 1102

head:
compact_isolated 921087
compact_stall 14451
compact_fail 12550
compact_success 1901

More success in both isolation and compaction results.

base:
thp_fault_alloc 45337
thp_fault_fallback 2349

head:
thp_fault_alloc 45564
thp_fault_fallback 2120

Somehow the extra compact success didn't fully translate to thp alloc 
success... But given how many of the alloc's didn't even involve a 
compact_stall (two thirds of them), that interpretation could also be 
easily misleading. So, hard to say.


Looking at the perf profiles...
base:
54.55%54.55%:1550  [kernel.kallsyms]   [k] 
pageblock_pfn_to_page


head:
40.13%40.13%:1551  [kernel.kallsyms]   [k] 
pageblock_pfn_to_page


Since the freepage allocation doesn't hit this code anymore, it shows 
that the bulk was actually from the migration scanner, although the perf 
callgraph and vmstats suggested otherwise. However, vmstats count only 
when the scanner actually enters the pageblock, and there are numerous 
reasons why it wouldn't... For example the pageblock_skip bitmap. Could 
it make sense to look at the bitmap before doing the pfn_to_page 
translation?


I don't see much else in the profiles. I guess the remaining problem of 
compaction here is that deferring compaction doesn't trigger for async 
compaction, and this testcase doesn't hit sync compaction at all.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/3] reduce latency of direct async compaction

2015-12-04 Thread Vlastimil Babka

On 12/04/2015 07:25 AM, Aaron Lu wrote:

On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote:

Aaron, could you try this on your testcase?


One time result isn't stable enough, so I did 9 runs for each commit,
here is the result:

base: 25364a9e54fb8296837061bf684b76d20eec01fb
head: 7433b1009ff5a02e1e9f3444802daba2cf385d27
(head =  base + this_patch_serie)

The always-always case(transparent_hugepage set to always and defrag set
to always):

Result for base:
$ cat {0..8}/swap
cmdline: /lkp/aaron/src/bin/usemem 10622592
10622592 transferred in 103 seconds, throughput: 925 MB/s
cmdline: /lkp/aaron/src/bin/usemem 9559680
9559680 transferred in 92 seconds, throughput: 1036 MB/s
cmdline: /lkp/aaron/src/bin/usemem 6171264
6171264 transferred in 92 seconds, throughput: 1036 MB/s
cmdline: /lkp/aaron/src/bin/usemem 15663744
15663744 transferred in 150 seconds, throughput: 635 MB/s
cmdline: /lkp/aaron/src/bin/usemem 12966528
12966528 transferred in 87 seconds, throughput: 1096 MB/s
cmdline: /lkp/aaron/src/bin/usemem 5784192
5784192 transferred in 131 seconds, throughput: 727 MB/s
cmdline: /lkp/aaron/src/bin/usemem 13731456
13731456 transferred in 97 seconds, throughput: 983 MB/s
cmdline: /lkp/aaron/src/bin/usemem 16440960
16440960 transferred in 109 seconds, throughput: 874 MB/s
cmdline: /lkp/aaron/src/bin/usemem 8813184
8813184 transferred in 122 seconds, throughput: 781 MB/s
Max: 1096 MB/s
Min: 635 MB/s
Avg: 899 MB/s

Result for head:
$ cat {0..8}/swap
cmdline: /lkp/aaron/src/bin/usemem 13163136
13163136 transferred in 105 seconds, throughput: 908 MB/s
cmdline: /lkp/aaron/src/bin/usemem 8524416
8524416 transferred in 78 seconds, throughput: 1222 MB/s
cmdline: /lkp/aaron/src/bin/usemem 3646080
3646080 transferred in 108 seconds, throughput: 882 MB/s
cmdline: /lkp/aaron/src/bin/usemem 8936064
8936064 transferred in 114 seconds, throughput: 836 MB/s
cmdline: /lkp/aaron/src/bin/usemem 12204672
12204672 transferred in 73 seconds, throughput: 1306 MB/s
cmdline: /lkp/aaron/src/bin/usemem 8140416
8140416 transferred in 146 seconds, throughput: 653 MB/s
cmdline: /lkp/aaron/src/bin/usemem 12941952
12941952 transferred in 78 seconds, throughput: 1222 MB/s
cmdline: /lkp/aaron/src/bin/usemem 6917760
6917760 transferred in 109 seconds, throughput: 874 MB/s
cmdline: /lkp/aaron/src/bin/usemem 11405952
11405952 transferred in 96 seconds, throughput: 993 MB/s
Max: 1306 MB/s
Min: 653 MB/s
Avg: 988 MB/s


Ok that looks better than the first results :) The series either helped, 
or it's just noise. But hopefully not worse.



Result for v4.3 as a reference:
$ cat {0..8}/swap
cmdline: /lkp/aaron/src/bin/usemem 12459648
12459648 transferred in 96 seconds, throughput: 993 MB/s
cmdline: /lkp/aaron/src/bin/usemem 7375488
7375488 transferred in 96 seconds, throughput: 993 MB/s
cmdline: /lkp/aaron/src/bin/usemem 9028224
9028224 transferred in 107 seconds, throughput: 891 MB/s
cmdline: /lkp/aaron/src/bin/usemem 10137216
10137216 transferred in 91 seconds, throughput: 1047 MB/s
cmdline: /lkp/aaron/src/bin/usemem 13835904
13835904 transferred in 80 seconds, throughput: 1192 MB/s
cmdline: /lkp/aaron/src/bin/usemem 10143360
10143360 transferred in 96 seconds, throughput: 993 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100020593664
100020593664 transferred in 101 seconds, throughput: 944 MB/s
cmdline: /lkp/aaron/src/bin/usemem 15805056
15805056 transferred in 87 seconds, throughput: 1096 MB/s
cmdline: /lkp/aaron/src/bin/usemem 18360960
18360960 transferred in 74 seconds, throughput: 1288 MB/s
Max: 1288 MB/s
Min: 891 MB/s
Avg: 1048 MB/s


Hard to say if there's actual regression from 4.3 to 4.4, it's too 
noisy. More iterations could help, but then the eventual bisection would 
need them too.



The always-never case:

Result for head:
$ cat {0..8}/swap
cmdline: /lkp/aaron/src/bin/usemem 13940352
13940352 transferred in 71 seconds, throughput: 1343 MB/s
cmdline: /lkp/aaron/src/bin/usemem 17411712
17411712 transferred in 62 seconds, throughput: 1538 MB/s
cmdline: /lkp/aaron/src/bin/usemem 11875968
11875968 transferred in 64 seconds, throughput: 1490 MB/s
cmdline: /lkp/aaron/src/bin/usemem 13912704
13912704 transferred in 62 seconds, throughput: 1538 MB/s
cmdline: /lkp/aaron/src/bin/usemem 12238464
12238464 transferred in 66 seconds, throughput: 1444 MB/s
cmdline: /lkp/aaron/src/bin/usemem 13670016
13670016 transferred in 65 seconds, throughput: 1467 MB/s
cmdline: /lkp/aaron/src/bin/usemem 8364672
8364672 transferred in 68 seconds, throughput: 1402 MB/s
cmdline: /lkp/aaron/src/bin/usemem 15417984
15417984 transferred in 70 seconds, throughput: 1362 MB/s
cmdline: /lkp/aaron/src/bin/usemem 15304320
15304320 transferred in 64 seconds, throughput: 1490 MB/s
Max: 1538 

Re: [RFC 0/3] reduce latency of direct async compaction

2015-12-03 Thread Aaron Lu
On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote:
> Aaron, could you try this on your testcase?

One time result isn't stable enough, so I did 9 runs for each commit,
here is the result:

base: 25364a9e54fb8296837061bf684b76d20eec01fb
head: 7433b1009ff5a02e1e9f3444802daba2cf385d27
(head =  base + this_patch_serie)

The always-always case(transparent_hugepage set to always and defrag set
to always):

Result for base:
$ cat {0..8}/swap
cmdline: /lkp/aaron/src/bin/usemem 10622592
10622592 transferred in 103 seconds, throughput: 925 MB/s
cmdline: /lkp/aaron/src/bin/usemem 9559680
9559680 transferred in 92 seconds, throughput: 1036 MB/s
cmdline: /lkp/aaron/src/bin/usemem 6171264
6171264 transferred in 92 seconds, throughput: 1036 MB/s
cmdline: /lkp/aaron/src/bin/usemem 15663744
15663744 transferred in 150 seconds, throughput: 635 MB/s
cmdline: /lkp/aaron/src/bin/usemem 12966528
12966528 transferred in 87 seconds, throughput: 1096 MB/s
cmdline: /lkp/aaron/src/bin/usemem 5784192
5784192 transferred in 131 seconds, throughput: 727 MB/s
cmdline: /lkp/aaron/src/bin/usemem 13731456
13731456 transferred in 97 seconds, throughput: 983 MB/s
cmdline: /lkp/aaron/src/bin/usemem 16440960
16440960 transferred in 109 seconds, throughput: 874 MB/s
cmdline: /lkp/aaron/src/bin/usemem 8813184
8813184 transferred in 122 seconds, throughput: 781 MB/s
Max: 1096 MB/s
Min: 635 MB/s
Avg: 899 MB/s

Result for head:
$ cat {0..8}/swap
cmdline: /lkp/aaron/src/bin/usemem 13163136
13163136 transferred in 105 seconds, throughput: 908 MB/s
cmdline: /lkp/aaron/src/bin/usemem 8524416
8524416 transferred in 78 seconds, throughput: 1222 MB/s
cmdline: /lkp/aaron/src/bin/usemem 3646080
3646080 transferred in 108 seconds, throughput: 882 MB/s
cmdline: /lkp/aaron/src/bin/usemem 8936064
8936064 transferred in 114 seconds, throughput: 836 MB/s
cmdline: /lkp/aaron/src/bin/usemem 12204672
12204672 transferred in 73 seconds, throughput: 1306 MB/s
cmdline: /lkp/aaron/src/bin/usemem 8140416
8140416 transferred in 146 seconds, throughput: 653 MB/s
cmdline: /lkp/aaron/src/bin/usemem 12941952
12941952 transferred in 78 seconds, throughput: 1222 MB/s
cmdline: /lkp/aaron/src/bin/usemem 6917760
6917760 transferred in 109 seconds, throughput: 874 MB/s
cmdline: /lkp/aaron/src/bin/usemem 11405952
11405952 transferred in 96 seconds, throughput: 993 MB/s
Max: 1306 MB/s
Min: 653 MB/s
Avg: 988 MB/s

Result for v4.3 as a reference:
$ cat {0..8}/swap
cmdline: /lkp/aaron/src/bin/usemem 12459648
12459648 transferred in 96 seconds, throughput: 993 MB/s
cmdline: /lkp/aaron/src/bin/usemem 7375488
7375488 transferred in 96 seconds, throughput: 993 MB/s
cmdline: /lkp/aaron/src/bin/usemem 9028224
9028224 transferred in 107 seconds, throughput: 891 MB/s
cmdline: /lkp/aaron/src/bin/usemem 10137216
10137216 transferred in 91 seconds, throughput: 1047 MB/s
cmdline: /lkp/aaron/src/bin/usemem 13835904
13835904 transferred in 80 seconds, throughput: 1192 MB/s
cmdline: /lkp/aaron/src/bin/usemem 10143360
10143360 transferred in 96 seconds, throughput: 993 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100020593664
100020593664 transferred in 101 seconds, throughput: 944 MB/s
cmdline: /lkp/aaron/src/bin/usemem 15805056
15805056 transferred in 87 seconds, throughput: 1096 MB/s
cmdline: /lkp/aaron/src/bin/usemem 18360960
18360960 transferred in 74 seconds, throughput: 1288 MB/s
Max: 1288 MB/s
Min: 891 MB/s
Avg: 1048 MB/s

The always-never case:

Result for head:
$ cat {0..8}/swap
cmdline: /lkp/aaron/src/bin/usemem 13940352
13940352 transferred in 71 seconds, throughput: 1343 MB/s
cmdline: /lkp/aaron/src/bin/usemem 17411712
17411712 transferred in 62 seconds, throughput: 1538 MB/s
cmdline: /lkp/aaron/src/bin/usemem 11875968
11875968 transferred in 64 seconds, throughput: 1490 MB/s
cmdline: /lkp/aaron/src/bin/usemem 13912704
13912704 transferred in 62 seconds, throughput: 1538 MB/s
cmdline: /lkp/aaron/src/bin/usemem 12238464
12238464 transferred in 66 seconds, throughput: 1444 MB/s
cmdline: /lkp/aaron/src/bin/usemem 13670016
13670016 transferred in 65 seconds, throughput: 1467 MB/s
cmdline: /lkp/aaron/src/bin/usemem 8364672
8364672 transferred in 68 seconds, throughput: 1402 MB/s
cmdline: /lkp/aaron/src/bin/usemem 15417984
15417984 transferred in 70 seconds, throughput: 1362 MB/s
cmdline: /lkp/aaron/src/bin/usemem 15304320
15304320 transferred in 64 seconds, throughput: 1490 MB/s
Max: 1538 MB/s
Min: 1343 MB/s
Avg: 1452 MB/s
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/3] reduce latency of direct async compaction

2015-12-03 Thread Aaron Lu
On Thu, Dec 03, 2015 at 10:38:50AM +0100, Vlastimil Babka wrote:
> On 12/03/2015 10:25 AM, Aaron Lu wrote:
> > On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote:
> >> Aaron, could you try this on your testcase?
> > 
> > The test result is placed at:
> > https://drive.google.com/file/d/0B49uX3igf4K4enBkdVFScXhFM0U
> > 
> > For some reason, the patches made the performace worse. The base tree is
> > today's Linus git 25364a9e54fb8296837061bf684b76d20eec01fb, and its
> > performace is about 1000MB/s. After applying this patch series, the
> > performace drops to 720MB/s.
> > 
> > Please let me know if you need more information, thanks.
> 
> Hm, compaction stats are at 0. The code in the patches isn't even running.
> Can you provide the same data also for the base tree?

My bad, I uploaded the wrong data :-/
I uploaded again:
https://drive.google.com/file/d/0B49uX3igf4K4UFI4TEQ3THYta0E

And I just run the base tree with trace-cmd and found that its
performace drops significantly(from 1000MB/s to 6xxMB/s), is it that
trace-cmd will impact performace a lot? Any suggestions on how to run
the test regarding trace-cmd? i.e. should I aways run usemem under
trace-cmd or only when necessary?

Thanks,
Aaron
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/3] reduce latency of direct async compaction

2015-12-03 Thread Vlastimil Babka
On 12/03/2015 10:25 AM, Aaron Lu wrote:
> On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote:
>> Aaron, could you try this on your testcase?
> 
> The test result is placed at:
> https://drive.google.com/file/d/0B49uX3igf4K4enBkdVFScXhFM0U
> 
> For some reason, the patches made the performace worse. The base tree is
> today's Linus git 25364a9e54fb8296837061bf684b76d20eec01fb, and its
> performace is about 1000MB/s. After applying this patch series, the
> performace drops to 720MB/s.
> 
> Please let me know if you need more information, thanks.

Hm, compaction stats are at 0. The code in the patches isn't even running.
Can you provide the same data also for the base tree?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/3] reduce latency of direct async compaction

2015-12-03 Thread Aaron Lu
On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote:
> Aaron, could you try this on your testcase?

The test result is placed at:
https://drive.google.com/file/d/0B49uX3igf4K4enBkdVFScXhFM0U

For some reason, the patches made the performace worse. The base tree is
today's Linus git 25364a9e54fb8296837061bf684b76d20eec01fb, and its
performace is about 1000MB/s. After applying this patch series, the
performace drops to 720MB/s.

Please let me know if you need more information, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/3] reduce latency of direct async compaction

2015-12-03 Thread Aaron Lu
On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote:
> Aaron, could you try this on your testcase?

One time result isn't stable enough, so I did 9 runs for each commit,
here is the result:

base: 25364a9e54fb8296837061bf684b76d20eec01fb
head: 7433b1009ff5a02e1e9f3444802daba2cf385d27
(head =  base + this_patch_serie)

The always-always case(transparent_hugepage set to always and defrag set
to always):

Result for base:
$ cat {0..8}/swap
cmdline: /lkp/aaron/src/bin/usemem 10622592
10622592 transferred in 103 seconds, throughput: 925 MB/s
cmdline: /lkp/aaron/src/bin/usemem 9559680
9559680 transferred in 92 seconds, throughput: 1036 MB/s
cmdline: /lkp/aaron/src/bin/usemem 6171264
6171264 transferred in 92 seconds, throughput: 1036 MB/s
cmdline: /lkp/aaron/src/bin/usemem 15663744
15663744 transferred in 150 seconds, throughput: 635 MB/s
cmdline: /lkp/aaron/src/bin/usemem 12966528
12966528 transferred in 87 seconds, throughput: 1096 MB/s
cmdline: /lkp/aaron/src/bin/usemem 5784192
5784192 transferred in 131 seconds, throughput: 727 MB/s
cmdline: /lkp/aaron/src/bin/usemem 13731456
13731456 transferred in 97 seconds, throughput: 983 MB/s
cmdline: /lkp/aaron/src/bin/usemem 16440960
16440960 transferred in 109 seconds, throughput: 874 MB/s
cmdline: /lkp/aaron/src/bin/usemem 8813184
8813184 transferred in 122 seconds, throughput: 781 MB/s
Max: 1096 MB/s
Min: 635 MB/s
Avg: 899 MB/s

Result for head:
$ cat {0..8}/swap
cmdline: /lkp/aaron/src/bin/usemem 13163136
13163136 transferred in 105 seconds, throughput: 908 MB/s
cmdline: /lkp/aaron/src/bin/usemem 8524416
8524416 transferred in 78 seconds, throughput: 1222 MB/s
cmdline: /lkp/aaron/src/bin/usemem 3646080
3646080 transferred in 108 seconds, throughput: 882 MB/s
cmdline: /lkp/aaron/src/bin/usemem 8936064
8936064 transferred in 114 seconds, throughput: 836 MB/s
cmdline: /lkp/aaron/src/bin/usemem 12204672
12204672 transferred in 73 seconds, throughput: 1306 MB/s
cmdline: /lkp/aaron/src/bin/usemem 8140416
8140416 transferred in 146 seconds, throughput: 653 MB/s
cmdline: /lkp/aaron/src/bin/usemem 12941952
12941952 transferred in 78 seconds, throughput: 1222 MB/s
cmdline: /lkp/aaron/src/bin/usemem 6917760
6917760 transferred in 109 seconds, throughput: 874 MB/s
cmdline: /lkp/aaron/src/bin/usemem 11405952
11405952 transferred in 96 seconds, throughput: 993 MB/s
Max: 1306 MB/s
Min: 653 MB/s
Avg: 988 MB/s

Result for v4.3 as a reference:
$ cat {0..8}/swap
cmdline: /lkp/aaron/src/bin/usemem 12459648
12459648 transferred in 96 seconds, throughput: 993 MB/s
cmdline: /lkp/aaron/src/bin/usemem 7375488
7375488 transferred in 96 seconds, throughput: 993 MB/s
cmdline: /lkp/aaron/src/bin/usemem 9028224
9028224 transferred in 107 seconds, throughput: 891 MB/s
cmdline: /lkp/aaron/src/bin/usemem 10137216
10137216 transferred in 91 seconds, throughput: 1047 MB/s
cmdline: /lkp/aaron/src/bin/usemem 13835904
13835904 transferred in 80 seconds, throughput: 1192 MB/s
cmdline: /lkp/aaron/src/bin/usemem 10143360
10143360 transferred in 96 seconds, throughput: 993 MB/s
cmdline: /lkp/aaron/src/bin/usemem 100020593664
100020593664 transferred in 101 seconds, throughput: 944 MB/s
cmdline: /lkp/aaron/src/bin/usemem 15805056
15805056 transferred in 87 seconds, throughput: 1096 MB/s
cmdline: /lkp/aaron/src/bin/usemem 18360960
18360960 transferred in 74 seconds, throughput: 1288 MB/s
Max: 1288 MB/s
Min: 891 MB/s
Avg: 1048 MB/s

The always-never case:

Result for head:
$ cat {0..8}/swap
cmdline: /lkp/aaron/src/bin/usemem 13940352
13940352 transferred in 71 seconds, throughput: 1343 MB/s
cmdline: /lkp/aaron/src/bin/usemem 17411712
17411712 transferred in 62 seconds, throughput: 1538 MB/s
cmdline: /lkp/aaron/src/bin/usemem 11875968
11875968 transferred in 64 seconds, throughput: 1490 MB/s
cmdline: /lkp/aaron/src/bin/usemem 13912704
13912704 transferred in 62 seconds, throughput: 1538 MB/s
cmdline: /lkp/aaron/src/bin/usemem 12238464
12238464 transferred in 66 seconds, throughput: 1444 MB/s
cmdline: /lkp/aaron/src/bin/usemem 13670016
13670016 transferred in 65 seconds, throughput: 1467 MB/s
cmdline: /lkp/aaron/src/bin/usemem 8364672
8364672 transferred in 68 seconds, throughput: 1402 MB/s
cmdline: /lkp/aaron/src/bin/usemem 15417984
15417984 transferred in 70 seconds, throughput: 1362 MB/s
cmdline: /lkp/aaron/src/bin/usemem 15304320
15304320 transferred in 64 seconds, throughput: 1490 MB/s
Max: 1538 MB/s
Min: 1343 MB/s
Avg: 1452 MB/s
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/3] reduce latency of direct async compaction

2015-12-03 Thread Aaron Lu
On Thu, Dec 03, 2015 at 10:38:50AM +0100, Vlastimil Babka wrote:
> On 12/03/2015 10:25 AM, Aaron Lu wrote:
> > On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote:
> >> Aaron, could you try this on your testcase?
> > 
> > The test result is placed at:
> > https://drive.google.com/file/d/0B49uX3igf4K4enBkdVFScXhFM0U
> > 
> > For some reason, the patches made the performace worse. The base tree is
> > today's Linus git 25364a9e54fb8296837061bf684b76d20eec01fb, and its
> > performace is about 1000MB/s. After applying this patch series, the
> > performace drops to 720MB/s.
> > 
> > Please let me know if you need more information, thanks.
> 
> Hm, compaction stats are at 0. The code in the patches isn't even running.
> Can you provide the same data also for the base tree?

My bad, I uploaded the wrong data :-/
I uploaded again:
https://drive.google.com/file/d/0B49uX3igf4K4UFI4TEQ3THYta0E

And I just run the base tree with trace-cmd and found that its
performace drops significantly(from 1000MB/s to 6xxMB/s), is it that
trace-cmd will impact performace a lot? Any suggestions on how to run
the test regarding trace-cmd? i.e. should I aways run usemem under
trace-cmd or only when necessary?

Thanks,
Aaron
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/3] reduce latency of direct async compaction

2015-12-03 Thread Vlastimil Babka
On 12/03/2015 10:25 AM, Aaron Lu wrote:
> On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote:
>> Aaron, could you try this on your testcase?
> 
> The test result is placed at:
> https://drive.google.com/file/d/0B49uX3igf4K4enBkdVFScXhFM0U
> 
> For some reason, the patches made the performace worse. The base tree is
> today's Linus git 25364a9e54fb8296837061bf684b76d20eec01fb, and its
> performace is about 1000MB/s. After applying this patch series, the
> performace drops to 720MB/s.
> 
> Please let me know if you need more information, thanks.

Hm, compaction stats are at 0. The code in the patches isn't even running.
Can you provide the same data also for the base tree?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/3] reduce latency of direct async compaction

2015-12-03 Thread Aaron Lu
On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote:
> Aaron, could you try this on your testcase?

The test result is placed at:
https://drive.google.com/file/d/0B49uX3igf4K4enBkdVFScXhFM0U

For some reason, the patches made the performace worse. The base tree is
today's Linus git 25364a9e54fb8296837061bf684b76d20eec01fb, and its
performace is about 1000MB/s. After applying this patch series, the
performace drops to 720MB/s.

Please let me know if you need more information, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/