Re: [RFC 0/3] reduce latency of direct async compaction
On 12/10/2015 12:35 PM, Joonsoo Kim wrote: > On Wed, Dec 09, 2015 at 01:40:06PM +0800, Aaron Lu wrote: >> On Wed, Dec 09, 2015 at 09:33:53AM +0900, Joonsoo Kim wrote: >>> On Tue, Dec 08, 2015 at 04:52:42PM +0800, Aaron Lu wrote: On Tue, Dec 08, 2015 at 03:51:16PM +0900, Joonsoo Kim wrote: > I add work-around for this problem at isolate_freepages(). Please test > following one. Still no luck and the error is about the same: >>> >>> There is a mistake... Could you insert () for >>> cc->free_pfn & ~(pageblock_nr_pages-1) like as following? >>> >>> cc->free_pfn == (cc->free_pfn & ~(pageblock_nr_pages-1)) >> >> Oh right, of course. >> >> Good news, the result is much better now: >> $ cat {0..8}/swap >> cmdline: /lkp/aaron/src/bin/usemem 100064603136 >> 100064603136 transferred in 72 seconds, throughput: 1325 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 100072049664 >> 100072049664 transferred in 74 seconds, throughput: 1289 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 100070246400 >> 100070246400 transferred in 92 seconds, throughput: 1037 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 100069545984 >> 100069545984 transferred in 81 seconds, throughput: 1178 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 100058895360 >> 100058895360 transferred in 78 seconds, throughput: 1223 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 100066074624 >> 100066074624 transferred in 94 seconds, throughput: 1015 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 100062855168 >> 100062855168 transferred in 77 seconds, throughput: 1239 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 100060990464 >> 100060990464 transferred in 73 seconds, throughput: 1307 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 100064996352 >> 100064996352 transferred in 84 seconds, throughput: 1136 MB/s >> Max: 1325 MB/s >> Min: 1015 MB/s >> Avg: 1194 MB/s > > Nice result! Thanks for testing. > I will make a proper formatted patch soon. Thanks for the nice work. > > Then, your concern is solved? I think that performance of I think so. > always-always on this test case can't follow up performance of > always-never because migration cost to make hugepage is additionally > charged to always-always case. Instead, it will have more hugepage > mapping and it may result in better performance in some situation. > I guess that it is intention of that option. OK, I see. Regards, Aaron -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/3] reduce latency of direct async compaction
On Wed, Dec 09, 2015 at 01:40:06PM +0800, Aaron Lu wrote: > On Wed, Dec 09, 2015 at 09:33:53AM +0900, Joonsoo Kim wrote: > > On Tue, Dec 08, 2015 at 04:52:42PM +0800, Aaron Lu wrote: > > > On Tue, Dec 08, 2015 at 03:51:16PM +0900, Joonsoo Kim wrote: > > > > I add work-around for this problem at isolate_freepages(). Please test > > > > following one. > > > > > > Still no luck and the error is about the same: > > > > There is a mistake... Could you insert () for > > cc->free_pfn & ~(pageblock_nr_pages-1) like as following? > > > > cc->free_pfn == (cc->free_pfn & ~(pageblock_nr_pages-1)) > > Oh right, of course. > > Good news, the result is much better now: > $ cat {0..8}/swap > cmdline: /lkp/aaron/src/bin/usemem 100064603136 > 100064603136 transferred in 72 seconds, throughput: 1325 MB/s > cmdline: /lkp/aaron/src/bin/usemem 100072049664 > 100072049664 transferred in 74 seconds, throughput: 1289 MB/s > cmdline: /lkp/aaron/src/bin/usemem 100070246400 > 100070246400 transferred in 92 seconds, throughput: 1037 MB/s > cmdline: /lkp/aaron/src/bin/usemem 100069545984 > 100069545984 transferred in 81 seconds, throughput: 1178 MB/s > cmdline: /lkp/aaron/src/bin/usemem 100058895360 > 100058895360 transferred in 78 seconds, throughput: 1223 MB/s > cmdline: /lkp/aaron/src/bin/usemem 100066074624 > 100066074624 transferred in 94 seconds, throughput: 1015 MB/s > cmdline: /lkp/aaron/src/bin/usemem 100062855168 > 100062855168 transferred in 77 seconds, throughput: 1239 MB/s > cmdline: /lkp/aaron/src/bin/usemem 100060990464 > 100060990464 transferred in 73 seconds, throughput: 1307 MB/s > cmdline: /lkp/aaron/src/bin/usemem 100064996352 > 100064996352 transferred in 84 seconds, throughput: 1136 MB/s > Max: 1325 MB/s > Min: 1015 MB/s > Avg: 1194 MB/s Nice result! Thanks for testing. I will make a proper formatted patch soon. Then, your concern is solved? I think that performance of always-always on this test case can't follow up performance of always-never because migration cost to make hugepage is additionally charged to always-always case. Instead, it will have more hugepage mapping and it may result in better performance in some situation. I guess that it is intention of that option. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/3] reduce latency of direct async compaction
On 12/10/2015 12:35 PM, Joonsoo Kim wrote: > On Wed, Dec 09, 2015 at 01:40:06PM +0800, Aaron Lu wrote: >> On Wed, Dec 09, 2015 at 09:33:53AM +0900, Joonsoo Kim wrote: >>> On Tue, Dec 08, 2015 at 04:52:42PM +0800, Aaron Lu wrote: On Tue, Dec 08, 2015 at 03:51:16PM +0900, Joonsoo Kim wrote: > I add work-around for this problem at isolate_freepages(). Please test > following one. Still no luck and the error is about the same: >>> >>> There is a mistake... Could you insert () for >>> cc->free_pfn & ~(pageblock_nr_pages-1) like as following? >>> >>> cc->free_pfn == (cc->free_pfn & ~(pageblock_nr_pages-1)) >> >> Oh right, of course. >> >> Good news, the result is much better now: >> $ cat {0..8}/swap >> cmdline: /lkp/aaron/src/bin/usemem 100064603136 >> 100064603136 transferred in 72 seconds, throughput: 1325 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 100072049664 >> 100072049664 transferred in 74 seconds, throughput: 1289 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 100070246400 >> 100070246400 transferred in 92 seconds, throughput: 1037 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 100069545984 >> 100069545984 transferred in 81 seconds, throughput: 1178 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 100058895360 >> 100058895360 transferred in 78 seconds, throughput: 1223 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 100066074624 >> 100066074624 transferred in 94 seconds, throughput: 1015 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 100062855168 >> 100062855168 transferred in 77 seconds, throughput: 1239 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 100060990464 >> 100060990464 transferred in 73 seconds, throughput: 1307 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 100064996352 >> 100064996352 transferred in 84 seconds, throughput: 1136 MB/s >> Max: 1325 MB/s >> Min: 1015 MB/s >> Avg: 1194 MB/s > > Nice result! Thanks for testing. > I will make a proper formatted patch soon. Thanks for the nice work. > > Then, your concern is solved? I think that performance of I think so. > always-always on this test case can't follow up performance of > always-never because migration cost to make hugepage is additionally > charged to always-always case. Instead, it will have more hugepage > mapping and it may result in better performance in some situation. > I guess that it is intention of that option. OK, I see. Regards, Aaron -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/3] reduce latency of direct async compaction
On Wed, Dec 09, 2015 at 01:40:06PM +0800, Aaron Lu wrote: > On Wed, Dec 09, 2015 at 09:33:53AM +0900, Joonsoo Kim wrote: > > On Tue, Dec 08, 2015 at 04:52:42PM +0800, Aaron Lu wrote: > > > On Tue, Dec 08, 2015 at 03:51:16PM +0900, Joonsoo Kim wrote: > > > > I add work-around for this problem at isolate_freepages(). Please test > > > > following one. > > > > > > Still no luck and the error is about the same: > > > > There is a mistake... Could you insert () for > > cc->free_pfn & ~(pageblock_nr_pages-1) like as following? > > > > cc->free_pfn == (cc->free_pfn & ~(pageblock_nr_pages-1)) > > Oh right, of course. > > Good news, the result is much better now: > $ cat {0..8}/swap > cmdline: /lkp/aaron/src/bin/usemem 100064603136 > 100064603136 transferred in 72 seconds, throughput: 1325 MB/s > cmdline: /lkp/aaron/src/bin/usemem 100072049664 > 100072049664 transferred in 74 seconds, throughput: 1289 MB/s > cmdline: /lkp/aaron/src/bin/usemem 100070246400 > 100070246400 transferred in 92 seconds, throughput: 1037 MB/s > cmdline: /lkp/aaron/src/bin/usemem 100069545984 > 100069545984 transferred in 81 seconds, throughput: 1178 MB/s > cmdline: /lkp/aaron/src/bin/usemem 100058895360 > 100058895360 transferred in 78 seconds, throughput: 1223 MB/s > cmdline: /lkp/aaron/src/bin/usemem 100066074624 > 100066074624 transferred in 94 seconds, throughput: 1015 MB/s > cmdline: /lkp/aaron/src/bin/usemem 100062855168 > 100062855168 transferred in 77 seconds, throughput: 1239 MB/s > cmdline: /lkp/aaron/src/bin/usemem 100060990464 > 100060990464 transferred in 73 seconds, throughput: 1307 MB/s > cmdline: /lkp/aaron/src/bin/usemem 100064996352 > 100064996352 transferred in 84 seconds, throughput: 1136 MB/s > Max: 1325 MB/s > Min: 1015 MB/s > Avg: 1194 MB/s Nice result! Thanks for testing. I will make a proper formatted patch soon. Then, your concern is solved? I think that performance of always-always on this test case can't follow up performance of always-never because migration cost to make hugepage is additionally charged to always-always case. Instead, it will have more hugepage mapping and it may result in better performance in some situation. I guess that it is intention of that option. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/3] reduce latency of direct async compaction
On Wed, Dec 09, 2015 at 09:33:53AM +0900, Joonsoo Kim wrote: > On Tue, Dec 08, 2015 at 04:52:42PM +0800, Aaron Lu wrote: > > On Tue, Dec 08, 2015 at 03:51:16PM +0900, Joonsoo Kim wrote: > > > I add work-around for this problem at isolate_freepages(). Please test > > > following one. > > > > Still no luck and the error is about the same: > > There is a mistake... Could you insert () for > cc->free_pfn & ~(pageblock_nr_pages-1) like as following? > > cc->free_pfn == (cc->free_pfn & ~(pageblock_nr_pages-1)) Oh right, of course. Good news, the result is much better now: $ cat {0..8}/swap cmdline: /lkp/aaron/src/bin/usemem 100064603136 100064603136 transferred in 72 seconds, throughput: 1325 MB/s cmdline: /lkp/aaron/src/bin/usemem 100072049664 100072049664 transferred in 74 seconds, throughput: 1289 MB/s cmdline: /lkp/aaron/src/bin/usemem 100070246400 100070246400 transferred in 92 seconds, throughput: 1037 MB/s cmdline: /lkp/aaron/src/bin/usemem 100069545984 100069545984 transferred in 81 seconds, throughput: 1178 MB/s cmdline: /lkp/aaron/src/bin/usemem 100058895360 100058895360 transferred in 78 seconds, throughput: 1223 MB/s cmdline: /lkp/aaron/src/bin/usemem 100066074624 100066074624 transferred in 94 seconds, throughput: 1015 MB/s cmdline: /lkp/aaron/src/bin/usemem 100062855168 100062855168 transferred in 77 seconds, throughput: 1239 MB/s cmdline: /lkp/aaron/src/bin/usemem 100060990464 100060990464 transferred in 73 seconds, throughput: 1307 MB/s cmdline: /lkp/aaron/src/bin/usemem 100064996352 100064996352 transferred in 84 seconds, throughput: 1136 MB/s Max: 1325 MB/s Min: 1015 MB/s Avg: 1194 MB/s The base result for reference: $ cat {0..8}/swap cmdline: /lkp/aaron/src/bin/usemem 10622592 10622592 transferred in 103 seconds, throughput: 925 MB/s cmdline: /lkp/aaron/src/bin/usemem 9559680 9559680 transferred in 92 seconds, throughput: 1036 MB/s cmdline: /lkp/aaron/src/bin/usemem 6171264 6171264 transferred in 92 seconds, throughput: 1036 MB/s cmdline: /lkp/aaron/src/bin/usemem 15663744 15663744 transferred in 150 seconds, throughput: 635 MB/s cmdline: /lkp/aaron/src/bin/usemem 12966528 12966528 transferred in 87 seconds, throughput: 1096 MB/s cmdline: /lkp/aaron/src/bin/usemem 5784192 5784192 transferred in 131 seconds, throughput: 727 MB/s cmdline: /lkp/aaron/src/bin/usemem 13731456 13731456 transferred in 97 seconds, throughput: 983 MB/s cmdline: /lkp/aaron/src/bin/usemem 16440960 16440960 transferred in 109 seconds, throughput: 874 MB/s cmdline: /lkp/aaron/src/bin/usemem 8813184 8813184 transferred in 122 seconds, throughput: 781 MB/s Max: 1096 MB/s Min: 635 MB/s Avg: 899 MB/s -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/3] reduce latency of direct async compaction
On Tue, Dec 08, 2015 at 04:52:42PM +0800, Aaron Lu wrote: > On Tue, Dec 08, 2015 at 03:51:16PM +0900, Joonsoo Kim wrote: > > On Tue, Dec 08, 2015 at 01:14:39PM +0800, Aaron Lu wrote: > > > On Tue, Dec 08, 2015 at 09:41:18AM +0900, Joonsoo Kim wrote: > > > > On Mon, Dec 07, 2015 at 04:59:56PM +0800, Aaron Lu wrote: > > > > > On Mon, Dec 07, 2015 at 04:35:24PM +0900, Joonsoo Kim wrote: > > > > > > It looks like overhead still remain. I guess that migration scanner > > > > > > would call pageblock_pfn_to_page() for more extended range so > > > > > > overhead still remain. > > > > > > > > > > > > I have an idea to solve his problem. Aaron, could you test > > > > > > following patch > > > > > > on top of base? It tries to skip calling pageblock_pfn_to_page() > > > > > > > > > > It doesn't apply on top of 25364a9e54fb8296837061bf684b76d20eec01fb > > > > > cleanly, so I made some changes to make it apply and the result is: > > > > > https://github.com/aaronlu/linux/commit/cb8d05829190b806ad3948ff9b9e08c8ba1daf63 > > > > > > > > Yes, that's okay. I made it on my working branch but it will not result > > > > in > > > > any problem except applying. > > > > > > > > > > > > > > There is a problem occured right after the test starts: > > > > > [ 58.080962] BUG: unable to handle kernel paging request at > > > > > ea008218 > > > > > [ 58.089124] IP: [] compaction_alloc+0xf9/0x270 > > > > > [ 58.096109] PGD 107ffd6067 PUD 207f7d5067 PMD 0 > > > > > [ 58.101569] Oops: [#1] SMP > > > > > > > > I did some mistake. Please test following patch. It is also made > > > > on my working branch so you need to resolve conflict but it would be > > > > trivial. > > > > > > > > I inserted some logs to check whether zone is contiguous or not. > > > > Please check that normal zone is set to contiguous after testing. > > > > > > Yes it is contiguous, but unfortunately, the problem remains: > > > [ 56.536930] check_zone_contiguous: Normal > > > [ 56.543467] check_zone_contiguous: Normal: contiguous > > > [ 56.549640] BUG: unable to handle kernel paging request at > > > ea008218 > > > [ 56.557717] IP: [] compaction_alloc+0xf9/0x270 > > > [ 56.564719] PGD 107ffd6067 PUD 207f7d5067 PMD 0 > > > > > > > Maybe, I find the reason. cc->free_pfn can be initialized to invalid pfn > > that isn't checked so optimized pageblock_pfn_to_page() causes BUG(). > > > > I add work-around for this problem at isolate_freepages(). Please test > > following one. > > Still no luck and the error is about the same: There is a mistake... Could you insert () for cc->free_pfn & ~(pageblock_nr_pages-1) like as following? cc->free_pfn == (cc->free_pfn & ~(pageblock_nr_pages-1)) Thanks. > > [ 64.727792] check_zone_contiguous: Normal > [ 64.733950] check_zone_contiguous: Normal: contiguous > [ 64.741610] BUG: unable to handle kernel paging request at ea008218 > [ 64.749708] IP: [] compaction_alloc+0xf9/0x270 > [ 64.756806] PGD 107ffd6067 PUD 207f7d5067 PMD 0 > [ 64.762302] Oops: [#1] SMP > [ 64.766294] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss > nfsv4 dns_resolver netconsole sg sd_mod x86_pkg_temp_thermal coretemp > kvm_intel kvm mgag200 irqbypass crct10dif_pclmul ttm crc32_pclmul > crc32c_intel drm_kms_helper ahci syscopyarea sysfillrect sysimgblt snd_pcm > libahci fb_sys_fops snd_timer snd sb_edac aesni_intel soundcore lrw drm > gf128mul pcspkr edac_core ipmi_devintf glue_helper ablk_helper cryptd libata > ipmi_si shpchp wmi ipmi_msghandler acpi_power_meter acpi_pad > [ 64.816579] CPU: 19 PID: 1526 Comm: usemem Not tainted > 4.4.0-rc3-00025-gf60ea5f #1 > [ 64.825419] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS > SE5C610.86B.01.01.0008.021120151325 02/11/2015 > [ 64.837483] task: 88168a0aca80 ti: 88168a564000 > task.ti:88168a564000 > [ 64.846264] RIP: 0010:[] [] > compaction_alloc+0xf9/0x270 > [ 64.856147] RSP: :88168a567940 EFLAGS: 00010286 > [ 64.862520] RAX: 88207ffdcd80 RBX: 88168a567ac0 RCX: > 88207ffdcd80 > [ 64.870944] RDX: 0208 RSI: 88168a567ac0 RDI: > 88168a567ac0 > [ 64.879377] RBP: 88168a567990 R08: ea008200 R09: > > [ 64.887813] R10: R11: 0001ae88 R12: > ea008200 > [ 64.896254] R13: ea0059f20780 R14: 0208 R15: > 0208 > [ 64.904704] FS: 7f2d4e6e8700() GS:88203444() > knlGS: > [ 64.914232] CS: 0010 DS: ES: CR0: 80050033 > [ 64.921151] CR2: ea008218 CR3: 002015771000 CR4: > 001406e0 > [ 64.929635] Stack: > [ 64.932413] 88168a568000 0167ca00 81193196 > 88207ffdcd80 > [ 64.941292] 0208 ea0059f207c0 88168a567ac0 > ea0059f20780 > [ 64.950179] ea0059f207e0 88207ffdcd80 88168a567a20 > 811d097e > [ 64.959071]
Re: [RFC 0/3] reduce latency of direct async compaction
On Tue, Dec 08, 2015 at 03:51:16PM +0900, Joonsoo Kim wrote: > On Tue, Dec 08, 2015 at 01:14:39PM +0800, Aaron Lu wrote: > > On Tue, Dec 08, 2015 at 09:41:18AM +0900, Joonsoo Kim wrote: > > > On Mon, Dec 07, 2015 at 04:59:56PM +0800, Aaron Lu wrote: > > > > On Mon, Dec 07, 2015 at 04:35:24PM +0900, Joonsoo Kim wrote: > > > > > It looks like overhead still remain. I guess that migration scanner > > > > > would call pageblock_pfn_to_page() for more extended range so > > > > > overhead still remain. > > > > > > > > > > I have an idea to solve his problem. Aaron, could you test following > > > > > patch > > > > > on top of base? It tries to skip calling pageblock_pfn_to_page() > > > > > > > > It doesn't apply on top of 25364a9e54fb8296837061bf684b76d20eec01fb > > > > cleanly, so I made some changes to make it apply and the result is: > > > > https://github.com/aaronlu/linux/commit/cb8d05829190b806ad3948ff9b9e08c8ba1daf63 > > > > > > Yes, that's okay. I made it on my working branch but it will not result in > > > any problem except applying. > > > > > > > > > > > There is a problem occured right after the test starts: > > > > [ 58.080962] BUG: unable to handle kernel paging request at > > > > ea008218 > > > > [ 58.089124] IP: [] compaction_alloc+0xf9/0x270 > > > > [ 58.096109] PGD 107ffd6067 PUD 207f7d5067 PMD 0 > > > > [ 58.101569] Oops: [#1] SMP > > > > > > I did some mistake. Please test following patch. It is also made > > > on my working branch so you need to resolve conflict but it would be > > > trivial. > > > > > > I inserted some logs to check whether zone is contiguous or not. > > > Please check that normal zone is set to contiguous after testing. > > > > Yes it is contiguous, but unfortunately, the problem remains: > > [ 56.536930] check_zone_contiguous: Normal > > [ 56.543467] check_zone_contiguous: Normal: contiguous > > [ 56.549640] BUG: unable to handle kernel paging request at > > ea008218 > > [ 56.557717] IP: [] compaction_alloc+0xf9/0x270 > > [ 56.564719] PGD 107ffd6067 PUD 207f7d5067 PMD 0 > > > > Maybe, I find the reason. cc->free_pfn can be initialized to invalid pfn > that isn't checked so optimized pageblock_pfn_to_page() causes BUG(). > > I add work-around for this problem at isolate_freepages(). Please test > following one. Still no luck and the error is about the same: [ 64.727792] check_zone_contiguous: Normal [ 64.733950] check_zone_contiguous: Normal: contiguous [ 64.741610] BUG: unable to handle kernel paging request at ea008218 [ 64.749708] IP: [] compaction_alloc+0xf9/0x270 [ 64.756806] PGD 107ffd6067 PUD 207f7d5067 PMD 0 [ 64.762302] Oops: [#1] SMP [ 64.766294] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver netconsole sg sd_mod x86_pkg_temp_thermal coretemp kvm_intel kvm mgag200 irqbypass crct10dif_pclmul ttm crc32_pclmul crc32c_intel drm_kms_helper ahci syscopyarea sysfillrect sysimgblt snd_pcm libahci fb_sys_fops snd_timer snd sb_edac aesni_intel soundcore lrw drm gf128mul pcspkr edac_core ipmi_devintf glue_helper ablk_helper cryptd libata ipmi_si shpchp wmi ipmi_msghandler acpi_power_meter acpi_pad [ 64.816579] CPU: 19 PID: 1526 Comm: usemem Not tainted 4.4.0-rc3-00025-gf60ea5f #1 [ 64.825419] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [ 64.837483] task: 88168a0aca80 ti: 88168a564000 task.ti:88168a564000 [ 64.846264] RIP: 0010:[] [] compaction_alloc+0xf9/0x270 [ 64.856147] RSP: :88168a567940 EFLAGS: 00010286 [ 64.862520] RAX: 88207ffdcd80 RBX: 88168a567ac0 RCX: 88207ffdcd80 [ 64.870944] RDX: 0208 RSI: 88168a567ac0 RDI: 88168a567ac0 [ 64.879377] RBP: 88168a567990 R08: ea008200 R09: [ 64.887813] R10: R11: 0001ae88 R12: ea008200 [ 64.896254] R13: ea0059f20780 R14: 0208 R15: 0208 [ 64.904704] FS: 7f2d4e6e8700() GS:88203444() knlGS: [ 64.914232] CS: 0010 DS: ES: CR0: 80050033 [ 64.921151] CR2: ea008218 CR3: 002015771000 CR4: 001406e0 [ 64.929635] Stack: [ 64.932413] 88168a568000 0167ca00 81193196 88207ffdcd80 [ 64.941292] 0208 ea0059f207c0 88168a567ac0 ea0059f20780 [ 64.950179] ea0059f207e0 88207ffdcd80 88168a567a20 811d097e [ 64.959071] Call Trace: [ 64.962364] [] ? update_pageblock_skip+0x56/0xa0 [ 64.969939] [] migrate_pages+0x28e/0x7b0 [ 64.976728] [] ? update_pageblock_skip+0xa0/0xa0 [ 64.984312] [] ? __pageblock_pfn_to_page+0xe0/0xe0 [ 64.992093] [] compact_zone+0x38a/0x8e0 [ 64.998811] [] compact_zone_order+0x6d/0x90 [ 65.005926] [] ? get_page_from_freelist+0xd4/0xa20 [ 65.013861] [] try_to_compact_pages+0xec/0x210 [ 65.021212] [] ?
Re: [RFC 0/3] reduce latency of direct async compaction
On Tue, Dec 08, 2015 at 03:51:16PM +0900, Joonsoo Kim wrote: > On Tue, Dec 08, 2015 at 01:14:39PM +0800, Aaron Lu wrote: > > On Tue, Dec 08, 2015 at 09:41:18AM +0900, Joonsoo Kim wrote: > > > On Mon, Dec 07, 2015 at 04:59:56PM +0800, Aaron Lu wrote: > > > > On Mon, Dec 07, 2015 at 04:35:24PM +0900, Joonsoo Kim wrote: > > > > > It looks like overhead still remain. I guess that migration scanner > > > > > would call pageblock_pfn_to_page() for more extended range so > > > > > overhead still remain. > > > > > > > > > > I have an idea to solve his problem. Aaron, could you test following > > > > > patch > > > > > on top of base? It tries to skip calling pageblock_pfn_to_page() > > > > > > > > It doesn't apply on top of 25364a9e54fb8296837061bf684b76d20eec01fb > > > > cleanly, so I made some changes to make it apply and the result is: > > > > https://github.com/aaronlu/linux/commit/cb8d05829190b806ad3948ff9b9e08c8ba1daf63 > > > > > > Yes, that's okay. I made it on my working branch but it will not result in > > > any problem except applying. > > > > > > > > > > > There is a problem occured right after the test starts: > > > > [ 58.080962] BUG: unable to handle kernel paging request at > > > > ea008218 > > > > [ 58.089124] IP: [] compaction_alloc+0xf9/0x270 > > > > [ 58.096109] PGD 107ffd6067 PUD 207f7d5067 PMD 0 > > > > [ 58.101569] Oops: [#1] SMP > > > > > > I did some mistake. Please test following patch. It is also made > > > on my working branch so you need to resolve conflict but it would be > > > trivial. > > > > > > I inserted some logs to check whether zone is contiguous or not. > > > Please check that normal zone is set to contiguous after testing. > > > > Yes it is contiguous, but unfortunately, the problem remains: > > [ 56.536930] check_zone_contiguous: Normal > > [ 56.543467] check_zone_contiguous: Normal: contiguous > > [ 56.549640] BUG: unable to handle kernel paging request at > > ea008218 > > [ 56.557717] IP: [] compaction_alloc+0xf9/0x270 > > [ 56.564719] PGD 107ffd6067 PUD 207f7d5067 PMD 0 > > > > Maybe, I find the reason. cc->free_pfn can be initialized to invalid pfn > that isn't checked so optimized pageblock_pfn_to_page() causes BUG(). > > I add work-around for this problem at isolate_freepages(). Please test > following one. Still no luck and the error is about the same: [ 64.727792] check_zone_contiguous: Normal [ 64.733950] check_zone_contiguous: Normal: contiguous [ 64.741610] BUG: unable to handle kernel paging request at ea008218 [ 64.749708] IP: [] compaction_alloc+0xf9/0x270 [ 64.756806] PGD 107ffd6067 PUD 207f7d5067 PMD 0 [ 64.762302] Oops: [#1] SMP [ 64.766294] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver netconsole sg sd_mod x86_pkg_temp_thermal coretemp kvm_intel kvm mgag200 irqbypass crct10dif_pclmul ttm crc32_pclmul crc32c_intel drm_kms_helper ahci syscopyarea sysfillrect sysimgblt snd_pcm libahci fb_sys_fops snd_timer snd sb_edac aesni_intel soundcore lrw drm gf128mul pcspkr edac_core ipmi_devintf glue_helper ablk_helper cryptd libata ipmi_si shpchp wmi ipmi_msghandler acpi_power_meter acpi_pad [ 64.816579] CPU: 19 PID: 1526 Comm: usemem Not tainted 4.4.0-rc3-00025-gf60ea5f #1 [ 64.825419] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [ 64.837483] task: 88168a0aca80 ti: 88168a564000 task.ti:88168a564000 [ 64.846264] RIP: 0010:[] [] compaction_alloc+0xf9/0x270 [ 64.856147] RSP: :88168a567940 EFLAGS: 00010286 [ 64.862520] RAX: 88207ffdcd80 RBX: 88168a567ac0 RCX: 88207ffdcd80 [ 64.870944] RDX: 0208 RSI: 88168a567ac0 RDI: 88168a567ac0 [ 64.879377] RBP: 88168a567990 R08: ea008200 R09: [ 64.887813] R10: R11: 0001ae88 R12: ea008200 [ 64.896254] R13: ea0059f20780 R14: 0208 R15: 0208 [ 64.904704] FS: 7f2d4e6e8700() GS:88203444() knlGS: [ 64.914232] CS: 0010 DS: ES: CR0: 80050033 [ 64.921151] CR2: ea008218 CR3: 002015771000 CR4: 001406e0 [ 64.929635] Stack: [ 64.932413] 88168a568000 0167ca00 81193196 88207ffdcd80 [ 64.941292] 0208 ea0059f207c0 88168a567ac0 ea0059f20780 [ 64.950179] ea0059f207e0 88207ffdcd80 88168a567a20 811d097e [ 64.959071] Call Trace: [ 64.962364] [] ? update_pageblock_skip+0x56/0xa0 [ 64.969939] [] migrate_pages+0x28e/0x7b0 [ 64.976728] [] ? update_pageblock_skip+0xa0/0xa0 [ 64.984312] [] ? __pageblock_pfn_to_page+0xe0/0xe0 [ 64.992093] [] compact_zone+0x38a/0x8e0 [ 64.998811] [] compact_zone_order+0x6d/0x90 [ 65.005926] [] ? get_page_from_freelist+0xd4/0xa20 [ 65.013861] [] try_to_compact_pages+0xec/0x210 [ 65.021212] [] ?
Re: [RFC 0/3] reduce latency of direct async compaction
On Tue, Dec 08, 2015 at 04:52:42PM +0800, Aaron Lu wrote: > On Tue, Dec 08, 2015 at 03:51:16PM +0900, Joonsoo Kim wrote: > > On Tue, Dec 08, 2015 at 01:14:39PM +0800, Aaron Lu wrote: > > > On Tue, Dec 08, 2015 at 09:41:18AM +0900, Joonsoo Kim wrote: > > > > On Mon, Dec 07, 2015 at 04:59:56PM +0800, Aaron Lu wrote: > > > > > On Mon, Dec 07, 2015 at 04:35:24PM +0900, Joonsoo Kim wrote: > > > > > > It looks like overhead still remain. I guess that migration scanner > > > > > > would call pageblock_pfn_to_page() for more extended range so > > > > > > overhead still remain. > > > > > > > > > > > > I have an idea to solve his problem. Aaron, could you test > > > > > > following patch > > > > > > on top of base? It tries to skip calling pageblock_pfn_to_page() > > > > > > > > > > It doesn't apply on top of 25364a9e54fb8296837061bf684b76d20eec01fb > > > > > cleanly, so I made some changes to make it apply and the result is: > > > > > https://github.com/aaronlu/linux/commit/cb8d05829190b806ad3948ff9b9e08c8ba1daf63 > > > > > > > > Yes, that's okay. I made it on my working branch but it will not result > > > > in > > > > any problem except applying. > > > > > > > > > > > > > > There is a problem occured right after the test starts: > > > > > [ 58.080962] BUG: unable to handle kernel paging request at > > > > > ea008218 > > > > > [ 58.089124] IP: [] compaction_alloc+0xf9/0x270 > > > > > [ 58.096109] PGD 107ffd6067 PUD 207f7d5067 PMD 0 > > > > > [ 58.101569] Oops: [#1] SMP > > > > > > > > I did some mistake. Please test following patch. It is also made > > > > on my working branch so you need to resolve conflict but it would be > > > > trivial. > > > > > > > > I inserted some logs to check whether zone is contiguous or not. > > > > Please check that normal zone is set to contiguous after testing. > > > > > > Yes it is contiguous, but unfortunately, the problem remains: > > > [ 56.536930] check_zone_contiguous: Normal > > > [ 56.543467] check_zone_contiguous: Normal: contiguous > > > [ 56.549640] BUG: unable to handle kernel paging request at > > > ea008218 > > > [ 56.557717] IP: [] compaction_alloc+0xf9/0x270 > > > [ 56.564719] PGD 107ffd6067 PUD 207f7d5067 PMD 0 > > > > > > > Maybe, I find the reason. cc->free_pfn can be initialized to invalid pfn > > that isn't checked so optimized pageblock_pfn_to_page() causes BUG(). > > > > I add work-around for this problem at isolate_freepages(). Please test > > following one. > > Still no luck and the error is about the same: There is a mistake... Could you insert () for cc->free_pfn & ~(pageblock_nr_pages-1) like as following? cc->free_pfn == (cc->free_pfn & ~(pageblock_nr_pages-1)) Thanks. > > [ 64.727792] check_zone_contiguous: Normal > [ 64.733950] check_zone_contiguous: Normal: contiguous > [ 64.741610] BUG: unable to handle kernel paging request at ea008218 > [ 64.749708] IP: [] compaction_alloc+0xf9/0x270 > [ 64.756806] PGD 107ffd6067 PUD 207f7d5067 PMD 0 > [ 64.762302] Oops: [#1] SMP > [ 64.766294] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss > nfsv4 dns_resolver netconsole sg sd_mod x86_pkg_temp_thermal coretemp > kvm_intel kvm mgag200 irqbypass crct10dif_pclmul ttm crc32_pclmul > crc32c_intel drm_kms_helper ahci syscopyarea sysfillrect sysimgblt snd_pcm > libahci fb_sys_fops snd_timer snd sb_edac aesni_intel soundcore lrw drm > gf128mul pcspkr edac_core ipmi_devintf glue_helper ablk_helper cryptd libata > ipmi_si shpchp wmi ipmi_msghandler acpi_power_meter acpi_pad > [ 64.816579] CPU: 19 PID: 1526 Comm: usemem Not tainted > 4.4.0-rc3-00025-gf60ea5f #1 > [ 64.825419] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS > SE5C610.86B.01.01.0008.021120151325 02/11/2015 > [ 64.837483] task: 88168a0aca80 ti: 88168a564000 > task.ti:88168a564000 > [ 64.846264] RIP: 0010:[] [] > compaction_alloc+0xf9/0x270 > [ 64.856147] RSP: :88168a567940 EFLAGS: 00010286 > [ 64.862520] RAX: 88207ffdcd80 RBX: 88168a567ac0 RCX: > 88207ffdcd80 > [ 64.870944] RDX: 0208 RSI: 88168a567ac0 RDI: > 88168a567ac0 > [ 64.879377] RBP: 88168a567990 R08: ea008200 R09: > > [ 64.887813] R10: R11: 0001ae88 R12: > ea008200 > [ 64.896254] R13: ea0059f20780 R14: 0208 R15: > 0208 > [ 64.904704] FS: 7f2d4e6e8700() GS:88203444() > knlGS: > [ 64.914232] CS: 0010 DS: ES: CR0: 80050033 > [ 64.921151] CR2: ea008218 CR3: 002015771000 CR4: > 001406e0 > [ 64.929635] Stack: > [ 64.932413] 88168a568000 0167ca00 81193196 > 88207ffdcd80 > [ 64.941292] 0208 ea0059f207c0 88168a567ac0 > ea0059f20780 > [ 64.950179] ea0059f207e0 88207ffdcd80 88168a567a20 > 811d097e > [ 64.959071]
Re: [RFC 0/3] reduce latency of direct async compaction
On Wed, Dec 09, 2015 at 09:33:53AM +0900, Joonsoo Kim wrote: > On Tue, Dec 08, 2015 at 04:52:42PM +0800, Aaron Lu wrote: > > On Tue, Dec 08, 2015 at 03:51:16PM +0900, Joonsoo Kim wrote: > > > I add work-around for this problem at isolate_freepages(). Please test > > > following one. > > > > Still no luck and the error is about the same: > > There is a mistake... Could you insert () for > cc->free_pfn & ~(pageblock_nr_pages-1) like as following? > > cc->free_pfn == (cc->free_pfn & ~(pageblock_nr_pages-1)) Oh right, of course. Good news, the result is much better now: $ cat {0..8}/swap cmdline: /lkp/aaron/src/bin/usemem 100064603136 100064603136 transferred in 72 seconds, throughput: 1325 MB/s cmdline: /lkp/aaron/src/bin/usemem 100072049664 100072049664 transferred in 74 seconds, throughput: 1289 MB/s cmdline: /lkp/aaron/src/bin/usemem 100070246400 100070246400 transferred in 92 seconds, throughput: 1037 MB/s cmdline: /lkp/aaron/src/bin/usemem 100069545984 100069545984 transferred in 81 seconds, throughput: 1178 MB/s cmdline: /lkp/aaron/src/bin/usemem 100058895360 100058895360 transferred in 78 seconds, throughput: 1223 MB/s cmdline: /lkp/aaron/src/bin/usemem 100066074624 100066074624 transferred in 94 seconds, throughput: 1015 MB/s cmdline: /lkp/aaron/src/bin/usemem 100062855168 100062855168 transferred in 77 seconds, throughput: 1239 MB/s cmdline: /lkp/aaron/src/bin/usemem 100060990464 100060990464 transferred in 73 seconds, throughput: 1307 MB/s cmdline: /lkp/aaron/src/bin/usemem 100064996352 100064996352 transferred in 84 seconds, throughput: 1136 MB/s Max: 1325 MB/s Min: 1015 MB/s Avg: 1194 MB/s The base result for reference: $ cat {0..8}/swap cmdline: /lkp/aaron/src/bin/usemem 10622592 10622592 transferred in 103 seconds, throughput: 925 MB/s cmdline: /lkp/aaron/src/bin/usemem 9559680 9559680 transferred in 92 seconds, throughput: 1036 MB/s cmdline: /lkp/aaron/src/bin/usemem 6171264 6171264 transferred in 92 seconds, throughput: 1036 MB/s cmdline: /lkp/aaron/src/bin/usemem 15663744 15663744 transferred in 150 seconds, throughput: 635 MB/s cmdline: /lkp/aaron/src/bin/usemem 12966528 12966528 transferred in 87 seconds, throughput: 1096 MB/s cmdline: /lkp/aaron/src/bin/usemem 5784192 5784192 transferred in 131 seconds, throughput: 727 MB/s cmdline: /lkp/aaron/src/bin/usemem 13731456 13731456 transferred in 97 seconds, throughput: 983 MB/s cmdline: /lkp/aaron/src/bin/usemem 16440960 16440960 transferred in 109 seconds, throughput: 874 MB/s cmdline: /lkp/aaron/src/bin/usemem 8813184 8813184 transferred in 122 seconds, throughput: 781 MB/s Max: 1096 MB/s Min: 635 MB/s Avg: 899 MB/s -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/3] reduce latency of direct async compaction
On Tue, Dec 08, 2015 at 01:14:39PM +0800, Aaron Lu wrote: > On Tue, Dec 08, 2015 at 09:41:18AM +0900, Joonsoo Kim wrote: > > On Mon, Dec 07, 2015 at 04:59:56PM +0800, Aaron Lu wrote: > > > On Mon, Dec 07, 2015 at 04:35:24PM +0900, Joonsoo Kim wrote: > > > > It looks like overhead still remain. I guess that migration scanner > > > > would call pageblock_pfn_to_page() for more extended range so > > > > overhead still remain. > > > > > > > > I have an idea to solve his problem. Aaron, could you test following > > > > patch > > > > on top of base? It tries to skip calling pageblock_pfn_to_page() > > > > > > It doesn't apply on top of 25364a9e54fb8296837061bf684b76d20eec01fb > > > cleanly, so I made some changes to make it apply and the result is: > > > https://github.com/aaronlu/linux/commit/cb8d05829190b806ad3948ff9b9e08c8ba1daf63 > > > > Yes, that's okay. I made it on my working branch but it will not result in > > any problem except applying. > > > > > > > > There is a problem occured right after the test starts: > > > [ 58.080962] BUG: unable to handle kernel paging request at > > > ea008218 > > > [ 58.089124] IP: [] compaction_alloc+0xf9/0x270 > > > [ 58.096109] PGD 107ffd6067 PUD 207f7d5067 PMD 0 > > > [ 58.101569] Oops: [#1] SMP > > > > I did some mistake. Please test following patch. It is also made > > on my working branch so you need to resolve conflict but it would be > > trivial. > > > > I inserted some logs to check whether zone is contiguous or not. > > Please check that normal zone is set to contiguous after testing. > > Yes it is contiguous, but unfortunately, the problem remains: > [ 56.536930] check_zone_contiguous: Normal > [ 56.543467] check_zone_contiguous: Normal: contiguous > [ 56.549640] BUG: unable to handle kernel paging request at ea008218 > [ 56.557717] IP: [] compaction_alloc+0xf9/0x270 > [ 56.564719] PGD 107ffd6067 PUD 207f7d5067 PMD 0 > Maybe, I find the reason. cc->free_pfn can be initialized to invalid pfn that isn't checked so optimized pageblock_pfn_to_page() causes BUG(). I add work-around for this problem at isolate_freepages(). Please test following one. Thanks. -->8--- >From 7e954a68fb555a868acc5860627a1ad8dadbe3bf Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Mon, 7 Dec 2015 14:51:42 +0900 Subject: [PATCH] mm/compaction: Optimize pageblock_pfn_to_page() for contiguous zone Signed-off-by: Joonsoo Kim --- include/linux/mmzone.h | 1 + mm/compaction.c| 60 +- 2 files changed, 60 insertions(+), 1 deletion(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index e23a9e7..573f9a9 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -521,6 +521,7 @@ struct zone { #endif #if defined CONFIG_COMPACTION || defined CONFIG_CMA + int contiguous; /* Set to true when the PG_migrate_skip bits should be cleared */ boolcompact_blockskip_flush; #endif diff --git a/mm/compaction.c b/mm/compaction.c index de3e1e7..ff5fb04 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -88,7 +88,7 @@ static inline bool migrate_async_suitable(int migratetype) * the first and last page of a pageblock and avoid checking each individual * page in a pageblock. */ -static struct page *pageblock_pfn_to_page(unsigned long start_pfn, +static struct page *__pageblock_pfn_to_page(unsigned long start_pfn, unsigned long end_pfn, struct zone *zone) { struct page *start_page; @@ -114,6 +114,56 @@ static struct page *pageblock_pfn_to_page(unsigned long start_pfn, return start_page; } +static inline struct page *pageblock_pfn_to_page(unsigned long start_pfn, + unsigned long end_pfn, struct zone *zone) +{ + if (zone->contiguous == 1) + return pfn_to_page(start_pfn); + + return __pageblock_pfn_to_page(start_pfn, end_pfn, zone); +} + +static void check_zone_contiguous(struct zone *zone) +{ + unsigned long block_start_pfn = zone->zone_start_pfn; + unsigned long block_end_pfn; + unsigned long pfn; + + /* Already checked */ + if (zone->contiguous) + return; + + printk("%s: %s\n", __func__, zone->name); + block_end_pfn = ALIGN(block_start_pfn + 1, pageblock_nr_pages); + for (; block_start_pfn < zone_end_pfn(zone); + block_start_pfn = block_end_pfn, + block_end_pfn += pageblock_nr_pages) { + + block_end_pfn = min(block_end_pfn, zone_end_pfn(zone)); + + if (!__pageblock_pfn_to_page(block_start_pfn, + block_end_pfn, zone)) { + /* We have hole */ + zone->contiguous = -1; + printk("%s: %s: uncontiguous\n", __func__, zone->name); + return; +
Re: [RFC 0/3] reduce latency of direct async compaction
On Tue, Dec 08, 2015 at 09:41:18AM +0900, Joonsoo Kim wrote: > On Mon, Dec 07, 2015 at 04:59:56PM +0800, Aaron Lu wrote: > > On Mon, Dec 07, 2015 at 04:35:24PM +0900, Joonsoo Kim wrote: > > > It looks like overhead still remain. I guess that migration scanner > > > would call pageblock_pfn_to_page() for more extended range so > > > overhead still remain. > > > > > > I have an idea to solve his problem. Aaron, could you test following patch > > > on top of base? It tries to skip calling pageblock_pfn_to_page() > > > > It doesn't apply on top of 25364a9e54fb8296837061bf684b76d20eec01fb > > cleanly, so I made some changes to make it apply and the result is: > > https://github.com/aaronlu/linux/commit/cb8d05829190b806ad3948ff9b9e08c8ba1daf63 > > Yes, that's okay. I made it on my working branch but it will not result in > any problem except applying. > > > > > There is a problem occured right after the test starts: > > [ 58.080962] BUG: unable to handle kernel paging request at > > ea008218 > > [ 58.089124] IP: [] compaction_alloc+0xf9/0x270 > > [ 58.096109] PGD 107ffd6067 PUD 207f7d5067 PMD 0 > > [ 58.101569] Oops: [#1] SMP > > I did some mistake. Please test following patch. It is also made > on my working branch so you need to resolve conflict but it would be > trivial. > > I inserted some logs to check whether zone is contiguous or not. > Please check that normal zone is set to contiguous after testing. Yes it is contiguous, but unfortunately, the problem remains: [ 56.536930] check_zone_contiguous: Normal [ 56.543467] check_zone_contiguous: Normal: contiguous [ 56.549640] BUG: unable to handle kernel paging request at ea008218 [ 56.557717] IP: [] compaction_alloc+0xf9/0x270 [ 56.564719] PGD 107ffd6067 PUD 207f7d5067 PMD 0 Full dmesg attached. Thanks, Aaron > > Thanks. > > -->8-- > From 4a1a08d8ab3fb165b87ad2ec0a2000ff6892330f Mon Sep 17 00:00:00 2001 > From: Joonsoo Kim > Date: Mon, 7 Dec 2015 14:51:42 +0900 > Subject: [PATCH] mm/compaction: Optimize pageblock_pfn_to_page() for > contiguous zone > > Signed-off-by: Joonsoo Kim > --- > include/linux/mmzone.h | 1 + > mm/compaction.c| 54 > +- > 2 files changed, 54 insertions(+), 1 deletion(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index e23a9e7..573f9a9 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -521,6 +521,7 @@ struct zone { > #endif > > #if defined CONFIG_COMPACTION || defined CONFIG_CMA > + int contiguous; > /* Set to true when the PG_migrate_skip bits should be cleared */ > boolcompact_blockskip_flush; > #endif > diff --git a/mm/compaction.c b/mm/compaction.c > index 67b8d90..cb5c7a2 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -88,7 +88,7 @@ static inline bool migrate_async_suitable(int migratetype) > * the first and last page of a pageblock and avoid checking each individual > * page in a pageblock. > */ > -static struct page *pageblock_pfn_to_page(unsigned long start_pfn, > +static struct page *__pageblock_pfn_to_page(unsigned long start_pfn, > unsigned long end_pfn, struct zone *zone) > { > struct page *start_page; > @@ -114,6 +114,56 @@ static struct page *pageblock_pfn_to_page(unsigned long > start_pfn, > return start_page; > } > > +static inline struct page *pageblock_pfn_to_page(unsigned long start_pfn, > + unsigned long end_pfn, struct zone *zone) > +{ > + if (zone->contiguous == 1) > + return pfn_to_page(start_pfn); > + > + return __pageblock_pfn_to_page(start_pfn, end_pfn, zone); > +} > + > +static void check_zone_contiguous(struct zone *zone) > +{ > + unsigned long block_start_pfn = zone->zone_start_pfn; > + unsigned long block_end_pfn; > + unsigned long pfn; > + > + /* Already checked */ > + if (zone->contiguous) > + return; > + > + printk("%s: %s\n", __func__, zone->name); > + block_end_pfn = ALIGN(block_start_pfn + 1, pageblock_nr_pages); > + for (; block_start_pfn < zone_end_pfn(zone); > + block_start_pfn = block_end_pfn, > + block_end_pfn += pageblock_nr_pages) { > + > + block_end_pfn = min(block_end_pfn, zone_end_pfn(zone)); > + > + if (!__pageblock_pfn_to_page(block_start_pfn, > + block_end_pfn, zone)) { > + /* We have hole */ > + zone->contiguous = -1; > + printk("%s: %s: uncontiguous\n", __func__, > zone->name); > + return; > + } > + > + /* Check validity of pfn within pageblock */ > + for (pfn = block_start_pfn; pfn < block_end_pfn; pfn++) { > + if
Re: [RFC 0/3] reduce latency of direct async compaction
On Mon, Dec 07, 2015 at 04:59:56PM +0800, Aaron Lu wrote: > On Mon, Dec 07, 2015 at 04:35:24PM +0900, Joonsoo Kim wrote: > > It looks like overhead still remain. I guess that migration scanner > > would call pageblock_pfn_to_page() for more extended range so > > overhead still remain. > > > > I have an idea to solve his problem. Aaron, could you test following patch > > on top of base? It tries to skip calling pageblock_pfn_to_page() > > It doesn't apply on top of 25364a9e54fb8296837061bf684b76d20eec01fb > cleanly, so I made some changes to make it apply and the result is: > https://github.com/aaronlu/linux/commit/cb8d05829190b806ad3948ff9b9e08c8ba1daf63 Yes, that's okay. I made it on my working branch but it will not result in any problem except applying. > > There is a problem occured right after the test starts: > [ 58.080962] BUG: unable to handle kernel paging request at ea008218 > [ 58.089124] IP: [] compaction_alloc+0xf9/0x270 > [ 58.096109] PGD 107ffd6067 PUD 207f7d5067 PMD 0 > [ 58.101569] Oops: [#1] SMP I did some mistake. Please test following patch. It is also made on my working branch so you need to resolve conflict but it would be trivial. I inserted some logs to check whether zone is contiguous or not. Please check that normal zone is set to contiguous after testing. Thanks. -->8-- >From 4a1a08d8ab3fb165b87ad2ec0a2000ff6892330f Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Mon, 7 Dec 2015 14:51:42 +0900 Subject: [PATCH] mm/compaction: Optimize pageblock_pfn_to_page() for contiguous zone Signed-off-by: Joonsoo Kim --- include/linux/mmzone.h | 1 + mm/compaction.c| 54 +- 2 files changed, 54 insertions(+), 1 deletion(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index e23a9e7..573f9a9 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -521,6 +521,7 @@ struct zone { #endif #if defined CONFIG_COMPACTION || defined CONFIG_CMA + int contiguous; /* Set to true when the PG_migrate_skip bits should be cleared */ boolcompact_blockskip_flush; #endif diff --git a/mm/compaction.c b/mm/compaction.c index 67b8d90..cb5c7a2 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -88,7 +88,7 @@ static inline bool migrate_async_suitable(int migratetype) * the first and last page of a pageblock and avoid checking each individual * page in a pageblock. */ -static struct page *pageblock_pfn_to_page(unsigned long start_pfn, +static struct page *__pageblock_pfn_to_page(unsigned long start_pfn, unsigned long end_pfn, struct zone *zone) { struct page *start_page; @@ -114,6 +114,56 @@ static struct page *pageblock_pfn_to_page(unsigned long start_pfn, return start_page; } +static inline struct page *pageblock_pfn_to_page(unsigned long start_pfn, + unsigned long end_pfn, struct zone *zone) +{ + if (zone->contiguous == 1) + return pfn_to_page(start_pfn); + + return __pageblock_pfn_to_page(start_pfn, end_pfn, zone); +} + +static void check_zone_contiguous(struct zone *zone) +{ + unsigned long block_start_pfn = zone->zone_start_pfn; + unsigned long block_end_pfn; + unsigned long pfn; + + /* Already checked */ + if (zone->contiguous) + return; + + printk("%s: %s\n", __func__, zone->name); + block_end_pfn = ALIGN(block_start_pfn + 1, pageblock_nr_pages); + for (; block_start_pfn < zone_end_pfn(zone); + block_start_pfn = block_end_pfn, + block_end_pfn += pageblock_nr_pages) { + + block_end_pfn = min(block_end_pfn, zone_end_pfn(zone)); + + if (!__pageblock_pfn_to_page(block_start_pfn, + block_end_pfn, zone)) { + /* We have hole */ + zone->contiguous = -1; + printk("%s: %s: uncontiguous\n", __func__, zone->name); + return; + } + + /* Check validity of pfn within pageblock */ + for (pfn = block_start_pfn; pfn < block_end_pfn; pfn++) { + if (!pfn_valid_within(pfn)) { + zone->contiguous = -1; + printk("%s: %s: uncontiguous\n", __func__, zone->name); + return; + } + } + } + + /* We don't have hole */ + zone->contiguous = 1; + printk("%s: %s: contiguous\n", __func__, zone->name); +} + #ifdef CONFIG_COMPACTION /* Do not skip compaction more than 64 times */ @@ -1353,6 +1403,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc) ; } + check_zone_contiguous(zone); + /* * Clear pageblock skip if
Re: [RFC 0/3] reduce latency of direct async compaction
On Tue, Dec 08, 2015 at 01:14:39PM +0800, Aaron Lu wrote: > On Tue, Dec 08, 2015 at 09:41:18AM +0900, Joonsoo Kim wrote: > > On Mon, Dec 07, 2015 at 04:59:56PM +0800, Aaron Lu wrote: > > > On Mon, Dec 07, 2015 at 04:35:24PM +0900, Joonsoo Kim wrote: > > > > It looks like overhead still remain. I guess that migration scanner > > > > would call pageblock_pfn_to_page() for more extended range so > > > > overhead still remain. > > > > > > > > I have an idea to solve his problem. Aaron, could you test following > > > > patch > > > > on top of base? It tries to skip calling pageblock_pfn_to_page() > > > > > > It doesn't apply on top of 25364a9e54fb8296837061bf684b76d20eec01fb > > > cleanly, so I made some changes to make it apply and the result is: > > > https://github.com/aaronlu/linux/commit/cb8d05829190b806ad3948ff9b9e08c8ba1daf63 > > > > Yes, that's okay. I made it on my working branch but it will not result in > > any problem except applying. > > > > > > > > There is a problem occured right after the test starts: > > > [ 58.080962] BUG: unable to handle kernel paging request at > > > ea008218 > > > [ 58.089124] IP: [] compaction_alloc+0xf9/0x270 > > > [ 58.096109] PGD 107ffd6067 PUD 207f7d5067 PMD 0 > > > [ 58.101569] Oops: [#1] SMP > > > > I did some mistake. Please test following patch. It is also made > > on my working branch so you need to resolve conflict but it would be > > trivial. > > > > I inserted some logs to check whether zone is contiguous or not. > > Please check that normal zone is set to contiguous after testing. > > Yes it is contiguous, but unfortunately, the problem remains: > [ 56.536930] check_zone_contiguous: Normal > [ 56.543467] check_zone_contiguous: Normal: contiguous > [ 56.549640] BUG: unable to handle kernel paging request at ea008218 > [ 56.557717] IP: [] compaction_alloc+0xf9/0x270 > [ 56.564719] PGD 107ffd6067 PUD 207f7d5067 PMD 0 > Maybe, I find the reason. cc->free_pfn can be initialized to invalid pfn that isn't checked so optimized pageblock_pfn_to_page() causes BUG(). I add work-around for this problem at isolate_freepages(). Please test following one. Thanks. -->8--- >From 7e954a68fb555a868acc5860627a1ad8dadbe3bf Mon Sep 17 00:00:00 2001 From: Joonsoo KimDate: Mon, 7 Dec 2015 14:51:42 +0900 Subject: [PATCH] mm/compaction: Optimize pageblock_pfn_to_page() for contiguous zone Signed-off-by: Joonsoo Kim --- include/linux/mmzone.h | 1 + mm/compaction.c| 60 +- 2 files changed, 60 insertions(+), 1 deletion(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index e23a9e7..573f9a9 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -521,6 +521,7 @@ struct zone { #endif #if defined CONFIG_COMPACTION || defined CONFIG_CMA + int contiguous; /* Set to true when the PG_migrate_skip bits should be cleared */ boolcompact_blockskip_flush; #endif diff --git a/mm/compaction.c b/mm/compaction.c index de3e1e7..ff5fb04 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -88,7 +88,7 @@ static inline bool migrate_async_suitable(int migratetype) * the first and last page of a pageblock and avoid checking each individual * page in a pageblock. */ -static struct page *pageblock_pfn_to_page(unsigned long start_pfn, +static struct page *__pageblock_pfn_to_page(unsigned long start_pfn, unsigned long end_pfn, struct zone *zone) { struct page *start_page; @@ -114,6 +114,56 @@ static struct page *pageblock_pfn_to_page(unsigned long start_pfn, return start_page; } +static inline struct page *pageblock_pfn_to_page(unsigned long start_pfn, + unsigned long end_pfn, struct zone *zone) +{ + if (zone->contiguous == 1) + return pfn_to_page(start_pfn); + + return __pageblock_pfn_to_page(start_pfn, end_pfn, zone); +} + +static void check_zone_contiguous(struct zone *zone) +{ + unsigned long block_start_pfn = zone->zone_start_pfn; + unsigned long block_end_pfn; + unsigned long pfn; + + /* Already checked */ + if (zone->contiguous) + return; + + printk("%s: %s\n", __func__, zone->name); + block_end_pfn = ALIGN(block_start_pfn + 1, pageblock_nr_pages); + for (; block_start_pfn < zone_end_pfn(zone); + block_start_pfn = block_end_pfn, + block_end_pfn += pageblock_nr_pages) { + + block_end_pfn = min(block_end_pfn, zone_end_pfn(zone)); + + if (!__pageblock_pfn_to_page(block_start_pfn, + block_end_pfn, zone)) { + /* We have hole */ + zone->contiguous = -1; + printk("%s: %s: uncontiguous\n", __func__,
Re: [RFC 0/3] reduce latency of direct async compaction
On Mon, Dec 07, 2015 at 04:59:56PM +0800, Aaron Lu wrote: > On Mon, Dec 07, 2015 at 04:35:24PM +0900, Joonsoo Kim wrote: > > It looks like overhead still remain. I guess that migration scanner > > would call pageblock_pfn_to_page() for more extended range so > > overhead still remain. > > > > I have an idea to solve his problem. Aaron, could you test following patch > > on top of base? It tries to skip calling pageblock_pfn_to_page() > > It doesn't apply on top of 25364a9e54fb8296837061bf684b76d20eec01fb > cleanly, so I made some changes to make it apply and the result is: > https://github.com/aaronlu/linux/commit/cb8d05829190b806ad3948ff9b9e08c8ba1daf63 Yes, that's okay. I made it on my working branch but it will not result in any problem except applying. > > There is a problem occured right after the test starts: > [ 58.080962] BUG: unable to handle kernel paging request at ea008218 > [ 58.089124] IP: [] compaction_alloc+0xf9/0x270 > [ 58.096109] PGD 107ffd6067 PUD 207f7d5067 PMD 0 > [ 58.101569] Oops: [#1] SMP I did some mistake. Please test following patch. It is also made on my working branch so you need to resolve conflict but it would be trivial. I inserted some logs to check whether zone is contiguous or not. Please check that normal zone is set to contiguous after testing. Thanks. -->8-- >From 4a1a08d8ab3fb165b87ad2ec0a2000ff6892330f Mon Sep 17 00:00:00 2001 From: Joonsoo KimDate: Mon, 7 Dec 2015 14:51:42 +0900 Subject: [PATCH] mm/compaction: Optimize pageblock_pfn_to_page() for contiguous zone Signed-off-by: Joonsoo Kim --- include/linux/mmzone.h | 1 + mm/compaction.c| 54 +- 2 files changed, 54 insertions(+), 1 deletion(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index e23a9e7..573f9a9 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -521,6 +521,7 @@ struct zone { #endif #if defined CONFIG_COMPACTION || defined CONFIG_CMA + int contiguous; /* Set to true when the PG_migrate_skip bits should be cleared */ boolcompact_blockskip_flush; #endif diff --git a/mm/compaction.c b/mm/compaction.c index 67b8d90..cb5c7a2 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -88,7 +88,7 @@ static inline bool migrate_async_suitable(int migratetype) * the first and last page of a pageblock and avoid checking each individual * page in a pageblock. */ -static struct page *pageblock_pfn_to_page(unsigned long start_pfn, +static struct page *__pageblock_pfn_to_page(unsigned long start_pfn, unsigned long end_pfn, struct zone *zone) { struct page *start_page; @@ -114,6 +114,56 @@ static struct page *pageblock_pfn_to_page(unsigned long start_pfn, return start_page; } +static inline struct page *pageblock_pfn_to_page(unsigned long start_pfn, + unsigned long end_pfn, struct zone *zone) +{ + if (zone->contiguous == 1) + return pfn_to_page(start_pfn); + + return __pageblock_pfn_to_page(start_pfn, end_pfn, zone); +} + +static void check_zone_contiguous(struct zone *zone) +{ + unsigned long block_start_pfn = zone->zone_start_pfn; + unsigned long block_end_pfn; + unsigned long pfn; + + /* Already checked */ + if (zone->contiguous) + return; + + printk("%s: %s\n", __func__, zone->name); + block_end_pfn = ALIGN(block_start_pfn + 1, pageblock_nr_pages); + for (; block_start_pfn < zone_end_pfn(zone); + block_start_pfn = block_end_pfn, + block_end_pfn += pageblock_nr_pages) { + + block_end_pfn = min(block_end_pfn, zone_end_pfn(zone)); + + if (!__pageblock_pfn_to_page(block_start_pfn, + block_end_pfn, zone)) { + /* We have hole */ + zone->contiguous = -1; + printk("%s: %s: uncontiguous\n", __func__, zone->name); + return; + } + + /* Check validity of pfn within pageblock */ + for (pfn = block_start_pfn; pfn < block_end_pfn; pfn++) { + if (!pfn_valid_within(pfn)) { + zone->contiguous = -1; + printk("%s: %s: uncontiguous\n", __func__, zone->name); + return; + } + } + } + + /* We don't have hole */ + zone->contiguous = 1; + printk("%s: %s: contiguous\n", __func__, zone->name); +} + #ifdef CONFIG_COMPACTION /* Do not skip compaction more than 64 times */ @@ -1353,6 +1403,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc) ; } + check_zone_contiguous(zone); +
Re: [RFC 0/3] reduce latency of direct async compaction
On Tue, Dec 08, 2015 at 09:41:18AM +0900, Joonsoo Kim wrote: > On Mon, Dec 07, 2015 at 04:59:56PM +0800, Aaron Lu wrote: > > On Mon, Dec 07, 2015 at 04:35:24PM +0900, Joonsoo Kim wrote: > > > It looks like overhead still remain. I guess that migration scanner > > > would call pageblock_pfn_to_page() for more extended range so > > > overhead still remain. > > > > > > I have an idea to solve his problem. Aaron, could you test following patch > > > on top of base? It tries to skip calling pageblock_pfn_to_page() > > > > It doesn't apply on top of 25364a9e54fb8296837061bf684b76d20eec01fb > > cleanly, so I made some changes to make it apply and the result is: > > https://github.com/aaronlu/linux/commit/cb8d05829190b806ad3948ff9b9e08c8ba1daf63 > > Yes, that's okay. I made it on my working branch but it will not result in > any problem except applying. > > > > > There is a problem occured right after the test starts: > > [ 58.080962] BUG: unable to handle kernel paging request at > > ea008218 > > [ 58.089124] IP: [] compaction_alloc+0xf9/0x270 > > [ 58.096109] PGD 107ffd6067 PUD 207f7d5067 PMD 0 > > [ 58.101569] Oops: [#1] SMP > > I did some mistake. Please test following patch. It is also made > on my working branch so you need to resolve conflict but it would be > trivial. > > I inserted some logs to check whether zone is contiguous or not. > Please check that normal zone is set to contiguous after testing. Yes it is contiguous, but unfortunately, the problem remains: [ 56.536930] check_zone_contiguous: Normal [ 56.543467] check_zone_contiguous: Normal: contiguous [ 56.549640] BUG: unable to handle kernel paging request at ea008218 [ 56.557717] IP: [] compaction_alloc+0xf9/0x270 [ 56.564719] PGD 107ffd6067 PUD 207f7d5067 PMD 0 Full dmesg attached. Thanks, Aaron > > Thanks. > > -->8-- > From 4a1a08d8ab3fb165b87ad2ec0a2000ff6892330f Mon Sep 17 00:00:00 2001 > From: Joonsoo Kim> Date: Mon, 7 Dec 2015 14:51:42 +0900 > Subject: [PATCH] mm/compaction: Optimize pageblock_pfn_to_page() for > contiguous zone > > Signed-off-by: Joonsoo Kim > --- > include/linux/mmzone.h | 1 + > mm/compaction.c| 54 > +- > 2 files changed, 54 insertions(+), 1 deletion(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index e23a9e7..573f9a9 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -521,6 +521,7 @@ struct zone { > #endif > > #if defined CONFIG_COMPACTION || defined CONFIG_CMA > + int contiguous; > /* Set to true when the PG_migrate_skip bits should be cleared */ > boolcompact_blockskip_flush; > #endif > diff --git a/mm/compaction.c b/mm/compaction.c > index 67b8d90..cb5c7a2 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -88,7 +88,7 @@ static inline bool migrate_async_suitable(int migratetype) > * the first and last page of a pageblock and avoid checking each individual > * page in a pageblock. > */ > -static struct page *pageblock_pfn_to_page(unsigned long start_pfn, > +static struct page *__pageblock_pfn_to_page(unsigned long start_pfn, > unsigned long end_pfn, struct zone *zone) > { > struct page *start_page; > @@ -114,6 +114,56 @@ static struct page *pageblock_pfn_to_page(unsigned long > start_pfn, > return start_page; > } > > +static inline struct page *pageblock_pfn_to_page(unsigned long start_pfn, > + unsigned long end_pfn, struct zone *zone) > +{ > + if (zone->contiguous == 1) > + return pfn_to_page(start_pfn); > + > + return __pageblock_pfn_to_page(start_pfn, end_pfn, zone); > +} > + > +static void check_zone_contiguous(struct zone *zone) > +{ > + unsigned long block_start_pfn = zone->zone_start_pfn; > + unsigned long block_end_pfn; > + unsigned long pfn; > + > + /* Already checked */ > + if (zone->contiguous) > + return; > + > + printk("%s: %s\n", __func__, zone->name); > + block_end_pfn = ALIGN(block_start_pfn + 1, pageblock_nr_pages); > + for (; block_start_pfn < zone_end_pfn(zone); > + block_start_pfn = block_end_pfn, > + block_end_pfn += pageblock_nr_pages) { > + > + block_end_pfn = min(block_end_pfn, zone_end_pfn(zone)); > + > + if (!__pageblock_pfn_to_page(block_start_pfn, > + block_end_pfn, zone)) { > + /* We have hole */ > + zone->contiguous = -1; > + printk("%s: %s: uncontiguous\n", __func__, > zone->name); > + return; > + } > + > + /* Check validity of pfn within pageblock */ > + for (pfn = block_start_pfn; pfn <
Re: [RFC 0/3] reduce latency of direct async compaction
On Fri, Dec 04, 2015 at 01:34:09PM +0100, Vlastimil Babka wrote: > On 12/03/2015 12:52 PM, Aaron Lu wrote: > >On Thu, Dec 03, 2015 at 07:35:08PM +0800, Aaron Lu wrote: > >>On Thu, Dec 03, 2015 at 10:38:50AM +0100, Vlastimil Babka wrote: > >>>On 12/03/2015 10:25 AM, Aaron Lu wrote: > On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote: > >> > >>My bad, I uploaded the wrong data :-/ > >>I uploaded again: > >>https://drive.google.com/file/d/0B49uX3igf4K4UFI4TEQ3THYta0E > >> > >>And I just run the base tree with trace-cmd and found that its > >>performace drops significantly(from 1000MB/s to 6xxMB/s), is it that > >>trace-cmd will impact performace a lot? > > Yeah it has some overhead depending on how many events it has to > process. Your workload is quite sensitive to that. > > >>Any suggestions on how to run > >>the test regarding trace-cmd? i.e. should I aways run usemem under > >>trace-cmd or only when necessary? > > I'd run it with tracing only when the goal is to collect traces, but > not for any performance comparisons. Also it's not useful to collect > perf data while also tracing. > > >I just run the test with the base tree and with this patch series > >applied(head), I didn't use trace-cmd this time. > > > >The throughput for base tree is 963MB/s while the head is 815MB/s, I > >have attached pagetypeinfo/proc-vmstat/perf-profile for them. > > The compact stats improvements look fine, perhaps better than in my tests: > > base: compact_migrate_scanned 3476360 > head: compact_migrate_scanned 1020827 > > - that's the eager skipping of patch 2 > > base: compact_free_scanned 5924928 > head: compact_free_scanned 0 > compact_free_direct 918813 > compact_free_direct_miss 500308 > > As your workload does exclusively async direct compaction through > THP faults, the traditional free scanner isn't used at all. Direct > allocations should be much cheaper, although the "miss" ratio (the > allocations that were from the same pageblock as the one we are > compacting) is quite high. I should probably look into making > migration release pages to the tails of the freelists - could be > that it's grabbing the very pages that were just freed in the > previous COMPACT_CLUSTER_MAX cycle (modulo pcplist buffering). > > I however find it strange that your original stats (4.3?) differ > from the base so much: > > compact_migrate_scanned 1982396 > compact_free_scanned 40576943 > > That was order of magnitude more free scanned on 4.3, and half the > migrate scanned. But your throughput figures in the other mail > suggested a regression from 4.3 to 4.4, which would be the opposite > of what the stats say. And anyway, compaction code didn't change > between 4.3 and 4.4 except changes to tracepoint format... > > moving on... > base: > compact_isolated 731304 > compact_stall 10561 > compact_fail 9459 > compact_success 1102 > > head: > compact_isolated 921087 > compact_stall 14451 > compact_fail 12550 > compact_success 1901 > > More success in both isolation and compaction results. > > base: > thp_fault_alloc 45337 > thp_fault_fallback 2349 > > head: > thp_fault_alloc 45564 > thp_fault_fallback 2120 > > Somehow the extra compact success didn't fully translate to thp > alloc success... But given how many of the alloc's didn't even > involve a compact_stall (two thirds of them), that interpretation > could also be easily misleading. So, hard to say. > > Looking at the perf profiles... > base: > 54.55%54.55%:1550 [kernel.kallsyms] [k] > pageblock_pfn_to_page > > head: > 40.13%40.13%:1551 [kernel.kallsyms] [k] > pageblock_pfn_to_page > > Since the freepage allocation doesn't hit this code anymore, it > shows that the bulk was actually from the migration scanner, > although the perf callgraph and vmstats suggested otherwise. It looks like overhead still remain. I guess that migration scanner would call pageblock_pfn_to_page() for more extended range so overhead still remain. I have an idea to solve his problem. Aaron, could you test following patch on top of base? It tries to skip calling pageblock_pfn_to_page() if we check that zone is contiguous at initialization stage. Thanks. >8 >From 9c4fbf8f8ed37eb88a04a97908e76ba2437404a2 Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Mon, 7 Dec 2015 14:51:42 +0900 Subject: [PATCH] mm/compaction: Optimize pageblock_pfn_to_page() for contiguous zone Signed-off-by: Joonsoo Kim --- include/linux/mmzone.h | 1 + mm/compaction.c| 35 ++- 2 files changed, 35 insertions(+), 1 deletion(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index e23a9e7..573f9a9 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -521,6 +521,7 @@ struct zone { #endif #if defined CONFIG_COMPACTION || defined CONFIG_CMA + int contiguous; /* Set to true when the PG_migrate_skip bits should be cleared */
Re: [RFC 0/3] reduce latency of direct async compaction
On 12/04/2015 08:38 PM, Vlastimil Babka wrote: > On 12/04/2015 07:25 AM, Aaron Lu wrote: >> On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote: >>> Aaron, could you try this on your testcase? >> >> One time result isn't stable enough, so I did 9 runs for each commit, >> here is the result: >> >> base: 25364a9e54fb8296837061bf684b76d20eec01fb >> head: 7433b1009ff5a02e1e9f3444802daba2cf385d27 >> (head = base + this_patch_serie) >> >> The always-always case(transparent_hugepage set to always and defrag set >> to always): >> >> Result for base: >> $ cat {0..8}/swap >> cmdline: /lkp/aaron/src/bin/usemem 10622592 >> 10622592 transferred in 103 seconds, throughput: 925 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 9559680 >> 9559680 transferred in 92 seconds, throughput: 1036 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 6171264 >> 6171264 transferred in 92 seconds, throughput: 1036 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 15663744 >> 15663744 transferred in 150 seconds, throughput: 635 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 12966528 >> 12966528 transferred in 87 seconds, throughput: 1096 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 5784192 >> 5784192 transferred in 131 seconds, throughput: 727 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 13731456 >> 13731456 transferred in 97 seconds, throughput: 983 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 16440960 >> 16440960 transferred in 109 seconds, throughput: 874 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 8813184 >> 8813184 transferred in 122 seconds, throughput: 781 MB/s >> Max: 1096 MB/s >> Min: 635 MB/s >> Avg: 899 MB/s >> >> Result for head: >> $ cat {0..8}/swap >> cmdline: /lkp/aaron/src/bin/usemem 13163136 >> 13163136 transferred in 105 seconds, throughput: 908 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 8524416 >> 8524416 transferred in 78 seconds, throughput: 1222 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 3646080 >> 3646080 transferred in 108 seconds, throughput: 882 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 8936064 >> 8936064 transferred in 114 seconds, throughput: 836 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 12204672 >> 12204672 transferred in 73 seconds, throughput: 1306 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 8140416 >> 8140416 transferred in 146 seconds, throughput: 653 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 12941952 >> 12941952 transferred in 78 seconds, throughput: 1222 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 6917760 >> 6917760 transferred in 109 seconds, throughput: 874 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 11405952 >> 11405952 transferred in 96 seconds, throughput: 993 MB/s >> Max: 1306 MB/s >> Min: 653 MB/s >> Avg: 988 MB/s > > Ok that looks better than the first results :) The series either helped, > or it's just noise. But hopefully not worse. Well, it looks to be the case :-) > >> Result for v4.3 as a reference: >> $ cat {0..8}/swap >> cmdline: /lkp/aaron/src/bin/usemem 12459648 >> 12459648 transferred in 96 seconds, throughput: 993 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 7375488 >> 7375488 transferred in 96 seconds, throughput: 993 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 9028224 >> 9028224 transferred in 107 seconds, throughput: 891 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 10137216 >> 10137216 transferred in 91 seconds, throughput: 1047 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 13835904 >> 13835904 transferred in 80 seconds, throughput: 1192 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 10143360 >> 10143360 transferred in 96 seconds, throughput: 993 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 100020593664 >> 100020593664 transferred in 101 seconds, throughput: 944 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 15805056 >> 15805056 transferred in 87 seconds, throughput: 1096 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 18360960 >> 18360960 transferred in 74 seconds, throughput: 1288 MB/s >> Max: 1288 MB/s >> Min: 891 MB/s >> Avg: 1048 MB/s > > Hard to say if there's actual regression from 4.3 to 4.4, it's too > noisy. More iterations could help, but then the eventual bisection would > need them too. One thing puzzles me most is that once compaction is involved, the results will become undetermined, i.e. the result could be as high as 1xxx MB/s or as low as 6xx MB/s. The always-never's case is much better in this regard. Thanks, Aaron > >> The always-never case: >> >> Result for head: >> $ cat {0..8}/swap >> cmdline: /lkp/aaron/src/bin/usemem 13940352 >> 13940352 transferred in 71 seconds, throughput: 1343 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 17411712 >> 17411712 transferred in 62 seconds, throughput: 1538 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 11875968 >> 11875968 transferred in 64 seconds, throughput: 1490 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 13912704 >>
Re: [RFC 0/3] reduce latency of direct async compaction
On 12/04/2015 08:38 PM, Vlastimil Babka wrote: > On 12/04/2015 07:25 AM, Aaron Lu wrote: >> On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote: >>> Aaron, could you try this on your testcase? >> >> One time result isn't stable enough, so I did 9 runs for each commit, >> here is the result: >> >> base: 25364a9e54fb8296837061bf684b76d20eec01fb >> head: 7433b1009ff5a02e1e9f3444802daba2cf385d27 >> (head = base + this_patch_serie) >> >> The always-always case(transparent_hugepage set to always and defrag set >> to always): >> >> Result for base: >> $ cat {0..8}/swap >> cmdline: /lkp/aaron/src/bin/usemem 10622592 >> 10622592 transferred in 103 seconds, throughput: 925 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 9559680 >> 9559680 transferred in 92 seconds, throughput: 1036 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 6171264 >> 6171264 transferred in 92 seconds, throughput: 1036 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 15663744 >> 15663744 transferred in 150 seconds, throughput: 635 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 12966528 >> 12966528 transferred in 87 seconds, throughput: 1096 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 5784192 >> 5784192 transferred in 131 seconds, throughput: 727 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 13731456 >> 13731456 transferred in 97 seconds, throughput: 983 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 16440960 >> 16440960 transferred in 109 seconds, throughput: 874 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 8813184 >> 8813184 transferred in 122 seconds, throughput: 781 MB/s >> Max: 1096 MB/s >> Min: 635 MB/s >> Avg: 899 MB/s >> >> Result for head: >> $ cat {0..8}/swap >> cmdline: /lkp/aaron/src/bin/usemem 13163136 >> 13163136 transferred in 105 seconds, throughput: 908 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 8524416 >> 8524416 transferred in 78 seconds, throughput: 1222 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 3646080 >> 3646080 transferred in 108 seconds, throughput: 882 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 8936064 >> 8936064 transferred in 114 seconds, throughput: 836 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 12204672 >> 12204672 transferred in 73 seconds, throughput: 1306 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 8140416 >> 8140416 transferred in 146 seconds, throughput: 653 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 12941952 >> 12941952 transferred in 78 seconds, throughput: 1222 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 6917760 >> 6917760 transferred in 109 seconds, throughput: 874 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 11405952 >> 11405952 transferred in 96 seconds, throughput: 993 MB/s >> Max: 1306 MB/s >> Min: 653 MB/s >> Avg: 988 MB/s > > Ok that looks better than the first results :) The series either helped, > or it's just noise. But hopefully not worse. Well, it looks to be the case :-) > >> Result for v4.3 as a reference: >> $ cat {0..8}/swap >> cmdline: /lkp/aaron/src/bin/usemem 12459648 >> 12459648 transferred in 96 seconds, throughput: 993 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 7375488 >> 7375488 transferred in 96 seconds, throughput: 993 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 9028224 >> 9028224 transferred in 107 seconds, throughput: 891 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 10137216 >> 10137216 transferred in 91 seconds, throughput: 1047 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 13835904 >> 13835904 transferred in 80 seconds, throughput: 1192 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 10143360 >> 10143360 transferred in 96 seconds, throughput: 993 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 100020593664 >> 100020593664 transferred in 101 seconds, throughput: 944 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 15805056 >> 15805056 transferred in 87 seconds, throughput: 1096 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 18360960 >> 18360960 transferred in 74 seconds, throughput: 1288 MB/s >> Max: 1288 MB/s >> Min: 891 MB/s >> Avg: 1048 MB/s > > Hard to say if there's actual regression from 4.3 to 4.4, it's too > noisy. More iterations could help, but then the eventual bisection would > need them too. One thing puzzles me most is that once compaction is involved, the results will become undetermined, i.e. the result could be as high as 1xxx MB/s or as low as 6xx MB/s. The always-never's case is much better in this regard. Thanks, Aaron > >> The always-never case: >> >> Result for head: >> $ cat {0..8}/swap >> cmdline: /lkp/aaron/src/bin/usemem 13940352 >> 13940352 transferred in 71 seconds, throughput: 1343 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 17411712 >> 17411712 transferred in 62 seconds, throughput: 1538 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 11875968 >> 11875968 transferred in 64 seconds, throughput: 1490 MB/s >> cmdline: /lkp/aaron/src/bin/usemem 13912704 >>
Re: [RFC 0/3] reduce latency of direct async compaction
On Fri, Dec 04, 2015 at 01:34:09PM +0100, Vlastimil Babka wrote: > On 12/03/2015 12:52 PM, Aaron Lu wrote: > >On Thu, Dec 03, 2015 at 07:35:08PM +0800, Aaron Lu wrote: > >>On Thu, Dec 03, 2015 at 10:38:50AM +0100, Vlastimil Babka wrote: > >>>On 12/03/2015 10:25 AM, Aaron Lu wrote: > On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote: > >> > >>My bad, I uploaded the wrong data :-/ > >>I uploaded again: > >>https://drive.google.com/file/d/0B49uX3igf4K4UFI4TEQ3THYta0E > >> > >>And I just run the base tree with trace-cmd and found that its > >>performace drops significantly(from 1000MB/s to 6xxMB/s), is it that > >>trace-cmd will impact performace a lot? > > Yeah it has some overhead depending on how many events it has to > process. Your workload is quite sensitive to that. > > >>Any suggestions on how to run > >>the test regarding trace-cmd? i.e. should I aways run usemem under > >>trace-cmd or only when necessary? > > I'd run it with tracing only when the goal is to collect traces, but > not for any performance comparisons. Also it's not useful to collect > perf data while also tracing. > > >I just run the test with the base tree and with this patch series > >applied(head), I didn't use trace-cmd this time. > > > >The throughput for base tree is 963MB/s while the head is 815MB/s, I > >have attached pagetypeinfo/proc-vmstat/perf-profile for them. > > The compact stats improvements look fine, perhaps better than in my tests: > > base: compact_migrate_scanned 3476360 > head: compact_migrate_scanned 1020827 > > - that's the eager skipping of patch 2 > > base: compact_free_scanned 5924928 > head: compact_free_scanned 0 > compact_free_direct 918813 > compact_free_direct_miss 500308 > > As your workload does exclusively async direct compaction through > THP faults, the traditional free scanner isn't used at all. Direct > allocations should be much cheaper, although the "miss" ratio (the > allocations that were from the same pageblock as the one we are > compacting) is quite high. I should probably look into making > migration release pages to the tails of the freelists - could be > that it's grabbing the very pages that were just freed in the > previous COMPACT_CLUSTER_MAX cycle (modulo pcplist buffering). > > I however find it strange that your original stats (4.3?) differ > from the base so much: > > compact_migrate_scanned 1982396 > compact_free_scanned 40576943 > > That was order of magnitude more free scanned on 4.3, and half the > migrate scanned. But your throughput figures in the other mail > suggested a regression from 4.3 to 4.4, which would be the opposite > of what the stats say. And anyway, compaction code didn't change > between 4.3 and 4.4 except changes to tracepoint format... > > moving on... > base: > compact_isolated 731304 > compact_stall 10561 > compact_fail 9459 > compact_success 1102 > > head: > compact_isolated 921087 > compact_stall 14451 > compact_fail 12550 > compact_success 1901 > > More success in both isolation and compaction results. > > base: > thp_fault_alloc 45337 > thp_fault_fallback 2349 > > head: > thp_fault_alloc 45564 > thp_fault_fallback 2120 > > Somehow the extra compact success didn't fully translate to thp > alloc success... But given how many of the alloc's didn't even > involve a compact_stall (two thirds of them), that interpretation > could also be easily misleading. So, hard to say. > > Looking at the perf profiles... > base: > 54.55%54.55%:1550 [kernel.kallsyms] [k] > pageblock_pfn_to_page > > head: > 40.13%40.13%:1551 [kernel.kallsyms] [k] > pageblock_pfn_to_page > > Since the freepage allocation doesn't hit this code anymore, it > shows that the bulk was actually from the migration scanner, > although the perf callgraph and vmstats suggested otherwise. It looks like overhead still remain. I guess that migration scanner would call pageblock_pfn_to_page() for more extended range so overhead still remain. I have an idea to solve his problem. Aaron, could you test following patch on top of base? It tries to skip calling pageblock_pfn_to_page() if we check that zone is contiguous at initialization stage. Thanks. >8 >From 9c4fbf8f8ed37eb88a04a97908e76ba2437404a2 Mon Sep 17 00:00:00 2001 From: Joonsoo KimDate: Mon, 7 Dec 2015 14:51:42 +0900 Subject: [PATCH] mm/compaction: Optimize pageblock_pfn_to_page() for contiguous zone Signed-off-by: Joonsoo Kim --- include/linux/mmzone.h | 1 + mm/compaction.c| 35 ++- 2 files changed, 35 insertions(+), 1 deletion(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index e23a9e7..573f9a9 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -521,6 +521,7 @@ struct zone { #endif #if defined CONFIG_COMPACTION || defined CONFIG_CMA + int contiguous; /* Set to true when
Re: [RFC 0/3] reduce latency of direct async compaction
On 12/04/2015 07:25 AM, Aaron Lu wrote: On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote: Aaron, could you try this on your testcase? One time result isn't stable enough, so I did 9 runs for each commit, here is the result: base: 25364a9e54fb8296837061bf684b76d20eec01fb head: 7433b1009ff5a02e1e9f3444802daba2cf385d27 (head = base + this_patch_serie) The always-always case(transparent_hugepage set to always and defrag set to always): Result for base: $ cat {0..8}/swap cmdline: /lkp/aaron/src/bin/usemem 10622592 10622592 transferred in 103 seconds, throughput: 925 MB/s cmdline: /lkp/aaron/src/bin/usemem 9559680 9559680 transferred in 92 seconds, throughput: 1036 MB/s cmdline: /lkp/aaron/src/bin/usemem 6171264 6171264 transferred in 92 seconds, throughput: 1036 MB/s cmdline: /lkp/aaron/src/bin/usemem 15663744 15663744 transferred in 150 seconds, throughput: 635 MB/s cmdline: /lkp/aaron/src/bin/usemem 12966528 12966528 transferred in 87 seconds, throughput: 1096 MB/s cmdline: /lkp/aaron/src/bin/usemem 5784192 5784192 transferred in 131 seconds, throughput: 727 MB/s cmdline: /lkp/aaron/src/bin/usemem 13731456 13731456 transferred in 97 seconds, throughput: 983 MB/s cmdline: /lkp/aaron/src/bin/usemem 16440960 16440960 transferred in 109 seconds, throughput: 874 MB/s cmdline: /lkp/aaron/src/bin/usemem 8813184 8813184 transferred in 122 seconds, throughput: 781 MB/s Max: 1096 MB/s Min: 635 MB/s Avg: 899 MB/s Result for head: $ cat {0..8}/swap cmdline: /lkp/aaron/src/bin/usemem 13163136 13163136 transferred in 105 seconds, throughput: 908 MB/s cmdline: /lkp/aaron/src/bin/usemem 8524416 8524416 transferred in 78 seconds, throughput: 1222 MB/s cmdline: /lkp/aaron/src/bin/usemem 3646080 3646080 transferred in 108 seconds, throughput: 882 MB/s cmdline: /lkp/aaron/src/bin/usemem 8936064 8936064 transferred in 114 seconds, throughput: 836 MB/s cmdline: /lkp/aaron/src/bin/usemem 12204672 12204672 transferred in 73 seconds, throughput: 1306 MB/s cmdline: /lkp/aaron/src/bin/usemem 8140416 8140416 transferred in 146 seconds, throughput: 653 MB/s cmdline: /lkp/aaron/src/bin/usemem 12941952 12941952 transferred in 78 seconds, throughput: 1222 MB/s cmdline: /lkp/aaron/src/bin/usemem 6917760 6917760 transferred in 109 seconds, throughput: 874 MB/s cmdline: /lkp/aaron/src/bin/usemem 11405952 11405952 transferred in 96 seconds, throughput: 993 MB/s Max: 1306 MB/s Min: 653 MB/s Avg: 988 MB/s Ok that looks better than the first results :) The series either helped, or it's just noise. But hopefully not worse. Result for v4.3 as a reference: $ cat {0..8}/swap cmdline: /lkp/aaron/src/bin/usemem 12459648 12459648 transferred in 96 seconds, throughput: 993 MB/s cmdline: /lkp/aaron/src/bin/usemem 7375488 7375488 transferred in 96 seconds, throughput: 993 MB/s cmdline: /lkp/aaron/src/bin/usemem 9028224 9028224 transferred in 107 seconds, throughput: 891 MB/s cmdline: /lkp/aaron/src/bin/usemem 10137216 10137216 transferred in 91 seconds, throughput: 1047 MB/s cmdline: /lkp/aaron/src/bin/usemem 13835904 13835904 transferred in 80 seconds, throughput: 1192 MB/s cmdline: /lkp/aaron/src/bin/usemem 10143360 10143360 transferred in 96 seconds, throughput: 993 MB/s cmdline: /lkp/aaron/src/bin/usemem 100020593664 100020593664 transferred in 101 seconds, throughput: 944 MB/s cmdline: /lkp/aaron/src/bin/usemem 15805056 15805056 transferred in 87 seconds, throughput: 1096 MB/s cmdline: /lkp/aaron/src/bin/usemem 18360960 18360960 transferred in 74 seconds, throughput: 1288 MB/s Max: 1288 MB/s Min: 891 MB/s Avg: 1048 MB/s Hard to say if there's actual regression from 4.3 to 4.4, it's too noisy. More iterations could help, but then the eventual bisection would need them too. The always-never case: Result for head: $ cat {0..8}/swap cmdline: /lkp/aaron/src/bin/usemem 13940352 13940352 transferred in 71 seconds, throughput: 1343 MB/s cmdline: /lkp/aaron/src/bin/usemem 17411712 17411712 transferred in 62 seconds, throughput: 1538 MB/s cmdline: /lkp/aaron/src/bin/usemem 11875968 11875968 transferred in 64 seconds, throughput: 1490 MB/s cmdline: /lkp/aaron/src/bin/usemem 13912704 13912704 transferred in 62 seconds, throughput: 1538 MB/s cmdline: /lkp/aaron/src/bin/usemem 12238464 12238464 transferred in 66 seconds, throughput: 1444 MB/s cmdline: /lkp/aaron/src/bin/usemem 13670016 13670016 transferred in 65 seconds, throughput: 1467 MB/s cmdline: /lkp/aaron/src/bin/usemem 8364672 8364672 transferred in 68 seconds, throughput: 1402 MB/s cmdline: /lkp/aaron/src/bin/usemem 15417984 15417984 transferred in 70 seconds, throughput: 1362 MB/s cmdline: /lkp/aaron/src/bin/usemem 15304320 15304320 transferred in 64 seconds, throughput: 1490 MB/s Max: 1538
Re: [RFC 0/3] reduce latency of direct async compaction
On 12/03/2015 12:52 PM, Aaron Lu wrote: On Thu, Dec 03, 2015 at 07:35:08PM +0800, Aaron Lu wrote: On Thu, Dec 03, 2015 at 10:38:50AM +0100, Vlastimil Babka wrote: On 12/03/2015 10:25 AM, Aaron Lu wrote: On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote: My bad, I uploaded the wrong data :-/ I uploaded again: https://drive.google.com/file/d/0B49uX3igf4K4UFI4TEQ3THYta0E And I just run the base tree with trace-cmd and found that its performace drops significantly(from 1000MB/s to 6xxMB/s), is it that trace-cmd will impact performace a lot? Yeah it has some overhead depending on how many events it has to process. Your workload is quite sensitive to that. Any suggestions on how to run the test regarding trace-cmd? i.e. should I aways run usemem under trace-cmd or only when necessary? I'd run it with tracing only when the goal is to collect traces, but not for any performance comparisons. Also it's not useful to collect perf data while also tracing. I just run the test with the base tree and with this patch series applied(head), I didn't use trace-cmd this time. The throughput for base tree is 963MB/s while the head is 815MB/s, I have attached pagetypeinfo/proc-vmstat/perf-profile for them. The compact stats improvements look fine, perhaps better than in my tests: base: compact_migrate_scanned 3476360 head: compact_migrate_scanned 1020827 - that's the eager skipping of patch 2 base: compact_free_scanned 5924928 head: compact_free_scanned 0 compact_free_direct 918813 compact_free_direct_miss 500308 As your workload does exclusively async direct compaction through THP faults, the traditional free scanner isn't used at all. Direct allocations should be much cheaper, although the "miss" ratio (the allocations that were from the same pageblock as the one we are compacting) is quite high. I should probably look into making migration release pages to the tails of the freelists - could be that it's grabbing the very pages that were just freed in the previous COMPACT_CLUSTER_MAX cycle (modulo pcplist buffering). I however find it strange that your original stats (4.3?) differ from the base so much: compact_migrate_scanned 1982396 compact_free_scanned 40576943 That was order of magnitude more free scanned on 4.3, and half the migrate scanned. But your throughput figures in the other mail suggested a regression from 4.3 to 4.4, which would be the opposite of what the stats say. And anyway, compaction code didn't change between 4.3 and 4.4 except changes to tracepoint format... moving on... base: compact_isolated 731304 compact_stall 10561 compact_fail 9459 compact_success 1102 head: compact_isolated 921087 compact_stall 14451 compact_fail 12550 compact_success 1901 More success in both isolation and compaction results. base: thp_fault_alloc 45337 thp_fault_fallback 2349 head: thp_fault_alloc 45564 thp_fault_fallback 2120 Somehow the extra compact success didn't fully translate to thp alloc success... But given how many of the alloc's didn't even involve a compact_stall (two thirds of them), that interpretation could also be easily misleading. So, hard to say. Looking at the perf profiles... base: 54.55%54.55%:1550 [kernel.kallsyms] [k] pageblock_pfn_to_page head: 40.13%40.13%:1551 [kernel.kallsyms] [k] pageblock_pfn_to_page Since the freepage allocation doesn't hit this code anymore, it shows that the bulk was actually from the migration scanner, although the perf callgraph and vmstats suggested otherwise. However, vmstats count only when the scanner actually enters the pageblock, and there are numerous reasons why it wouldn't... For example the pageblock_skip bitmap. Could it make sense to look at the bitmap before doing the pfn_to_page translation? I don't see much else in the profiles. I guess the remaining problem of compaction here is that deferring compaction doesn't trigger for async compaction, and this testcase doesn't hit sync compaction at all. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/3] reduce latency of direct async compaction
On 12/03/2015 12:52 PM, Aaron Lu wrote: On Thu, Dec 03, 2015 at 07:35:08PM +0800, Aaron Lu wrote: On Thu, Dec 03, 2015 at 10:38:50AM +0100, Vlastimil Babka wrote: On 12/03/2015 10:25 AM, Aaron Lu wrote: On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote: My bad, I uploaded the wrong data :-/ I uploaded again: https://drive.google.com/file/d/0B49uX3igf4K4UFI4TEQ3THYta0E And I just run the base tree with trace-cmd and found that its performace drops significantly(from 1000MB/s to 6xxMB/s), is it that trace-cmd will impact performace a lot? Yeah it has some overhead depending on how many events it has to process. Your workload is quite sensitive to that. Any suggestions on how to run the test regarding trace-cmd? i.e. should I aways run usemem under trace-cmd or only when necessary? I'd run it with tracing only when the goal is to collect traces, but not for any performance comparisons. Also it's not useful to collect perf data while also tracing. I just run the test with the base tree and with this patch series applied(head), I didn't use trace-cmd this time. The throughput for base tree is 963MB/s while the head is 815MB/s, I have attached pagetypeinfo/proc-vmstat/perf-profile for them. The compact stats improvements look fine, perhaps better than in my tests: base: compact_migrate_scanned 3476360 head: compact_migrate_scanned 1020827 - that's the eager skipping of patch 2 base: compact_free_scanned 5924928 head: compact_free_scanned 0 compact_free_direct 918813 compact_free_direct_miss 500308 As your workload does exclusively async direct compaction through THP faults, the traditional free scanner isn't used at all. Direct allocations should be much cheaper, although the "miss" ratio (the allocations that were from the same pageblock as the one we are compacting) is quite high. I should probably look into making migration release pages to the tails of the freelists - could be that it's grabbing the very pages that were just freed in the previous COMPACT_CLUSTER_MAX cycle (modulo pcplist buffering). I however find it strange that your original stats (4.3?) differ from the base so much: compact_migrate_scanned 1982396 compact_free_scanned 40576943 That was order of magnitude more free scanned on 4.3, and half the migrate scanned. But your throughput figures in the other mail suggested a regression from 4.3 to 4.4, which would be the opposite of what the stats say. And anyway, compaction code didn't change between 4.3 and 4.4 except changes to tracepoint format... moving on... base: compact_isolated 731304 compact_stall 10561 compact_fail 9459 compact_success 1102 head: compact_isolated 921087 compact_stall 14451 compact_fail 12550 compact_success 1901 More success in both isolation and compaction results. base: thp_fault_alloc 45337 thp_fault_fallback 2349 head: thp_fault_alloc 45564 thp_fault_fallback 2120 Somehow the extra compact success didn't fully translate to thp alloc success... But given how many of the alloc's didn't even involve a compact_stall (two thirds of them), that interpretation could also be easily misleading. So, hard to say. Looking at the perf profiles... base: 54.55%54.55%:1550 [kernel.kallsyms] [k] pageblock_pfn_to_page head: 40.13%40.13%:1551 [kernel.kallsyms] [k] pageblock_pfn_to_page Since the freepage allocation doesn't hit this code anymore, it shows that the bulk was actually from the migration scanner, although the perf callgraph and vmstats suggested otherwise. However, vmstats count only when the scanner actually enters the pageblock, and there are numerous reasons why it wouldn't... For example the pageblock_skip bitmap. Could it make sense to look at the bitmap before doing the pfn_to_page translation? I don't see much else in the profiles. I guess the remaining problem of compaction here is that deferring compaction doesn't trigger for async compaction, and this testcase doesn't hit sync compaction at all. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/3] reduce latency of direct async compaction
On 12/04/2015 07:25 AM, Aaron Lu wrote: On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote: Aaron, could you try this on your testcase? One time result isn't stable enough, so I did 9 runs for each commit, here is the result: base: 25364a9e54fb8296837061bf684b76d20eec01fb head: 7433b1009ff5a02e1e9f3444802daba2cf385d27 (head = base + this_patch_serie) The always-always case(transparent_hugepage set to always and defrag set to always): Result for base: $ cat {0..8}/swap cmdline: /lkp/aaron/src/bin/usemem 10622592 10622592 transferred in 103 seconds, throughput: 925 MB/s cmdline: /lkp/aaron/src/bin/usemem 9559680 9559680 transferred in 92 seconds, throughput: 1036 MB/s cmdline: /lkp/aaron/src/bin/usemem 6171264 6171264 transferred in 92 seconds, throughput: 1036 MB/s cmdline: /lkp/aaron/src/bin/usemem 15663744 15663744 transferred in 150 seconds, throughput: 635 MB/s cmdline: /lkp/aaron/src/bin/usemem 12966528 12966528 transferred in 87 seconds, throughput: 1096 MB/s cmdline: /lkp/aaron/src/bin/usemem 5784192 5784192 transferred in 131 seconds, throughput: 727 MB/s cmdline: /lkp/aaron/src/bin/usemem 13731456 13731456 transferred in 97 seconds, throughput: 983 MB/s cmdline: /lkp/aaron/src/bin/usemem 16440960 16440960 transferred in 109 seconds, throughput: 874 MB/s cmdline: /lkp/aaron/src/bin/usemem 8813184 8813184 transferred in 122 seconds, throughput: 781 MB/s Max: 1096 MB/s Min: 635 MB/s Avg: 899 MB/s Result for head: $ cat {0..8}/swap cmdline: /lkp/aaron/src/bin/usemem 13163136 13163136 transferred in 105 seconds, throughput: 908 MB/s cmdline: /lkp/aaron/src/bin/usemem 8524416 8524416 transferred in 78 seconds, throughput: 1222 MB/s cmdline: /lkp/aaron/src/bin/usemem 3646080 3646080 transferred in 108 seconds, throughput: 882 MB/s cmdline: /lkp/aaron/src/bin/usemem 8936064 8936064 transferred in 114 seconds, throughput: 836 MB/s cmdline: /lkp/aaron/src/bin/usemem 12204672 12204672 transferred in 73 seconds, throughput: 1306 MB/s cmdline: /lkp/aaron/src/bin/usemem 8140416 8140416 transferred in 146 seconds, throughput: 653 MB/s cmdline: /lkp/aaron/src/bin/usemem 12941952 12941952 transferred in 78 seconds, throughput: 1222 MB/s cmdline: /lkp/aaron/src/bin/usemem 6917760 6917760 transferred in 109 seconds, throughput: 874 MB/s cmdline: /lkp/aaron/src/bin/usemem 11405952 11405952 transferred in 96 seconds, throughput: 993 MB/s Max: 1306 MB/s Min: 653 MB/s Avg: 988 MB/s Ok that looks better than the first results :) The series either helped, or it's just noise. But hopefully not worse. Result for v4.3 as a reference: $ cat {0..8}/swap cmdline: /lkp/aaron/src/bin/usemem 12459648 12459648 transferred in 96 seconds, throughput: 993 MB/s cmdline: /lkp/aaron/src/bin/usemem 7375488 7375488 transferred in 96 seconds, throughput: 993 MB/s cmdline: /lkp/aaron/src/bin/usemem 9028224 9028224 transferred in 107 seconds, throughput: 891 MB/s cmdline: /lkp/aaron/src/bin/usemem 10137216 10137216 transferred in 91 seconds, throughput: 1047 MB/s cmdline: /lkp/aaron/src/bin/usemem 13835904 13835904 transferred in 80 seconds, throughput: 1192 MB/s cmdline: /lkp/aaron/src/bin/usemem 10143360 10143360 transferred in 96 seconds, throughput: 993 MB/s cmdline: /lkp/aaron/src/bin/usemem 100020593664 100020593664 transferred in 101 seconds, throughput: 944 MB/s cmdline: /lkp/aaron/src/bin/usemem 15805056 15805056 transferred in 87 seconds, throughput: 1096 MB/s cmdline: /lkp/aaron/src/bin/usemem 18360960 18360960 transferred in 74 seconds, throughput: 1288 MB/s Max: 1288 MB/s Min: 891 MB/s Avg: 1048 MB/s Hard to say if there's actual regression from 4.3 to 4.4, it's too noisy. More iterations could help, but then the eventual bisection would need them too. The always-never case: Result for head: $ cat {0..8}/swap cmdline: /lkp/aaron/src/bin/usemem 13940352 13940352 transferred in 71 seconds, throughput: 1343 MB/s cmdline: /lkp/aaron/src/bin/usemem 17411712 17411712 transferred in 62 seconds, throughput: 1538 MB/s cmdline: /lkp/aaron/src/bin/usemem 11875968 11875968 transferred in 64 seconds, throughput: 1490 MB/s cmdline: /lkp/aaron/src/bin/usemem 13912704 13912704 transferred in 62 seconds, throughput: 1538 MB/s cmdline: /lkp/aaron/src/bin/usemem 12238464 12238464 transferred in 66 seconds, throughput: 1444 MB/s cmdline: /lkp/aaron/src/bin/usemem 13670016 13670016 transferred in 65 seconds, throughput: 1467 MB/s cmdline: /lkp/aaron/src/bin/usemem 8364672 8364672 transferred in 68 seconds, throughput: 1402 MB/s cmdline: /lkp/aaron/src/bin/usemem 15417984 15417984 transferred in 70 seconds, throughput: 1362 MB/s cmdline: /lkp/aaron/src/bin/usemem 15304320 15304320 transferred in 64 seconds, throughput: 1490 MB/s Max: 1538
Re: [RFC 0/3] reduce latency of direct async compaction
On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote: > Aaron, could you try this on your testcase? One time result isn't stable enough, so I did 9 runs for each commit, here is the result: base: 25364a9e54fb8296837061bf684b76d20eec01fb head: 7433b1009ff5a02e1e9f3444802daba2cf385d27 (head = base + this_patch_serie) The always-always case(transparent_hugepage set to always and defrag set to always): Result for base: $ cat {0..8}/swap cmdline: /lkp/aaron/src/bin/usemem 10622592 10622592 transferred in 103 seconds, throughput: 925 MB/s cmdline: /lkp/aaron/src/bin/usemem 9559680 9559680 transferred in 92 seconds, throughput: 1036 MB/s cmdline: /lkp/aaron/src/bin/usemem 6171264 6171264 transferred in 92 seconds, throughput: 1036 MB/s cmdline: /lkp/aaron/src/bin/usemem 15663744 15663744 transferred in 150 seconds, throughput: 635 MB/s cmdline: /lkp/aaron/src/bin/usemem 12966528 12966528 transferred in 87 seconds, throughput: 1096 MB/s cmdline: /lkp/aaron/src/bin/usemem 5784192 5784192 transferred in 131 seconds, throughput: 727 MB/s cmdline: /lkp/aaron/src/bin/usemem 13731456 13731456 transferred in 97 seconds, throughput: 983 MB/s cmdline: /lkp/aaron/src/bin/usemem 16440960 16440960 transferred in 109 seconds, throughput: 874 MB/s cmdline: /lkp/aaron/src/bin/usemem 8813184 8813184 transferred in 122 seconds, throughput: 781 MB/s Max: 1096 MB/s Min: 635 MB/s Avg: 899 MB/s Result for head: $ cat {0..8}/swap cmdline: /lkp/aaron/src/bin/usemem 13163136 13163136 transferred in 105 seconds, throughput: 908 MB/s cmdline: /lkp/aaron/src/bin/usemem 8524416 8524416 transferred in 78 seconds, throughput: 1222 MB/s cmdline: /lkp/aaron/src/bin/usemem 3646080 3646080 transferred in 108 seconds, throughput: 882 MB/s cmdline: /lkp/aaron/src/bin/usemem 8936064 8936064 transferred in 114 seconds, throughput: 836 MB/s cmdline: /lkp/aaron/src/bin/usemem 12204672 12204672 transferred in 73 seconds, throughput: 1306 MB/s cmdline: /lkp/aaron/src/bin/usemem 8140416 8140416 transferred in 146 seconds, throughput: 653 MB/s cmdline: /lkp/aaron/src/bin/usemem 12941952 12941952 transferred in 78 seconds, throughput: 1222 MB/s cmdline: /lkp/aaron/src/bin/usemem 6917760 6917760 transferred in 109 seconds, throughput: 874 MB/s cmdline: /lkp/aaron/src/bin/usemem 11405952 11405952 transferred in 96 seconds, throughput: 993 MB/s Max: 1306 MB/s Min: 653 MB/s Avg: 988 MB/s Result for v4.3 as a reference: $ cat {0..8}/swap cmdline: /lkp/aaron/src/bin/usemem 12459648 12459648 transferred in 96 seconds, throughput: 993 MB/s cmdline: /lkp/aaron/src/bin/usemem 7375488 7375488 transferred in 96 seconds, throughput: 993 MB/s cmdline: /lkp/aaron/src/bin/usemem 9028224 9028224 transferred in 107 seconds, throughput: 891 MB/s cmdline: /lkp/aaron/src/bin/usemem 10137216 10137216 transferred in 91 seconds, throughput: 1047 MB/s cmdline: /lkp/aaron/src/bin/usemem 13835904 13835904 transferred in 80 seconds, throughput: 1192 MB/s cmdline: /lkp/aaron/src/bin/usemem 10143360 10143360 transferred in 96 seconds, throughput: 993 MB/s cmdline: /lkp/aaron/src/bin/usemem 100020593664 100020593664 transferred in 101 seconds, throughput: 944 MB/s cmdline: /lkp/aaron/src/bin/usemem 15805056 15805056 transferred in 87 seconds, throughput: 1096 MB/s cmdline: /lkp/aaron/src/bin/usemem 18360960 18360960 transferred in 74 seconds, throughput: 1288 MB/s Max: 1288 MB/s Min: 891 MB/s Avg: 1048 MB/s The always-never case: Result for head: $ cat {0..8}/swap cmdline: /lkp/aaron/src/bin/usemem 13940352 13940352 transferred in 71 seconds, throughput: 1343 MB/s cmdline: /lkp/aaron/src/bin/usemem 17411712 17411712 transferred in 62 seconds, throughput: 1538 MB/s cmdline: /lkp/aaron/src/bin/usemem 11875968 11875968 transferred in 64 seconds, throughput: 1490 MB/s cmdline: /lkp/aaron/src/bin/usemem 13912704 13912704 transferred in 62 seconds, throughput: 1538 MB/s cmdline: /lkp/aaron/src/bin/usemem 12238464 12238464 transferred in 66 seconds, throughput: 1444 MB/s cmdline: /lkp/aaron/src/bin/usemem 13670016 13670016 transferred in 65 seconds, throughput: 1467 MB/s cmdline: /lkp/aaron/src/bin/usemem 8364672 8364672 transferred in 68 seconds, throughput: 1402 MB/s cmdline: /lkp/aaron/src/bin/usemem 15417984 15417984 transferred in 70 seconds, throughput: 1362 MB/s cmdline: /lkp/aaron/src/bin/usemem 15304320 15304320 transferred in 64 seconds, throughput: 1490 MB/s Max: 1538 MB/s Min: 1343 MB/s Avg: 1452 MB/s -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/3] reduce latency of direct async compaction
On Thu, Dec 03, 2015 at 10:38:50AM +0100, Vlastimil Babka wrote: > On 12/03/2015 10:25 AM, Aaron Lu wrote: > > On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote: > >> Aaron, could you try this on your testcase? > > > > The test result is placed at: > > https://drive.google.com/file/d/0B49uX3igf4K4enBkdVFScXhFM0U > > > > For some reason, the patches made the performace worse. The base tree is > > today's Linus git 25364a9e54fb8296837061bf684b76d20eec01fb, and its > > performace is about 1000MB/s. After applying this patch series, the > > performace drops to 720MB/s. > > > > Please let me know if you need more information, thanks. > > Hm, compaction stats are at 0. The code in the patches isn't even running. > Can you provide the same data also for the base tree? My bad, I uploaded the wrong data :-/ I uploaded again: https://drive.google.com/file/d/0B49uX3igf4K4UFI4TEQ3THYta0E And I just run the base tree with trace-cmd and found that its performace drops significantly(from 1000MB/s to 6xxMB/s), is it that trace-cmd will impact performace a lot? Any suggestions on how to run the test regarding trace-cmd? i.e. should I aways run usemem under trace-cmd or only when necessary? Thanks, Aaron -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/3] reduce latency of direct async compaction
On 12/03/2015 10:25 AM, Aaron Lu wrote: > On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote: >> Aaron, could you try this on your testcase? > > The test result is placed at: > https://drive.google.com/file/d/0B49uX3igf4K4enBkdVFScXhFM0U > > For some reason, the patches made the performace worse. The base tree is > today's Linus git 25364a9e54fb8296837061bf684b76d20eec01fb, and its > performace is about 1000MB/s. After applying this patch series, the > performace drops to 720MB/s. > > Please let me know if you need more information, thanks. Hm, compaction stats are at 0. The code in the patches isn't even running. Can you provide the same data also for the base tree? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/3] reduce latency of direct async compaction
On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote: > Aaron, could you try this on your testcase? The test result is placed at: https://drive.google.com/file/d/0B49uX3igf4K4enBkdVFScXhFM0U For some reason, the patches made the performace worse. The base tree is today's Linus git 25364a9e54fb8296837061bf684b76d20eec01fb, and its performace is about 1000MB/s. After applying this patch series, the performace drops to 720MB/s. Please let me know if you need more information, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/3] reduce latency of direct async compaction
On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote: > Aaron, could you try this on your testcase? One time result isn't stable enough, so I did 9 runs for each commit, here is the result: base: 25364a9e54fb8296837061bf684b76d20eec01fb head: 7433b1009ff5a02e1e9f3444802daba2cf385d27 (head = base + this_patch_serie) The always-always case(transparent_hugepage set to always and defrag set to always): Result for base: $ cat {0..8}/swap cmdline: /lkp/aaron/src/bin/usemem 10622592 10622592 transferred in 103 seconds, throughput: 925 MB/s cmdline: /lkp/aaron/src/bin/usemem 9559680 9559680 transferred in 92 seconds, throughput: 1036 MB/s cmdline: /lkp/aaron/src/bin/usemem 6171264 6171264 transferred in 92 seconds, throughput: 1036 MB/s cmdline: /lkp/aaron/src/bin/usemem 15663744 15663744 transferred in 150 seconds, throughput: 635 MB/s cmdline: /lkp/aaron/src/bin/usemem 12966528 12966528 transferred in 87 seconds, throughput: 1096 MB/s cmdline: /lkp/aaron/src/bin/usemem 5784192 5784192 transferred in 131 seconds, throughput: 727 MB/s cmdline: /lkp/aaron/src/bin/usemem 13731456 13731456 transferred in 97 seconds, throughput: 983 MB/s cmdline: /lkp/aaron/src/bin/usemem 16440960 16440960 transferred in 109 seconds, throughput: 874 MB/s cmdline: /lkp/aaron/src/bin/usemem 8813184 8813184 transferred in 122 seconds, throughput: 781 MB/s Max: 1096 MB/s Min: 635 MB/s Avg: 899 MB/s Result for head: $ cat {0..8}/swap cmdline: /lkp/aaron/src/bin/usemem 13163136 13163136 transferred in 105 seconds, throughput: 908 MB/s cmdline: /lkp/aaron/src/bin/usemem 8524416 8524416 transferred in 78 seconds, throughput: 1222 MB/s cmdline: /lkp/aaron/src/bin/usemem 3646080 3646080 transferred in 108 seconds, throughput: 882 MB/s cmdline: /lkp/aaron/src/bin/usemem 8936064 8936064 transferred in 114 seconds, throughput: 836 MB/s cmdline: /lkp/aaron/src/bin/usemem 12204672 12204672 transferred in 73 seconds, throughput: 1306 MB/s cmdline: /lkp/aaron/src/bin/usemem 8140416 8140416 transferred in 146 seconds, throughput: 653 MB/s cmdline: /lkp/aaron/src/bin/usemem 12941952 12941952 transferred in 78 seconds, throughput: 1222 MB/s cmdline: /lkp/aaron/src/bin/usemem 6917760 6917760 transferred in 109 seconds, throughput: 874 MB/s cmdline: /lkp/aaron/src/bin/usemem 11405952 11405952 transferred in 96 seconds, throughput: 993 MB/s Max: 1306 MB/s Min: 653 MB/s Avg: 988 MB/s Result for v4.3 as a reference: $ cat {0..8}/swap cmdline: /lkp/aaron/src/bin/usemem 12459648 12459648 transferred in 96 seconds, throughput: 993 MB/s cmdline: /lkp/aaron/src/bin/usemem 7375488 7375488 transferred in 96 seconds, throughput: 993 MB/s cmdline: /lkp/aaron/src/bin/usemem 9028224 9028224 transferred in 107 seconds, throughput: 891 MB/s cmdline: /lkp/aaron/src/bin/usemem 10137216 10137216 transferred in 91 seconds, throughput: 1047 MB/s cmdline: /lkp/aaron/src/bin/usemem 13835904 13835904 transferred in 80 seconds, throughput: 1192 MB/s cmdline: /lkp/aaron/src/bin/usemem 10143360 10143360 transferred in 96 seconds, throughput: 993 MB/s cmdline: /lkp/aaron/src/bin/usemem 100020593664 100020593664 transferred in 101 seconds, throughput: 944 MB/s cmdline: /lkp/aaron/src/bin/usemem 15805056 15805056 transferred in 87 seconds, throughput: 1096 MB/s cmdline: /lkp/aaron/src/bin/usemem 18360960 18360960 transferred in 74 seconds, throughput: 1288 MB/s Max: 1288 MB/s Min: 891 MB/s Avg: 1048 MB/s The always-never case: Result for head: $ cat {0..8}/swap cmdline: /lkp/aaron/src/bin/usemem 13940352 13940352 transferred in 71 seconds, throughput: 1343 MB/s cmdline: /lkp/aaron/src/bin/usemem 17411712 17411712 transferred in 62 seconds, throughput: 1538 MB/s cmdline: /lkp/aaron/src/bin/usemem 11875968 11875968 transferred in 64 seconds, throughput: 1490 MB/s cmdline: /lkp/aaron/src/bin/usemem 13912704 13912704 transferred in 62 seconds, throughput: 1538 MB/s cmdline: /lkp/aaron/src/bin/usemem 12238464 12238464 transferred in 66 seconds, throughput: 1444 MB/s cmdline: /lkp/aaron/src/bin/usemem 13670016 13670016 transferred in 65 seconds, throughput: 1467 MB/s cmdline: /lkp/aaron/src/bin/usemem 8364672 8364672 transferred in 68 seconds, throughput: 1402 MB/s cmdline: /lkp/aaron/src/bin/usemem 15417984 15417984 transferred in 70 seconds, throughput: 1362 MB/s cmdline: /lkp/aaron/src/bin/usemem 15304320 15304320 transferred in 64 seconds, throughput: 1490 MB/s Max: 1538 MB/s Min: 1343 MB/s Avg: 1452 MB/s -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/3] reduce latency of direct async compaction
On Thu, Dec 03, 2015 at 10:38:50AM +0100, Vlastimil Babka wrote: > On 12/03/2015 10:25 AM, Aaron Lu wrote: > > On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote: > >> Aaron, could you try this on your testcase? > > > > The test result is placed at: > > https://drive.google.com/file/d/0B49uX3igf4K4enBkdVFScXhFM0U > > > > For some reason, the patches made the performace worse. The base tree is > > today's Linus git 25364a9e54fb8296837061bf684b76d20eec01fb, and its > > performace is about 1000MB/s. After applying this patch series, the > > performace drops to 720MB/s. > > > > Please let me know if you need more information, thanks. > > Hm, compaction stats are at 0. The code in the patches isn't even running. > Can you provide the same data also for the base tree? My bad, I uploaded the wrong data :-/ I uploaded again: https://drive.google.com/file/d/0B49uX3igf4K4UFI4TEQ3THYta0E And I just run the base tree with trace-cmd and found that its performace drops significantly(from 1000MB/s to 6xxMB/s), is it that trace-cmd will impact performace a lot? Any suggestions on how to run the test regarding trace-cmd? i.e. should I aways run usemem under trace-cmd or only when necessary? Thanks, Aaron -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/3] reduce latency of direct async compaction
On 12/03/2015 10:25 AM, Aaron Lu wrote: > On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote: >> Aaron, could you try this on your testcase? > > The test result is placed at: > https://drive.google.com/file/d/0B49uX3igf4K4enBkdVFScXhFM0U > > For some reason, the patches made the performace worse. The base tree is > today's Linus git 25364a9e54fb8296837061bf684b76d20eec01fb, and its > performace is about 1000MB/s. After applying this patch series, the > performace drops to 720MB/s. > > Please let me know if you need more information, thanks. Hm, compaction stats are at 0. The code in the patches isn't even running. Can you provide the same data also for the base tree? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/3] reduce latency of direct async compaction
On Thu, Dec 03, 2015 at 09:10:44AM +0100, Vlastimil Babka wrote: > Aaron, could you try this on your testcase? The test result is placed at: https://drive.google.com/file/d/0B49uX3igf4K4enBkdVFScXhFM0U For some reason, the patches made the performace worse. The base tree is today's Linus git 25364a9e54fb8296837061bf684b76d20eec01fb, and its performace is about 1000MB/s. After applying this patch series, the performace drops to 720MB/s. Please let me know if you need more information, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/