Re: [PATCH v3 1/1] mm: introduce put_user_page*(), placeholder versions

2019-03-13 Thread John Hubbard

On 3/13/19 7:49 AM, Ira Weiny wrote:

On Tue, Mar 12, 2019 at 05:38:55PM -0700, John Hubbard wrote:

On 3/12/19 8:30 AM, Ira Weiny wrote:

On Wed, Mar 06, 2019 at 03:54:55PM -0800, john.hubb...@gmail.com wrote:

From: John Hubbard 

Introduces put_user_page(), which simply calls put_page().
This provides a way to update all get_user_pages*() callers,
so that they call put_user_page(), instead of put_page().


So I've been running with these patches for a while but today while ramping up
my testing I hit the following:

[ 1355.557819] [ cut here ]
[ 1355.563436] get_user_pages pin count overflowed


Hi Ira,

Thanks for reporting this. That overflow, at face value, means that we've
used more than the 22 bits worth of gup pin counts, so about 4 million pins
of the same page...


This is my bug in the patches I'm playing with.  Somehow I'm causing more puts
than gets...  I'm not sure how but this is for sure my problem.

Backing off to your patch set the numbers are good.


Now that's a welcome bit of good news!



Sorry for the noise.

With the testing I've done today I feel comfortable adding

Tested-by: Ira Weiny 

For the main GUP and InfiniBand patches.

Ira



OK, I'll add your tested-by tag to patches 1, 2, 4, 5 (the numbering refers
to the "RFC v2: mm: gup/dma tracking" posting [1]) in my repo [2], and they'll 
show up in the next posting. (Patch 3 is already upstream, and patch 6 is

documentation that needs to be rewritten entirely.)

[1] https://lore.kernel.org/r/20190204052135.25784-1-jhubb...@nvidia.com

[2] https://github.com/johnhubbard/linux/tree/gup_dma_core

thanks,
--
John Hubbard
NVIDIA


Re: [PATCH v3 1/1] mm: introduce put_user_page*(), placeholder versions

2019-03-13 Thread Ira Weiny
On Tue, Mar 12, 2019 at 05:38:55PM -0700, John Hubbard wrote:
> On 3/12/19 8:30 AM, Ira Weiny wrote:
> > On Wed, Mar 06, 2019 at 03:54:55PM -0800, john.hubb...@gmail.com wrote:
> > > From: John Hubbard 
> > > 
> > > Introduces put_user_page(), which simply calls put_page().
> > > This provides a way to update all get_user_pages*() callers,
> > > so that they call put_user_page(), instead of put_page().
> > 
> > So I've been running with these patches for a while but today while ramping 
> > up
> > my testing I hit the following:
> > 
> > [ 1355.557819] [ cut here ]
> > [ 1355.563436] get_user_pages pin count overflowed
> 
> Hi Ira,
> 
> Thanks for reporting this. That overflow, at face value, means that we've
> used more than the 22 bits worth of gup pin counts, so about 4 million pins
> of the same page...

This is my bug in the patches I'm playing with.  Somehow I'm causing more puts
than gets...  I'm not sure how but this is for sure my problem.

Backing off to your patch set the numbers are good.

Sorry for the noise.

With the testing I've done today I feel comfortable adding

Tested-by: Ira Weiny 

For the main GUP and InfiniBand patches.

Ira

> 
> > [ 1355.563446] WARNING: CPU: 1 PID: 1740 at mm/gup.c:73 
> > get_gup_pin_page+0xa5/0xb0
> > [ 1355.577391] Modules linked in: ib_isert iscsi_target_mod ib_srpt 
> > target_core_mod ib_srp scsi_transpo
> > rt_srp ext4 mbcache jbd2 mlx4_ib opa_vnic rpcrdma sunrpc rdma_ucm ib_iser 
> > rdma_cm ib_umad iw_cm libiscs
> > i ib_ipoib scsi_transport_iscsi ib_cm sb_edac x86_pkg_temp_thermal 
> > intel_powerclamp coretemp kvm irqbyp
> > ass snd_hda_codec_realtek ib_uverbs snd_hda_codec_generic crct10dif_pclmul 
> > ledtrig_audio snd_hda_intel
> > crc32_pclmul snd_hda_codec snd_hda_core ghash_clmulni_intel snd_hwdep 
> > snd_pcm aesni_intel crypto_simd s
> > nd_timer ib_core cryptd snd glue_helper dax_pmem soundcore nd_pmem ipmi_si 
> > device_dax nd_btt ioatdma nd
> > _e820 ipmi_devintf ipmi_msghandler iTCO_wdt i2c_i801 iTCO_vendor_support 
> > libnvdimm pcspkr lpc_ich mei_m
> > e mei mfd_core wmi pcc_cpufreq acpi_cpufreq sch_fq_codel xfs libcrc32c 
> > mlx4_en sr_mod cdrom sd_mod mgag
> > 200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops mlx4_core 
> > ttm crc32c_intel igb isci ah
> > ci dca libsas firewire_ohci drm i2c_algo_bit libahci scsi_transport_sas
> > [ 1355.577429]  firewire_core crc_itu_t i2c_core libata dm_mod [last 
> > unloaded: rdmavt]
> > [ 1355.686703] CPU: 1 PID: 1740 Comm: reg-mr Not tainted 5.0.0+ #10
> > [ 1355.693851] Hardware name: Intel Corporation W2600CR/W2600CR, BIOS 
> > SE5C600.86B.02.04.0003.1023201411
> > 38 10/23/2014
> > [ 1355.705750] RIP: 0010:get_gup_pin_page+0xa5/0xb0
> > [ 1355.711348] Code: e8 40 02 ff ff 80 3d ba a2 fb 00 00 b8 b5 ff ff ff 75 
> > bb 48 c7 c7 48 0a e9 81 89 4
> > 4 24 04 c6 05 a1 a2 fb 00 01 e8 35 63 e8 ff <0f> 0b 8b 44 24 04 eb 9c 0f 1f 
> > 00 66 66 66 66 90 41 57 49
> > bf 00 00
> > [ 1355.733244] RSP: 0018:c90005a23b30 EFLAGS: 00010286
> > [ 1355.739536] RAX:  RBX: ea001422 RCX: 
> > 
> > [ 1355.748005] RDX: 0003 RSI: 827d94a3 RDI: 
> > 0246
> > [ 1355.756453] RBP: ea001422 R08: 0002 R09: 
> > 00022400
> > [ 1355.764907] R10: 0009ccf0ad0c4203 R11: 0001 R12: 
> > 00010207
> > [ 1355.773369] R13: 8884130b7040 R14: fff00fff R15: 
> > 000fffe0
> > [ 1355.781836] FS:  7f2680d0d740() GS:88842e84() 
> > knlGS:
> > [ 1355.791384] CS:  0010 DS:  ES:  CR0: 80050033
> > [ 1355.798319] CR2: 00589000 CR3: 00040b05e004 CR4: 
> > 000606e0
> > [ 1355.806809] Call Trace:
> > [ 1355.810078]  follow_page_pte+0x4f3/0x5c0
> > [ 1355.814987]  __get_user_pages+0x1eb/0x730
> > [ 1355.820020]  get_user_pages+0x3e/0x50
> > [ 1355.824657]  ib_umem_get+0x283/0x500 [ib_uverbs]
> > [ 1355.830340]  ? _cond_resched+0x15/0x30
> > [ 1355.835065]  mlx4_ib_reg_user_mr+0x75/0x1e0 [mlx4_ib]
> > [ 1355.841235]  ib_uverbs_reg_mr+0x10c/0x220 [ib_uverbs]
> > [ 1355.847400]  ib_uverbs_write+0x2f9/0x4d0 [ib_uverbs]
> > [ 1355.853473]  __vfs_write+0x36/0x1b0
> > [ 1355.857904]  ? selinux_file_permission+0xf0/0x130
> > [ 1355.863702]  ? security_file_permission+0x2e/0xe0
> > [ 1355.869503]  vfs_write+0xa5/0x1a0
> > [ 1355.873751]  ksys_write+0x4f/0xb0
> > [ 1355.878009]  do_syscall_64+0x5b/0x180
> > [ 1355.882656]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [ 1355.62] RIP: 0033:0x7f2680ec3ed8
> > [ 1355.893420] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 
> > f3 0f 1e fa 48 8d 05 45 78 0
> > d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 
> > 1f 80 00 00 00 00 41 54 49
> > 89 d4 55
> > [ 1355.915573] RSP: 002b:7ffe65d50bc8 EFLAGS: 0246 ORIG_RAX: 
> > 0001
> > [ 1355.924621] RAX: ffda RBX: 7ffe65d50c74 RCX: 
> > 

Re: [PATCH v3 1/1] mm: introduce put_user_page*(), placeholder versions

2019-03-12 Thread John Hubbard

On 3/12/19 8:30 AM, Ira Weiny wrote:

On Wed, Mar 06, 2019 at 03:54:55PM -0800, john.hubb...@gmail.com wrote:

From: John Hubbard 

Introduces put_user_page(), which simply calls put_page().
This provides a way to update all get_user_pages*() callers,
so that they call put_user_page(), instead of put_page().


So I've been running with these patches for a while but today while ramping up
my testing I hit the following:

[ 1355.557819] [ cut here ]
[ 1355.563436] get_user_pages pin count overflowed


Hi Ira,

Thanks for reporting this. That overflow, at face value, means that we've
used more than the 22 bits worth of gup pin counts, so about 4 million pins
of the same page...


[ 1355.563446] WARNING: CPU: 1 PID: 1740 at mm/gup.c:73 
get_gup_pin_page+0xa5/0xb0
[ 1355.577391] Modules linked in: ib_isert iscsi_target_mod ib_srpt 
target_core_mod ib_srp scsi_transpo
rt_srp ext4 mbcache jbd2 mlx4_ib opa_vnic rpcrdma sunrpc rdma_ucm ib_iser 
rdma_cm ib_umad iw_cm libiscs
i ib_ipoib scsi_transport_iscsi ib_cm sb_edac x86_pkg_temp_thermal 
intel_powerclamp coretemp kvm irqbyp
ass snd_hda_codec_realtek ib_uverbs snd_hda_codec_generic crct10dif_pclmul 
ledtrig_audio snd_hda_intel
crc32_pclmul snd_hda_codec snd_hda_core ghash_clmulni_intel snd_hwdep snd_pcm 
aesni_intel crypto_simd s
nd_timer ib_core cryptd snd glue_helper dax_pmem soundcore nd_pmem ipmi_si 
device_dax nd_btt ioatdma nd
_e820 ipmi_devintf ipmi_msghandler iTCO_wdt i2c_i801 iTCO_vendor_support 
libnvdimm pcspkr lpc_ich mei_m
e mei mfd_core wmi pcc_cpufreq acpi_cpufreq sch_fq_codel xfs libcrc32c mlx4_en 
sr_mod cdrom sd_mod mgag
200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops mlx4_core ttm 
crc32c_intel igb isci ah
ci dca libsas firewire_ohci drm i2c_algo_bit libahci scsi_transport_sas
[ 1355.577429]  firewire_core crc_itu_t i2c_core libata dm_mod [last unloaded: 
rdmavt]
[ 1355.686703] CPU: 1 PID: 1740 Comm: reg-mr Not tainted 5.0.0+ #10
[ 1355.693851] Hardware name: Intel Corporation W2600CR/W2600CR, BIOS 
SE5C600.86B.02.04.0003.1023201411
38 10/23/2014
[ 1355.705750] RIP: 0010:get_gup_pin_page+0xa5/0xb0
[ 1355.711348] Code: e8 40 02 ff ff 80 3d ba a2 fb 00 00 b8 b5 ff ff ff 75 bb 
48 c7 c7 48 0a e9 81 89 4
4 24 04 c6 05 a1 a2 fb 00 01 e8 35 63 e8 ff <0f> 0b 8b 44 24 04 eb 9c 0f 1f 00 
66 66 66 66 90 41 57 49
bf 00 00
[ 1355.733244] RSP: 0018:c90005a23b30 EFLAGS: 00010286
[ 1355.739536] RAX:  RBX: ea001422 RCX: 
[ 1355.748005] RDX: 0003 RSI: 827d94a3 RDI: 0246
[ 1355.756453] RBP: ea001422 R08: 0002 R09: 00022400
[ 1355.764907] R10: 0009ccf0ad0c4203 R11: 0001 R12: 00010207
[ 1355.773369] R13: 8884130b7040 R14: fff00fff R15: 000fffe0
[ 1355.781836] FS:  7f2680d0d740() GS:88842e84() 
knlGS:
[ 1355.791384] CS:  0010 DS:  ES:  CR0: 80050033
[ 1355.798319] CR2: 00589000 CR3: 00040b05e004 CR4: 000606e0
[ 1355.806809] Call Trace:
[ 1355.810078]  follow_page_pte+0x4f3/0x5c0
[ 1355.814987]  __get_user_pages+0x1eb/0x730
[ 1355.820020]  get_user_pages+0x3e/0x50
[ 1355.824657]  ib_umem_get+0x283/0x500 [ib_uverbs]
[ 1355.830340]  ? _cond_resched+0x15/0x30
[ 1355.835065]  mlx4_ib_reg_user_mr+0x75/0x1e0 [mlx4_ib]
[ 1355.841235]  ib_uverbs_reg_mr+0x10c/0x220 [ib_uverbs]
[ 1355.847400]  ib_uverbs_write+0x2f9/0x4d0 [ib_uverbs]
[ 1355.853473]  __vfs_write+0x36/0x1b0
[ 1355.857904]  ? selinux_file_permission+0xf0/0x130
[ 1355.863702]  ? security_file_permission+0x2e/0xe0
[ 1355.869503]  vfs_write+0xa5/0x1a0
[ 1355.873751]  ksys_write+0x4f/0xb0
[ 1355.878009]  do_syscall_64+0x5b/0x180
[ 1355.882656]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1355.62] RIP: 0033:0x7f2680ec3ed8
[ 1355.893420] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 
0f 1e fa 48 8d 05 45 78 0
d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 
80 00 00 00 00 41 54 49
89 d4 55
[ 1355.915573] RSP: 002b:7ffe65d50bc8 EFLAGS: 0246 ORIG_RAX: 
0001
[ 1355.924621] RAX: ffda RBX: 7ffe65d50c74 RCX: 7f2680ec3ed8
[ 1355.933195] RDX: 0030 RSI: 7ffe65d50c80 RDI: 0003
[ 1355.941760] RBP: 0030 R08: 0007 R09: 00581260
[ 1355.950326] R10:  R11: 0246 R12: 00581930
[ 1355.958885] R13: 000c R14: 00581260 R15: 
[ 1355.967430] ---[ end trace bc771ac6189977a2 ]---


I'm not sure what I did to do this and I'm going to work on a reproducer.  At
the time of the Warning I only had 1 GUP user?!?!?!?!


If there is a get_user_pages() call that lacks a corresponding put_user_pages()
call, then the count could start working its way up, and up. Either that, or a
bug in my patches here, could cause this. The basic counting works correctly
in fio runs on an NVMe 

Re: [PATCH v3 1/1] mm: introduce put_user_page*(), placeholder versions

2019-03-12 Thread Ira Weiny
On Wed, Mar 06, 2019 at 03:54:55PM -0800, john.hubb...@gmail.com wrote:
> From: John Hubbard 
> 
> Introduces put_user_page(), which simply calls put_page().
> This provides a way to update all get_user_pages*() callers,
> so that they call put_user_page(), instead of put_page().

So I've been running with these patches for a while but today while ramping up
my testing I hit the following:

[ 1355.557819] [ cut here ]
[ 1355.563436] get_user_pages pin count overflowed
[ 1355.563446] WARNING: CPU: 1 PID: 1740 at mm/gup.c:73 
get_gup_pin_page+0xa5/0xb0
[ 1355.577391] Modules linked in: ib_isert iscsi_target_mod ib_srpt 
target_core_mod ib_srp scsi_transpo
rt_srp ext4 mbcache jbd2 mlx4_ib opa_vnic rpcrdma sunrpc rdma_ucm ib_iser 
rdma_cm ib_umad iw_cm libiscs
i ib_ipoib scsi_transport_iscsi ib_cm sb_edac x86_pkg_temp_thermal 
intel_powerclamp coretemp kvm irqbyp
ass snd_hda_codec_realtek ib_uverbs snd_hda_codec_generic crct10dif_pclmul 
ledtrig_audio snd_hda_intel
crc32_pclmul snd_hda_codec snd_hda_core ghash_clmulni_intel snd_hwdep snd_pcm 
aesni_intel crypto_simd s
nd_timer ib_core cryptd snd glue_helper dax_pmem soundcore nd_pmem ipmi_si 
device_dax nd_btt ioatdma nd
_e820 ipmi_devintf ipmi_msghandler iTCO_wdt i2c_i801 iTCO_vendor_support 
libnvdimm pcspkr lpc_ich mei_m
e mei mfd_core wmi pcc_cpufreq acpi_cpufreq sch_fq_codel xfs libcrc32c mlx4_en 
sr_mod cdrom sd_mod mgag
200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops mlx4_core ttm 
crc32c_intel igb isci ah
ci dca libsas firewire_ohci drm i2c_algo_bit libahci scsi_transport_sas
[ 1355.577429]  firewire_core crc_itu_t i2c_core libata dm_mod [last unloaded: 
rdmavt]
[ 1355.686703] CPU: 1 PID: 1740 Comm: reg-mr Not tainted 5.0.0+ #10
[ 1355.693851] Hardware name: Intel Corporation W2600CR/W2600CR, BIOS 
SE5C600.86B.02.04.0003.1023201411
38 10/23/2014
[ 1355.705750] RIP: 0010:get_gup_pin_page+0xa5/0xb0
[ 1355.711348] Code: e8 40 02 ff ff 80 3d ba a2 fb 00 00 b8 b5 ff ff ff 75 bb 
48 c7 c7 48 0a e9 81 89 4
4 24 04 c6 05 a1 a2 fb 00 01 e8 35 63 e8 ff <0f> 0b 8b 44 24 04 eb 9c 0f 1f 00 
66 66 66 66 90 41 57 49
bf 00 00
[ 1355.733244] RSP: 0018:c90005a23b30 EFLAGS: 00010286
[ 1355.739536] RAX:  RBX: ea001422 RCX: 
[ 1355.748005] RDX: 0003 RSI: 827d94a3 RDI: 0246
[ 1355.756453] RBP: ea001422 R08: 0002 R09: 00022400
[ 1355.764907] R10: 0009ccf0ad0c4203 R11: 0001 R12: 00010207
[ 1355.773369] R13: 8884130b7040 R14: fff00fff R15: 000fffe0
[ 1355.781836] FS:  7f2680d0d740() GS:88842e84() 
knlGS:
[ 1355.791384] CS:  0010 DS:  ES:  CR0: 80050033
[ 1355.798319] CR2: 00589000 CR3: 00040b05e004 CR4: 000606e0
[ 1355.806809] Call Trace:
[ 1355.810078]  follow_page_pte+0x4f3/0x5c0
[ 1355.814987]  __get_user_pages+0x1eb/0x730
[ 1355.820020]  get_user_pages+0x3e/0x50
[ 1355.824657]  ib_umem_get+0x283/0x500 [ib_uverbs]
[ 1355.830340]  ? _cond_resched+0x15/0x30
[ 1355.835065]  mlx4_ib_reg_user_mr+0x75/0x1e0 [mlx4_ib]
[ 1355.841235]  ib_uverbs_reg_mr+0x10c/0x220 [ib_uverbs]
[ 1355.847400]  ib_uverbs_write+0x2f9/0x4d0 [ib_uverbs]
[ 1355.853473]  __vfs_write+0x36/0x1b0
[ 1355.857904]  ? selinux_file_permission+0xf0/0x130
[ 1355.863702]  ? security_file_permission+0x2e/0xe0
[ 1355.869503]  vfs_write+0xa5/0x1a0
[ 1355.873751]  ksys_write+0x4f/0xb0
[ 1355.878009]  do_syscall_64+0x5b/0x180
[ 1355.882656]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1355.62] RIP: 0033:0x7f2680ec3ed8
[ 1355.893420] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 
0f 1e fa 48 8d 05 45 78 0
d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 
80 00 00 00 00 41 54 49
89 d4 55
[ 1355.915573] RSP: 002b:7ffe65d50bc8 EFLAGS: 0246 ORIG_RAX: 
0001
[ 1355.924621] RAX: ffda RBX: 7ffe65d50c74 RCX: 7f2680ec3ed8
[ 1355.933195] RDX: 0030 RSI: 7ffe65d50c80 RDI: 0003
[ 1355.941760] RBP: 0030 R08: 0007 R09: 00581260
[ 1355.950326] R10:  R11: 0246 R12: 00581930
[ 1355.958885] R13: 000c R14: 00581260 R15: 
[ 1355.967430] ---[ end trace bc771ac6189977a2 ]---


I'm not sure what I did to do this and I'm going to work on a reproducer.  At
the time of the Warning I only had 1 GUP user?!?!?!?!

I'm not using ODP, so I don't think the changes we have discussed there are a
problem.

Ira

> 
> Also introduces put_user_pages(), and a few dirty/locked variations,
> as a replacement for release_pages(), and also as a replacement
> for open-coded loops that release multiple pages.
> These may be used for subsequent performance improvements,
> via batching of pages to be released.
> 
> This is the first step of fixing a problem (also described in [1] and
> [2]) with 

Re: [PATCH v3 1/1] mm: introduce put_user_page*(), placeholder versions

2019-03-08 Thread John Hubbard
On 3/8/19 9:57 AM, Jerome Glisse wrote:
[snip] 
> Just a small comments below that would help my life :)
> 
> Reviewed-by: Jérôme Glisse 
> 

Thanks for the review! 

>> ---
>>  include/linux/mm.h | 24 ++
>>  mm/swap.c  | 82 ++
> 
> Why not putting those functions in gup.c instead of swap.c ?

Yes, gup.c is better for these. And it passes the various cross compiler and
tinyconfig builds locally, so I think I'm not missing any cases. (The swap.c 
location was an artifact of very early approaches, pre-dating the
put_user_pages() name.) 

[snip]

>>  #define SECTION_IN_PAGE_FLAGS
>>  #endif
>> diff --git a/mm/swap.c b/mm/swap.c
>> index 4d7d37eb3c40..a6b4f693f46d 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -133,6 +133,88 @@ void put_pages_list(struct list_head *pages)
>>  }
>>  EXPORT_SYMBOL(put_pages_list);
>>  
>> +typedef int (*set_dirty_func)(struct page *page);
> 
> set_dirty_func_t would be better as it is the rule for typedef to append
> the _t also it make it easier for coccinelle patch.
> 

Done. I'm posting a v4 in a moment, with both of the above, plus
Christopher's "real filesystems" wording change, and your reviewed-by
tag.


thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v3 1/1] mm: introduce put_user_page*(), placeholder versions

2019-03-08 Thread Jerome Glisse
On Wed, Mar 06, 2019 at 03:54:55PM -0800, john.hubb...@gmail.com wrote:
> From: John Hubbard 
> 
> Introduces put_user_page(), which simply calls put_page().
> This provides a way to update all get_user_pages*() callers,
> so that they call put_user_page(), instead of put_page().
> 
> Also introduces put_user_pages(), and a few dirty/locked variations,
> as a replacement for release_pages(), and also as a replacement
> for open-coded loops that release multiple pages.
> These may be used for subsequent performance improvements,
> via batching of pages to be released.
> 
> This is the first step of fixing a problem (also described in [1] and
> [2]) with interactions between get_user_pages ("gup") and filesystems.
> 
> Problem description: let's start with a bug report. Below, is what happens
> sometimes, under memory pressure, when a driver pins some pages via gup,
> and then marks those pages dirty, and releases them. Note that the gup
> documentation actually recommends that pattern. The problem is that the
> filesystem may do a writeback while the pages were gup-pinned, and then the
> filesystem believes that the pages are clean. So, when the driver later
> marks the pages as dirty, that conflicts with the filesystem's page
> tracking and results in a BUG(), like this one that I experienced:
> 
> kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
> backtrace:
> ext4_writepage
> __writepage
> write_cache_pages
> ext4_writepages
> do_writepages
> __writeback_single_inode
> writeback_sb_inodes
> __writeback_inodes_wb
> wb_writeback
> wb_workfn
> process_one_work
> worker_thread
> kthread
> ret_from_fork
> 
> ...which is due to the file system asserting that there are still buffer
> heads attached:
> 
> ({  \
> BUG_ON(!PagePrivate(page)); \
> ((struct buffer_head *)page_private(page)); \
> })
> 
> Dave Chinner's description of this is very clear:
> 
> "The fundamental issue is that ->page_mkwrite must be called on every
> write access to a clean file backed page, not just the first one.
> How long the GUP reference lasts is irrelevant, if the page is clean
> and you need to dirty it, you must call ->page_mkwrite before it is
> marked writeable and dirtied. Every. Time."
> 
> This is just one symptom of the larger design problem: filesystems do not
> actually support get_user_pages() being called on their pages, and letting
> hardware write directly to those pages--even though that patter has been
> going on since about 2005 or so.
> 
> The steps are to fix it are:
> 
> 1) (This patch): provide put_user_page*() routines, intended to be used
>for releasing pages that were pinned via get_user_pages*().
> 
> 2) Convert all of the call sites for get_user_pages*(), to
>invoke put_user_page*(), instead of put_page(). This involves dozens of
>call sites, and will take some time.
> 
> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
>implement tracking of these pages. This tracking will be separate from
>the existing struct page refcounting.
> 
> 4) Use the tracking and identification of these pages, to implement
>special handling (especially in writeback paths) when the pages are
>backed by a filesystem.
> 
> [1] https://lwn.net/Articles/774411/ : "DMA and get_user_pages()"
> [2] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
> 
> Cc: Al Viro 
> Cc: Christoph Hellwig 
> Cc: Christopher Lameter 
> Cc: Dan Williams 
> Cc: Dave Chinner 
> Cc: Ira Weiny 
> Cc: Jan Kara 
> Cc: Jason Gunthorpe 
> Cc: Jerome Glisse 
> Cc: Matthew Wilcox 
> Cc: Michal Hocko 
> Cc: Mike Rapoport 
> Cc: Ralph Campbell 
> 
> Reviewed-by: Jan Kara 
> Reviewed-by: Mike Rapoport # docs
> Reviewed-by: Ira Weiny 
> Signed-off-by: John Hubbard 

Just a small comments below that would help my life :)

Reviewed-by: Jérôme Glisse 

> ---
>  include/linux/mm.h | 24 ++
>  mm/swap.c  | 82 ++

Why not putting those functions in gup.c instead of swap.c ?

>  2 files changed, 106 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 80bb6408fe73..809b7397d41e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -993,6 +993,30 @@ static inline void put_page(struct page *page)
>   __put_page(page);
>  }
>  
> +/**
> + * put_user_page() - release a gup-pinned page
> + * @page:pointer to page to be released
> + *
> + * Pages that were pinned via get_user_pages*() must be released via
> + * either put_user_page(), or one of the put_user_pages*() routines
> + * below. This is so that eventually, pages that are pinned via
> + * get_user_pages*() can be separately tracked and uniquely 

RE: [PATCH v3 1/1] mm: introduce put_user_page*(), placeholder versions

2019-03-08 Thread Weiny, Ira
> Subject: Re: [PATCH v3 1/1] mm: introduce put_user_page*(), placeholder
> versions
> 
> On 3/7/19 6:58 PM, Christopher Lameter wrote:
> > On Wed, 6 Mar 2019, john.hubb...@gmail.com wrote:
> >
> >> Dave Chinner's description of this is very clear:
> >>
> >> "The fundamental issue is that ->page_mkwrite must be called on every
> >> write access to a clean file backed page, not just the first one.
> >> How long the GUP reference lasts is irrelevant, if the page is clean
> >> and you need to dirty it, you must call ->page_mkwrite before it is
> >> marked writeable and dirtied. Every. Time."
> >>
> >> This is just one symptom of the larger design problem: filesystems do
> >> not actually support get_user_pages() being called on their pages,
> >> and letting hardware write directly to those pages--even though that
> >> patter has been going on since about 2005 or so.
> >
> > Can we distinguish between real filesystems that actually write to a
> > backing device and the special filesystems (like hugetlbfs, shm and
> > friends) that are like anonymous memory and do not require
> > ->page_mkwrite() in the same way as regular filesystems?
> 
> Yes. I'll change the wording in the commit message to say "real filesystems
> that actually write to a backing device", instead of "filesystems". That does
> help, thanks.
> 
> >
> > The use that I have seen in my section of the world has been
> > restricted to RDMA and get_user_pages being limited to anonymous
> > memory and those special filesystems. And if the RDMA memory is of
> > such type then the use in the past and present is safe.
> 
> Agreed.
> 
> >
> > So a logical other approach would be to simply not allow the use of
> > long term get_user_page() on real filesystem pages. I hope this patch
> > supports that?
> 
> This patch neither prevents nor provides that. What this patch does is
> provide a prerequisite to clear identification of pages that have had
> get_user_pages() called on them.
> 
> 
> >
> > It is customary after all that a file read or write operation involve
> > one single file(!) and that what is written either comes from or goes
> > to memory (anonymous or special memory filesystem).
> >
> > If you have an mmapped memory segment with a regular device backed
> > file then you already have one file associated with a memory segment
> > and a filesystem that does take care of synchronizing the contents of
> > the memory segment to a backing device.
> >
> > If you now perform RDMA or device I/O on such a memory segment then
> > you will have *two* different devices interacting with that memory
> > segment. I think that ought not to happen and not be supported out of
> > the box. It will be difficult to handle and the semantics will be hard
> > for users to understand.
> >
> > What could happen is that the filesystem could agree on request to
> > allow third party I/O to go to such a memory segment. But that needs
> > to be well defined and clearly and explicitly handled by some
> > mechanism in user space that has well defined semantics for data
> > integrity for the filesystem as well as the RDMA or device I/O.
> >
> 
> Those discussions are underway. Dave Chinner and others have been talking
> about filesystem leases, for example. The key point here is that we'll still
> need, in any of these approaches, to be able to identify the gup-pinned pages.
> And there are lots (100+) of call sites to change. So I figure we'd better get
> that started.
>

+ 1

I'm exploring patch sets like this.  Having this interface available will, IMO, 
allow for better review of those patches rather than saying "go over to Johns 
tree to get the pre-requisite patches".  :-D

Also I think it will be easier for users to get things right by calling 
[get|put]_user_pages() rather than get_user_pages() followed by put_page().

Ira

> thanks,
> --
> John Hubbard
> NVIDIA


Re: [PATCH v3 1/1] mm: introduce put_user_page*(), placeholder versions

2019-03-07 Thread John Hubbard
On 3/7/19 6:58 PM, Christopher Lameter wrote:
> On Wed, 6 Mar 2019, john.hubb...@gmail.com wrote:
> 
>> Dave Chinner's description of this is very clear:
>>
>> "The fundamental issue is that ->page_mkwrite must be called on every
>> write access to a clean file backed page, not just the first one.
>> How long the GUP reference lasts is irrelevant, if the page is clean
>> and you need to dirty it, you must call ->page_mkwrite before it is
>> marked writeable and dirtied. Every. Time."
>>
>> This is just one symptom of the larger design problem: filesystems do not
>> actually support get_user_pages() being called on their pages, and letting
>> hardware write directly to those pages--even though that patter has been
>> going on since about 2005 or so.
> 
> Can we distinguish between real filesystems that actually write to a
> backing device and the special filesystems (like hugetlbfs, shm and
> friends) that are like anonymous memory and do not require
> ->page_mkwrite() in the same way as regular filesystems?

Yes. I'll change the wording in the commit message to say "real filesystems
that actually write to a backing device", instead of "filesystems". That
does help, thanks.

> 
> The use that I have seen in my section of the world has been restricted to
> RDMA and get_user_pages being limited to anonymous memory and those
> special filesystems. And if the RDMA memory is of such type then the use
> in the past and present is safe.

Agreed.

> 
> So a logical other approach would be to simply not allow the use of
> long term get_user_page() on real filesystem pages. I hope this patch
> supports that?

This patch neither prevents nor provides that. What this patch does is
provide a prerequisite to clear identification of pages that have had
get_user_pages() called on them.


> 
> It is customary after all that a file read or write operation involve one
> single file(!) and that what is written either comes from or goes to
> memory (anonymous or special memory filesystem).
> 
> If you have an mmapped memory segment with a regular device backed file
> then you already have one file associated with a memory segment and a
> filesystem that does take care of synchronizing the contents of the memory
> segment to a backing device.
> 
> If you now perform RDMA or device I/O on such a memory segment then you
> will have *two* different devices interacting with that memory segment. I
> think that ought not to happen and not be supported out of the box. It
> will be difficult to handle and the semantics will be hard for users to
> understand.
> 
> What could happen is that the filesystem could agree on request to allow
> third party I/O to go to such a memory segment. But that needs to be well
> defined and clearly and explicitly handled by some mechanism in user space
> that has well defined semantics for data integrity for the filesystem as
> well as the RDMA or device I/O.
> 

Those discussions are underway. Dave Chinner and others have been talking
about filesystem leases, for example. The key point here is that we'll still
need, in any of these approaches, to be able to identify the gup-pinned
pages. And there are lots (100+) of call sites to change. So I figure we'd
better get that started.

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v3 1/1] mm: introduce put_user_page*(), placeholder versions

2019-03-07 Thread Christopher Lameter
On Wed, 6 Mar 2019, john.hubb...@gmail.com wrote:

> Dave Chinner's description of this is very clear:
>
> "The fundamental issue is that ->page_mkwrite must be called on every
> write access to a clean file backed page, not just the first one.
> How long the GUP reference lasts is irrelevant, if the page is clean
> and you need to dirty it, you must call ->page_mkwrite before it is
> marked writeable and dirtied. Every. Time."
>
> This is just one symptom of the larger design problem: filesystems do not
> actually support get_user_pages() being called on their pages, and letting
> hardware write directly to those pages--even though that patter has been
> going on since about 2005 or so.

Can we distinguish between real filesystems that actually write to a
backing device and the special filesystems (like hugetlbfs, shm and
friends) that are like anonymous memory and do not require
->page_mkwrite() in the same way as regular filesystems?

The use that I have seen in my section of the world has been restricted to
RDMA and get_user_pages being limited to anonymous memory and those
special filesystems. And if the RDMA memory is of such type then the use
in the past and present is safe.

So a logical other approach would be to simply not allow the use of
long term get_user_page() on real filesystem pages. I hope this patch
supports that?

It is customary after all that a file read or write operation involve one
single file(!) and that what is written either comes from or goes to
memory (anonymous or special memory filesystem).

If you have an mmapped memory segment with a regular device backed file
then you already have one file associated with a memory segment and a
filesystem that does take care of synchronizing the contents of the memory
segment to a backing device.

If you now perform RDMA or device I/O on such a memory segment then you
will have *two* different devices interacting with that memory segment. I
think that ought not to happen and not be supported out of the box. It
will be difficult to handle and the semantics will be hard for users to
understand.

What could happen is that the filesystem could agree on request to allow
third party I/O to go to such a memory segment. But that needs to be well
defined and clearly and explicitly handled by some mechanism in user space
that has well defined semantics for data integrity for the filesystem as
well as the RDMA or device I/O.