Re: [git pull] vfs.git get_user_pages_fast() conversion
On Sat, Nov 18, 2017 at 09:45:31PM +, Al Viro wrote: > in __get_user_pages_locked(), or am I missing something subtle there? Andrea? You're not missing anything as far as the logic is concerned. However see the __always_inline, I added "notify_drop" purely to optimize away such branch at build time so that gcc doesn't need to include in every different copy it does of it. "notify_drop" is known at build time and constant true/false, the other variables are not. Either you drop the __always_inline or "notify_drop" makes sense to me to keep to reduce the size of the inline and run faster at runtime. Thanks, Andrea
mthca misuse of get_user_pages() (was Re: [git pull] vfs.git get_user_pages_fast() conversion)
On Fri, Nov 17, 2017 at 09:32:15PM +, Al Viro wrote: [snip] > drivers/infiniband/hw/mthca/mthca_memfree.c:475:ret = > get_user_pages(uaddr & PAGE_MASK, 1, FOLL_WRITE, pages, NULL); [snip] > Itanic > one is almost certainly buggered - we are not holding ->mmap_sem there. So's the mthca one - that caller (mthca_map_user_db()) does not take ->mmap_sem and at least some of its calls are immediately preceded by copy_from_user(). If we don't have ->mmap_sem there, this get_user_pages() is oopsably racy; if we do, those copy_from_user() calls is deadlock-prone. To make things even nastier, mthca_map_user_db() does grab a mutex around the chunk that contains get_user_pages() call: mutex_lock(&db_tab->mutex); several lines prior. And _that_ is taken on a whole lot of codepaths; if any of those happen to be under ->mmap_sem, we can't grab ->mmap_sem just around that get_user_pages(). I've poked around a bit, but I'm really unfamiliar with that code... The questions so far: * can mthca_map_user_db() ever be called under ->mmap_sem? Looks like it shouldn't, but... * do we really need that get_user_pages() under ->mutex? Would something like if (db_tab->page[i].refcount) { ++db_tab->page[i].refcount; goto out; } mutex_unlock(&db_tab->mutex); ret = get_user_pages_fast(uaddr & PAGE_MASK, 1, true, pages); if (ret < 0) return ret; mutex_lock(&db_tab->mutex); if (unlikely(db_tab->page[i].refcount)) { /* lost the race */ put_page(pages[0]); ++db_tab->page[i].refcount; goto out; } in place of the current if (db_tab->page[i].refcount) { ++db_tab->page[i].refcount; goto out; } ret = get_user_pages(uaddr & PAGE_MASK, 1, FOLL_WRITE, pages, NULL); if (ret < 0) goto out; be a usable solution? Comments?
Re: [git pull] vfs.git get_user_pages_fast() conversion
On Fri, Nov 17, 2017 at 1:32 PM, Al Viro wrote: > On Fri, Nov 17, 2017 at 12:50:47PM -0800, Linus Torvalds wrote: > >> Not because the conversion was wrong, but because the original code is >> so broken. >> >> In particular, that "1" that is unchanged in the arguments is correct >> in the conversion, but it was completely wrong in the original, even >> if it happened to work. >> >> it _should_ have been a FOLL_WRITE. Yes, it happens to have that >> value, but it was broken. >> >> (I note that a bit of grepping shows we have the same issue in a stale >> comment in mm/ksm.c). >> >> It would have been nice to see things like this mentioned in the commit >> message. >> >> Because I'm pretty sure you actually _realized_ that as you made the >> conversion, but there's no sign of that in the logs, because the >> commit message just says >> >> atomisp: use get_user_pages_fast() >> >> without mentioning how broken the old case was (even it if happened to work). > > Point... Frankly, my impression from the whole thing is that get_user_pages() > and get_user_pages_unlocked() calling conventions are overcomplicated, > especially > since most of the callers either should be get_user_pages_fast() or happen > to be inside implementations of get_user_pages_fast(). Grepping for the > remaining callers right now yields just this: > > arch/cris/arch-v32/drivers/cryptocop.c:2722:err = > get_user_pages((unsigned long int)(oper.indata + prev_ix), > arch/cris/arch-v32/drivers/cryptocop.c:2736:err = > get_user_pages((unsigned long int)oper.cipher_outdata, > arch/ia64/kernel/err_inject.c:145: ret = get_user_pages(virt_addr, 1, > FOLL_WRITE, NULL, NULL); > arch/x86/mm/mpx.c:550: gup_ret = get_user_pages((unsigned long)addr, > nr_pages, > drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:646:r = > get_user_pages(userptr, num_pages, flags, p, NULL); > drivers/gpu/drm/radeon/radeon_ttm.c:571:r = > get_user_pages(userptr, num_pages, write ? FOLL_WRITE : 0, > drivers/infiniband/core/umem.c:194: ret = get_user_pages(cur_base, > drivers/infiniband/hw/mthca/mthca_memfree.c:475:ret = > get_user_pages(uaddr & PAGE_MASK, 1, FOLL_WRITE, pages, NUL > L); > drivers/infiniband/hw/qib/qib_user_pages.c:70: ret = > get_user_pages(start_page + got * PAGE_SIZE, > drivers/infiniband/hw/usnic/usnic_uiom.c:146: ret = > get_user_pages(cur_base, > drivers/media/pci/ivtv/ivtv-udma.c:127: err = > get_user_pages_unlocked(user_dma.uaddr, user_dma.page_count, > drivers/media/pci/ivtv/ivtv-yuv.c:78: y_pages = > get_user_pages_unlocked(y_dma.uaddr, > drivers/media/pci/ivtv/ivtv-yuv.c:82: uv_pages = > get_user_pages_unlocked(uv_dma.uaddr, > drivers/media/v4l2-core/videobuf-dma-sg.c:188: err = get_user_pages(data & > PAGE_MASK, dma->nr_pages, > drivers/misc/mic/scif/scif_rma.c:1399: pinned_pages->nr_pages = > get_user_pages( > drivers/misc/sgi-gru/grufault.c:201:if (get_user_pages(vaddr, 1, write ? > FOLL_WRITE : 0, &page, NULL) <= 0) > mm/mempolicy.c:829: err = get_user_pages(addr & PAGE_MASK, 1, 0, &p, > NULL); > virt/kvm/kvm_main.c:1326: return get_user_pages(start, 1, flags, page, > NULL); > virt/kvm/kvm_main.c:1333: rc = get_user_pages(addr, 1, flags, NULL, > NULL); > virt/kvm/kvm_main.c:1395: npages = > get_user_pages_unlocked(addr, 1, page, flags); > > and even those are dubious - e.g. cris ones could bloody well become a pair of > get_user_pages_fast(), without ->mmap_sem being held across both calls. > Itanic > one is almost certainly buggered - we are not holding ->mmap_sem there. > > Hell knows... I wonder if exposing FOLL_... thing outside of mm/gup.c > actually > makes sense. With the users so heavily skewed towards just two combinations > of > flags (0 and FOLL_WRITE)... > > And for get_user_pages() itself it's even more ridiculous - vmalist (the last > argument) is non-NULL in only one caller. Which uses it only to check if all > of the VMAs happen to be hugetlb ones, apparently. One note here, we've discovered that filesystem-dax mappings are broken with respect to DMA and fallocate(PUNCH_HOLE). As a band-aid we're looking to change rdma, video/media, and any other subsystem that tries to pin pages indefinitely to fail in the vma_is_dax() case with a new get_user_pages_longterm() helper [1]. Then the plan is to follow on with new infrastructure to register memory with a file lease. We need an interface for the kernel to notify userspace that the pages it pinned need to be unpinned because the file blocks are being deallocated (where 'file blocks' and 'memory pages' are one in the same in the dax case). [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-November/013295.html
Re: [git pull] vfs.git get_user_pages_fast() conversion
On Fri, Nov 17, 2017 at 09:32:15PM +, Al Viro wrote: > And for get_user_pages() itself it's even more ridiculous - vmalist (the last > argument) is non-NULL in only one caller. Which uses it only to check if all > of the VMAs happen to be hugetlb ones, apparently. > > FWIW, I wanted to trim the users of those two suckers and see what remains. > And then go through those with maintainers of subsystems in question, to > see what is really wanted there. That's for the coming cycle, though... Incidentally, why the hell do we need that notify_drop argument of __get_user_pages_locked()? Its only use is (and has always been) if (notify_drop && lock_dropped && *locked) { /* * We must let the caller know we temporarily dropped the lock * and so the critical section protected by it was lost. */ up_read(&mm->mmap_sem); *locked = 0; } in the very end of __get_user_pages_locked(). There are 4 callers: get_user_pages_locked() and get_user_pages_remote() pass true, get_user_pages() and __get_user_pages_unlocked() - false. get_user_pages() passes NULL for 'locked', so we _never_ get to the body of that if() anyway - lock_dropped is only set if locked != NULL. And in __get_user_pages_unlocked() we have ret = __get_user_pages_locked(tsk, mm, start, nr_pages, pages, NULL, &locked, false, gup_flags | FOLL_TOUCH); if (locked) up_read(&mm->mmap_sem); Suppose we passed true instead of false here. If that if (notify_drop...) still has not triggered, nothing has changed. If it *has* triggered, we had *locked (i.e. locked from the __get_user_pages_unlocked() stack frame) non-zero, so we'd just have dropped ->mmap_sem just before returning to __get_user_pages_unlocked() instead of doing that just after. And set *locked to zero, so that __get_user_pages_unlocked() won't end up dropping it. Looks like we can bloody well get rid of that argument and do just if (lock_dropped && *locked) { /* * We must let the caller know we temporarily dropped the lock * and so the critical section protected by it was lost. */ up_read(&mm->mmap_sem); *locked = 0; } in __get_user_pages_locked(), or am I missing something subtle there? Andrea?
Re: [git pull] vfs.git get_user_pages_fast() conversion
On Fri, Nov 17, 2017 at 12:50:47PM -0800, Linus Torvalds wrote: > Not because the conversion was wrong, but because the original code is > so broken. > > In particular, that "1" that is unchanged in the arguments is correct > in the conversion, but it was completely wrong in the original, even > if it happened to work. > > it _should_ have been a FOLL_WRITE. Yes, it happens to have that > value, but it was broken. > > (I note that a bit of grepping shows we have the same issue in a stale > comment in mm/ksm.c). > > It would have been nice to see things like this mentioned in the commit > message. > > Because I'm pretty sure you actually _realized_ that as you made the > conversion, but there's no sign of that in the logs, because the > commit message just says > > atomisp: use get_user_pages_fast() > > without mentioning how broken the old case was (even it if happened to work). Point... Frankly, my impression from the whole thing is that get_user_pages() and get_user_pages_unlocked() calling conventions are overcomplicated, especially since most of the callers either should be get_user_pages_fast() or happen to be inside implementations of get_user_pages_fast(). Grepping for the remaining callers right now yields just this: arch/cris/arch-v32/drivers/cryptocop.c:2722:err = get_user_pages((unsigned long int)(oper.indata + prev_ix), arch/cris/arch-v32/drivers/cryptocop.c:2736:err = get_user_pages((unsigned long int)oper.cipher_outdata, arch/ia64/kernel/err_inject.c:145: ret = get_user_pages(virt_addr, 1, FOLL_WRITE, NULL, NULL); arch/x86/mm/mpx.c:550: gup_ret = get_user_pages((unsigned long)addr, nr_pages, drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:646:r = get_user_pages(userptr, num_pages, flags, p, NULL); drivers/gpu/drm/radeon/radeon_ttm.c:571:r = get_user_pages(userptr, num_pages, write ? FOLL_WRITE : 0, drivers/infiniband/core/umem.c:194: ret = get_user_pages(cur_base, drivers/infiniband/hw/mthca/mthca_memfree.c:475:ret = get_user_pages(uaddr & PAGE_MASK, 1, FOLL_WRITE, pages, NUL L); drivers/infiniband/hw/qib/qib_user_pages.c:70: ret = get_user_pages(start_page + got * PAGE_SIZE, drivers/infiniband/hw/usnic/usnic_uiom.c:146: ret = get_user_pages(cur_base, drivers/media/pci/ivtv/ivtv-udma.c:127: err = get_user_pages_unlocked(user_dma.uaddr, user_dma.page_count, drivers/media/pci/ivtv/ivtv-yuv.c:78: y_pages = get_user_pages_unlocked(y_dma.uaddr, drivers/media/pci/ivtv/ivtv-yuv.c:82: uv_pages = get_user_pages_unlocked(uv_dma.uaddr, drivers/media/v4l2-core/videobuf-dma-sg.c:188: err = get_user_pages(data & PAGE_MASK, dma->nr_pages, drivers/misc/mic/scif/scif_rma.c:1399: pinned_pages->nr_pages = get_user_pages( drivers/misc/sgi-gru/grufault.c:201:if (get_user_pages(vaddr, 1, write ? FOLL_WRITE : 0, &page, NULL) <= 0) mm/mempolicy.c:829: err = get_user_pages(addr & PAGE_MASK, 1, 0, &p, NULL); virt/kvm/kvm_main.c:1326: return get_user_pages(start, 1, flags, page, NULL); virt/kvm/kvm_main.c:1333: rc = get_user_pages(addr, 1, flags, NULL, NULL); virt/kvm/kvm_main.c:1395: npages = get_user_pages_unlocked(addr, 1, page, flags); and even those are dubious - e.g. cris ones could bloody well become a pair of get_user_pages_fast(), without ->mmap_sem being held across both calls. Itanic one is almost certainly buggered - we are not holding ->mmap_sem there. Hell knows... I wonder if exposing FOLL_... thing outside of mm/gup.c actually makes sense. With the users so heavily skewed towards just two combinations of flags (0 and FOLL_WRITE)... And for get_user_pages() itself it's even more ridiculous - vmalist (the last argument) is non-NULL in only one caller. Which uses it only to check if all of the VMAs happen to be hugetlb ones, apparently. FWIW, I wanted to trim the users of those two suckers and see what remains. And then go through those with maintainers of subsystems in question, to see what is really wanted there. That's for the coming cycle, though...
Re: [git pull] vfs.git get_user_pages_fast() conversion
On Thu, Nov 16, 2017 at 7:02 PM, Al Viro wrote: > A bunch of places switched to get_user_pages_fast(). I really would have liked a bit of a commentary. Looking at the individual patches, I notice this, for example: - down_read(¤t->mm->mmap_sem); - page_nr = get_user_pages((unsigned long)userptr, -(int)(bo->pgnr), 1, pages, NULL); - up_read(¤t->mm->mmap_sem); + page_nr = get_user_pages_fast((unsigned long)userptr, +(int)(bo->pgnr), 1, pages); from the atomisp conversion, and it made me throw up my hands in horror. Not because the conversion was wrong, but because the original code is so broken. In particular, that "1" that is unchanged in the arguments is correct in the conversion, but it was completely wrong in the original, even if it happened to work. it _should_ have been a FOLL_WRITE. Yes, it happens to have that value, but it was broken. (I note that a bit of grepping shows we have the same issue in a stale comment in mm/ksm.c). It would have been nice to see things like this mentioned in the commit message. Because I'm pretty sure you actually _realized_ that as you made the conversion, but there's no sign of that in the logs, because the commit message just says atomisp: use get_user_pages_fast() without mentioning how broken the old case was (even it if happened to work). Linus
[git pull] vfs.git get_user_pages_fast() conversion
A bunch of places switched to get_user_pages_fast(). The following changes since commit 2bd6bf03f4c1c59381d62c61d03f6cc3fe71f66e: Linux 4.14-rc1 (2017-09-16 15:47:51 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git work.get_user_pages_fast for you to fetch changes up to 77478715ba9242017976fd01de189e77fa072f51: ceph: use get_user_pages_fast() (2017-09-22 23:14:41 -0400) Al Viro (8): vchiq_2835_arm: switch to get_user_pages_fast() rapidio: switch to get_user_pages_fast() fsl_hypervisor: switch to get_user_pages_fast() via_dmablit(): use get_user_pages_fast() st: use get_user_pages_fast() atomisp: use get_user_pages_fast() pvr2fs: use get_user_pages_fast() ceph: use get_user_pages_fast() drivers/gpu/drm/via/via_dmablit.c| 6 +++--- drivers/rapidio/devices/rio_mport_cdev.c | 6 ++ drivers/scsi/st.c| 6 +- .../staging/media/atomisp/pci/atomisp2/hmm/hmm_bo.c | 6 ++ .../interface/vchiq_arm/vchiq_2835_arm.c | 20 +++- drivers/video/fbdev/pvr2fb.c | 4 +--- drivers/virt/fsl_hypervisor.c| 4 ++-- net/ceph/pagevec.c | 4 ++-- 8 files changed, 20 insertions(+), 36 deletions(-)