Re: [PATCH 0/2] mm: put_user_page() call site conversion first

2019-02-14 Thread John Hubbard
On 2/14/19 4:23 PM, Ira Weiny wrote:
> On Thu, Feb 07, 2019 at 11:56:47PM -0800, john.hubb...@gmail.com wrote:
>> From: John Hubbard 
[...]
>>   mm: introduce put_user_page*(), placeholder versions
>>   infiniband/mm: convert put_page() to put_user_page*()
> 
> A bit late but, FWIW:
> 
> Reviewed-by: Ira Weiny 
> 
> John these are the pages sitting in your gup_dma/first_steps branch here,
> correct?
> 
> https://github.com/johnhubbard/linux.git
> 

That's an old branch. In fact, just deleted it now, in order to avoid further
confusion.

This is the current branch: 

gup_dma_core 

in that same git repo. It has the current set of call site conversions. 
Please note that there are a lot of conversions that are either incomplete
or likely just plain wrong, at this point, but it is sufficient to at least 
boot up and run things such as fio(1).



thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH 0/2] mm: put_user_page() call site conversion first

2019-02-14 Thread Ira Weiny
On Thu, Feb 07, 2019 at 11:56:47PM -0800, john.hubb...@gmail.com wrote:
> From: John Hubbard 
> 
> Hi,
> 
> It seems about time to post these initial patches: I think we have pretty
> good consensus on the concept and details of the put_user_pages() approach.
> Therefore, here are the first two patches, to get started on converting the
> get_user_pages() call sites to use put_user_page(), instead of put_page().
> This is in order to implement tracking of get_user_page() pages.
> 
> A discussion of the overall problem is below.
> 
> As mentioned in patch 0001, the steps are to fix the problem are:
> 
> 1) Provide put_user_page*() routines, intended to be used
>for releasing pages that were pinned via get_user_pages*().
> 
> 2) Convert all of the call sites for get_user_pages*(), to
>invoke put_user_page*(), instead of put_page(). This involves dozens of
>call sites, and will take some time.
> 
> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
>implement tracking of these pages. This tracking will be separate from
>the existing struct page refcounting.
> 
> 4) Use the tracking and identification of these pages, to implement
>special handling (especially in writeback paths) when the pages are
>backed by a filesystem.
> 
> This write up is lifted from the RFC v2 patchset cover letter [1]:
> 
> Overview
> 
> 
> Some kernel components (file systems, device drivers) need to access
> memory that is specified via process virtual address. For a long time, the
> API to achieve that was get_user_pages ("GUP") and its variations. However,
> GUP has critical limitations that have been overlooked; in particular, GUP
> does not interact correctly with filesystems in all situations. That means
> that file-backed memory + GUP is a recipe for potential problems, some of
> which have already occurred in the field.
> 
> GUP was first introduced for Direct IO (O_DIRECT), allowing filesystem code
> to get the struct page behind a virtual address and to let storage hardware
> perform a direct copy to or from that page. This is a short-lived access
> pattern, and as such, the window for a concurrent writeback of GUP'd page
> was small enough that there were not (we think) any reported problems.
> Also, userspace was expected to understand and accept that Direct IO was
> not synchronized with memory-mapped access to that data, nor with any
> process address space changes such as munmap(), mremap(), etc.
> 
> Over the years, more GUP uses have appeared (virtualization, device
> drivers, RDMA) that can keep the pages they get via GUP for a long period
> of time (seconds, minutes, hours, days, ...). This long-term pinning makes
> an underlying design problem more obvious.
> 
> In fact, there are a number of key problems inherent to GUP:
> 
> Interactions with file systems
> ==
> 
> File systems expect to be able to write back data, both to reclaim pages,
> and for data integrity. Allowing other hardware (NICs, GPUs, etc) to gain
> write access to the file memory pages means that such hardware can dirty
> the pages, without the filesystem being aware. This can, in some cases
> (depending on filesystem, filesystem options, block device, block device
> options, and other variables), lead to data corruption, and also to kernel
> bugs of the form:
> 
> kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
> backtrace:
> ext4_writepage
> __writepage
> write_cache_pages
> ext4_writepages
> do_writepages
> __writeback_single_inode
> writeback_sb_inodes
> __writeback_inodes_wb
> wb_writeback
> wb_workfn
> process_one_work
> worker_thread
> kthread
> ret_from_fork
> 
> ...which is due to the file system asserting that there are still buffer
> heads attached:
> 
> ({  \
> BUG_ON(!PagePrivate(page)); \
> ((struct buffer_head *)page_private(page)); \
> })
> 
> Dave Chinner's description of this is very clear:
> 
> "The fundamental issue is that ->page_mkwrite must be called on every
> write access to a clean file backed page, not just the first one.
> How long the GUP reference lasts is irrelevant, if the page is clean
> and you need to dirty it, you must call ->page_mkwrite before it is
> marked writeable and dirtied. Every. Time."
> 
> This is just one symptom of the larger design problem: filesystems do not
> actually support get_user_pages() being called on their pages, and letting
> hardware write directly to those pages--even though that pattern has been
> going on since about 2005 or so.
> 
> Long term GUP
> =
> 
> Long term GUP is an issue when FOLL_WRITE is specified to GUP (so, a
> writeable mapping is created), and the pages are file-backed. That can lead
> to