Re: [RFC v3 7/7] shm: isolate pinned pages when sealing files
Hi On Wed, Jul 9, 2014 at 10:57 AM, Hugh Dickins wrote: > On Fri, 13 Jun 2014, David Herrmann wrote: > >> When setting SEAL_WRITE, we must make sure nobody has a writable reference >> to the pages (via GUP or similar). We currently check references and wait >> some time for them to be dropped. This, however, might fail for several >> reasons, including: >> - the page is pinned for longer than we wait >> - while we wait, someone takes an already pinned page for read-access >> >> Therefore, this patch introduces page-isolation. When sealing a file with >> SEAL_WRITE, we copy all pages that have an elevated ref-count. The newpage >> is put in place atomically, the old page is detached and left alone. It >> will get reclaimed once the last external user dropped it. >> >> Signed-off-by: David Herrmann > > I've not checked it line by line, but this seems to be very good work; > and I'm glad you have posted it, where we can refer back to it in future. > > However, I'm NAKing this patch, at least for now. > > The reason is simple and twofold. > > I absolutely do not want to be maintaining an alternative form of > page migration in mm/shmem.c. Shmem has its own peculiar problems > (mostly because of swap): adding a new dimension of very rarely > exercised complication, and dependence on the rest mm, is not wise. > > And sealing just does not need this. It is clearly technically > superior to, and more satisfying than, the "wait-a-while-then-give-up" > technique which it would replace. But in practice, the wait-a-while > technique is quite good enough (and much better self-contained than this). > > I've heard no requirement to support sealing of objects pinned for I/O, > and the sealer just would not have given the object out for that; the > requirement is to give the recipient of a sealed object confidence > that it cannot be susceptible to modification in that way. > > I doubt there will ever be an actual need for sealing to use this > migration technique; but I can imagine us referring back to your work in > future, when/if implementing revoke on regular files. And integrating > this into mm/migrate.c's unmap_and_move() as an extra-force mode > (proceed even when the page count is raised). > > I think the concerns I had, when Tony first proposed this migration copy > technique, were in fact unfounded - I was worried by the new inverse COW. > On reflection, I don't think this introduces any new risks, which are > not already present in page migration, truncation and orphaned pages. > > I didn't begin to test it at all, but the only defects that stood out > in your code were in the areas of memcg and mlock. I think that if we > go down the road of duplicating pinned pages, then we do have to make > a memcg charge on the new page in addition to the old page. And if > any pages happen to be mlock'ed into an address space, then we ought > to map in the replacement pages afterwards (as page migration does, > whether mlock'ed or not). > > (You were perfectly correct to use unmap_mapping_range(), rather than > try_to_unmap() as page migration does: because unmap_mapping_range() > manages the VM_NONLINEAR case. But our intention, under way, is to > scrap all VM_NONLINEAR code, and just emulate it with multiple vmas, > in which case try_to_unmap() should do.) Dropping VM_NONLINEAR would make a lot of stuff so much easier. I'm fine with dropping this patch again. The mlock and memcg issues you raised are valid and should get fixed. And indeed, my testing never triggered any real evelated page-refs except if I pinned them via FUSE. Therefore, the wait-for-pins function should be sufficient, indeed. Thanks for the reviews! I will send v4 shortly. David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 7/7] shm: isolate pinned pages when sealing files
Hi On Wed, Jul 9, 2014 at 10:57 AM, Hugh Dickins hu...@google.com wrote: On Fri, 13 Jun 2014, David Herrmann wrote: When setting SEAL_WRITE, we must make sure nobody has a writable reference to the pages (via GUP or similar). We currently check references and wait some time for them to be dropped. This, however, might fail for several reasons, including: - the page is pinned for longer than we wait - while we wait, someone takes an already pinned page for read-access Therefore, this patch introduces page-isolation. When sealing a file with SEAL_WRITE, we copy all pages that have an elevated ref-count. The newpage is put in place atomically, the old page is detached and left alone. It will get reclaimed once the last external user dropped it. Signed-off-by: David Herrmann dh.herrm...@gmail.com I've not checked it line by line, but this seems to be very good work; and I'm glad you have posted it, where we can refer back to it in future. However, I'm NAKing this patch, at least for now. The reason is simple and twofold. I absolutely do not want to be maintaining an alternative form of page migration in mm/shmem.c. Shmem has its own peculiar problems (mostly because of swap): adding a new dimension of very rarely exercised complication, and dependence on the rest mm, is not wise. And sealing just does not need this. It is clearly technically superior to, and more satisfying than, the wait-a-while-then-give-up technique which it would replace. But in practice, the wait-a-while technique is quite good enough (and much better self-contained than this). I've heard no requirement to support sealing of objects pinned for I/O, and the sealer just would not have given the object out for that; the requirement is to give the recipient of a sealed object confidence that it cannot be susceptible to modification in that way. I doubt there will ever be an actual need for sealing to use this migration technique; but I can imagine us referring back to your work in future, when/if implementing revoke on regular files. And integrating this into mm/migrate.c's unmap_and_move() as an extra-force mode (proceed even when the page count is raised). I think the concerns I had, when Tony first proposed this migration copy technique, were in fact unfounded - I was worried by the new inverse COW. On reflection, I don't think this introduces any new risks, which are not already present in page migration, truncation and orphaned pages. I didn't begin to test it at all, but the only defects that stood out in your code were in the areas of memcg and mlock. I think that if we go down the road of duplicating pinned pages, then we do have to make a memcg charge on the new page in addition to the old page. And if any pages happen to be mlock'ed into an address space, then we ought to map in the replacement pages afterwards (as page migration does, whether mlock'ed or not). (You were perfectly correct to use unmap_mapping_range(), rather than try_to_unmap() as page migration does: because unmap_mapping_range() manages the VM_NONLINEAR case. But our intention, under way, is to scrap all VM_NONLINEAR code, and just emulate it with multiple vmas, in which case try_to_unmap() should do.) Dropping VM_NONLINEAR would make a lot of stuff so much easier. I'm fine with dropping this patch again. The mlock and memcg issues you raised are valid and should get fixed. And indeed, my testing never triggered any real evelated page-refs except if I pinned them via FUSE. Therefore, the wait-for-pins function should be sufficient, indeed. Thanks for the reviews! I will send v4 shortly. David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 7/7] shm: isolate pinned pages when sealing files
On Fri, 13 Jun 2014, David Herrmann wrote: > When setting SEAL_WRITE, we must make sure nobody has a writable reference > to the pages (via GUP or similar). We currently check references and wait > some time for them to be dropped. This, however, might fail for several > reasons, including: > - the page is pinned for longer than we wait > - while we wait, someone takes an already pinned page for read-access > > Therefore, this patch introduces page-isolation. When sealing a file with > SEAL_WRITE, we copy all pages that have an elevated ref-count. The newpage > is put in place atomically, the old page is detached and left alone. It > will get reclaimed once the last external user dropped it. > > Signed-off-by: David Herrmann I've not checked it line by line, but this seems to be very good work; and I'm glad you have posted it, where we can refer back to it in future. However, I'm NAKing this patch, at least for now. The reason is simple and twofold. I absolutely do not want to be maintaining an alternative form of page migration in mm/shmem.c. Shmem has its own peculiar problems (mostly because of swap): adding a new dimension of very rarely exercised complication, and dependence on the rest mm, is not wise. And sealing just does not need this. It is clearly technically superior to, and more satisfying than, the "wait-a-while-then-give-up" technique which it would replace. But in practice, the wait-a-while technique is quite good enough (and much better self-contained than this). I've heard no requirement to support sealing of objects pinned for I/O, and the sealer just would not have given the object out for that; the requirement is to give the recipient of a sealed object confidence that it cannot be susceptible to modification in that way. I doubt there will ever be an actual need for sealing to use this migration technique; but I can imagine us referring back to your work in future, when/if implementing revoke on regular files. And integrating this into mm/migrate.c's unmap_and_move() as an extra-force mode (proceed even when the page count is raised). I think the concerns I had, when Tony first proposed this migration copy technique, were in fact unfounded - I was worried by the new inverse COW. On reflection, I don't think this introduces any new risks, which are not already present in page migration, truncation and orphaned pages. I didn't begin to test it at all, but the only defects that stood out in your code were in the areas of memcg and mlock. I think that if we go down the road of duplicating pinned pages, then we do have to make a memcg charge on the new page in addition to the old page. And if any pages happen to be mlock'ed into an address space, then we ought to map in the replacement pages afterwards (as page migration does, whether mlock'ed or not). (You were perfectly correct to use unmap_mapping_range(), rather than try_to_unmap() as page migration does: because unmap_mapping_range() manages the VM_NONLINEAR case. But our intention, under way, is to scrap all VM_NONLINEAR code, and just emulate it with multiple vmas, in which case try_to_unmap() should do.) Hugh > --- > mm/shmem.c | 218 > + > 1 file changed, 105 insertions(+), 113 deletions(-) > > diff --git a/mm/shmem.c b/mm/shmem.c > index ddc3998..34b14fb 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -1237,6 +1237,110 @@ unlock: > return error; > } > > +static int shmem_isolate_page(struct inode *inode, struct page *oldpage) > +{ > + struct address_space *mapping = inode->i_mapping; > + struct shmem_inode_info *info = SHMEM_I(inode); > + struct page *newpage; > + int error; > + > + if (oldpage->mapping != mapping) > + return 0; > + if (page_count(oldpage) - page_mapcount(oldpage) <= 2) > + return 0; > + > + if (page_mapped(oldpage)) > + unmap_mapping_range(mapping, > + (loff_t)oldpage->index << PAGE_CACHE_SHIFT, > + PAGE_CACHE_SIZE, 0); > + > + VM_BUG_ON_PAGE(PageWriteback(oldpage), oldpage); > + VM_BUG_ON_PAGE(page_has_private(oldpage), oldpage); > + > + newpage = shmem_alloc_page(mapping_gfp_mask(mapping), info, > +oldpage->index); > + if (!newpage) > + return -ENOMEM; > + > + __set_page_locked(newpage); > + copy_highpage(newpage, oldpage); > + flush_dcache_page(newpage); > + > + page_cache_get(newpage); > + SetPageUptodate(newpage); > + SetPageSwapBacked(newpage); > + newpage->mapping = mapping; > + newpage->index = oldpage->index; > + > + cancel_dirty_page(oldpage, PAGE_CACHE_SIZE); > + > + spin_lock_irq(>tree_lock); > + error = shmem_radix_tree_replace(mapping, oldpage->index, > + oldpage, newpage); > + if (!error) { > +
Re: [RFC v3 7/7] shm: isolate pinned pages when sealing files
On Fri, 13 Jun 2014, David Herrmann wrote: When setting SEAL_WRITE, we must make sure nobody has a writable reference to the pages (via GUP or similar). We currently check references and wait some time for them to be dropped. This, however, might fail for several reasons, including: - the page is pinned for longer than we wait - while we wait, someone takes an already pinned page for read-access Therefore, this patch introduces page-isolation. When sealing a file with SEAL_WRITE, we copy all pages that have an elevated ref-count. The newpage is put in place atomically, the old page is detached and left alone. It will get reclaimed once the last external user dropped it. Signed-off-by: David Herrmann dh.herrm...@gmail.com I've not checked it line by line, but this seems to be very good work; and I'm glad you have posted it, where we can refer back to it in future. However, I'm NAKing this patch, at least for now. The reason is simple and twofold. I absolutely do not want to be maintaining an alternative form of page migration in mm/shmem.c. Shmem has its own peculiar problems (mostly because of swap): adding a new dimension of very rarely exercised complication, and dependence on the rest mm, is not wise. And sealing just does not need this. It is clearly technically superior to, and more satisfying than, the wait-a-while-then-give-up technique which it would replace. But in practice, the wait-a-while technique is quite good enough (and much better self-contained than this). I've heard no requirement to support sealing of objects pinned for I/O, and the sealer just would not have given the object out for that; the requirement is to give the recipient of a sealed object confidence that it cannot be susceptible to modification in that way. I doubt there will ever be an actual need for sealing to use this migration technique; but I can imagine us referring back to your work in future, when/if implementing revoke on regular files. And integrating this into mm/migrate.c's unmap_and_move() as an extra-force mode (proceed even when the page count is raised). I think the concerns I had, when Tony first proposed this migration copy technique, were in fact unfounded - I was worried by the new inverse COW. On reflection, I don't think this introduces any new risks, which are not already present in page migration, truncation and orphaned pages. I didn't begin to test it at all, but the only defects that stood out in your code were in the areas of memcg and mlock. I think that if we go down the road of duplicating pinned pages, then we do have to make a memcg charge on the new page in addition to the old page. And if any pages happen to be mlock'ed into an address space, then we ought to map in the replacement pages afterwards (as page migration does, whether mlock'ed or not). (You were perfectly correct to use unmap_mapping_range(), rather than try_to_unmap() as page migration does: because unmap_mapping_range() manages the VM_NONLINEAR case. But our intention, under way, is to scrap all VM_NONLINEAR code, and just emulate it with multiple vmas, in which case try_to_unmap() should do.) Hugh --- mm/shmem.c | 218 + 1 file changed, 105 insertions(+), 113 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index ddc3998..34b14fb 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1237,6 +1237,110 @@ unlock: return error; } +static int shmem_isolate_page(struct inode *inode, struct page *oldpage) +{ + struct address_space *mapping = inode-i_mapping; + struct shmem_inode_info *info = SHMEM_I(inode); + struct page *newpage; + int error; + + if (oldpage-mapping != mapping) + return 0; + if (page_count(oldpage) - page_mapcount(oldpage) = 2) + return 0; + + if (page_mapped(oldpage)) + unmap_mapping_range(mapping, + (loff_t)oldpage-index PAGE_CACHE_SHIFT, + PAGE_CACHE_SIZE, 0); + + VM_BUG_ON_PAGE(PageWriteback(oldpage), oldpage); + VM_BUG_ON_PAGE(page_has_private(oldpage), oldpage); + + newpage = shmem_alloc_page(mapping_gfp_mask(mapping), info, +oldpage-index); + if (!newpage) + return -ENOMEM; + + __set_page_locked(newpage); + copy_highpage(newpage, oldpage); + flush_dcache_page(newpage); + + page_cache_get(newpage); + SetPageUptodate(newpage); + SetPageSwapBacked(newpage); + newpage-mapping = mapping; + newpage-index = oldpage-index; + + cancel_dirty_page(oldpage, PAGE_CACHE_SIZE); + + spin_lock_irq(mapping-tree_lock); + error = shmem_radix_tree_replace(mapping, oldpage-index, + oldpage, newpage); + if (!error) { + __inc_zone_page_state(newpage, NR_FILE_PAGES); +
Re: [RFC v3 7/7] shm: isolate pinned pages when sealing files
On Fri, Jun 13, 2014 at 8:27 AM, David Herrmann wrote: > Hi > > On Fri, Jun 13, 2014 at 5:06 PM, Andy Lutomirski wrote: >> On Fri, Jun 13, 2014 at 3:36 AM, David Herrmann >> wrote: >>> When setting SEAL_WRITE, we must make sure nobody has a writable reference >>> to the pages (via GUP or similar). We currently check references and wait >>> some time for them to be dropped. This, however, might fail for several >>> reasons, including: >>> - the page is pinned for longer than we wait >>> - while we wait, someone takes an already pinned page for read-access >>> >>> Therefore, this patch introduces page-isolation. When sealing a file with >>> SEAL_WRITE, we copy all pages that have an elevated ref-count. The newpage >>> is put in place atomically, the old page is detached and left alone. It >>> will get reclaimed once the last external user dropped it. >>> >>> Signed-off-by: David Herrmann >> >> Won't this have unexpected effects? >> >> Thread 1: start read into mapping backed by fd >> >> Thread 2: SEAL_WRITE >> >> Thread 1: read finishes. now the page doesn't match the sealed page > > Just to be clear: you're talking about read() calls that write into > the memfd? (like my FUSE example does) Your language might be > ambiguous to others as "read into" actually implies a write. > > No, this does not have unexpected effects. But yes, your conclusion is > right. To be clear, this behavior would be part of the API. Any > asynchronous write might be cut off by SEAL_WRITE _iff_ you unmap your > buffer before the write finishes. But you actually have to extend your > example: > > Thread 1: p = mmap(memfd, SIZE); > Thread 1: h = async_read(some_fd, p, SIZE); > Thread 1: munmap(p, SIZE); > Thread 2: SEAL_WRITE > Thread 1: async_wait(h); > > If you don't do the unmap(), then SEAL_WRITE will fail due to an > elevated i_mmap_writable. I think this is fine. In fact, I remember > reading that async-IO is not required to resolve user-space addresses > at the time of the syscall, but might delay it to the time of the > actual write. But you're right, it would be misleading that the AIO > operation returns success. This would be part of the memfd-API, > though. And if you mess with your address space while running an > async-IO operation on it, you're screwed anyway. Ok, I missed the part where you had to munmap to trigger the oddity. That seems fine to me. > > Btw., your sealing use-case is really odd. No-one guarantees that the > SEAL_WRITE happens _after_ you schedule your async-read. In case you > have some synchronization there, you just have to move it after > waiting for your async-io to finish. > > Does that clear things up? I think so. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 7/7] shm: isolate pinned pages when sealing files
Hi On Fri, Jun 13, 2014 at 5:06 PM, Andy Lutomirski wrote: > On Fri, Jun 13, 2014 at 3:36 AM, David Herrmann wrote: >> When setting SEAL_WRITE, we must make sure nobody has a writable reference >> to the pages (via GUP or similar). We currently check references and wait >> some time for them to be dropped. This, however, might fail for several >> reasons, including: >> - the page is pinned for longer than we wait >> - while we wait, someone takes an already pinned page for read-access >> >> Therefore, this patch introduces page-isolation. When sealing a file with >> SEAL_WRITE, we copy all pages that have an elevated ref-count. The newpage >> is put in place atomically, the old page is detached and left alone. It >> will get reclaimed once the last external user dropped it. >> >> Signed-off-by: David Herrmann > > Won't this have unexpected effects? > > Thread 1: start read into mapping backed by fd > > Thread 2: SEAL_WRITE > > Thread 1: read finishes. now the page doesn't match the sealed page Just to be clear: you're talking about read() calls that write into the memfd? (like my FUSE example does) Your language might be ambiguous to others as "read into" actually implies a write. No, this does not have unexpected effects. But yes, your conclusion is right. To be clear, this behavior would be part of the API. Any asynchronous write might be cut off by SEAL_WRITE _iff_ you unmap your buffer before the write finishes. But you actually have to extend your example: Thread 1: p = mmap(memfd, SIZE); Thread 1: h = async_read(some_fd, p, SIZE); Thread 1: munmap(p, SIZE); Thread 2: SEAL_WRITE Thread 1: async_wait(h); If you don't do the unmap(), then SEAL_WRITE will fail due to an elevated i_mmap_writable. I think this is fine. In fact, I remember reading that async-IO is not required to resolve user-space addresses at the time of the syscall, but might delay it to the time of the actual write. But you're right, it would be misleading that the AIO operation returns success. This would be part of the memfd-API, though. And if you mess with your address space while running an async-IO operation on it, you're screwed anyway. Btw., your sealing use-case is really odd. No-one guarantees that the SEAL_WRITE happens _after_ you schedule your async-read. In case you have some synchronization there, you just have to move it after waiting for your async-io to finish. Does that clear things up? Thanks David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 7/7] shm: isolate pinned pages when sealing files
On Fri, Jun 13, 2014 at 3:36 AM, David Herrmann wrote: > When setting SEAL_WRITE, we must make sure nobody has a writable reference > to the pages (via GUP or similar). We currently check references and wait > some time for them to be dropped. This, however, might fail for several > reasons, including: > - the page is pinned for longer than we wait > - while we wait, someone takes an already pinned page for read-access > > Therefore, this patch introduces page-isolation. When sealing a file with > SEAL_WRITE, we copy all pages that have an elevated ref-count. The newpage > is put in place atomically, the old page is detached and left alone. It > will get reclaimed once the last external user dropped it. > > Signed-off-by: David Herrmann Won't this have unexpected effects? Thread 1: start read into mapping backed by fd Thread 2: SEAL_WRITE Thread 1: read finishes. now the page doesn't match the sealed page Is this okay? Or am I missing something? Are there really things that keep unnecessary writable pins around? --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC v3 7/7] shm: isolate pinned pages when sealing files
When setting SEAL_WRITE, we must make sure nobody has a writable reference to the pages (via GUP or similar). We currently check references and wait some time for them to be dropped. This, however, might fail for several reasons, including: - the page is pinned for longer than we wait - while we wait, someone takes an already pinned page for read-access Therefore, this patch introduces page-isolation. When sealing a file with SEAL_WRITE, we copy all pages that have an elevated ref-count. The newpage is put in place atomically, the old page is detached and left alone. It will get reclaimed once the last external user dropped it. Signed-off-by: David Herrmann --- mm/shmem.c | 218 + 1 file changed, 105 insertions(+), 113 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index ddc3998..34b14fb 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1237,6 +1237,110 @@ unlock: return error; } +static int shmem_isolate_page(struct inode *inode, struct page *oldpage) +{ + struct address_space *mapping = inode->i_mapping; + struct shmem_inode_info *info = SHMEM_I(inode); + struct page *newpage; + int error; + + if (oldpage->mapping != mapping) + return 0; + if (page_count(oldpage) - page_mapcount(oldpage) <= 2) + return 0; + + if (page_mapped(oldpage)) + unmap_mapping_range(mapping, + (loff_t)oldpage->index << PAGE_CACHE_SHIFT, + PAGE_CACHE_SIZE, 0); + + VM_BUG_ON_PAGE(PageWriteback(oldpage), oldpage); + VM_BUG_ON_PAGE(page_has_private(oldpage), oldpage); + + newpage = shmem_alloc_page(mapping_gfp_mask(mapping), info, + oldpage->index); + if (!newpage) + return -ENOMEM; + + __set_page_locked(newpage); + copy_highpage(newpage, oldpage); + flush_dcache_page(newpage); + + page_cache_get(newpage); + SetPageUptodate(newpage); + SetPageSwapBacked(newpage); + newpage->mapping = mapping; + newpage->index = oldpage->index; + + cancel_dirty_page(oldpage, PAGE_CACHE_SIZE); + + spin_lock_irq(>tree_lock); + error = shmem_radix_tree_replace(mapping, oldpage->index, +oldpage, newpage); + if (!error) { + __inc_zone_page_state(newpage, NR_FILE_PAGES); + __dec_zone_page_state(oldpage, NR_FILE_PAGES); + } + spin_unlock_irq(>tree_lock); + + if (error) { + newpage->mapping = NULL; + unlock_page(newpage); + page_cache_release(newpage); + page_cache_release(newpage); + return error; + } + + mem_cgroup_replace_page_cache(oldpage, newpage); + lru_cache_add_anon(newpage); + + oldpage->mapping = NULL; + page_cache_release(oldpage); + unlock_page(newpage); + page_cache_release(newpage); + + return 1; +} + +static int shmem_isolate_pins(struct inode *inode) +{ + struct address_space *mapping = inode->i_mapping; + struct pagevec pvec; + pgoff_t indices[PAGEVEC_SIZE]; + pgoff_t index; + int i, ret, error; + + pagevec_init(, 0); + index = 0; + error = 0; + while ((pvec.nr = find_get_entries(mapping, index, PAGEVEC_SIZE, + pvec.pages, indices))) { + for (i = 0; i < pagevec_count(); i++) { + struct page *page = pvec.pages[i]; + + index = indices[i]; + if (radix_tree_exceptional_entry(page)) + continue; + if (page->mapping != mapping) + continue; + if (page_count(page) - page_mapcount(page) <= 2) + continue; + + lock_page(page); + ret = shmem_isolate_page(inode, page); + if (ret < 0) + error = ret; + unlock_page(page); + } + pagevec_remove_exceptionals(); + pagevec_release(); + cond_resched(); + index++; + } + + return error; +} + static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { struct inode *inode = file_inode(vma->vm_file); @@ -1734,118 +1838,6 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence) return offset; } -/* - * We need a tag: a new tag would expand every radix_tree_node by 8 bytes, - * so reuse a tag which we firmly believe is never set or cleared on shmem. - */ -#define SHMEM_TAG_PINNEDPAGECACHE_TAG_TOWRITE -#define LAST_SCAN 4 /* about 150ms max */ - -static void
[RFC v3 7/7] shm: isolate pinned pages when sealing files
When setting SEAL_WRITE, we must make sure nobody has a writable reference to the pages (via GUP or similar). We currently check references and wait some time for them to be dropped. This, however, might fail for several reasons, including: - the page is pinned for longer than we wait - while we wait, someone takes an already pinned page for read-access Therefore, this patch introduces page-isolation. When sealing a file with SEAL_WRITE, we copy all pages that have an elevated ref-count. The newpage is put in place atomically, the old page is detached and left alone. It will get reclaimed once the last external user dropped it. Signed-off-by: David Herrmann dh.herrm...@gmail.com --- mm/shmem.c | 218 + 1 file changed, 105 insertions(+), 113 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index ddc3998..34b14fb 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1237,6 +1237,110 @@ unlock: return error; } +static int shmem_isolate_page(struct inode *inode, struct page *oldpage) +{ + struct address_space *mapping = inode-i_mapping; + struct shmem_inode_info *info = SHMEM_I(inode); + struct page *newpage; + int error; + + if (oldpage-mapping != mapping) + return 0; + if (page_count(oldpage) - page_mapcount(oldpage) = 2) + return 0; + + if (page_mapped(oldpage)) + unmap_mapping_range(mapping, + (loff_t)oldpage-index PAGE_CACHE_SHIFT, + PAGE_CACHE_SIZE, 0); + + VM_BUG_ON_PAGE(PageWriteback(oldpage), oldpage); + VM_BUG_ON_PAGE(page_has_private(oldpage), oldpage); + + newpage = shmem_alloc_page(mapping_gfp_mask(mapping), info, + oldpage-index); + if (!newpage) + return -ENOMEM; + + __set_page_locked(newpage); + copy_highpage(newpage, oldpage); + flush_dcache_page(newpage); + + page_cache_get(newpage); + SetPageUptodate(newpage); + SetPageSwapBacked(newpage); + newpage-mapping = mapping; + newpage-index = oldpage-index; + + cancel_dirty_page(oldpage, PAGE_CACHE_SIZE); + + spin_lock_irq(mapping-tree_lock); + error = shmem_radix_tree_replace(mapping, oldpage-index, +oldpage, newpage); + if (!error) { + __inc_zone_page_state(newpage, NR_FILE_PAGES); + __dec_zone_page_state(oldpage, NR_FILE_PAGES); + } + spin_unlock_irq(mapping-tree_lock); + + if (error) { + newpage-mapping = NULL; + unlock_page(newpage); + page_cache_release(newpage); + page_cache_release(newpage); + return error; + } + + mem_cgroup_replace_page_cache(oldpage, newpage); + lru_cache_add_anon(newpage); + + oldpage-mapping = NULL; + page_cache_release(oldpage); + unlock_page(newpage); + page_cache_release(newpage); + + return 1; +} + +static int shmem_isolate_pins(struct inode *inode) +{ + struct address_space *mapping = inode-i_mapping; + struct pagevec pvec; + pgoff_t indices[PAGEVEC_SIZE]; + pgoff_t index; + int i, ret, error; + + pagevec_init(pvec, 0); + index = 0; + error = 0; + while ((pvec.nr = find_get_entries(mapping, index, PAGEVEC_SIZE, + pvec.pages, indices))) { + for (i = 0; i pagevec_count(pvec); i++) { + struct page *page = pvec.pages[i]; + + index = indices[i]; + if (radix_tree_exceptional_entry(page)) + continue; + if (page-mapping != mapping) + continue; + if (page_count(page) - page_mapcount(page) = 2) + continue; + + lock_page(page); + ret = shmem_isolate_page(inode, page); + if (ret 0) + error = ret; + unlock_page(page); + } + pagevec_remove_exceptionals(pvec); + pagevec_release(pvec); + cond_resched(); + index++; + } + + return error; +} + static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { struct inode *inode = file_inode(vma-vm_file); @@ -1734,118 +1838,6 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence) return offset; } -/* - * We need a tag: a new tag would expand every radix_tree_node by 8 bytes, - * so reuse a tag which we firmly believe is never set or cleared on shmem. - */ -#define SHMEM_TAG_PINNEDPAGECACHE_TAG_TOWRITE -#define LAST_SCAN 4 /* about 150ms max
Re: [RFC v3 7/7] shm: isolate pinned pages when sealing files
On Fri, Jun 13, 2014 at 3:36 AM, David Herrmann dh.herrm...@gmail.com wrote: When setting SEAL_WRITE, we must make sure nobody has a writable reference to the pages (via GUP or similar). We currently check references and wait some time for them to be dropped. This, however, might fail for several reasons, including: - the page is pinned for longer than we wait - while we wait, someone takes an already pinned page for read-access Therefore, this patch introduces page-isolation. When sealing a file with SEAL_WRITE, we copy all pages that have an elevated ref-count. The newpage is put in place atomically, the old page is detached and left alone. It will get reclaimed once the last external user dropped it. Signed-off-by: David Herrmann dh.herrm...@gmail.com Won't this have unexpected effects? Thread 1: start read into mapping backed by fd Thread 2: SEAL_WRITE Thread 1: read finishes. now the page doesn't match the sealed page Is this okay? Or am I missing something? Are there really things that keep unnecessary writable pins around? --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 7/7] shm: isolate pinned pages when sealing files
Hi On Fri, Jun 13, 2014 at 5:06 PM, Andy Lutomirski l...@amacapital.net wrote: On Fri, Jun 13, 2014 at 3:36 AM, David Herrmann dh.herrm...@gmail.com wrote: When setting SEAL_WRITE, we must make sure nobody has a writable reference to the pages (via GUP or similar). We currently check references and wait some time for them to be dropped. This, however, might fail for several reasons, including: - the page is pinned for longer than we wait - while we wait, someone takes an already pinned page for read-access Therefore, this patch introduces page-isolation. When sealing a file with SEAL_WRITE, we copy all pages that have an elevated ref-count. The newpage is put in place atomically, the old page is detached and left alone. It will get reclaimed once the last external user dropped it. Signed-off-by: David Herrmann dh.herrm...@gmail.com Won't this have unexpected effects? Thread 1: start read into mapping backed by fd Thread 2: SEAL_WRITE Thread 1: read finishes. now the page doesn't match the sealed page Just to be clear: you're talking about read() calls that write into the memfd? (like my FUSE example does) Your language might be ambiguous to others as read into actually implies a write. No, this does not have unexpected effects. But yes, your conclusion is right. To be clear, this behavior would be part of the API. Any asynchronous write might be cut off by SEAL_WRITE _iff_ you unmap your buffer before the write finishes. But you actually have to extend your example: Thread 1: p = mmap(memfd, SIZE); Thread 1: h = async_read(some_fd, p, SIZE); Thread 1: munmap(p, SIZE); Thread 2: SEAL_WRITE Thread 1: async_wait(h); If you don't do the unmap(), then SEAL_WRITE will fail due to an elevated i_mmap_writable. I think this is fine. In fact, I remember reading that async-IO is not required to resolve user-space addresses at the time of the syscall, but might delay it to the time of the actual write. But you're right, it would be misleading that the AIO operation returns success. This would be part of the memfd-API, though. And if you mess with your address space while running an async-IO operation on it, you're screwed anyway. Btw., your sealing use-case is really odd. No-one guarantees that the SEAL_WRITE happens _after_ you schedule your async-read. In case you have some synchronization there, you just have to move it after waiting for your async-io to finish. Does that clear things up? Thanks David -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 7/7] shm: isolate pinned pages when sealing files
On Fri, Jun 13, 2014 at 8:27 AM, David Herrmann dh.herrm...@gmail.com wrote: Hi On Fri, Jun 13, 2014 at 5:06 PM, Andy Lutomirski l...@amacapital.net wrote: On Fri, Jun 13, 2014 at 3:36 AM, David Herrmann dh.herrm...@gmail.com wrote: When setting SEAL_WRITE, we must make sure nobody has a writable reference to the pages (via GUP or similar). We currently check references and wait some time for them to be dropped. This, however, might fail for several reasons, including: - the page is pinned for longer than we wait - while we wait, someone takes an already pinned page for read-access Therefore, this patch introduces page-isolation. When sealing a file with SEAL_WRITE, we copy all pages that have an elevated ref-count. The newpage is put in place atomically, the old page is detached and left alone. It will get reclaimed once the last external user dropped it. Signed-off-by: David Herrmann dh.herrm...@gmail.com Won't this have unexpected effects? Thread 1: start read into mapping backed by fd Thread 2: SEAL_WRITE Thread 1: read finishes. now the page doesn't match the sealed page Just to be clear: you're talking about read() calls that write into the memfd? (like my FUSE example does) Your language might be ambiguous to others as read into actually implies a write. No, this does not have unexpected effects. But yes, your conclusion is right. To be clear, this behavior would be part of the API. Any asynchronous write might be cut off by SEAL_WRITE _iff_ you unmap your buffer before the write finishes. But you actually have to extend your example: Thread 1: p = mmap(memfd, SIZE); Thread 1: h = async_read(some_fd, p, SIZE); Thread 1: munmap(p, SIZE); Thread 2: SEAL_WRITE Thread 1: async_wait(h); If you don't do the unmap(), then SEAL_WRITE will fail due to an elevated i_mmap_writable. I think this is fine. In fact, I remember reading that async-IO is not required to resolve user-space addresses at the time of the syscall, but might delay it to the time of the actual write. But you're right, it would be misleading that the AIO operation returns success. This would be part of the memfd-API, though. And if you mess with your address space while running an async-IO operation on it, you're screwed anyway. Ok, I missed the part where you had to munmap to trigger the oddity. That seems fine to me. Btw., your sealing use-case is really odd. No-one guarantees that the SEAL_WRITE happens _after_ you schedule your async-read. In case you have some synchronization there, you just have to move it after waiting for your async-io to finish. Does that clear things up? I think so. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/