Re: [RFC v3 7/7] shm: isolate pinned pages when sealing files

2014-07-19 Thread David Herrmann
Hi

On Wed, Jul 9, 2014 at 10:57 AM, Hugh Dickins  wrote:
> On Fri, 13 Jun 2014, David Herrmann wrote:
>
>> When setting SEAL_WRITE, we must make sure nobody has a writable reference
>> to the pages (via GUP or similar). We currently check references and wait
>> some time for them to be dropped. This, however, might fail for several
>> reasons, including:
>>  - the page is pinned for longer than we wait
>>  - while we wait, someone takes an already pinned page for read-access
>>
>> Therefore, this patch introduces page-isolation. When sealing a file with
>> SEAL_WRITE, we copy all pages that have an elevated ref-count. The newpage
>> is put in place atomically, the old page is detached and left alone. It
>> will get reclaimed once the last external user dropped it.
>>
>> Signed-off-by: David Herrmann 
>
> I've not checked it line by line, but this seems to be very good work;
> and I'm glad you have posted it, where we can refer back to it in future.
>
> However, I'm NAKing this patch, at least for now.
>
> The reason is simple and twofold.
>
> I absolutely do not want to be maintaining an alternative form of
> page migration in mm/shmem.c.  Shmem has its own peculiar problems
> (mostly because of swap): adding a new dimension of very rarely
> exercised complication, and dependence on the rest mm, is not wise.
>
> And sealing just does not need this.  It is clearly technically
> superior to, and more satisfying than, the "wait-a-while-then-give-up"
> technique which it would replace.  But in practice, the wait-a-while
> technique is quite good enough (and much better self-contained than this).
>
> I've heard no requirement to support sealing of objects pinned for I/O,
> and the sealer just would not have given the object out for that; the
> requirement is to give the recipient of a sealed object confidence
> that it cannot be susceptible to modification in that way.
>
> I doubt there will ever be an actual need for sealing to use this
> migration technique; but I can imagine us referring back to your work in
> future, when/if implementing revoke on regular files.  And integrating
> this into mm/migrate.c's unmap_and_move() as an extra-force mode
> (proceed even when the page count is raised).
>
> I think the concerns I had, when Tony first proposed this migration copy
> technique, were in fact unfounded - I was worried by the new inverse COW.
> On reflection, I don't think this introduces any new risks, which are
> not already present in page migration, truncation and orphaned pages.
>
> I didn't begin to test it at all, but the only defects that stood out
> in your code were in the areas of memcg and mlock.  I think that if we
> go down the road of duplicating pinned pages, then we do have to make
> a memcg charge on the new page in addition to the old page.  And if
> any pages happen to be mlock'ed into an address space, then we ought
> to map in the replacement pages afterwards (as page migration does,
> whether mlock'ed or not).
>
> (You were perfectly correct to use unmap_mapping_range(), rather than
> try_to_unmap() as page migration does: because unmap_mapping_range()
> manages the VM_NONLINEAR case.  But our intention, under way, is to
> scrap all VM_NONLINEAR code, and just emulate it with multiple vmas,
> in which case try_to_unmap() should do.)

Dropping VM_NONLINEAR would make a lot of stuff so much easier.

I'm fine with dropping this patch again. The mlock and memcg issues
you raised are valid and should get fixed. And indeed, my testing
never triggered any real evelated page-refs except if I pinned them
via FUSE. Therefore, the wait-for-pins function should be sufficient,
indeed.

Thanks for the reviews! I will send v4 shortly.
David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v3 7/7] shm: isolate pinned pages when sealing files

2014-07-19 Thread David Herrmann
Hi

On Wed, Jul 9, 2014 at 10:57 AM, Hugh Dickins hu...@google.com wrote:
 On Fri, 13 Jun 2014, David Herrmann wrote:

 When setting SEAL_WRITE, we must make sure nobody has a writable reference
 to the pages (via GUP or similar). We currently check references and wait
 some time for them to be dropped. This, however, might fail for several
 reasons, including:
  - the page is pinned for longer than we wait
  - while we wait, someone takes an already pinned page for read-access

 Therefore, this patch introduces page-isolation. When sealing a file with
 SEAL_WRITE, we copy all pages that have an elevated ref-count. The newpage
 is put in place atomically, the old page is detached and left alone. It
 will get reclaimed once the last external user dropped it.

 Signed-off-by: David Herrmann dh.herrm...@gmail.com

 I've not checked it line by line, but this seems to be very good work;
 and I'm glad you have posted it, where we can refer back to it in future.

 However, I'm NAKing this patch, at least for now.

 The reason is simple and twofold.

 I absolutely do not want to be maintaining an alternative form of
 page migration in mm/shmem.c.  Shmem has its own peculiar problems
 (mostly because of swap): adding a new dimension of very rarely
 exercised complication, and dependence on the rest mm, is not wise.

 And sealing just does not need this.  It is clearly technically
 superior to, and more satisfying than, the wait-a-while-then-give-up
 technique which it would replace.  But in practice, the wait-a-while
 technique is quite good enough (and much better self-contained than this).

 I've heard no requirement to support sealing of objects pinned for I/O,
 and the sealer just would not have given the object out for that; the
 requirement is to give the recipient of a sealed object confidence
 that it cannot be susceptible to modification in that way.

 I doubt there will ever be an actual need for sealing to use this
 migration technique; but I can imagine us referring back to your work in
 future, when/if implementing revoke on regular files.  And integrating
 this into mm/migrate.c's unmap_and_move() as an extra-force mode
 (proceed even when the page count is raised).

 I think the concerns I had, when Tony first proposed this migration copy
 technique, were in fact unfounded - I was worried by the new inverse COW.
 On reflection, I don't think this introduces any new risks, which are
 not already present in page migration, truncation and orphaned pages.

 I didn't begin to test it at all, but the only defects that stood out
 in your code were in the areas of memcg and mlock.  I think that if we
 go down the road of duplicating pinned pages, then we do have to make
 a memcg charge on the new page in addition to the old page.  And if
 any pages happen to be mlock'ed into an address space, then we ought
 to map in the replacement pages afterwards (as page migration does,
 whether mlock'ed or not).

 (You were perfectly correct to use unmap_mapping_range(), rather than
 try_to_unmap() as page migration does: because unmap_mapping_range()
 manages the VM_NONLINEAR case.  But our intention, under way, is to
 scrap all VM_NONLINEAR code, and just emulate it with multiple vmas,
 in which case try_to_unmap() should do.)

Dropping VM_NONLINEAR would make a lot of stuff so much easier.

I'm fine with dropping this patch again. The mlock and memcg issues
you raised are valid and should get fixed. And indeed, my testing
never triggered any real evelated page-refs except if I pinned them
via FUSE. Therefore, the wait-for-pins function should be sufficient,
indeed.

Thanks for the reviews! I will send v4 shortly.
David
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v3 7/7] shm: isolate pinned pages when sealing files

2014-07-09 Thread Hugh Dickins
On Fri, 13 Jun 2014, David Herrmann wrote:

> When setting SEAL_WRITE, we must make sure nobody has a writable reference
> to the pages (via GUP or similar). We currently check references and wait
> some time for them to be dropped. This, however, might fail for several
> reasons, including:
>  - the page is pinned for longer than we wait
>  - while we wait, someone takes an already pinned page for read-access
> 
> Therefore, this patch introduces page-isolation. When sealing a file with
> SEAL_WRITE, we copy all pages that have an elevated ref-count. The newpage
> is put in place atomically, the old page is detached and left alone. It
> will get reclaimed once the last external user dropped it.
> 
> Signed-off-by: David Herrmann 

I've not checked it line by line, but this seems to be very good work;
and I'm glad you have posted it, where we can refer back to it in future.

However, I'm NAKing this patch, at least for now.

The reason is simple and twofold.

I absolutely do not want to be maintaining an alternative form of
page migration in mm/shmem.c.  Shmem has its own peculiar problems
(mostly because of swap): adding a new dimension of very rarely
exercised complication, and dependence on the rest mm, is not wise.

And sealing just does not need this.  It is clearly technically
superior to, and more satisfying than, the "wait-a-while-then-give-up"
technique which it would replace.  But in practice, the wait-a-while
technique is quite good enough (and much better self-contained than this). 

I've heard no requirement to support sealing of objects pinned for I/O,
and the sealer just would not have given the object out for that; the
requirement is to give the recipient of a sealed object confidence
that it cannot be susceptible to modification in that way.

I doubt there will ever be an actual need for sealing to use this
migration technique; but I can imagine us referring back to your work in
future, when/if implementing revoke on regular files.  And integrating
this into mm/migrate.c's unmap_and_move() as an extra-force mode
(proceed even when the page count is raised).

I think the concerns I had, when Tony first proposed this migration copy
technique, were in fact unfounded - I was worried by the new inverse COW.
On reflection, I don't think this introduces any new risks, which are
not already present in page migration, truncation and orphaned pages.

I didn't begin to test it at all, but the only defects that stood out
in your code were in the areas of memcg and mlock.  I think that if we
go down the road of duplicating pinned pages, then we do have to make
a memcg charge on the new page in addition to the old page.  And if
any pages happen to be mlock'ed into an address space, then we ought
to map in the replacement pages afterwards (as page migration does,
whether mlock'ed or not).

(You were perfectly correct to use unmap_mapping_range(), rather than
try_to_unmap() as page migration does: because unmap_mapping_range()
manages the VM_NONLINEAR case.  But our intention, under way, is to
scrap all VM_NONLINEAR code, and just emulate it with multiple vmas,
in which case try_to_unmap() should do.)

Hugh

> ---
>  mm/shmem.c | 218 
> +
>  1 file changed, 105 insertions(+), 113 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index ddc3998..34b14fb 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1237,6 +1237,110 @@ unlock:
>   return error;
>  }
>  
> +static int shmem_isolate_page(struct inode *inode, struct page *oldpage)
> +{
> + struct address_space *mapping = inode->i_mapping;
> + struct shmem_inode_info *info = SHMEM_I(inode);
> + struct page *newpage;
> + int error;
> +
> + if (oldpage->mapping != mapping)
> + return 0;
> + if (page_count(oldpage) - page_mapcount(oldpage) <= 2)
> + return 0;
> +
> + if (page_mapped(oldpage))
> + unmap_mapping_range(mapping,
> + (loff_t)oldpage->index << PAGE_CACHE_SHIFT,
> + PAGE_CACHE_SIZE, 0);
> +
> + VM_BUG_ON_PAGE(PageWriteback(oldpage), oldpage);
> + VM_BUG_ON_PAGE(page_has_private(oldpage), oldpage);
> +
> + newpage = shmem_alloc_page(mapping_gfp_mask(mapping), info,
> +oldpage->index);
> + if (!newpage)
> + return -ENOMEM;
> +
> + __set_page_locked(newpage);
> + copy_highpage(newpage, oldpage);
> + flush_dcache_page(newpage);
> +
> + page_cache_get(newpage);
> + SetPageUptodate(newpage);
> + SetPageSwapBacked(newpage);
> + newpage->mapping = mapping;
> + newpage->index = oldpage->index;
> +
> + cancel_dirty_page(oldpage, PAGE_CACHE_SIZE);
> +
> + spin_lock_irq(>tree_lock);
> + error = shmem_radix_tree_replace(mapping, oldpage->index,
> +  oldpage, newpage);
> + if (!error) {
> + 

Re: [RFC v3 7/7] shm: isolate pinned pages when sealing files

2014-07-09 Thread Hugh Dickins
On Fri, 13 Jun 2014, David Herrmann wrote:

 When setting SEAL_WRITE, we must make sure nobody has a writable reference
 to the pages (via GUP or similar). We currently check references and wait
 some time for them to be dropped. This, however, might fail for several
 reasons, including:
  - the page is pinned for longer than we wait
  - while we wait, someone takes an already pinned page for read-access
 
 Therefore, this patch introduces page-isolation. When sealing a file with
 SEAL_WRITE, we copy all pages that have an elevated ref-count. The newpage
 is put in place atomically, the old page is detached and left alone. It
 will get reclaimed once the last external user dropped it.
 
 Signed-off-by: David Herrmann dh.herrm...@gmail.com

I've not checked it line by line, but this seems to be very good work;
and I'm glad you have posted it, where we can refer back to it in future.

However, I'm NAKing this patch, at least for now.

The reason is simple and twofold.

I absolutely do not want to be maintaining an alternative form of
page migration in mm/shmem.c.  Shmem has its own peculiar problems
(mostly because of swap): adding a new dimension of very rarely
exercised complication, and dependence on the rest mm, is not wise.

And sealing just does not need this.  It is clearly technically
superior to, and more satisfying than, the wait-a-while-then-give-up
technique which it would replace.  But in practice, the wait-a-while
technique is quite good enough (and much better self-contained than this). 

I've heard no requirement to support sealing of objects pinned for I/O,
and the sealer just would not have given the object out for that; the
requirement is to give the recipient of a sealed object confidence
that it cannot be susceptible to modification in that way.

I doubt there will ever be an actual need for sealing to use this
migration technique; but I can imagine us referring back to your work in
future, when/if implementing revoke on regular files.  And integrating
this into mm/migrate.c's unmap_and_move() as an extra-force mode
(proceed even when the page count is raised).

I think the concerns I had, when Tony first proposed this migration copy
technique, were in fact unfounded - I was worried by the new inverse COW.
On reflection, I don't think this introduces any new risks, which are
not already present in page migration, truncation and orphaned pages.

I didn't begin to test it at all, but the only defects that stood out
in your code were in the areas of memcg and mlock.  I think that if we
go down the road of duplicating pinned pages, then we do have to make
a memcg charge on the new page in addition to the old page.  And if
any pages happen to be mlock'ed into an address space, then we ought
to map in the replacement pages afterwards (as page migration does,
whether mlock'ed or not).

(You were perfectly correct to use unmap_mapping_range(), rather than
try_to_unmap() as page migration does: because unmap_mapping_range()
manages the VM_NONLINEAR case.  But our intention, under way, is to
scrap all VM_NONLINEAR code, and just emulate it with multiple vmas,
in which case try_to_unmap() should do.)

Hugh

 ---
  mm/shmem.c | 218 
 +
  1 file changed, 105 insertions(+), 113 deletions(-)
 
 diff --git a/mm/shmem.c b/mm/shmem.c
 index ddc3998..34b14fb 100644
 --- a/mm/shmem.c
 +++ b/mm/shmem.c
 @@ -1237,6 +1237,110 @@ unlock:
   return error;
  }
  
 +static int shmem_isolate_page(struct inode *inode, struct page *oldpage)
 +{
 + struct address_space *mapping = inode-i_mapping;
 + struct shmem_inode_info *info = SHMEM_I(inode);
 + struct page *newpage;
 + int error;
 +
 + if (oldpage-mapping != mapping)
 + return 0;
 + if (page_count(oldpage) - page_mapcount(oldpage) = 2)
 + return 0;
 +
 + if (page_mapped(oldpage))
 + unmap_mapping_range(mapping,
 + (loff_t)oldpage-index  PAGE_CACHE_SHIFT,
 + PAGE_CACHE_SIZE, 0);
 +
 + VM_BUG_ON_PAGE(PageWriteback(oldpage), oldpage);
 + VM_BUG_ON_PAGE(page_has_private(oldpage), oldpage);
 +
 + newpage = shmem_alloc_page(mapping_gfp_mask(mapping), info,
 +oldpage-index);
 + if (!newpage)
 + return -ENOMEM;
 +
 + __set_page_locked(newpage);
 + copy_highpage(newpage, oldpage);
 + flush_dcache_page(newpage);
 +
 + page_cache_get(newpage);
 + SetPageUptodate(newpage);
 + SetPageSwapBacked(newpage);
 + newpage-mapping = mapping;
 + newpage-index = oldpage-index;
 +
 + cancel_dirty_page(oldpage, PAGE_CACHE_SIZE);
 +
 + spin_lock_irq(mapping-tree_lock);
 + error = shmem_radix_tree_replace(mapping, oldpage-index,
 +  oldpage, newpage);
 + if (!error) {
 + __inc_zone_page_state(newpage, NR_FILE_PAGES);
 + 

Re: [RFC v3 7/7] shm: isolate pinned pages when sealing files

2014-06-13 Thread Andy Lutomirski
On Fri, Jun 13, 2014 at 8:27 AM, David Herrmann  wrote:
> Hi
>
> On Fri, Jun 13, 2014 at 5:06 PM, Andy Lutomirski  wrote:
>> On Fri, Jun 13, 2014 at 3:36 AM, David Herrmann  
>> wrote:
>>> When setting SEAL_WRITE, we must make sure nobody has a writable reference
>>> to the pages (via GUP or similar). We currently check references and wait
>>> some time for them to be dropped. This, however, might fail for several
>>> reasons, including:
>>>  - the page is pinned for longer than we wait
>>>  - while we wait, someone takes an already pinned page for read-access
>>>
>>> Therefore, this patch introduces page-isolation. When sealing a file with
>>> SEAL_WRITE, we copy all pages that have an elevated ref-count. The newpage
>>> is put in place atomically, the old page is detached and left alone. It
>>> will get reclaimed once the last external user dropped it.
>>>
>>> Signed-off-by: David Herrmann 
>>
>> Won't this have unexpected effects?
>>
>> Thread 1:  start read into mapping backed by fd
>>
>> Thread 2:  SEAL_WRITE
>>
>> Thread 1: read finishes.  now the page doesn't match the sealed page
>
> Just to be clear: you're talking about read() calls that write into
> the memfd? (like my FUSE example does) Your language might be
> ambiguous to others as "read into" actually implies a write.
>
> No, this does not have unexpected effects. But yes, your conclusion is
> right. To be clear, this behavior would be part of the API. Any
> asynchronous write might be cut off by SEAL_WRITE _iff_ you unmap your
> buffer before the write finishes. But you actually have to extend your
> example:
>
> Thread 1: p = mmap(memfd, SIZE);
> Thread 1: h = async_read(some_fd, p, SIZE);
> Thread 1: munmap(p, SIZE);
> Thread 2: SEAL_WRITE
> Thread 1: async_wait(h);
>
> If you don't do the unmap(), then SEAL_WRITE will fail due to an
> elevated i_mmap_writable. I think this is fine. In fact, I remember
> reading that async-IO is not required to resolve user-space addresses
> at the time of the syscall, but might delay it to the time of the
> actual write. But you're right, it would be misleading that the AIO
> operation returns success. This would be part of the memfd-API,
> though. And if you mess with your address space while running an
> async-IO operation on it, you're screwed anyway.

Ok, I missed the part where you had to munmap to trigger the oddity.
That seems fine to me.

>
> Btw., your sealing use-case is really odd. No-one guarantees that the
> SEAL_WRITE happens _after_ you schedule your async-read. In case you
> have some synchronization there, you just have to move it after
> waiting for your async-io to finish.
>
> Does that clear things up?

I think so.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v3 7/7] shm: isolate pinned pages when sealing files

2014-06-13 Thread David Herrmann
Hi

On Fri, Jun 13, 2014 at 5:06 PM, Andy Lutomirski  wrote:
> On Fri, Jun 13, 2014 at 3:36 AM, David Herrmann  wrote:
>> When setting SEAL_WRITE, we must make sure nobody has a writable reference
>> to the pages (via GUP or similar). We currently check references and wait
>> some time for them to be dropped. This, however, might fail for several
>> reasons, including:
>>  - the page is pinned for longer than we wait
>>  - while we wait, someone takes an already pinned page for read-access
>>
>> Therefore, this patch introduces page-isolation. When sealing a file with
>> SEAL_WRITE, we copy all pages that have an elevated ref-count. The newpage
>> is put in place atomically, the old page is detached and left alone. It
>> will get reclaimed once the last external user dropped it.
>>
>> Signed-off-by: David Herrmann 
>
> Won't this have unexpected effects?
>
> Thread 1:  start read into mapping backed by fd
>
> Thread 2:  SEAL_WRITE
>
> Thread 1: read finishes.  now the page doesn't match the sealed page

Just to be clear: you're talking about read() calls that write into
the memfd? (like my FUSE example does) Your language might be
ambiguous to others as "read into" actually implies a write.

No, this does not have unexpected effects. But yes, your conclusion is
right. To be clear, this behavior would be part of the API. Any
asynchronous write might be cut off by SEAL_WRITE _iff_ you unmap your
buffer before the write finishes. But you actually have to extend your
example:

Thread 1: p = mmap(memfd, SIZE);
Thread 1: h = async_read(some_fd, p, SIZE);
Thread 1: munmap(p, SIZE);
Thread 2: SEAL_WRITE
Thread 1: async_wait(h);

If you don't do the unmap(), then SEAL_WRITE will fail due to an
elevated i_mmap_writable. I think this is fine. In fact, I remember
reading that async-IO is not required to resolve user-space addresses
at the time of the syscall, but might delay it to the time of the
actual write. But you're right, it would be misleading that the AIO
operation returns success. This would be part of the memfd-API,
though. And if you mess with your address space while running an
async-IO operation on it, you're screwed anyway.

Btw., your sealing use-case is really odd. No-one guarantees that the
SEAL_WRITE happens _after_ you schedule your async-read. In case you
have some synchronization there, you just have to move it after
waiting for your async-io to finish.

Does that clear things up?
Thanks
David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v3 7/7] shm: isolate pinned pages when sealing files

2014-06-13 Thread Andy Lutomirski
On Fri, Jun 13, 2014 at 3:36 AM, David Herrmann  wrote:
> When setting SEAL_WRITE, we must make sure nobody has a writable reference
> to the pages (via GUP or similar). We currently check references and wait
> some time for them to be dropped. This, however, might fail for several
> reasons, including:
>  - the page is pinned for longer than we wait
>  - while we wait, someone takes an already pinned page for read-access
>
> Therefore, this patch introduces page-isolation. When sealing a file with
> SEAL_WRITE, we copy all pages that have an elevated ref-count. The newpage
> is put in place atomically, the old page is detached and left alone. It
> will get reclaimed once the last external user dropped it.
>
> Signed-off-by: David Herrmann 

Won't this have unexpected effects?

Thread 1:  start read into mapping backed by fd

Thread 2:  SEAL_WRITE

Thread 1: read finishes.  now the page doesn't match the sealed page

Is this okay?  Or am I missing something?

Are there really things that keep unnecessary writable pins around?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC v3 7/7] shm: isolate pinned pages when sealing files

2014-06-13 Thread David Herrmann
When setting SEAL_WRITE, we must make sure nobody has a writable reference
to the pages (via GUP or similar). We currently check references and wait
some time for them to be dropped. This, however, might fail for several
reasons, including:
 - the page is pinned for longer than we wait
 - while we wait, someone takes an already pinned page for read-access

Therefore, this patch introduces page-isolation. When sealing a file with
SEAL_WRITE, we copy all pages that have an elevated ref-count. The newpage
is put in place atomically, the old page is detached and left alone. It
will get reclaimed once the last external user dropped it.

Signed-off-by: David Herrmann 
---
 mm/shmem.c | 218 +
 1 file changed, 105 insertions(+), 113 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index ddc3998..34b14fb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1237,6 +1237,110 @@ unlock:
return error;
 }
 
+static int shmem_isolate_page(struct inode *inode, struct page *oldpage)
+{
+   struct address_space *mapping = inode->i_mapping;
+   struct shmem_inode_info *info = SHMEM_I(inode);
+   struct page *newpage;
+   int error;
+
+   if (oldpage->mapping != mapping)
+   return 0;
+   if (page_count(oldpage) - page_mapcount(oldpage) <= 2)
+   return 0;
+
+   if (page_mapped(oldpage))
+   unmap_mapping_range(mapping,
+   (loff_t)oldpage->index << PAGE_CACHE_SHIFT,
+   PAGE_CACHE_SIZE, 0);
+
+   VM_BUG_ON_PAGE(PageWriteback(oldpage), oldpage);
+   VM_BUG_ON_PAGE(page_has_private(oldpage), oldpage);
+
+   newpage = shmem_alloc_page(mapping_gfp_mask(mapping), info,
+  oldpage->index);
+   if (!newpage)
+   return -ENOMEM;
+
+   __set_page_locked(newpage);
+   copy_highpage(newpage, oldpage);
+   flush_dcache_page(newpage);
+
+   page_cache_get(newpage);
+   SetPageUptodate(newpage);
+   SetPageSwapBacked(newpage);
+   newpage->mapping = mapping;
+   newpage->index = oldpage->index;
+
+   cancel_dirty_page(oldpage, PAGE_CACHE_SIZE);
+
+   spin_lock_irq(>tree_lock);
+   error = shmem_radix_tree_replace(mapping, oldpage->index,
+oldpage, newpage);
+   if (!error) {
+   __inc_zone_page_state(newpage, NR_FILE_PAGES);
+   __dec_zone_page_state(oldpage, NR_FILE_PAGES);
+   }
+   spin_unlock_irq(>tree_lock);
+
+   if (error) {
+   newpage->mapping = NULL;
+   unlock_page(newpage);
+   page_cache_release(newpage);
+   page_cache_release(newpage);
+   return error;
+   }
+
+   mem_cgroup_replace_page_cache(oldpage, newpage);
+   lru_cache_add_anon(newpage);
+
+   oldpage->mapping = NULL;
+   page_cache_release(oldpage);
+   unlock_page(newpage);
+   page_cache_release(newpage);
+
+   return 1;
+}
+
+static int shmem_isolate_pins(struct inode *inode)
+{
+   struct address_space *mapping = inode->i_mapping;
+   struct pagevec pvec;
+   pgoff_t indices[PAGEVEC_SIZE];
+   pgoff_t index;
+   int i, ret, error;
+
+   pagevec_init(, 0);
+   index = 0;
+   error = 0;
+   while ((pvec.nr = find_get_entries(mapping, index, PAGEVEC_SIZE,
+  pvec.pages, indices))) {
+   for (i = 0; i < pagevec_count(); i++) {
+   struct page *page = pvec.pages[i];
+
+   index = indices[i];
+   if (radix_tree_exceptional_entry(page))
+   continue;
+   if (page->mapping != mapping)
+   continue;
+   if (page_count(page) - page_mapcount(page) <= 2)
+   continue;
+
+   lock_page(page);
+   ret = shmem_isolate_page(inode, page);
+   if (ret < 0)
+   error = ret;
+   unlock_page(page);
+   }
+   pagevec_remove_exceptionals();
+   pagevec_release();
+   cond_resched();
+   index++;
+   }
+
+   return error;
+}
+
 static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
struct inode *inode = file_inode(vma->vm_file);
@@ -1734,118 +1838,6 @@ static loff_t shmem_file_llseek(struct file *file, 
loff_t offset, int whence)
return offset;
 }
 
-/*
- * We need a tag: a new tag would expand every radix_tree_node by 8 bytes,
- * so reuse a tag which we firmly believe is never set or cleared on shmem.
- */
-#define SHMEM_TAG_PINNEDPAGECACHE_TAG_TOWRITE
-#define LAST_SCAN   4   /* about 150ms max */
-
-static void 

[RFC v3 7/7] shm: isolate pinned pages when sealing files

2014-06-13 Thread David Herrmann
When setting SEAL_WRITE, we must make sure nobody has a writable reference
to the pages (via GUP or similar). We currently check references and wait
some time for them to be dropped. This, however, might fail for several
reasons, including:
 - the page is pinned for longer than we wait
 - while we wait, someone takes an already pinned page for read-access

Therefore, this patch introduces page-isolation. When sealing a file with
SEAL_WRITE, we copy all pages that have an elevated ref-count. The newpage
is put in place atomically, the old page is detached and left alone. It
will get reclaimed once the last external user dropped it.

Signed-off-by: David Herrmann dh.herrm...@gmail.com
---
 mm/shmem.c | 218 +
 1 file changed, 105 insertions(+), 113 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index ddc3998..34b14fb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1237,6 +1237,110 @@ unlock:
return error;
 }
 
+static int shmem_isolate_page(struct inode *inode, struct page *oldpage)
+{
+   struct address_space *mapping = inode-i_mapping;
+   struct shmem_inode_info *info = SHMEM_I(inode);
+   struct page *newpage;
+   int error;
+
+   if (oldpage-mapping != mapping)
+   return 0;
+   if (page_count(oldpage) - page_mapcount(oldpage) = 2)
+   return 0;
+
+   if (page_mapped(oldpage))
+   unmap_mapping_range(mapping,
+   (loff_t)oldpage-index  PAGE_CACHE_SHIFT,
+   PAGE_CACHE_SIZE, 0);
+
+   VM_BUG_ON_PAGE(PageWriteback(oldpage), oldpage);
+   VM_BUG_ON_PAGE(page_has_private(oldpage), oldpage);
+
+   newpage = shmem_alloc_page(mapping_gfp_mask(mapping), info,
+  oldpage-index);
+   if (!newpage)
+   return -ENOMEM;
+
+   __set_page_locked(newpage);
+   copy_highpage(newpage, oldpage);
+   flush_dcache_page(newpage);
+
+   page_cache_get(newpage);
+   SetPageUptodate(newpage);
+   SetPageSwapBacked(newpage);
+   newpage-mapping = mapping;
+   newpage-index = oldpage-index;
+
+   cancel_dirty_page(oldpage, PAGE_CACHE_SIZE);
+
+   spin_lock_irq(mapping-tree_lock);
+   error = shmem_radix_tree_replace(mapping, oldpage-index,
+oldpage, newpage);
+   if (!error) {
+   __inc_zone_page_state(newpage, NR_FILE_PAGES);
+   __dec_zone_page_state(oldpage, NR_FILE_PAGES);
+   }
+   spin_unlock_irq(mapping-tree_lock);
+
+   if (error) {
+   newpage-mapping = NULL;
+   unlock_page(newpage);
+   page_cache_release(newpage);
+   page_cache_release(newpage);
+   return error;
+   }
+
+   mem_cgroup_replace_page_cache(oldpage, newpage);
+   lru_cache_add_anon(newpage);
+
+   oldpage-mapping = NULL;
+   page_cache_release(oldpage);
+   unlock_page(newpage);
+   page_cache_release(newpage);
+
+   return 1;
+}
+
+static int shmem_isolate_pins(struct inode *inode)
+{
+   struct address_space *mapping = inode-i_mapping;
+   struct pagevec pvec;
+   pgoff_t indices[PAGEVEC_SIZE];
+   pgoff_t index;
+   int i, ret, error;
+
+   pagevec_init(pvec, 0);
+   index = 0;
+   error = 0;
+   while ((pvec.nr = find_get_entries(mapping, index, PAGEVEC_SIZE,
+  pvec.pages, indices))) {
+   for (i = 0; i  pagevec_count(pvec); i++) {
+   struct page *page = pvec.pages[i];
+
+   index = indices[i];
+   if (radix_tree_exceptional_entry(page))
+   continue;
+   if (page-mapping != mapping)
+   continue;
+   if (page_count(page) - page_mapcount(page) = 2)
+   continue;
+
+   lock_page(page);
+   ret = shmem_isolate_page(inode, page);
+   if (ret  0)
+   error = ret;
+   unlock_page(page);
+   }
+   pagevec_remove_exceptionals(pvec);
+   pagevec_release(pvec);
+   cond_resched();
+   index++;
+   }
+
+   return error;
+}
+
 static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
struct inode *inode = file_inode(vma-vm_file);
@@ -1734,118 +1838,6 @@ static loff_t shmem_file_llseek(struct file *file, 
loff_t offset, int whence)
return offset;
 }
 
-/*
- * We need a tag: a new tag would expand every radix_tree_node by 8 bytes,
- * so reuse a tag which we firmly believe is never set or cleared on shmem.
- */
-#define SHMEM_TAG_PINNEDPAGECACHE_TAG_TOWRITE
-#define LAST_SCAN   4   /* about 150ms max 

Re: [RFC v3 7/7] shm: isolate pinned pages when sealing files

2014-06-13 Thread Andy Lutomirski
On Fri, Jun 13, 2014 at 3:36 AM, David Herrmann dh.herrm...@gmail.com wrote:
 When setting SEAL_WRITE, we must make sure nobody has a writable reference
 to the pages (via GUP or similar). We currently check references and wait
 some time for them to be dropped. This, however, might fail for several
 reasons, including:
  - the page is pinned for longer than we wait
  - while we wait, someone takes an already pinned page for read-access

 Therefore, this patch introduces page-isolation. When sealing a file with
 SEAL_WRITE, we copy all pages that have an elevated ref-count. The newpage
 is put in place atomically, the old page is detached and left alone. It
 will get reclaimed once the last external user dropped it.

 Signed-off-by: David Herrmann dh.herrm...@gmail.com

Won't this have unexpected effects?

Thread 1:  start read into mapping backed by fd

Thread 2:  SEAL_WRITE

Thread 1: read finishes.  now the page doesn't match the sealed page

Is this okay?  Or am I missing something?

Are there really things that keep unnecessary writable pins around?

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v3 7/7] shm: isolate pinned pages when sealing files

2014-06-13 Thread David Herrmann
Hi

On Fri, Jun 13, 2014 at 5:06 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Fri, Jun 13, 2014 at 3:36 AM, David Herrmann dh.herrm...@gmail.com wrote:
 When setting SEAL_WRITE, we must make sure nobody has a writable reference
 to the pages (via GUP or similar). We currently check references and wait
 some time for them to be dropped. This, however, might fail for several
 reasons, including:
  - the page is pinned for longer than we wait
  - while we wait, someone takes an already pinned page for read-access

 Therefore, this patch introduces page-isolation. When sealing a file with
 SEAL_WRITE, we copy all pages that have an elevated ref-count. The newpage
 is put in place atomically, the old page is detached and left alone. It
 will get reclaimed once the last external user dropped it.

 Signed-off-by: David Herrmann dh.herrm...@gmail.com

 Won't this have unexpected effects?

 Thread 1:  start read into mapping backed by fd

 Thread 2:  SEAL_WRITE

 Thread 1: read finishes.  now the page doesn't match the sealed page

Just to be clear: you're talking about read() calls that write into
the memfd? (like my FUSE example does) Your language might be
ambiguous to others as read into actually implies a write.

No, this does not have unexpected effects. But yes, your conclusion is
right. To be clear, this behavior would be part of the API. Any
asynchronous write might be cut off by SEAL_WRITE _iff_ you unmap your
buffer before the write finishes. But you actually have to extend your
example:

Thread 1: p = mmap(memfd, SIZE);
Thread 1: h = async_read(some_fd, p, SIZE);
Thread 1: munmap(p, SIZE);
Thread 2: SEAL_WRITE
Thread 1: async_wait(h);

If you don't do the unmap(), then SEAL_WRITE will fail due to an
elevated i_mmap_writable. I think this is fine. In fact, I remember
reading that async-IO is not required to resolve user-space addresses
at the time of the syscall, but might delay it to the time of the
actual write. But you're right, it would be misleading that the AIO
operation returns success. This would be part of the memfd-API,
though. And if you mess with your address space while running an
async-IO operation on it, you're screwed anyway.

Btw., your sealing use-case is really odd. No-one guarantees that the
SEAL_WRITE happens _after_ you schedule your async-read. In case you
have some synchronization there, you just have to move it after
waiting for your async-io to finish.

Does that clear things up?
Thanks
David
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v3 7/7] shm: isolate pinned pages when sealing files

2014-06-13 Thread Andy Lutomirski
On Fri, Jun 13, 2014 at 8:27 AM, David Herrmann dh.herrm...@gmail.com wrote:
 Hi

 On Fri, Jun 13, 2014 at 5:06 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Fri, Jun 13, 2014 at 3:36 AM, David Herrmann dh.herrm...@gmail.com 
 wrote:
 When setting SEAL_WRITE, we must make sure nobody has a writable reference
 to the pages (via GUP or similar). We currently check references and wait
 some time for them to be dropped. This, however, might fail for several
 reasons, including:
  - the page is pinned for longer than we wait
  - while we wait, someone takes an already pinned page for read-access

 Therefore, this patch introduces page-isolation. When sealing a file with
 SEAL_WRITE, we copy all pages that have an elevated ref-count. The newpage
 is put in place atomically, the old page is detached and left alone. It
 will get reclaimed once the last external user dropped it.

 Signed-off-by: David Herrmann dh.herrm...@gmail.com

 Won't this have unexpected effects?

 Thread 1:  start read into mapping backed by fd

 Thread 2:  SEAL_WRITE

 Thread 1: read finishes.  now the page doesn't match the sealed page

 Just to be clear: you're talking about read() calls that write into
 the memfd? (like my FUSE example does) Your language might be
 ambiguous to others as read into actually implies a write.

 No, this does not have unexpected effects. But yes, your conclusion is
 right. To be clear, this behavior would be part of the API. Any
 asynchronous write might be cut off by SEAL_WRITE _iff_ you unmap your
 buffer before the write finishes. But you actually have to extend your
 example:

 Thread 1: p = mmap(memfd, SIZE);
 Thread 1: h = async_read(some_fd, p, SIZE);
 Thread 1: munmap(p, SIZE);
 Thread 2: SEAL_WRITE
 Thread 1: async_wait(h);

 If you don't do the unmap(), then SEAL_WRITE will fail due to an
 elevated i_mmap_writable. I think this is fine. In fact, I remember
 reading that async-IO is not required to resolve user-space addresses
 at the time of the syscall, but might delay it to the time of the
 actual write. But you're right, it would be misleading that the AIO
 operation returns success. This would be part of the memfd-API,
 though. And if you mess with your address space while running an
 async-IO operation on it, you're screwed anyway.

Ok, I missed the part where you had to munmap to trigger the oddity.
That seems fine to me.


 Btw., your sealing use-case is really odd. No-one guarantees that the
 SEAL_WRITE happens _after_ you schedule your async-read. In case you
 have some synchronization there, you just have to move it after
 waiting for your async-io to finish.

 Does that clear things up?

I think so.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/