Re: [PATCH 1/3] mm/mincore: make mincore() more conservative

2019-03-06 Thread Jiri Kosina
On Thu, 7 Mar 2019, Dominique Martinet wrote:

> > > 
> > >   addr = 4095
> > >   vma->vm_end = 4096
> > >   pages = 1000
> > > 
> > > then `end' is 4096 and `(end - addr) << PAGE_SHIFT' is zero, but it
> > > should have been 1.
> > 
> > Good catch! It should rather be something like
> > 
> > unsigned long pages = (end >> PAGE_SHIFT) - (addr >> PAGE_SHIFT);
> 
> That would be 0 for addr = 0 and vma->vm_end = 1; I assume we would
> still want to count that as one page.

Yeah, that was bogus as well, ETOOTIRED yesterday, sorry for the noise. 
Both the variants are off.

> I'm not too familiar with this area of the code, but I think there's a
> handy macro we can use for this, perhaps
>   DIV_ROUND_UP(end - addr, PAGE_SIZE) ?
> 
> kernel/kexec_core.c has defined PAGE_COUNT() which seems more
> appropriate but I do not see a global equivalent
> #define PAGE_COUNT(x) (((x) + PAGE_SIZE - 1) >> PAGE_SHIFT)

I'll fix that up when doing the other changes requested by Andrew.

Thanks,

-- 
Jiri Kosina
SUSE Labs



Re: [PATCH 1/3] mm/mincore: make mincore() more conservative

2019-03-06 Thread Dominique Martinet
Jiri Kosina wrote on Thu, Mar 07, 2019:
> > I'm not sure this is correct in all cases.   If
> > 
> > addr = 4095
> > vma->vm_end = 4096
> > pages = 1000
> > 
> > then `end' is 4096 and `(end - addr) << PAGE_SHIFT' is zero, but it
> > should have been 1.
> 
> Good catch! It should rather be something like
> 
>   unsigned long pages = (end >> PAGE_SHIFT) - (addr >> PAGE_SHIFT);

That would be 0 for addr = 0 and vma->vm_end = 1; I assume we would
still want to count that as one page.
I'm not too familiar with this area of the code, but I think there's a
handy macro we can use for this, perhaps
  DIV_ROUND_UP(end - addr, PAGE_SIZE) ?

kernel/kexec_core.c has defined PAGE_COUNT() which seems more
appropriate but I do not see a global equivalent
#define PAGE_COUNT(x) (((x) + PAGE_SIZE - 1) >> PAGE_SHIFT)

-- 
Dominique


Re: [PATCH 1/3] mm/mincore: make mincore() more conservative

2019-03-06 Thread Jiri Kosina
On Wed, 6 Mar 2019, Andrew Morton wrote:

> > The semantics of what mincore() considers to be resident is not completely
> > clear, but Linux has always (since 2.3.52, which is when mincore() was
> > initially done) treated it as "page is available in page cache".
> > 
> > That's potentially a problem, as that [in]directly exposes meta-information
> > about pagecache / memory mapping state even about memory not strictly 
> > belonging
> > to the process executing the syscall, opening possibilities for sidechannel
> > attacks.
> > 
> > Change the semantics of mincore() so that it only reveals pagecache 
> > information
> > for non-anonymous mappings that belog to files that the calling process 
> > could
> > (if it tried to) successfully open for writing.
> 
> "for writing" comes as a bit of a surprise.  Why not for reading?

I guess this is a rhetorical question from you :) but fair enough, good 
point, I'll explain this a bit more in the changelog and in the code 
comments.

> > @@ -189,8 +197,13 @@ static long do_mincore(unsigned long addr, unsigned 
> > long pages, unsigned char *v
> > vma = find_vma(current->mm, addr);
> > if (!vma || addr < vma->vm_start)
> > return -ENOMEM;
> > -   mincore_walk.mm = vma->vm_mm;
> > end = min(vma->vm_end, addr + (pages << PAGE_SHIFT));
> > +   if (!can_do_mincore(vma)) {
> > +   unsigned long pages = (end - addr) >> PAGE_SHIFT;
> 
> I'm not sure this is correct in all cases.   If
> 
>   addr = 4095
>   vma->vm_end = 4096
>   pages = 1000
> 
> then `end' is 4096 and `(end - addr) << PAGE_SHIFT' is zero, but it
> should have been 1.

Good catch! It should rather be something like

unsigned long pages = (end >> PAGE_SHIFT) - (addr >> PAGE_SHIFT);

I'll fix that up and resend tomorrow.

Thanks,

-- 
Jiri Kosina
SUSE Labs



Re: [PATCH 1/3] mm/mincore: make mincore() more conservative

2019-03-06 Thread Andrew Morton
On Wed, 30 Jan 2019 13:44:18 +0100 Vlastimil Babka  wrote:

> From: Jiri Kosina 
> 
> The semantics of what mincore() considers to be resident is not completely
> clear, but Linux has always (since 2.3.52, which is when mincore() was
> initially done) treated it as "page is available in page cache".
> 
> That's potentially a problem, as that [in]directly exposes meta-information
> about pagecache / memory mapping state even about memory not strictly 
> belonging
> to the process executing the syscall, opening possibilities for sidechannel
> attacks.
> 
> Change the semantics of mincore() so that it only reveals pagecache 
> information
> for non-anonymous mappings that belog to files that the calling process could
> (if it tried to) successfully open for writing.

"for writing" comes as a bit of a surprise.  Why not for reading?

Could we please explain the reasoning in the changelog and in the
(presently absent) comments which describe can_do_mincore()?

> @@ -189,8 +197,13 @@ static long do_mincore(unsigned long addr, unsigned long 
> pages, unsigned char *v
>   vma = find_vma(current->mm, addr);
>   if (!vma || addr < vma->vm_start)
>   return -ENOMEM;
> - mincore_walk.mm = vma->vm_mm;
>   end = min(vma->vm_end, addr + (pages << PAGE_SHIFT));
> + if (!can_do_mincore(vma)) {
> + unsigned long pages = (end - addr) >> PAGE_SHIFT;

I'm not sure this is correct in all cases.   If

addr = 4095
vma->vm_end = 4096
pages = 1000

then `end' is 4096 and `(end - addr) << PAGE_SHIFT' is zero, but it
should have been 1.

Please check?

A mincore test suite in tools/testing/selftests would be useful,
methinks.  To exercise such corner cases, check for future breakage,
etc.

> + memset(vec, 1, pages);
> + return pages;
> + }
> + mincore_walk.mm = vma->vm_mm;
>   err = walk_page_range(addr, end, _walk);
>   if (err < 0)
>   return err;




Re: [PATCH 1/3] mm/mincore: make mincore() more conservative

2019-02-01 Thread Vlastimil Babka
Here's updated version with Michal's suggestion, and acks:

I think this patch is fine to go, less sure about 2/3 and 3/3.

8<
>From 49f17d9f6a42ecc2a508125b0c880ff0402a6f49 Mon Sep 17 00:00:00 2001
From: Jiri Kosina 
Date: Wed, 16 Jan 2019 20:53:17 +0100
Subject: [PATCH v2] mm/mincore: make mincore() more conservative

The semantics of what mincore() considers to be resident is not completely
clear, but Linux has always (since 2.3.52, which is when mincore() was
initially done) treated it as "page is available in page cache".

That's potentially a problem, as that [in]directly exposes meta-information
about pagecache / memory mapping state even about memory not strictly belonging
to the process executing the syscall, opening possibilities for sidechannel
attacks.

Change the semantics of mincore() so that it only reveals pagecache information
for non-anonymous mappings that belog to files that the calling process could
(if it tried to) successfully open for writing.

[mho...@suse.com: restructure can_do_mincore() conditions]
Originally-by: Linus Torvalds 
Originally-by: Dominique Martinet 
Cc: Dominique Martinet 
Cc: Andy Lutomirski 
Cc: Dave Chinner 
Cc: Kevin Easton 
Cc: Matthew Wilcox 
Cc: Cyril Hrubis 
Cc: Tejun Heo 
Cc: Kirill A. Shutemov 
Cc: Daniel Gruss 
Signed-off-by: Jiri Kosina 
Signed-off-by: Vlastimil Babka 
Acked-by: Josh Snyder 
Acked-by: Michal Hocko 
---
 mm/mincore.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/mm/mincore.c b/mm/mincore.c
index 218099b5ed31..b8842b849604 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -169,6 +169,16 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long 
addr, unsigned long end,
return 0;
 }
 
+static inline bool can_do_mincore(struct vm_area_struct *vma)
+{
+   if (vma_is_anonymous(vma))
+   return true;
+   if (!vma->vm_file)
+   return false;
+   return inode_owner_or_capable(file_inode(vma->vm_file)) ||
+   inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0;
+}
+
 /*
  * Do a chunk of "sys_mincore()". We've already checked
  * all the arguments, we hold the mmap semaphore: we should
@@ -189,8 +199,13 @@ static long do_mincore(unsigned long addr, unsigned long 
pages, unsigned char *v
vma = find_vma(current->mm, addr);
if (!vma || addr < vma->vm_start)
return -ENOMEM;
-   mincore_walk.mm = vma->vm_mm;
end = min(vma->vm_end, addr + (pages << PAGE_SHIFT));
+   if (!can_do_mincore(vma)) {
+   unsigned long pages = (end - addr) >> PAGE_SHIFT;
+   memset(vec, 1, pages);
+   return pages;
+   }
+   mincore_walk.mm = vma->vm_mm;
err = walk_page_range(addr, end, _walk);
if (err < 0)
return err;
-- 
2.20.1




Re: [PATCH 1/3] mm/mincore: make mincore() more conservative

2019-01-31 Thread Josh Snyder
On Thu, Jan 31, 2019 at 1:44 AM Michal Hocko  wrote:
> One thing is still not clear to me though. Is the new owner/writeable
> check OK for the Netflix-like usecases? I mean does happycache have
> appropriate access to the cache data? I have tried to re-read the
> original thread but couldn't find any confirmation.

The owner/writable check will suit every database that I've ever used
happycache with, including cassandra, postgres and git.

Acked-by: Josh Snyder 


Re: [PATCH 1/3] mm/mincore: make mincore() more conservative

2019-01-31 Thread Dominique Martinet
Michal Hocko wrote on Thu, Jan 31, 2019:
> > Change the semantics of mincore() so that it only reveals pagecache 
> > information
> > for non-anonymous mappings that belog to files that the calling process 
> > could
> > (if it tried to) successfully open for writing.
> 
> I agree that this is a better way than the original 574823bfab82
> ("Change mincore() to count "mapped" pages rather than "cached" pages").
> One thing is still not clear to me though. Is the new owner/writeable
> check OK for the Netflix-like usecases? I mean does happycache have
> appropriate access to the cache data? I have tried to re-read the
> original thread but couldn't find any confirmation.

It's enough for my use cases and Josh didn't seem to oppose, but since
he's not in Cc I don't think he would answer -- added him now :)

FWIW happycache writes in the current directory by default so I assume
in the way they use it it would usually have access one way or another.

> If this still doesn't help happycache kind of workloads then we should
> add a capability check IMO but this looks like a decent foundation to
> me.

the inode_owner_or_capable/inode_permission helpers already do allow
quite a few capabilities there


-- 
Dominique


Re: [PATCH 1/3] mm/mincore: make mincore() more conservative

2019-01-31 Thread Michal Hocko
On Wed 30-01-19 13:44:18, Vlastimil Babka wrote:
> From: Jiri Kosina 
> 
> The semantics of what mincore() considers to be resident is not completely
> clear, but Linux has always (since 2.3.52, which is when mincore() was
> initially done) treated it as "page is available in page cache".
> 
> That's potentially a problem, as that [in]directly exposes meta-information
> about pagecache / memory mapping state even about memory not strictly 
> belonging
> to the process executing the syscall, opening possibilities for sidechannel
> attacks.
> 
> Change the semantics of mincore() so that it only reveals pagecache 
> information
> for non-anonymous mappings that belog to files that the calling process could
> (if it tried to) successfully open for writing.

I agree that this is a better way than the original 574823bfab82
("Change mincore() to count "mapped" pages rather than "cached" pages").
One thing is still not clear to me though. Is the new owner/writeable
check OK for the Netflix-like usecases? I mean does happycache have
appropriate access to the cache data? I have tried to re-read the
original thread but couldn't find any confirmation.

I nit below

> Originally-by: Linus Torvalds 
> Originally-by: Dominique Martinet 
> Cc: Dominique Martinet 
> Cc: Andy Lutomirski 
> Cc: Dave Chinner 
> Cc: Kevin Easton 
> Cc: Matthew Wilcox 
> Cc: Cyril Hrubis 
> Cc: Tejun Heo 
> Cc: Kirill A. Shutemov 
> Cc: Daniel Gruss 
> Signed-off-by: Jiri Kosina 
> Signed-off-by: Vlastimil Babka 

other than that looks good to me.
Acked-by: Michal Hocko 

If this still doesn't help happycache kind of workloads then we should
add a capability check IMO but this looks like a decent foundation to
me.

> ---
>  mm/mincore.c | 15 ++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/mincore.c b/mm/mincore.c
> index 218099b5ed31..747a4907a3ac 100644
> --- a/mm/mincore.c
> +++ b/mm/mincore.c
> @@ -169,6 +169,14 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long 
> addr, unsigned long end,
>   return 0;
>  }
>  
> +static inline bool can_do_mincore(struct vm_area_struct *vma)
> +{
> + return vma_is_anonymous(vma) ||
> + (vma->vm_file &&
> + (inode_owner_or_capable(file_inode(vma->vm_file))
> +  || inode_permission(file_inode(vma->vm_file), 
> MAY_WRITE) == 0));
> +}

This is hard to read. Can we do
if (vma_is_anonymous(vma))
return true;
if (!vma->vm_file)
return false;
return inode_owner_or_capable(file_inode(vma->vm_file)) ||
inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0;

-- 
Michal Hocko
SUSE Labs