Re: [PATCH 1/3] mm/mincore: make mincore() more conservative
On Thu, 7 Mar 2019, Dominique Martinet wrote: > > > > > > addr = 4095 > > > vma->vm_end = 4096 > > > pages = 1000 > > > > > > then `end' is 4096 and `(end - addr) << PAGE_SHIFT' is zero, but it > > > should have been 1. > > > > Good catch! It should rather be something like > > > > unsigned long pages = (end >> PAGE_SHIFT) - (addr >> PAGE_SHIFT); > > That would be 0 for addr = 0 and vma->vm_end = 1; I assume we would > still want to count that as one page. Yeah, that was bogus as well, ETOOTIRED yesterday, sorry for the noise. Both the variants are off. > I'm not too familiar with this area of the code, but I think there's a > handy macro we can use for this, perhaps > DIV_ROUND_UP(end - addr, PAGE_SIZE) ? > > kernel/kexec_core.c has defined PAGE_COUNT() which seems more > appropriate but I do not see a global equivalent > #define PAGE_COUNT(x) (((x) + PAGE_SIZE - 1) >> PAGE_SHIFT) I'll fix that up when doing the other changes requested by Andrew. Thanks, -- Jiri Kosina SUSE Labs
Re: [PATCH 1/3] mm/mincore: make mincore() more conservative
Jiri Kosina wrote on Thu, Mar 07, 2019: > > I'm not sure this is correct in all cases. If > > > > addr = 4095 > > vma->vm_end = 4096 > > pages = 1000 > > > > then `end' is 4096 and `(end - addr) << PAGE_SHIFT' is zero, but it > > should have been 1. > > Good catch! It should rather be something like > > unsigned long pages = (end >> PAGE_SHIFT) - (addr >> PAGE_SHIFT); That would be 0 for addr = 0 and vma->vm_end = 1; I assume we would still want to count that as one page. I'm not too familiar with this area of the code, but I think there's a handy macro we can use for this, perhaps DIV_ROUND_UP(end - addr, PAGE_SIZE) ? kernel/kexec_core.c has defined PAGE_COUNT() which seems more appropriate but I do not see a global equivalent #define PAGE_COUNT(x) (((x) + PAGE_SIZE - 1) >> PAGE_SHIFT) -- Dominique
Re: [PATCH 1/3] mm/mincore: make mincore() more conservative
On Wed, 6 Mar 2019, Andrew Morton wrote: > > The semantics of what mincore() considers to be resident is not completely > > clear, but Linux has always (since 2.3.52, which is when mincore() was > > initially done) treated it as "page is available in page cache". > > > > That's potentially a problem, as that [in]directly exposes meta-information > > about pagecache / memory mapping state even about memory not strictly > > belonging > > to the process executing the syscall, opening possibilities for sidechannel > > attacks. > > > > Change the semantics of mincore() so that it only reveals pagecache > > information > > for non-anonymous mappings that belog to files that the calling process > > could > > (if it tried to) successfully open for writing. > > "for writing" comes as a bit of a surprise. Why not for reading? I guess this is a rhetorical question from you :) but fair enough, good point, I'll explain this a bit more in the changelog and in the code comments. > > @@ -189,8 +197,13 @@ static long do_mincore(unsigned long addr, unsigned > > long pages, unsigned char *v > > vma = find_vma(current->mm, addr); > > if (!vma || addr < vma->vm_start) > > return -ENOMEM; > > - mincore_walk.mm = vma->vm_mm; > > end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); > > + if (!can_do_mincore(vma)) { > > + unsigned long pages = (end - addr) >> PAGE_SHIFT; > > I'm not sure this is correct in all cases. If > > addr = 4095 > vma->vm_end = 4096 > pages = 1000 > > then `end' is 4096 and `(end - addr) << PAGE_SHIFT' is zero, but it > should have been 1. Good catch! It should rather be something like unsigned long pages = (end >> PAGE_SHIFT) - (addr >> PAGE_SHIFT); I'll fix that up and resend tomorrow. Thanks, -- Jiri Kosina SUSE Labs
Re: [PATCH 1/3] mm/mincore: make mincore() more conservative
On Wed, 30 Jan 2019 13:44:18 +0100 Vlastimil Babka wrote: > From: Jiri Kosina > > The semantics of what mincore() considers to be resident is not completely > clear, but Linux has always (since 2.3.52, which is when mincore() was > initially done) treated it as "page is available in page cache". > > That's potentially a problem, as that [in]directly exposes meta-information > about pagecache / memory mapping state even about memory not strictly > belonging > to the process executing the syscall, opening possibilities for sidechannel > attacks. > > Change the semantics of mincore() so that it only reveals pagecache > information > for non-anonymous mappings that belog to files that the calling process could > (if it tried to) successfully open for writing. "for writing" comes as a bit of a surprise. Why not for reading? Could we please explain the reasoning in the changelog and in the (presently absent) comments which describe can_do_mincore()? > @@ -189,8 +197,13 @@ static long do_mincore(unsigned long addr, unsigned long > pages, unsigned char *v > vma = find_vma(current->mm, addr); > if (!vma || addr < vma->vm_start) > return -ENOMEM; > - mincore_walk.mm = vma->vm_mm; > end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); > + if (!can_do_mincore(vma)) { > + unsigned long pages = (end - addr) >> PAGE_SHIFT; I'm not sure this is correct in all cases. If addr = 4095 vma->vm_end = 4096 pages = 1000 then `end' is 4096 and `(end - addr) << PAGE_SHIFT' is zero, but it should have been 1. Please check? A mincore test suite in tools/testing/selftests would be useful, methinks. To exercise such corner cases, check for future breakage, etc. > + memset(vec, 1, pages); > + return pages; > + } > + mincore_walk.mm = vma->vm_mm; > err = walk_page_range(addr, end, _walk); > if (err < 0) > return err;
Re: [PATCH 1/3] mm/mincore: make mincore() more conservative
Here's updated version with Michal's suggestion, and acks: I think this patch is fine to go, less sure about 2/3 and 3/3. 8< >From 49f17d9f6a42ecc2a508125b0c880ff0402a6f49 Mon Sep 17 00:00:00 2001 From: Jiri Kosina Date: Wed, 16 Jan 2019 20:53:17 +0100 Subject: [PATCH v2] mm/mincore: make mincore() more conservative The semantics of what mincore() considers to be resident is not completely clear, but Linux has always (since 2.3.52, which is when mincore() was initially done) treated it as "page is available in page cache". That's potentially a problem, as that [in]directly exposes meta-information about pagecache / memory mapping state even about memory not strictly belonging to the process executing the syscall, opening possibilities for sidechannel attacks. Change the semantics of mincore() so that it only reveals pagecache information for non-anonymous mappings that belog to files that the calling process could (if it tried to) successfully open for writing. [mho...@suse.com: restructure can_do_mincore() conditions] Originally-by: Linus Torvalds Originally-by: Dominique Martinet Cc: Dominique Martinet Cc: Andy Lutomirski Cc: Dave Chinner Cc: Kevin Easton Cc: Matthew Wilcox Cc: Cyril Hrubis Cc: Tejun Heo Cc: Kirill A. Shutemov Cc: Daniel Gruss Signed-off-by: Jiri Kosina Signed-off-by: Vlastimil Babka Acked-by: Josh Snyder Acked-by: Michal Hocko --- mm/mincore.c | 17 - 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/mm/mincore.c b/mm/mincore.c index 218099b5ed31..b8842b849604 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -169,6 +169,16 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, return 0; } +static inline bool can_do_mincore(struct vm_area_struct *vma) +{ + if (vma_is_anonymous(vma)) + return true; + if (!vma->vm_file) + return false; + return inode_owner_or_capable(file_inode(vma->vm_file)) || + inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0; +} + /* * Do a chunk of "sys_mincore()". We've already checked * all the arguments, we hold the mmap semaphore: we should @@ -189,8 +199,13 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v vma = find_vma(current->mm, addr); if (!vma || addr < vma->vm_start) return -ENOMEM; - mincore_walk.mm = vma->vm_mm; end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); + if (!can_do_mincore(vma)) { + unsigned long pages = (end - addr) >> PAGE_SHIFT; + memset(vec, 1, pages); + return pages; + } + mincore_walk.mm = vma->vm_mm; err = walk_page_range(addr, end, _walk); if (err < 0) return err; -- 2.20.1
Re: [PATCH 1/3] mm/mincore: make mincore() more conservative
On Thu, Jan 31, 2019 at 1:44 AM Michal Hocko wrote: > One thing is still not clear to me though. Is the new owner/writeable > check OK for the Netflix-like usecases? I mean does happycache have > appropriate access to the cache data? I have tried to re-read the > original thread but couldn't find any confirmation. The owner/writable check will suit every database that I've ever used happycache with, including cassandra, postgres and git. Acked-by: Josh Snyder
Re: [PATCH 1/3] mm/mincore: make mincore() more conservative
Michal Hocko wrote on Thu, Jan 31, 2019: > > Change the semantics of mincore() so that it only reveals pagecache > > information > > for non-anonymous mappings that belog to files that the calling process > > could > > (if it tried to) successfully open for writing. > > I agree that this is a better way than the original 574823bfab82 > ("Change mincore() to count "mapped" pages rather than "cached" pages"). > One thing is still not clear to me though. Is the new owner/writeable > check OK for the Netflix-like usecases? I mean does happycache have > appropriate access to the cache data? I have tried to re-read the > original thread but couldn't find any confirmation. It's enough for my use cases and Josh didn't seem to oppose, but since he's not in Cc I don't think he would answer -- added him now :) FWIW happycache writes in the current directory by default so I assume in the way they use it it would usually have access one way or another. > If this still doesn't help happycache kind of workloads then we should > add a capability check IMO but this looks like a decent foundation to > me. the inode_owner_or_capable/inode_permission helpers already do allow quite a few capabilities there -- Dominique
Re: [PATCH 1/3] mm/mincore: make mincore() more conservative
On Wed 30-01-19 13:44:18, Vlastimil Babka wrote: > From: Jiri Kosina > > The semantics of what mincore() considers to be resident is not completely > clear, but Linux has always (since 2.3.52, which is when mincore() was > initially done) treated it as "page is available in page cache". > > That's potentially a problem, as that [in]directly exposes meta-information > about pagecache / memory mapping state even about memory not strictly > belonging > to the process executing the syscall, opening possibilities for sidechannel > attacks. > > Change the semantics of mincore() so that it only reveals pagecache > information > for non-anonymous mappings that belog to files that the calling process could > (if it tried to) successfully open for writing. I agree that this is a better way than the original 574823bfab82 ("Change mincore() to count "mapped" pages rather than "cached" pages"). One thing is still not clear to me though. Is the new owner/writeable check OK for the Netflix-like usecases? I mean does happycache have appropriate access to the cache data? I have tried to re-read the original thread but couldn't find any confirmation. I nit below > Originally-by: Linus Torvalds > Originally-by: Dominique Martinet > Cc: Dominique Martinet > Cc: Andy Lutomirski > Cc: Dave Chinner > Cc: Kevin Easton > Cc: Matthew Wilcox > Cc: Cyril Hrubis > Cc: Tejun Heo > Cc: Kirill A. Shutemov > Cc: Daniel Gruss > Signed-off-by: Jiri Kosina > Signed-off-by: Vlastimil Babka other than that looks good to me. Acked-by: Michal Hocko If this still doesn't help happycache kind of workloads then we should add a capability check IMO but this looks like a decent foundation to me. > --- > mm/mincore.c | 15 ++- > 1 file changed, 14 insertions(+), 1 deletion(-) > > diff --git a/mm/mincore.c b/mm/mincore.c > index 218099b5ed31..747a4907a3ac 100644 > --- a/mm/mincore.c > +++ b/mm/mincore.c > @@ -169,6 +169,14 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long > addr, unsigned long end, > return 0; > } > > +static inline bool can_do_mincore(struct vm_area_struct *vma) > +{ > + return vma_is_anonymous(vma) || > + (vma->vm_file && > + (inode_owner_or_capable(file_inode(vma->vm_file)) > + || inode_permission(file_inode(vma->vm_file), > MAY_WRITE) == 0)); > +} This is hard to read. Can we do if (vma_is_anonymous(vma)) return true; if (!vma->vm_file) return false; return inode_owner_or_capable(file_inode(vma->vm_file)) || inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0; -- Michal Hocko SUSE Labs