Re: [PATCH V2 0/6] VA to numa node information

2018-11-10 Thread Prakash Sangappa




On 9/24/18 10:14 AM, Michal Hocko wrote:

On Fri 14-09-18 12:01:18, Steven Sistare wrote:

On 9/14/2018 1:56 AM, Michal Hocko wrote:

[...]

Why does this matter for something that is for analysis purposes.
Reading the file for the whole address space is far from a free
operation. Is the page walk optimization really essential for usability?
Moreover what prevents move_pages implementation to be clever for the
page walk itself? In other words why would we want to add a new API
rather than make the existing one faster for everybody.

One could optimize move pages.  If the caller passes a consecutive range
of small pages, and the page walk sees that a VA is mapped by a huge page,
then it can return the same numa node for each of the following VA's that fall
into the huge page range. It would be faster than 55 nsec per small page, but
hard to say how much faster, and the cost is still driven by the number of
small pages.

This is exactly what I was arguing for. There is some room for
improvements for the existing interface. I yet have to hear the explicit
usecase which would required even better performance that cannot be
achieved by the existing API.



Above mentioned optimization to move_pages() API helps when scanning
mapped huge pages, but does not help if there are large sparse mappings
with few pages mapped. Otherwise, consider adding page walk support in
the move_pages() implementation, enhance the API(new flag?) to return
address range to numa node information. The page walk optimization
would certainly make a difference for usability.

We can have applications(Like Oracle DB) having processes with large sparse
mappings(in TBs)  with only some areas of these mapped address range
being accessed, basically  large portions not having page tables backing 
it.

This can become more prevalent on newer systems with multiple TBs of
memory.

Here is some data from pmap using move_pages() API  with optimization.
Following table compares time pmap takes to print address mapping of a
large process, with numa node information using move_pages() api vs pmap
using /proc numa_vamaps file.

Running pmap command on a process with 1.3 TB of address space, with
sparse mappings.

       ~1.3 TB sparse      250G dense segment with hugepages.
move_pages  8.33s  3.14
optimized move_pages    6.29s  0.92
/proc numa_vamaps   0.08s  0.04

 
Second column is pmap time on a 250G address range of this process, which maps

hugepages(THP & hugetlb).



Re: [PATCH V2 0/6] VA to numa node information

2018-11-10 Thread Prakash Sangappa




On 9/24/18 10:14 AM, Michal Hocko wrote:

On Fri 14-09-18 12:01:18, Steven Sistare wrote:

On 9/14/2018 1:56 AM, Michal Hocko wrote:

[...]

Why does this matter for something that is for analysis purposes.
Reading the file for the whole address space is far from a free
operation. Is the page walk optimization really essential for usability?
Moreover what prevents move_pages implementation to be clever for the
page walk itself? In other words why would we want to add a new API
rather than make the existing one faster for everybody.

One could optimize move pages.  If the caller passes a consecutive range
of small pages, and the page walk sees that a VA is mapped by a huge page,
then it can return the same numa node for each of the following VA's that fall
into the huge page range. It would be faster than 55 nsec per small page, but
hard to say how much faster, and the cost is still driven by the number of
small pages.

This is exactly what I was arguing for. There is some room for
improvements for the existing interface. I yet have to hear the explicit
usecase which would required even better performance that cannot be
achieved by the existing API.



Above mentioned optimization to move_pages() API helps when scanning
mapped huge pages, but does not help if there are large sparse mappings
with few pages mapped. Otherwise, consider adding page walk support in
the move_pages() implementation, enhance the API(new flag?) to return
address range to numa node information. The page walk optimization
would certainly make a difference for usability.

We can have applications(Like Oracle DB) having processes with large sparse
mappings(in TBs)  with only some areas of these mapped address range
being accessed, basically  large portions not having page tables backing 
it.

This can become more prevalent on newer systems with multiple TBs of
memory.

Here is some data from pmap using move_pages() API  with optimization.
Following table compares time pmap takes to print address mapping of a
large process, with numa node information using move_pages() api vs pmap
using /proc numa_vamaps file.

Running pmap command on a process with 1.3 TB of address space, with
sparse mappings.

       ~1.3 TB sparse      250G dense segment with hugepages.
move_pages  8.33s  3.14
optimized move_pages    6.29s  0.92
/proc numa_vamaps   0.08s  0.04

 
Second column is pmap time on a 250G address range of this process, which maps

hugepages(THP & hugetlb).



Re: [PATCH V2 0/6] VA to numa node information

2018-09-14 Thread Prakash Sangappa

On 9/13/2018 5:25 PM, Dave Hansen wrote:

On 09/13/2018 05:10 PM, Andrew Morton wrote:

Also, VMAs having THP pages can have a mix of 4k pages and hugepages.
The page walks would be efficient in scanning and determining if it is
a THP huge page and step over it. Whereas using the API, the application
would not know what page size mapping is used for a given VA and so would
have to again scan the VMA in units of 4k page size.

If this sounds reasonable, I can add it to the commit / patch description.

As we are judging whether this is a "good" interface, can you tell us a
bit about its scalability?  For instance, let's say someone has a 1TB
VMA that's populated with interleaved 4k pages.  How much data comes
out?  How long does it take to parse?  Will we effectively deadlock the
system if someone accidentally cat's the wrong /proc file?


For the worst case scenario you describe, it would be one line(range) 
for each 4k. Which would
be similar to what you get with  '/proc/*/pagemap'. The amount of data 
copied out at a
time is based on the buffer size used in the kernel. Which is 1024. That 
is if one line(one range)
printed is about 40 bytes(char), that means  about 25 lines per copy 
out.  Main concern would
be holding  'mmap_sem' lock, which can cause hangs. When the 1024 buffer 
gets filled the
mmap_sem is dropped and the buffer content is copied out to the user 
buffer. Then the
mmap_sem lock is reacquired and the page walk continues as needed until 
the specified user

buffer size is filed or till end of process address space is reached.

One potential issue could be that there is  a large VA range with all 
pages populated from
one numa node, then the page walk could take longer while holding 
mmap_sem lock. This
can be addressed by dropping and re-acquiring the mmap_sem lock after 
certain number of
pages have been walked(Say 512 - which is what happens in 
'/proc/*/pagemap' case).




/proc seems like a really simple way to implement this, but it seems a
*really* odd choice for something that needs to collect a large amount
of data.  The lseek() stuff is a nice addition, but I wonder if it's
unwieldy to use in practice.  For instance, if you want to read data for
the VMA at 0x100 you lseek(fd, 0x100, SEEK_SET, right?  You read
~20 bytes of data and then the fd is at 0x120.  But, you're getting
data out at the next read() for (at least) the next page, which is also
available at 0x1001000.  Seems funky.  Do other /proc files behave this way?

Yes, SEEK_SET to the VA.  The lseek offset is the process VA. So it is 
not going to be
different from reading a normal text file.  Expect that  /proc files are 
special. Ex In
/proc/*/pagemap' file case, read enforces that seek/file offset and the 
user buffer size
passed in to  be a  multiple of the pagemap_entry_t  size or else the 
read would fail.


The usage for numa_vamaps file will be to SEEK_SET to the VA from where 
VA range

to numa node information needs to be read.

The  'fd' offset is not taken into consideration here, just the VA. Say 
each va range to numa
node id printed is about 40 bytes(chars). Now if  the read only read 20 
bytes, it would have read
part of the line. Subsequent read would read the remaining bytes of the 
line, which will

be stored in the kernel buffer.




Re: [PATCH V2 0/6] VA to numa node information

2018-09-14 Thread Prakash Sangappa

On 9/13/2018 5:25 PM, Dave Hansen wrote:

On 09/13/2018 05:10 PM, Andrew Morton wrote:

Also, VMAs having THP pages can have a mix of 4k pages and hugepages.
The page walks would be efficient in scanning and determining if it is
a THP huge page and step over it. Whereas using the API, the application
would not know what page size mapping is used for a given VA and so would
have to again scan the VMA in units of 4k page size.

If this sounds reasonable, I can add it to the commit / patch description.

As we are judging whether this is a "good" interface, can you tell us a
bit about its scalability?  For instance, let's say someone has a 1TB
VMA that's populated with interleaved 4k pages.  How much data comes
out?  How long does it take to parse?  Will we effectively deadlock the
system if someone accidentally cat's the wrong /proc file?


For the worst case scenario you describe, it would be one line(range) 
for each 4k. Which would
be similar to what you get with  '/proc/*/pagemap'. The amount of data 
copied out at a
time is based on the buffer size used in the kernel. Which is 1024. That 
is if one line(one range)
printed is about 40 bytes(char), that means  about 25 lines per copy 
out.  Main concern would
be holding  'mmap_sem' lock, which can cause hangs. When the 1024 buffer 
gets filled the
mmap_sem is dropped and the buffer content is copied out to the user 
buffer. Then the
mmap_sem lock is reacquired and the page walk continues as needed until 
the specified user

buffer size is filed or till end of process address space is reached.

One potential issue could be that there is  a large VA range with all 
pages populated from
one numa node, then the page walk could take longer while holding 
mmap_sem lock. This
can be addressed by dropping and re-acquiring the mmap_sem lock after 
certain number of
pages have been walked(Say 512 - which is what happens in 
'/proc/*/pagemap' case).




/proc seems like a really simple way to implement this, but it seems a
*really* odd choice for something that needs to collect a large amount
of data.  The lseek() stuff is a nice addition, but I wonder if it's
unwieldy to use in practice.  For instance, if you want to read data for
the VMA at 0x100 you lseek(fd, 0x100, SEEK_SET, right?  You read
~20 bytes of data and then the fd is at 0x120.  But, you're getting
data out at the next read() for (at least) the next page, which is also
available at 0x1001000.  Seems funky.  Do other /proc files behave this way?

Yes, SEEK_SET to the VA.  The lseek offset is the process VA. So it is 
not going to be
different from reading a normal text file.  Expect that  /proc files are 
special. Ex In
/proc/*/pagemap' file case, read enforces that seek/file offset and the 
user buffer size
passed in to  be a  multiple of the pagemap_entry_t  size or else the 
read would fail.


The usage for numa_vamaps file will be to SEEK_SET to the VA from where 
VA range

to numa node information needs to be read.

The  'fd' offset is not taken into consideration here, just the VA. Say 
each va range to numa
node id printed is about 40 bytes(chars). Now if  the read only read 20 
bytes, it would have read
part of the line. Subsequent read would read the remaining bytes of the 
line, which will

be stored in the kernel buffer.




Re: [RFC PATCH] Add /proc//numa_vamaps for numa node information

2018-09-14 Thread Prakash Sangappa




On 9/14/18 5:49 AM, Jann Horn wrote:

On Fri, Sep 14, 2018 at 8:21 AM Michal Hocko  wrote:

On Fri 14-09-18 03:33:28, Jann Horn wrote:

On Wed, Sep 12, 2018 at 10:43 PM prakash.sangappa
 wrote:

On 05/09/2018 04:31 PM, Dave Hansen wrote:

On 05/07/2018 06:16 PM, prakash.sangappa wrote:

It will be /proc//numa_vamaps. Yes, the behavior will be
different with respect to seeking. Output will still be text and
the format will be same.

I want to get feedback on this approach.

I think it would be really great if you can write down a list of the
things you actually want to accomplish.  Dare I say: you need a
requirements list.

The numa_vamaps approach continues down the path of an ever-growing list
of highly-specialized /proc/ files.  I don't think that is
sustainable, even if it has been our trajectory for many years.

Pagemap wasn't exactly a shining example of us getting new ABIs right,
but it sounds like something along those is what we need.

Just sent out a V2 patch.  This patch simplifies the file content. It
only provides VA range to numa node id information.

The requirement is basically observability for performance analysis.

- Need to be able to determine VA range to numa node id information.
Which also gives an idea of which range has memory allocated.

- The proc file /proc//numa_vamaps is in text so it is easy to
directly view.

The V2 patch supports seeking to a particular process VA from where
the application could read the VA to  numa node id information.

Also added the 'PTRACE_MODE_READ_REALCREDS' check when opening the
file /proc file as was indicated by Michal Hacko

procfs files should use PTRACE_MODE_*_FSCREDS, not PTRACE_MODE_*_REALCREDS.

Out of my curiosity, what is the semantic difference? At least
kernel_move_pages uses PTRACE_MODE_READ_REALCREDS. Is this a bug?

No, that's fine. REALCREDS basically means "look at the caller's real
UID for the access check", while FSCREDS means "look at the caller's
filesystem UID". The ptrace access check has historically been using
the real UID, which is sorta weird, but normally works fine. Given
that this is documented, I didn't see any reason to change it for most
things that do ptrace access checks, even if the EUID would IMO be
more appropriate. But things that capture caller credentials at points
like open() really shouldn't look at the real UID; instead, they
should use the filesystem UID (which in practice is basically the same
as the EUID).

So in short, it depends on the interface you're coming through: Direct
syscalls use REALCREDS, things that go through the VFS layer use
FSCREDS.


So in this case can the REALCREDS check be done in the read() system call
when reading the /proc file instead of the open call?





Re: [RFC PATCH] Add /proc//numa_vamaps for numa node information

2018-09-14 Thread Prakash Sangappa




On 9/14/18 5:49 AM, Jann Horn wrote:

On Fri, Sep 14, 2018 at 8:21 AM Michal Hocko  wrote:

On Fri 14-09-18 03:33:28, Jann Horn wrote:

On Wed, Sep 12, 2018 at 10:43 PM prakash.sangappa
 wrote:

On 05/09/2018 04:31 PM, Dave Hansen wrote:

On 05/07/2018 06:16 PM, prakash.sangappa wrote:

It will be /proc//numa_vamaps. Yes, the behavior will be
different with respect to seeking. Output will still be text and
the format will be same.

I want to get feedback on this approach.

I think it would be really great if you can write down a list of the
things you actually want to accomplish.  Dare I say: you need a
requirements list.

The numa_vamaps approach continues down the path of an ever-growing list
of highly-specialized /proc/ files.  I don't think that is
sustainable, even if it has been our trajectory for many years.

Pagemap wasn't exactly a shining example of us getting new ABIs right,
but it sounds like something along those is what we need.

Just sent out a V2 patch.  This patch simplifies the file content. It
only provides VA range to numa node id information.

The requirement is basically observability for performance analysis.

- Need to be able to determine VA range to numa node id information.
Which also gives an idea of which range has memory allocated.

- The proc file /proc//numa_vamaps is in text so it is easy to
directly view.

The V2 patch supports seeking to a particular process VA from where
the application could read the VA to  numa node id information.

Also added the 'PTRACE_MODE_READ_REALCREDS' check when opening the
file /proc file as was indicated by Michal Hacko

procfs files should use PTRACE_MODE_*_FSCREDS, not PTRACE_MODE_*_REALCREDS.

Out of my curiosity, what is the semantic difference? At least
kernel_move_pages uses PTRACE_MODE_READ_REALCREDS. Is this a bug?

No, that's fine. REALCREDS basically means "look at the caller's real
UID for the access check", while FSCREDS means "look at the caller's
filesystem UID". The ptrace access check has historically been using
the real UID, which is sorta weird, but normally works fine. Given
that this is documented, I didn't see any reason to change it for most
things that do ptrace access checks, even if the EUID would IMO be
more appropriate. But things that capture caller credentials at points
like open() really shouldn't look at the real UID; instead, they
should use the filesystem UID (which in practice is basically the same
as the EUID).

So in short, it depends on the interface you're coming through: Direct
syscalls use REALCREDS, things that go through the VFS layer use
FSCREDS.


So in this case can the REALCREDS check be done in the read() system call
when reading the /proc file instead of the open call?





Re: [PATCH V2 0/6] VA to numa node information

2018-09-14 Thread Prakash Sangappa




On 9/14/18 9:01 AM, Steven Sistare wrote:

On 9/14/2018 1:56 AM, Michal Hocko wrote:

On Thu 13-09-18 15:32:25, prakash.sangappa wrote:


The proc interface provides an efficient way to export address range
to numa node id mapping information compared to using the API.

Do you have any numbers?


For example, for sparsely populated mappings, if a VMA has large portions
not have any physical pages mapped, the page walk done thru the /proc file
interface can skip over non existent PMDs / ptes. Whereas using the
API the application would have to scan the entire VMA in page size units.

What prevents you from pre-filtering by reading /proc/$pid/maps to get
ranges of interest?

That works for skipping holes, but not for skipping huge pages.  I did a
quick experiment to time move_pages on a 3 GHz Xeon and a 4.18 kernel.
Allocate 128 GB and touch every small page.  Call move_pages with nodes=NULL
to get the node id for all pages, passing 512 consecutive small pages per
call to move_nodes. The total move_nodes time is 1.85 secs, and 55 nsec
per page.  Extrapolating to a 1 TB range, it would take 15 sec to retrieve
the numa node for every small page in the range.  That is not terrible, but
it is not interactive, and it becomes terrible for multiple TB.



Also, for valid VMAs in  'maps' file, if the VMA is sparsely populated 
with  physical pages,
the page walk can skip over non existing page table entires (PMDs) and 
so can be faster.


For example  reading va range of a 400GB VMA which has few pages mapped
in beginning and few pages at the end and the rest of VMA does not have 
any pages, it
takes 0.001s using the /proc interface. Whereas with move_page() api 
passing 1024
consecutive small pages address, it takes about 2.4secs. This is on a 
similar system

running 4.19 kernel.





Re: [PATCH V2 0/6] VA to numa node information

2018-09-14 Thread Prakash Sangappa




On 9/14/18 9:01 AM, Steven Sistare wrote:

On 9/14/2018 1:56 AM, Michal Hocko wrote:

On Thu 13-09-18 15:32:25, prakash.sangappa wrote:


The proc interface provides an efficient way to export address range
to numa node id mapping information compared to using the API.

Do you have any numbers?


For example, for sparsely populated mappings, if a VMA has large portions
not have any physical pages mapped, the page walk done thru the /proc file
interface can skip over non existent PMDs / ptes. Whereas using the
API the application would have to scan the entire VMA in page size units.

What prevents you from pre-filtering by reading /proc/$pid/maps to get
ranges of interest?

That works for skipping holes, but not for skipping huge pages.  I did a
quick experiment to time move_pages on a 3 GHz Xeon and a 4.18 kernel.
Allocate 128 GB and touch every small page.  Call move_pages with nodes=NULL
to get the node id for all pages, passing 512 consecutive small pages per
call to move_nodes. The total move_nodes time is 1.85 secs, and 55 nsec
per page.  Extrapolating to a 1 TB range, it would take 15 sec to retrieve
the numa node for every small page in the range.  That is not terrible, but
it is not interactive, and it becomes terrible for multiple TB.



Also, for valid VMAs in  'maps' file, if the VMA is sparsely populated 
with  physical pages,
the page walk can skip over non existing page table entires (PMDs) and 
so can be faster.


For example  reading va range of a 400GB VMA which has few pages mapped
in beginning and few pages at the end and the rest of VMA does not have 
any pages, it
takes 0.001s using the /proc interface. Whereas with move_page() api 
passing 1024
consecutive small pages address, it takes about 2.4secs. This is on a 
similar system

running 4.19 kernel.





[PATCH V2 3/6] Provide process address range to numa node id mapping

2018-09-12 Thread Prakash Sangappa
This patch provides process address range to numa node information
thru /proc//numa_vamaps file. For address ranges not having
any pages mapped, a '-' is printed instead of the numa node id.

Following is the sample of the file format

0040-0041 N1
0041-0047f000 N0
0047f000-0048 N2
0048-00481000 -
00481000-004a N0
004a-004a2000 -
004a2000-004aa000 N2
004aa000-004ad000 N0
004ad000-004ae000 -
..

Signed-off-by: Prakash Sangappa 
Reviewed-by: Steve Sistare 
---
 fs/proc/task_mmu.c | 158 +
 1 file changed, 158 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 02b553c..1371e379 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1845,6 +1845,162 @@ static int show_numa_map(struct seq_file *m, void *v)
return 0;
 }
 
+static int gather_hole_info_vamap(unsigned long start, unsigned long end,
+   struct mm_walk *walk)
+{
+   struct numa_maps *md = walk->private;
+   struct vm_area_struct *vma = walk->vma;
+
+   /*
+   * If in a nid, end walk at hole start.
+   * If no nid and vma changes, end walk at next vma start.
+   */
+   if (md->nid >= 0 || vma != find_vma(walk->mm, start)) {
+   md->nextaddr = start;
+   return 1;
+   }
+
+   if (md->nid == NUMA_VAMAPS_NID_NONE)
+   md->nid = NUMA_VAMAPS_NID_NOPAGES;
+
+   return 0;
+}
+
+static int vamap_vprintf(struct numa_vamaps_private *nvm, const char *f, ...)
+{
+   va_list args;
+   int len, space;
+
+   space = NUMA_VAMAPS_BUFSZ - nvm->count;
+   va_start(args, f);
+   len = vsnprintf(nvm->buf + nvm->count, space, f, args);
+   va_end(args);
+   if (len < space) {
+   nvm->count += len;
+   return 0;
+   }
+   return 1;
+}
+
+/*
+ * Display va-range to numa node info via /proc
+ */
+static ssize_t numa_vamaps_read(struct file *file, char __user *buf,
+   size_t count, loff_t *ppos)
+{
+   struct numa_vamaps_private *nvm = file->private_data;
+   struct vm_area_struct *vma, *tailvma;
+   struct numa_maps *md = >md;
+   struct mm_struct *mm = nvm->mm;
+   u64 vm_start = nvm->vm_start;
+   size_t ucount;
+   struct mm_walk walk = {
+   .hugetlb_entry = gather_hugetlb_stats,
+   .pmd_entry = gather_pte_stats,
+   .pte_hole = gather_hole_info_vamap,
+   .private = md,
+   .mm = mm,
+   };
+   int ret = 0, copied = 0, done = 0;
+
+   if (!mm || !mmget_not_zero(mm))
+   return 0;
+
+   if (count <= 0)
+   goto out_mm;
+
+   /* First copy leftover contents in buffer */
+   if (nvm->from)
+   goto docopy;
+
+repeat:
+   down_read(>mmap_sem);
+   vma = find_vma(mm, vm_start);
+   if (!vma) {
+   done = 1;
+   goto out;
+   }
+
+   if (vma->vm_start > vm_start)
+   vm_start = vma->vm_start;
+
+   while (nvm->count < count) {
+   u64 vm_end;
+
+   /* Ensure we start with an empty numa_maps statistics */
+   memset(md, 0, sizeof(*md));
+   md->nid = NUMA_VAMAPS_NID_NONE; /* invalid nodeid at start */
+   md->nextaddr = 0;
+   md->isvamaps = 1;
+
+if (walk_page_range(vm_start, vma->vm_end, ) < 0)
+   break;
+
+   /* nextaddr ends the range. if 0, reached the vma end */
+   vm_end = (md->nextaddr ? md->nextaddr : vma->vm_end);
+
+/* break if buffer full */
+   if (md->nid >= 0 && md->node[md->nid]) {
+  if (vamap_vprintf(nvm, "%08lx-%08lx N%ld\n", vm_start,
+   vm_end, md->nid))
+   break;
+   } else if (vamap_vprintf(nvm, "%08lx-%08lx - \n", vm_start,
+   vm_end)) {
+   break;
+   }
+
+   /* advance to next VA */
+   vm_start = vm_end;
+   if (vm_end == vma->vm_end) {
+   vma = vma->vm_next;
+   if (!vma) {
+   done = 1;
+   break;
+   }
+   vm_start = vma->vm_start;
+   }
+   }
+out:
+   /* last, add gate vma details */
+   if (!vma && (tailvma = get_gate_vma(mm)) != NULL &&
+   vm_start < tailvma->vm_end) {
+   done = 0;
+   if (!vamap_vprintf(nvm, "%08lx-%08lx - \n",
+  tailvma->vm_start, tailvma->vm_end)) {
+   done = 1;
+   vm_start = tail

[PATCH V2 3/6] Provide process address range to numa node id mapping

2018-09-12 Thread Prakash Sangappa
This patch provides process address range to numa node information
thru /proc//numa_vamaps file. For address ranges not having
any pages mapped, a '-' is printed instead of the numa node id.

Following is the sample of the file format

0040-0041 N1
0041-0047f000 N0
0047f000-0048 N2
0048-00481000 -
00481000-004a N0
004a-004a2000 -
004a2000-004aa000 N2
004aa000-004ad000 N0
004ad000-004ae000 -
..

Signed-off-by: Prakash Sangappa 
Reviewed-by: Steve Sistare 
---
 fs/proc/task_mmu.c | 158 +
 1 file changed, 158 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 02b553c..1371e379 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1845,6 +1845,162 @@ static int show_numa_map(struct seq_file *m, void *v)
return 0;
 }
 
+static int gather_hole_info_vamap(unsigned long start, unsigned long end,
+   struct mm_walk *walk)
+{
+   struct numa_maps *md = walk->private;
+   struct vm_area_struct *vma = walk->vma;
+
+   /*
+   * If in a nid, end walk at hole start.
+   * If no nid and vma changes, end walk at next vma start.
+   */
+   if (md->nid >= 0 || vma != find_vma(walk->mm, start)) {
+   md->nextaddr = start;
+   return 1;
+   }
+
+   if (md->nid == NUMA_VAMAPS_NID_NONE)
+   md->nid = NUMA_VAMAPS_NID_NOPAGES;
+
+   return 0;
+}
+
+static int vamap_vprintf(struct numa_vamaps_private *nvm, const char *f, ...)
+{
+   va_list args;
+   int len, space;
+
+   space = NUMA_VAMAPS_BUFSZ - nvm->count;
+   va_start(args, f);
+   len = vsnprintf(nvm->buf + nvm->count, space, f, args);
+   va_end(args);
+   if (len < space) {
+   nvm->count += len;
+   return 0;
+   }
+   return 1;
+}
+
+/*
+ * Display va-range to numa node info via /proc
+ */
+static ssize_t numa_vamaps_read(struct file *file, char __user *buf,
+   size_t count, loff_t *ppos)
+{
+   struct numa_vamaps_private *nvm = file->private_data;
+   struct vm_area_struct *vma, *tailvma;
+   struct numa_maps *md = >md;
+   struct mm_struct *mm = nvm->mm;
+   u64 vm_start = nvm->vm_start;
+   size_t ucount;
+   struct mm_walk walk = {
+   .hugetlb_entry = gather_hugetlb_stats,
+   .pmd_entry = gather_pte_stats,
+   .pte_hole = gather_hole_info_vamap,
+   .private = md,
+   .mm = mm,
+   };
+   int ret = 0, copied = 0, done = 0;
+
+   if (!mm || !mmget_not_zero(mm))
+   return 0;
+
+   if (count <= 0)
+   goto out_mm;
+
+   /* First copy leftover contents in buffer */
+   if (nvm->from)
+   goto docopy;
+
+repeat:
+   down_read(>mmap_sem);
+   vma = find_vma(mm, vm_start);
+   if (!vma) {
+   done = 1;
+   goto out;
+   }
+
+   if (vma->vm_start > vm_start)
+   vm_start = vma->vm_start;
+
+   while (nvm->count < count) {
+   u64 vm_end;
+
+   /* Ensure we start with an empty numa_maps statistics */
+   memset(md, 0, sizeof(*md));
+   md->nid = NUMA_VAMAPS_NID_NONE; /* invalid nodeid at start */
+   md->nextaddr = 0;
+   md->isvamaps = 1;
+
+if (walk_page_range(vm_start, vma->vm_end, ) < 0)
+   break;
+
+   /* nextaddr ends the range. if 0, reached the vma end */
+   vm_end = (md->nextaddr ? md->nextaddr : vma->vm_end);
+
+/* break if buffer full */
+   if (md->nid >= 0 && md->node[md->nid]) {
+  if (vamap_vprintf(nvm, "%08lx-%08lx N%ld\n", vm_start,
+   vm_end, md->nid))
+   break;
+   } else if (vamap_vprintf(nvm, "%08lx-%08lx - \n", vm_start,
+   vm_end)) {
+   break;
+   }
+
+   /* advance to next VA */
+   vm_start = vm_end;
+   if (vm_end == vma->vm_end) {
+   vma = vma->vm_next;
+   if (!vma) {
+   done = 1;
+   break;
+   }
+   vm_start = vma->vm_start;
+   }
+   }
+out:
+   /* last, add gate vma details */
+   if (!vma && (tailvma = get_gate_vma(mm)) != NULL &&
+   vm_start < tailvma->vm_end) {
+   done = 0;
+   if (!vamap_vprintf(nvm, "%08lx-%08lx - \n",
+  tailvma->vm_start, tailvma->vm_end)) {
+   done = 1;
+   vm_start = tail

[PATCH V2 1/6] Add check to match numa node id when gathering pte stats

2018-09-12 Thread Prakash Sangappa
Add support to check if numa node id matches when gathering pte stats,
to be used by later patches.

Signed-off-by: Prakash Sangappa 
Reviewed-by: Steve Sistare 
---
 fs/proc/task_mmu.c | 44 +---
 1 file changed, 37 insertions(+), 7 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 5ea1d64..0e2095c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1569,9 +1569,15 @@ struct numa_maps {
unsigned long mapcount_max;
unsigned long dirty;
unsigned long swapcache;
+unsigned long nextaddr;
+long nid;
+long isvamaps;
unsigned long node[MAX_NUMNODES];
 };
 
+#defineNUMA_VAMAPS_NID_NOPAGES (-1)
+#defineNUMA_VAMAPS_NID_NONE(-2)
+
 struct numa_maps_private {
struct proc_maps_private proc_maps;
struct numa_maps md;
@@ -1653,6 +1659,20 @@ static struct page *can_gather_numa_stats_pmd(pmd_t pmd,
 }
 #endif
 
+static bool
+vamap_match_nid(struct numa_maps *md, unsigned long addr, struct page *page)
+{
+   long target = (page ? page_to_nid(page) : NUMA_VAMAPS_NID_NOPAGES);
+
+   if (md->nid == NUMA_VAMAPS_NID_NONE)
+   md->nid = target;
+   if (md->nid == target)
+   return 0;
+   /* did not match */
+   md->nextaddr = addr;
+   return 1;
+}
+
 static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
unsigned long end, struct mm_walk *walk)
 {
@@ -1661,6 +1681,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long 
addr,
spinlock_t *ptl;
pte_t *orig_pte;
pte_t *pte;
+   int ret = 0;
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
ptl = pmd_trans_huge_lock(pmd, vma);
@@ -1668,11 +1689,13 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long 
addr,
struct page *page;
 
page = can_gather_numa_stats_pmd(*pmd, vma, addr);
-   if (page)
+   if (md->isvamaps)
+   ret = vamap_match_nid(md, addr, page);
+   if (page && !ret)
gather_stats(page, md, pmd_dirty(*pmd),
 HPAGE_PMD_SIZE/PAGE_SIZE);
spin_unlock(ptl);
-   return 0;
+   return ret;
}
 
if (pmd_trans_unstable(pmd))
@@ -1681,6 +1704,10 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long 
addr,
orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, );
do {
struct page *page = can_gather_numa_stats(*pte, vma, addr);
+   if (md->isvamaps && vamap_match_nid(md, addr, page)) {
+   ret = 1;
+   break;
+   }
if (!page)
continue;
gather_stats(page, md, pte_dirty(*pte), 1);
@@ -1688,7 +1715,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long 
addr,
} while (pte++, addr += PAGE_SIZE, addr != end);
pte_unmap_unlock(orig_pte, ptl);
cond_resched();
-   return 0;
+   return ret;
 }
 #ifdef CONFIG_HUGETLB_PAGE
 static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask,
@@ -1697,15 +1724,18 @@ static int gather_hugetlb_stats(pte_t *pte, unsigned 
long hmask,
pte_t huge_pte = huge_ptep_get(pte);
struct numa_maps *md;
struct page *page;
+   int ret = 0;
+   md = walk->private;
 
if (!pte_present(huge_pte))
-   return 0;
+   return (md->isvamaps ? vamap_match_nid(md, addr, NULL) : 0);
 
page = pte_page(huge_pte);
-   if (!page)
-   return 0;
+   if (md->isvamaps)
+   ret = vamap_match_nid(md, addr, page);
+   if (!page || ret)
+   return ret;
 
-   md = walk->private;
gather_stats(page, md, pte_dirty(huge_pte), 1);
return 0;
 }
-- 
2.7.4



[PATCH V2 2/6] Add /proc//numa_vamaps file for numa node information

2018-09-12 Thread Prakash Sangappa
Introduce supporting data structures and file operations. Later
patch will provide changes for generating file content.

Signed-off-by: Prakash Sangappa 
Reviewed-by: Steve Sistare 
---
 fs/proc/base.c |  2 ++
 fs/proc/internal.h |  1 +
 fs/proc/task_mmu.c | 42 ++
 3 files changed, 45 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index ccf86f1..1af99ae 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2927,6 +2927,7 @@ static const struct pid_entry tgid_base_stuff[] = {
REG("maps",   S_IRUGO, proc_pid_maps_operations),
 #ifdef CONFIG_NUMA
REG("numa_maps",  S_IRUGO, proc_pid_numa_maps_operations),
+   REG("numa_vamaps",  S_IRUGO, proc_numa_vamaps_operations),
 #endif
REG("mem",S_IRUSR|S_IWUSR, proc_mem_operations),
LNK("cwd",proc_cwd_link),
@@ -3313,6 +3314,7 @@ static const struct pid_entry tid_base_stuff[] = {
 #endif
 #ifdef CONFIG_NUMA
REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations),
+   REG("numa_vamaps",  S_IRUGO, proc_numa_vamaps_operations),
 #endif
REG("mem",   S_IRUSR|S_IWUSR, proc_mem_operations),
LNK("cwd",   proc_cwd_link),
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 5185d7f..994c7fd 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -298,6 +298,7 @@ extern const struct file_operations 
proc_pid_smaps_operations;
 extern const struct file_operations proc_pid_smaps_rollup_operations;
 extern const struct file_operations proc_clear_refs_operations;
 extern const struct file_operations proc_pagemap_operations;
+extern const struct file_operations proc_numa_vamaps_operations;
 
 extern unsigned long task_vsize(struct mm_struct *);
 extern unsigned long task_statm(struct mm_struct *,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 0e2095c..02b553c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1583,6 +1583,16 @@ struct numa_maps_private {
struct numa_maps md;
 };
 
+#define NUMA_VAMAPS_BUFSZ  1024
+struct numa_vamaps_private {
+   struct mm_struct *mm;
+   struct numa_maps md;
+   u64 vm_start;
+   size_t from;
+   size_t count; /* residual bytes in buf at offset 'from' */
+   char buf[NUMA_VAMAPS_BUFSZ]; /* buffer */
+};
+
 static void gather_stats(struct page *page, struct numa_maps *md, int 
pte_dirty,
unsigned long nr_pages)
 {
@@ -1848,6 +1858,34 @@ static int pid_numa_maps_open(struct inode *inode, 
struct file *file)
sizeof(struct numa_maps_private));
 }
 
+static int numa_vamaps_open(struct inode *inode, struct file *file)
+{
+   struct mm_struct *mm;
+   struct numa_vamaps_private *nvm;
+   nvm = kzalloc(sizeof(struct numa_vamaps_private), GFP_KERNEL);
+   if (!nvm)
+   return -ENOMEM;
+
+   mm = proc_mem_open(inode, PTRACE_MODE_READ);
+   if (IS_ERR(mm)) {
+   kfree(nvm);
+   return PTR_ERR(mm);
+   }
+   nvm->mm = mm;
+   file->private_data = nvm;
+   return 0;
+}
+
+static int numa_vamaps_release(struct inode *inode, struct file *file)
+{
+   struct numa_vamaps_private *nvm = file->private_data;
+
+   if (nvm->mm)
+   mmdrop(nvm->mm);
+   kfree(nvm);
+   return 0;
+}
+
 const struct file_operations proc_pid_numa_maps_operations = {
.open   = pid_numa_maps_open,
.read   = seq_read,
@@ -1855,4 +1893,8 @@ const struct file_operations 
proc_pid_numa_maps_operations = {
.release= proc_map_release,
 };
 
+const struct file_operations proc_numa_vamaps_operations = {
+   .open   = numa_vamaps_open,
+   .release= numa_vamaps_release,
+};
 #endif /* CONFIG_NUMA */
-- 
2.7.4



[PATCH V2 5/6] File /proc//numa_vamaps access needs PTRACE_MODE_READ_REALCREDS check

2018-09-12 Thread Prakash Sangappa
Permission to access /proc//numa_vamaps file should be governed by
PTRACE_READ_REALCREADS check to restrict getting specific VA range to numa
node mapping information.

Signed-off-by: Prakash Sangappa 
Reviewed-by: Steve Sistare 
---
 fs/proc/base.c | 4 +++-
 fs/proc/task_mmu.c | 2 +-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 1af99ae..3c19a55 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -745,7 +745,9 @@ struct mm_struct *proc_mem_open(struct inode *inode, 
unsigned int mode)
struct mm_struct *mm = ERR_PTR(-ESRCH);
 
if (task) {
-   mm = mm_access(task, mode | PTRACE_MODE_FSCREDS);
+   if (!(mode & PTRACE_MODE_REALCREDS))
+   mode |= PTRACE_MODE_FSCREDS;
+   mm = mm_access(task, mode);
put_task_struct(task);
 
if (!IS_ERR_OR_NULL(mm)) {
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 93dce46..30b29d2 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -2043,7 +2043,7 @@ static int numa_vamaps_open(struct inode *inode, struct 
file *file)
if (!nvm)
return -ENOMEM;
 
-   mm = proc_mem_open(inode, PTRACE_MODE_READ);
+   mm = proc_mem_open(inode, PTRACE_MODE_READ | PTRACE_MODE_REALCREDS);
if (IS_ERR(mm)) {
kfree(nvm);
return PTR_ERR(mm);
-- 
2.7.4



[PATCH V2 6/6] /proc/pid/numa_vamaps: document in Documentation/filesystems/proc.txt

2018-09-12 Thread Prakash Sangappa
Add documentation for /proc//numa_vamaps in
Documentation/filesystems/proc.txt

Signed-off-by: Prakash Sangappa 
Reviewed-by: Steve Sistare 
---
 Documentation/filesystems/proc.txt | 21 +
 1 file changed, 21 insertions(+)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 22b4b00..7095216 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -150,6 +150,9 @@ Table 1-1: Process specific entries in /proc
each mapping and flags associated with it
  numa_maps an extension based on maps, showing the memory locality and
binding policy as well as mem usage (in pages) of each mapping.
+ numa_vamaps   Presents information about mapped address ranges to numa node
+   from where the physical memory is allocated.
+
 ..
 
 For example, to get the status information of a process, all you have to do is
@@ -571,6 +574,24 @@ Where:
 node locality page counters (N0 == node0, N1 == node1, ...) and the kernel page
 size, in KB, that is backing the mapping up.
 
+The /proc/pid/numa_vamaps shows mapped address ranges to numa node id from
+where the physical pages are allocated. For mapped address ranges not having
+any pages mapped a '-' is shown instead of the node id. Each line in the file
+will show address range to one numa node.
+
+address-range  numa-node-id
+
+0040-0041 N1
+0041-0047f000 N0
+0047f000-0048 N2
+0048-00481000 -
+00481000-004a N0
+004a-004a2000 -
+004a2000-004aa000 N2
+004aa000-004ad000 N0
+004ad000-004ae000 -
+..
+
 1.2 Kernel data
 ---
 
-- 
2.7.4



[PATCH V2 1/6] Add check to match numa node id when gathering pte stats

2018-09-12 Thread Prakash Sangappa
Add support to check if numa node id matches when gathering pte stats,
to be used by later patches.

Signed-off-by: Prakash Sangappa 
Reviewed-by: Steve Sistare 
---
 fs/proc/task_mmu.c | 44 +---
 1 file changed, 37 insertions(+), 7 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 5ea1d64..0e2095c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1569,9 +1569,15 @@ struct numa_maps {
unsigned long mapcount_max;
unsigned long dirty;
unsigned long swapcache;
+unsigned long nextaddr;
+long nid;
+long isvamaps;
unsigned long node[MAX_NUMNODES];
 };
 
+#defineNUMA_VAMAPS_NID_NOPAGES (-1)
+#defineNUMA_VAMAPS_NID_NONE(-2)
+
 struct numa_maps_private {
struct proc_maps_private proc_maps;
struct numa_maps md;
@@ -1653,6 +1659,20 @@ static struct page *can_gather_numa_stats_pmd(pmd_t pmd,
 }
 #endif
 
+static bool
+vamap_match_nid(struct numa_maps *md, unsigned long addr, struct page *page)
+{
+   long target = (page ? page_to_nid(page) : NUMA_VAMAPS_NID_NOPAGES);
+
+   if (md->nid == NUMA_VAMAPS_NID_NONE)
+   md->nid = target;
+   if (md->nid == target)
+   return 0;
+   /* did not match */
+   md->nextaddr = addr;
+   return 1;
+}
+
 static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
unsigned long end, struct mm_walk *walk)
 {
@@ -1661,6 +1681,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long 
addr,
spinlock_t *ptl;
pte_t *orig_pte;
pte_t *pte;
+   int ret = 0;
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
ptl = pmd_trans_huge_lock(pmd, vma);
@@ -1668,11 +1689,13 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long 
addr,
struct page *page;
 
page = can_gather_numa_stats_pmd(*pmd, vma, addr);
-   if (page)
+   if (md->isvamaps)
+   ret = vamap_match_nid(md, addr, page);
+   if (page && !ret)
gather_stats(page, md, pmd_dirty(*pmd),
 HPAGE_PMD_SIZE/PAGE_SIZE);
spin_unlock(ptl);
-   return 0;
+   return ret;
}
 
if (pmd_trans_unstable(pmd))
@@ -1681,6 +1704,10 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long 
addr,
orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, );
do {
struct page *page = can_gather_numa_stats(*pte, vma, addr);
+   if (md->isvamaps && vamap_match_nid(md, addr, page)) {
+   ret = 1;
+   break;
+   }
if (!page)
continue;
gather_stats(page, md, pte_dirty(*pte), 1);
@@ -1688,7 +1715,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long 
addr,
} while (pte++, addr += PAGE_SIZE, addr != end);
pte_unmap_unlock(orig_pte, ptl);
cond_resched();
-   return 0;
+   return ret;
 }
 #ifdef CONFIG_HUGETLB_PAGE
 static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask,
@@ -1697,15 +1724,18 @@ static int gather_hugetlb_stats(pte_t *pte, unsigned 
long hmask,
pte_t huge_pte = huge_ptep_get(pte);
struct numa_maps *md;
struct page *page;
+   int ret = 0;
+   md = walk->private;
 
if (!pte_present(huge_pte))
-   return 0;
+   return (md->isvamaps ? vamap_match_nid(md, addr, NULL) : 0);
 
page = pte_page(huge_pte);
-   if (!page)
-   return 0;
+   if (md->isvamaps)
+   ret = vamap_match_nid(md, addr, page);
+   if (!page || ret)
+   return ret;
 
-   md = walk->private;
gather_stats(page, md, pte_dirty(huge_pte), 1);
return 0;
 }
-- 
2.7.4



[PATCH V2 2/6] Add /proc//numa_vamaps file for numa node information

2018-09-12 Thread Prakash Sangappa
Introduce supporting data structures and file operations. Later
patch will provide changes for generating file content.

Signed-off-by: Prakash Sangappa 
Reviewed-by: Steve Sistare 
---
 fs/proc/base.c |  2 ++
 fs/proc/internal.h |  1 +
 fs/proc/task_mmu.c | 42 ++
 3 files changed, 45 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index ccf86f1..1af99ae 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2927,6 +2927,7 @@ static const struct pid_entry tgid_base_stuff[] = {
REG("maps",   S_IRUGO, proc_pid_maps_operations),
 #ifdef CONFIG_NUMA
REG("numa_maps",  S_IRUGO, proc_pid_numa_maps_operations),
+   REG("numa_vamaps",  S_IRUGO, proc_numa_vamaps_operations),
 #endif
REG("mem",S_IRUSR|S_IWUSR, proc_mem_operations),
LNK("cwd",proc_cwd_link),
@@ -3313,6 +3314,7 @@ static const struct pid_entry tid_base_stuff[] = {
 #endif
 #ifdef CONFIG_NUMA
REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations),
+   REG("numa_vamaps",  S_IRUGO, proc_numa_vamaps_operations),
 #endif
REG("mem",   S_IRUSR|S_IWUSR, proc_mem_operations),
LNK("cwd",   proc_cwd_link),
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 5185d7f..994c7fd 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -298,6 +298,7 @@ extern const struct file_operations 
proc_pid_smaps_operations;
 extern const struct file_operations proc_pid_smaps_rollup_operations;
 extern const struct file_operations proc_clear_refs_operations;
 extern const struct file_operations proc_pagemap_operations;
+extern const struct file_operations proc_numa_vamaps_operations;
 
 extern unsigned long task_vsize(struct mm_struct *);
 extern unsigned long task_statm(struct mm_struct *,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 0e2095c..02b553c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1583,6 +1583,16 @@ struct numa_maps_private {
struct numa_maps md;
 };
 
+#define NUMA_VAMAPS_BUFSZ  1024
+struct numa_vamaps_private {
+   struct mm_struct *mm;
+   struct numa_maps md;
+   u64 vm_start;
+   size_t from;
+   size_t count; /* residual bytes in buf at offset 'from' */
+   char buf[NUMA_VAMAPS_BUFSZ]; /* buffer */
+};
+
 static void gather_stats(struct page *page, struct numa_maps *md, int 
pte_dirty,
unsigned long nr_pages)
 {
@@ -1848,6 +1858,34 @@ static int pid_numa_maps_open(struct inode *inode, 
struct file *file)
sizeof(struct numa_maps_private));
 }
 
+static int numa_vamaps_open(struct inode *inode, struct file *file)
+{
+   struct mm_struct *mm;
+   struct numa_vamaps_private *nvm;
+   nvm = kzalloc(sizeof(struct numa_vamaps_private), GFP_KERNEL);
+   if (!nvm)
+   return -ENOMEM;
+
+   mm = proc_mem_open(inode, PTRACE_MODE_READ);
+   if (IS_ERR(mm)) {
+   kfree(nvm);
+   return PTR_ERR(mm);
+   }
+   nvm->mm = mm;
+   file->private_data = nvm;
+   return 0;
+}
+
+static int numa_vamaps_release(struct inode *inode, struct file *file)
+{
+   struct numa_vamaps_private *nvm = file->private_data;
+
+   if (nvm->mm)
+   mmdrop(nvm->mm);
+   kfree(nvm);
+   return 0;
+}
+
 const struct file_operations proc_pid_numa_maps_operations = {
.open   = pid_numa_maps_open,
.read   = seq_read,
@@ -1855,4 +1893,8 @@ const struct file_operations 
proc_pid_numa_maps_operations = {
.release= proc_map_release,
 };
 
+const struct file_operations proc_numa_vamaps_operations = {
+   .open   = numa_vamaps_open,
+   .release= numa_vamaps_release,
+};
 #endif /* CONFIG_NUMA */
-- 
2.7.4



[PATCH V2 5/6] File /proc//numa_vamaps access needs PTRACE_MODE_READ_REALCREDS check

2018-09-12 Thread Prakash Sangappa
Permission to access /proc//numa_vamaps file should be governed by
PTRACE_READ_REALCREADS check to restrict getting specific VA range to numa
node mapping information.

Signed-off-by: Prakash Sangappa 
Reviewed-by: Steve Sistare 
---
 fs/proc/base.c | 4 +++-
 fs/proc/task_mmu.c | 2 +-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 1af99ae..3c19a55 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -745,7 +745,9 @@ struct mm_struct *proc_mem_open(struct inode *inode, 
unsigned int mode)
struct mm_struct *mm = ERR_PTR(-ESRCH);
 
if (task) {
-   mm = mm_access(task, mode | PTRACE_MODE_FSCREDS);
+   if (!(mode & PTRACE_MODE_REALCREDS))
+   mode |= PTRACE_MODE_FSCREDS;
+   mm = mm_access(task, mode);
put_task_struct(task);
 
if (!IS_ERR_OR_NULL(mm)) {
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 93dce46..30b29d2 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -2043,7 +2043,7 @@ static int numa_vamaps_open(struct inode *inode, struct 
file *file)
if (!nvm)
return -ENOMEM;
 
-   mm = proc_mem_open(inode, PTRACE_MODE_READ);
+   mm = proc_mem_open(inode, PTRACE_MODE_READ | PTRACE_MODE_REALCREDS);
if (IS_ERR(mm)) {
kfree(nvm);
return PTR_ERR(mm);
-- 
2.7.4



[PATCH V2 6/6] /proc/pid/numa_vamaps: document in Documentation/filesystems/proc.txt

2018-09-12 Thread Prakash Sangappa
Add documentation for /proc//numa_vamaps in
Documentation/filesystems/proc.txt

Signed-off-by: Prakash Sangappa 
Reviewed-by: Steve Sistare 
---
 Documentation/filesystems/proc.txt | 21 +
 1 file changed, 21 insertions(+)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 22b4b00..7095216 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -150,6 +150,9 @@ Table 1-1: Process specific entries in /proc
each mapping and flags associated with it
  numa_maps an extension based on maps, showing the memory locality and
binding policy as well as mem usage (in pages) of each mapping.
+ numa_vamaps   Presents information about mapped address ranges to numa node
+   from where the physical memory is allocated.
+
 ..
 
 For example, to get the status information of a process, all you have to do is
@@ -571,6 +574,24 @@ Where:
 node locality page counters (N0 == node0, N1 == node1, ...) and the kernel page
 size, in KB, that is backing the mapping up.
 
+The /proc/pid/numa_vamaps shows mapped address ranges to numa node id from
+where the physical pages are allocated. For mapped address ranges not having
+any pages mapped a '-' is shown instead of the node id. Each line in the file
+will show address range to one numa node.
+
+address-range  numa-node-id
+
+0040-0041 N1
+0041-0047f000 N0
+0047f000-0048 N2
+0048-00481000 -
+00481000-004a N0
+004a-004a2000 -
+004a2000-004aa000 N2
+004aa000-004ad000 N0
+004ad000-004ae000 -
+..
+
 1.2 Kernel data
 ---
 
-- 
2.7.4



[PATCH V2 0/6] VA to numa node information

2018-09-12 Thread Prakash Sangappa
For analysis purpose it is useful to have numa node information
corresponding mapped virtual address ranges of a process. Currently,
the file /proc//numa_maps provides list of numa nodes from where pages
are allocated per VMA of a process. This is not useful if an user needs to
determine which numa node the mapped pages are allocated from for a
particular address range. It would have helped if the numa node information
presented in /proc//numa_maps was broken down by VA ranges showing the
exact numa node from where the pages have been allocated.

The format of /proc//numa_maps file content is dependent on
/proc//maps file content as mentioned in the manpage. i.e one line
entry for every VMA corresponding to entries in /proc//maps file.
Therefore changing the output of /proc//numa_maps may not be possible.

This patch set introduces the file /proc//numa_vamaps which
will provide proper break down of VA ranges by numa node id from where the
mapped pages are allocated. For Address ranges not having any pages mapped,
a '-' is printed instead of numa node id.

Includes support to lseek, allowing seeking to a specific process Virtual
address(VA) starting from where the address range to numa node information
can to be read from this file.

The new file /proc//numa_vamaps will be governed by ptrace access
mode PTRACE_MODE_READ_REALCREDS.

See following for previous discussion about this proposal

https://marc.info/?t=15252407341=1=2


Prakash Sangappa (6):
  Add check to match numa node id when gathering pte stats
  Add /proc//numa_vamaps file for numa node information
  Provide process address range to numa node id mapping
  Add support to lseek /proc//numa_vamaps file
  File /proc//numa_vamaps access needs PTRACE_MODE_READ_REALCREDS
check
  /proc/pid/numa_vamaps: document in Documentation/filesystems/proc.txt

 Documentation/filesystems/proc.txt |  21 +++
 fs/proc/base.c |   6 +-
 fs/proc/internal.h |   1 +
 fs/proc/task_mmu.c | 265 -
 4 files changed, 285 insertions(+), 8 deletions(-)

-- 
2.7.4



[PATCH V2 4/6] Add support to lseek /proc//numa_vamaps file

2018-09-12 Thread Prakash Sangappa
Allow lseeking to a process virtual address(VA), starting from where
the address range to numa node information can be read. The lseek offset
will be the process virtual address.

Signed-off-by: Prakash Sangappa 
Reviewed-by: Steve Sistare 
---
 fs/proc/task_mmu.c | 23 ++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 1371e379..93dce46 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1866,6 +1866,27 @@ static int gather_hole_info_vamap(unsigned long start, 
unsigned long end,
return 0;
 }
 
+static loff_t numa_vamaps_llseek(struct file *file, loff_t offset, int orig)
+{
+   struct numa_vamaps_private *nvm = file->private_data;
+
+   if (orig == SEEK_CUR && offset < 0 && nvm->vm_start < -offset)
+   return -EINVAL;
+
+   switch (orig) {
+   case SEEK_SET:
+   nvm->vm_start = offset & PAGE_MASK;
+   break;
+   case SEEK_CUR:
+   nvm->vm_start += offset;
+   nvm->vm_start = nvm->vm_start & PAGE_MASK;
+   break;
+   default:
+   return -EINVAL;
+   }
+   return nvm->vm_start;
+}
+
 static int vamap_vprintf(struct numa_vamaps_private *nvm, const char *f, ...)
 {
va_list args;
@@ -2052,7 +2073,7 @@ const struct file_operations 
proc_pid_numa_maps_operations = {
 const struct file_operations proc_numa_vamaps_operations = {
.open   = numa_vamaps_open,
.read   = numa_vamaps_read,
-   .llseek = noop_llseek,
+   .llseek = numa_vamaps_llseek,
.release= numa_vamaps_release,
 };
 #endif /* CONFIG_NUMA */
-- 
2.7.4



[PATCH V2 0/6] VA to numa node information

2018-09-12 Thread Prakash Sangappa
For analysis purpose it is useful to have numa node information
corresponding mapped virtual address ranges of a process. Currently,
the file /proc//numa_maps provides list of numa nodes from where pages
are allocated per VMA of a process. This is not useful if an user needs to
determine which numa node the mapped pages are allocated from for a
particular address range. It would have helped if the numa node information
presented in /proc//numa_maps was broken down by VA ranges showing the
exact numa node from where the pages have been allocated.

The format of /proc//numa_maps file content is dependent on
/proc//maps file content as mentioned in the manpage. i.e one line
entry for every VMA corresponding to entries in /proc//maps file.
Therefore changing the output of /proc//numa_maps may not be possible.

This patch set introduces the file /proc//numa_vamaps which
will provide proper break down of VA ranges by numa node id from where the
mapped pages are allocated. For Address ranges not having any pages mapped,
a '-' is printed instead of numa node id.

Includes support to lseek, allowing seeking to a specific process Virtual
address(VA) starting from where the address range to numa node information
can to be read from this file.

The new file /proc//numa_vamaps will be governed by ptrace access
mode PTRACE_MODE_READ_REALCREDS.

See following for previous discussion about this proposal

https://marc.info/?t=15252407341=1=2


Prakash Sangappa (6):
  Add check to match numa node id when gathering pte stats
  Add /proc//numa_vamaps file for numa node information
  Provide process address range to numa node id mapping
  Add support to lseek /proc//numa_vamaps file
  File /proc//numa_vamaps access needs PTRACE_MODE_READ_REALCREDS
check
  /proc/pid/numa_vamaps: document in Documentation/filesystems/proc.txt

 Documentation/filesystems/proc.txt |  21 +++
 fs/proc/base.c |   6 +-
 fs/proc/internal.h |   1 +
 fs/proc/task_mmu.c | 265 -
 4 files changed, 285 insertions(+), 8 deletions(-)

-- 
2.7.4



[PATCH V2 4/6] Add support to lseek /proc//numa_vamaps file

2018-09-12 Thread Prakash Sangappa
Allow lseeking to a process virtual address(VA), starting from where
the address range to numa node information can be read. The lseek offset
will be the process virtual address.

Signed-off-by: Prakash Sangappa 
Reviewed-by: Steve Sistare 
---
 fs/proc/task_mmu.c | 23 ++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 1371e379..93dce46 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1866,6 +1866,27 @@ static int gather_hole_info_vamap(unsigned long start, 
unsigned long end,
return 0;
 }
 
+static loff_t numa_vamaps_llseek(struct file *file, loff_t offset, int orig)
+{
+   struct numa_vamaps_private *nvm = file->private_data;
+
+   if (orig == SEEK_CUR && offset < 0 && nvm->vm_start < -offset)
+   return -EINVAL;
+
+   switch (orig) {
+   case SEEK_SET:
+   nvm->vm_start = offset & PAGE_MASK;
+   break;
+   case SEEK_CUR:
+   nvm->vm_start += offset;
+   nvm->vm_start = nvm->vm_start & PAGE_MASK;
+   break;
+   default:
+   return -EINVAL;
+   }
+   return nvm->vm_start;
+}
+
 static int vamap_vprintf(struct numa_vamaps_private *nvm, const char *f, ...)
 {
va_list args;
@@ -2052,7 +2073,7 @@ const struct file_operations 
proc_pid_numa_maps_operations = {
 const struct file_operations proc_numa_vamaps_operations = {
.open   = numa_vamaps_open,
.read   = numa_vamaps_read,
-   .llseek = noop_llseek,
+   .llseek = numa_vamaps_llseek,
.release= numa_vamaps_release,
 };
 #endif /* CONFIG_NUMA */
-- 
2.7.4



Re: [RFC PATCH] Add /proc//numa_vamaps for numa node information

2018-05-10 Thread Prakash Sangappa



On 5/10/18 12:42 AM, Michal Hocko wrote:

On Fri 04-05-18 09:18:11, Prakash Sangappa wrote:


On 5/4/18 4:12 AM, Michal Hocko wrote:

On Thu 03-05-18 15:39:49, prakash.sangappa wrote:

On 05/03/2018 11:03 AM, Christopher Lameter wrote:

On Tue, 1 May 2018, Prakash Sangappa wrote:


For analysis purpose it is useful to have numa node information
corresponding mapped address ranges of the process. Currently
/proc//numa_maps provides list of numa nodes from where pages are
allocated per VMA of the process. This is not useful if an user needs to
determine which numa node the mapped pages are allocated from for a
particular address range. It would have helped if the numa node information
presented in /proc//numa_maps was broken down by VA ranges showing the
exact numa node from where the pages have been allocated.

Cant you write a small script that scans the information in numa_maps and
then displays the total pages per NUMA node and then a list of which
ranges have how many pages on a particular node?

Don't think we can determine which numa node a given user process
address range has pages from, based on the existing 'numa_maps' file.

yes we have. See move_pages...

Sure using move_pages, not based on just 'numa_maps'.


reading this file will not be restricted(i.e requiring CAP_SYS_ADMIN).

So a prime motivator here is security restricted access to numa_maps?

No it is the opposite. A regular user should be able to determine
numa node information.

Well, that breaks the layout randomization, doesn't it?

Exposing numa node information itself should not break randomization right?

I thought you planned to expose address ranges for each numa node as
well. /me confused.


Yes, are you suggesting this information should not be available to a 
regular

user?

Is it not possible to get that same information using the move_pages() 
api as a regular

user, although one / set of pages at a time?



It would be upto the application. In case of randomization, the application
could generate  address range traces of interest for debugging and then
using numa node information one could determine where the memory is laid
out for analysis.

... even more confused





Re: [RFC PATCH] Add /proc//numa_vamaps for numa node information

2018-05-10 Thread Prakash Sangappa



On 5/10/18 12:42 AM, Michal Hocko wrote:

On Fri 04-05-18 09:18:11, Prakash Sangappa wrote:


On 5/4/18 4:12 AM, Michal Hocko wrote:

On Thu 03-05-18 15:39:49, prakash.sangappa wrote:

On 05/03/2018 11:03 AM, Christopher Lameter wrote:

On Tue, 1 May 2018, Prakash Sangappa wrote:


For analysis purpose it is useful to have numa node information
corresponding mapped address ranges of the process. Currently
/proc//numa_maps provides list of numa nodes from where pages are
allocated per VMA of the process. This is not useful if an user needs to
determine which numa node the mapped pages are allocated from for a
particular address range. It would have helped if the numa node information
presented in /proc//numa_maps was broken down by VA ranges showing the
exact numa node from where the pages have been allocated.

Cant you write a small script that scans the information in numa_maps and
then displays the total pages per NUMA node and then a list of which
ranges have how many pages on a particular node?

Don't think we can determine which numa node a given user process
address range has pages from, based on the existing 'numa_maps' file.

yes we have. See move_pages...

Sure using move_pages, not based on just 'numa_maps'.


reading this file will not be restricted(i.e requiring CAP_SYS_ADMIN).

So a prime motivator here is security restricted access to numa_maps?

No it is the opposite. A regular user should be able to determine
numa node information.

Well, that breaks the layout randomization, doesn't it?

Exposing numa node information itself should not break randomization right?

I thought you planned to expose address ranges for each numa node as
well. /me confused.


Yes, are you suggesting this information should not be available to a 
regular

user?

Is it not possible to get that same information using the move_pages() 
api as a regular

user, although one / set of pages at a time?



It would be upto the application. In case of randomization, the application
could generate  address range traces of interest for debugging and then
using numa node information one could determine where the memory is laid
out for analysis.

... even more confused





Re: [RFC PATCH] Add /proc//numa_vamaps for numa node information

2018-05-04 Thread Prakash Sangappa



On 5/4/18 7:57 AM, Christopher Lameter wrote:

On Thu, 3 May 2018, prakash.sangappa wrote:


exact numa node from where the pages have been allocated.

Cant you write a small script that scans the information in numa_maps and
then displays the total pages per NUMA node and then a list of which
ranges have how many pages on a particular node?

Don't think we can determine which numa node a given user process
address range has pages from, based on the existing 'numa_maps' file.

Well the information is contained in numa_maps I thought. What is missing?


Currently 'numa_maps' gives a list of numa nodes, memory is allocated per
VMA.
Ex. we get something like from numa_maps.

04000  N0=1,N2=2 kernelpagesize_KB=4

First is the start address of a VMA. This VMA could be much larger then 
3 4k pages.

It does not say which address in the VMA has the pages mapped.




reading this file will not be restricted(i.e requiring CAP_SYS_ADMIN).

So a prime motivator here is security restricted access to numa_maps?

No it is the opposite. A regular user should be able to determine
numa node information.

That used to be the case until changes were made to the permissions for
reading numa_maps.





Re: [RFC PATCH] Add /proc//numa_vamaps for numa node information

2018-05-04 Thread Prakash Sangappa



On 5/4/18 7:57 AM, Christopher Lameter wrote:

On Thu, 3 May 2018, prakash.sangappa wrote:


exact numa node from where the pages have been allocated.

Cant you write a small script that scans the information in numa_maps and
then displays the total pages per NUMA node and then a list of which
ranges have how many pages on a particular node?

Don't think we can determine which numa node a given user process
address range has pages from, based on the existing 'numa_maps' file.

Well the information is contained in numa_maps I thought. What is missing?


Currently 'numa_maps' gives a list of numa nodes, memory is allocated per
VMA.
Ex. we get something like from numa_maps.

04000  N0=1,N2=2 kernelpagesize_KB=4

First is the start address of a VMA. This VMA could be much larger then 
3 4k pages.

It does not say which address in the VMA has the pages mapped.




reading this file will not be restricted(i.e requiring CAP_SYS_ADMIN).

So a prime motivator here is security restricted access to numa_maps?

No it is the opposite. A regular user should be able to determine
numa node information.

That used to be the case until changes were made to the permissions for
reading numa_maps.





Re: [RFC PATCH] Add /proc//numa_vamaps for numa node information

2018-05-04 Thread Prakash Sangappa



On 5/4/18 4:12 AM, Michal Hocko wrote:

On Thu 03-05-18 15:39:49, prakash.sangappa wrote:


On 05/03/2018 11:03 AM, Christopher Lameter wrote:

On Tue, 1 May 2018, Prakash Sangappa wrote:


For analysis purpose it is useful to have numa node information
corresponding mapped address ranges of the process. Currently
/proc//numa_maps provides list of numa nodes from where pages are
allocated per VMA of the process. This is not useful if an user needs to
determine which numa node the mapped pages are allocated from for a
particular address range. It would have helped if the numa node information
presented in /proc//numa_maps was broken down by VA ranges showing the
exact numa node from where the pages have been allocated.

Cant you write a small script that scans the information in numa_maps and
then displays the total pages per NUMA node and then a list of which
ranges have how many pages on a particular node?

Don't think we can determine which numa node a given user process
address range has pages from, based on the existing 'numa_maps' file.

yes we have. See move_pages...


Sure using move_pages, not based on just 'numa_maps'.

  

reading this file will not be restricted(i.e requiring CAP_SYS_ADMIN).

So a prime motivator here is security restricted access to numa_maps?

No it is the opposite. A regular user should be able to determine
numa node information.

Well, that breaks the layout randomization, doesn't it?


Exposing numa node information itself should not break randomization right?

It would be upto the application. In case of randomization, the application
could generate  address range traces of interest for debugging and then
using numa node information one could determine where the memory is laid
out for analysis.



Re: [RFC PATCH] Add /proc//numa_vamaps for numa node information

2018-05-04 Thread Prakash Sangappa



On 5/4/18 4:12 AM, Michal Hocko wrote:

On Thu 03-05-18 15:39:49, prakash.sangappa wrote:


On 05/03/2018 11:03 AM, Christopher Lameter wrote:

On Tue, 1 May 2018, Prakash Sangappa wrote:


For analysis purpose it is useful to have numa node information
corresponding mapped address ranges of the process. Currently
/proc//numa_maps provides list of numa nodes from where pages are
allocated per VMA of the process. This is not useful if an user needs to
determine which numa node the mapped pages are allocated from for a
particular address range. It would have helped if the numa node information
presented in /proc//numa_maps was broken down by VA ranges showing the
exact numa node from where the pages have been allocated.

Cant you write a small script that scans the information in numa_maps and
then displays the total pages per NUMA node and then a list of which
ranges have how many pages on a particular node?

Don't think we can determine which numa node a given user process
address range has pages from, based on the existing 'numa_maps' file.

yes we have. See move_pages...


Sure using move_pages, not based on just 'numa_maps'.

  

reading this file will not be restricted(i.e requiring CAP_SYS_ADMIN).

So a prime motivator here is security restricted access to numa_maps?

No it is the opposite. A regular user should be able to determine
numa node information.

Well, that breaks the layout randomization, doesn't it?


Exposing numa node information itself should not break randomization right?

It would be upto the application. In case of randomization, the application
could generate  address range traces of interest for debugging and then
using numa node information one could determine where the memory is laid
out for analysis.



[RFC PATCH] Add /proc//numa_vamaps for numa node information

2018-05-01 Thread Prakash Sangappa
For analysis purpose it is useful to have numa node information
corresponding mapped address ranges of the process. Currently
/proc//numa_maps provides list of numa nodes from where pages are
allocated per VMA of the process. This is not useful if an user needs to
determine which numa node the mapped pages are allocated from for a
particular address range. It would have helped if the numa node information
presented in /proc//numa_maps was broken down by VA ranges showing the
exact numa node from where the pages have been allocated.

The format of /proc//numa_maps file content is dependent on
/proc//maps file content as mentioned in the manpage. i.e one line
entry for every VMA corresponding to entries in /proc//maps file.
Therefore changing the output of /proc//numa_maps may not be possible.

Hence, this patch proposes adding file /proc//numa_vamaps which will
provide proper break down of VA ranges by numa node id from where the mapped
pages are allocated. For Address ranges not having any pages mapped, a '-'
is printed instead of numa node id. In addition, this file will include most
of the other information currently presented in /proc//numa_maps. The
additional information included is for convenience. If this is not
preferred, the patch could be modified to just provide VA range to numa node
information as the rest of the information is already available thru
/proc//numa_maps file.

Since the VA range to numa node information does not include page's PFN,
reading this file will not be restricted(i.e requiring CAP_SYS_ADMIN).

Here is the snippet from the new file content showing the format.

0040-00401000 N0=1 kernelpagesize_kB=4 mapped=1 file=/tmp/hmap2
0060-00601000 N0=1 kernelpagesize_kB=4 anon=1 dirty=1 file=/tmp/hmap2
00601000-00602000 N0=1 kernelpagesize_kB=4 anon=1 dirty=1 file=/tmp/hmap2
7f021560-7f021580 N0=1 kernelpagesize_kB=2048 dirty=1 file=/mnt/f1
7f021580-7f0215c0 -  file=/mnt/f1
7f0215c0-7f0215e0 N0=1 kernelpagesize_kB=2048 dirty=1 file=/mnt/f1
7f0215e0-7f021620 -  file=/mnt/f1
..
7f0217ecb000-7f0217f2 N0=85 kernelpagesize_kB=4 mapped=85 mapmax=51
   file=/usr/lib64/libc-2.17.so
7f0217f2-7f0217f3 -  file=/usr/lib64/libc-2.17.so
7f0217f3-7f0217f9 N0=96 kernelpagesize_kB=4 mapped=96 mapmax=51
   file=/usr/lib64/libc-2.17.so
7f0217f9-7f0217fb -  file=/usr/lib64/libc-2.17.so
..

The 'pmap' command can be enhanced to include an option to show numa node
information which it can read from this new proc file. This will be a
follow on proposal.

There have been couple of previous patch proposals to provide numa node
information based on pfn or physical address. They seem to have not made
progress. Also it would appear reading numa node information based on PFN
or physical address will require privileges(CAP_SYS_ADMIN) similar to
reading PFN info from /proc//pagemap.

See
https://marc.info/?t=13963093821=1=2

https://marc.info/?t=13971872441=1=2

Signed-off-by: Prakash Sangappa <prakash.sanga...@oracle.com>
---
 fs/proc/base.c |   2 +
 fs/proc/internal.h |   3 +
 fs/proc/task_mmu.c | 299 -
 3 files changed, 303 insertions(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 1b2ede6..8fd7cc5 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2960,6 +2960,7 @@ static const struct pid_entry tgid_base_stuff[] = {
REG("maps",   S_IRUGO, proc_pid_maps_operations),
 #ifdef CONFIG_NUMA
REG("numa_maps",  S_IRUGO, proc_pid_numa_maps_operations),
+   REG("numa_vamaps",  S_IRUGO, proc_pid_numa_vamaps_operations),
 #endif
REG("mem",S_IRUSR|S_IWUSR, proc_mem_operations),
LNK("cwd",proc_cwd_link),
@@ -3352,6 +3353,7 @@ static const struct pid_entry tid_base_stuff[] = {
 #endif
 #ifdef CONFIG_NUMA
REG("numa_maps", S_IRUGO, proc_tid_numa_maps_operations),
+   REG("numa_vamaps",  S_IRUGO, proc_tid_numa_vamaps_operations),
 #endif
REG("mem",   S_IRUSR|S_IWUSR, proc_mem_operations),
LNK("cwd",   proc_cwd_link),
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 0f1692e..9a3ff80 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -273,6 +273,7 @@ struct proc_maps_private {
 #ifdef CONFIG_NUMA
struct mempolicy *task_mempolicy;
 #endif
+   u64 vma_off;
 } __randomize_layout;
 
 struct mm_struct *proc_mem_open(struct inode *inode, unsigned int mode);
@@ -280,7 +281,9 @@ struct mm_struct *proc_mem_open(struct inode *inode, 
unsigned int mode);
 extern const struct file_operations proc_pid_maps_operations;
 extern const struct file_operations proc_tid_maps_operations;
 extern const struct file_operations proc_pid_numa_maps_operations;
+extern const struct file_operations proc_pid_numa_vamaps_operations;
 extern const struct file_operations proc_tid_numa_maps

[RFC PATCH] Add /proc//numa_vamaps for numa node information

2018-05-01 Thread Prakash Sangappa
For analysis purpose it is useful to have numa node information
corresponding mapped address ranges of the process. Currently
/proc//numa_maps provides list of numa nodes from where pages are
allocated per VMA of the process. This is not useful if an user needs to
determine which numa node the mapped pages are allocated from for a
particular address range. It would have helped if the numa node information
presented in /proc//numa_maps was broken down by VA ranges showing the
exact numa node from where the pages have been allocated.

The format of /proc//numa_maps file content is dependent on
/proc//maps file content as mentioned in the manpage. i.e one line
entry for every VMA corresponding to entries in /proc//maps file.
Therefore changing the output of /proc//numa_maps may not be possible.

Hence, this patch proposes adding file /proc//numa_vamaps which will
provide proper break down of VA ranges by numa node id from where the mapped
pages are allocated. For Address ranges not having any pages mapped, a '-'
is printed instead of numa node id. In addition, this file will include most
of the other information currently presented in /proc//numa_maps. The
additional information included is for convenience. If this is not
preferred, the patch could be modified to just provide VA range to numa node
information as the rest of the information is already available thru
/proc//numa_maps file.

Since the VA range to numa node information does not include page's PFN,
reading this file will not be restricted(i.e requiring CAP_SYS_ADMIN).

Here is the snippet from the new file content showing the format.

0040-00401000 N0=1 kernelpagesize_kB=4 mapped=1 file=/tmp/hmap2
0060-00601000 N0=1 kernelpagesize_kB=4 anon=1 dirty=1 file=/tmp/hmap2
00601000-00602000 N0=1 kernelpagesize_kB=4 anon=1 dirty=1 file=/tmp/hmap2
7f021560-7f021580 N0=1 kernelpagesize_kB=2048 dirty=1 file=/mnt/f1
7f021580-7f0215c0 -  file=/mnt/f1
7f0215c0-7f0215e0 N0=1 kernelpagesize_kB=2048 dirty=1 file=/mnt/f1
7f0215e0-7f021620 -  file=/mnt/f1
..
7f0217ecb000-7f0217f2 N0=85 kernelpagesize_kB=4 mapped=85 mapmax=51
   file=/usr/lib64/libc-2.17.so
7f0217f2-7f0217f3 -  file=/usr/lib64/libc-2.17.so
7f0217f3-7f0217f9 N0=96 kernelpagesize_kB=4 mapped=96 mapmax=51
   file=/usr/lib64/libc-2.17.so
7f0217f9-7f0217fb -  file=/usr/lib64/libc-2.17.so
..

The 'pmap' command can be enhanced to include an option to show numa node
information which it can read from this new proc file. This will be a
follow on proposal.

There have been couple of previous patch proposals to provide numa node
information based on pfn or physical address. They seem to have not made
progress. Also it would appear reading numa node information based on PFN
or physical address will require privileges(CAP_SYS_ADMIN) similar to
reading PFN info from /proc//pagemap.

See
https://marc.info/?t=13963093821=1=2

https://marc.info/?t=13971872441=1=2

Signed-off-by: Prakash Sangappa 
---
 fs/proc/base.c |   2 +
 fs/proc/internal.h |   3 +
 fs/proc/task_mmu.c | 299 -
 3 files changed, 303 insertions(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 1b2ede6..8fd7cc5 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2960,6 +2960,7 @@ static const struct pid_entry tgid_base_stuff[] = {
REG("maps",   S_IRUGO, proc_pid_maps_operations),
 #ifdef CONFIG_NUMA
REG("numa_maps",  S_IRUGO, proc_pid_numa_maps_operations),
+   REG("numa_vamaps",  S_IRUGO, proc_pid_numa_vamaps_operations),
 #endif
REG("mem",S_IRUSR|S_IWUSR, proc_mem_operations),
LNK("cwd",proc_cwd_link),
@@ -3352,6 +3353,7 @@ static const struct pid_entry tid_base_stuff[] = {
 #endif
 #ifdef CONFIG_NUMA
REG("numa_maps", S_IRUGO, proc_tid_numa_maps_operations),
+   REG("numa_vamaps",  S_IRUGO, proc_tid_numa_vamaps_operations),
 #endif
REG("mem",   S_IRUSR|S_IWUSR, proc_mem_operations),
LNK("cwd",   proc_cwd_link),
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 0f1692e..9a3ff80 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -273,6 +273,7 @@ struct proc_maps_private {
 #ifdef CONFIG_NUMA
struct mempolicy *task_mempolicy;
 #endif
+   u64 vma_off;
 } __randomize_layout;
 
 struct mm_struct *proc_mem_open(struct inode *inode, unsigned int mode);
@@ -280,7 +281,9 @@ struct mm_struct *proc_mem_open(struct inode *inode, 
unsigned int mode);
 extern const struct file_operations proc_pid_maps_operations;
 extern const struct file_operations proc_tid_maps_operations;
 extern const struct file_operations proc_pid_numa_maps_operations;
+extern const struct file_operations proc_pid_numa_vamaps_operations;
 extern const struct file_operations proc_tid_numa_maps_operations;
+extern const st

[PATCH] Hugetlb pages rss accounting is incorrect in /proc//smaps

2017-10-24 Thread Prakash Sangappa
Resident set size(Rss) accounting of hugetlb pages is not done
currently in /proc//smaps. The pmap command reads rss from
this file and so it shows Rss to be 0 in pmap -x output for
hugetlb mapped vmas. This patch fixes it.

Signed-off-by: Prakash Sangappa <prakash.sanga...@oracle.com>
---
 fs/proc/task_mmu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 5589b4b..c7e1048 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -724,6 +724,7 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long 
hmask,
mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
else
mss->private_hugetlb += huge_page_size(hstate_vma(vma));
+   mss->resident += huge_page_size(hstate_vma(vma));
}
return 0;
 }
-- 
2.7.4



[PATCH] Hugetlb pages rss accounting is incorrect in /proc//smaps

2017-10-24 Thread Prakash Sangappa
Resident set size(Rss) accounting of hugetlb pages is not done
currently in /proc//smaps. The pmap command reads rss from
this file and so it shows Rss to be 0 in pmap -x output for
hugetlb mapped vmas. This patch fixes it.

Signed-off-by: Prakash Sangappa 
---
 fs/proc/task_mmu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 5589b4b..c7e1048 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -724,6 +724,7 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long 
hmask,
mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
else
mss->private_hugetlb += huge_page_size(hstate_vma(vma));
+   mss->resident += huge_page_size(hstate_vma(vma));
}
return 0;
 }
-- 
2.7.4



Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-17 Thread prakash sangappa


On 10/17/2017 3:40 PM, Andy Lutomirski wrote:

On Tue, Oct 17, 2017 at 3:35 PM, prakash sangappa
<prakash.sanga...@oracle.com> wrote:

On 10/17/2017 3:02 PM, Andy Lutomirski wrote:

On Tue, Oct 17, 2017 at 8:38 AM, Prakash Sangappa
<prakash.sanga...@oracle.com> wrote:


On 10/16/17 5:52 PM, Andy Lutomirski wrote:

On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa
<prakash.sanga...@oracle.com> wrote:


On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote:



On 10/16/2017 02:36 PM, Andrew Morton wrote:

On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov
<khlebni...@yandex-team.ru> wrote:


pid_t translate_pid(pid_t pid, int source, int target);

This syscall converts pid from source pid-ns into pid in target
pid-ns.
If pid is unreachable from target pid-ns it returns zero.

Pid-namespaces are referred file descriptors opened to proc files
/proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative
argument
refers to current pid namespace, same as file /proc/self/ns/pid.

Kernel expose virtual pids in /proc/[pid]/status:NSpid, but
backward
translation requires scanning all tasks. Also pids could be
translated
by sending them through unix socket between namespaces, this
method
is
slow and insecure because other side is exposed inside pid
namespace.

Andrew asked why we might need this.

Such conversion is required for interaction between processes across
pid-namespaces.
For example to identify process in container by pid file looking
from
outside.

Two years ago I've solved this in project of mine with monstrous
code
which
forks couple times just to convert pid, lucky for me performance
wasn't
important.

That's a single user who needed this a single time, and found a
userspace-based solution anyway.  This is not exactly compelling!

Is there a stronger case to be made?  How does this change benefit
our
users?  Sell it to us!

Oracle database is planning to use pid namespace for sandboxing
database
instances and they need an API similar to translate_pid to effectively
translate process IDs from other pid namespaces. Prakash (cced in
mail)
can
provide more details on this usecase.


As Nagarathnam indicated, Oracle Database will be using pid namespaces
and
needs a direct method of converting pids of processes in the pid
namespace
hierarchy. In this use case multiple
nested PID namespaces will be used.  The currently available mechanism
are
not very efficient for this use case. For ex. as Konstantin described,
using
/proc//status would require the application to scan all the pid's
status files to determine the pid of given process in a child
namespace.

Use of SCM_CREDENTIALS's socket message is another way, which would
require
every process starting inside a pid namespace to send this message and
the
receiving process in the target namespace would have to save the
converted
pid and reference it. This mechanism becomes cumbersome especially if
the
application has to deal with multiple nested pid namespaces. Also, the
Database needs to be able to convert a thread's global pid(gettid()).
Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires
CAP_SYS_ADMIN, which is an issue.

So having a direct method, like the API that Konstantin is proposing,
will
work best for the Database
since pid of a process in any of the nested pid namespaces can be
converted
as and when required. I think with the proposed API, the application
should
be able to convert pid of a process or tid(gettid()) of a thread as
well.


Can you explain what Oracle's database is planning to do with this
information?


Database uses the PID to programmatically find out if the process/thread
is
alive(kill 0) also send signals to the processes requesting it to dump
status/debug information and kill the processes in case of a shutdown
abort
of the instance.

What I'm wondering is: how does the caller of kill() end up
controlling a task whose pid it doesn't know in its own namespace?


I was generally describing how DB would use the PID of process. The above
description
was in the case when no namespaces are used.

With use of namespaces, the DB would convert the PID of processes inside
its children namespaces to PID in its namespace and use that pid to issue
kill().

Seems vaguely sensible.

If I were designing this type of system, I'd have a manager process in
each namespace running as PID 1, though -- PID 1 is special and needs
to understand what's going on anyway.  Then PID 1 would do the kill()
calls and wouldn't need translate_pid().


Yes, this has been tried out with the prototype use of PID namespaces in 
the DB.
It works, but would be slow as the manager would have to exchange 
messages with the

controlling processes which would be in the parent namespace.
DB could use the api to convert the pid.



Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-17 Thread prakash sangappa


On 10/17/2017 3:40 PM, Andy Lutomirski wrote:

On Tue, Oct 17, 2017 at 3:35 PM, prakash sangappa
 wrote:

On 10/17/2017 3:02 PM, Andy Lutomirski wrote:

On Tue, Oct 17, 2017 at 8:38 AM, Prakash Sangappa
 wrote:


On 10/16/17 5:52 PM, Andy Lutomirski wrote:

On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa
 wrote:


On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote:



On 10/16/2017 02:36 PM, Andrew Morton wrote:

On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov
 wrote:


pid_t translate_pid(pid_t pid, int source, int target);

This syscall converts pid from source pid-ns into pid in target
pid-ns.
If pid is unreachable from target pid-ns it returns zero.

Pid-namespaces are referred file descriptors opened to proc files
/proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative
argument
refers to current pid namespace, same as file /proc/self/ns/pid.

Kernel expose virtual pids in /proc/[pid]/status:NSpid, but
backward
translation requires scanning all tasks. Also pids could be
translated
by sending them through unix socket between namespaces, this
method
is
slow and insecure because other side is exposed inside pid
namespace.

Andrew asked why we might need this.

Such conversion is required for interaction between processes across
pid-namespaces.
For example to identify process in container by pid file looking
from
outside.

Two years ago I've solved this in project of mine with monstrous
code
which
forks couple times just to convert pid, lucky for me performance
wasn't
important.

That's a single user who needed this a single time, and found a
userspace-based solution anyway.  This is not exactly compelling!

Is there a stronger case to be made?  How does this change benefit
our
users?  Sell it to us!

Oracle database is planning to use pid namespace for sandboxing
database
instances and they need an API similar to translate_pid to effectively
translate process IDs from other pid namespaces. Prakash (cced in
mail)
can
provide more details on this usecase.


As Nagarathnam indicated, Oracle Database will be using pid namespaces
and
needs a direct method of converting pids of processes in the pid
namespace
hierarchy. In this use case multiple
nested PID namespaces will be used.  The currently available mechanism
are
not very efficient for this use case. For ex. as Konstantin described,
using
/proc//status would require the application to scan all the pid's
status files to determine the pid of given process in a child
namespace.

Use of SCM_CREDENTIALS's socket message is another way, which would
require
every process starting inside a pid namespace to send this message and
the
receiving process in the target namespace would have to save the
converted
pid and reference it. This mechanism becomes cumbersome especially if
the
application has to deal with multiple nested pid namespaces. Also, the
Database needs to be able to convert a thread's global pid(gettid()).
Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires
CAP_SYS_ADMIN, which is an issue.

So having a direct method, like the API that Konstantin is proposing,
will
work best for the Database
since pid of a process in any of the nested pid namespaces can be
converted
as and when required. I think with the proposed API, the application
should
be able to convert pid of a process or tid(gettid()) of a thread as
well.


Can you explain what Oracle's database is planning to do with this
information?


Database uses the PID to programmatically find out if the process/thread
is
alive(kill 0) also send signals to the processes requesting it to dump
status/debug information and kill the processes in case of a shutdown
abort
of the instance.

What I'm wondering is: how does the caller of kill() end up
controlling a task whose pid it doesn't know in its own namespace?


I was generally describing how DB would use the PID of process. The above
description
was in the case when no namespaces are used.

With use of namespaces, the DB would convert the PID of processes inside
its children namespaces to PID in its namespace and use that pid to issue
kill().

Seems vaguely sensible.

If I were designing this type of system, I'd have a manager process in
each namespace running as PID 1, though -- PID 1 is special and needs
to understand what's going on anyway.  Then PID 1 would do the kill()
calls and wouldn't need translate_pid().


Yes, this has been tried out with the prototype use of PID namespaces in 
the DB.
It works, but would be slow as the manager would have to exchange 
messages with the

controlling processes which would be in the parent namespace.
DB could use the api to convert the pid.



Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-17 Thread prakash sangappa


On 10/17/2017 3:02 PM, Andy Lutomirski wrote:

On Tue, Oct 17, 2017 at 8:38 AM, Prakash Sangappa
<prakash.sanga...@oracle.com> wrote:


On 10/16/17 5:52 PM, Andy Lutomirski wrote:

On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa
<prakash.sanga...@oracle.com> wrote:


On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote:



On 10/16/2017 02:36 PM, Andrew Morton wrote:

On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov
<khlebni...@yandex-team.ru> wrote:


pid_t translate_pid(pid_t pid, int source, int target);

This syscall converts pid from source pid-ns into pid in target
pid-ns.
If pid is unreachable from target pid-ns it returns zero.

Pid-namespaces are referred file descriptors opened to proc files
/proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative
argument
refers to current pid namespace, same as file /proc/self/ns/pid.

Kernel expose virtual pids in /proc/[pid]/status:NSpid, but
backward
translation requires scanning all tasks. Also pids could be
translated
by sending them through unix socket between namespaces, this method
is
slow and insecure because other side is exposed inside pid
namespace.

Andrew asked why we might need this.

Such conversion is required for interaction between processes across
pid-namespaces.
For example to identify process in container by pid file looking from
outside.

Two years ago I've solved this in project of mine with monstrous code
which
forks couple times just to convert pid, lucky for me performance
wasn't
important.

That's a single user who needed this a single time, and found a
userspace-based solution anyway.  This is not exactly compelling!

Is there a stronger case to be made?  How does this change benefit our
users?  Sell it to us!

Oracle database is planning to use pid namespace for sandboxing database
instances and they need an API similar to translate_pid to effectively
translate process IDs from other pid namespaces. Prakash (cced in mail)
can
provide more details on this usecase.


As Nagarathnam indicated, Oracle Database will be using pid namespaces
and
needs a direct method of converting pids of processes in the pid
namespace
hierarchy. In this use case multiple
nested PID namespaces will be used.  The currently available mechanism
are
not very efficient for this use case. For ex. as Konstantin described,
using
/proc//status would require the application to scan all the pid's
status files to determine the pid of given process in a child namespace.

Use of SCM_CREDENTIALS's socket message is another way, which would
require
every process starting inside a pid namespace to send this message and
the
receiving process in the target namespace would have to save the
converted
pid and reference it. This mechanism becomes cumbersome especially if the
application has to deal with multiple nested pid namespaces. Also, the
Database needs to be able to convert a thread's global pid(gettid()).
Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires
CAP_SYS_ADMIN, which is an issue.

So having a direct method, like the API that Konstantin is proposing,
will
work best for the Database
since pid of a process in any of the nested pid namespaces can be
converted
as and when required. I think with the proposed API, the application
should
be able to convert pid of a process or tid(gettid()) of a thread as well.


Can you explain what Oracle's database is planning to do with this
information?


Database uses the PID to programmatically find out if the process/thread is
alive(kill 0) also send signals to the processes requesting it to dump
status/debug information and kill the processes in case of a shutdown abort
of the instance.

What I'm wondering is: how does the caller of kill() end up
controlling a task whose pid it doesn't know in its own namespace?


I was generally describing how DB would use the PID of process. The 
above description

was in the case when no namespaces are used.

With use of namespaces, the DB would convert the PID of processes inside
its children namespaces to PID in its namespace and use that pid to 
issue kill().


-Prakash.




-Prakash.






Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-17 Thread prakash sangappa


On 10/17/2017 3:02 PM, Andy Lutomirski wrote:

On Tue, Oct 17, 2017 at 8:38 AM, Prakash Sangappa
 wrote:


On 10/16/17 5:52 PM, Andy Lutomirski wrote:

On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa
 wrote:


On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote:



On 10/16/2017 02:36 PM, Andrew Morton wrote:

On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov
 wrote:


pid_t translate_pid(pid_t pid, int source, int target);

This syscall converts pid from source pid-ns into pid in target
pid-ns.
If pid is unreachable from target pid-ns it returns zero.

Pid-namespaces are referred file descriptors opened to proc files
/proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative
argument
refers to current pid namespace, same as file /proc/self/ns/pid.

Kernel expose virtual pids in /proc/[pid]/status:NSpid, but
backward
translation requires scanning all tasks. Also pids could be
translated
by sending them through unix socket between namespaces, this method
is
slow and insecure because other side is exposed inside pid
namespace.

Andrew asked why we might need this.

Such conversion is required for interaction between processes across
pid-namespaces.
For example to identify process in container by pid file looking from
outside.

Two years ago I've solved this in project of mine with monstrous code
which
forks couple times just to convert pid, lucky for me performance
wasn't
important.

That's a single user who needed this a single time, and found a
userspace-based solution anyway.  This is not exactly compelling!

Is there a stronger case to be made?  How does this change benefit our
users?  Sell it to us!

Oracle database is planning to use pid namespace for sandboxing database
instances and they need an API similar to translate_pid to effectively
translate process IDs from other pid namespaces. Prakash (cced in mail)
can
provide more details on this usecase.


As Nagarathnam indicated, Oracle Database will be using pid namespaces
and
needs a direct method of converting pids of processes in the pid
namespace
hierarchy. In this use case multiple
nested PID namespaces will be used.  The currently available mechanism
are
not very efficient for this use case. For ex. as Konstantin described,
using
/proc//status would require the application to scan all the pid's
status files to determine the pid of given process in a child namespace.

Use of SCM_CREDENTIALS's socket message is another way, which would
require
every process starting inside a pid namespace to send this message and
the
receiving process in the target namespace would have to save the
converted
pid and reference it. This mechanism becomes cumbersome especially if the
application has to deal with multiple nested pid namespaces. Also, the
Database needs to be able to convert a thread's global pid(gettid()).
Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires
CAP_SYS_ADMIN, which is an issue.

So having a direct method, like the API that Konstantin is proposing,
will
work best for the Database
since pid of a process in any of the nested pid namespaces can be
converted
as and when required. I think with the proposed API, the application
should
be able to convert pid of a process or tid(gettid()) of a thread as well.


Can you explain what Oracle's database is planning to do with this
information?


Database uses the PID to programmatically find out if the process/thread is
alive(kill 0) also send signals to the processes requesting it to dump
status/debug information and kill the processes in case of a shutdown abort
of the instance.

What I'm wondering is: how does the caller of kill() end up
controlling a task whose pid it doesn't know in its own namespace?


I was generally describing how DB would use the PID of process. The 
above description

was in the case when no namespaces are used.

With use of namespaces, the DB would convert the PID of processes inside
its children namespaces to PID in its namespace and use that pid to 
issue kill().


-Prakash.




-Prakash.






Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-17 Thread Prakash Sangappa



On 10/16/17 5:52 PM, Andy Lutomirski wrote:

On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa
 wrote:


On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote:



On 10/16/2017 02:36 PM, Andrew Morton wrote:

On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov
 wrote:


pid_t translate_pid(pid_t pid, int source, int target);

This syscall converts pid from source pid-ns into pid in target
pid-ns.
If pid is unreachable from target pid-ns it returns zero.

Pid-namespaces are referred file descriptors opened to proc files
/proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative
argument
refers to current pid namespace, same as file /proc/self/ns/pid.

Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward
translation requires scanning all tasks. Also pids could be
translated
by sending them through unix socket between namespaces, this method
is
slow and insecure because other side is exposed inside pid namespace.

Andrew asked why we might need this.

Such conversion is required for interaction between processes across
pid-namespaces.
For example to identify process in container by pid file looking from
outside.

Two years ago I've solved this in project of mine with monstrous code
which
forks couple times just to convert pid, lucky for me performance wasn't
important.

That's a single user who needed this a single time, and found a
userspace-based solution anyway.  This is not exactly compelling!

Is there a stronger case to be made?  How does this change benefit our
users?  Sell it to us!

Oracle database is planning to use pid namespace for sandboxing database
instances and they need an API similar to translate_pid to effectively
translate process IDs from other pid namespaces. Prakash (cced in mail) can
provide more details on this usecase.


As Nagarathnam indicated, Oracle Database will be using pid namespaces and
needs a direct method of converting pids of processes in the pid namespace
hierarchy. In this use case multiple
nested PID namespaces will be used.  The currently available mechanism are
not very efficient for this use case. For ex. as Konstantin described, using
/proc//status would require the application to scan all the pid's
status files to determine the pid of given process in a child namespace.

Use of SCM_CREDENTIALS's socket message is another way, which would require
every process starting inside a pid namespace to send this message and the
receiving process in the target namespace would have to save the converted
pid and reference it. This mechanism becomes cumbersome especially if the
application has to deal with multiple nested pid namespaces. Also, the
Database needs to be able to convert a thread's global pid(gettid()).
Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires
CAP_SYS_ADMIN, which is an issue.

So having a direct method, like the API that Konstantin is proposing, will
work best for the Database
since pid of a process in any of the nested pid namespaces can be converted
as and when required. I think with the proposed API, the application should
be able to convert pid of a process or tid(gettid()) of a thread as well.



Can you explain what Oracle's database is planning to do with this information?


Database uses the PID to programmatically find out if the process/thread 
is alive(kill 0) also send signals to the processes requesting it to 
dump status/debug information and kill the processes in case of a 
shutdown abort of the instance.


-Prakash.




Re: [PATCH v4] pidns: introduce syscall translate_pid

2017-10-17 Thread Prakash Sangappa



On 10/16/17 5:52 PM, Andy Lutomirski wrote:

On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa
 wrote:


On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote:



On 10/16/2017 02:36 PM, Andrew Morton wrote:

On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov
 wrote:


pid_t translate_pid(pid_t pid, int source, int target);

This syscall converts pid from source pid-ns into pid in target
pid-ns.
If pid is unreachable from target pid-ns it returns zero.

Pid-namespaces are referred file descriptors opened to proc files
/proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative
argument
refers to current pid namespace, same as file /proc/self/ns/pid.

Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward
translation requires scanning all tasks. Also pids could be
translated
by sending them through unix socket between namespaces, this method
is
slow and insecure because other side is exposed inside pid namespace.

Andrew asked why we might need this.

Such conversion is required for interaction between processes across
pid-namespaces.
For example to identify process in container by pid file looking from
outside.

Two years ago I've solved this in project of mine with monstrous code
which
forks couple times just to convert pid, lucky for me performance wasn't
important.

That's a single user who needed this a single time, and found a
userspace-based solution anyway.  This is not exactly compelling!

Is there a stronger case to be made?  How does this change benefit our
users?  Sell it to us!

Oracle database is planning to use pid namespace for sandboxing database
instances and they need an API similar to translate_pid to effectively
translate process IDs from other pid namespaces. Prakash (cced in mail) can
provide more details on this usecase.


As Nagarathnam indicated, Oracle Database will be using pid namespaces and
needs a direct method of converting pids of processes in the pid namespace
hierarchy. In this use case multiple
nested PID namespaces will be used.  The currently available mechanism are
not very efficient for this use case. For ex. as Konstantin described, using
/proc//status would require the application to scan all the pid's
status files to determine the pid of given process in a child namespace.

Use of SCM_CREDENTIALS's socket message is another way, which would require
every process starting inside a pid namespace to send this message and the
receiving process in the target namespace would have to save the converted
pid and reference it. This mechanism becomes cumbersome especially if the
application has to deal with multiple nested pid namespaces. Also, the
Database needs to be able to convert a thread's global pid(gettid()).
Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires
CAP_SYS_ADMIN, which is an issue.

So having a direct method, like the API that Konstantin is proposing, will
work best for the Database
since pid of a process in any of the nested pid namespaces can be converted
as and when required. I think with the proposed API, the application should
be able to convert pid of a process or tid(gettid()) of a thread as well.



Can you explain what Oracle's database is planning to do with this information?


Database uses the PID to programmatically find out if the process/thread 
is alive(kill 0) also send signals to the processes requesting it to 
dump status/debug information and kill the processes in case of a 
shutdown abort of the instance.


-Prakash.




[PATCH v2] Userfaultfd: Add description for UFFD_FEATURE_SIGBUS

2017-10-09 Thread Prakash Sangappa
Userfaultfd feature UFFD_FEATURE_SIGBUS was merged recently and should
be available in Linux 4.14 release. This patch is for the manpage
changes documenting this API.

Documents the following commit:

commit 2d6d6f5a09a96cc1fec7ed992b825e05f64cb50e
Author: Prakash Sangappa <prakash.sanga...@oracle.com>
Date: Wed Sep 6 16:23:39 2017 -0700

 mm: userfaultfd: add feature to request for a signal delivery

Signed-off-by: Prakash Sangappa <prakash.sanga...@oracle.com>
---
v2: Incorporated review feedback changes.
---
 man2/ioctl_userfaultfd.2 |  9 +
 man2/userfaultfd.2   | 23 +++
 2 files changed, 32 insertions(+)

diff --git a/man2/ioctl_userfaultfd.2 b/man2/ioctl_userfaultfd.2
index 60fd29b..32f0744 100644
--- a/man2/ioctl_userfaultfd.2
+++ b/man2/ioctl_userfaultfd.2
@@ -196,6 +196,15 @@ with the
 flag set,
 .BR memfd_create (2),
 and so on.
+.TP
+.B UFFD_FEATURE_SIGBUS
+Since Linux 4.14, If this feature bit is set, no page-fault events
+.B (UFFD_EVENT_PAGEFAULT)
+will be delivered, instead a
+.B SIGBUS
+signal will be sent to the faulting process. Applications using this
+feature will not require the use of a userfaultfd monitor for processing
+memory accesses to the regions registered with userfaultfd.
 .IP
 The returned
 .I ioctls
diff --git a/man2/userfaultfd.2 b/man2/userfaultfd.2
index 1741ee3..3c5b9c0 100644
--- a/man2/userfaultfd.2
+++ b/man2/userfaultfd.2
@@ -172,6 +172,29 @@ or
 .BR ioctl (2)
 operations to resolve the page fault.
 .PP
+Starting from Linux 4.14, if application sets
+.B UFFD_FEATURE_SIGBUS
+feature bit using
+.B UFFDIO_API
+.BR ioctl (2),
+no page fault notification will be forwarded to
+the user-space, instead a
+.B SIGBUS
+signal is delivered to the faulting process. With this feature,
+userfaultfd can be used for robustness purpose to simply catch
+any access to areas within the registered address range that do not
+have pages allocated, without having to listen to userfaultfd events.
+No userfaultfd monitor will be required for dealing with such memory
+accesses. For example, this feature can be useful for applications that
+want to prevent the kernel from automatically allocating pages and filling
+holes in sparse files when the hole is accessed thru mapped address.
+.PP
+The
+.B UFFD_FEATURE_SIGBUS
+feature is implicitly inherited through fork() if used in combination with
+.BR UFFD_FEATURE_FORK .
+
+.PP
 Details of the various
 .BR ioctl (2)
 operations can be found in
-- 
2.7.4



[PATCH v2] Userfaultfd: Add description for UFFD_FEATURE_SIGBUS

2017-10-09 Thread Prakash Sangappa
Userfaultfd feature UFFD_FEATURE_SIGBUS was merged recently and should
be available in Linux 4.14 release. This patch is for the manpage
changes documenting this API.

Documents the following commit:

commit 2d6d6f5a09a96cc1fec7ed992b825e05f64cb50e
Author: Prakash Sangappa 
Date: Wed Sep 6 16:23:39 2017 -0700

 mm: userfaultfd: add feature to request for a signal delivery

Signed-off-by: Prakash Sangappa 
---
v2: Incorporated review feedback changes.
---
 man2/ioctl_userfaultfd.2 |  9 +
 man2/userfaultfd.2   | 23 +++
 2 files changed, 32 insertions(+)

diff --git a/man2/ioctl_userfaultfd.2 b/man2/ioctl_userfaultfd.2
index 60fd29b..32f0744 100644
--- a/man2/ioctl_userfaultfd.2
+++ b/man2/ioctl_userfaultfd.2
@@ -196,6 +196,15 @@ with the
 flag set,
 .BR memfd_create (2),
 and so on.
+.TP
+.B UFFD_FEATURE_SIGBUS
+Since Linux 4.14, If this feature bit is set, no page-fault events
+.B (UFFD_EVENT_PAGEFAULT)
+will be delivered, instead a
+.B SIGBUS
+signal will be sent to the faulting process. Applications using this
+feature will not require the use of a userfaultfd monitor for processing
+memory accesses to the regions registered with userfaultfd.
 .IP
 The returned
 .I ioctls
diff --git a/man2/userfaultfd.2 b/man2/userfaultfd.2
index 1741ee3..3c5b9c0 100644
--- a/man2/userfaultfd.2
+++ b/man2/userfaultfd.2
@@ -172,6 +172,29 @@ or
 .BR ioctl (2)
 operations to resolve the page fault.
 .PP
+Starting from Linux 4.14, if application sets
+.B UFFD_FEATURE_SIGBUS
+feature bit using
+.B UFFDIO_API
+.BR ioctl (2),
+no page fault notification will be forwarded to
+the user-space, instead a
+.B SIGBUS
+signal is delivered to the faulting process. With this feature,
+userfaultfd can be used for robustness purpose to simply catch
+any access to areas within the registered address range that do not
+have pages allocated, without having to listen to userfaultfd events.
+No userfaultfd monitor will be required for dealing with such memory
+accesses. For example, this feature can be useful for applications that
+want to prevent the kernel from automatically allocating pages and filling
+holes in sparse files when the hole is accessed thru mapped address.
+.PP
+The
+.B UFFD_FEATURE_SIGBUS
+feature is implicitly inherited through fork() if used in combination with
+.BR UFFD_FEATURE_FORK .
+
+.PP
 Details of the various
 .BR ioctl (2)
 operations can be found in
-- 
2.7.4



Re: [PATCH] Userfaultfd: Add description for UFFD_FEATURE_SIGBUS

2017-10-06 Thread Prakash Sangappa

cc: Andrea Arcangeli


On 10/6/17 7:52 PM, Prakash Sangappa wrote:

Userfaultfd feature UFFD_FEATURE_SIGBUS was merged recently and should
be available in Linux 4.14 release. This patch is for the manpage
changes documenting this API.

Documents the following commit:

commit 2d6d6f5a09a96cc1fec7ed992b825e05f64cb50e
Author: Prakash Sangappa <prakash.sanga...@oracle.com>
Date: Wed Sep 6 16:23:39 2017 -0700

 mm: userfaultfd: add feature to request for a signal delivery

Signed-off-by: Prakash Sangappa <prakash.sanga...@oracle.com>
---
  man2/ioctl_userfaultfd.2 |  9 +
  man2/userfaultfd.2   | 17 +
  2 files changed, 26 insertions(+)

diff --git a/man2/ioctl_userfaultfd.2 b/man2/ioctl_userfaultfd.2
index 60fd29b..cfc65ae 100644
--- a/man2/ioctl_userfaultfd.2
+++ b/man2/ioctl_userfaultfd.2
@@ -196,6 +196,15 @@ with the
  flag set,
  .BR memfd_create (2),
  and so on.
+.TP
+.B UFFD_FEATURE_SIGBUS
+Since Linux 4.14, If this feature bit is set, no page-fault events(
+.B UFFD_EVENT_PAGEFAULT
+) will be delivered, instead a
+.B SIGBUS
+signal will be sent to the faulting process. Applications using this
+feature will not require the use of a userfaultfd monitor for handling
+page-fault events.
  .IP
  The returned
  .I ioctls
diff --git a/man2/userfaultfd.2 b/man2/userfaultfd.2
index 1741ee3..a033742 100644
--- a/man2/userfaultfd.2
+++ b/man2/userfaultfd.2
@@ -172,6 +172,23 @@ or
  .BR ioctl (2)
  operations to resolve the page fault.
  .PP
+Starting from Linux 4.14, if application sets
+.B UFFD_FEATURE_SIGBUS
+feature bit using
+.B UFFDIO_API
+.BR ioctl (2)
+, no page fault notification will be forwarded to
+the user-space, instead a
+.B SIGBUS
+signal is delivered to the faulting process. With this feature,
+userfaultfd can be used for robustness purpose to simply catch
+any access to areas within the registered address range that do not
+have pages allocated, without having to deal with page-fault events.
+No userfaultd monitor will be required for handling page faults. For
+example, this feature can be useful for applications that want to
+prevent the kernel from automatically allocating pages and filling
+holes in sparse files when the hole is accessed thru mapped address.
+.PP
  Details of the various
  .BR ioctl (2)
  operations can be found in




Re: [PATCH] Userfaultfd: Add description for UFFD_FEATURE_SIGBUS

2017-10-06 Thread Prakash Sangappa

cc: Andrea Arcangeli


On 10/6/17 7:52 PM, Prakash Sangappa wrote:

Userfaultfd feature UFFD_FEATURE_SIGBUS was merged recently and should
be available in Linux 4.14 release. This patch is for the manpage
changes documenting this API.

Documents the following commit:

commit 2d6d6f5a09a96cc1fec7ed992b825e05f64cb50e
Author: Prakash Sangappa 
Date: Wed Sep 6 16:23:39 2017 -0700

 mm: userfaultfd: add feature to request for a signal delivery

Signed-off-by: Prakash Sangappa 
---
  man2/ioctl_userfaultfd.2 |  9 +
  man2/userfaultfd.2   | 17 +
  2 files changed, 26 insertions(+)

diff --git a/man2/ioctl_userfaultfd.2 b/man2/ioctl_userfaultfd.2
index 60fd29b..cfc65ae 100644
--- a/man2/ioctl_userfaultfd.2
+++ b/man2/ioctl_userfaultfd.2
@@ -196,6 +196,15 @@ with the
  flag set,
  .BR memfd_create (2),
  and so on.
+.TP
+.B UFFD_FEATURE_SIGBUS
+Since Linux 4.14, If this feature bit is set, no page-fault events(
+.B UFFD_EVENT_PAGEFAULT
+) will be delivered, instead a
+.B SIGBUS
+signal will be sent to the faulting process. Applications using this
+feature will not require the use of a userfaultfd monitor for handling
+page-fault events.
  .IP
  The returned
  .I ioctls
diff --git a/man2/userfaultfd.2 b/man2/userfaultfd.2
index 1741ee3..a033742 100644
--- a/man2/userfaultfd.2
+++ b/man2/userfaultfd.2
@@ -172,6 +172,23 @@ or
  .BR ioctl (2)
  operations to resolve the page fault.
  .PP
+Starting from Linux 4.14, if application sets
+.B UFFD_FEATURE_SIGBUS
+feature bit using
+.B UFFDIO_API
+.BR ioctl (2)
+, no page fault notification will be forwarded to
+the user-space, instead a
+.B SIGBUS
+signal is delivered to the faulting process. With this feature,
+userfaultfd can be used for robustness purpose to simply catch
+any access to areas within the registered address range that do not
+have pages allocated, without having to deal with page-fault events.
+No userfaultd monitor will be required for handling page faults. For
+example, this feature can be useful for applications that want to
+prevent the kernel from automatically allocating pages and filling
+holes in sparse files when the hole is accessed thru mapped address.
+.PP
  Details of the various
  .BR ioctl (2)
  operations can be found in




[PATCH] Userfaultfd: Add description for UFFD_FEATURE_SIGBUS

2017-10-06 Thread Prakash Sangappa
Userfaultfd feature UFFD_FEATURE_SIGBUS was merged recently and should
be available in Linux 4.14 release. This patch is for the manpage
changes documenting this API.

Documents the following commit:

commit 2d6d6f5a09a96cc1fec7ed992b825e05f64cb50e
Author: Prakash Sangappa <prakash.sanga...@oracle.com>
Date: Wed Sep 6 16:23:39 2017 -0700

mm: userfaultfd: add feature to request for a signal delivery

Signed-off-by: Prakash Sangappa <prakash.sanga...@oracle.com>
---
 man2/ioctl_userfaultfd.2 |  9 +
 man2/userfaultfd.2   | 17 +
 2 files changed, 26 insertions(+)

diff --git a/man2/ioctl_userfaultfd.2 b/man2/ioctl_userfaultfd.2
index 60fd29b..cfc65ae 100644
--- a/man2/ioctl_userfaultfd.2
+++ b/man2/ioctl_userfaultfd.2
@@ -196,6 +196,15 @@ with the
 flag set,
 .BR memfd_create (2),
 and so on.
+.TP
+.B UFFD_FEATURE_SIGBUS
+Since Linux 4.14, If this feature bit is set, no page-fault events(
+.B UFFD_EVENT_PAGEFAULT
+) will be delivered, instead a
+.B SIGBUS
+signal will be sent to the faulting process. Applications using this
+feature will not require the use of a userfaultfd monitor for handling
+page-fault events.
 .IP
 The returned
 .I ioctls
diff --git a/man2/userfaultfd.2 b/man2/userfaultfd.2
index 1741ee3..a033742 100644
--- a/man2/userfaultfd.2
+++ b/man2/userfaultfd.2
@@ -172,6 +172,23 @@ or
 .BR ioctl (2)
 operations to resolve the page fault.
 .PP
+Starting from Linux 4.14, if application sets
+.B UFFD_FEATURE_SIGBUS
+feature bit using
+.B UFFDIO_API
+.BR ioctl (2)
+, no page fault notification will be forwarded to
+the user-space, instead a
+.B SIGBUS
+signal is delivered to the faulting process. With this feature,
+userfaultfd can be used for robustness purpose to simply catch
+any access to areas within the registered address range that do not
+have pages allocated, without having to deal with page-fault events.
+No userfaultd monitor will be required for handling page faults. For
+example, this feature can be useful for applications that want to
+prevent the kernel from automatically allocating pages and filling
+holes in sparse files when the hole is accessed thru mapped address.
+.PP
 Details of the various
 .BR ioctl (2)
 operations can be found in
-- 
2.7.4



[PATCH] Userfaultfd: Add description for UFFD_FEATURE_SIGBUS

2017-10-06 Thread Prakash Sangappa
Userfaultfd feature UFFD_FEATURE_SIGBUS was merged recently and should
be available in Linux 4.14 release. This patch is for the manpage
changes documenting this API.

Documents the following commit:

commit 2d6d6f5a09a96cc1fec7ed992b825e05f64cb50e
Author: Prakash Sangappa 
Date: Wed Sep 6 16:23:39 2017 -0700

mm: userfaultfd: add feature to request for a signal delivery

Signed-off-by: Prakash Sangappa 
---
 man2/ioctl_userfaultfd.2 |  9 +
 man2/userfaultfd.2   | 17 +
 2 files changed, 26 insertions(+)

diff --git a/man2/ioctl_userfaultfd.2 b/man2/ioctl_userfaultfd.2
index 60fd29b..cfc65ae 100644
--- a/man2/ioctl_userfaultfd.2
+++ b/man2/ioctl_userfaultfd.2
@@ -196,6 +196,15 @@ with the
 flag set,
 .BR memfd_create (2),
 and so on.
+.TP
+.B UFFD_FEATURE_SIGBUS
+Since Linux 4.14, If this feature bit is set, no page-fault events(
+.B UFFD_EVENT_PAGEFAULT
+) will be delivered, instead a
+.B SIGBUS
+signal will be sent to the faulting process. Applications using this
+feature will not require the use of a userfaultfd monitor for handling
+page-fault events.
 .IP
 The returned
 .I ioctls
diff --git a/man2/userfaultfd.2 b/man2/userfaultfd.2
index 1741ee3..a033742 100644
--- a/man2/userfaultfd.2
+++ b/man2/userfaultfd.2
@@ -172,6 +172,23 @@ or
 .BR ioctl (2)
 operations to resolve the page fault.
 .PP
+Starting from Linux 4.14, if application sets
+.B UFFD_FEATURE_SIGBUS
+feature bit using
+.B UFFDIO_API
+.BR ioctl (2)
+, no page fault notification will be forwarded to
+the user-space, instead a
+.B SIGBUS
+signal is delivered to the faulting process. With this feature,
+userfaultfd can be used for robustness purpose to simply catch
+any access to areas within the registered address range that do not
+have pages allocated, without having to deal with page-fault events.
+No userfaultd monitor will be required for handling page faults. For
+example, this feature can be useful for applications that want to
+prevent the kernel from automatically allocating pages and filling
+holes in sparse files when the hole is accessed thru mapped address.
+.PP
 Details of the various
 .BR ioctl (2)
 operations can be found in
-- 
2.7.4



Re: [RESEND PATCH] Allow passing tid or pid in SCM_CREDENTIALS without CAP_SYS_ADMIN

2017-09-01 Thread Prakash Sangappa



On 8/30/17 10:41 AM, ebied...@xmission.com wrote:

Prakash Sangappa <prakash.sanga...@oracle.com> writes:



With regards to security, the question basically is what is the consequence
of passing the wrong id. As I understand it, Interpreting the id to be pid
or tid, the effective uid and gid will be the same. It would be a problem
only if the incorrect interpretation of the id would refer a different process.
But that cannot happen as the the global tid(gettid() of a thread is
unique.

There is also the issue that the receiving process could look, not see
the pid in proc and assume the sending process is dead.  That I suspect
is the larger danger.



Will this not be a bug in the application, if it is sending the wrong id?


As long as the thread is alive, that id cannot reference another process / 
thread.
Unless the thread were to exit and the id gets recycled and got used for another
thread or process. This would be no different from a process exiting and its
pid getting recycled which is the case now.

Largely I agree.

If all you want are pid translations I suspect the are far easier ways
thant updating the SCM_CREDENTIALS code.


What would be an another easier & efficient way of doing pid translation?

Should a new API/mechanism be considered mainly for pid translation purpose
for use with pid namespaces, say based on 'pipe' something similar to 
I_SENDFD?


Thanks,
-Prakash.


Eric





Re: [RESEND PATCH] Allow passing tid or pid in SCM_CREDENTIALS without CAP_SYS_ADMIN

2017-09-01 Thread Prakash Sangappa



On 8/30/17 10:41 AM, ebied...@xmission.com wrote:

Prakash Sangappa  writes:



With regards to security, the question basically is what is the consequence
of passing the wrong id. As I understand it, Interpreting the id to be pid
or tid, the effective uid and gid will be the same. It would be a problem
only if the incorrect interpretation of the id would refer a different process.
But that cannot happen as the the global tid(gettid() of a thread is
unique.

There is also the issue that the receiving process could look, not see
the pid in proc and assume the sending process is dead.  That I suspect
is the larger danger.



Will this not be a bug in the application, if it is sending the wrong id?


As long as the thread is alive, that id cannot reference another process / 
thread.
Unless the thread were to exit and the id gets recycled and got used for another
thread or process. This would be no different from a process exiting and its
pid getting recycled which is the case now.

Largely I agree.

If all you want are pid translations I suspect the are far easier ways
thant updating the SCM_CREDENTIALS code.


What would be an another easier & efficient way of doing pid translation?

Should a new API/mechanism be considered mainly for pid translation purpose
for use with pid namespaces, say based on 'pipe' something similar to 
I_SENDFD?


Thanks,
-Prakash.


Eric





[RESEND PATCH] Allow passing tid or pid in SCM_CREDENTIALS without CAP_SYS_ADMIN

2017-08-28 Thread Prakash Sangappa
Currently passing tid(gettid(2)) of a thread in struct ucred in
SCM_CREDENTIALS message requires CAP_SYS_ADMIN capability otherwise
it fails with EPERM error. Some applications deal with thread id
of a thread(tid) and so it would help to allow tid in SCM_CREDENTIALS
message. Basically, either tgid(pid of the process) or the tid of
the thread should be allowed without the need for CAP_SYS_ADMIN capability.

SCM_CREDENTIALS will be used to determine the global id of a process or
a thread running inside a pid namespace.

This patch adds necessary check to accept tid in SCM_CREDENTIALS
struct ucred.

Signed-off-by: Prakash Sangappa <prakash.sanga...@oracle.com>
---
 net/core/scm.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/scm.c b/net/core/scm.c
index b1ff8a4..9274197 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -55,6 +55,7 @@ static __inline__ int scm_check_creds(struct ucred *creds)
return -EINVAL;
 
if ((creds->pid == task_tgid_vnr(current) ||
+creds->pid == task_pid_vnr(current) ||
 ns_capable(task_active_pid_ns(current)->user_ns, CAP_SYS_ADMIN)) &&
((uid_eq(uid, cred->uid)   || uid_eq(uid, cred->euid) ||
  uid_eq(uid, cred->suid)) || ns_capable(cred->user_ns, 
CAP_SETUID)) &&
-- 
2.7.4



[RESEND PATCH] Allow passing tid or pid in SCM_CREDENTIALS without CAP_SYS_ADMIN

2017-08-28 Thread Prakash Sangappa
Currently passing tid(gettid(2)) of a thread in struct ucred in
SCM_CREDENTIALS message requires CAP_SYS_ADMIN capability otherwise
it fails with EPERM error. Some applications deal with thread id
of a thread(tid) and so it would help to allow tid in SCM_CREDENTIALS
message. Basically, either tgid(pid of the process) or the tid of
the thread should be allowed without the need for CAP_SYS_ADMIN capability.

SCM_CREDENTIALS will be used to determine the global id of a process or
a thread running inside a pid namespace.

This patch adds necessary check to accept tid in SCM_CREDENTIALS
struct ucred.

Signed-off-by: Prakash Sangappa 
---
 net/core/scm.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/scm.c b/net/core/scm.c
index b1ff8a4..9274197 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -55,6 +55,7 @@ static __inline__ int scm_check_creds(struct ucred *creds)
return -EINVAL;
 
if ((creds->pid == task_tgid_vnr(current) ||
+creds->pid == task_pid_vnr(current) ||
 ns_capable(task_active_pid_ns(current)->user_ns, CAP_SYS_ADMIN)) &&
((uid_eq(uid, cred->uid)   || uid_eq(uid, cred->euid) ||
  uid_eq(uid, cred->suid)) || ns_capable(cred->user_ns, 
CAP_SETUID)) &&
-- 
2.7.4



[PATCH] Allow passing tid or pid in SCM_CREDENTIALS without CAP_SYS_ADMIN

2017-08-07 Thread Prakash Sangappa
Currently passing tid(gettid(2)) of a thread in struct ucred in
SCM_CREDENTIALS message requires CAP_SYS_ADMIN capability otherwise
it fails with EPERM error. Some applications deal with thread id
of a thread(tid) and so it would help to allow tid in SCM_CREDENTIALS
message. Basically, either tgid(pid of the process) or the tid of
the thread should be allowed without the need for CAP_SYS_ADMIN capability.

SCM_CREDENTIALS will be used to determine the global id of a process or
a thread running inside a pid namespace.

This patch adds necessary check to accept tid in SCM_CREDENTIALS
struct ucred.

Signed-off-by: Prakash Sangappa <prakash.sanga...@oracle.com>
---
 net/core/scm.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/scm.c b/net/core/scm.c
index b1ff8a4..9274197 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -55,6 +55,7 @@ static __inline__ int scm_check_creds(struct ucred *creds)
return -EINVAL;
 
if ((creds->pid == task_tgid_vnr(current) ||
+creds->pid == task_pid_vnr(current) ||
 ns_capable(task_active_pid_ns(current)->user_ns, CAP_SYS_ADMIN)) &&
((uid_eq(uid, cred->uid)   || uid_eq(uid, cred->euid) ||
  uid_eq(uid, cred->suid)) || ns_capable(cred->user_ns, 
CAP_SETUID)) &&
-- 
2.7.4



[PATCH] Allow passing tid or pid in SCM_CREDENTIALS without CAP_SYS_ADMIN

2017-08-07 Thread Prakash Sangappa
Currently passing tid(gettid(2)) of a thread in struct ucred in
SCM_CREDENTIALS message requires CAP_SYS_ADMIN capability otherwise
it fails with EPERM error. Some applications deal with thread id
of a thread(tid) and so it would help to allow tid in SCM_CREDENTIALS
message. Basically, either tgid(pid of the process) or the tid of
the thread should be allowed without the need for CAP_SYS_ADMIN capability.

SCM_CREDENTIALS will be used to determine the global id of a process or
a thread running inside a pid namespace.

This patch adds necessary check to accept tid in SCM_CREDENTIALS
struct ucred.

Signed-off-by: Prakash Sangappa 
---
 net/core/scm.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/scm.c b/net/core/scm.c
index b1ff8a4..9274197 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -55,6 +55,7 @@ static __inline__ int scm_check_creds(struct ucred *creds)
return -EINVAL;
 
if ((creds->pid == task_tgid_vnr(current) ||
+creds->pid == task_pid_vnr(current) ||
 ns_capable(task_active_pid_ns(current)->user_ns, CAP_SYS_ADMIN)) &&
((uid_eq(uid, cred->uid)   || uid_eq(uid, cred->euid) ||
  uid_eq(uid, cred->suid)) || ns_capable(cred->user_ns, 
CAP_SETUID)) &&
-- 
2.7.4



[RESEND PATCH v3 2/2] userfaultfd: selftest: Add tests for UFFD_FEATURE_SIGBUS feature

2017-07-31 Thread Prakash Sangappa
This patch adds tests for UFFD_FEATURE_SIGBUS feature. The
tests will verify signal delivery instead of userfault events.
Also, test use of UFFDIO_COPY to allocate memory and retry
accessing monitored area after signal delivery.

This patch also fixes a bug in uffd_poll_thread() where 'uffd'
is leaked.

Signed-off-by: Prakash Sangappa <prakash.sanga...@oracle.com>
---
Change log

v3:  Eliminated use of sig_repeat variable and simplified error return.

v2:
  - Added comments to explain the tests.
  - Fixed test to fail immediately if signal repeats.
  - Addressed other review comments.

v1: https://lkml.org/lkml/2017/7/26/101
---
 tools/testing/selftests/vm/userfaultfd.c |  127 +-
 1 files changed, 124 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c 
b/tools/testing/selftests/vm/userfaultfd.c
index 1eae79a..52740ae 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -66,6 +66,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef __NR_userfaultfd
 
@@ -408,6 +409,7 @@ static int copy_page(int ufd, unsigned long offset)
userfaults++;
break;
case UFFD_EVENT_FORK:
+   close(uffd);
uffd = msg.arg.fork.ufd;
pollfd[0].fd = uffd;
break;
@@ -572,6 +574,17 @@ static int userfaultfd_open(int features)
return 0;
 }
 
+sigjmp_buf jbuf, *sigbuf;
+
+static void sighndl(int sig, siginfo_t *siginfo, void *ptr)
+{
+   if (sig == SIGBUS) {
+   if (sigbuf)
+   siglongjmp(*sigbuf, 1);
+   abort();
+   }
+}
+
 /*
  * For non-cooperative userfaultfd test we fork() a process that will
  * generate pagefaults, will mremap the area monitored by the
@@ -585,19 +598,59 @@ static int userfaultfd_open(int features)
  * The release of the pages currently generates event for shmem and
  * anonymous memory (UFFD_EVENT_REMOVE), hence it is not checked
  * for hugetlb.
+ * For signal test(UFFD_FEATURE_SIGBUS), signal_test = 1, we register
+ * monitored area, generate pagefaults and test that signal is delivered.
+ * Use UFFDIO_COPY to allocate missing page and retry. For signal_test = 2
+ * test robustness use case - we release monitored area, fork a process
+ * that will generate pagefaults and verify signal is generated.
+ * This also tests UFFD_FEATURE_EVENT_FORK event along with the signal
+ * feature. Using monitor thread, verify no userfault events are generated.
  */
-static int faulting_process(void)
+static int faulting_process(int signal_test)
 {
unsigned long nr;
unsigned long long count;
unsigned long split_nr_pages;
+   unsigned long lastnr;
+   struct sigaction act;
+   unsigned long signalled = 0;
 
if (test_type != TEST_HUGETLB)
split_nr_pages = (nr_pages + 1) / 2;
else
split_nr_pages = nr_pages;
 
+   if (signal_test) {
+   sigbuf = 
+   memset(, 0, sizeof(act));
+   act.sa_sigaction = sighndl;
+   act.sa_flags = SA_SIGINFO;
+   if (sigaction(SIGBUS, , 0)) {
+   perror("sigaction");
+   return 1;
+   }
+   lastnr = (unsigned long)-1;
+   }
+
for (nr = 0; nr < split_nr_pages; nr++) {
+   if (signal_test) {
+   if (sigsetjmp(*sigbuf, 1) != 0) {
+   if (nr == lastnr) {
+   fprintf(stderr, "Signal repeated\n");
+   return 1;
+   }
+
+   lastnr = nr;
+   if (signal_test == 1) {
+   if (copy_page(uffd, nr * page_size))
+   signalled++;
+   } else {
+   signalled++;
+   continue;
+   }
+   }
+   }
+
count = *area_count(area_dst, nr);
if (count != count_verify[nr]) {
fprintf(stderr,
@@ -607,6 +660,9 @@ static int faulting_process(void)
}
}
 
+   if (signal_test)
+   return signalled != split_nr_pages;
+
if (test_type == TEST_HUGETLB)
return 0;
 
@@ -761,7 +817,7 @@ static int userfaultfd_events_test(void)
perror("fork"), exit(1);
 
if (!pid)
-   return faulting_process();
+   return faulting_process(0);
 
waitpid(pid, , 0);
if (err)
@@ -778,6 +834,70 @@ static int userfaultfd_events_test(void)
retu

[RESEND PATCH v3 2/2] userfaultfd: selftest: Add tests for UFFD_FEATURE_SIGBUS feature

2017-07-31 Thread Prakash Sangappa
This patch adds tests for UFFD_FEATURE_SIGBUS feature. The
tests will verify signal delivery instead of userfault events.
Also, test use of UFFDIO_COPY to allocate memory and retry
accessing monitored area after signal delivery.

This patch also fixes a bug in uffd_poll_thread() where 'uffd'
is leaked.

Signed-off-by: Prakash Sangappa 
---
Change log

v3:  Eliminated use of sig_repeat variable and simplified error return.

v2:
  - Added comments to explain the tests.
  - Fixed test to fail immediately if signal repeats.
  - Addressed other review comments.

v1: https://lkml.org/lkml/2017/7/26/101
---
 tools/testing/selftests/vm/userfaultfd.c |  127 +-
 1 files changed, 124 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c 
b/tools/testing/selftests/vm/userfaultfd.c
index 1eae79a..52740ae 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -66,6 +66,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef __NR_userfaultfd
 
@@ -408,6 +409,7 @@ static int copy_page(int ufd, unsigned long offset)
userfaults++;
break;
case UFFD_EVENT_FORK:
+   close(uffd);
uffd = msg.arg.fork.ufd;
pollfd[0].fd = uffd;
break;
@@ -572,6 +574,17 @@ static int userfaultfd_open(int features)
return 0;
 }
 
+sigjmp_buf jbuf, *sigbuf;
+
+static void sighndl(int sig, siginfo_t *siginfo, void *ptr)
+{
+   if (sig == SIGBUS) {
+   if (sigbuf)
+   siglongjmp(*sigbuf, 1);
+   abort();
+   }
+}
+
 /*
  * For non-cooperative userfaultfd test we fork() a process that will
  * generate pagefaults, will mremap the area monitored by the
@@ -585,19 +598,59 @@ static int userfaultfd_open(int features)
  * The release of the pages currently generates event for shmem and
  * anonymous memory (UFFD_EVENT_REMOVE), hence it is not checked
  * for hugetlb.
+ * For signal test(UFFD_FEATURE_SIGBUS), signal_test = 1, we register
+ * monitored area, generate pagefaults and test that signal is delivered.
+ * Use UFFDIO_COPY to allocate missing page and retry. For signal_test = 2
+ * test robustness use case - we release monitored area, fork a process
+ * that will generate pagefaults and verify signal is generated.
+ * This also tests UFFD_FEATURE_EVENT_FORK event along with the signal
+ * feature. Using monitor thread, verify no userfault events are generated.
  */
-static int faulting_process(void)
+static int faulting_process(int signal_test)
 {
unsigned long nr;
unsigned long long count;
unsigned long split_nr_pages;
+   unsigned long lastnr;
+   struct sigaction act;
+   unsigned long signalled = 0;
 
if (test_type != TEST_HUGETLB)
split_nr_pages = (nr_pages + 1) / 2;
else
split_nr_pages = nr_pages;
 
+   if (signal_test) {
+   sigbuf = 
+   memset(, 0, sizeof(act));
+   act.sa_sigaction = sighndl;
+   act.sa_flags = SA_SIGINFO;
+   if (sigaction(SIGBUS, , 0)) {
+   perror("sigaction");
+   return 1;
+   }
+   lastnr = (unsigned long)-1;
+   }
+
for (nr = 0; nr < split_nr_pages; nr++) {
+   if (signal_test) {
+   if (sigsetjmp(*sigbuf, 1) != 0) {
+   if (nr == lastnr) {
+   fprintf(stderr, "Signal repeated\n");
+   return 1;
+   }
+
+   lastnr = nr;
+   if (signal_test == 1) {
+   if (copy_page(uffd, nr * page_size))
+   signalled++;
+   } else {
+   signalled++;
+   continue;
+   }
+   }
+   }
+
count = *area_count(area_dst, nr);
if (count != count_verify[nr]) {
fprintf(stderr,
@@ -607,6 +660,9 @@ static int faulting_process(void)
}
}
 
+   if (signal_test)
+   return signalled != split_nr_pages;
+
if (test_type == TEST_HUGETLB)
return 0;
 
@@ -761,7 +817,7 @@ static int userfaultfd_events_test(void)
perror("fork"), exit(1);
 
if (!pid)
-   return faulting_process();
+   return faulting_process(0);
 
waitpid(pid, , 0);
if (err)
@@ -778,6 +834,70 @@ static int userfaultfd_events_test(void)
return userfaults != nr_pages;
 }
 
+sta

[RESEND PATCH v3 1/2] userfaultfd: Add feature to request for a signal delivery

2017-07-31 Thread Prakash Sangappa
In some cases, userfaultfd mechanism should just deliver a SIGBUS signal
to the faulting process, instead of the page-fault event. Dealing with
page-fault event using a monitor thread can be an overhead in these
cases. For example applications like the database could use the signaling
mechanism for robustness purpose.

Database uses hugetlbfs for performance reason. Files on hugetlbfs
filesystem are created and huge pages allocated using fallocate() API.
Pages are deallocated/freed using fallocate() hole punching support.
These files are mmapped and accessed by many processes as shared memory.
The database keeps track of which offsets in the hugetlbfs file have
pages allocated.

Any access to mapped address over holes in the file, which can occur due
to bugs in the application, is considered invalid and expect the process
to simply receive a SIGBUS.  However, currently when a hole in the file is
accessed via the mapped address, kernel/mm attempts to automatically
allocate a page at page fault time, resulting in implicitly filling the
hole in the file. This may not be the desired behavior for applications
like the database that want to explicitly manage page allocations of
hugetlbfs files.

Using userfaultfd mechanism with this support to get a signal, database
application can prevent pages from being allocated implicitly when
processes access mapped address over holes in the file.

This patch adds UFFD_FEATURE_SIGBUS feature to userfaultfd mechnism to
request for a SIGBUS signal.

See following for previous discussion about the database requirement
leading to this proposal as suggested by Andrea.

http://www.spinics.net/lists/linux-mm/msg129224.html

Signed-off-by: Prakash Sangappa <prakash.sanga...@oracle.com>
Reviewed-by: Mike Rapoport <r...@linux.vnet.ibm.com>
Reviewed-by: Andrea Arcangeli <aarca...@redhat.com>
---
 fs/userfaultfd.c |3 +++
 include/uapi/linux/userfaultfd.h |   10 +-
 2 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 1d622f2..0bbe7df 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -371,6 +371,9 @@ int handle_userfault(struct vm_fault *vmf, unsigned long 
reason)
VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP));
VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP));
 
+   if (ctx->features & UFFD_FEATURE_SIGBUS)
+   goto out;
+
/*
 * If it's already released don't get it. This avoids to loop
 * in __get_user_pages if userfaultfd_release waits on the
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 3b05953..d39d5db 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -23,7 +23,8 @@
   UFFD_FEATURE_EVENT_REMOVE |  \
   UFFD_FEATURE_EVENT_UNMAP |   \
   UFFD_FEATURE_MISSING_HUGETLBFS | \
-  UFFD_FEATURE_MISSING_SHMEM)
+  UFFD_FEATURE_MISSING_SHMEM | \
+  UFFD_FEATURE_SIGBUS)
 #define UFFD_API_IOCTLS\
((__u64)1 << _UFFDIO_REGISTER | \
 (__u64)1 << _UFFDIO_UNREGISTER |   \
@@ -153,6 +154,12 @@ struct uffdio_api {
 * UFFD_FEATURE_MISSING_SHMEM works the same as
 * UFFD_FEATURE_MISSING_HUGETLBFS, but it applies to shmem
 * (i.e. tmpfs and other shmem based APIs).
+*
+* UFFD_FEATURE_SIGBUS feature means no page-fault
+* (UFFD_EVENT_PAGEFAULT) event will be delivered, instead
+* a SIGBUS signal will be sent to the faulting process.
+* The application process can enable this behavior by adding
+* it to uffdio_api.features.
 */
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0)
 #define UFFD_FEATURE_EVENT_FORK(1<<1)
@@ -161,6 +168,7 @@ struct uffdio_api {
 #define UFFD_FEATURE_MISSING_HUGETLBFS (1<<4)
 #define UFFD_FEATURE_MISSING_SHMEM (1<<5)
 #define UFFD_FEATURE_EVENT_UNMAP   (1<<6)
+#define UFFD_FEATURE_SIGBUS(1<<7)
__u64 features;
 
__u64 ioctls;
-- 
1.7.1



[RESEND PATCH v3 1/2] userfaultfd: Add feature to request for a signal delivery

2017-07-31 Thread Prakash Sangappa
In some cases, userfaultfd mechanism should just deliver a SIGBUS signal
to the faulting process, instead of the page-fault event. Dealing with
page-fault event using a monitor thread can be an overhead in these
cases. For example applications like the database could use the signaling
mechanism for robustness purpose.

Database uses hugetlbfs for performance reason. Files on hugetlbfs
filesystem are created and huge pages allocated using fallocate() API.
Pages are deallocated/freed using fallocate() hole punching support.
These files are mmapped and accessed by many processes as shared memory.
The database keeps track of which offsets in the hugetlbfs file have
pages allocated.

Any access to mapped address over holes in the file, which can occur due
to bugs in the application, is considered invalid and expect the process
to simply receive a SIGBUS.  However, currently when a hole in the file is
accessed via the mapped address, kernel/mm attempts to automatically
allocate a page at page fault time, resulting in implicitly filling the
hole in the file. This may not be the desired behavior for applications
like the database that want to explicitly manage page allocations of
hugetlbfs files.

Using userfaultfd mechanism with this support to get a signal, database
application can prevent pages from being allocated implicitly when
processes access mapped address over holes in the file.

This patch adds UFFD_FEATURE_SIGBUS feature to userfaultfd mechnism to
request for a SIGBUS signal.

See following for previous discussion about the database requirement
leading to this proposal as suggested by Andrea.

http://www.spinics.net/lists/linux-mm/msg129224.html

Signed-off-by: Prakash Sangappa 
Reviewed-by: Mike Rapoport 
Reviewed-by: Andrea Arcangeli 
---
 fs/userfaultfd.c |3 +++
 include/uapi/linux/userfaultfd.h |   10 +-
 2 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 1d622f2..0bbe7df 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -371,6 +371,9 @@ int handle_userfault(struct vm_fault *vmf, unsigned long 
reason)
VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP));
VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP));
 
+   if (ctx->features & UFFD_FEATURE_SIGBUS)
+   goto out;
+
/*
 * If it's already released don't get it. This avoids to loop
 * in __get_user_pages if userfaultfd_release waits on the
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 3b05953..d39d5db 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -23,7 +23,8 @@
   UFFD_FEATURE_EVENT_REMOVE |  \
   UFFD_FEATURE_EVENT_UNMAP |   \
   UFFD_FEATURE_MISSING_HUGETLBFS | \
-  UFFD_FEATURE_MISSING_SHMEM)
+  UFFD_FEATURE_MISSING_SHMEM | \
+  UFFD_FEATURE_SIGBUS)
 #define UFFD_API_IOCTLS\
((__u64)1 << _UFFDIO_REGISTER | \
 (__u64)1 << _UFFDIO_UNREGISTER |   \
@@ -153,6 +154,12 @@ struct uffdio_api {
 * UFFD_FEATURE_MISSING_SHMEM works the same as
 * UFFD_FEATURE_MISSING_HUGETLBFS, but it applies to shmem
 * (i.e. tmpfs and other shmem based APIs).
+*
+* UFFD_FEATURE_SIGBUS feature means no page-fault
+* (UFFD_EVENT_PAGEFAULT) event will be delivered, instead
+* a SIGBUS signal will be sent to the faulting process.
+* The application process can enable this behavior by adding
+* it to uffdio_api.features.
 */
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0)
 #define UFFD_FEATURE_EVENT_FORK(1<<1)
@@ -161,6 +168,7 @@ struct uffdio_api {
 #define UFFD_FEATURE_MISSING_HUGETLBFS (1<<4)
 #define UFFD_FEATURE_MISSING_SHMEM (1<<5)
 #define UFFD_FEATURE_EVENT_UNMAP   (1<<6)
+#define UFFD_FEATURE_SIGBUS(1<<7)
__u64 features;
 
__u64 ioctls;
-- 
1.7.1



[RESEND PATCH v3 0/2] userfaultfd: selftest: Add tests for UFFD_FEATURE_SIGBUS feature

2017-07-31 Thread Prakash Sangappa
Resending, Andrew asked if I could resend the whole set.
Just added 'Reviewed-by' to commit log for 1/2, no other
changes.

v3 patch for 2/2 as before
http://marc.info/?l=linux-mm=150148267803458=2

Previous patch set here:
http://marc.info/?l=linux-mm=150095815413275=2

Prakash Sangappa (2):
  userfaultfd: Add feature to request for a signal delivery
  userfaultfd: selftest: Add tests for UFFD_FEATURE_SIGBUS feature

 fs/userfaultfd.c |3 +
 include/uapi/linux/userfaultfd.h |   10 ++-
 tools/testing/selftests/vm/userfaultfd.c |  127 +-
 3 files changed, 136 insertions(+), 4 deletions(-)



[RESEND PATCH v3 0/2] userfaultfd: selftest: Add tests for UFFD_FEATURE_SIGBUS feature

2017-07-31 Thread Prakash Sangappa
Resending, Andrew asked if I could resend the whole set.
Just added 'Reviewed-by' to commit log for 1/2, no other
changes.

v3 patch for 2/2 as before
http://marc.info/?l=linux-mm=150148267803458=2

Previous patch set here:
http://marc.info/?l=linux-mm=150095815413275=2

Prakash Sangappa (2):
  userfaultfd: Add feature to request for a signal delivery
  userfaultfd: selftest: Add tests for UFFD_FEATURE_SIGBUS feature

 fs/userfaultfd.c |3 +
 include/uapi/linux/userfaultfd.h |   10 ++-
 tools/testing/selftests/vm/userfaultfd.c |  127 +-
 3 files changed, 136 insertions(+), 4 deletions(-)



Re: [PATCH v2 2/2] userfaultfd: selftest: Add tests for UFFD_FEATURE_SIGBUS feature

2017-07-31 Thread Prakash Sangappa



On 7/30/17 12:07 AM, Mike Rapoport wrote:

On Thu, Jul 27, 2017 at 10:18:40PM -0400, Prakash Sangappa wrote:

This patch adds tests for UFFD_FEATURE_SIGBUS feature. The
tests will verify signal delivery instead of userfault events.
Also, test use of UFFDIO_COPY to allocate memory and retry
accessing monitored area after signal delivery.

This patch also fixes a bug in uffd_poll_thread() where 'uffd'
is leaked.

Signed-off-by: Prakash Sangappa <prakash.sanga...@oracle.com>
---
Change log

v2:
   - Added comments to explain the tests.
   - Fixed test to fail immediately if signal repeats.
   - Addressed other review comments.

v1: https://lkml.org/lkml/2017/7/26/101
---

Overall looks good to me, just small nitpick below.

[...]

for (nr = 0; nr < split_nr_pages; nr++) {
+   if (signal_test) {
+   if (sigsetjmp(*sigbuf, 1) != 0) {
+   if (nr == lastnr) {
+   sig_repeats++;

You can simply 'return 1' here, then sig_repeats variable can be dropped
and the return statement for signal_test can be simplified.


Ok, sent v3 patch with this change.

Thanks,
-Prakash.



Re: [PATCH v2 2/2] userfaultfd: selftest: Add tests for UFFD_FEATURE_SIGBUS feature

2017-07-31 Thread Prakash Sangappa



On 7/30/17 12:07 AM, Mike Rapoport wrote:

On Thu, Jul 27, 2017 at 10:18:40PM -0400, Prakash Sangappa wrote:

This patch adds tests for UFFD_FEATURE_SIGBUS feature. The
tests will verify signal delivery instead of userfault events.
Also, test use of UFFDIO_COPY to allocate memory and retry
accessing monitored area after signal delivery.

This patch also fixes a bug in uffd_poll_thread() where 'uffd'
is leaked.

Signed-off-by: Prakash Sangappa 
---
Change log

v2:
   - Added comments to explain the tests.
   - Fixed test to fail immediately if signal repeats.
   - Addressed other review comments.

v1: https://lkml.org/lkml/2017/7/26/101
---

Overall looks good to me, just small nitpick below.

[...]

for (nr = 0; nr < split_nr_pages; nr++) {
+   if (signal_test) {
+   if (sigsetjmp(*sigbuf, 1) != 0) {
+   if (nr == lastnr) {
+   sig_repeats++;

You can simply 'return 1' here, then sig_repeats variable can be dropped
and the return statement for signal_test can be simplified.


Ok, sent v3 patch with this change.

Thanks,
-Prakash.



[PATCH v3 2/2] userfaultfd: selftest: Add tests for UFFD_FEATURE_SIGBUS feature

2017-07-31 Thread Prakash Sangappa
This patch adds tests for UFFD_FEATURE_SIGBUS feature. The
tests will verify signal delivery instead of userfault events.
Also, test use of UFFDIO_COPY to allocate memory and retry
accessing monitored area after signal delivery.

This patch also fixes a bug in uffd_poll_thread() where 'uffd'
is leaked.

Signed-off-by: Prakash Sangappa <prakash.sanga...@oracle.com>
---
Change log

v3:  Eliminated use of sig_repeat variable and simplified error return.

v2:
  - Added comments to explain the tests.
  - Fixed test to fail immediately if signal repeats.
  - Addressed other review comments.

v1: https://lkml.org/lkml/2017/7/26/101
---
 tools/testing/selftests/vm/userfaultfd.c |  127 +-
 1 files changed, 124 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c 
b/tools/testing/selftests/vm/userfaultfd.c
index 1eae79a..52740ae 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -66,6 +66,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef __NR_userfaultfd
 
@@ -408,6 +409,7 @@ static int copy_page(int ufd, unsigned long offset)
userfaults++;
break;
case UFFD_EVENT_FORK:
+   close(uffd);
uffd = msg.arg.fork.ufd;
pollfd[0].fd = uffd;
break;
@@ -572,6 +574,17 @@ static int userfaultfd_open(int features)
return 0;
 }
 
+sigjmp_buf jbuf, *sigbuf;
+
+static void sighndl(int sig, siginfo_t *siginfo, void *ptr)
+{
+   if (sig == SIGBUS) {
+   if (sigbuf)
+   siglongjmp(*sigbuf, 1);
+   abort();
+   }
+}
+
 /*
  * For non-cooperative userfaultfd test we fork() a process that will
  * generate pagefaults, will mremap the area monitored by the
@@ -585,19 +598,59 @@ static int userfaultfd_open(int features)
  * The release of the pages currently generates event for shmem and
  * anonymous memory (UFFD_EVENT_REMOVE), hence it is not checked
  * for hugetlb.
+ * For signal test(UFFD_FEATURE_SIGBUS), signal_test = 1, we register
+ * monitored area, generate pagefaults and test that signal is delivered.
+ * Use UFFDIO_COPY to allocate missing page and retry. For signal_test = 2
+ * test robustness use case - we release monitored area, fork a process
+ * that will generate pagefaults and verify signal is generated.
+ * This also tests UFFD_FEATURE_EVENT_FORK event along with the signal
+ * feature. Using monitor thread, verify no userfault events are generated.
  */
-static int faulting_process(void)
+static int faulting_process(int signal_test)
 {
unsigned long nr;
unsigned long long count;
unsigned long split_nr_pages;
+   unsigned long lastnr;
+   struct sigaction act;
+   unsigned long signalled = 0;
 
if (test_type != TEST_HUGETLB)
split_nr_pages = (nr_pages + 1) / 2;
else
split_nr_pages = nr_pages;
 
+   if (signal_test) {
+   sigbuf = 
+   memset(, 0, sizeof(act));
+   act.sa_sigaction = sighndl;
+   act.sa_flags = SA_SIGINFO;
+   if (sigaction(SIGBUS, , 0)) {
+   perror("sigaction");
+   return 1;
+   }
+   lastnr = (unsigned long)-1;
+   }
+
for (nr = 0; nr < split_nr_pages; nr++) {
+   if (signal_test) {
+   if (sigsetjmp(*sigbuf, 1) != 0) {
+   if (nr == lastnr) {
+   fprintf(stderr, "Signal repeated\n");
+   return 1;
+   }
+
+   lastnr = nr;
+   if (signal_test == 1) {
+   if (copy_page(uffd, nr * page_size))
+   signalled++;
+   } else {
+   signalled++;
+   continue;
+   }
+   }
+   }
+
count = *area_count(area_dst, nr);
if (count != count_verify[nr]) {
fprintf(stderr,
@@ -607,6 +660,9 @@ static int faulting_process(void)
}
}
 
+   if (signal_test)
+   return signalled != split_nr_pages;
+
if (test_type == TEST_HUGETLB)
return 0;
 
@@ -761,7 +817,7 @@ static int userfaultfd_events_test(void)
perror("fork"), exit(1);
 
if (!pid)
-   return faulting_process();
+   return faulting_process(0);
 
waitpid(pid, , 0);
if (err)
@@ -778,6 +834,70 @@ static int userfaultfd_events_test(void)
retu

[PATCH v3 2/2] userfaultfd: selftest: Add tests for UFFD_FEATURE_SIGBUS feature

2017-07-31 Thread Prakash Sangappa
This patch adds tests for UFFD_FEATURE_SIGBUS feature. The
tests will verify signal delivery instead of userfault events.
Also, test use of UFFDIO_COPY to allocate memory and retry
accessing monitored area after signal delivery.

This patch also fixes a bug in uffd_poll_thread() where 'uffd'
is leaked.

Signed-off-by: Prakash Sangappa 
---
Change log

v3:  Eliminated use of sig_repeat variable and simplified error return.

v2:
  - Added comments to explain the tests.
  - Fixed test to fail immediately if signal repeats.
  - Addressed other review comments.

v1: https://lkml.org/lkml/2017/7/26/101
---
 tools/testing/selftests/vm/userfaultfd.c |  127 +-
 1 files changed, 124 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c 
b/tools/testing/selftests/vm/userfaultfd.c
index 1eae79a..52740ae 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -66,6 +66,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef __NR_userfaultfd
 
@@ -408,6 +409,7 @@ static int copy_page(int ufd, unsigned long offset)
userfaults++;
break;
case UFFD_EVENT_FORK:
+   close(uffd);
uffd = msg.arg.fork.ufd;
pollfd[0].fd = uffd;
break;
@@ -572,6 +574,17 @@ static int userfaultfd_open(int features)
return 0;
 }
 
+sigjmp_buf jbuf, *sigbuf;
+
+static void sighndl(int sig, siginfo_t *siginfo, void *ptr)
+{
+   if (sig == SIGBUS) {
+   if (sigbuf)
+   siglongjmp(*sigbuf, 1);
+   abort();
+   }
+}
+
 /*
  * For non-cooperative userfaultfd test we fork() a process that will
  * generate pagefaults, will mremap the area monitored by the
@@ -585,19 +598,59 @@ static int userfaultfd_open(int features)
  * The release of the pages currently generates event for shmem and
  * anonymous memory (UFFD_EVENT_REMOVE), hence it is not checked
  * for hugetlb.
+ * For signal test(UFFD_FEATURE_SIGBUS), signal_test = 1, we register
+ * monitored area, generate pagefaults and test that signal is delivered.
+ * Use UFFDIO_COPY to allocate missing page and retry. For signal_test = 2
+ * test robustness use case - we release monitored area, fork a process
+ * that will generate pagefaults and verify signal is generated.
+ * This also tests UFFD_FEATURE_EVENT_FORK event along with the signal
+ * feature. Using monitor thread, verify no userfault events are generated.
  */
-static int faulting_process(void)
+static int faulting_process(int signal_test)
 {
unsigned long nr;
unsigned long long count;
unsigned long split_nr_pages;
+   unsigned long lastnr;
+   struct sigaction act;
+   unsigned long signalled = 0;
 
if (test_type != TEST_HUGETLB)
split_nr_pages = (nr_pages + 1) / 2;
else
split_nr_pages = nr_pages;
 
+   if (signal_test) {
+   sigbuf = 
+   memset(, 0, sizeof(act));
+   act.sa_sigaction = sighndl;
+   act.sa_flags = SA_SIGINFO;
+   if (sigaction(SIGBUS, , 0)) {
+   perror("sigaction");
+   return 1;
+   }
+   lastnr = (unsigned long)-1;
+   }
+
for (nr = 0; nr < split_nr_pages; nr++) {
+   if (signal_test) {
+   if (sigsetjmp(*sigbuf, 1) != 0) {
+   if (nr == lastnr) {
+   fprintf(stderr, "Signal repeated\n");
+   return 1;
+   }
+
+   lastnr = nr;
+   if (signal_test == 1) {
+   if (copy_page(uffd, nr * page_size))
+   signalled++;
+   } else {
+   signalled++;
+   continue;
+   }
+   }
+   }
+
count = *area_count(area_dst, nr);
if (count != count_verify[nr]) {
fprintf(stderr,
@@ -607,6 +660,9 @@ static int faulting_process(void)
}
}
 
+   if (signal_test)
+   return signalled != split_nr_pages;
+
if (test_type == TEST_HUGETLB)
return 0;
 
@@ -761,7 +817,7 @@ static int userfaultfd_events_test(void)
perror("fork"), exit(1);
 
if (!pid)
-   return faulting_process();
+   return faulting_process(0);
 
waitpid(pid, , 0);
if (err)
@@ -778,6 +834,70 @@ static int userfaultfd_events_test(void)
return userfaults != nr_pages;
 }
 
+sta

[PATCH v2 2/2] userfaultfd: selftest: Add tests for UFFD_FEATURE_SIGBUS feature

2017-07-27 Thread Prakash Sangappa
This patch adds tests for UFFD_FEATURE_SIGBUS feature. The
tests will verify signal delivery instead of userfault events.
Also, test use of UFFDIO_COPY to allocate memory and retry
accessing monitored area after signal delivery.

This patch also fixes a bug in uffd_poll_thread() where 'uffd'
is leaked.

Signed-off-by: Prakash Sangappa <prakash.sanga...@oracle.com>
---
Change log

v2:
  - Added comments to explain the tests.
  - Fixed test to fail immediately if signal repeats.
  - Addressed other review comments.

v1: https://lkml.org/lkml/2017/7/26/101
---
 tools/testing/selftests/vm/userfaultfd.c |  127 +-
 1 files changed, 124 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c 
b/tools/testing/selftests/vm/userfaultfd.c
index 1eae79a..3976d7a 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -66,6 +66,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef __NR_userfaultfd
 
@@ -408,6 +409,7 @@ static int copy_page(int ufd, unsigned long offset)
userfaults++;
break;
case UFFD_EVENT_FORK:
+   close(uffd);
uffd = msg.arg.fork.ufd;
pollfd[0].fd = uffd;
break;
@@ -572,6 +574,17 @@ static int userfaultfd_open(int features)
return 0;
 }
 
+sigjmp_buf jbuf, *sigbuf;
+
+static void sighndl(int sig, siginfo_t *siginfo, void *ptr)
+{
+   if (sig == SIGBUS) {
+   if (sigbuf)
+   siglongjmp(*sigbuf, 1);
+   abort();
+   }
+}
+
 /*
  * For non-cooperative userfaultfd test we fork() a process that will
  * generate pagefaults, will mremap the area monitored by the
@@ -585,19 +598,59 @@ static int userfaultfd_open(int features)
  * The release of the pages currently generates event for shmem and
  * anonymous memory (UFFD_EVENT_REMOVE), hence it is not checked
  * for hugetlb.
+ * For signal test(UFFD_FEATURE_SIGBUS), signal_test = 1, we register
+ * monitored area, generate pagefaults and test that signal is delivered.
+ * Use UFFDIO_COPY to allocate missing page and retry. For signal_test = 2
+ * test robustness use case - we release monitored area, fork a process
+ * that will generate pagefaults and verify signal is generated.
+ * This also tests UFFD_FEATURE_EVENT_FORK event along with the signal
+ * feature. Using monitor thread, verify no userfault events are generated.
  */
-static int faulting_process(void)
+static int faulting_process(int signal_test)
 {
unsigned long nr;
unsigned long long count;
unsigned long split_nr_pages;
+   unsigned long lastnr;
+   struct sigaction act;
+   unsigned long signalled = 0, sig_repeats = 0;
 
if (test_type != TEST_HUGETLB)
split_nr_pages = (nr_pages + 1) / 2;
else
split_nr_pages = nr_pages;
 
+   if (signal_test) {
+   sigbuf = 
+   memset(, 0, sizeof(act));
+   act.sa_sigaction = sighndl;
+   act.sa_flags = SA_SIGINFO;
+   if (sigaction(SIGBUS, , 0)) {
+   perror("sigaction");
+   return 1;
+   }
+   lastnr = (unsigned long)-1;
+   }
+
for (nr = 0; nr < split_nr_pages; nr++) {
+   if (signal_test) {
+   if (sigsetjmp(*sigbuf, 1) != 0) {
+   if (nr == lastnr) {
+   sig_repeats++;
+   break;
+   }
+
+   lastnr = nr;
+   if (signal_test == 1) {
+   if (copy_page(uffd, nr * page_size))
+   signalled++;
+   } else {
+   signalled++;
+   continue;
+   }
+   }
+   }
+
count = *area_count(area_dst, nr);
if (count != count_verify[nr]) {
fprintf(stderr,
@@ -607,6 +660,9 @@ static int faulting_process(void)
}
}
 
+   if (signal_test)
+   return !(signalled == split_nr_pages && sig_repeats == 0);
+
if (test_type == TEST_HUGETLB)
return 0;
 
@@ -761,7 +817,7 @@ static int userfaultfd_events_test(void)
perror("fork"), exit(1);
 
if (!pid)
-   return faulting_process();
+   return faulting_process(0);
 
waitpid(pid, , 0);
if (err)
@@ -778,6 +834,70 @@ static int userfaultfd_events_test(void)
return userfaults != nr_pages;
 }
 
+static int userfau

[PATCH v2 2/2] userfaultfd: selftest: Add tests for UFFD_FEATURE_SIGBUS feature

2017-07-27 Thread Prakash Sangappa
This patch adds tests for UFFD_FEATURE_SIGBUS feature. The
tests will verify signal delivery instead of userfault events.
Also, test use of UFFDIO_COPY to allocate memory and retry
accessing monitored area after signal delivery.

This patch also fixes a bug in uffd_poll_thread() where 'uffd'
is leaked.

Signed-off-by: Prakash Sangappa 
---
Change log

v2:
  - Added comments to explain the tests.
  - Fixed test to fail immediately if signal repeats.
  - Addressed other review comments.

v1: https://lkml.org/lkml/2017/7/26/101
---
 tools/testing/selftests/vm/userfaultfd.c |  127 +-
 1 files changed, 124 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c 
b/tools/testing/selftests/vm/userfaultfd.c
index 1eae79a..3976d7a 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -66,6 +66,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef __NR_userfaultfd
 
@@ -408,6 +409,7 @@ static int copy_page(int ufd, unsigned long offset)
userfaults++;
break;
case UFFD_EVENT_FORK:
+   close(uffd);
uffd = msg.arg.fork.ufd;
pollfd[0].fd = uffd;
break;
@@ -572,6 +574,17 @@ static int userfaultfd_open(int features)
return 0;
 }
 
+sigjmp_buf jbuf, *sigbuf;
+
+static void sighndl(int sig, siginfo_t *siginfo, void *ptr)
+{
+   if (sig == SIGBUS) {
+   if (sigbuf)
+   siglongjmp(*sigbuf, 1);
+   abort();
+   }
+}
+
 /*
  * For non-cooperative userfaultfd test we fork() a process that will
  * generate pagefaults, will mremap the area monitored by the
@@ -585,19 +598,59 @@ static int userfaultfd_open(int features)
  * The release of the pages currently generates event for shmem and
  * anonymous memory (UFFD_EVENT_REMOVE), hence it is not checked
  * for hugetlb.
+ * For signal test(UFFD_FEATURE_SIGBUS), signal_test = 1, we register
+ * monitored area, generate pagefaults and test that signal is delivered.
+ * Use UFFDIO_COPY to allocate missing page and retry. For signal_test = 2
+ * test robustness use case - we release monitored area, fork a process
+ * that will generate pagefaults and verify signal is generated.
+ * This also tests UFFD_FEATURE_EVENT_FORK event along with the signal
+ * feature. Using monitor thread, verify no userfault events are generated.
  */
-static int faulting_process(void)
+static int faulting_process(int signal_test)
 {
unsigned long nr;
unsigned long long count;
unsigned long split_nr_pages;
+   unsigned long lastnr;
+   struct sigaction act;
+   unsigned long signalled = 0, sig_repeats = 0;
 
if (test_type != TEST_HUGETLB)
split_nr_pages = (nr_pages + 1) / 2;
else
split_nr_pages = nr_pages;
 
+   if (signal_test) {
+   sigbuf = 
+   memset(, 0, sizeof(act));
+   act.sa_sigaction = sighndl;
+   act.sa_flags = SA_SIGINFO;
+   if (sigaction(SIGBUS, , 0)) {
+   perror("sigaction");
+   return 1;
+   }
+   lastnr = (unsigned long)-1;
+   }
+
for (nr = 0; nr < split_nr_pages; nr++) {
+   if (signal_test) {
+   if (sigsetjmp(*sigbuf, 1) != 0) {
+   if (nr == lastnr) {
+   sig_repeats++;
+   break;
+   }
+
+   lastnr = nr;
+   if (signal_test == 1) {
+   if (copy_page(uffd, nr * page_size))
+   signalled++;
+   } else {
+   signalled++;
+   continue;
+   }
+   }
+   }
+
count = *area_count(area_dst, nr);
if (count != count_verify[nr]) {
fprintf(stderr,
@@ -607,6 +660,9 @@ static int faulting_process(void)
}
}
 
+   if (signal_test)
+   return !(signalled == split_nr_pages && sig_repeats == 0);
+
if (test_type == TEST_HUGETLB)
return 0;
 
@@ -761,7 +817,7 @@ static int userfaultfd_events_test(void)
perror("fork"), exit(1);
 
if (!pid)
-   return faulting_process();
+   return faulting_process(0);
 
waitpid(pid, , 0);
if (err)
@@ -778,6 +834,70 @@ static int userfaultfd_events_test(void)
return userfaults != nr_pages;
 }
 
+static int userfaultfd_sig_test(void)
+{
+   struct 

Re: [RESEND PATCH 1/2] userfaultfd: Add feature to request for a signal delivery

2017-07-27 Thread Prakash Sangappa

Yes, I will provide a man page update.

-Prakash.


On 7/27/17 4:58 AM, Michal Hocko wrote:

Please do not forget to provide a man page update with clarified
semantic.




Re: [RESEND PATCH 1/2] userfaultfd: Add feature to request for a signal delivery

2017-07-27 Thread Prakash Sangappa

Yes, I will provide a man page update.

-Prakash.


On 7/27/17 4:58 AM, Michal Hocko wrote:

Please do not forget to provide a man page update with clarified
semantic.




Re: [RESEND PATCH 2/2] userfaultfd: selftest: Add tests for UFFD_FREATURE_SIGBUS

2017-07-26 Thread Prakash Sangappa



On 7/26/17 7:27 AM, Andrea Arcangeli wrote:

On Tue, Jul 25, 2017 at 12:47:42AM -0400, Prakash Sangappa wrote:

Signed-off-by: Prakash Sangappa <prakash.sanga...@oracle.com>
---
  tools/testing/selftests/vm/userfaultfd.c |  121 +-
  1 files changed, 118 insertions(+), 3 deletions(-)

Like Mike said, some comment about the test would be better, commit
messages are never one liners in the kernel.


Ok




@@ -408,6 +409,7 @@ static int copy_page(int ufd, unsigned long offset)
userfaults++;
break;
case UFFD_EVENT_FORK:
+   close(uffd);
uffd = msg.arg.fork.ufd;
pollfd[0].fd = uffd;
break;

Isn't this fd leak bugfix independent of the rest of the changes? The
only side effects should have been that it could run out of fds, but I
assume this was found by source review as I doubt it could run out of fds.
This could be splitted off in a separate patch.


Not just the fd leak, it causes problems here with the addition of the
new test userfaultfd_sig_test(). Since the original vma registration
persists in the parent, subsequent registration in userfaultfd_events_test()
fails with 'EBUSY' error, as userfault implementation does not allow
registering same vma with another uffd, while one exists.

Therefore, will need this change. I could just leave this fix here along
with the rest of the changes, will that be ok?

-Prakash


Overall it looks a good test also exercising UFFD_EVENT_FORK at the
same time.

Thanks,
Andrea




Re: [RESEND PATCH 2/2] userfaultfd: selftest: Add tests for UFFD_FREATURE_SIGBUS

2017-07-26 Thread Prakash Sangappa



On 7/26/17 7:27 AM, Andrea Arcangeli wrote:

On Tue, Jul 25, 2017 at 12:47:42AM -0400, Prakash Sangappa wrote:

Signed-off-by: Prakash Sangappa 
---
  tools/testing/selftests/vm/userfaultfd.c |  121 +-
  1 files changed, 118 insertions(+), 3 deletions(-)

Like Mike said, some comment about the test would be better, commit
messages are never one liners in the kernel.


Ok




@@ -408,6 +409,7 @@ static int copy_page(int ufd, unsigned long offset)
userfaults++;
break;
case UFFD_EVENT_FORK:
+   close(uffd);
uffd = msg.arg.fork.ufd;
pollfd[0].fd = uffd;
break;

Isn't this fd leak bugfix independent of the rest of the changes? The
only side effects should have been that it could run out of fds, but I
assume this was found by source review as I doubt it could run out of fds.
This could be splitted off in a separate patch.


Not just the fd leak, it causes problems here with the addition of the
new test userfaultfd_sig_test(). Since the original vma registration
persists in the parent, subsequent registration in userfaultfd_events_test()
fails with 'EBUSY' error, as userfault implementation does not allow
registering same vma with another uffd, while one exists.

Therefore, will need this change. I could just leave this fix here along
with the rest of the changes, will that be ok?

-Prakash


Overall it looks a good test also exercising UFFD_EVENT_FORK at the
same time.

Thanks,
Andrea




Re: [RESEND PATCH 2/2] userfaultfd: selftest: Add tests for UFFD_FREATURE_SIGBUS

2017-07-26 Thread Prakash Sangappa



On 7/26/17 12:53 AM, Mike Rapoport wrote:

+
  /*
   * For non-cooperative userfaultfd test we fork() a process that will
   * generate pagefaults, will mremap the area monitored by the
@@ -585,19 +598,54 @@ static int userfaultfd_open(int features)
   * The release of the pages currently generates event for shmem and
   * anonymous memory (UFFD_EVENT_REMOVE), hence it is not checked
   * for hugetlb.
+ * For signal test(UFFD_FEATURE_SIGBUS), primarily test signal
+ * delivery and ensure no userfault events are generated.

Can you add some details about the tests? E.g. what is the meaning if
signal_test=1 and signal_test=2 and what is the difference between them?


Ok, I will.




   */
-static int faulting_process(void)
+static int faulting_process(int signal_test)
  {
unsigned long nr;
unsigned long long count;
unsigned long split_nr_pages;
+   unsigned long lastnr;
+   struct sigaction act;
+   unsigned long signalled=0, sig_repeats = 0;

Spaces around that '=' ^


Will fix it.



if (test_type != TEST_HUGETLB)
split_nr_pages = (nr_pages + 1) / 2;
else
split_nr_pages = nr_pages;

+   if (signal_test) {
+   sigbuf = 
+   memset (, 0, sizeof(act));

There should be no space between function name and open parenthesis.


ok



+   act.sa_sigaction = sighndl;
+   act.sa_flags = SA_SIGINFO;
+   if (sigaction(SIGBUS, , 0)) {
+   perror("sigaction");
+   return 1;
+   }
+   lastnr = (unsigned long)-1;
+   }
+
for (nr = 0; nr < split_nr_pages; nr++) {
+   if (signal_test) {
+   if (sigsetjmp(*sigbuf, 1) != 0) {
+   if (nr == lastnr) {
+   sig_repeats++;
+   continue;

If I understand correctly, when nr == lastnr we get a repeated signal for
the same page and this is an error, right?


Yes,


Why would we continue the test and won't return error immediately?


Yes, it could just return error. I will fix it.



+   }
+
+   lastnr = nr;
+   if (signal_test == 1) {
+   if (copy_page(uffd, nr * page_size))
+   signalled++;
+   } else {
+   signalled++;
+   continue;
+   }
+   }
+   }
+
count = *area_count(area_dst, nr);
if (count != count_verify[nr]) {
fprintf(stderr,
@@ -607,6 +655,8 @@ static int faulting_process(void)
}
}

+   if (signal_test)
+   return signalled != split_nr_pages || sig_repeats != 0;

I believe return !(signalled == split_nr_pages && sig_repeats == 0) is
clearer.
And I blank line after the return statement would be nice :)


Ok.
Will send out v2 patch with the changes.

Thanks,
-Prakash




Re: [RESEND PATCH 2/2] userfaultfd: selftest: Add tests for UFFD_FREATURE_SIGBUS

2017-07-26 Thread Prakash Sangappa



On 7/26/17 12:53 AM, Mike Rapoport wrote:

+
  /*
   * For non-cooperative userfaultfd test we fork() a process that will
   * generate pagefaults, will mremap the area monitored by the
@@ -585,19 +598,54 @@ static int userfaultfd_open(int features)
   * The release of the pages currently generates event for shmem and
   * anonymous memory (UFFD_EVENT_REMOVE), hence it is not checked
   * for hugetlb.
+ * For signal test(UFFD_FEATURE_SIGBUS), primarily test signal
+ * delivery and ensure no userfault events are generated.

Can you add some details about the tests? E.g. what is the meaning if
signal_test=1 and signal_test=2 and what is the difference between them?


Ok, I will.




   */
-static int faulting_process(void)
+static int faulting_process(int signal_test)
  {
unsigned long nr;
unsigned long long count;
unsigned long split_nr_pages;
+   unsigned long lastnr;
+   struct sigaction act;
+   unsigned long signalled=0, sig_repeats = 0;

Spaces around that '=' ^


Will fix it.



if (test_type != TEST_HUGETLB)
split_nr_pages = (nr_pages + 1) / 2;
else
split_nr_pages = nr_pages;

+   if (signal_test) {
+   sigbuf = 
+   memset (, 0, sizeof(act));

There should be no space between function name and open parenthesis.


ok



+   act.sa_sigaction = sighndl;
+   act.sa_flags = SA_SIGINFO;
+   if (sigaction(SIGBUS, , 0)) {
+   perror("sigaction");
+   return 1;
+   }
+   lastnr = (unsigned long)-1;
+   }
+
for (nr = 0; nr < split_nr_pages; nr++) {
+   if (signal_test) {
+   if (sigsetjmp(*sigbuf, 1) != 0) {
+   if (nr == lastnr) {
+   sig_repeats++;
+   continue;

If I understand correctly, when nr == lastnr we get a repeated signal for
the same page and this is an error, right?


Yes,


Why would we continue the test and won't return error immediately?


Yes, it could just return error. I will fix it.



+   }
+
+   lastnr = nr;
+   if (signal_test == 1) {
+   if (copy_page(uffd, nr * page_size))
+   signalled++;
+   } else {
+   signalled++;
+   continue;
+   }
+   }
+   }
+
count = *area_count(area_dst, nr);
if (count != count_verify[nr]) {
fprintf(stderr,
@@ -607,6 +655,8 @@ static int faulting_process(void)
}
}

+   if (signal_test)
+   return signalled != split_nr_pages || sig_repeats != 0;

I believe return !(signalled == split_nr_pages && sig_repeats == 0) is
clearer.
And I blank line after the return statement would be nice :)


Ok.
Will send out v2 patch with the changes.

Thanks,
-Prakash




[RESEND PATCH 0/2] userfaultfd: Add feature to request for a signal delivery

2017-07-24 Thread Prakash Sangappa
Hi Andrea, Mike,

Rsending - fixed email address. 

Here is the patch set for the proposed userfaultfd UFFD_FEATURE_SIGBUS
feature, including tests in selftest/vm/userfaultfd.c

Please review.

See following for previous discussion.

http://www.spinics.net/lists/linux-mm/msg129224.html
http://www.spinics.net/lists/linux-mm/msg130678.html


Thanks,

Prakash Sangappa (2):
  userfaultfd: Add feature to request for a signal delivery
  userfaultfd: selftest: Add tests for UFFD_FREATURE_SIGBUS

 fs/userfaultfd.c |3 +
 include/uapi/linux/userfaultfd.h |   10 ++-
 tools/testing/selftests/vm/userfaultfd.c |  121 +-
 3 files changed, 130 insertions(+), 4 deletions(-)



[RESEND PATCH 1/2] userfaultfd: Add feature to request for a signal delivery

2017-07-24 Thread Prakash Sangappa
In some cases, userfaultfd mechanism should just deliver a SIGBUS signal
to the faulting process, instead of the page-fault event. Dealing with
page-fault event using a monitor thread can be an overhead in these
cases. For example applications like the database could use the signaling
mechanism for robustness purpose.

Database uses hugetlbfs for performance reason. Files on hugetlbfs
filesystem are created and huge pages allocated using fallocate() API.
Pages are deallocated/freed using fallocate() hole punching support.
These files are mmapped and accessed by many processes as shared memory.
The database keeps track of which offsets in the hugetlbfs file have
pages allocated.

Any access to mapped address over holes in the file, which can occur due
to bugs in the application, is considered invalid and expect the process
to simply receive a SIGBUS.  However, currently when a hole in the file is
accessed via the mapped address, kernel/mm attempts to automatically
allocate a page at page fault time, resulting in implicitly filling the
hole in the file. This may not be the desired behavior for applications
like the database that want to explicitly manage page allocations of
hugetlbfs files.

Using userfaultfd mechanism with this support to get a signal, database
application can prevent pages from being allocated implicitly when
processes access mapped address over holes in the file.

This patch adds UFFD_FEATURE_SIGBUS feature to userfaultfd mechnism to
request for a SIGBUS signal.

See following for previous discussion about the database requirement
leading to this proposal as suggested by Andrea.

http://www.spinics.net/lists/linux-mm/msg129224.html

Signed-off-by: Prakash Sangappa <prakash.sanga...@oracle.com>
---
 fs/userfaultfd.c |3 +++
 include/uapi/linux/userfaultfd.h |   10 +-
 2 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 1d622f2..0bbe7df 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -371,6 +371,9 @@ int handle_userfault(struct vm_fault *vmf, unsigned long 
reason)
VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP));
VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP));
 
+   if (ctx->features & UFFD_FEATURE_SIGBUS)
+   goto out;
+
/*
 * If it's already released don't get it. This avoids to loop
 * in __get_user_pages if userfaultfd_release waits on the
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 3b05953..d39d5db 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -23,7 +23,8 @@
   UFFD_FEATURE_EVENT_REMOVE |  \
   UFFD_FEATURE_EVENT_UNMAP |   \
   UFFD_FEATURE_MISSING_HUGETLBFS | \
-  UFFD_FEATURE_MISSING_SHMEM)
+  UFFD_FEATURE_MISSING_SHMEM | \
+  UFFD_FEATURE_SIGBUS)
 #define UFFD_API_IOCTLS\
((__u64)1 << _UFFDIO_REGISTER | \
 (__u64)1 << _UFFDIO_UNREGISTER |   \
@@ -153,6 +154,12 @@ struct uffdio_api {
 * UFFD_FEATURE_MISSING_SHMEM works the same as
 * UFFD_FEATURE_MISSING_HUGETLBFS, but it applies to shmem
 * (i.e. tmpfs and other shmem based APIs).
+*
+* UFFD_FEATURE_SIGBUS feature means no page-fault
+* (UFFD_EVENT_PAGEFAULT) event will be delivered, instead
+* a SIGBUS signal will be sent to the faulting process.
+* The application process can enable this behavior by adding
+* it to uffdio_api.features.
 */
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0)
 #define UFFD_FEATURE_EVENT_FORK(1<<1)
@@ -161,6 +168,7 @@ struct uffdio_api {
 #define UFFD_FEATURE_MISSING_HUGETLBFS (1<<4)
 #define UFFD_FEATURE_MISSING_SHMEM (1<<5)
 #define UFFD_FEATURE_EVENT_UNMAP   (1<<6)
+#define UFFD_FEATURE_SIGBUS(1<<7)
__u64 features;
 
__u64 ioctls;
-- 
1.7.1



[RESEND PATCH 0/2] userfaultfd: Add feature to request for a signal delivery

2017-07-24 Thread Prakash Sangappa
Hi Andrea, Mike,

Rsending - fixed email address. 

Here is the patch set for the proposed userfaultfd UFFD_FEATURE_SIGBUS
feature, including tests in selftest/vm/userfaultfd.c

Please review.

See following for previous discussion.

http://www.spinics.net/lists/linux-mm/msg129224.html
http://www.spinics.net/lists/linux-mm/msg130678.html


Thanks,

Prakash Sangappa (2):
  userfaultfd: Add feature to request for a signal delivery
  userfaultfd: selftest: Add tests for UFFD_FREATURE_SIGBUS

 fs/userfaultfd.c |3 +
 include/uapi/linux/userfaultfd.h |   10 ++-
 tools/testing/selftests/vm/userfaultfd.c |  121 +-
 3 files changed, 130 insertions(+), 4 deletions(-)



[RESEND PATCH 1/2] userfaultfd: Add feature to request for a signal delivery

2017-07-24 Thread Prakash Sangappa
In some cases, userfaultfd mechanism should just deliver a SIGBUS signal
to the faulting process, instead of the page-fault event. Dealing with
page-fault event using a monitor thread can be an overhead in these
cases. For example applications like the database could use the signaling
mechanism for robustness purpose.

Database uses hugetlbfs for performance reason. Files on hugetlbfs
filesystem are created and huge pages allocated using fallocate() API.
Pages are deallocated/freed using fallocate() hole punching support.
These files are mmapped and accessed by many processes as shared memory.
The database keeps track of which offsets in the hugetlbfs file have
pages allocated.

Any access to mapped address over holes in the file, which can occur due
to bugs in the application, is considered invalid and expect the process
to simply receive a SIGBUS.  However, currently when a hole in the file is
accessed via the mapped address, kernel/mm attempts to automatically
allocate a page at page fault time, resulting in implicitly filling the
hole in the file. This may not be the desired behavior for applications
like the database that want to explicitly manage page allocations of
hugetlbfs files.

Using userfaultfd mechanism with this support to get a signal, database
application can prevent pages from being allocated implicitly when
processes access mapped address over holes in the file.

This patch adds UFFD_FEATURE_SIGBUS feature to userfaultfd mechnism to
request for a SIGBUS signal.

See following for previous discussion about the database requirement
leading to this proposal as suggested by Andrea.

http://www.spinics.net/lists/linux-mm/msg129224.html

Signed-off-by: Prakash Sangappa 
---
 fs/userfaultfd.c |3 +++
 include/uapi/linux/userfaultfd.h |   10 +-
 2 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 1d622f2..0bbe7df 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -371,6 +371,9 @@ int handle_userfault(struct vm_fault *vmf, unsigned long 
reason)
VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP));
VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP));
 
+   if (ctx->features & UFFD_FEATURE_SIGBUS)
+   goto out;
+
/*
 * If it's already released don't get it. This avoids to loop
 * in __get_user_pages if userfaultfd_release waits on the
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 3b05953..d39d5db 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -23,7 +23,8 @@
   UFFD_FEATURE_EVENT_REMOVE |  \
   UFFD_FEATURE_EVENT_UNMAP |   \
   UFFD_FEATURE_MISSING_HUGETLBFS | \
-  UFFD_FEATURE_MISSING_SHMEM)
+  UFFD_FEATURE_MISSING_SHMEM | \
+  UFFD_FEATURE_SIGBUS)
 #define UFFD_API_IOCTLS\
((__u64)1 << _UFFDIO_REGISTER | \
 (__u64)1 << _UFFDIO_UNREGISTER |   \
@@ -153,6 +154,12 @@ struct uffdio_api {
 * UFFD_FEATURE_MISSING_SHMEM works the same as
 * UFFD_FEATURE_MISSING_HUGETLBFS, but it applies to shmem
 * (i.e. tmpfs and other shmem based APIs).
+*
+* UFFD_FEATURE_SIGBUS feature means no page-fault
+* (UFFD_EVENT_PAGEFAULT) event will be delivered, instead
+* a SIGBUS signal will be sent to the faulting process.
+* The application process can enable this behavior by adding
+* it to uffdio_api.features.
 */
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0)
 #define UFFD_FEATURE_EVENT_FORK(1<<1)
@@ -161,6 +168,7 @@ struct uffdio_api {
 #define UFFD_FEATURE_MISSING_HUGETLBFS (1<<4)
 #define UFFD_FEATURE_MISSING_SHMEM (1<<5)
 #define UFFD_FEATURE_EVENT_UNMAP   (1<<6)
+#define UFFD_FEATURE_SIGBUS(1<<7)
__u64 features;
 
__u64 ioctls;
-- 
1.7.1



[RESEND PATCH 2/2] userfaultfd: selftest: Add tests for UFFD_FREATURE_SIGBUS

2017-07-24 Thread Prakash Sangappa
Signed-off-by: Prakash Sangappa <prakash.sanga...@oracle.com>
---
 tools/testing/selftests/vm/userfaultfd.c |  121 +-
 1 files changed, 118 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c 
b/tools/testing/selftests/vm/userfaultfd.c
index 1eae79a..6a43e84 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -66,6 +66,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef __NR_userfaultfd
 
@@ -408,6 +409,7 @@ static int copy_page(int ufd, unsigned long offset)
userfaults++;
break;
case UFFD_EVENT_FORK:
+   close(uffd);
uffd = msg.arg.fork.ufd;
pollfd[0].fd = uffd;
break;
@@ -572,6 +574,17 @@ static int userfaultfd_open(int features)
return 0;
 }
 
+sigjmp_buf jbuf, *sigbuf;
+
+static void sighndl(int sig, siginfo_t *siginfo, void *ptr)
+{
+if (sig == SIGBUS) {
+if (sigbuf)
+ siglongjmp(*sigbuf, 1);
+abort();
+}
+}
+
 /*
  * For non-cooperative userfaultfd test we fork() a process that will
  * generate pagefaults, will mremap the area monitored by the
@@ -585,19 +598,54 @@ static int userfaultfd_open(int features)
  * The release of the pages currently generates event for shmem and
  * anonymous memory (UFFD_EVENT_REMOVE), hence it is not checked
  * for hugetlb.
+ * For signal test(UFFD_FEATURE_SIGBUS), primarily test signal
+ * delivery and ensure no userfault events are generated.
  */
-static int faulting_process(void)
+static int faulting_process(int signal_test)
 {
unsigned long nr;
unsigned long long count;
unsigned long split_nr_pages;
+   unsigned long lastnr;
+   struct sigaction act;
+   unsigned long signalled=0, sig_repeats = 0;
 
if (test_type != TEST_HUGETLB)
split_nr_pages = (nr_pages + 1) / 2;
else
split_nr_pages = nr_pages;
 
+   if (signal_test) {
+   sigbuf = 
+   memset (, 0, sizeof(act));
+   act.sa_sigaction = sighndl;
+   act.sa_flags = SA_SIGINFO;
+   if (sigaction(SIGBUS, , 0)) {
+   perror("sigaction");
+   return 1;
+   }
+   lastnr = (unsigned long)-1;
+   }
+
for (nr = 0; nr < split_nr_pages; nr++) {
+   if (signal_test) {
+   if (sigsetjmp(*sigbuf, 1) != 0) {
+   if (nr == lastnr) {
+   sig_repeats++;
+   continue;
+   }
+
+   lastnr = nr;
+   if (signal_test == 1) {
+   if (copy_page(uffd, nr * page_size))
+   signalled++;
+   } else {
+   signalled++;
+   continue;
+   }
+   }
+   }
+
count = *area_count(area_dst, nr);
if (count != count_verify[nr]) {
fprintf(stderr,
@@ -607,6 +655,8 @@ static int faulting_process(void)
}
}
 
+   if (signal_test)
+   return signalled != split_nr_pages || sig_repeats != 0;
if (test_type == TEST_HUGETLB)
return 0;
 
@@ -761,7 +811,7 @@ static int userfaultfd_events_test(void)
perror("fork"), exit(1);
 
if (!pid)
-   return faulting_process();
+   return faulting_process(0);
 
waitpid(pid, , 0);
if (err)
@@ -778,6 +828,70 @@ static int userfaultfd_events_test(void)
return userfaults != nr_pages;
 }
 
+static int userfaultfd_sig_test(void)
+{
+   struct uffdio_register uffdio_register;
+   unsigned long expected_ioctls;
+   unsigned long userfaults;
+   pthread_t uffd_mon;
+   int err, features;
+   pid_t pid;
+   char c;
+
+   printf("testing signal delivery: ");
+   fflush(stdout);
+
+   if (uffd_test_ops->release_pages(area_dst))
+   return 1;
+
+   features = UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_SIGBUS;
+   if (userfaultfd_open(features) < 0)
+   return 1;
+   fcntl(uffd, F_SETFL, uffd_flags | O_NONBLOCK);
+
+   uffdio_register.range.start = (unsigned long) area_dst;
+   uffdio_register.range.len = nr_pages * page_size;
+   uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
+   if (ioctl(uffd, UFFDIO_REGISTER, _register))
+   fprintf(stderr, "register failure\n&quo

[RESEND PATCH 2/2] userfaultfd: selftest: Add tests for UFFD_FREATURE_SIGBUS

2017-07-24 Thread Prakash Sangappa
Signed-off-by: Prakash Sangappa 
---
 tools/testing/selftests/vm/userfaultfd.c |  121 +-
 1 files changed, 118 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c 
b/tools/testing/selftests/vm/userfaultfd.c
index 1eae79a..6a43e84 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -66,6 +66,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef __NR_userfaultfd
 
@@ -408,6 +409,7 @@ static int copy_page(int ufd, unsigned long offset)
userfaults++;
break;
case UFFD_EVENT_FORK:
+   close(uffd);
uffd = msg.arg.fork.ufd;
pollfd[0].fd = uffd;
break;
@@ -572,6 +574,17 @@ static int userfaultfd_open(int features)
return 0;
 }
 
+sigjmp_buf jbuf, *sigbuf;
+
+static void sighndl(int sig, siginfo_t *siginfo, void *ptr)
+{
+if (sig == SIGBUS) {
+if (sigbuf)
+ siglongjmp(*sigbuf, 1);
+abort();
+}
+}
+
 /*
  * For non-cooperative userfaultfd test we fork() a process that will
  * generate pagefaults, will mremap the area monitored by the
@@ -585,19 +598,54 @@ static int userfaultfd_open(int features)
  * The release of the pages currently generates event for shmem and
  * anonymous memory (UFFD_EVENT_REMOVE), hence it is not checked
  * for hugetlb.
+ * For signal test(UFFD_FEATURE_SIGBUS), primarily test signal
+ * delivery and ensure no userfault events are generated.
  */
-static int faulting_process(void)
+static int faulting_process(int signal_test)
 {
unsigned long nr;
unsigned long long count;
unsigned long split_nr_pages;
+   unsigned long lastnr;
+   struct sigaction act;
+   unsigned long signalled=0, sig_repeats = 0;
 
if (test_type != TEST_HUGETLB)
split_nr_pages = (nr_pages + 1) / 2;
else
split_nr_pages = nr_pages;
 
+   if (signal_test) {
+   sigbuf = 
+   memset (, 0, sizeof(act));
+   act.sa_sigaction = sighndl;
+   act.sa_flags = SA_SIGINFO;
+   if (sigaction(SIGBUS, , 0)) {
+   perror("sigaction");
+   return 1;
+   }
+   lastnr = (unsigned long)-1;
+   }
+
for (nr = 0; nr < split_nr_pages; nr++) {
+   if (signal_test) {
+   if (sigsetjmp(*sigbuf, 1) != 0) {
+   if (nr == lastnr) {
+   sig_repeats++;
+   continue;
+   }
+
+   lastnr = nr;
+   if (signal_test == 1) {
+   if (copy_page(uffd, nr * page_size))
+   signalled++;
+   } else {
+   signalled++;
+   continue;
+   }
+   }
+   }
+
count = *area_count(area_dst, nr);
if (count != count_verify[nr]) {
fprintf(stderr,
@@ -607,6 +655,8 @@ static int faulting_process(void)
}
}
 
+   if (signal_test)
+   return signalled != split_nr_pages || sig_repeats != 0;
if (test_type == TEST_HUGETLB)
return 0;
 
@@ -761,7 +811,7 @@ static int userfaultfd_events_test(void)
perror("fork"), exit(1);
 
if (!pid)
-   return faulting_process();
+   return faulting_process(0);
 
waitpid(pid, , 0);
if (err)
@@ -778,6 +828,70 @@ static int userfaultfd_events_test(void)
return userfaults != nr_pages;
 }
 
+static int userfaultfd_sig_test(void)
+{
+   struct uffdio_register uffdio_register;
+   unsigned long expected_ioctls;
+   unsigned long userfaults;
+   pthread_t uffd_mon;
+   int err, features;
+   pid_t pid;
+   char c;
+
+   printf("testing signal delivery: ");
+   fflush(stdout);
+
+   if (uffd_test_ops->release_pages(area_dst))
+   return 1;
+
+   features = UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_SIGBUS;
+   if (userfaultfd_open(features) < 0)
+   return 1;
+   fcntl(uffd, F_SETFL, uffd_flags | O_NONBLOCK);
+
+   uffdio_register.range.start = (unsigned long) area_dst;
+   uffdio_register.range.len = nr_pages * page_size;
+   uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
+   if (ioctl(uffd, UFFDIO_REGISTER, _register))
+   fprintf(stderr, "register failure\n"), exit(1);
+
+  

Re: [RFC PATCH] userfaultfd: Add feature to request for a signal delivery

2017-06-30 Thread prakash sangappa


On 6/30/2017 6:08 AM, Andrea Arcangeli wrote:

On Fri, Jun 30, 2017 at 11:47:35AM +0200, Michal Hocko wrote:

[...]

As an aside, I rememeber that prior to MADV_FREE there was long
discussion about lazy freeing of memory from userspace. Some users
wanted to be signalled when their memory was freed by the system so that
they could rebuild the original content (e.g. uncompressed images in
memory). It seems like MADV_FREE + this signalling could be used for
that usecase. John would surely know more about those usecases.

That would provide an equivalent API to the one volatile pages
provided agreed. So it would allow to adapt code (if any?) more easily
to drop the duplicate feature in volatile pages code (however it would
be faster if the userland code using volatile pages lazy reclaim mode
was converted to poll the uffd so the kernel talks directly to the
monitor without involving a SIGBUS signal handler which will cause
spurious enter/exit if compared to signal-less uffd API).

The main benefit in my view is not volatile pages but that
UFFD_FEATURE_SIGBUS would work equally well to enforce robustness on
all kind of memory not only hugetlbfs (so one could run the database
with robustness on THP over tmpfs) and the new cache can be injected
in the filesystem using UFFDIO_COPY which is likely faster than
fallocate as UFFDIO_COPY was already demonstrated to be faster even
than a regular page fault.


Interesting that UFFDIO_COPY is faster then fallocate().  In the DB use case
the page does not need to be allocated at the time a process trips on 
the hugetlbfs

file hole and receives SIGBUS.  fallocate() is called on the hugetlbfs file,
when more memory needs to be allocated by a separate process.


It's also simpler to handle backwards compatibility with the
UFFDIO_API call, that allows probing if UFFD_FEATURE_SIGBUS is
supported by the running kernel regardless of kernel version (so it
can be backported and enabled by the database, without the database
noticing it's on a older kernel version).


Yes, this is useful as this change will need to be back ported.


So while this wasn't the intended way to use the userfault and I
already pointed out the possibility to use a single monitor to do all
this, I'm positive about UFFD_FEATURE_SIGBUS if the overhead of having
a monitor is so concerning.

Ultimately there are many pros and just a single cons: the branch in
handle_userfault().

I wonder if it would be possible to use static_branch_enable() in
UFFDIO_API and static_branch_unlikely in handle_userfault() to
eliminate that branch but perhaps it's overkill and UFFDIO_API is
unprivileged and it would send an IPI to all CPUs. I don't think we
normally expose the static_branch_enable() to unprivileged userland
and making UFFD_FEATURE_SIGBUS a privileged op doesn't sound
attractive (although the alternative of altering a hugetlbfs mount
option would be a privileged op).


Regarding hugetlbfs mount option, one consideration is to allow mounts of
hugetlbfs inside user namespaces's mount namespace. Which would allow
non privileged processes to mount hugetlbfs for use inside a user 
namespace.

This may be needed even for the 'min_size' mount option using which an
application could reserve huge pages and mount a filesystem for its use,
with out the need to have privileges given the system has enough hugepages
configured.  It seems if non privileged processes are allowed to mount 
hugetlbfs

filesystem, then min_size should be subject to some resource limits.

Mounting inside user namespace will be a different patch proposal later.




Thanks,
Andrea




Re: [RFC PATCH] userfaultfd: Add feature to request for a signal delivery

2017-06-30 Thread prakash sangappa


On 6/30/2017 6:08 AM, Andrea Arcangeli wrote:

On Fri, Jun 30, 2017 at 11:47:35AM +0200, Michal Hocko wrote:

[...]

As an aside, I rememeber that prior to MADV_FREE there was long
discussion about lazy freeing of memory from userspace. Some users
wanted to be signalled when their memory was freed by the system so that
they could rebuild the original content (e.g. uncompressed images in
memory). It seems like MADV_FREE + this signalling could be used for
that usecase. John would surely know more about those usecases.

That would provide an equivalent API to the one volatile pages
provided agreed. So it would allow to adapt code (if any?) more easily
to drop the duplicate feature in volatile pages code (however it would
be faster if the userland code using volatile pages lazy reclaim mode
was converted to poll the uffd so the kernel talks directly to the
monitor without involving a SIGBUS signal handler which will cause
spurious enter/exit if compared to signal-less uffd API).

The main benefit in my view is not volatile pages but that
UFFD_FEATURE_SIGBUS would work equally well to enforce robustness on
all kind of memory not only hugetlbfs (so one could run the database
with robustness on THP over tmpfs) and the new cache can be injected
in the filesystem using UFFDIO_COPY which is likely faster than
fallocate as UFFDIO_COPY was already demonstrated to be faster even
than a regular page fault.


Interesting that UFFDIO_COPY is faster then fallocate().  In the DB use case
the page does not need to be allocated at the time a process trips on 
the hugetlbfs

file hole and receives SIGBUS.  fallocate() is called on the hugetlbfs file,
when more memory needs to be allocated by a separate process.


It's also simpler to handle backwards compatibility with the
UFFDIO_API call, that allows probing if UFFD_FEATURE_SIGBUS is
supported by the running kernel regardless of kernel version (so it
can be backported and enabled by the database, without the database
noticing it's on a older kernel version).


Yes, this is useful as this change will need to be back ported.


So while this wasn't the intended way to use the userfault and I
already pointed out the possibility to use a single monitor to do all
this, I'm positive about UFFD_FEATURE_SIGBUS if the overhead of having
a monitor is so concerning.

Ultimately there are many pros and just a single cons: the branch in
handle_userfault().

I wonder if it would be possible to use static_branch_enable() in
UFFDIO_API and static_branch_unlikely in handle_userfault() to
eliminate that branch but perhaps it's overkill and UFFDIO_API is
unprivileged and it would send an IPI to all CPUs. I don't think we
normally expose the static_branch_enable() to unprivileged userland
and making UFFD_FEATURE_SIGBUS a privileged op doesn't sound
attractive (although the alternative of altering a hugetlbfs mount
option would be a privileged op).


Regarding hugetlbfs mount option, one consideration is to allow mounts of
hugetlbfs inside user namespaces's mount namespace. Which would allow
non privileged processes to mount hugetlbfs for use inside a user 
namespace.

This may be needed even for the 'min_size' mount option using which an
application could reserve huge pages and mount a filesystem for its use,
with out the need to have privileges given the system has enough hugepages
configured.  It seems if non privileged processes are allowed to mount 
hugetlbfs

filesystem, then min_size should be subject to some resource limits.

Mounting inside user namespace will be a different patch proposal later.




Thanks,
Andrea




Re: [RFC PATCH] userfaultfd: Add feature to request for a signal delivery

2017-06-28 Thread Prakash Sangappa



On 6/28/17 6:18 AM, Mike Rapoport wrote:

On Tue, Jun 27, 2017 at 09:01:20AM -0700, Prakash Sangappa wrote:

On 6/27/17 8:35 AM, Mike Rapoport wrote:


On Tue, Jun 27, 2017 at 09:06:43AM +0200, Michal Hocko wrote:

This is an user visible API so let's CC linux-api mailing list.

On Mon 26-06-17 12:46:13, Prakash Sangappa wrote:


Any access to mapped address over holes in the file, which can occur due
to bugs in the application, is considered invalid and expect the process
to simply receive a SIGBUS.  However, currently when a hole in the file is
accessed via the mapped address, kernel/mm attempts to automatically
allocate a page at page fault time, resulting in implicitly filling the
hole in the file. This may not be the desired behavior for applications
like the database that want to explicitly manage page allocations of
hugetlbfs files.

So you register UFFD_FEATURE_SIGBUS on each region tha you are unmapping
and than just let those offenders die?

If I understand correctly, the database will create the mapping, then it'll
open userfaultfd and register those mappings with the userfault.
Afterwards, when the application accesses a hole userfault will cause
SIGBUS and the application will process it in whatever way it likes, e.g.
just die.

Yes.


What I don't understand is why won't you use userfault monitor process that
will take care of the page fault events?
It shouldn't be much overhead running it and it can keep track on all the
userfault file descriptors for you and it will allow more versatile error
handling that SIGBUS.


Co-ordination with the external monitor process by all the database
processes
to send  their userfaultfd is still an overhead.

You are planning to register in userfaultfd only the holes you punch to
deallocate pages, am I right?



No, the entire mmap'ed region. The DB processes would mmap(MAP_NORESERVE)
hugetlbfs files, register this mapped address with userfaultfd ones 
right after

the mmap() call.



And the co-ordination of the userfault file descriptor with the monitor
would have been added after calls to fallocate() and userfaultfd_register()?


Well, the database application does not need to deal with a monitor.



I've just been thinking that maybe it would be possible to use
UFFD_EVENT_REMOVE for this case. We anyway need to implement the generation
of UFFD_EVENT_REMOVE for the case of hole punching in hugetlbfs for
non-cooperative userfaultfd. It could be that it will solve your issue as
well.



Will this result in a signal delivery?

In the use case described, the database application does not need any event
for  hole punching. Basically, just a signal for any invalid access to 
mapped

area over holes in the file.



Re: [RFC PATCH] userfaultfd: Add feature to request for a signal delivery

2017-06-28 Thread Prakash Sangappa



On 6/28/17 6:18 AM, Mike Rapoport wrote:

On Tue, Jun 27, 2017 at 09:01:20AM -0700, Prakash Sangappa wrote:

On 6/27/17 8:35 AM, Mike Rapoport wrote:


On Tue, Jun 27, 2017 at 09:06:43AM +0200, Michal Hocko wrote:

This is an user visible API so let's CC linux-api mailing list.

On Mon 26-06-17 12:46:13, Prakash Sangappa wrote:


Any access to mapped address over holes in the file, which can occur due
to bugs in the application, is considered invalid and expect the process
to simply receive a SIGBUS.  However, currently when a hole in the file is
accessed via the mapped address, kernel/mm attempts to automatically
allocate a page at page fault time, resulting in implicitly filling the
hole in the file. This may not be the desired behavior for applications
like the database that want to explicitly manage page allocations of
hugetlbfs files.

So you register UFFD_FEATURE_SIGBUS on each region tha you are unmapping
and than just let those offenders die?

If I understand correctly, the database will create the mapping, then it'll
open userfaultfd and register those mappings with the userfault.
Afterwards, when the application accesses a hole userfault will cause
SIGBUS and the application will process it in whatever way it likes, e.g.
just die.

Yes.


What I don't understand is why won't you use userfault monitor process that
will take care of the page fault events?
It shouldn't be much overhead running it and it can keep track on all the
userfault file descriptors for you and it will allow more versatile error
handling that SIGBUS.


Co-ordination with the external monitor process by all the database
processes
to send  their userfaultfd is still an overhead.

You are planning to register in userfaultfd only the holes you punch to
deallocate pages, am I right?



No, the entire mmap'ed region. The DB processes would mmap(MAP_NORESERVE)
hugetlbfs files, register this mapped address with userfaultfd ones 
right after

the mmap() call.



And the co-ordination of the userfault file descriptor with the monitor
would have been added after calls to fallocate() and userfaultfd_register()?


Well, the database application does not need to deal with a monitor.



I've just been thinking that maybe it would be possible to use
UFFD_EVENT_REMOVE for this case. We anyway need to implement the generation
of UFFD_EVENT_REMOVE for the case of hole punching in hugetlbfs for
non-cooperative userfaultfd. It could be that it will solve your issue as
well.



Will this result in a signal delivery?

In the use case described, the database application does not need any event
for  hole punching. Basically, just a signal for any invalid access to 
mapped

area over holes in the file.



Re: [PATCH RFC] hugetlbfs 'noautofill' mount option

2017-06-27 Thread Prakash Sangappa



On 6/20/17 4:35 PM, Prakash Sangappa wrote:



On 6/16/17 6:15 AM, Andrea Arcangeli wrote:

Adding a single if (ctx->feature & UFFD_FEATURE_SIGBUS) goto out,
branch for this corner case to handle_userfault() isn't great and the
hugetlbfs mount option is absolutely zero cost to the handle_userfault
which is primarily why I'm not against it.. although it's not going to
be measurable so it would be ok also to add such feature.



If implementing UFFD_FEATURE_SIGBUS is preferred instead of the mount 
option, I could look into that.



Implementing UFFD_FEATURE_SIGBUS seems reasonable.

I wanted to note here on this thread that I sent out a seperate
RFC patch review for adding UFFD_FEATURE_SIGBUS.

See,
http://marc.info/?l=linux-mm=149857975906880=2




Re: [PATCH RFC] hugetlbfs 'noautofill' mount option

2017-06-27 Thread Prakash Sangappa



On 6/20/17 4:35 PM, Prakash Sangappa wrote:



On 6/16/17 6:15 AM, Andrea Arcangeli wrote:

Adding a single if (ctx->feature & UFFD_FEATURE_SIGBUS) goto out,
branch for this corner case to handle_userfault() isn't great and the
hugetlbfs mount option is absolutely zero cost to the handle_userfault
which is primarily why I'm not against it.. although it's not going to
be measurable so it would be ok also to add such feature.



If implementing UFFD_FEATURE_SIGBUS is preferred instead of the mount 
option, I could look into that.



Implementing UFFD_FEATURE_SIGBUS seems reasonable.

I wanted to note here on this thread that I sent out a seperate
RFC patch review for adding UFFD_FEATURE_SIGBUS.

See,
http://marc.info/?l=linux-mm=149857975906880=2




Re: [RFC PATCH] userfaultfd: Add feature to request for a signal delivery

2017-06-27 Thread Prakash Sangappa

On 6/27/17 8:35 AM, Mike Rapoport wrote:


On Tue, Jun 27, 2017 at 09:06:43AM +0200, Michal Hocko wrote:

This is an user visible API so let's CC linux-api mailing list.

On Mon 26-06-17 12:46:13, Prakash Sangappa wrote:

In some cases, userfaultfd mechanism should just deliver a SIGBUS signal
to the faulting process, instead of the page-fault event. Dealing with
page-fault event using a monitor thread can be an overhead in these
cases. For example applications like the database could use the signaling
mechanism for robustness purpose.

this is rather confusing. What is the reason that the monitor would be
slower than signal delivery and handling?


Database uses hugetlbfs for performance reason. Files on hugetlbfs
filesystem are created and huge pages allocated using fallocate() API.
Pages are deallocated/freed using fallocate() hole punching support.
These files are mmapped and accessed by many processes as shared memory.
The database keeps track of which offsets in the hugetlbfs file have
pages allocated.

Any access to mapped address over holes in the file, which can occur due
to bugs in the application, is considered invalid and expect the process
to simply receive a SIGBUS.  However, currently when a hole in the file is
accessed via the mapped address, kernel/mm attempts to automatically
allocate a page at page fault time, resulting in implicitly filling the
hole in the file. This may not be the desired behavior for applications
like the database that want to explicitly manage page allocations of
hugetlbfs files.

So you register UFFD_FEATURE_SIGBUS on each region tha you are unmapping
and than just let those offenders die?
  
If I understand correctly, the database will create the mapping, then it'll

open userfaultfd and register those mappings with the userfault.
Afterwards, when the application accesses a hole userfault will cause
SIGBUS and the application will process it in whatever way it likes, e.g.
just die.


Yes.


What I don't understand is why won't you use userfault monitor process that
will take care of the page fault events?
It shouldn't be much overhead running it and it can keep track on all the
userfault file descriptors for you and it will allow more versatile error
handling that SIGBUS.



Co-ordination with the external monitor process by all the database 
processes

to send  their userfaultfd is still an overhead.



Using userfaultfd mechanism, with this support to get a signal, database
application can prevent pages from being allocated implicitly when
processes access mapped address over holes in the file.

This patch adds the feature to request for a SIGBUS signal to userfaultfd
mechanism.

See following for previous discussion about the database requirement
leading to this proposal as suggested by Andrea.

http://www.spinics.net/lists/linux-mm/msg129224.html

Please make those requirements part of the changelog.


Signed-off-by: Prakash <prakash.sanga...@oracle.com>
---
  fs/userfaultfd.c |  5 +
  include/uapi/linux/userfaultfd.h | 10 +-
  2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 1d622f2..5686d6d2 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -371,6 +371,11 @@ int handle_userfault(struct vm_fault *vmf, unsigned
long reason)
  VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP));
  VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP));

+if (ctx->features & UFFD_FEATURE_SIGBUS) {
+goto out;
+}
+
  /*
   * If it's already released don't get it. This avoids to loop
   * in __get_user_pages if userfaultfd_release waits on the
diff --git a/include/uapi/linux/userfaultfd.h
b/include/uapi/linux/userfaultfd.h
index 3b05953..d39d5db 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -23,7 +23,8 @@
 UFFD_FEATURE_EVENT_REMOVE |\
 UFFD_FEATURE_EVENT_UNMAP |\
 UFFD_FEATURE_MISSING_HUGETLBFS |\
-   UFFD_FEATURE_MISSING_SHMEM)
+   UFFD_FEATURE_MISSING_SHMEM |\
+   UFFD_FEATURE_SIGBUS)
  #define UFFD_API_IOCTLS\
  ((__u64)1 << _UFFDIO_REGISTER |\
   (__u64)1 << _UFFDIO_UNREGISTER |\
@@ -153,6 +154,12 @@ struct uffdio_api {
   * UFFD_FEATURE_MISSING_SHMEM works the same as
   * UFFD_FEATURE_MISSING_HUGETLBFS, but it applies to shmem
   * (i.e. tmpfs and other shmem based APIs).
+ *
+ * UFFD_FEATURE_SIGBUS feature means no page-fault
+ * (UFFD_EVENT_PAGEFAULT) event will be delivered, instead
+ * a SIGBUS signal will be sent to the faulting process.
+ * The application process can enable this behavior by adding
+ * it to uffdio_api.features.
   */
  #define UFFD_FEATURE_PAGEFAULT_FLAG_WP(1<<0)
  #define UFFD_FEATURE_EVENT_FORK(1<<1)

Re: [RFC PATCH] userfaultfd: Add feature to request for a signal delivery

2017-06-27 Thread Prakash Sangappa

On 6/27/17 8:35 AM, Mike Rapoport wrote:


On Tue, Jun 27, 2017 at 09:06:43AM +0200, Michal Hocko wrote:

This is an user visible API so let's CC linux-api mailing list.

On Mon 26-06-17 12:46:13, Prakash Sangappa wrote:

In some cases, userfaultfd mechanism should just deliver a SIGBUS signal
to the faulting process, instead of the page-fault event. Dealing with
page-fault event using a monitor thread can be an overhead in these
cases. For example applications like the database could use the signaling
mechanism for robustness purpose.

this is rather confusing. What is the reason that the monitor would be
slower than signal delivery and handling?


Database uses hugetlbfs for performance reason. Files on hugetlbfs
filesystem are created and huge pages allocated using fallocate() API.
Pages are deallocated/freed using fallocate() hole punching support.
These files are mmapped and accessed by many processes as shared memory.
The database keeps track of which offsets in the hugetlbfs file have
pages allocated.

Any access to mapped address over holes in the file, which can occur due
to bugs in the application, is considered invalid and expect the process
to simply receive a SIGBUS.  However, currently when a hole in the file is
accessed via the mapped address, kernel/mm attempts to automatically
allocate a page at page fault time, resulting in implicitly filling the
hole in the file. This may not be the desired behavior for applications
like the database that want to explicitly manage page allocations of
hugetlbfs files.

So you register UFFD_FEATURE_SIGBUS on each region tha you are unmapping
and than just let those offenders die?
  
If I understand correctly, the database will create the mapping, then it'll

open userfaultfd and register those mappings with the userfault.
Afterwards, when the application accesses a hole userfault will cause
SIGBUS and the application will process it in whatever way it likes, e.g.
just die.


Yes.


What I don't understand is why won't you use userfault monitor process that
will take care of the page fault events?
It shouldn't be much overhead running it and it can keep track on all the
userfault file descriptors for you and it will allow more versatile error
handling that SIGBUS.



Co-ordination with the external monitor process by all the database 
processes

to send  their userfaultfd is still an overhead.



Using userfaultfd mechanism, with this support to get a signal, database
application can prevent pages from being allocated implicitly when
processes access mapped address over holes in the file.

This patch adds the feature to request for a SIGBUS signal to userfaultfd
mechanism.

See following for previous discussion about the database requirement
leading to this proposal as suggested by Andrea.

http://www.spinics.net/lists/linux-mm/msg129224.html

Please make those requirements part of the changelog.


Signed-off-by: Prakash 
---
  fs/userfaultfd.c |  5 +
  include/uapi/linux/userfaultfd.h | 10 +-
  2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 1d622f2..5686d6d2 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -371,6 +371,11 @@ int handle_userfault(struct vm_fault *vmf, unsigned
long reason)
  VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP));
  VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP));

+if (ctx->features & UFFD_FEATURE_SIGBUS) {
+goto out;
+}
+
  /*
   * If it's already released don't get it. This avoids to loop
   * in __get_user_pages if userfaultfd_release waits on the
diff --git a/include/uapi/linux/userfaultfd.h
b/include/uapi/linux/userfaultfd.h
index 3b05953..d39d5db 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -23,7 +23,8 @@
 UFFD_FEATURE_EVENT_REMOVE |\
 UFFD_FEATURE_EVENT_UNMAP |\
 UFFD_FEATURE_MISSING_HUGETLBFS |\
-   UFFD_FEATURE_MISSING_SHMEM)
+   UFFD_FEATURE_MISSING_SHMEM |\
+   UFFD_FEATURE_SIGBUS)
  #define UFFD_API_IOCTLS\
  ((__u64)1 << _UFFDIO_REGISTER |\
   (__u64)1 << _UFFDIO_UNREGISTER |\
@@ -153,6 +154,12 @@ struct uffdio_api {
   * UFFD_FEATURE_MISSING_SHMEM works the same as
   * UFFD_FEATURE_MISSING_HUGETLBFS, but it applies to shmem
   * (i.e. tmpfs and other shmem based APIs).
+ *
+ * UFFD_FEATURE_SIGBUS feature means no page-fault
+ * (UFFD_EVENT_PAGEFAULT) event will be delivered, instead
+ * a SIGBUS signal will be sent to the faulting process.
+ * The application process can enable this behavior by adding
+ * it to uffdio_api.features.
   */
  #define UFFD_FEATURE_PAGEFAULT_FLAG_WP(1<<0)
  #define UFFD_FEATURE_EVENT_FORK(1<<1)
@@ -161,6 +168,7 @@ struct uffdio_api {
  #define UFFD_FEATUR

Re: [RFC PATCH] userfaultfd: Add feature to request for a signal delivery

2017-06-27 Thread Prakash Sangappa

On 6/27/17 12:06 AM, Michal Hocko wrote:


This is an user visible API so let's CC linux-api mailing list.

On Mon 26-06-17 12:46:13, Prakash Sangappa wrote:

In some cases, userfaultfd mechanism should just deliver a SIGBUS signal
to the faulting process, instead of the page-fault event. Dealing with
page-fault event using a monitor thread can be an overhead in these
cases. For example applications like the database could use the signaling
mechanism for robustness purpose.

this is rather confusing. What is the reason that the monitor would be
slower than signal delivery and handling?


There are a large number of single threaded database processes involved,
each of these processes will require a monitor thread which is considered
an overhead.




Database uses hugetlbfs for performance reason. Files on hugetlbfs
filesystem are created and huge pages allocated using fallocate() API.
Pages are deallocated/freed using fallocate() hole punching support.
These files are mmapped and accessed by many processes as shared memory.
The database keeps track of which offsets in the hugetlbfs file have
pages allocated.

Any access to mapped address over holes in the file, which can occur due
to bugs in the application, is considered invalid and expect the process
to simply receive a SIGBUS.  However, currently when a hole in the file is
accessed via the mapped address, kernel/mm attempts to automatically
allocate a page at page fault time, resulting in implicitly filling the
hole in the file. This may not be the desired behavior for applications
like the database that want to explicitly manage page allocations of
hugetlbfs files.

So you register UFFD_FEATURE_SIGBUS on each region tha you are unmapping
and than just let those offenders die?


The database application will create the mapping and register with 
userfault.
Subsequently when the processes the mapping over a hole will result in 
SIGBUS

and die.




Using userfaultfd mechanism, with this support to get a signal, database
application can prevent pages from being allocated implicitly when
processes access mapped address over holes in the file.

This patch adds the feature to request for a SIGBUS signal to userfaultfd
mechanism.

See following for previous discussion about the database requirement
leading to this proposal as suggested by Andrea.

http://www.spinics.net/lists/linux-mm/msg129224.html

Please make those requirements part of the changelog.


The requirement is described above, which is the need for the database
application to not fill hole implicitly. Sorry, if this was not clear. I
will update the change log and send a v2 patch again.





Re: [RFC PATCH] userfaultfd: Add feature to request for a signal delivery

2017-06-27 Thread Prakash Sangappa

On 6/27/17 12:06 AM, Michal Hocko wrote:


This is an user visible API so let's CC linux-api mailing list.

On Mon 26-06-17 12:46:13, Prakash Sangappa wrote:

In some cases, userfaultfd mechanism should just deliver a SIGBUS signal
to the faulting process, instead of the page-fault event. Dealing with
page-fault event using a monitor thread can be an overhead in these
cases. For example applications like the database could use the signaling
mechanism for robustness purpose.

this is rather confusing. What is the reason that the monitor would be
slower than signal delivery and handling?


There are a large number of single threaded database processes involved,
each of these processes will require a monitor thread which is considered
an overhead.




Database uses hugetlbfs for performance reason. Files on hugetlbfs
filesystem are created and huge pages allocated using fallocate() API.
Pages are deallocated/freed using fallocate() hole punching support.
These files are mmapped and accessed by many processes as shared memory.
The database keeps track of which offsets in the hugetlbfs file have
pages allocated.

Any access to mapped address over holes in the file, which can occur due
to bugs in the application, is considered invalid and expect the process
to simply receive a SIGBUS.  However, currently when a hole in the file is
accessed via the mapped address, kernel/mm attempts to automatically
allocate a page at page fault time, resulting in implicitly filling the
hole in the file. This may not be the desired behavior for applications
like the database that want to explicitly manage page allocations of
hugetlbfs files.

So you register UFFD_FEATURE_SIGBUS on each region tha you are unmapping
and than just let those offenders die?


The database application will create the mapping and register with 
userfault.
Subsequently when the processes the mapping over a hole will result in 
SIGBUS

and die.




Using userfaultfd mechanism, with this support to get a signal, database
application can prevent pages from being allocated implicitly when
processes access mapped address over holes in the file.

This patch adds the feature to request for a SIGBUS signal to userfaultfd
mechanism.

See following for previous discussion about the database requirement
leading to this proposal as suggested by Andrea.

http://www.spinics.net/lists/linux-mm/msg129224.html

Please make those requirements part of the changelog.


The requirement is described above, which is the need for the database
application to not fill hole implicitly. Sorry, if this was not clear. I
will update the change log and send a v2 patch again.





[RFC PATCH] userfaultfd: Add feature to request for a signal delivery

2017-06-26 Thread Prakash Sangappa

In some cases, userfaultfd mechanism should just deliver a SIGBUS signal
to the faulting process, instead of the page-fault event. Dealing with
page-fault event using a monitor thread can be an overhead in these
cases. For example applications like the database could use the signaling
mechanism for robustness purpose.

Database uses hugetlbfs for performance reason. Files on hugetlbfs
filesystem are created and huge pages allocated using fallocate() API.
Pages are deallocated/freed using fallocate() hole punching support.
These files are mmapped and accessed by many processes as shared memory.
The database keeps track of which offsets in the hugetlbfs file have
pages allocated.

Any access to mapped address over holes in the file, which can occur due
to bugs in the application, is considered invalid and expect the process
to simply receive a SIGBUS.  However, currently when a hole in the file is
accessed via the mapped address, kernel/mm attempts to automatically
allocate a page at page fault time, resulting in implicitly filling the
hole in the file. This may not be the desired behavior for applications
like the database that want to explicitly manage page allocations of
hugetlbfs files.

Using userfaultfd mechanism, with this support to get a signal, database
application can prevent pages from being allocated implicitly when
processes access mapped address over holes in the file.

This patch adds the feature to request for a SIGBUS signal to userfaultfd
mechanism.

See following for previous discussion about the database requirement
leading to this proposal as suggested by Andrea.

http://www.spinics.net/lists/linux-mm/msg129224.html

Signed-off-by: Prakash 
---
 fs/userfaultfd.c |  5 +
 include/uapi/linux/userfaultfd.h | 10 +-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 1d622f2..5686d6d2 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -371,6 +371,11 @@ int handle_userfault(struct vm_fault *vmf, unsigned 
long reason)

 VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP));
 VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP));

+if (ctx->features & UFFD_FEATURE_SIGBUS) {
+goto out;
+}
+
 /*
  * If it's already released don't get it. This avoids to loop
  * in __get_user_pages if userfaultfd_release waits on the
diff --git a/include/uapi/linux/userfaultfd.h 
b/include/uapi/linux/userfaultfd.h

index 3b05953..d39d5db 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -23,7 +23,8 @@
UFFD_FEATURE_EVENT_REMOVE |\
UFFD_FEATURE_EVENT_UNMAP |\
UFFD_FEATURE_MISSING_HUGETLBFS |\
-   UFFD_FEATURE_MISSING_SHMEM)
+   UFFD_FEATURE_MISSING_SHMEM |\
+   UFFD_FEATURE_SIGBUS)
 #define UFFD_API_IOCTLS\
 ((__u64)1 << _UFFDIO_REGISTER |\
  (__u64)1 << _UFFDIO_UNREGISTER |\
@@ -153,6 +154,12 @@ struct uffdio_api {
  * UFFD_FEATURE_MISSING_SHMEM works the same as
  * UFFD_FEATURE_MISSING_HUGETLBFS, but it applies to shmem
  * (i.e. tmpfs and other shmem based APIs).
+ *
+ * UFFD_FEATURE_SIGBUS feature means no page-fault
+ * (UFFD_EVENT_PAGEFAULT) event will be delivered, instead
+ * a SIGBUS signal will be sent to the faulting process.
+ * The application process can enable this behavior by adding
+ * it to uffdio_api.features.
  */
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP(1<<0)
 #define UFFD_FEATURE_EVENT_FORK(1<<1)
@@ -161,6 +168,7 @@ struct uffdio_api {
 #define UFFD_FEATURE_MISSING_HUGETLBFS(1<<4)
 #define UFFD_FEATURE_MISSING_SHMEM(1<<5)
 #define UFFD_FEATURE_EVENT_UNMAP(1<<6)
+#define UFFD_FEATURE_SIGBUS(1<<7)
 __u64 features;

 __u64 ioctls;
--
2.7.4



[RFC PATCH] userfaultfd: Add feature to request for a signal delivery

2017-06-26 Thread Prakash Sangappa

In some cases, userfaultfd mechanism should just deliver a SIGBUS signal
to the faulting process, instead of the page-fault event. Dealing with
page-fault event using a monitor thread can be an overhead in these
cases. For example applications like the database could use the signaling
mechanism for robustness purpose.

Database uses hugetlbfs for performance reason. Files on hugetlbfs
filesystem are created and huge pages allocated using fallocate() API.
Pages are deallocated/freed using fallocate() hole punching support.
These files are mmapped and accessed by many processes as shared memory.
The database keeps track of which offsets in the hugetlbfs file have
pages allocated.

Any access to mapped address over holes in the file, which can occur due
to bugs in the application, is considered invalid and expect the process
to simply receive a SIGBUS.  However, currently when a hole in the file is
accessed via the mapped address, kernel/mm attempts to automatically
allocate a page at page fault time, resulting in implicitly filling the
hole in the file. This may not be the desired behavior for applications
like the database that want to explicitly manage page allocations of
hugetlbfs files.

Using userfaultfd mechanism, with this support to get a signal, database
application can prevent pages from being allocated implicitly when
processes access mapped address over holes in the file.

This patch adds the feature to request for a SIGBUS signal to userfaultfd
mechanism.

See following for previous discussion about the database requirement
leading to this proposal as suggested by Andrea.

http://www.spinics.net/lists/linux-mm/msg129224.html

Signed-off-by: Prakash 
---
 fs/userfaultfd.c |  5 +
 include/uapi/linux/userfaultfd.h | 10 +-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 1d622f2..5686d6d2 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -371,6 +371,11 @@ int handle_userfault(struct vm_fault *vmf, unsigned 
long reason)

 VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP));
 VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP));

+if (ctx->features & UFFD_FEATURE_SIGBUS) {
+goto out;
+}
+
 /*
  * If it's already released don't get it. This avoids to loop
  * in __get_user_pages if userfaultfd_release waits on the
diff --git a/include/uapi/linux/userfaultfd.h 
b/include/uapi/linux/userfaultfd.h

index 3b05953..d39d5db 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -23,7 +23,8 @@
UFFD_FEATURE_EVENT_REMOVE |\
UFFD_FEATURE_EVENT_UNMAP |\
UFFD_FEATURE_MISSING_HUGETLBFS |\
-   UFFD_FEATURE_MISSING_SHMEM)
+   UFFD_FEATURE_MISSING_SHMEM |\
+   UFFD_FEATURE_SIGBUS)
 #define UFFD_API_IOCTLS\
 ((__u64)1 << _UFFDIO_REGISTER |\
  (__u64)1 << _UFFDIO_UNREGISTER |\
@@ -153,6 +154,12 @@ struct uffdio_api {
  * UFFD_FEATURE_MISSING_SHMEM works the same as
  * UFFD_FEATURE_MISSING_HUGETLBFS, but it applies to shmem
  * (i.e. tmpfs and other shmem based APIs).
+ *
+ * UFFD_FEATURE_SIGBUS feature means no page-fault
+ * (UFFD_EVENT_PAGEFAULT) event will be delivered, instead
+ * a SIGBUS signal will be sent to the faulting process.
+ * The application process can enable this behavior by adding
+ * it to uffdio_api.features.
  */
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP(1<<0)
 #define UFFD_FEATURE_EVENT_FORK(1<<1)
@@ -161,6 +168,7 @@ struct uffdio_api {
 #define UFFD_FEATURE_MISSING_HUGETLBFS(1<<4)
 #define UFFD_FEATURE_MISSING_SHMEM(1<<5)
 #define UFFD_FEATURE_EVENT_UNMAP(1<<6)
+#define UFFD_FEATURE_SIGBUS(1<<7)
 __u64 features;

 __u64 ioctls;
--
2.7.4



Re: [PATCH RFC] hugetlbfs 'noautofill' mount option

2017-06-20 Thread Prakash Sangappa



On 6/16/17 6:15 AM, Andrea Arcangeli wrote:

Hello Prakash,



Thanks for you response. Comments inline.



On Tue, May 09, 2017 at 01:59:34PM -0700, Prakash Sangappa wrote:


On 5/9/17 1:58 AM, Christoph Hellwig wrote:

On Mon, May 08, 2017 at 03:12:42PM -0700, prakash.sangappa wrote:

Regarding #3 as a general feature, do we want to
consider this and the complexity associated with the
implementation?

We have to.  Given that no one has exclusive access to hugetlbfs
a mount option is fundamentally the wrong interface.


A hugetlbfs filesystem may need to be mounted for exclusive use by
an application. Note, recently the 'min_size' mount option was added
to hugetlbfs, which would reserve minimum number of huge pages
for that filesystem for use by an application. If the filesystem with
min size specified, is not setup for exclusive use by an application,
then the purpose of reserving huge pages is defeated.  The
min_size option was for use by applications like the database.

Also, I am investigating enabling hugetlbfs mounts within user
namespace's mount namespace. That would allow an application
to mount a hugetlbfs filesystem inside a namespace exclusively for
its use, running as a non root user. For this it seems like the 'min_size'
should be subject to some user limits. Anyways, mounting inside
user namespaces is  a different discussion.

So, if a filesystem has to be setup for exclusive use by an application,
then different mount options can be used for that filesystem.

Before userfaultfd I used a madvise that triggered SIGBUS. Aside from
performance that is much lower than userfaultfd because of the return
to userland, SIGBUS handling and new enter kernel to communicate
through a pipe with a memory manager, it couldn't work reliably
because you're not going to get exact information on the virtual
address that triggered the fault if the SIGBUS triggers in some random
in a copy-user of some random syscall, depending on the syscall some
random error will be returned. So it couldn't work transparently to
the app as far as syscalls and get_user_pages drivers were concerned.



Sure, seems like that would be the case if an application wants to take 
some action as a result of the fault.




With your solution if you pass a corrupted pointer to a random read()
syscall you're going to get a error, but supposedly you already handle
any syscall error and stop the app.



Yes, the expectation is  that the application will handle the error and 
stop. This would be similar to an application passing an invalid  
address to a system call.


So, in the use case for this feature,  accessing the mapped address over 
a hole in hugetlbfs file is invalid. The application will keep track of 
the valid regions.





This is a special case because you don't care about performance and
you don't care about not returning random EFAULT errors from syscalls
like read().



Exactly.



This mount option seems non intrusive enough and hugetlbfs is quite
special already, so I'm not particularly concerned by the fact it's
one more special tweak.

If it would be enough to convert the SIGBUS into a (killable) process
hang, you could still use uffd and there would be no need to send the
uffd to a manager. You'd find the corrupting buggy process stuck in
handle_userfault().



This could be a useful feature in debug mode.  However, In the normal 
mode the application should exit/die.




As an alternative to the mount option we could consider adding
UFFD_FEATURE_SIGBUS that tells the handle_userfault() to simply return
VM_FAULT_SIGBUS in presence of a pagefault event. You'd still get
weird EFAULT or erratic retvals from syscalls so it would only be
usable in for your robustness feature. Then you could use UFFDIO_COPY
too to fill the memory atomically which runs faster than a page fault
(fallocate punch hole still required to zap it).

Adding a single if (ctx->feature & UFFD_FEATURE_SIGBUS) goto out,
branch for this corner case to handle_userfault() isn't great and the
hugetlbfs mount option is absolutely zero cost to the handle_userfault
which is primarily why I'm not against it.. although it's not going to
be measurable so it would be ok also to add such feature.



Sure, UFFD_FEATURE_SIGBUS would address the use case for the database 
using hugetlbfs. This could be a generic API and so could be useful in 
other cases as well maybe?


However for this, the userfaultfd(2)  has to be opened to register. This 
fd has to remain opened. Is this ok? Also, even though  a monitor thread 
will not be required for this particular feature, hopefully it will not 
hinder future  enhancements to userfaultfd.


Expectation is that the overhead of registering UFFD_FEATURE_SIGBUS is 
minimal, and the registration will be done by the application ones after 
every mmap() call as required, hopefully this is not required to be done 
frequently. In the database use case, the registration  will mainly be 
done once in the beginning when mapping hu

Re: [PATCH RFC] hugetlbfs 'noautofill' mount option

2017-06-20 Thread Prakash Sangappa



On 6/16/17 6:15 AM, Andrea Arcangeli wrote:

Hello Prakash,



Thanks for you response. Comments inline.



On Tue, May 09, 2017 at 01:59:34PM -0700, Prakash Sangappa wrote:


On 5/9/17 1:58 AM, Christoph Hellwig wrote:

On Mon, May 08, 2017 at 03:12:42PM -0700, prakash.sangappa wrote:

Regarding #3 as a general feature, do we want to
consider this and the complexity associated with the
implementation?

We have to.  Given that no one has exclusive access to hugetlbfs
a mount option is fundamentally the wrong interface.


A hugetlbfs filesystem may need to be mounted for exclusive use by
an application. Note, recently the 'min_size' mount option was added
to hugetlbfs, which would reserve minimum number of huge pages
for that filesystem for use by an application. If the filesystem with
min size specified, is not setup for exclusive use by an application,
then the purpose of reserving huge pages is defeated.  The
min_size option was for use by applications like the database.

Also, I am investigating enabling hugetlbfs mounts within user
namespace's mount namespace. That would allow an application
to mount a hugetlbfs filesystem inside a namespace exclusively for
its use, running as a non root user. For this it seems like the 'min_size'
should be subject to some user limits. Anyways, mounting inside
user namespaces is  a different discussion.

So, if a filesystem has to be setup for exclusive use by an application,
then different mount options can be used for that filesystem.

Before userfaultfd I used a madvise that triggered SIGBUS. Aside from
performance that is much lower than userfaultfd because of the return
to userland, SIGBUS handling and new enter kernel to communicate
through a pipe with a memory manager, it couldn't work reliably
because you're not going to get exact information on the virtual
address that triggered the fault if the SIGBUS triggers in some random
in a copy-user of some random syscall, depending on the syscall some
random error will be returned. So it couldn't work transparently to
the app as far as syscalls and get_user_pages drivers were concerned.



Sure, seems like that would be the case if an application wants to take 
some action as a result of the fault.




With your solution if you pass a corrupted pointer to a random read()
syscall you're going to get a error, but supposedly you already handle
any syscall error and stop the app.



Yes, the expectation is  that the application will handle the error and 
stop. This would be similar to an application passing an invalid  
address to a system call.


So, in the use case for this feature,  accessing the mapped address over 
a hole in hugetlbfs file is invalid. The application will keep track of 
the valid regions.





This is a special case because you don't care about performance and
you don't care about not returning random EFAULT errors from syscalls
like read().



Exactly.



This mount option seems non intrusive enough and hugetlbfs is quite
special already, so I'm not particularly concerned by the fact it's
one more special tweak.

If it would be enough to convert the SIGBUS into a (killable) process
hang, you could still use uffd and there would be no need to send the
uffd to a manager. You'd find the corrupting buggy process stuck in
handle_userfault().



This could be a useful feature in debug mode.  However, In the normal 
mode the application should exit/die.




As an alternative to the mount option we could consider adding
UFFD_FEATURE_SIGBUS that tells the handle_userfault() to simply return
VM_FAULT_SIGBUS in presence of a pagefault event. You'd still get
weird EFAULT or erratic retvals from syscalls so it would only be
usable in for your robustness feature. Then you could use UFFDIO_COPY
too to fill the memory atomically which runs faster than a page fault
(fallocate punch hole still required to zap it).

Adding a single if (ctx->feature & UFFD_FEATURE_SIGBUS) goto out,
branch for this corner case to handle_userfault() isn't great and the
hugetlbfs mount option is absolutely zero cost to the handle_userfault
which is primarily why I'm not against it.. although it's not going to
be measurable so it would be ok also to add such feature.



Sure, UFFD_FEATURE_SIGBUS would address the use case for the database 
using hugetlbfs. This could be a generic API and so could be useful in 
other cases as well maybe?


However for this, the userfaultfd(2)  has to be opened to register. This 
fd has to remain opened. Is this ok? Also, even though  a monitor thread 
will not be required for this particular feature, hopefully it will not 
hinder future  enhancements to userfaultfd.


Expectation is that the overhead of registering UFFD_FEATURE_SIGBUS is 
minimal, and the registration will be done by the application ones after 
every mmap() call as required, hopefully this is not required to be done 
frequently. In the database use case, the registration  will mainly be 
done once in the beginning when mapping hu

Re: [PATCH RFC] hugetlbfs 'noautofill' mount option

2017-05-16 Thread Prakash Sangappa



On 5/9/17 1:59 PM, Prakash Sangappa wrote:



On 5/9/17 1:58 AM, Christoph Hellwig wrote:

On Mon, May 08, 2017 at 03:12:42PM -0700, prakash.sangappa wrote:

Regarding #3 as a general feature, do we want to
consider this and the complexity associated with the
implementation?

We have to.  Given that no one has exclusive access to hugetlbfs
a mount option is fundamentally the wrong interface.



A hugetlbfs filesystem may need to be mounted for exclusive use by
an application. Note, recently the 'min_size' mount option was added
to hugetlbfs, which would reserve minimum number of huge pages
for that filesystem for use by an application. If the filesystem with
min size specified, is not setup for exclusive use by an application,
then the purpose of reserving huge pages is defeated.  The
min_size option was for use by applications like the database.

Also, I am investigating enabling hugetlbfs mounts within user
namespace's mount namespace. That would allow an application
to mount a hugetlbfs filesystem inside a namespace exclusively for
its use, running as a non root user. For this it seems like the 
'min_size'

should be subject to some user limits. Anyways, mounting inside
user namespaces is  a different discussion.

So, if a filesystem has to be setup for exclusive use by an application,
then different mount options can be used for that filesystem.



Any further comments?

Cc'ing Andrea as we had discussed this requirement for the Database.




Re: [PATCH RFC] hugetlbfs 'noautofill' mount option

2017-05-16 Thread Prakash Sangappa



On 5/9/17 1:59 PM, Prakash Sangappa wrote:



On 5/9/17 1:58 AM, Christoph Hellwig wrote:

On Mon, May 08, 2017 at 03:12:42PM -0700, prakash.sangappa wrote:

Regarding #3 as a general feature, do we want to
consider this and the complexity associated with the
implementation?

We have to.  Given that no one has exclusive access to hugetlbfs
a mount option is fundamentally the wrong interface.



A hugetlbfs filesystem may need to be mounted for exclusive use by
an application. Note, recently the 'min_size' mount option was added
to hugetlbfs, which would reserve minimum number of huge pages
for that filesystem for use by an application. If the filesystem with
min size specified, is not setup for exclusive use by an application,
then the purpose of reserving huge pages is defeated.  The
min_size option was for use by applications like the database.

Also, I am investigating enabling hugetlbfs mounts within user
namespace's mount namespace. That would allow an application
to mount a hugetlbfs filesystem inside a namespace exclusively for
its use, running as a non root user. For this it seems like the 
'min_size'

should be subject to some user limits. Anyways, mounting inside
user namespaces is  a different discussion.

So, if a filesystem has to be setup for exclusive use by an application,
then different mount options can be used for that filesystem.



Any further comments?

Cc'ing Andrea as we had discussed this requirement for the Database.




Re: [PATCH RFC] hugetlbfs 'noautofill' mount option

2017-05-09 Thread Prakash Sangappa



On 5/9/17 1:58 AM, Christoph Hellwig wrote:

On Mon, May 08, 2017 at 03:12:42PM -0700, prakash.sangappa wrote:

Regarding #3 as a general feature, do we want to
consider this and the complexity associated with the
implementation?

We have to.  Given that no one has exclusive access to hugetlbfs
a mount option is fundamentally the wrong interface.



A hugetlbfs filesystem may need to be mounted for exclusive use by
an application. Note, recently the 'min_size' mount option was added
to hugetlbfs, which would reserve minimum number of huge pages
for that filesystem for use by an application. If the filesystem with
min size specified, is not setup for exclusive use by an application,
then the purpose of reserving huge pages is defeated.  The
min_size option was for use by applications like the database.

Also, I am investigating enabling hugetlbfs mounts within user
namespace's mount namespace. That would allow an application
to mount a hugetlbfs filesystem inside a namespace exclusively for
its use, running as a non root user. For this it seems like the 'min_size'
should be subject to some user limits. Anyways, mounting inside
user namespaces is  a different discussion.

So, if a filesystem has to be setup for exclusive use by an application,
then different mount options can be used for that filesystem.



Re: [PATCH RFC] hugetlbfs 'noautofill' mount option

2017-05-09 Thread Prakash Sangappa



On 5/9/17 1:58 AM, Christoph Hellwig wrote:

On Mon, May 08, 2017 at 03:12:42PM -0700, prakash.sangappa wrote:

Regarding #3 as a general feature, do we want to
consider this and the complexity associated with the
implementation?

We have to.  Given that no one has exclusive access to hugetlbfs
a mount option is fundamentally the wrong interface.



A hugetlbfs filesystem may need to be mounted for exclusive use by
an application. Note, recently the 'min_size' mount option was added
to hugetlbfs, which would reserve minimum number of huge pages
for that filesystem for use by an application. If the filesystem with
min size specified, is not setup for exclusive use by an application,
then the purpose of reserving huge pages is defeated.  The
min_size option was for use by applications like the database.

Also, I am investigating enabling hugetlbfs mounts within user
namespace's mount namespace. That would allow an application
to mount a hugetlbfs filesystem inside a namespace exclusively for
its use, running as a non root user. For this it seems like the 'min_size'
should be subject to some user limits. Anyways, mounting inside
user namespaces is  a different discussion.

So, if a filesystem has to be setup for exclusive use by an application,
then different mount options can be used for that filesystem.



Re: [PATCH RFC] hugetlbfs 'noautofill' mount option

2017-05-07 Thread Prakash Sangappa



On 5/3/17 12:02 PM, Prakash Sangappa wrote:

On 5/2/17 4:43 PM, Dave Hansen wrote:


Ideally, it would be something that is *not* specifically for hugetlbfs.
  MADV_NOAUTOFILL, for instance, could be defined to SIGSEGV whenever
memory is touched that was not populated with MADV_WILLNEED, mlock(), 
etc...


If this is a generic advice type, necessary support will have to be 
implemented

in various filesystems which can support this.

The proposed behavior for 'noautofill' was to not fill holes in 
files(like sparse files).

In the page fault path, mm would not know if the mmapped address on which
the fault occurred, is over a hole in the file or just that the page 
is not available
in the page cache. The underlying filesystem would be called and it 
determines
if it is a hole and that is where it would fail and not fill the hole, 
if this support is added.
Normally, filesystem which support sparse files(holes in file) 
automatically fill the hole
when accessed. Then there is the issue of file system block size and 
page size. If the
block sizes are smaller then page size, it could mean the noautofill 
would only work

if the hole size is equal to  or a multiple of, page size?

In case of hugetlbfs it is much straight forward. Since this 
filesystem is not like a normal
filesystems and and the file sizes are multiple of huge pages. The 
hole will be a multiple
of the huge page size. For this reason then should the advise be 
specific to hugetlbfs?






Any further comments? I think introducing a general madvise option or a 
mmap flag applicable to all filesystems, may not be required. The 
'noautofill' behavior would be specifically useful in hugetlbfs filesystem.


So, if it is specific to hugetlbfs, will the mount option be ok? 
Otherwise adding a madvise / mmap option specific to hugetlbfs, be 
preferred?




Re: [PATCH RFC] hugetlbfs 'noautofill' mount option

2017-05-07 Thread Prakash Sangappa



On 5/3/17 12:02 PM, Prakash Sangappa wrote:

On 5/2/17 4:43 PM, Dave Hansen wrote:


Ideally, it would be something that is *not* specifically for hugetlbfs.
  MADV_NOAUTOFILL, for instance, could be defined to SIGSEGV whenever
memory is touched that was not populated with MADV_WILLNEED, mlock(), 
etc...


If this is a generic advice type, necessary support will have to be 
implemented

in various filesystems which can support this.

The proposed behavior for 'noautofill' was to not fill holes in 
files(like sparse files).

In the page fault path, mm would not know if the mmapped address on which
the fault occurred, is over a hole in the file or just that the page 
is not available
in the page cache. The underlying filesystem would be called and it 
determines
if it is a hole and that is where it would fail and not fill the hole, 
if this support is added.
Normally, filesystem which support sparse files(holes in file) 
automatically fill the hole
when accessed. Then there is the issue of file system block size and 
page size. If the
block sizes are smaller then page size, it could mean the noautofill 
would only work

if the hole size is equal to  or a multiple of, page size?

In case of hugetlbfs it is much straight forward. Since this 
filesystem is not like a normal
filesystems and and the file sizes are multiple of huge pages. The 
hole will be a multiple
of the huge page size. For this reason then should the advise be 
specific to hugetlbfs?






Any further comments? I think introducing a general madvise option or a 
mmap flag applicable to all filesystems, may not be required. The 
'noautofill' behavior would be specifically useful in hugetlbfs filesystem.


So, if it is specific to hugetlbfs, will the mount option be ok? 
Otherwise adding a madvise / mmap option specific to hugetlbfs, be 
preferred?




Re: [PATCH RFC] hugetlbfs 'noautofill' mount option

2017-05-03 Thread Prakash Sangappa

On 5/2/17 4:43 PM, Dave Hansen wrote:


On 05/02/2017 04:34 PM, Prakash Sangappa wrote:

Similarly, a madvise() option also requires additional system call by every
process mapping the file, this is considered a overhead for the database.

How long-lived are these processes?  For a database, I'd assume that
this would happen a single time, or a single time per mmap() at process
startup time.  Such a syscall would be doing something on the order of
taking mmap_sem, walking the VMA tree, setting a bit per VMA, and
unlocking.  That's a pretty cheap one-time cost...
Plus a call into the filesystem (a_ops?) to check if the underlying 
filesystem

supports not filling holes to mapped access before setting the bit per vma.
Although the overhead may not be that bad.

Database processes can exit and new once started, for instance, depending on
database activity.



If we do consider a new madvise() option, will it be acceptable
since this will be specifically for hugetlbfs file mappings?

Ideally, it would be something that is *not* specifically for hugetlbfs.
  MADV_NOAUTOFILL, for instance, could be defined to SIGSEGV whenever
memory is touched that was not populated with MADV_WILLNEED, mlock(), etc...


If this is a generic advice type, necessary support will have to be 
implemented

in various filesystems which can support this.

The proposed behavior for 'noautofill' was to not fill holes in 
files(like sparse files).

In the page fault path, mm would not know if the mmapped address on which
the fault occurred, is over a hole in the file or just that the page is 
not available
in the page cache. The underlying filesystem would be called and it 
determines
if it is a hole and that is where it would fail and not fill the hole, 
if this support is added.
Normally, filesystem which support sparse files(holes in file) 
automatically fill the hole
when accessed. Then there is the issue of file system block size and 
page size. If the
block sizes are smaller then page size, it could mean the noautofill 
would only work

if the hole size is equal to  or a multiple of, page size?

In case of hugetlbfs it is much straight forward. Since this filesystem 
is not like a normal
filesystems and and the file sizes are multiple of huge pages. The hole 
will be a multiple
of the huge page size. For this reason then should the advise be 
specific to hugetlbfs?






If so,
would a new flag to mmap() call itself be acceptable, which would
define the proposed behavior?. That way no additional system calls
need to be made.

I don't feel super strongly about it, but I guess an mmap() flag could
work too.



Same goes with the mmap call, if it is a generic flag.

-Prakash.






Re: [PATCH RFC] hugetlbfs 'noautofill' mount option

2017-05-03 Thread Prakash Sangappa

On 5/2/17 4:43 PM, Dave Hansen wrote:


On 05/02/2017 04:34 PM, Prakash Sangappa wrote:

Similarly, a madvise() option also requires additional system call by every
process mapping the file, this is considered a overhead for the database.

How long-lived are these processes?  For a database, I'd assume that
this would happen a single time, or a single time per mmap() at process
startup time.  Such a syscall would be doing something on the order of
taking mmap_sem, walking the VMA tree, setting a bit per VMA, and
unlocking.  That's a pretty cheap one-time cost...
Plus a call into the filesystem (a_ops?) to check if the underlying 
filesystem

supports not filling holes to mapped access before setting the bit per vma.
Although the overhead may not be that bad.

Database processes can exit and new once started, for instance, depending on
database activity.



If we do consider a new madvise() option, will it be acceptable
since this will be specifically for hugetlbfs file mappings?

Ideally, it would be something that is *not* specifically for hugetlbfs.
  MADV_NOAUTOFILL, for instance, could be defined to SIGSEGV whenever
memory is touched that was not populated with MADV_WILLNEED, mlock(), etc...


If this is a generic advice type, necessary support will have to be 
implemented

in various filesystems which can support this.

The proposed behavior for 'noautofill' was to not fill holes in 
files(like sparse files).

In the page fault path, mm would not know if the mmapped address on which
the fault occurred, is over a hole in the file or just that the page is 
not available
in the page cache. The underlying filesystem would be called and it 
determines
if it is a hole and that is where it would fail and not fill the hole, 
if this support is added.
Normally, filesystem which support sparse files(holes in file) 
automatically fill the hole
when accessed. Then there is the issue of file system block size and 
page size. If the
block sizes are smaller then page size, it could mean the noautofill 
would only work

if the hole size is equal to  or a multiple of, page size?

In case of hugetlbfs it is much straight forward. Since this filesystem 
is not like a normal
filesystems and and the file sizes are multiple of huge pages. The hole 
will be a multiple
of the huge page size. For this reason then should the advise be 
specific to hugetlbfs?






If so,
would a new flag to mmap() call itself be acceptable, which would
define the proposed behavior?. That way no additional system calls
need to be made.

I don't feel super strongly about it, but I guess an mmap() flag could
work too.



Same goes with the mmap call, if it is a generic flag.

-Prakash.






Re: [PATCH RFC] hugetlbfs 'noautofill' mount option

2017-05-02 Thread Prakash Sangappa



On 5/2/17 2:32 PM, Dave Hansen wrote:

On 05/01/2017 11:00 AM, Prakash Sangappa wrote:

This patch adds a new hugetlbfs mount option 'noautofill', to indicate that
pages should not be allocated at page fault time when accessed thru mmapped
address.

I think the main argument against doing something like this is further
specializing hugetlbfs.  I was really hoping that userfaultfd would be
usable for your needs here.

Could you elaborate on other options that you considered?  Did you look
at userfaultfd?  What about an madvise() option that disallows backing
allocations?



Yes, we did consider userfaultfd and madvise(). The use case in mind is 
the database.


With a database, large number of single threaded processes are involved 
which
will map hugetlbfs file and use it for shared memory. The concern with 
using
userfaultfd is the overhead of setup and having an additional thread per 
process
to monitor the userfaultfd. Even if the additional thread can be 
avoided, by using
an external monitor process and  each process sending the userfaultfd to 
this

monitor process, setup overhead exists.

Similarly, a madvise() option also requires additional system call by every
process mapping the file, this is considered a overhead for the database.

If we do consider a new madvise() option, will it be acceptable since 
this will be

specifically for hugetlbfs file mappings? If so, would a new flag to mmap()
call itself be acceptable, which would define the proposed behavior?. 
That way

no additional system calls need to be made. Again this mmap flag would be
applicable  specifically to hugetlbfs file mappings

With the proposed mount option, it would enforce one consistent behavior
and the application using this filesystem would not have to take additional
steps as with userfaultfd or madvise().



Re: [PATCH RFC] hugetlbfs 'noautofill' mount option

2017-05-02 Thread Prakash Sangappa



On 5/2/17 2:32 PM, Dave Hansen wrote:

On 05/01/2017 11:00 AM, Prakash Sangappa wrote:

This patch adds a new hugetlbfs mount option 'noautofill', to indicate that
pages should not be allocated at page fault time when accessed thru mmapped
address.

I think the main argument against doing something like this is further
specializing hugetlbfs.  I was really hoping that userfaultfd would be
usable for your needs here.

Could you elaborate on other options that you considered?  Did you look
at userfaultfd?  What about an madvise() option that disallows backing
allocations?



Yes, we did consider userfaultfd and madvise(). The use case in mind is 
the database.


With a database, large number of single threaded processes are involved 
which
will map hugetlbfs file and use it for shared memory. The concern with 
using
userfaultfd is the overhead of setup and having an additional thread per 
process
to monitor the userfaultfd. Even if the additional thread can be 
avoided, by using
an external monitor process and  each process sending the userfaultfd to 
this

monitor process, setup overhead exists.

Similarly, a madvise() option also requires additional system call by every
process mapping the file, this is considered a overhead for the database.

If we do consider a new madvise() option, will it be acceptable since 
this will be

specifically for hugetlbfs file mappings? If so, would a new flag to mmap()
call itself be acceptable, which would define the proposed behavior?. 
That way

no additional system calls need to be made. Again this mmap flag would be
applicable  specifically to hugetlbfs file mappings

With the proposed mount option, it would enforce one consistent behavior
and the application using this filesystem would not have to take additional
steps as with userfaultfd or madvise().



Re: [PATCH RFC] hugetlbfs 'noautofill' mount option

2017-05-02 Thread Prakash Sangappa



On 5/2/17 3:53 AM, Anshuman Khandual wrote:

On 05/01/2017 11:30 PM, Prakash Sangappa wrote:

Some applications like a database use hugetblfs for performance
reasons. Files on hugetlbfs filesystem are created and huge pages
allocated using fallocate() API. Pages are deallocated/freed using
fallocate() hole punching support that has been added to hugetlbfs.
These files are mmapped and accessed by many processes as shared memory.
Such applications keep track of which offsets in the hugetlbfs file have
pages allocated.

Any access to mapped address over holes in the file, which can occur due

s/mapped/unmapped/ ^ ?


It is 'mapped' address.




to bugs in the application, is considered invalid and expect the process
to simply receive a SIGBUS.  However, currently when a hole in the file is
accessed via the mapped address, kernel/mm attempts to automatically
allocate a page at page fault time, resulting in implicitly filling the
hole

But this is expected when you try to control the file allocation from
a mapped address. Any changes while walking past or writing the range
in the memory mapped should reflect exactly in the file on the disk.
Why its not a valid behavior ?

Sure, that is a valid behavior. However, hugetlbfs is a pesudo filesystem
and the purpose is for applications to use hugepage memory. The contents
of these filesystem are not backed by disk nor are they swapped out.

The proposed new behavior is only applicable for hugetlbfs filesystem 
mounted
with the new 'noautofill' mount option. The file's page allocation/free 
are managed

using the 'fallocate()' API.

For hugetlbfs filesystems mounted without this option, there is no 
change in behavior.



in the file. This may not be the desired behavior for applications like the
database that want to explicitly manage page allocations of hugetlbfs
files.

This patch adds a new hugetlbfs mount option 'noautofill', to indicate that
pages should not be allocated at page fault time when accessed thru mmapped
address.

When the page should be allocated for mapping ?

The application would allocate/free file pages using the fallocate() API.





Re: [PATCH RFC] hugetlbfs 'noautofill' mount option

2017-05-02 Thread Prakash Sangappa



On 5/2/17 3:53 AM, Anshuman Khandual wrote:

On 05/01/2017 11:30 PM, Prakash Sangappa wrote:

Some applications like a database use hugetblfs for performance
reasons. Files on hugetlbfs filesystem are created and huge pages
allocated using fallocate() API. Pages are deallocated/freed using
fallocate() hole punching support that has been added to hugetlbfs.
These files are mmapped and accessed by many processes as shared memory.
Such applications keep track of which offsets in the hugetlbfs file have
pages allocated.

Any access to mapped address over holes in the file, which can occur due

s/mapped/unmapped/ ^ ?


It is 'mapped' address.




to bugs in the application, is considered invalid and expect the process
to simply receive a SIGBUS.  However, currently when a hole in the file is
accessed via the mapped address, kernel/mm attempts to automatically
allocate a page at page fault time, resulting in implicitly filling the
hole

But this is expected when you try to control the file allocation from
a mapped address. Any changes while walking past or writing the range
in the memory mapped should reflect exactly in the file on the disk.
Why its not a valid behavior ?

Sure, that is a valid behavior. However, hugetlbfs is a pesudo filesystem
and the purpose is for applications to use hugepage memory. The contents
of these filesystem are not backed by disk nor are they swapped out.

The proposed new behavior is only applicable for hugetlbfs filesystem 
mounted
with the new 'noautofill' mount option. The file's page allocation/free 
are managed

using the 'fallocate()' API.

For hugetlbfs filesystems mounted without this option, there is no 
change in behavior.



in the file. This may not be the desired behavior for applications like the
database that want to explicitly manage page allocations of hugetlbfs
files.

This patch adds a new hugetlbfs mount option 'noautofill', to indicate that
pages should not be allocated at page fault time when accessed thru mmapped
address.

When the page should be allocated for mapping ?

The application would allocate/free file pages using the fallocate() API.





  1   2   >