Re: [PATCH V2 0/6] VA to numa node information
On 9/24/18 10:14 AM, Michal Hocko wrote: On Fri 14-09-18 12:01:18, Steven Sistare wrote: On 9/14/2018 1:56 AM, Michal Hocko wrote: [...] Why does this matter for something that is for analysis purposes. Reading the file for the whole address space is far from a free operation. Is the page walk optimization really essential for usability? Moreover what prevents move_pages implementation to be clever for the page walk itself? In other words why would we want to add a new API rather than make the existing one faster for everybody. One could optimize move pages. If the caller passes a consecutive range of small pages, and the page walk sees that a VA is mapped by a huge page, then it can return the same numa node for each of the following VA's that fall into the huge page range. It would be faster than 55 nsec per small page, but hard to say how much faster, and the cost is still driven by the number of small pages. This is exactly what I was arguing for. There is some room for improvements for the existing interface. I yet have to hear the explicit usecase which would required even better performance that cannot be achieved by the existing API. Above mentioned optimization to move_pages() API helps when scanning mapped huge pages, but does not help if there are large sparse mappings with few pages mapped. Otherwise, consider adding page walk support in the move_pages() implementation, enhance the API(new flag?) to return address range to numa node information. The page walk optimization would certainly make a difference for usability. We can have applications(Like Oracle DB) having processes with large sparse mappings(in TBs) with only some areas of these mapped address range being accessed, basically large portions not having page tables backing it. This can become more prevalent on newer systems with multiple TBs of memory. Here is some data from pmap using move_pages() API with optimization. Following table compares time pmap takes to print address mapping of a large process, with numa node information using move_pages() api vs pmap using /proc numa_vamaps file. Running pmap command on a process with 1.3 TB of address space, with sparse mappings. ~1.3 TB sparse 250G dense segment with hugepages. move_pages 8.33s 3.14 optimized move_pages 6.29s 0.92 /proc numa_vamaps 0.08s 0.04 Second column is pmap time on a 250G address range of this process, which maps hugepages(THP & hugetlb).
Re: [PATCH V2 0/6] VA to numa node information
On 9/24/18 10:14 AM, Michal Hocko wrote: On Fri 14-09-18 12:01:18, Steven Sistare wrote: On 9/14/2018 1:56 AM, Michal Hocko wrote: [...] Why does this matter for something that is for analysis purposes. Reading the file for the whole address space is far from a free operation. Is the page walk optimization really essential for usability? Moreover what prevents move_pages implementation to be clever for the page walk itself? In other words why would we want to add a new API rather than make the existing one faster for everybody. One could optimize move pages. If the caller passes a consecutive range of small pages, and the page walk sees that a VA is mapped by a huge page, then it can return the same numa node for each of the following VA's that fall into the huge page range. It would be faster than 55 nsec per small page, but hard to say how much faster, and the cost is still driven by the number of small pages. This is exactly what I was arguing for. There is some room for improvements for the existing interface. I yet have to hear the explicit usecase which would required even better performance that cannot be achieved by the existing API. Above mentioned optimization to move_pages() API helps when scanning mapped huge pages, but does not help if there are large sparse mappings with few pages mapped. Otherwise, consider adding page walk support in the move_pages() implementation, enhance the API(new flag?) to return address range to numa node information. The page walk optimization would certainly make a difference for usability. We can have applications(Like Oracle DB) having processes with large sparse mappings(in TBs) with only some areas of these mapped address range being accessed, basically large portions not having page tables backing it. This can become more prevalent on newer systems with multiple TBs of memory. Here is some data from pmap using move_pages() API with optimization. Following table compares time pmap takes to print address mapping of a large process, with numa node information using move_pages() api vs pmap using /proc numa_vamaps file. Running pmap command on a process with 1.3 TB of address space, with sparse mappings. ~1.3 TB sparse 250G dense segment with hugepages. move_pages 8.33s 3.14 optimized move_pages 6.29s 0.92 /proc numa_vamaps 0.08s 0.04 Second column is pmap time on a 250G address range of this process, which maps hugepages(THP & hugetlb).
Re: [PATCH V2 0/6] VA to numa node information
On 9/13/2018 5:25 PM, Dave Hansen wrote: On 09/13/2018 05:10 PM, Andrew Morton wrote: Also, VMAs having THP pages can have a mix of 4k pages and hugepages. The page walks would be efficient in scanning and determining if it is a THP huge page and step over it. Whereas using the API, the application would not know what page size mapping is used for a given VA and so would have to again scan the VMA in units of 4k page size. If this sounds reasonable, I can add it to the commit / patch description. As we are judging whether this is a "good" interface, can you tell us a bit about its scalability? For instance, let's say someone has a 1TB VMA that's populated with interleaved 4k pages. How much data comes out? How long does it take to parse? Will we effectively deadlock the system if someone accidentally cat's the wrong /proc file? For the worst case scenario you describe, it would be one line(range) for each 4k. Which would be similar to what you get with '/proc/*/pagemap'. The amount of data copied out at a time is based on the buffer size used in the kernel. Which is 1024. That is if one line(one range) printed is about 40 bytes(char), that means about 25 lines per copy out. Main concern would be holding 'mmap_sem' lock, which can cause hangs. When the 1024 buffer gets filled the mmap_sem is dropped and the buffer content is copied out to the user buffer. Then the mmap_sem lock is reacquired and the page walk continues as needed until the specified user buffer size is filed or till end of process address space is reached. One potential issue could be that there is a large VA range with all pages populated from one numa node, then the page walk could take longer while holding mmap_sem lock. This can be addressed by dropping and re-acquiring the mmap_sem lock after certain number of pages have been walked(Say 512 - which is what happens in '/proc/*/pagemap' case). /proc seems like a really simple way to implement this, but it seems a *really* odd choice for something that needs to collect a large amount of data. The lseek() stuff is a nice addition, but I wonder if it's unwieldy to use in practice. For instance, if you want to read data for the VMA at 0x100 you lseek(fd, 0x100, SEEK_SET, right? You read ~20 bytes of data and then the fd is at 0x120. But, you're getting data out at the next read() for (at least) the next page, which is also available at 0x1001000. Seems funky. Do other /proc files behave this way? Yes, SEEK_SET to the VA. The lseek offset is the process VA. So it is not going to be different from reading a normal text file. Expect that /proc files are special. Ex In /proc/*/pagemap' file case, read enforces that seek/file offset and the user buffer size passed in to be a multiple of the pagemap_entry_t size or else the read would fail. The usage for numa_vamaps file will be to SEEK_SET to the VA from where VA range to numa node information needs to be read. The 'fd' offset is not taken into consideration here, just the VA. Say each va range to numa node id printed is about 40 bytes(chars). Now if the read only read 20 bytes, it would have read part of the line. Subsequent read would read the remaining bytes of the line, which will be stored in the kernel buffer.
Re: [PATCH V2 0/6] VA to numa node information
On 9/13/2018 5:25 PM, Dave Hansen wrote: On 09/13/2018 05:10 PM, Andrew Morton wrote: Also, VMAs having THP pages can have a mix of 4k pages and hugepages. The page walks would be efficient in scanning and determining if it is a THP huge page and step over it. Whereas using the API, the application would not know what page size mapping is used for a given VA and so would have to again scan the VMA in units of 4k page size. If this sounds reasonable, I can add it to the commit / patch description. As we are judging whether this is a "good" interface, can you tell us a bit about its scalability? For instance, let's say someone has a 1TB VMA that's populated with interleaved 4k pages. How much data comes out? How long does it take to parse? Will we effectively deadlock the system if someone accidentally cat's the wrong /proc file? For the worst case scenario you describe, it would be one line(range) for each 4k. Which would be similar to what you get with '/proc/*/pagemap'. The amount of data copied out at a time is based on the buffer size used in the kernel. Which is 1024. That is if one line(one range) printed is about 40 bytes(char), that means about 25 lines per copy out. Main concern would be holding 'mmap_sem' lock, which can cause hangs. When the 1024 buffer gets filled the mmap_sem is dropped and the buffer content is copied out to the user buffer. Then the mmap_sem lock is reacquired and the page walk continues as needed until the specified user buffer size is filed or till end of process address space is reached. One potential issue could be that there is a large VA range with all pages populated from one numa node, then the page walk could take longer while holding mmap_sem lock. This can be addressed by dropping and re-acquiring the mmap_sem lock after certain number of pages have been walked(Say 512 - which is what happens in '/proc/*/pagemap' case). /proc seems like a really simple way to implement this, but it seems a *really* odd choice for something that needs to collect a large amount of data. The lseek() stuff is a nice addition, but I wonder if it's unwieldy to use in practice. For instance, if you want to read data for the VMA at 0x100 you lseek(fd, 0x100, SEEK_SET, right? You read ~20 bytes of data and then the fd is at 0x120. But, you're getting data out at the next read() for (at least) the next page, which is also available at 0x1001000. Seems funky. Do other /proc files behave this way? Yes, SEEK_SET to the VA. The lseek offset is the process VA. So it is not going to be different from reading a normal text file. Expect that /proc files are special. Ex In /proc/*/pagemap' file case, read enforces that seek/file offset and the user buffer size passed in to be a multiple of the pagemap_entry_t size or else the read would fail. The usage for numa_vamaps file will be to SEEK_SET to the VA from where VA range to numa node information needs to be read. The 'fd' offset is not taken into consideration here, just the VA. Say each va range to numa node id printed is about 40 bytes(chars). Now if the read only read 20 bytes, it would have read part of the line. Subsequent read would read the remaining bytes of the line, which will be stored in the kernel buffer.
Re: [RFC PATCH] Add /proc//numa_vamaps for numa node information
On 9/14/18 5:49 AM, Jann Horn wrote: On Fri, Sep 14, 2018 at 8:21 AM Michal Hocko wrote: On Fri 14-09-18 03:33:28, Jann Horn wrote: On Wed, Sep 12, 2018 at 10:43 PM prakash.sangappa wrote: On 05/09/2018 04:31 PM, Dave Hansen wrote: On 05/07/2018 06:16 PM, prakash.sangappa wrote: It will be /proc//numa_vamaps. Yes, the behavior will be different with respect to seeking. Output will still be text and the format will be same. I want to get feedback on this approach. I think it would be really great if you can write down a list of the things you actually want to accomplish. Dare I say: you need a requirements list. The numa_vamaps approach continues down the path of an ever-growing list of highly-specialized /proc/ files. I don't think that is sustainable, even if it has been our trajectory for many years. Pagemap wasn't exactly a shining example of us getting new ABIs right, but it sounds like something along those is what we need. Just sent out a V2 patch. This patch simplifies the file content. It only provides VA range to numa node id information. The requirement is basically observability for performance analysis. - Need to be able to determine VA range to numa node id information. Which also gives an idea of which range has memory allocated. - The proc file /proc//numa_vamaps is in text so it is easy to directly view. The V2 patch supports seeking to a particular process VA from where the application could read the VA to numa node id information. Also added the 'PTRACE_MODE_READ_REALCREDS' check when opening the file /proc file as was indicated by Michal Hacko procfs files should use PTRACE_MODE_*_FSCREDS, not PTRACE_MODE_*_REALCREDS. Out of my curiosity, what is the semantic difference? At least kernel_move_pages uses PTRACE_MODE_READ_REALCREDS. Is this a bug? No, that's fine. REALCREDS basically means "look at the caller's real UID for the access check", while FSCREDS means "look at the caller's filesystem UID". The ptrace access check has historically been using the real UID, which is sorta weird, but normally works fine. Given that this is documented, I didn't see any reason to change it for most things that do ptrace access checks, even if the EUID would IMO be more appropriate. But things that capture caller credentials at points like open() really shouldn't look at the real UID; instead, they should use the filesystem UID (which in practice is basically the same as the EUID). So in short, it depends on the interface you're coming through: Direct syscalls use REALCREDS, things that go through the VFS layer use FSCREDS. So in this case can the REALCREDS check be done in the read() system call when reading the /proc file instead of the open call?
Re: [RFC PATCH] Add /proc//numa_vamaps for numa node information
On 9/14/18 5:49 AM, Jann Horn wrote: On Fri, Sep 14, 2018 at 8:21 AM Michal Hocko wrote: On Fri 14-09-18 03:33:28, Jann Horn wrote: On Wed, Sep 12, 2018 at 10:43 PM prakash.sangappa wrote: On 05/09/2018 04:31 PM, Dave Hansen wrote: On 05/07/2018 06:16 PM, prakash.sangappa wrote: It will be /proc//numa_vamaps. Yes, the behavior will be different with respect to seeking. Output will still be text and the format will be same. I want to get feedback on this approach. I think it would be really great if you can write down a list of the things you actually want to accomplish. Dare I say: you need a requirements list. The numa_vamaps approach continues down the path of an ever-growing list of highly-specialized /proc/ files. I don't think that is sustainable, even if it has been our trajectory for many years. Pagemap wasn't exactly a shining example of us getting new ABIs right, but it sounds like something along those is what we need. Just sent out a V2 patch. This patch simplifies the file content. It only provides VA range to numa node id information. The requirement is basically observability for performance analysis. - Need to be able to determine VA range to numa node id information. Which also gives an idea of which range has memory allocated. - The proc file /proc//numa_vamaps is in text so it is easy to directly view. The V2 patch supports seeking to a particular process VA from where the application could read the VA to numa node id information. Also added the 'PTRACE_MODE_READ_REALCREDS' check when opening the file /proc file as was indicated by Michal Hacko procfs files should use PTRACE_MODE_*_FSCREDS, not PTRACE_MODE_*_REALCREDS. Out of my curiosity, what is the semantic difference? At least kernel_move_pages uses PTRACE_MODE_READ_REALCREDS. Is this a bug? No, that's fine. REALCREDS basically means "look at the caller's real UID for the access check", while FSCREDS means "look at the caller's filesystem UID". The ptrace access check has historically been using the real UID, which is sorta weird, but normally works fine. Given that this is documented, I didn't see any reason to change it for most things that do ptrace access checks, even if the EUID would IMO be more appropriate. But things that capture caller credentials at points like open() really shouldn't look at the real UID; instead, they should use the filesystem UID (which in practice is basically the same as the EUID). So in short, it depends on the interface you're coming through: Direct syscalls use REALCREDS, things that go through the VFS layer use FSCREDS. So in this case can the REALCREDS check be done in the read() system call when reading the /proc file instead of the open call?
Re: [PATCH V2 0/6] VA to numa node information
On 9/14/18 9:01 AM, Steven Sistare wrote: On 9/14/2018 1:56 AM, Michal Hocko wrote: On Thu 13-09-18 15:32:25, prakash.sangappa wrote: The proc interface provides an efficient way to export address range to numa node id mapping information compared to using the API. Do you have any numbers? For example, for sparsely populated mappings, if a VMA has large portions not have any physical pages mapped, the page walk done thru the /proc file interface can skip over non existent PMDs / ptes. Whereas using the API the application would have to scan the entire VMA in page size units. What prevents you from pre-filtering by reading /proc/$pid/maps to get ranges of interest? That works for skipping holes, but not for skipping huge pages. I did a quick experiment to time move_pages on a 3 GHz Xeon and a 4.18 kernel. Allocate 128 GB and touch every small page. Call move_pages with nodes=NULL to get the node id for all pages, passing 512 consecutive small pages per call to move_nodes. The total move_nodes time is 1.85 secs, and 55 nsec per page. Extrapolating to a 1 TB range, it would take 15 sec to retrieve the numa node for every small page in the range. That is not terrible, but it is not interactive, and it becomes terrible for multiple TB. Also, for valid VMAs in 'maps' file, if the VMA is sparsely populated with physical pages, the page walk can skip over non existing page table entires (PMDs) and so can be faster. For example reading va range of a 400GB VMA which has few pages mapped in beginning and few pages at the end and the rest of VMA does not have any pages, it takes 0.001s using the /proc interface. Whereas with move_page() api passing 1024 consecutive small pages address, it takes about 2.4secs. This is on a similar system running 4.19 kernel.
Re: [PATCH V2 0/6] VA to numa node information
On 9/14/18 9:01 AM, Steven Sistare wrote: On 9/14/2018 1:56 AM, Michal Hocko wrote: On Thu 13-09-18 15:32:25, prakash.sangappa wrote: The proc interface provides an efficient way to export address range to numa node id mapping information compared to using the API. Do you have any numbers? For example, for sparsely populated mappings, if a VMA has large portions not have any physical pages mapped, the page walk done thru the /proc file interface can skip over non existent PMDs / ptes. Whereas using the API the application would have to scan the entire VMA in page size units. What prevents you from pre-filtering by reading /proc/$pid/maps to get ranges of interest? That works for skipping holes, but not for skipping huge pages. I did a quick experiment to time move_pages on a 3 GHz Xeon and a 4.18 kernel. Allocate 128 GB and touch every small page. Call move_pages with nodes=NULL to get the node id for all pages, passing 512 consecutive small pages per call to move_nodes. The total move_nodes time is 1.85 secs, and 55 nsec per page. Extrapolating to a 1 TB range, it would take 15 sec to retrieve the numa node for every small page in the range. That is not terrible, but it is not interactive, and it becomes terrible for multiple TB. Also, for valid VMAs in 'maps' file, if the VMA is sparsely populated with physical pages, the page walk can skip over non existing page table entires (PMDs) and so can be faster. For example reading va range of a 400GB VMA which has few pages mapped in beginning and few pages at the end and the rest of VMA does not have any pages, it takes 0.001s using the /proc interface. Whereas with move_page() api passing 1024 consecutive small pages address, it takes about 2.4secs. This is on a similar system running 4.19 kernel.
[PATCH V2 3/6] Provide process address range to numa node id mapping
This patch provides process address range to numa node information thru /proc//numa_vamaps file. For address ranges not having any pages mapped, a '-' is printed instead of the numa node id. Following is the sample of the file format 0040-0041 N1 0041-0047f000 N0 0047f000-0048 N2 0048-00481000 - 00481000-004a N0 004a-004a2000 - 004a2000-004aa000 N2 004aa000-004ad000 N0 004ad000-004ae000 - .. Signed-off-by: Prakash Sangappa Reviewed-by: Steve Sistare --- fs/proc/task_mmu.c | 158 + 1 file changed, 158 insertions(+) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 02b553c..1371e379 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1845,6 +1845,162 @@ static int show_numa_map(struct seq_file *m, void *v) return 0; } +static int gather_hole_info_vamap(unsigned long start, unsigned long end, + struct mm_walk *walk) +{ + struct numa_maps *md = walk->private; + struct vm_area_struct *vma = walk->vma; + + /* + * If in a nid, end walk at hole start. + * If no nid and vma changes, end walk at next vma start. + */ + if (md->nid >= 0 || vma != find_vma(walk->mm, start)) { + md->nextaddr = start; + return 1; + } + + if (md->nid == NUMA_VAMAPS_NID_NONE) + md->nid = NUMA_VAMAPS_NID_NOPAGES; + + return 0; +} + +static int vamap_vprintf(struct numa_vamaps_private *nvm, const char *f, ...) +{ + va_list args; + int len, space; + + space = NUMA_VAMAPS_BUFSZ - nvm->count; + va_start(args, f); + len = vsnprintf(nvm->buf + nvm->count, space, f, args); + va_end(args); + if (len < space) { + nvm->count += len; + return 0; + } + return 1; +} + +/* + * Display va-range to numa node info via /proc + */ +static ssize_t numa_vamaps_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos) +{ + struct numa_vamaps_private *nvm = file->private_data; + struct vm_area_struct *vma, *tailvma; + struct numa_maps *md = >md; + struct mm_struct *mm = nvm->mm; + u64 vm_start = nvm->vm_start; + size_t ucount; + struct mm_walk walk = { + .hugetlb_entry = gather_hugetlb_stats, + .pmd_entry = gather_pte_stats, + .pte_hole = gather_hole_info_vamap, + .private = md, + .mm = mm, + }; + int ret = 0, copied = 0, done = 0; + + if (!mm || !mmget_not_zero(mm)) + return 0; + + if (count <= 0) + goto out_mm; + + /* First copy leftover contents in buffer */ + if (nvm->from) + goto docopy; + +repeat: + down_read(>mmap_sem); + vma = find_vma(mm, vm_start); + if (!vma) { + done = 1; + goto out; + } + + if (vma->vm_start > vm_start) + vm_start = vma->vm_start; + + while (nvm->count < count) { + u64 vm_end; + + /* Ensure we start with an empty numa_maps statistics */ + memset(md, 0, sizeof(*md)); + md->nid = NUMA_VAMAPS_NID_NONE; /* invalid nodeid at start */ + md->nextaddr = 0; + md->isvamaps = 1; + +if (walk_page_range(vm_start, vma->vm_end, ) < 0) + break; + + /* nextaddr ends the range. if 0, reached the vma end */ + vm_end = (md->nextaddr ? md->nextaddr : vma->vm_end); + +/* break if buffer full */ + if (md->nid >= 0 && md->node[md->nid]) { + if (vamap_vprintf(nvm, "%08lx-%08lx N%ld\n", vm_start, + vm_end, md->nid)) + break; + } else if (vamap_vprintf(nvm, "%08lx-%08lx - \n", vm_start, + vm_end)) { + break; + } + + /* advance to next VA */ + vm_start = vm_end; + if (vm_end == vma->vm_end) { + vma = vma->vm_next; + if (!vma) { + done = 1; + break; + } + vm_start = vma->vm_start; + } + } +out: + /* last, add gate vma details */ + if (!vma && (tailvma = get_gate_vma(mm)) != NULL && + vm_start < tailvma->vm_end) { + done = 0; + if (!vamap_vprintf(nvm, "%08lx-%08lx - \n", + tailvma->vm_start, tailvma->vm_end)) { + done = 1; + vm_start = tail
[PATCH V2 3/6] Provide process address range to numa node id mapping
This patch provides process address range to numa node information thru /proc//numa_vamaps file. For address ranges not having any pages mapped, a '-' is printed instead of the numa node id. Following is the sample of the file format 0040-0041 N1 0041-0047f000 N0 0047f000-0048 N2 0048-00481000 - 00481000-004a N0 004a-004a2000 - 004a2000-004aa000 N2 004aa000-004ad000 N0 004ad000-004ae000 - .. Signed-off-by: Prakash Sangappa Reviewed-by: Steve Sistare --- fs/proc/task_mmu.c | 158 + 1 file changed, 158 insertions(+) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 02b553c..1371e379 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1845,6 +1845,162 @@ static int show_numa_map(struct seq_file *m, void *v) return 0; } +static int gather_hole_info_vamap(unsigned long start, unsigned long end, + struct mm_walk *walk) +{ + struct numa_maps *md = walk->private; + struct vm_area_struct *vma = walk->vma; + + /* + * If in a nid, end walk at hole start. + * If no nid and vma changes, end walk at next vma start. + */ + if (md->nid >= 0 || vma != find_vma(walk->mm, start)) { + md->nextaddr = start; + return 1; + } + + if (md->nid == NUMA_VAMAPS_NID_NONE) + md->nid = NUMA_VAMAPS_NID_NOPAGES; + + return 0; +} + +static int vamap_vprintf(struct numa_vamaps_private *nvm, const char *f, ...) +{ + va_list args; + int len, space; + + space = NUMA_VAMAPS_BUFSZ - nvm->count; + va_start(args, f); + len = vsnprintf(nvm->buf + nvm->count, space, f, args); + va_end(args); + if (len < space) { + nvm->count += len; + return 0; + } + return 1; +} + +/* + * Display va-range to numa node info via /proc + */ +static ssize_t numa_vamaps_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos) +{ + struct numa_vamaps_private *nvm = file->private_data; + struct vm_area_struct *vma, *tailvma; + struct numa_maps *md = >md; + struct mm_struct *mm = nvm->mm; + u64 vm_start = nvm->vm_start; + size_t ucount; + struct mm_walk walk = { + .hugetlb_entry = gather_hugetlb_stats, + .pmd_entry = gather_pte_stats, + .pte_hole = gather_hole_info_vamap, + .private = md, + .mm = mm, + }; + int ret = 0, copied = 0, done = 0; + + if (!mm || !mmget_not_zero(mm)) + return 0; + + if (count <= 0) + goto out_mm; + + /* First copy leftover contents in buffer */ + if (nvm->from) + goto docopy; + +repeat: + down_read(>mmap_sem); + vma = find_vma(mm, vm_start); + if (!vma) { + done = 1; + goto out; + } + + if (vma->vm_start > vm_start) + vm_start = vma->vm_start; + + while (nvm->count < count) { + u64 vm_end; + + /* Ensure we start with an empty numa_maps statistics */ + memset(md, 0, sizeof(*md)); + md->nid = NUMA_VAMAPS_NID_NONE; /* invalid nodeid at start */ + md->nextaddr = 0; + md->isvamaps = 1; + +if (walk_page_range(vm_start, vma->vm_end, ) < 0) + break; + + /* nextaddr ends the range. if 0, reached the vma end */ + vm_end = (md->nextaddr ? md->nextaddr : vma->vm_end); + +/* break if buffer full */ + if (md->nid >= 0 && md->node[md->nid]) { + if (vamap_vprintf(nvm, "%08lx-%08lx N%ld\n", vm_start, + vm_end, md->nid)) + break; + } else if (vamap_vprintf(nvm, "%08lx-%08lx - \n", vm_start, + vm_end)) { + break; + } + + /* advance to next VA */ + vm_start = vm_end; + if (vm_end == vma->vm_end) { + vma = vma->vm_next; + if (!vma) { + done = 1; + break; + } + vm_start = vma->vm_start; + } + } +out: + /* last, add gate vma details */ + if (!vma && (tailvma = get_gate_vma(mm)) != NULL && + vm_start < tailvma->vm_end) { + done = 0; + if (!vamap_vprintf(nvm, "%08lx-%08lx - \n", + tailvma->vm_start, tailvma->vm_end)) { + done = 1; + vm_start = tail
[PATCH V2 1/6] Add check to match numa node id when gathering pte stats
Add support to check if numa node id matches when gathering pte stats, to be used by later patches. Signed-off-by: Prakash Sangappa Reviewed-by: Steve Sistare --- fs/proc/task_mmu.c | 44 +--- 1 file changed, 37 insertions(+), 7 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 5ea1d64..0e2095c 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1569,9 +1569,15 @@ struct numa_maps { unsigned long mapcount_max; unsigned long dirty; unsigned long swapcache; +unsigned long nextaddr; +long nid; +long isvamaps; unsigned long node[MAX_NUMNODES]; }; +#defineNUMA_VAMAPS_NID_NOPAGES (-1) +#defineNUMA_VAMAPS_NID_NONE(-2) + struct numa_maps_private { struct proc_maps_private proc_maps; struct numa_maps md; @@ -1653,6 +1659,20 @@ static struct page *can_gather_numa_stats_pmd(pmd_t pmd, } #endif +static bool +vamap_match_nid(struct numa_maps *md, unsigned long addr, struct page *page) +{ + long target = (page ? page_to_nid(page) : NUMA_VAMAPS_NID_NOPAGES); + + if (md->nid == NUMA_VAMAPS_NID_NONE) + md->nid = target; + if (md->nid == target) + return 0; + /* did not match */ + md->nextaddr = addr; + return 1; +} + static int gather_pte_stats(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { @@ -1661,6 +1681,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, spinlock_t *ptl; pte_t *orig_pte; pte_t *pte; + int ret = 0; #ifdef CONFIG_TRANSPARENT_HUGEPAGE ptl = pmd_trans_huge_lock(pmd, vma); @@ -1668,11 +1689,13 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, struct page *page; page = can_gather_numa_stats_pmd(*pmd, vma, addr); - if (page) + if (md->isvamaps) + ret = vamap_match_nid(md, addr, page); + if (page && !ret) gather_stats(page, md, pmd_dirty(*pmd), HPAGE_PMD_SIZE/PAGE_SIZE); spin_unlock(ptl); - return 0; + return ret; } if (pmd_trans_unstable(pmd)) @@ -1681,6 +1704,10 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, ); do { struct page *page = can_gather_numa_stats(*pte, vma, addr); + if (md->isvamaps && vamap_match_nid(md, addr, page)) { + ret = 1; + break; + } if (!page) continue; gather_stats(page, md, pte_dirty(*pte), 1); @@ -1688,7 +1715,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, } while (pte++, addr += PAGE_SIZE, addr != end); pte_unmap_unlock(orig_pte, ptl); cond_resched(); - return 0; + return ret; } #ifdef CONFIG_HUGETLB_PAGE static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask, @@ -1697,15 +1724,18 @@ static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask, pte_t huge_pte = huge_ptep_get(pte); struct numa_maps *md; struct page *page; + int ret = 0; + md = walk->private; if (!pte_present(huge_pte)) - return 0; + return (md->isvamaps ? vamap_match_nid(md, addr, NULL) : 0); page = pte_page(huge_pte); - if (!page) - return 0; + if (md->isvamaps) + ret = vamap_match_nid(md, addr, page); + if (!page || ret) + return ret; - md = walk->private; gather_stats(page, md, pte_dirty(huge_pte), 1); return 0; } -- 2.7.4
[PATCH V2 2/6] Add /proc//numa_vamaps file for numa node information
Introduce supporting data structures and file operations. Later patch will provide changes for generating file content. Signed-off-by: Prakash Sangappa Reviewed-by: Steve Sistare --- fs/proc/base.c | 2 ++ fs/proc/internal.h | 1 + fs/proc/task_mmu.c | 42 ++ 3 files changed, 45 insertions(+) diff --git a/fs/proc/base.c b/fs/proc/base.c index ccf86f1..1af99ae 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2927,6 +2927,7 @@ static const struct pid_entry tgid_base_stuff[] = { REG("maps", S_IRUGO, proc_pid_maps_operations), #ifdef CONFIG_NUMA REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations), + REG("numa_vamaps", S_IRUGO, proc_numa_vamaps_operations), #endif REG("mem",S_IRUSR|S_IWUSR, proc_mem_operations), LNK("cwd",proc_cwd_link), @@ -3313,6 +3314,7 @@ static const struct pid_entry tid_base_stuff[] = { #endif #ifdef CONFIG_NUMA REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations), + REG("numa_vamaps", S_IRUGO, proc_numa_vamaps_operations), #endif REG("mem", S_IRUSR|S_IWUSR, proc_mem_operations), LNK("cwd", proc_cwd_link), diff --git a/fs/proc/internal.h b/fs/proc/internal.h index 5185d7f..994c7fd 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -298,6 +298,7 @@ extern const struct file_operations proc_pid_smaps_operations; extern const struct file_operations proc_pid_smaps_rollup_operations; extern const struct file_operations proc_clear_refs_operations; extern const struct file_operations proc_pagemap_operations; +extern const struct file_operations proc_numa_vamaps_operations; extern unsigned long task_vsize(struct mm_struct *); extern unsigned long task_statm(struct mm_struct *, diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 0e2095c..02b553c 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1583,6 +1583,16 @@ struct numa_maps_private { struct numa_maps md; }; +#define NUMA_VAMAPS_BUFSZ 1024 +struct numa_vamaps_private { + struct mm_struct *mm; + struct numa_maps md; + u64 vm_start; + size_t from; + size_t count; /* residual bytes in buf at offset 'from' */ + char buf[NUMA_VAMAPS_BUFSZ]; /* buffer */ +}; + static void gather_stats(struct page *page, struct numa_maps *md, int pte_dirty, unsigned long nr_pages) { @@ -1848,6 +1858,34 @@ static int pid_numa_maps_open(struct inode *inode, struct file *file) sizeof(struct numa_maps_private)); } +static int numa_vamaps_open(struct inode *inode, struct file *file) +{ + struct mm_struct *mm; + struct numa_vamaps_private *nvm; + nvm = kzalloc(sizeof(struct numa_vamaps_private), GFP_KERNEL); + if (!nvm) + return -ENOMEM; + + mm = proc_mem_open(inode, PTRACE_MODE_READ); + if (IS_ERR(mm)) { + kfree(nvm); + return PTR_ERR(mm); + } + nvm->mm = mm; + file->private_data = nvm; + return 0; +} + +static int numa_vamaps_release(struct inode *inode, struct file *file) +{ + struct numa_vamaps_private *nvm = file->private_data; + + if (nvm->mm) + mmdrop(nvm->mm); + kfree(nvm); + return 0; +} + const struct file_operations proc_pid_numa_maps_operations = { .open = pid_numa_maps_open, .read = seq_read, @@ -1855,4 +1893,8 @@ const struct file_operations proc_pid_numa_maps_operations = { .release= proc_map_release, }; +const struct file_operations proc_numa_vamaps_operations = { + .open = numa_vamaps_open, + .release= numa_vamaps_release, +}; #endif /* CONFIG_NUMA */ -- 2.7.4
[PATCH V2 5/6] File /proc//numa_vamaps access needs PTRACE_MODE_READ_REALCREDS check
Permission to access /proc//numa_vamaps file should be governed by PTRACE_READ_REALCREADS check to restrict getting specific VA range to numa node mapping information. Signed-off-by: Prakash Sangappa Reviewed-by: Steve Sistare --- fs/proc/base.c | 4 +++- fs/proc/task_mmu.c | 2 +- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index 1af99ae..3c19a55 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -745,7 +745,9 @@ struct mm_struct *proc_mem_open(struct inode *inode, unsigned int mode) struct mm_struct *mm = ERR_PTR(-ESRCH); if (task) { - mm = mm_access(task, mode | PTRACE_MODE_FSCREDS); + if (!(mode & PTRACE_MODE_REALCREDS)) + mode |= PTRACE_MODE_FSCREDS; + mm = mm_access(task, mode); put_task_struct(task); if (!IS_ERR_OR_NULL(mm)) { diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 93dce46..30b29d2 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -2043,7 +2043,7 @@ static int numa_vamaps_open(struct inode *inode, struct file *file) if (!nvm) return -ENOMEM; - mm = proc_mem_open(inode, PTRACE_MODE_READ); + mm = proc_mem_open(inode, PTRACE_MODE_READ | PTRACE_MODE_REALCREDS); if (IS_ERR(mm)) { kfree(nvm); return PTR_ERR(mm); -- 2.7.4
[PATCH V2 6/6] /proc/pid/numa_vamaps: document in Documentation/filesystems/proc.txt
Add documentation for /proc//numa_vamaps in Documentation/filesystems/proc.txt Signed-off-by: Prakash Sangappa Reviewed-by: Steve Sistare --- Documentation/filesystems/proc.txt | 21 + 1 file changed, 21 insertions(+) diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 22b4b00..7095216 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -150,6 +150,9 @@ Table 1-1: Process specific entries in /proc each mapping and flags associated with it numa_maps an extension based on maps, showing the memory locality and binding policy as well as mem usage (in pages) of each mapping. + numa_vamaps Presents information about mapped address ranges to numa node + from where the physical memory is allocated. + .. For example, to get the status information of a process, all you have to do is @@ -571,6 +574,24 @@ Where: node locality page counters (N0 == node0, N1 == node1, ...) and the kernel page size, in KB, that is backing the mapping up. +The /proc/pid/numa_vamaps shows mapped address ranges to numa node id from +where the physical pages are allocated. For mapped address ranges not having +any pages mapped a '-' is shown instead of the node id. Each line in the file +will show address range to one numa node. + +address-range numa-node-id + +0040-0041 N1 +0041-0047f000 N0 +0047f000-0048 N2 +0048-00481000 - +00481000-004a N0 +004a-004a2000 - +004a2000-004aa000 N2 +004aa000-004ad000 N0 +004ad000-004ae000 - +.. + 1.2 Kernel data --- -- 2.7.4
[PATCH V2 1/6] Add check to match numa node id when gathering pte stats
Add support to check if numa node id matches when gathering pte stats, to be used by later patches. Signed-off-by: Prakash Sangappa Reviewed-by: Steve Sistare --- fs/proc/task_mmu.c | 44 +--- 1 file changed, 37 insertions(+), 7 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 5ea1d64..0e2095c 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1569,9 +1569,15 @@ struct numa_maps { unsigned long mapcount_max; unsigned long dirty; unsigned long swapcache; +unsigned long nextaddr; +long nid; +long isvamaps; unsigned long node[MAX_NUMNODES]; }; +#defineNUMA_VAMAPS_NID_NOPAGES (-1) +#defineNUMA_VAMAPS_NID_NONE(-2) + struct numa_maps_private { struct proc_maps_private proc_maps; struct numa_maps md; @@ -1653,6 +1659,20 @@ static struct page *can_gather_numa_stats_pmd(pmd_t pmd, } #endif +static bool +vamap_match_nid(struct numa_maps *md, unsigned long addr, struct page *page) +{ + long target = (page ? page_to_nid(page) : NUMA_VAMAPS_NID_NOPAGES); + + if (md->nid == NUMA_VAMAPS_NID_NONE) + md->nid = target; + if (md->nid == target) + return 0; + /* did not match */ + md->nextaddr = addr; + return 1; +} + static int gather_pte_stats(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { @@ -1661,6 +1681,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, spinlock_t *ptl; pte_t *orig_pte; pte_t *pte; + int ret = 0; #ifdef CONFIG_TRANSPARENT_HUGEPAGE ptl = pmd_trans_huge_lock(pmd, vma); @@ -1668,11 +1689,13 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, struct page *page; page = can_gather_numa_stats_pmd(*pmd, vma, addr); - if (page) + if (md->isvamaps) + ret = vamap_match_nid(md, addr, page); + if (page && !ret) gather_stats(page, md, pmd_dirty(*pmd), HPAGE_PMD_SIZE/PAGE_SIZE); spin_unlock(ptl); - return 0; + return ret; } if (pmd_trans_unstable(pmd)) @@ -1681,6 +1704,10 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, ); do { struct page *page = can_gather_numa_stats(*pte, vma, addr); + if (md->isvamaps && vamap_match_nid(md, addr, page)) { + ret = 1; + break; + } if (!page) continue; gather_stats(page, md, pte_dirty(*pte), 1); @@ -1688,7 +1715,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, } while (pte++, addr += PAGE_SIZE, addr != end); pte_unmap_unlock(orig_pte, ptl); cond_resched(); - return 0; + return ret; } #ifdef CONFIG_HUGETLB_PAGE static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask, @@ -1697,15 +1724,18 @@ static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask, pte_t huge_pte = huge_ptep_get(pte); struct numa_maps *md; struct page *page; + int ret = 0; + md = walk->private; if (!pte_present(huge_pte)) - return 0; + return (md->isvamaps ? vamap_match_nid(md, addr, NULL) : 0); page = pte_page(huge_pte); - if (!page) - return 0; + if (md->isvamaps) + ret = vamap_match_nid(md, addr, page); + if (!page || ret) + return ret; - md = walk->private; gather_stats(page, md, pte_dirty(huge_pte), 1); return 0; } -- 2.7.4
[PATCH V2 2/6] Add /proc//numa_vamaps file for numa node information
Introduce supporting data structures and file operations. Later patch will provide changes for generating file content. Signed-off-by: Prakash Sangappa Reviewed-by: Steve Sistare --- fs/proc/base.c | 2 ++ fs/proc/internal.h | 1 + fs/proc/task_mmu.c | 42 ++ 3 files changed, 45 insertions(+) diff --git a/fs/proc/base.c b/fs/proc/base.c index ccf86f1..1af99ae 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2927,6 +2927,7 @@ static const struct pid_entry tgid_base_stuff[] = { REG("maps", S_IRUGO, proc_pid_maps_operations), #ifdef CONFIG_NUMA REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations), + REG("numa_vamaps", S_IRUGO, proc_numa_vamaps_operations), #endif REG("mem",S_IRUSR|S_IWUSR, proc_mem_operations), LNK("cwd",proc_cwd_link), @@ -3313,6 +3314,7 @@ static const struct pid_entry tid_base_stuff[] = { #endif #ifdef CONFIG_NUMA REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations), + REG("numa_vamaps", S_IRUGO, proc_numa_vamaps_operations), #endif REG("mem", S_IRUSR|S_IWUSR, proc_mem_operations), LNK("cwd", proc_cwd_link), diff --git a/fs/proc/internal.h b/fs/proc/internal.h index 5185d7f..994c7fd 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -298,6 +298,7 @@ extern const struct file_operations proc_pid_smaps_operations; extern const struct file_operations proc_pid_smaps_rollup_operations; extern const struct file_operations proc_clear_refs_operations; extern const struct file_operations proc_pagemap_operations; +extern const struct file_operations proc_numa_vamaps_operations; extern unsigned long task_vsize(struct mm_struct *); extern unsigned long task_statm(struct mm_struct *, diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 0e2095c..02b553c 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1583,6 +1583,16 @@ struct numa_maps_private { struct numa_maps md; }; +#define NUMA_VAMAPS_BUFSZ 1024 +struct numa_vamaps_private { + struct mm_struct *mm; + struct numa_maps md; + u64 vm_start; + size_t from; + size_t count; /* residual bytes in buf at offset 'from' */ + char buf[NUMA_VAMAPS_BUFSZ]; /* buffer */ +}; + static void gather_stats(struct page *page, struct numa_maps *md, int pte_dirty, unsigned long nr_pages) { @@ -1848,6 +1858,34 @@ static int pid_numa_maps_open(struct inode *inode, struct file *file) sizeof(struct numa_maps_private)); } +static int numa_vamaps_open(struct inode *inode, struct file *file) +{ + struct mm_struct *mm; + struct numa_vamaps_private *nvm; + nvm = kzalloc(sizeof(struct numa_vamaps_private), GFP_KERNEL); + if (!nvm) + return -ENOMEM; + + mm = proc_mem_open(inode, PTRACE_MODE_READ); + if (IS_ERR(mm)) { + kfree(nvm); + return PTR_ERR(mm); + } + nvm->mm = mm; + file->private_data = nvm; + return 0; +} + +static int numa_vamaps_release(struct inode *inode, struct file *file) +{ + struct numa_vamaps_private *nvm = file->private_data; + + if (nvm->mm) + mmdrop(nvm->mm); + kfree(nvm); + return 0; +} + const struct file_operations proc_pid_numa_maps_operations = { .open = pid_numa_maps_open, .read = seq_read, @@ -1855,4 +1893,8 @@ const struct file_operations proc_pid_numa_maps_operations = { .release= proc_map_release, }; +const struct file_operations proc_numa_vamaps_operations = { + .open = numa_vamaps_open, + .release= numa_vamaps_release, +}; #endif /* CONFIG_NUMA */ -- 2.7.4
[PATCH V2 5/6] File /proc//numa_vamaps access needs PTRACE_MODE_READ_REALCREDS check
Permission to access /proc//numa_vamaps file should be governed by PTRACE_READ_REALCREADS check to restrict getting specific VA range to numa node mapping information. Signed-off-by: Prakash Sangappa Reviewed-by: Steve Sistare --- fs/proc/base.c | 4 +++- fs/proc/task_mmu.c | 2 +- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index 1af99ae..3c19a55 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -745,7 +745,9 @@ struct mm_struct *proc_mem_open(struct inode *inode, unsigned int mode) struct mm_struct *mm = ERR_PTR(-ESRCH); if (task) { - mm = mm_access(task, mode | PTRACE_MODE_FSCREDS); + if (!(mode & PTRACE_MODE_REALCREDS)) + mode |= PTRACE_MODE_FSCREDS; + mm = mm_access(task, mode); put_task_struct(task); if (!IS_ERR_OR_NULL(mm)) { diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 93dce46..30b29d2 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -2043,7 +2043,7 @@ static int numa_vamaps_open(struct inode *inode, struct file *file) if (!nvm) return -ENOMEM; - mm = proc_mem_open(inode, PTRACE_MODE_READ); + mm = proc_mem_open(inode, PTRACE_MODE_READ | PTRACE_MODE_REALCREDS); if (IS_ERR(mm)) { kfree(nvm); return PTR_ERR(mm); -- 2.7.4
[PATCH V2 6/6] /proc/pid/numa_vamaps: document in Documentation/filesystems/proc.txt
Add documentation for /proc//numa_vamaps in Documentation/filesystems/proc.txt Signed-off-by: Prakash Sangappa Reviewed-by: Steve Sistare --- Documentation/filesystems/proc.txt | 21 + 1 file changed, 21 insertions(+) diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 22b4b00..7095216 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -150,6 +150,9 @@ Table 1-1: Process specific entries in /proc each mapping and flags associated with it numa_maps an extension based on maps, showing the memory locality and binding policy as well as mem usage (in pages) of each mapping. + numa_vamaps Presents information about mapped address ranges to numa node + from where the physical memory is allocated. + .. For example, to get the status information of a process, all you have to do is @@ -571,6 +574,24 @@ Where: node locality page counters (N0 == node0, N1 == node1, ...) and the kernel page size, in KB, that is backing the mapping up. +The /proc/pid/numa_vamaps shows mapped address ranges to numa node id from +where the physical pages are allocated. For mapped address ranges not having +any pages mapped a '-' is shown instead of the node id. Each line in the file +will show address range to one numa node. + +address-range numa-node-id + +0040-0041 N1 +0041-0047f000 N0 +0047f000-0048 N2 +0048-00481000 - +00481000-004a N0 +004a-004a2000 - +004a2000-004aa000 N2 +004aa000-004ad000 N0 +004ad000-004ae000 - +.. + 1.2 Kernel data --- -- 2.7.4
[PATCH V2 0/6] VA to numa node information
For analysis purpose it is useful to have numa node information corresponding mapped virtual address ranges of a process. Currently, the file /proc//numa_maps provides list of numa nodes from where pages are allocated per VMA of a process. This is not useful if an user needs to determine which numa node the mapped pages are allocated from for a particular address range. It would have helped if the numa node information presented in /proc//numa_maps was broken down by VA ranges showing the exact numa node from where the pages have been allocated. The format of /proc//numa_maps file content is dependent on /proc//maps file content as mentioned in the manpage. i.e one line entry for every VMA corresponding to entries in /proc//maps file. Therefore changing the output of /proc//numa_maps may not be possible. This patch set introduces the file /proc//numa_vamaps which will provide proper break down of VA ranges by numa node id from where the mapped pages are allocated. For Address ranges not having any pages mapped, a '-' is printed instead of numa node id. Includes support to lseek, allowing seeking to a specific process Virtual address(VA) starting from where the address range to numa node information can to be read from this file. The new file /proc//numa_vamaps will be governed by ptrace access mode PTRACE_MODE_READ_REALCREDS. See following for previous discussion about this proposal https://marc.info/?t=15252407341=1=2 Prakash Sangappa (6): Add check to match numa node id when gathering pte stats Add /proc//numa_vamaps file for numa node information Provide process address range to numa node id mapping Add support to lseek /proc//numa_vamaps file File /proc//numa_vamaps access needs PTRACE_MODE_READ_REALCREDS check /proc/pid/numa_vamaps: document in Documentation/filesystems/proc.txt Documentation/filesystems/proc.txt | 21 +++ fs/proc/base.c | 6 +- fs/proc/internal.h | 1 + fs/proc/task_mmu.c | 265 - 4 files changed, 285 insertions(+), 8 deletions(-) -- 2.7.4
[PATCH V2 4/6] Add support to lseek /proc//numa_vamaps file
Allow lseeking to a process virtual address(VA), starting from where the address range to numa node information can be read. The lseek offset will be the process virtual address. Signed-off-by: Prakash Sangappa Reviewed-by: Steve Sistare --- fs/proc/task_mmu.c | 23 ++- 1 file changed, 22 insertions(+), 1 deletion(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 1371e379..93dce46 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1866,6 +1866,27 @@ static int gather_hole_info_vamap(unsigned long start, unsigned long end, return 0; } +static loff_t numa_vamaps_llseek(struct file *file, loff_t offset, int orig) +{ + struct numa_vamaps_private *nvm = file->private_data; + + if (orig == SEEK_CUR && offset < 0 && nvm->vm_start < -offset) + return -EINVAL; + + switch (orig) { + case SEEK_SET: + nvm->vm_start = offset & PAGE_MASK; + break; + case SEEK_CUR: + nvm->vm_start += offset; + nvm->vm_start = nvm->vm_start & PAGE_MASK; + break; + default: + return -EINVAL; + } + return nvm->vm_start; +} + static int vamap_vprintf(struct numa_vamaps_private *nvm, const char *f, ...) { va_list args; @@ -2052,7 +2073,7 @@ const struct file_operations proc_pid_numa_maps_operations = { const struct file_operations proc_numa_vamaps_operations = { .open = numa_vamaps_open, .read = numa_vamaps_read, - .llseek = noop_llseek, + .llseek = numa_vamaps_llseek, .release= numa_vamaps_release, }; #endif /* CONFIG_NUMA */ -- 2.7.4
[PATCH V2 0/6] VA to numa node information
For analysis purpose it is useful to have numa node information corresponding mapped virtual address ranges of a process. Currently, the file /proc//numa_maps provides list of numa nodes from where pages are allocated per VMA of a process. This is not useful if an user needs to determine which numa node the mapped pages are allocated from for a particular address range. It would have helped if the numa node information presented in /proc//numa_maps was broken down by VA ranges showing the exact numa node from where the pages have been allocated. The format of /proc//numa_maps file content is dependent on /proc//maps file content as mentioned in the manpage. i.e one line entry for every VMA corresponding to entries in /proc//maps file. Therefore changing the output of /proc//numa_maps may not be possible. This patch set introduces the file /proc//numa_vamaps which will provide proper break down of VA ranges by numa node id from where the mapped pages are allocated. For Address ranges not having any pages mapped, a '-' is printed instead of numa node id. Includes support to lseek, allowing seeking to a specific process Virtual address(VA) starting from where the address range to numa node information can to be read from this file. The new file /proc//numa_vamaps will be governed by ptrace access mode PTRACE_MODE_READ_REALCREDS. See following for previous discussion about this proposal https://marc.info/?t=15252407341=1=2 Prakash Sangappa (6): Add check to match numa node id when gathering pte stats Add /proc//numa_vamaps file for numa node information Provide process address range to numa node id mapping Add support to lseek /proc//numa_vamaps file File /proc//numa_vamaps access needs PTRACE_MODE_READ_REALCREDS check /proc/pid/numa_vamaps: document in Documentation/filesystems/proc.txt Documentation/filesystems/proc.txt | 21 +++ fs/proc/base.c | 6 +- fs/proc/internal.h | 1 + fs/proc/task_mmu.c | 265 - 4 files changed, 285 insertions(+), 8 deletions(-) -- 2.7.4
[PATCH V2 4/6] Add support to lseek /proc//numa_vamaps file
Allow lseeking to a process virtual address(VA), starting from where the address range to numa node information can be read. The lseek offset will be the process virtual address. Signed-off-by: Prakash Sangappa Reviewed-by: Steve Sistare --- fs/proc/task_mmu.c | 23 ++- 1 file changed, 22 insertions(+), 1 deletion(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 1371e379..93dce46 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1866,6 +1866,27 @@ static int gather_hole_info_vamap(unsigned long start, unsigned long end, return 0; } +static loff_t numa_vamaps_llseek(struct file *file, loff_t offset, int orig) +{ + struct numa_vamaps_private *nvm = file->private_data; + + if (orig == SEEK_CUR && offset < 0 && nvm->vm_start < -offset) + return -EINVAL; + + switch (orig) { + case SEEK_SET: + nvm->vm_start = offset & PAGE_MASK; + break; + case SEEK_CUR: + nvm->vm_start += offset; + nvm->vm_start = nvm->vm_start & PAGE_MASK; + break; + default: + return -EINVAL; + } + return nvm->vm_start; +} + static int vamap_vprintf(struct numa_vamaps_private *nvm, const char *f, ...) { va_list args; @@ -2052,7 +2073,7 @@ const struct file_operations proc_pid_numa_maps_operations = { const struct file_operations proc_numa_vamaps_operations = { .open = numa_vamaps_open, .read = numa_vamaps_read, - .llseek = noop_llseek, + .llseek = numa_vamaps_llseek, .release= numa_vamaps_release, }; #endif /* CONFIG_NUMA */ -- 2.7.4
Re: [RFC PATCH] Add /proc//numa_vamaps for numa node information
On 5/10/18 12:42 AM, Michal Hocko wrote: On Fri 04-05-18 09:18:11, Prakash Sangappa wrote: On 5/4/18 4:12 AM, Michal Hocko wrote: On Thu 03-05-18 15:39:49, prakash.sangappa wrote: On 05/03/2018 11:03 AM, Christopher Lameter wrote: On Tue, 1 May 2018, Prakash Sangappa wrote: For analysis purpose it is useful to have numa node information corresponding mapped address ranges of the process. Currently /proc//numa_maps provides list of numa nodes from where pages are allocated per VMA of the process. This is not useful if an user needs to determine which numa node the mapped pages are allocated from for a particular address range. It would have helped if the numa node information presented in /proc//numa_maps was broken down by VA ranges showing the exact numa node from where the pages have been allocated. Cant you write a small script that scans the information in numa_maps and then displays the total pages per NUMA node and then a list of which ranges have how many pages on a particular node? Don't think we can determine which numa node a given user process address range has pages from, based on the existing 'numa_maps' file. yes we have. See move_pages... Sure using move_pages, not based on just 'numa_maps'. reading this file will not be restricted(i.e requiring CAP_SYS_ADMIN). So a prime motivator here is security restricted access to numa_maps? No it is the opposite. A regular user should be able to determine numa node information. Well, that breaks the layout randomization, doesn't it? Exposing numa node information itself should not break randomization right? I thought you planned to expose address ranges for each numa node as well. /me confused. Yes, are you suggesting this information should not be available to a regular user? Is it not possible to get that same information using the move_pages() api as a regular user, although one / set of pages at a time? It would be upto the application. In case of randomization, the application could generate address range traces of interest for debugging and then using numa node information one could determine where the memory is laid out for analysis. ... even more confused
Re: [RFC PATCH] Add /proc//numa_vamaps for numa node information
On 5/10/18 12:42 AM, Michal Hocko wrote: On Fri 04-05-18 09:18:11, Prakash Sangappa wrote: On 5/4/18 4:12 AM, Michal Hocko wrote: On Thu 03-05-18 15:39:49, prakash.sangappa wrote: On 05/03/2018 11:03 AM, Christopher Lameter wrote: On Tue, 1 May 2018, Prakash Sangappa wrote: For analysis purpose it is useful to have numa node information corresponding mapped address ranges of the process. Currently /proc//numa_maps provides list of numa nodes from where pages are allocated per VMA of the process. This is not useful if an user needs to determine which numa node the mapped pages are allocated from for a particular address range. It would have helped if the numa node information presented in /proc//numa_maps was broken down by VA ranges showing the exact numa node from where the pages have been allocated. Cant you write a small script that scans the information in numa_maps and then displays the total pages per NUMA node and then a list of which ranges have how many pages on a particular node? Don't think we can determine which numa node a given user process address range has pages from, based on the existing 'numa_maps' file. yes we have. See move_pages... Sure using move_pages, not based on just 'numa_maps'. reading this file will not be restricted(i.e requiring CAP_SYS_ADMIN). So a prime motivator here is security restricted access to numa_maps? No it is the opposite. A regular user should be able to determine numa node information. Well, that breaks the layout randomization, doesn't it? Exposing numa node information itself should not break randomization right? I thought you planned to expose address ranges for each numa node as well. /me confused. Yes, are you suggesting this information should not be available to a regular user? Is it not possible to get that same information using the move_pages() api as a regular user, although one / set of pages at a time? It would be upto the application. In case of randomization, the application could generate address range traces of interest for debugging and then using numa node information one could determine where the memory is laid out for analysis. ... even more confused
Re: [RFC PATCH] Add /proc//numa_vamaps for numa node information
On 5/4/18 7:57 AM, Christopher Lameter wrote: On Thu, 3 May 2018, prakash.sangappa wrote: exact numa node from where the pages have been allocated. Cant you write a small script that scans the information in numa_maps and then displays the total pages per NUMA node and then a list of which ranges have how many pages on a particular node? Don't think we can determine which numa node a given user process address range has pages from, based on the existing 'numa_maps' file. Well the information is contained in numa_maps I thought. What is missing? Currently 'numa_maps' gives a list of numa nodes, memory is allocated per VMA. Ex. we get something like from numa_maps. 04000 N0=1,N2=2 kernelpagesize_KB=4 First is the start address of a VMA. This VMA could be much larger then 3 4k pages. It does not say which address in the VMA has the pages mapped. reading this file will not be restricted(i.e requiring CAP_SYS_ADMIN). So a prime motivator here is security restricted access to numa_maps? No it is the opposite. A regular user should be able to determine numa node information. That used to be the case until changes were made to the permissions for reading numa_maps.
Re: [RFC PATCH] Add /proc//numa_vamaps for numa node information
On 5/4/18 7:57 AM, Christopher Lameter wrote: On Thu, 3 May 2018, prakash.sangappa wrote: exact numa node from where the pages have been allocated. Cant you write a small script that scans the information in numa_maps and then displays the total pages per NUMA node and then a list of which ranges have how many pages on a particular node? Don't think we can determine which numa node a given user process address range has pages from, based on the existing 'numa_maps' file. Well the information is contained in numa_maps I thought. What is missing? Currently 'numa_maps' gives a list of numa nodes, memory is allocated per VMA. Ex. we get something like from numa_maps. 04000 N0=1,N2=2 kernelpagesize_KB=4 First is the start address of a VMA. This VMA could be much larger then 3 4k pages. It does not say which address in the VMA has the pages mapped. reading this file will not be restricted(i.e requiring CAP_SYS_ADMIN). So a prime motivator here is security restricted access to numa_maps? No it is the opposite. A regular user should be able to determine numa node information. That used to be the case until changes were made to the permissions for reading numa_maps.
Re: [RFC PATCH] Add /proc//numa_vamaps for numa node information
On 5/4/18 4:12 AM, Michal Hocko wrote: On Thu 03-05-18 15:39:49, prakash.sangappa wrote: On 05/03/2018 11:03 AM, Christopher Lameter wrote: On Tue, 1 May 2018, Prakash Sangappa wrote: For analysis purpose it is useful to have numa node information corresponding mapped address ranges of the process. Currently /proc//numa_maps provides list of numa nodes from where pages are allocated per VMA of the process. This is not useful if an user needs to determine which numa node the mapped pages are allocated from for a particular address range. It would have helped if the numa node information presented in /proc//numa_maps was broken down by VA ranges showing the exact numa node from where the pages have been allocated. Cant you write a small script that scans the information in numa_maps and then displays the total pages per NUMA node and then a list of which ranges have how many pages on a particular node? Don't think we can determine which numa node a given user process address range has pages from, based on the existing 'numa_maps' file. yes we have. See move_pages... Sure using move_pages, not based on just 'numa_maps'. reading this file will not be restricted(i.e requiring CAP_SYS_ADMIN). So a prime motivator here is security restricted access to numa_maps? No it is the opposite. A regular user should be able to determine numa node information. Well, that breaks the layout randomization, doesn't it? Exposing numa node information itself should not break randomization right? It would be upto the application. In case of randomization, the application could generate address range traces of interest for debugging and then using numa node information one could determine where the memory is laid out for analysis.
Re: [RFC PATCH] Add /proc//numa_vamaps for numa node information
On 5/4/18 4:12 AM, Michal Hocko wrote: On Thu 03-05-18 15:39:49, prakash.sangappa wrote: On 05/03/2018 11:03 AM, Christopher Lameter wrote: On Tue, 1 May 2018, Prakash Sangappa wrote: For analysis purpose it is useful to have numa node information corresponding mapped address ranges of the process. Currently /proc//numa_maps provides list of numa nodes from where pages are allocated per VMA of the process. This is not useful if an user needs to determine which numa node the mapped pages are allocated from for a particular address range. It would have helped if the numa node information presented in /proc//numa_maps was broken down by VA ranges showing the exact numa node from where the pages have been allocated. Cant you write a small script that scans the information in numa_maps and then displays the total pages per NUMA node and then a list of which ranges have how many pages on a particular node? Don't think we can determine which numa node a given user process address range has pages from, based on the existing 'numa_maps' file. yes we have. See move_pages... Sure using move_pages, not based on just 'numa_maps'. reading this file will not be restricted(i.e requiring CAP_SYS_ADMIN). So a prime motivator here is security restricted access to numa_maps? No it is the opposite. A regular user should be able to determine numa node information. Well, that breaks the layout randomization, doesn't it? Exposing numa node information itself should not break randomization right? It would be upto the application. In case of randomization, the application could generate address range traces of interest for debugging and then using numa node information one could determine where the memory is laid out for analysis.
[RFC PATCH] Add /proc//numa_vamaps for numa node information
For analysis purpose it is useful to have numa node information corresponding mapped address ranges of the process. Currently /proc//numa_maps provides list of numa nodes from where pages are allocated per VMA of the process. This is not useful if an user needs to determine which numa node the mapped pages are allocated from for a particular address range. It would have helped if the numa node information presented in /proc//numa_maps was broken down by VA ranges showing the exact numa node from where the pages have been allocated. The format of /proc//numa_maps file content is dependent on /proc//maps file content as mentioned in the manpage. i.e one line entry for every VMA corresponding to entries in /proc//maps file. Therefore changing the output of /proc//numa_maps may not be possible. Hence, this patch proposes adding file /proc//numa_vamaps which will provide proper break down of VA ranges by numa node id from where the mapped pages are allocated. For Address ranges not having any pages mapped, a '-' is printed instead of numa node id. In addition, this file will include most of the other information currently presented in /proc//numa_maps. The additional information included is for convenience. If this is not preferred, the patch could be modified to just provide VA range to numa node information as the rest of the information is already available thru /proc//numa_maps file. Since the VA range to numa node information does not include page's PFN, reading this file will not be restricted(i.e requiring CAP_SYS_ADMIN). Here is the snippet from the new file content showing the format. 0040-00401000 N0=1 kernelpagesize_kB=4 mapped=1 file=/tmp/hmap2 0060-00601000 N0=1 kernelpagesize_kB=4 anon=1 dirty=1 file=/tmp/hmap2 00601000-00602000 N0=1 kernelpagesize_kB=4 anon=1 dirty=1 file=/tmp/hmap2 7f021560-7f021580 N0=1 kernelpagesize_kB=2048 dirty=1 file=/mnt/f1 7f021580-7f0215c0 - file=/mnt/f1 7f0215c0-7f0215e0 N0=1 kernelpagesize_kB=2048 dirty=1 file=/mnt/f1 7f0215e0-7f021620 - file=/mnt/f1 .. 7f0217ecb000-7f0217f2 N0=85 kernelpagesize_kB=4 mapped=85 mapmax=51 file=/usr/lib64/libc-2.17.so 7f0217f2-7f0217f3 - file=/usr/lib64/libc-2.17.so 7f0217f3-7f0217f9 N0=96 kernelpagesize_kB=4 mapped=96 mapmax=51 file=/usr/lib64/libc-2.17.so 7f0217f9-7f0217fb - file=/usr/lib64/libc-2.17.so .. The 'pmap' command can be enhanced to include an option to show numa node information which it can read from this new proc file. This will be a follow on proposal. There have been couple of previous patch proposals to provide numa node information based on pfn or physical address. They seem to have not made progress. Also it would appear reading numa node information based on PFN or physical address will require privileges(CAP_SYS_ADMIN) similar to reading PFN info from /proc//pagemap. See https://marc.info/?t=13963093821=1=2 https://marc.info/?t=13971872441=1=2 Signed-off-by: Prakash Sangappa <prakash.sanga...@oracle.com> --- fs/proc/base.c | 2 + fs/proc/internal.h | 3 + fs/proc/task_mmu.c | 299 - 3 files changed, 303 insertions(+), 1 deletion(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index 1b2ede6..8fd7cc5 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2960,6 +2960,7 @@ static const struct pid_entry tgid_base_stuff[] = { REG("maps", S_IRUGO, proc_pid_maps_operations), #ifdef CONFIG_NUMA REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations), + REG("numa_vamaps", S_IRUGO, proc_pid_numa_vamaps_operations), #endif REG("mem",S_IRUSR|S_IWUSR, proc_mem_operations), LNK("cwd",proc_cwd_link), @@ -3352,6 +3353,7 @@ static const struct pid_entry tid_base_stuff[] = { #endif #ifdef CONFIG_NUMA REG("numa_maps", S_IRUGO, proc_tid_numa_maps_operations), + REG("numa_vamaps", S_IRUGO, proc_tid_numa_vamaps_operations), #endif REG("mem", S_IRUSR|S_IWUSR, proc_mem_operations), LNK("cwd", proc_cwd_link), diff --git a/fs/proc/internal.h b/fs/proc/internal.h index 0f1692e..9a3ff80 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -273,6 +273,7 @@ struct proc_maps_private { #ifdef CONFIG_NUMA struct mempolicy *task_mempolicy; #endif + u64 vma_off; } __randomize_layout; struct mm_struct *proc_mem_open(struct inode *inode, unsigned int mode); @@ -280,7 +281,9 @@ struct mm_struct *proc_mem_open(struct inode *inode, unsigned int mode); extern const struct file_operations proc_pid_maps_operations; extern const struct file_operations proc_tid_maps_operations; extern const struct file_operations proc_pid_numa_maps_operations; +extern const struct file_operations proc_pid_numa_vamaps_operations; extern const struct file_operations proc_tid_numa_maps
[RFC PATCH] Add /proc//numa_vamaps for numa node information
For analysis purpose it is useful to have numa node information corresponding mapped address ranges of the process. Currently /proc//numa_maps provides list of numa nodes from where pages are allocated per VMA of the process. This is not useful if an user needs to determine which numa node the mapped pages are allocated from for a particular address range. It would have helped if the numa node information presented in /proc//numa_maps was broken down by VA ranges showing the exact numa node from where the pages have been allocated. The format of /proc//numa_maps file content is dependent on /proc//maps file content as mentioned in the manpage. i.e one line entry for every VMA corresponding to entries in /proc//maps file. Therefore changing the output of /proc//numa_maps may not be possible. Hence, this patch proposes adding file /proc//numa_vamaps which will provide proper break down of VA ranges by numa node id from where the mapped pages are allocated. For Address ranges not having any pages mapped, a '-' is printed instead of numa node id. In addition, this file will include most of the other information currently presented in /proc//numa_maps. The additional information included is for convenience. If this is not preferred, the patch could be modified to just provide VA range to numa node information as the rest of the information is already available thru /proc//numa_maps file. Since the VA range to numa node information does not include page's PFN, reading this file will not be restricted(i.e requiring CAP_SYS_ADMIN). Here is the snippet from the new file content showing the format. 0040-00401000 N0=1 kernelpagesize_kB=4 mapped=1 file=/tmp/hmap2 0060-00601000 N0=1 kernelpagesize_kB=4 anon=1 dirty=1 file=/tmp/hmap2 00601000-00602000 N0=1 kernelpagesize_kB=4 anon=1 dirty=1 file=/tmp/hmap2 7f021560-7f021580 N0=1 kernelpagesize_kB=2048 dirty=1 file=/mnt/f1 7f021580-7f0215c0 - file=/mnt/f1 7f0215c0-7f0215e0 N0=1 kernelpagesize_kB=2048 dirty=1 file=/mnt/f1 7f0215e0-7f021620 - file=/mnt/f1 .. 7f0217ecb000-7f0217f2 N0=85 kernelpagesize_kB=4 mapped=85 mapmax=51 file=/usr/lib64/libc-2.17.so 7f0217f2-7f0217f3 - file=/usr/lib64/libc-2.17.so 7f0217f3-7f0217f9 N0=96 kernelpagesize_kB=4 mapped=96 mapmax=51 file=/usr/lib64/libc-2.17.so 7f0217f9-7f0217fb - file=/usr/lib64/libc-2.17.so .. The 'pmap' command can be enhanced to include an option to show numa node information which it can read from this new proc file. This will be a follow on proposal. There have been couple of previous patch proposals to provide numa node information based on pfn or physical address. They seem to have not made progress. Also it would appear reading numa node information based on PFN or physical address will require privileges(CAP_SYS_ADMIN) similar to reading PFN info from /proc//pagemap. See https://marc.info/?t=13963093821=1=2 https://marc.info/?t=13971872441=1=2 Signed-off-by: Prakash Sangappa --- fs/proc/base.c | 2 + fs/proc/internal.h | 3 + fs/proc/task_mmu.c | 299 - 3 files changed, 303 insertions(+), 1 deletion(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index 1b2ede6..8fd7cc5 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2960,6 +2960,7 @@ static const struct pid_entry tgid_base_stuff[] = { REG("maps", S_IRUGO, proc_pid_maps_operations), #ifdef CONFIG_NUMA REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations), + REG("numa_vamaps", S_IRUGO, proc_pid_numa_vamaps_operations), #endif REG("mem",S_IRUSR|S_IWUSR, proc_mem_operations), LNK("cwd",proc_cwd_link), @@ -3352,6 +3353,7 @@ static const struct pid_entry tid_base_stuff[] = { #endif #ifdef CONFIG_NUMA REG("numa_maps", S_IRUGO, proc_tid_numa_maps_operations), + REG("numa_vamaps", S_IRUGO, proc_tid_numa_vamaps_operations), #endif REG("mem", S_IRUSR|S_IWUSR, proc_mem_operations), LNK("cwd", proc_cwd_link), diff --git a/fs/proc/internal.h b/fs/proc/internal.h index 0f1692e..9a3ff80 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -273,6 +273,7 @@ struct proc_maps_private { #ifdef CONFIG_NUMA struct mempolicy *task_mempolicy; #endif + u64 vma_off; } __randomize_layout; struct mm_struct *proc_mem_open(struct inode *inode, unsigned int mode); @@ -280,7 +281,9 @@ struct mm_struct *proc_mem_open(struct inode *inode, unsigned int mode); extern const struct file_operations proc_pid_maps_operations; extern const struct file_operations proc_tid_maps_operations; extern const struct file_operations proc_pid_numa_maps_operations; +extern const struct file_operations proc_pid_numa_vamaps_operations; extern const struct file_operations proc_tid_numa_maps_operations; +extern const st
[PATCH] Hugetlb pages rss accounting is incorrect in /proc//smaps
Resident set size(Rss) accounting of hugetlb pages is not done currently in /proc//smaps. The pmap command reads rss from this file and so it shows Rss to be 0 in pmap -x output for hugetlb mapped vmas. This patch fixes it. Signed-off-by: Prakash Sangappa <prakash.sanga...@oracle.com> --- fs/proc/task_mmu.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 5589b4b..c7e1048 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -724,6 +724,7 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask, mss->shared_hugetlb += huge_page_size(hstate_vma(vma)); else mss->private_hugetlb += huge_page_size(hstate_vma(vma)); + mss->resident += huge_page_size(hstate_vma(vma)); } return 0; } -- 2.7.4
[PATCH] Hugetlb pages rss accounting is incorrect in /proc//smaps
Resident set size(Rss) accounting of hugetlb pages is not done currently in /proc//smaps. The pmap command reads rss from this file and so it shows Rss to be 0 in pmap -x output for hugetlb mapped vmas. This patch fixes it. Signed-off-by: Prakash Sangappa --- fs/proc/task_mmu.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 5589b4b..c7e1048 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -724,6 +724,7 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask, mss->shared_hugetlb += huge_page_size(hstate_vma(vma)); else mss->private_hugetlb += huge_page_size(hstate_vma(vma)); + mss->resident += huge_page_size(hstate_vma(vma)); } return 0; } -- 2.7.4
Re: [PATCH v4] pidns: introduce syscall translate_pid
On 10/17/2017 3:40 PM, Andy Lutomirski wrote: On Tue, Oct 17, 2017 at 3:35 PM, prakash sangappa <prakash.sanga...@oracle.com> wrote: On 10/17/2017 3:02 PM, Andy Lutomirski wrote: On Tue, Oct 17, 2017 at 8:38 AM, Prakash Sangappa <prakash.sanga...@oracle.com> wrote: On 10/16/17 5:52 PM, Andy Lutomirski wrote: On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa <prakash.sanga...@oracle.com> wrote: On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote: On 10/16/2017 02:36 PM, Andrew Morton wrote: On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov <khlebni...@yandex-team.ru> wrote: pid_t translate_pid(pid_t pid, int source, int target); This syscall converts pid from source pid-ns into pid in target pid-ns. If pid is unreachable from target pid-ns it returns zero. Pid-namespaces are referred file descriptors opened to proc files /proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative argument refers to current pid namespace, same as file /proc/self/ns/pid. Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward translation requires scanning all tasks. Also pids could be translated by sending them through unix socket between namespaces, this method is slow and insecure because other side is exposed inside pid namespace. Andrew asked why we might need this. Such conversion is required for interaction between processes across pid-namespaces. For example to identify process in container by pid file looking from outside. Two years ago I've solved this in project of mine with monstrous code which forks couple times just to convert pid, lucky for me performance wasn't important. That's a single user who needed this a single time, and found a userspace-based solution anyway. This is not exactly compelling! Is there a stronger case to be made? How does this change benefit our users? Sell it to us! Oracle database is planning to use pid namespace for sandboxing database instances and they need an API similar to translate_pid to effectively translate process IDs from other pid namespaces. Prakash (cced in mail) can provide more details on this usecase. As Nagarathnam indicated, Oracle Database will be using pid namespaces and needs a direct method of converting pids of processes in the pid namespace hierarchy. In this use case multiple nested PID namespaces will be used. The currently available mechanism are not very efficient for this use case. For ex. as Konstantin described, using /proc//status would require the application to scan all the pid's status files to determine the pid of given process in a child namespace. Use of SCM_CREDENTIALS's socket message is another way, which would require every process starting inside a pid namespace to send this message and the receiving process in the target namespace would have to save the converted pid and reference it. This mechanism becomes cumbersome especially if the application has to deal with multiple nested pid namespaces. Also, the Database needs to be able to convert a thread's global pid(gettid()). Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires CAP_SYS_ADMIN, which is an issue. So having a direct method, like the API that Konstantin is proposing, will work best for the Database since pid of a process in any of the nested pid namespaces can be converted as and when required. I think with the proposed API, the application should be able to convert pid of a process or tid(gettid()) of a thread as well. Can you explain what Oracle's database is planning to do with this information? Database uses the PID to programmatically find out if the process/thread is alive(kill 0) also send signals to the processes requesting it to dump status/debug information and kill the processes in case of a shutdown abort of the instance. What I'm wondering is: how does the caller of kill() end up controlling a task whose pid it doesn't know in its own namespace? I was generally describing how DB would use the PID of process. The above description was in the case when no namespaces are used. With use of namespaces, the DB would convert the PID of processes inside its children namespaces to PID in its namespace and use that pid to issue kill(). Seems vaguely sensible. If I were designing this type of system, I'd have a manager process in each namespace running as PID 1, though -- PID 1 is special and needs to understand what's going on anyway. Then PID 1 would do the kill() calls and wouldn't need translate_pid(). Yes, this has been tried out with the prototype use of PID namespaces in the DB. It works, but would be slow as the manager would have to exchange messages with the controlling processes which would be in the parent namespace. DB could use the api to convert the pid.
Re: [PATCH v4] pidns: introduce syscall translate_pid
On 10/17/2017 3:40 PM, Andy Lutomirski wrote: On Tue, Oct 17, 2017 at 3:35 PM, prakash sangappa wrote: On 10/17/2017 3:02 PM, Andy Lutomirski wrote: On Tue, Oct 17, 2017 at 8:38 AM, Prakash Sangappa wrote: On 10/16/17 5:52 PM, Andy Lutomirski wrote: On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa wrote: On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote: On 10/16/2017 02:36 PM, Andrew Morton wrote: On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov wrote: pid_t translate_pid(pid_t pid, int source, int target); This syscall converts pid from source pid-ns into pid in target pid-ns. If pid is unreachable from target pid-ns it returns zero. Pid-namespaces are referred file descriptors opened to proc files /proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative argument refers to current pid namespace, same as file /proc/self/ns/pid. Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward translation requires scanning all tasks. Also pids could be translated by sending them through unix socket between namespaces, this method is slow and insecure because other side is exposed inside pid namespace. Andrew asked why we might need this. Such conversion is required for interaction between processes across pid-namespaces. For example to identify process in container by pid file looking from outside. Two years ago I've solved this in project of mine with monstrous code which forks couple times just to convert pid, lucky for me performance wasn't important. That's a single user who needed this a single time, and found a userspace-based solution anyway. This is not exactly compelling! Is there a stronger case to be made? How does this change benefit our users? Sell it to us! Oracle database is planning to use pid namespace for sandboxing database instances and they need an API similar to translate_pid to effectively translate process IDs from other pid namespaces. Prakash (cced in mail) can provide more details on this usecase. As Nagarathnam indicated, Oracle Database will be using pid namespaces and needs a direct method of converting pids of processes in the pid namespace hierarchy. In this use case multiple nested PID namespaces will be used. The currently available mechanism are not very efficient for this use case. For ex. as Konstantin described, using /proc//status would require the application to scan all the pid's status files to determine the pid of given process in a child namespace. Use of SCM_CREDENTIALS's socket message is another way, which would require every process starting inside a pid namespace to send this message and the receiving process in the target namespace would have to save the converted pid and reference it. This mechanism becomes cumbersome especially if the application has to deal with multiple nested pid namespaces. Also, the Database needs to be able to convert a thread's global pid(gettid()). Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires CAP_SYS_ADMIN, which is an issue. So having a direct method, like the API that Konstantin is proposing, will work best for the Database since pid of a process in any of the nested pid namespaces can be converted as and when required. I think with the proposed API, the application should be able to convert pid of a process or tid(gettid()) of a thread as well. Can you explain what Oracle's database is planning to do with this information? Database uses the PID to programmatically find out if the process/thread is alive(kill 0) also send signals to the processes requesting it to dump status/debug information and kill the processes in case of a shutdown abort of the instance. What I'm wondering is: how does the caller of kill() end up controlling a task whose pid it doesn't know in its own namespace? I was generally describing how DB would use the PID of process. The above description was in the case when no namespaces are used. With use of namespaces, the DB would convert the PID of processes inside its children namespaces to PID in its namespace and use that pid to issue kill(). Seems vaguely sensible. If I were designing this type of system, I'd have a manager process in each namespace running as PID 1, though -- PID 1 is special and needs to understand what's going on anyway. Then PID 1 would do the kill() calls and wouldn't need translate_pid(). Yes, this has been tried out with the prototype use of PID namespaces in the DB. It works, but would be slow as the manager would have to exchange messages with the controlling processes which would be in the parent namespace. DB could use the api to convert the pid.
Re: [PATCH v4] pidns: introduce syscall translate_pid
On 10/17/2017 3:02 PM, Andy Lutomirski wrote: On Tue, Oct 17, 2017 at 8:38 AM, Prakash Sangappa <prakash.sanga...@oracle.com> wrote: On 10/16/17 5:52 PM, Andy Lutomirski wrote: On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa <prakash.sanga...@oracle.com> wrote: On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote: On 10/16/2017 02:36 PM, Andrew Morton wrote: On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov <khlebni...@yandex-team.ru> wrote: pid_t translate_pid(pid_t pid, int source, int target); This syscall converts pid from source pid-ns into pid in target pid-ns. If pid is unreachable from target pid-ns it returns zero. Pid-namespaces are referred file descriptors opened to proc files /proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative argument refers to current pid namespace, same as file /proc/self/ns/pid. Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward translation requires scanning all tasks. Also pids could be translated by sending them through unix socket between namespaces, this method is slow and insecure because other side is exposed inside pid namespace. Andrew asked why we might need this. Such conversion is required for interaction between processes across pid-namespaces. For example to identify process in container by pid file looking from outside. Two years ago I've solved this in project of mine with monstrous code which forks couple times just to convert pid, lucky for me performance wasn't important. That's a single user who needed this a single time, and found a userspace-based solution anyway. This is not exactly compelling! Is there a stronger case to be made? How does this change benefit our users? Sell it to us! Oracle database is planning to use pid namespace for sandboxing database instances and they need an API similar to translate_pid to effectively translate process IDs from other pid namespaces. Prakash (cced in mail) can provide more details on this usecase. As Nagarathnam indicated, Oracle Database will be using pid namespaces and needs a direct method of converting pids of processes in the pid namespace hierarchy. In this use case multiple nested PID namespaces will be used. The currently available mechanism are not very efficient for this use case. For ex. as Konstantin described, using /proc//status would require the application to scan all the pid's status files to determine the pid of given process in a child namespace. Use of SCM_CREDENTIALS's socket message is another way, which would require every process starting inside a pid namespace to send this message and the receiving process in the target namespace would have to save the converted pid and reference it. This mechanism becomes cumbersome especially if the application has to deal with multiple nested pid namespaces. Also, the Database needs to be able to convert a thread's global pid(gettid()). Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires CAP_SYS_ADMIN, which is an issue. So having a direct method, like the API that Konstantin is proposing, will work best for the Database since pid of a process in any of the nested pid namespaces can be converted as and when required. I think with the proposed API, the application should be able to convert pid of a process or tid(gettid()) of a thread as well. Can you explain what Oracle's database is planning to do with this information? Database uses the PID to programmatically find out if the process/thread is alive(kill 0) also send signals to the processes requesting it to dump status/debug information and kill the processes in case of a shutdown abort of the instance. What I'm wondering is: how does the caller of kill() end up controlling a task whose pid it doesn't know in its own namespace? I was generally describing how DB would use the PID of process. The above description was in the case when no namespaces are used. With use of namespaces, the DB would convert the PID of processes inside its children namespaces to PID in its namespace and use that pid to issue kill(). -Prakash. -Prakash.
Re: [PATCH v4] pidns: introduce syscall translate_pid
On 10/17/2017 3:02 PM, Andy Lutomirski wrote: On Tue, Oct 17, 2017 at 8:38 AM, Prakash Sangappa wrote: On 10/16/17 5:52 PM, Andy Lutomirski wrote: On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa wrote: On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote: On 10/16/2017 02:36 PM, Andrew Morton wrote: On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov wrote: pid_t translate_pid(pid_t pid, int source, int target); This syscall converts pid from source pid-ns into pid in target pid-ns. If pid is unreachable from target pid-ns it returns zero. Pid-namespaces are referred file descriptors opened to proc files /proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative argument refers to current pid namespace, same as file /proc/self/ns/pid. Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward translation requires scanning all tasks. Also pids could be translated by sending them through unix socket between namespaces, this method is slow and insecure because other side is exposed inside pid namespace. Andrew asked why we might need this. Such conversion is required for interaction between processes across pid-namespaces. For example to identify process in container by pid file looking from outside. Two years ago I've solved this in project of mine with monstrous code which forks couple times just to convert pid, lucky for me performance wasn't important. That's a single user who needed this a single time, and found a userspace-based solution anyway. This is not exactly compelling! Is there a stronger case to be made? How does this change benefit our users? Sell it to us! Oracle database is planning to use pid namespace for sandboxing database instances and they need an API similar to translate_pid to effectively translate process IDs from other pid namespaces. Prakash (cced in mail) can provide more details on this usecase. As Nagarathnam indicated, Oracle Database will be using pid namespaces and needs a direct method of converting pids of processes in the pid namespace hierarchy. In this use case multiple nested PID namespaces will be used. The currently available mechanism are not very efficient for this use case. For ex. as Konstantin described, using /proc//status would require the application to scan all the pid's status files to determine the pid of given process in a child namespace. Use of SCM_CREDENTIALS's socket message is another way, which would require every process starting inside a pid namespace to send this message and the receiving process in the target namespace would have to save the converted pid and reference it. This mechanism becomes cumbersome especially if the application has to deal with multiple nested pid namespaces. Also, the Database needs to be able to convert a thread's global pid(gettid()). Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires CAP_SYS_ADMIN, which is an issue. So having a direct method, like the API that Konstantin is proposing, will work best for the Database since pid of a process in any of the nested pid namespaces can be converted as and when required. I think with the proposed API, the application should be able to convert pid of a process or tid(gettid()) of a thread as well. Can you explain what Oracle's database is planning to do with this information? Database uses the PID to programmatically find out if the process/thread is alive(kill 0) also send signals to the processes requesting it to dump status/debug information and kill the processes in case of a shutdown abort of the instance. What I'm wondering is: how does the caller of kill() end up controlling a task whose pid it doesn't know in its own namespace? I was generally describing how DB would use the PID of process. The above description was in the case when no namespaces are used. With use of namespaces, the DB would convert the PID of processes inside its children namespaces to PID in its namespace and use that pid to issue kill(). -Prakash. -Prakash.
Re: [PATCH v4] pidns: introduce syscall translate_pid
On 10/16/17 5:52 PM, Andy Lutomirski wrote: On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappawrote: On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote: On 10/16/2017 02:36 PM, Andrew Morton wrote: On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov wrote: pid_t translate_pid(pid_t pid, int source, int target); This syscall converts pid from source pid-ns into pid in target pid-ns. If pid is unreachable from target pid-ns it returns zero. Pid-namespaces are referred file descriptors opened to proc files /proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative argument refers to current pid namespace, same as file /proc/self/ns/pid. Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward translation requires scanning all tasks. Also pids could be translated by sending them through unix socket between namespaces, this method is slow and insecure because other side is exposed inside pid namespace. Andrew asked why we might need this. Such conversion is required for interaction between processes across pid-namespaces. For example to identify process in container by pid file looking from outside. Two years ago I've solved this in project of mine with monstrous code which forks couple times just to convert pid, lucky for me performance wasn't important. That's a single user who needed this a single time, and found a userspace-based solution anyway. This is not exactly compelling! Is there a stronger case to be made? How does this change benefit our users? Sell it to us! Oracle database is planning to use pid namespace for sandboxing database instances and they need an API similar to translate_pid to effectively translate process IDs from other pid namespaces. Prakash (cced in mail) can provide more details on this usecase. As Nagarathnam indicated, Oracle Database will be using pid namespaces and needs a direct method of converting pids of processes in the pid namespace hierarchy. In this use case multiple nested PID namespaces will be used. The currently available mechanism are not very efficient for this use case. For ex. as Konstantin described, using /proc//status would require the application to scan all the pid's status files to determine the pid of given process in a child namespace. Use of SCM_CREDENTIALS's socket message is another way, which would require every process starting inside a pid namespace to send this message and the receiving process in the target namespace would have to save the converted pid and reference it. This mechanism becomes cumbersome especially if the application has to deal with multiple nested pid namespaces. Also, the Database needs to be able to convert a thread's global pid(gettid()). Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires CAP_SYS_ADMIN, which is an issue. So having a direct method, like the API that Konstantin is proposing, will work best for the Database since pid of a process in any of the nested pid namespaces can be converted as and when required. I think with the proposed API, the application should be able to convert pid of a process or tid(gettid()) of a thread as well. Can you explain what Oracle's database is planning to do with this information? Database uses the PID to programmatically find out if the process/thread is alive(kill 0) also send signals to the processes requesting it to dump status/debug information and kill the processes in case of a shutdown abort of the instance. -Prakash.
Re: [PATCH v4] pidns: introduce syscall translate_pid
On 10/16/17 5:52 PM, Andy Lutomirski wrote: On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa wrote: On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote: On 10/16/2017 02:36 PM, Andrew Morton wrote: On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov wrote: pid_t translate_pid(pid_t pid, int source, int target); This syscall converts pid from source pid-ns into pid in target pid-ns. If pid is unreachable from target pid-ns it returns zero. Pid-namespaces are referred file descriptors opened to proc files /proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative argument refers to current pid namespace, same as file /proc/self/ns/pid. Kernel expose virtual pids in /proc/[pid]/status:NSpid, but backward translation requires scanning all tasks. Also pids could be translated by sending them through unix socket between namespaces, this method is slow and insecure because other side is exposed inside pid namespace. Andrew asked why we might need this. Such conversion is required for interaction between processes across pid-namespaces. For example to identify process in container by pid file looking from outside. Two years ago I've solved this in project of mine with monstrous code which forks couple times just to convert pid, lucky for me performance wasn't important. That's a single user who needed this a single time, and found a userspace-based solution anyway. This is not exactly compelling! Is there a stronger case to be made? How does this change benefit our users? Sell it to us! Oracle database is planning to use pid namespace for sandboxing database instances and they need an API similar to translate_pid to effectively translate process IDs from other pid namespaces. Prakash (cced in mail) can provide more details on this usecase. As Nagarathnam indicated, Oracle Database will be using pid namespaces and needs a direct method of converting pids of processes in the pid namespace hierarchy. In this use case multiple nested PID namespaces will be used. The currently available mechanism are not very efficient for this use case. For ex. as Konstantin described, using /proc//status would require the application to scan all the pid's status files to determine the pid of given process in a child namespace. Use of SCM_CREDENTIALS's socket message is another way, which would require every process starting inside a pid namespace to send this message and the receiving process in the target namespace would have to save the converted pid and reference it. This mechanism becomes cumbersome especially if the application has to deal with multiple nested pid namespaces. Also, the Database needs to be able to convert a thread's global pid(gettid()). Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires CAP_SYS_ADMIN, which is an issue. So having a direct method, like the API that Konstantin is proposing, will work best for the Database since pid of a process in any of the nested pid namespaces can be converted as and when required. I think with the proposed API, the application should be able to convert pid of a process or tid(gettid()) of a thread as well. Can you explain what Oracle's database is planning to do with this information? Database uses the PID to programmatically find out if the process/thread is alive(kill 0) also send signals to the processes requesting it to dump status/debug information and kill the processes in case of a shutdown abort of the instance. -Prakash.
[PATCH v2] Userfaultfd: Add description for UFFD_FEATURE_SIGBUS
Userfaultfd feature UFFD_FEATURE_SIGBUS was merged recently and should be available in Linux 4.14 release. This patch is for the manpage changes documenting this API. Documents the following commit: commit 2d6d6f5a09a96cc1fec7ed992b825e05f64cb50e Author: Prakash Sangappa <prakash.sanga...@oracle.com> Date: Wed Sep 6 16:23:39 2017 -0700 mm: userfaultfd: add feature to request for a signal delivery Signed-off-by: Prakash Sangappa <prakash.sanga...@oracle.com> --- v2: Incorporated review feedback changes. --- man2/ioctl_userfaultfd.2 | 9 + man2/userfaultfd.2 | 23 +++ 2 files changed, 32 insertions(+) diff --git a/man2/ioctl_userfaultfd.2 b/man2/ioctl_userfaultfd.2 index 60fd29b..32f0744 100644 --- a/man2/ioctl_userfaultfd.2 +++ b/man2/ioctl_userfaultfd.2 @@ -196,6 +196,15 @@ with the flag set, .BR memfd_create (2), and so on. +.TP +.B UFFD_FEATURE_SIGBUS +Since Linux 4.14, If this feature bit is set, no page-fault events +.B (UFFD_EVENT_PAGEFAULT) +will be delivered, instead a +.B SIGBUS +signal will be sent to the faulting process. Applications using this +feature will not require the use of a userfaultfd monitor for processing +memory accesses to the regions registered with userfaultfd. .IP The returned .I ioctls diff --git a/man2/userfaultfd.2 b/man2/userfaultfd.2 index 1741ee3..3c5b9c0 100644 --- a/man2/userfaultfd.2 +++ b/man2/userfaultfd.2 @@ -172,6 +172,29 @@ or .BR ioctl (2) operations to resolve the page fault. .PP +Starting from Linux 4.14, if application sets +.B UFFD_FEATURE_SIGBUS +feature bit using +.B UFFDIO_API +.BR ioctl (2), +no page fault notification will be forwarded to +the user-space, instead a +.B SIGBUS +signal is delivered to the faulting process. With this feature, +userfaultfd can be used for robustness purpose to simply catch +any access to areas within the registered address range that do not +have pages allocated, without having to listen to userfaultfd events. +No userfaultfd monitor will be required for dealing with such memory +accesses. For example, this feature can be useful for applications that +want to prevent the kernel from automatically allocating pages and filling +holes in sparse files when the hole is accessed thru mapped address. +.PP +The +.B UFFD_FEATURE_SIGBUS +feature is implicitly inherited through fork() if used in combination with +.BR UFFD_FEATURE_FORK . + +.PP Details of the various .BR ioctl (2) operations can be found in -- 2.7.4
[PATCH v2] Userfaultfd: Add description for UFFD_FEATURE_SIGBUS
Userfaultfd feature UFFD_FEATURE_SIGBUS was merged recently and should be available in Linux 4.14 release. This patch is for the manpage changes documenting this API. Documents the following commit: commit 2d6d6f5a09a96cc1fec7ed992b825e05f64cb50e Author: Prakash Sangappa Date: Wed Sep 6 16:23:39 2017 -0700 mm: userfaultfd: add feature to request for a signal delivery Signed-off-by: Prakash Sangappa --- v2: Incorporated review feedback changes. --- man2/ioctl_userfaultfd.2 | 9 + man2/userfaultfd.2 | 23 +++ 2 files changed, 32 insertions(+) diff --git a/man2/ioctl_userfaultfd.2 b/man2/ioctl_userfaultfd.2 index 60fd29b..32f0744 100644 --- a/man2/ioctl_userfaultfd.2 +++ b/man2/ioctl_userfaultfd.2 @@ -196,6 +196,15 @@ with the flag set, .BR memfd_create (2), and so on. +.TP +.B UFFD_FEATURE_SIGBUS +Since Linux 4.14, If this feature bit is set, no page-fault events +.B (UFFD_EVENT_PAGEFAULT) +will be delivered, instead a +.B SIGBUS +signal will be sent to the faulting process. Applications using this +feature will not require the use of a userfaultfd monitor for processing +memory accesses to the regions registered with userfaultfd. .IP The returned .I ioctls diff --git a/man2/userfaultfd.2 b/man2/userfaultfd.2 index 1741ee3..3c5b9c0 100644 --- a/man2/userfaultfd.2 +++ b/man2/userfaultfd.2 @@ -172,6 +172,29 @@ or .BR ioctl (2) operations to resolve the page fault. .PP +Starting from Linux 4.14, if application sets +.B UFFD_FEATURE_SIGBUS +feature bit using +.B UFFDIO_API +.BR ioctl (2), +no page fault notification will be forwarded to +the user-space, instead a +.B SIGBUS +signal is delivered to the faulting process. With this feature, +userfaultfd can be used for robustness purpose to simply catch +any access to areas within the registered address range that do not +have pages allocated, without having to listen to userfaultfd events. +No userfaultfd monitor will be required for dealing with such memory +accesses. For example, this feature can be useful for applications that +want to prevent the kernel from automatically allocating pages and filling +holes in sparse files when the hole is accessed thru mapped address. +.PP +The +.B UFFD_FEATURE_SIGBUS +feature is implicitly inherited through fork() if used in combination with +.BR UFFD_FEATURE_FORK . + +.PP Details of the various .BR ioctl (2) operations can be found in -- 2.7.4
Re: [PATCH] Userfaultfd: Add description for UFFD_FEATURE_SIGBUS
cc: Andrea Arcangeli On 10/6/17 7:52 PM, Prakash Sangappa wrote: Userfaultfd feature UFFD_FEATURE_SIGBUS was merged recently and should be available in Linux 4.14 release. This patch is for the manpage changes documenting this API. Documents the following commit: commit 2d6d6f5a09a96cc1fec7ed992b825e05f64cb50e Author: Prakash Sangappa <prakash.sanga...@oracle.com> Date: Wed Sep 6 16:23:39 2017 -0700 mm: userfaultfd: add feature to request for a signal delivery Signed-off-by: Prakash Sangappa <prakash.sanga...@oracle.com> --- man2/ioctl_userfaultfd.2 | 9 + man2/userfaultfd.2 | 17 + 2 files changed, 26 insertions(+) diff --git a/man2/ioctl_userfaultfd.2 b/man2/ioctl_userfaultfd.2 index 60fd29b..cfc65ae 100644 --- a/man2/ioctl_userfaultfd.2 +++ b/man2/ioctl_userfaultfd.2 @@ -196,6 +196,15 @@ with the flag set, .BR memfd_create (2), and so on. +.TP +.B UFFD_FEATURE_SIGBUS +Since Linux 4.14, If this feature bit is set, no page-fault events( +.B UFFD_EVENT_PAGEFAULT +) will be delivered, instead a +.B SIGBUS +signal will be sent to the faulting process. Applications using this +feature will not require the use of a userfaultfd monitor for handling +page-fault events. .IP The returned .I ioctls diff --git a/man2/userfaultfd.2 b/man2/userfaultfd.2 index 1741ee3..a033742 100644 --- a/man2/userfaultfd.2 +++ b/man2/userfaultfd.2 @@ -172,6 +172,23 @@ or .BR ioctl (2) operations to resolve the page fault. .PP +Starting from Linux 4.14, if application sets +.B UFFD_FEATURE_SIGBUS +feature bit using +.B UFFDIO_API +.BR ioctl (2) +, no page fault notification will be forwarded to +the user-space, instead a +.B SIGBUS +signal is delivered to the faulting process. With this feature, +userfaultfd can be used for robustness purpose to simply catch +any access to areas within the registered address range that do not +have pages allocated, without having to deal with page-fault events. +No userfaultd monitor will be required for handling page faults. For +example, this feature can be useful for applications that want to +prevent the kernel from automatically allocating pages and filling +holes in sparse files when the hole is accessed thru mapped address. +.PP Details of the various .BR ioctl (2) operations can be found in
Re: [PATCH] Userfaultfd: Add description for UFFD_FEATURE_SIGBUS
cc: Andrea Arcangeli On 10/6/17 7:52 PM, Prakash Sangappa wrote: Userfaultfd feature UFFD_FEATURE_SIGBUS was merged recently and should be available in Linux 4.14 release. This patch is for the manpage changes documenting this API. Documents the following commit: commit 2d6d6f5a09a96cc1fec7ed992b825e05f64cb50e Author: Prakash Sangappa Date: Wed Sep 6 16:23:39 2017 -0700 mm: userfaultfd: add feature to request for a signal delivery Signed-off-by: Prakash Sangappa --- man2/ioctl_userfaultfd.2 | 9 + man2/userfaultfd.2 | 17 + 2 files changed, 26 insertions(+) diff --git a/man2/ioctl_userfaultfd.2 b/man2/ioctl_userfaultfd.2 index 60fd29b..cfc65ae 100644 --- a/man2/ioctl_userfaultfd.2 +++ b/man2/ioctl_userfaultfd.2 @@ -196,6 +196,15 @@ with the flag set, .BR memfd_create (2), and so on. +.TP +.B UFFD_FEATURE_SIGBUS +Since Linux 4.14, If this feature bit is set, no page-fault events( +.B UFFD_EVENT_PAGEFAULT +) will be delivered, instead a +.B SIGBUS +signal will be sent to the faulting process. Applications using this +feature will not require the use of a userfaultfd monitor for handling +page-fault events. .IP The returned .I ioctls diff --git a/man2/userfaultfd.2 b/man2/userfaultfd.2 index 1741ee3..a033742 100644 --- a/man2/userfaultfd.2 +++ b/man2/userfaultfd.2 @@ -172,6 +172,23 @@ or .BR ioctl (2) operations to resolve the page fault. .PP +Starting from Linux 4.14, if application sets +.B UFFD_FEATURE_SIGBUS +feature bit using +.B UFFDIO_API +.BR ioctl (2) +, no page fault notification will be forwarded to +the user-space, instead a +.B SIGBUS +signal is delivered to the faulting process. With this feature, +userfaultfd can be used for robustness purpose to simply catch +any access to areas within the registered address range that do not +have pages allocated, without having to deal with page-fault events. +No userfaultd monitor will be required for handling page faults. For +example, this feature can be useful for applications that want to +prevent the kernel from automatically allocating pages and filling +holes in sparse files when the hole is accessed thru mapped address. +.PP Details of the various .BR ioctl (2) operations can be found in
[PATCH] Userfaultfd: Add description for UFFD_FEATURE_SIGBUS
Userfaultfd feature UFFD_FEATURE_SIGBUS was merged recently and should be available in Linux 4.14 release. This patch is for the manpage changes documenting this API. Documents the following commit: commit 2d6d6f5a09a96cc1fec7ed992b825e05f64cb50e Author: Prakash Sangappa <prakash.sanga...@oracle.com> Date: Wed Sep 6 16:23:39 2017 -0700 mm: userfaultfd: add feature to request for a signal delivery Signed-off-by: Prakash Sangappa <prakash.sanga...@oracle.com> --- man2/ioctl_userfaultfd.2 | 9 + man2/userfaultfd.2 | 17 + 2 files changed, 26 insertions(+) diff --git a/man2/ioctl_userfaultfd.2 b/man2/ioctl_userfaultfd.2 index 60fd29b..cfc65ae 100644 --- a/man2/ioctl_userfaultfd.2 +++ b/man2/ioctl_userfaultfd.2 @@ -196,6 +196,15 @@ with the flag set, .BR memfd_create (2), and so on. +.TP +.B UFFD_FEATURE_SIGBUS +Since Linux 4.14, If this feature bit is set, no page-fault events( +.B UFFD_EVENT_PAGEFAULT +) will be delivered, instead a +.B SIGBUS +signal will be sent to the faulting process. Applications using this +feature will not require the use of a userfaultfd monitor for handling +page-fault events. .IP The returned .I ioctls diff --git a/man2/userfaultfd.2 b/man2/userfaultfd.2 index 1741ee3..a033742 100644 --- a/man2/userfaultfd.2 +++ b/man2/userfaultfd.2 @@ -172,6 +172,23 @@ or .BR ioctl (2) operations to resolve the page fault. .PP +Starting from Linux 4.14, if application sets +.B UFFD_FEATURE_SIGBUS +feature bit using +.B UFFDIO_API +.BR ioctl (2) +, no page fault notification will be forwarded to +the user-space, instead a +.B SIGBUS +signal is delivered to the faulting process. With this feature, +userfaultfd can be used for robustness purpose to simply catch +any access to areas within the registered address range that do not +have pages allocated, without having to deal with page-fault events. +No userfaultd monitor will be required for handling page faults. For +example, this feature can be useful for applications that want to +prevent the kernel from automatically allocating pages and filling +holes in sparse files when the hole is accessed thru mapped address. +.PP Details of the various .BR ioctl (2) operations can be found in -- 2.7.4
[PATCH] Userfaultfd: Add description for UFFD_FEATURE_SIGBUS
Userfaultfd feature UFFD_FEATURE_SIGBUS was merged recently and should be available in Linux 4.14 release. This patch is for the manpage changes documenting this API. Documents the following commit: commit 2d6d6f5a09a96cc1fec7ed992b825e05f64cb50e Author: Prakash Sangappa Date: Wed Sep 6 16:23:39 2017 -0700 mm: userfaultfd: add feature to request for a signal delivery Signed-off-by: Prakash Sangappa --- man2/ioctl_userfaultfd.2 | 9 + man2/userfaultfd.2 | 17 + 2 files changed, 26 insertions(+) diff --git a/man2/ioctl_userfaultfd.2 b/man2/ioctl_userfaultfd.2 index 60fd29b..cfc65ae 100644 --- a/man2/ioctl_userfaultfd.2 +++ b/man2/ioctl_userfaultfd.2 @@ -196,6 +196,15 @@ with the flag set, .BR memfd_create (2), and so on. +.TP +.B UFFD_FEATURE_SIGBUS +Since Linux 4.14, If this feature bit is set, no page-fault events( +.B UFFD_EVENT_PAGEFAULT +) will be delivered, instead a +.B SIGBUS +signal will be sent to the faulting process. Applications using this +feature will not require the use of a userfaultfd monitor for handling +page-fault events. .IP The returned .I ioctls diff --git a/man2/userfaultfd.2 b/man2/userfaultfd.2 index 1741ee3..a033742 100644 --- a/man2/userfaultfd.2 +++ b/man2/userfaultfd.2 @@ -172,6 +172,23 @@ or .BR ioctl (2) operations to resolve the page fault. .PP +Starting from Linux 4.14, if application sets +.B UFFD_FEATURE_SIGBUS +feature bit using +.B UFFDIO_API +.BR ioctl (2) +, no page fault notification will be forwarded to +the user-space, instead a +.B SIGBUS +signal is delivered to the faulting process. With this feature, +userfaultfd can be used for robustness purpose to simply catch +any access to areas within the registered address range that do not +have pages allocated, without having to deal with page-fault events. +No userfaultd monitor will be required for handling page faults. For +example, this feature can be useful for applications that want to +prevent the kernel from automatically allocating pages and filling +holes in sparse files when the hole is accessed thru mapped address. +.PP Details of the various .BR ioctl (2) operations can be found in -- 2.7.4
Re: [RESEND PATCH] Allow passing tid or pid in SCM_CREDENTIALS without CAP_SYS_ADMIN
On 8/30/17 10:41 AM, ebied...@xmission.com wrote: Prakash Sangappa <prakash.sanga...@oracle.com> writes: With regards to security, the question basically is what is the consequence of passing the wrong id. As I understand it, Interpreting the id to be pid or tid, the effective uid and gid will be the same. It would be a problem only if the incorrect interpretation of the id would refer a different process. But that cannot happen as the the global tid(gettid() of a thread is unique. There is also the issue that the receiving process could look, not see the pid in proc and assume the sending process is dead. That I suspect is the larger danger. Will this not be a bug in the application, if it is sending the wrong id? As long as the thread is alive, that id cannot reference another process / thread. Unless the thread were to exit and the id gets recycled and got used for another thread or process. This would be no different from a process exiting and its pid getting recycled which is the case now. Largely I agree. If all you want are pid translations I suspect the are far easier ways thant updating the SCM_CREDENTIALS code. What would be an another easier & efficient way of doing pid translation? Should a new API/mechanism be considered mainly for pid translation purpose for use with pid namespaces, say based on 'pipe' something similar to I_SENDFD? Thanks, -Prakash. Eric
Re: [RESEND PATCH] Allow passing tid or pid in SCM_CREDENTIALS without CAP_SYS_ADMIN
On 8/30/17 10:41 AM, ebied...@xmission.com wrote: Prakash Sangappa writes: With regards to security, the question basically is what is the consequence of passing the wrong id. As I understand it, Interpreting the id to be pid or tid, the effective uid and gid will be the same. It would be a problem only if the incorrect interpretation of the id would refer a different process. But that cannot happen as the the global tid(gettid() of a thread is unique. There is also the issue that the receiving process could look, not see the pid in proc and assume the sending process is dead. That I suspect is the larger danger. Will this not be a bug in the application, if it is sending the wrong id? As long as the thread is alive, that id cannot reference another process / thread. Unless the thread were to exit and the id gets recycled and got used for another thread or process. This would be no different from a process exiting and its pid getting recycled which is the case now. Largely I agree. If all you want are pid translations I suspect the are far easier ways thant updating the SCM_CREDENTIALS code. What would be an another easier & efficient way of doing pid translation? Should a new API/mechanism be considered mainly for pid translation purpose for use with pid namespaces, say based on 'pipe' something similar to I_SENDFD? Thanks, -Prakash. Eric
[RESEND PATCH] Allow passing tid or pid in SCM_CREDENTIALS without CAP_SYS_ADMIN
Currently passing tid(gettid(2)) of a thread in struct ucred in SCM_CREDENTIALS message requires CAP_SYS_ADMIN capability otherwise it fails with EPERM error. Some applications deal with thread id of a thread(tid) and so it would help to allow tid in SCM_CREDENTIALS message. Basically, either tgid(pid of the process) or the tid of the thread should be allowed without the need for CAP_SYS_ADMIN capability. SCM_CREDENTIALS will be used to determine the global id of a process or a thread running inside a pid namespace. This patch adds necessary check to accept tid in SCM_CREDENTIALS struct ucred. Signed-off-by: Prakash Sangappa <prakash.sanga...@oracle.com> --- net/core/scm.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/core/scm.c b/net/core/scm.c index b1ff8a4..9274197 100644 --- a/net/core/scm.c +++ b/net/core/scm.c @@ -55,6 +55,7 @@ static __inline__ int scm_check_creds(struct ucred *creds) return -EINVAL; if ((creds->pid == task_tgid_vnr(current) || +creds->pid == task_pid_vnr(current) || ns_capable(task_active_pid_ns(current)->user_ns, CAP_SYS_ADMIN)) && ((uid_eq(uid, cred->uid) || uid_eq(uid, cred->euid) || uid_eq(uid, cred->suid)) || ns_capable(cred->user_ns, CAP_SETUID)) && -- 2.7.4
[RESEND PATCH] Allow passing tid or pid in SCM_CREDENTIALS without CAP_SYS_ADMIN
Currently passing tid(gettid(2)) of a thread in struct ucred in SCM_CREDENTIALS message requires CAP_SYS_ADMIN capability otherwise it fails with EPERM error. Some applications deal with thread id of a thread(tid) and so it would help to allow tid in SCM_CREDENTIALS message. Basically, either tgid(pid of the process) or the tid of the thread should be allowed without the need for CAP_SYS_ADMIN capability. SCM_CREDENTIALS will be used to determine the global id of a process or a thread running inside a pid namespace. This patch adds necessary check to accept tid in SCM_CREDENTIALS struct ucred. Signed-off-by: Prakash Sangappa --- net/core/scm.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/core/scm.c b/net/core/scm.c index b1ff8a4..9274197 100644 --- a/net/core/scm.c +++ b/net/core/scm.c @@ -55,6 +55,7 @@ static __inline__ int scm_check_creds(struct ucred *creds) return -EINVAL; if ((creds->pid == task_tgid_vnr(current) || +creds->pid == task_pid_vnr(current) || ns_capable(task_active_pid_ns(current)->user_ns, CAP_SYS_ADMIN)) && ((uid_eq(uid, cred->uid) || uid_eq(uid, cred->euid) || uid_eq(uid, cred->suid)) || ns_capable(cred->user_ns, CAP_SETUID)) && -- 2.7.4
[PATCH] Allow passing tid or pid in SCM_CREDENTIALS without CAP_SYS_ADMIN
Currently passing tid(gettid(2)) of a thread in struct ucred in SCM_CREDENTIALS message requires CAP_SYS_ADMIN capability otherwise it fails with EPERM error. Some applications deal with thread id of a thread(tid) and so it would help to allow tid in SCM_CREDENTIALS message. Basically, either tgid(pid of the process) or the tid of the thread should be allowed without the need for CAP_SYS_ADMIN capability. SCM_CREDENTIALS will be used to determine the global id of a process or a thread running inside a pid namespace. This patch adds necessary check to accept tid in SCM_CREDENTIALS struct ucred. Signed-off-by: Prakash Sangappa <prakash.sanga...@oracle.com> --- net/core/scm.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/core/scm.c b/net/core/scm.c index b1ff8a4..9274197 100644 --- a/net/core/scm.c +++ b/net/core/scm.c @@ -55,6 +55,7 @@ static __inline__ int scm_check_creds(struct ucred *creds) return -EINVAL; if ((creds->pid == task_tgid_vnr(current) || +creds->pid == task_pid_vnr(current) || ns_capable(task_active_pid_ns(current)->user_ns, CAP_SYS_ADMIN)) && ((uid_eq(uid, cred->uid) || uid_eq(uid, cred->euid) || uid_eq(uid, cred->suid)) || ns_capable(cred->user_ns, CAP_SETUID)) && -- 2.7.4
[PATCH] Allow passing tid or pid in SCM_CREDENTIALS without CAP_SYS_ADMIN
Currently passing tid(gettid(2)) of a thread in struct ucred in SCM_CREDENTIALS message requires CAP_SYS_ADMIN capability otherwise it fails with EPERM error. Some applications deal with thread id of a thread(tid) and so it would help to allow tid in SCM_CREDENTIALS message. Basically, either tgid(pid of the process) or the tid of the thread should be allowed without the need for CAP_SYS_ADMIN capability. SCM_CREDENTIALS will be used to determine the global id of a process or a thread running inside a pid namespace. This patch adds necessary check to accept tid in SCM_CREDENTIALS struct ucred. Signed-off-by: Prakash Sangappa --- net/core/scm.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/core/scm.c b/net/core/scm.c index b1ff8a4..9274197 100644 --- a/net/core/scm.c +++ b/net/core/scm.c @@ -55,6 +55,7 @@ static __inline__ int scm_check_creds(struct ucred *creds) return -EINVAL; if ((creds->pid == task_tgid_vnr(current) || +creds->pid == task_pid_vnr(current) || ns_capable(task_active_pid_ns(current)->user_ns, CAP_SYS_ADMIN)) && ((uid_eq(uid, cred->uid) || uid_eq(uid, cred->euid) || uid_eq(uid, cred->suid)) || ns_capable(cred->user_ns, CAP_SETUID)) && -- 2.7.4
[RESEND PATCH v3 2/2] userfaultfd: selftest: Add tests for UFFD_FEATURE_SIGBUS feature
This patch adds tests for UFFD_FEATURE_SIGBUS feature. The tests will verify signal delivery instead of userfault events. Also, test use of UFFDIO_COPY to allocate memory and retry accessing monitored area after signal delivery. This patch also fixes a bug in uffd_poll_thread() where 'uffd' is leaked. Signed-off-by: Prakash Sangappa <prakash.sanga...@oracle.com> --- Change log v3: Eliminated use of sig_repeat variable and simplified error return. v2: - Added comments to explain the tests. - Fixed test to fail immediately if signal repeats. - Addressed other review comments. v1: https://lkml.org/lkml/2017/7/26/101 --- tools/testing/selftests/vm/userfaultfd.c | 127 +- 1 files changed, 124 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 1eae79a..52740ae 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -66,6 +66,7 @@ #include #include #include +#include #ifdef __NR_userfaultfd @@ -408,6 +409,7 @@ static int copy_page(int ufd, unsigned long offset) userfaults++; break; case UFFD_EVENT_FORK: + close(uffd); uffd = msg.arg.fork.ufd; pollfd[0].fd = uffd; break; @@ -572,6 +574,17 @@ static int userfaultfd_open(int features) return 0; } +sigjmp_buf jbuf, *sigbuf; + +static void sighndl(int sig, siginfo_t *siginfo, void *ptr) +{ + if (sig == SIGBUS) { + if (sigbuf) + siglongjmp(*sigbuf, 1); + abort(); + } +} + /* * For non-cooperative userfaultfd test we fork() a process that will * generate pagefaults, will mremap the area monitored by the @@ -585,19 +598,59 @@ static int userfaultfd_open(int features) * The release of the pages currently generates event for shmem and * anonymous memory (UFFD_EVENT_REMOVE), hence it is not checked * for hugetlb. + * For signal test(UFFD_FEATURE_SIGBUS), signal_test = 1, we register + * monitored area, generate pagefaults and test that signal is delivered. + * Use UFFDIO_COPY to allocate missing page and retry. For signal_test = 2 + * test robustness use case - we release monitored area, fork a process + * that will generate pagefaults and verify signal is generated. + * This also tests UFFD_FEATURE_EVENT_FORK event along with the signal + * feature. Using monitor thread, verify no userfault events are generated. */ -static int faulting_process(void) +static int faulting_process(int signal_test) { unsigned long nr; unsigned long long count; unsigned long split_nr_pages; + unsigned long lastnr; + struct sigaction act; + unsigned long signalled = 0; if (test_type != TEST_HUGETLB) split_nr_pages = (nr_pages + 1) / 2; else split_nr_pages = nr_pages; + if (signal_test) { + sigbuf = + memset(, 0, sizeof(act)); + act.sa_sigaction = sighndl; + act.sa_flags = SA_SIGINFO; + if (sigaction(SIGBUS, , 0)) { + perror("sigaction"); + return 1; + } + lastnr = (unsigned long)-1; + } + for (nr = 0; nr < split_nr_pages; nr++) { + if (signal_test) { + if (sigsetjmp(*sigbuf, 1) != 0) { + if (nr == lastnr) { + fprintf(stderr, "Signal repeated\n"); + return 1; + } + + lastnr = nr; + if (signal_test == 1) { + if (copy_page(uffd, nr * page_size)) + signalled++; + } else { + signalled++; + continue; + } + } + } + count = *area_count(area_dst, nr); if (count != count_verify[nr]) { fprintf(stderr, @@ -607,6 +660,9 @@ static int faulting_process(void) } } + if (signal_test) + return signalled != split_nr_pages; + if (test_type == TEST_HUGETLB) return 0; @@ -761,7 +817,7 @@ static int userfaultfd_events_test(void) perror("fork"), exit(1); if (!pid) - return faulting_process(); + return faulting_process(0); waitpid(pid, , 0); if (err) @@ -778,6 +834,70 @@ static int userfaultfd_events_test(void) retu
[RESEND PATCH v3 2/2] userfaultfd: selftest: Add tests for UFFD_FEATURE_SIGBUS feature
This patch adds tests for UFFD_FEATURE_SIGBUS feature. The tests will verify signal delivery instead of userfault events. Also, test use of UFFDIO_COPY to allocate memory and retry accessing monitored area after signal delivery. This patch also fixes a bug in uffd_poll_thread() where 'uffd' is leaked. Signed-off-by: Prakash Sangappa --- Change log v3: Eliminated use of sig_repeat variable and simplified error return. v2: - Added comments to explain the tests. - Fixed test to fail immediately if signal repeats. - Addressed other review comments. v1: https://lkml.org/lkml/2017/7/26/101 --- tools/testing/selftests/vm/userfaultfd.c | 127 +- 1 files changed, 124 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 1eae79a..52740ae 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -66,6 +66,7 @@ #include #include #include +#include #ifdef __NR_userfaultfd @@ -408,6 +409,7 @@ static int copy_page(int ufd, unsigned long offset) userfaults++; break; case UFFD_EVENT_FORK: + close(uffd); uffd = msg.arg.fork.ufd; pollfd[0].fd = uffd; break; @@ -572,6 +574,17 @@ static int userfaultfd_open(int features) return 0; } +sigjmp_buf jbuf, *sigbuf; + +static void sighndl(int sig, siginfo_t *siginfo, void *ptr) +{ + if (sig == SIGBUS) { + if (sigbuf) + siglongjmp(*sigbuf, 1); + abort(); + } +} + /* * For non-cooperative userfaultfd test we fork() a process that will * generate pagefaults, will mremap the area monitored by the @@ -585,19 +598,59 @@ static int userfaultfd_open(int features) * The release of the pages currently generates event for shmem and * anonymous memory (UFFD_EVENT_REMOVE), hence it is not checked * for hugetlb. + * For signal test(UFFD_FEATURE_SIGBUS), signal_test = 1, we register + * monitored area, generate pagefaults and test that signal is delivered. + * Use UFFDIO_COPY to allocate missing page and retry. For signal_test = 2 + * test robustness use case - we release monitored area, fork a process + * that will generate pagefaults and verify signal is generated. + * This also tests UFFD_FEATURE_EVENT_FORK event along with the signal + * feature. Using monitor thread, verify no userfault events are generated. */ -static int faulting_process(void) +static int faulting_process(int signal_test) { unsigned long nr; unsigned long long count; unsigned long split_nr_pages; + unsigned long lastnr; + struct sigaction act; + unsigned long signalled = 0; if (test_type != TEST_HUGETLB) split_nr_pages = (nr_pages + 1) / 2; else split_nr_pages = nr_pages; + if (signal_test) { + sigbuf = + memset(, 0, sizeof(act)); + act.sa_sigaction = sighndl; + act.sa_flags = SA_SIGINFO; + if (sigaction(SIGBUS, , 0)) { + perror("sigaction"); + return 1; + } + lastnr = (unsigned long)-1; + } + for (nr = 0; nr < split_nr_pages; nr++) { + if (signal_test) { + if (sigsetjmp(*sigbuf, 1) != 0) { + if (nr == lastnr) { + fprintf(stderr, "Signal repeated\n"); + return 1; + } + + lastnr = nr; + if (signal_test == 1) { + if (copy_page(uffd, nr * page_size)) + signalled++; + } else { + signalled++; + continue; + } + } + } + count = *area_count(area_dst, nr); if (count != count_verify[nr]) { fprintf(stderr, @@ -607,6 +660,9 @@ static int faulting_process(void) } } + if (signal_test) + return signalled != split_nr_pages; + if (test_type == TEST_HUGETLB) return 0; @@ -761,7 +817,7 @@ static int userfaultfd_events_test(void) perror("fork"), exit(1); if (!pid) - return faulting_process(); + return faulting_process(0); waitpid(pid, , 0); if (err) @@ -778,6 +834,70 @@ static int userfaultfd_events_test(void) return userfaults != nr_pages; } +sta
[RESEND PATCH v3 1/2] userfaultfd: Add feature to request for a signal delivery
In some cases, userfaultfd mechanism should just deliver a SIGBUS signal to the faulting process, instead of the page-fault event. Dealing with page-fault event using a monitor thread can be an overhead in these cases. For example applications like the database could use the signaling mechanism for robustness purpose. Database uses hugetlbfs for performance reason. Files on hugetlbfs filesystem are created and huge pages allocated using fallocate() API. Pages are deallocated/freed using fallocate() hole punching support. These files are mmapped and accessed by many processes as shared memory. The database keeps track of which offsets in the hugetlbfs file have pages allocated. Any access to mapped address over holes in the file, which can occur due to bugs in the application, is considered invalid and expect the process to simply receive a SIGBUS. However, currently when a hole in the file is accessed via the mapped address, kernel/mm attempts to automatically allocate a page at page fault time, resulting in implicitly filling the hole in the file. This may not be the desired behavior for applications like the database that want to explicitly manage page allocations of hugetlbfs files. Using userfaultfd mechanism with this support to get a signal, database application can prevent pages from being allocated implicitly when processes access mapped address over holes in the file. This patch adds UFFD_FEATURE_SIGBUS feature to userfaultfd mechnism to request for a SIGBUS signal. See following for previous discussion about the database requirement leading to this proposal as suggested by Andrea. http://www.spinics.net/lists/linux-mm/msg129224.html Signed-off-by: Prakash Sangappa <prakash.sanga...@oracle.com> Reviewed-by: Mike Rapoport <r...@linux.vnet.ibm.com> Reviewed-by: Andrea Arcangeli <aarca...@redhat.com> --- fs/userfaultfd.c |3 +++ include/uapi/linux/userfaultfd.h | 10 +- 2 files changed, 12 insertions(+), 1 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 1d622f2..0bbe7df 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -371,6 +371,9 @@ int handle_userfault(struct vm_fault *vmf, unsigned long reason) VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP)); VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP)); + if (ctx->features & UFFD_FEATURE_SIGBUS) + goto out; + /* * If it's already released don't get it. This avoids to loop * in __get_user_pages if userfaultfd_release waits on the diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 3b05953..d39d5db 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -23,7 +23,8 @@ UFFD_FEATURE_EVENT_REMOVE | \ UFFD_FEATURE_EVENT_UNMAP | \ UFFD_FEATURE_MISSING_HUGETLBFS | \ - UFFD_FEATURE_MISSING_SHMEM) + UFFD_FEATURE_MISSING_SHMEM | \ + UFFD_FEATURE_SIGBUS) #define UFFD_API_IOCTLS\ ((__u64)1 << _UFFDIO_REGISTER | \ (__u64)1 << _UFFDIO_UNREGISTER | \ @@ -153,6 +154,12 @@ struct uffdio_api { * UFFD_FEATURE_MISSING_SHMEM works the same as * UFFD_FEATURE_MISSING_HUGETLBFS, but it applies to shmem * (i.e. tmpfs and other shmem based APIs). +* +* UFFD_FEATURE_SIGBUS feature means no page-fault +* (UFFD_EVENT_PAGEFAULT) event will be delivered, instead +* a SIGBUS signal will be sent to the faulting process. +* The application process can enable this behavior by adding +* it to uffdio_api.features. */ #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0) #define UFFD_FEATURE_EVENT_FORK(1<<1) @@ -161,6 +168,7 @@ struct uffdio_api { #define UFFD_FEATURE_MISSING_HUGETLBFS (1<<4) #define UFFD_FEATURE_MISSING_SHMEM (1<<5) #define UFFD_FEATURE_EVENT_UNMAP (1<<6) +#define UFFD_FEATURE_SIGBUS(1<<7) __u64 features; __u64 ioctls; -- 1.7.1
[RESEND PATCH v3 1/2] userfaultfd: Add feature to request for a signal delivery
In some cases, userfaultfd mechanism should just deliver a SIGBUS signal to the faulting process, instead of the page-fault event. Dealing with page-fault event using a monitor thread can be an overhead in these cases. For example applications like the database could use the signaling mechanism for robustness purpose. Database uses hugetlbfs for performance reason. Files on hugetlbfs filesystem are created and huge pages allocated using fallocate() API. Pages are deallocated/freed using fallocate() hole punching support. These files are mmapped and accessed by many processes as shared memory. The database keeps track of which offsets in the hugetlbfs file have pages allocated. Any access to mapped address over holes in the file, which can occur due to bugs in the application, is considered invalid and expect the process to simply receive a SIGBUS. However, currently when a hole in the file is accessed via the mapped address, kernel/mm attempts to automatically allocate a page at page fault time, resulting in implicitly filling the hole in the file. This may not be the desired behavior for applications like the database that want to explicitly manage page allocations of hugetlbfs files. Using userfaultfd mechanism with this support to get a signal, database application can prevent pages from being allocated implicitly when processes access mapped address over holes in the file. This patch adds UFFD_FEATURE_SIGBUS feature to userfaultfd mechnism to request for a SIGBUS signal. See following for previous discussion about the database requirement leading to this proposal as suggested by Andrea. http://www.spinics.net/lists/linux-mm/msg129224.html Signed-off-by: Prakash Sangappa Reviewed-by: Mike Rapoport Reviewed-by: Andrea Arcangeli --- fs/userfaultfd.c |3 +++ include/uapi/linux/userfaultfd.h | 10 +- 2 files changed, 12 insertions(+), 1 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 1d622f2..0bbe7df 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -371,6 +371,9 @@ int handle_userfault(struct vm_fault *vmf, unsigned long reason) VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP)); VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP)); + if (ctx->features & UFFD_FEATURE_SIGBUS) + goto out; + /* * If it's already released don't get it. This avoids to loop * in __get_user_pages if userfaultfd_release waits on the diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 3b05953..d39d5db 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -23,7 +23,8 @@ UFFD_FEATURE_EVENT_REMOVE | \ UFFD_FEATURE_EVENT_UNMAP | \ UFFD_FEATURE_MISSING_HUGETLBFS | \ - UFFD_FEATURE_MISSING_SHMEM) + UFFD_FEATURE_MISSING_SHMEM | \ + UFFD_FEATURE_SIGBUS) #define UFFD_API_IOCTLS\ ((__u64)1 << _UFFDIO_REGISTER | \ (__u64)1 << _UFFDIO_UNREGISTER | \ @@ -153,6 +154,12 @@ struct uffdio_api { * UFFD_FEATURE_MISSING_SHMEM works the same as * UFFD_FEATURE_MISSING_HUGETLBFS, but it applies to shmem * (i.e. tmpfs and other shmem based APIs). +* +* UFFD_FEATURE_SIGBUS feature means no page-fault +* (UFFD_EVENT_PAGEFAULT) event will be delivered, instead +* a SIGBUS signal will be sent to the faulting process. +* The application process can enable this behavior by adding +* it to uffdio_api.features. */ #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0) #define UFFD_FEATURE_EVENT_FORK(1<<1) @@ -161,6 +168,7 @@ struct uffdio_api { #define UFFD_FEATURE_MISSING_HUGETLBFS (1<<4) #define UFFD_FEATURE_MISSING_SHMEM (1<<5) #define UFFD_FEATURE_EVENT_UNMAP (1<<6) +#define UFFD_FEATURE_SIGBUS(1<<7) __u64 features; __u64 ioctls; -- 1.7.1
[RESEND PATCH v3 0/2] userfaultfd: selftest: Add tests for UFFD_FEATURE_SIGBUS feature
Resending, Andrew asked if I could resend the whole set. Just added 'Reviewed-by' to commit log for 1/2, no other changes. v3 patch for 2/2 as before http://marc.info/?l=linux-mm=150148267803458=2 Previous patch set here: http://marc.info/?l=linux-mm=150095815413275=2 Prakash Sangappa (2): userfaultfd: Add feature to request for a signal delivery userfaultfd: selftest: Add tests for UFFD_FEATURE_SIGBUS feature fs/userfaultfd.c |3 + include/uapi/linux/userfaultfd.h | 10 ++- tools/testing/selftests/vm/userfaultfd.c | 127 +- 3 files changed, 136 insertions(+), 4 deletions(-)
[RESEND PATCH v3 0/2] userfaultfd: selftest: Add tests for UFFD_FEATURE_SIGBUS feature
Resending, Andrew asked if I could resend the whole set. Just added 'Reviewed-by' to commit log for 1/2, no other changes. v3 patch for 2/2 as before http://marc.info/?l=linux-mm=150148267803458=2 Previous patch set here: http://marc.info/?l=linux-mm=150095815413275=2 Prakash Sangappa (2): userfaultfd: Add feature to request for a signal delivery userfaultfd: selftest: Add tests for UFFD_FEATURE_SIGBUS feature fs/userfaultfd.c |3 + include/uapi/linux/userfaultfd.h | 10 ++- tools/testing/selftests/vm/userfaultfd.c | 127 +- 3 files changed, 136 insertions(+), 4 deletions(-)
Re: [PATCH v2 2/2] userfaultfd: selftest: Add tests for UFFD_FEATURE_SIGBUS feature
On 7/30/17 12:07 AM, Mike Rapoport wrote: On Thu, Jul 27, 2017 at 10:18:40PM -0400, Prakash Sangappa wrote: This patch adds tests for UFFD_FEATURE_SIGBUS feature. The tests will verify signal delivery instead of userfault events. Also, test use of UFFDIO_COPY to allocate memory and retry accessing monitored area after signal delivery. This patch also fixes a bug in uffd_poll_thread() where 'uffd' is leaked. Signed-off-by: Prakash Sangappa <prakash.sanga...@oracle.com> --- Change log v2: - Added comments to explain the tests. - Fixed test to fail immediately if signal repeats. - Addressed other review comments. v1: https://lkml.org/lkml/2017/7/26/101 --- Overall looks good to me, just small nitpick below. [...] for (nr = 0; nr < split_nr_pages; nr++) { + if (signal_test) { + if (sigsetjmp(*sigbuf, 1) != 0) { + if (nr == lastnr) { + sig_repeats++; You can simply 'return 1' here, then sig_repeats variable can be dropped and the return statement for signal_test can be simplified. Ok, sent v3 patch with this change. Thanks, -Prakash.
Re: [PATCH v2 2/2] userfaultfd: selftest: Add tests for UFFD_FEATURE_SIGBUS feature
On 7/30/17 12:07 AM, Mike Rapoport wrote: On Thu, Jul 27, 2017 at 10:18:40PM -0400, Prakash Sangappa wrote: This patch adds tests for UFFD_FEATURE_SIGBUS feature. The tests will verify signal delivery instead of userfault events. Also, test use of UFFDIO_COPY to allocate memory and retry accessing monitored area after signal delivery. This patch also fixes a bug in uffd_poll_thread() where 'uffd' is leaked. Signed-off-by: Prakash Sangappa --- Change log v2: - Added comments to explain the tests. - Fixed test to fail immediately if signal repeats. - Addressed other review comments. v1: https://lkml.org/lkml/2017/7/26/101 --- Overall looks good to me, just small nitpick below. [...] for (nr = 0; nr < split_nr_pages; nr++) { + if (signal_test) { + if (sigsetjmp(*sigbuf, 1) != 0) { + if (nr == lastnr) { + sig_repeats++; You can simply 'return 1' here, then sig_repeats variable can be dropped and the return statement for signal_test can be simplified. Ok, sent v3 patch with this change. Thanks, -Prakash.
[PATCH v3 2/2] userfaultfd: selftest: Add tests for UFFD_FEATURE_SIGBUS feature
This patch adds tests for UFFD_FEATURE_SIGBUS feature. The tests will verify signal delivery instead of userfault events. Also, test use of UFFDIO_COPY to allocate memory and retry accessing monitored area after signal delivery. This patch also fixes a bug in uffd_poll_thread() where 'uffd' is leaked. Signed-off-by: Prakash Sangappa <prakash.sanga...@oracle.com> --- Change log v3: Eliminated use of sig_repeat variable and simplified error return. v2: - Added comments to explain the tests. - Fixed test to fail immediately if signal repeats. - Addressed other review comments. v1: https://lkml.org/lkml/2017/7/26/101 --- tools/testing/selftests/vm/userfaultfd.c | 127 +- 1 files changed, 124 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 1eae79a..52740ae 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -66,6 +66,7 @@ #include #include #include +#include #ifdef __NR_userfaultfd @@ -408,6 +409,7 @@ static int copy_page(int ufd, unsigned long offset) userfaults++; break; case UFFD_EVENT_FORK: + close(uffd); uffd = msg.arg.fork.ufd; pollfd[0].fd = uffd; break; @@ -572,6 +574,17 @@ static int userfaultfd_open(int features) return 0; } +sigjmp_buf jbuf, *sigbuf; + +static void sighndl(int sig, siginfo_t *siginfo, void *ptr) +{ + if (sig == SIGBUS) { + if (sigbuf) + siglongjmp(*sigbuf, 1); + abort(); + } +} + /* * For non-cooperative userfaultfd test we fork() a process that will * generate pagefaults, will mremap the area monitored by the @@ -585,19 +598,59 @@ static int userfaultfd_open(int features) * The release of the pages currently generates event for shmem and * anonymous memory (UFFD_EVENT_REMOVE), hence it is not checked * for hugetlb. + * For signal test(UFFD_FEATURE_SIGBUS), signal_test = 1, we register + * monitored area, generate pagefaults and test that signal is delivered. + * Use UFFDIO_COPY to allocate missing page and retry. For signal_test = 2 + * test robustness use case - we release monitored area, fork a process + * that will generate pagefaults and verify signal is generated. + * This also tests UFFD_FEATURE_EVENT_FORK event along with the signal + * feature. Using monitor thread, verify no userfault events are generated. */ -static int faulting_process(void) +static int faulting_process(int signal_test) { unsigned long nr; unsigned long long count; unsigned long split_nr_pages; + unsigned long lastnr; + struct sigaction act; + unsigned long signalled = 0; if (test_type != TEST_HUGETLB) split_nr_pages = (nr_pages + 1) / 2; else split_nr_pages = nr_pages; + if (signal_test) { + sigbuf = + memset(, 0, sizeof(act)); + act.sa_sigaction = sighndl; + act.sa_flags = SA_SIGINFO; + if (sigaction(SIGBUS, , 0)) { + perror("sigaction"); + return 1; + } + lastnr = (unsigned long)-1; + } + for (nr = 0; nr < split_nr_pages; nr++) { + if (signal_test) { + if (sigsetjmp(*sigbuf, 1) != 0) { + if (nr == lastnr) { + fprintf(stderr, "Signal repeated\n"); + return 1; + } + + lastnr = nr; + if (signal_test == 1) { + if (copy_page(uffd, nr * page_size)) + signalled++; + } else { + signalled++; + continue; + } + } + } + count = *area_count(area_dst, nr); if (count != count_verify[nr]) { fprintf(stderr, @@ -607,6 +660,9 @@ static int faulting_process(void) } } + if (signal_test) + return signalled != split_nr_pages; + if (test_type == TEST_HUGETLB) return 0; @@ -761,7 +817,7 @@ static int userfaultfd_events_test(void) perror("fork"), exit(1); if (!pid) - return faulting_process(); + return faulting_process(0); waitpid(pid, , 0); if (err) @@ -778,6 +834,70 @@ static int userfaultfd_events_test(void) retu
[PATCH v3 2/2] userfaultfd: selftest: Add tests for UFFD_FEATURE_SIGBUS feature
This patch adds tests for UFFD_FEATURE_SIGBUS feature. The tests will verify signal delivery instead of userfault events. Also, test use of UFFDIO_COPY to allocate memory and retry accessing monitored area after signal delivery. This patch also fixes a bug in uffd_poll_thread() where 'uffd' is leaked. Signed-off-by: Prakash Sangappa --- Change log v3: Eliminated use of sig_repeat variable and simplified error return. v2: - Added comments to explain the tests. - Fixed test to fail immediately if signal repeats. - Addressed other review comments. v1: https://lkml.org/lkml/2017/7/26/101 --- tools/testing/selftests/vm/userfaultfd.c | 127 +- 1 files changed, 124 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 1eae79a..52740ae 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -66,6 +66,7 @@ #include #include #include +#include #ifdef __NR_userfaultfd @@ -408,6 +409,7 @@ static int copy_page(int ufd, unsigned long offset) userfaults++; break; case UFFD_EVENT_FORK: + close(uffd); uffd = msg.arg.fork.ufd; pollfd[0].fd = uffd; break; @@ -572,6 +574,17 @@ static int userfaultfd_open(int features) return 0; } +sigjmp_buf jbuf, *sigbuf; + +static void sighndl(int sig, siginfo_t *siginfo, void *ptr) +{ + if (sig == SIGBUS) { + if (sigbuf) + siglongjmp(*sigbuf, 1); + abort(); + } +} + /* * For non-cooperative userfaultfd test we fork() a process that will * generate pagefaults, will mremap the area monitored by the @@ -585,19 +598,59 @@ static int userfaultfd_open(int features) * The release of the pages currently generates event for shmem and * anonymous memory (UFFD_EVENT_REMOVE), hence it is not checked * for hugetlb. + * For signal test(UFFD_FEATURE_SIGBUS), signal_test = 1, we register + * monitored area, generate pagefaults and test that signal is delivered. + * Use UFFDIO_COPY to allocate missing page and retry. For signal_test = 2 + * test robustness use case - we release monitored area, fork a process + * that will generate pagefaults and verify signal is generated. + * This also tests UFFD_FEATURE_EVENT_FORK event along with the signal + * feature. Using monitor thread, verify no userfault events are generated. */ -static int faulting_process(void) +static int faulting_process(int signal_test) { unsigned long nr; unsigned long long count; unsigned long split_nr_pages; + unsigned long lastnr; + struct sigaction act; + unsigned long signalled = 0; if (test_type != TEST_HUGETLB) split_nr_pages = (nr_pages + 1) / 2; else split_nr_pages = nr_pages; + if (signal_test) { + sigbuf = + memset(, 0, sizeof(act)); + act.sa_sigaction = sighndl; + act.sa_flags = SA_SIGINFO; + if (sigaction(SIGBUS, , 0)) { + perror("sigaction"); + return 1; + } + lastnr = (unsigned long)-1; + } + for (nr = 0; nr < split_nr_pages; nr++) { + if (signal_test) { + if (sigsetjmp(*sigbuf, 1) != 0) { + if (nr == lastnr) { + fprintf(stderr, "Signal repeated\n"); + return 1; + } + + lastnr = nr; + if (signal_test == 1) { + if (copy_page(uffd, nr * page_size)) + signalled++; + } else { + signalled++; + continue; + } + } + } + count = *area_count(area_dst, nr); if (count != count_verify[nr]) { fprintf(stderr, @@ -607,6 +660,9 @@ static int faulting_process(void) } } + if (signal_test) + return signalled != split_nr_pages; + if (test_type == TEST_HUGETLB) return 0; @@ -761,7 +817,7 @@ static int userfaultfd_events_test(void) perror("fork"), exit(1); if (!pid) - return faulting_process(); + return faulting_process(0); waitpid(pid, , 0); if (err) @@ -778,6 +834,70 @@ static int userfaultfd_events_test(void) return userfaults != nr_pages; } +sta
[PATCH v2 2/2] userfaultfd: selftest: Add tests for UFFD_FEATURE_SIGBUS feature
This patch adds tests for UFFD_FEATURE_SIGBUS feature. The tests will verify signal delivery instead of userfault events. Also, test use of UFFDIO_COPY to allocate memory and retry accessing monitored area after signal delivery. This patch also fixes a bug in uffd_poll_thread() where 'uffd' is leaked. Signed-off-by: Prakash Sangappa <prakash.sanga...@oracle.com> --- Change log v2: - Added comments to explain the tests. - Fixed test to fail immediately if signal repeats. - Addressed other review comments. v1: https://lkml.org/lkml/2017/7/26/101 --- tools/testing/selftests/vm/userfaultfd.c | 127 +- 1 files changed, 124 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 1eae79a..3976d7a 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -66,6 +66,7 @@ #include #include #include +#include #ifdef __NR_userfaultfd @@ -408,6 +409,7 @@ static int copy_page(int ufd, unsigned long offset) userfaults++; break; case UFFD_EVENT_FORK: + close(uffd); uffd = msg.arg.fork.ufd; pollfd[0].fd = uffd; break; @@ -572,6 +574,17 @@ static int userfaultfd_open(int features) return 0; } +sigjmp_buf jbuf, *sigbuf; + +static void sighndl(int sig, siginfo_t *siginfo, void *ptr) +{ + if (sig == SIGBUS) { + if (sigbuf) + siglongjmp(*sigbuf, 1); + abort(); + } +} + /* * For non-cooperative userfaultfd test we fork() a process that will * generate pagefaults, will mremap the area monitored by the @@ -585,19 +598,59 @@ static int userfaultfd_open(int features) * The release of the pages currently generates event for shmem and * anonymous memory (UFFD_EVENT_REMOVE), hence it is not checked * for hugetlb. + * For signal test(UFFD_FEATURE_SIGBUS), signal_test = 1, we register + * monitored area, generate pagefaults and test that signal is delivered. + * Use UFFDIO_COPY to allocate missing page and retry. For signal_test = 2 + * test robustness use case - we release monitored area, fork a process + * that will generate pagefaults and verify signal is generated. + * This also tests UFFD_FEATURE_EVENT_FORK event along with the signal + * feature. Using monitor thread, verify no userfault events are generated. */ -static int faulting_process(void) +static int faulting_process(int signal_test) { unsigned long nr; unsigned long long count; unsigned long split_nr_pages; + unsigned long lastnr; + struct sigaction act; + unsigned long signalled = 0, sig_repeats = 0; if (test_type != TEST_HUGETLB) split_nr_pages = (nr_pages + 1) / 2; else split_nr_pages = nr_pages; + if (signal_test) { + sigbuf = + memset(, 0, sizeof(act)); + act.sa_sigaction = sighndl; + act.sa_flags = SA_SIGINFO; + if (sigaction(SIGBUS, , 0)) { + perror("sigaction"); + return 1; + } + lastnr = (unsigned long)-1; + } + for (nr = 0; nr < split_nr_pages; nr++) { + if (signal_test) { + if (sigsetjmp(*sigbuf, 1) != 0) { + if (nr == lastnr) { + sig_repeats++; + break; + } + + lastnr = nr; + if (signal_test == 1) { + if (copy_page(uffd, nr * page_size)) + signalled++; + } else { + signalled++; + continue; + } + } + } + count = *area_count(area_dst, nr); if (count != count_verify[nr]) { fprintf(stderr, @@ -607,6 +660,9 @@ static int faulting_process(void) } } + if (signal_test) + return !(signalled == split_nr_pages && sig_repeats == 0); + if (test_type == TEST_HUGETLB) return 0; @@ -761,7 +817,7 @@ static int userfaultfd_events_test(void) perror("fork"), exit(1); if (!pid) - return faulting_process(); + return faulting_process(0); waitpid(pid, , 0); if (err) @@ -778,6 +834,70 @@ static int userfaultfd_events_test(void) return userfaults != nr_pages; } +static int userfau
[PATCH v2 2/2] userfaultfd: selftest: Add tests for UFFD_FEATURE_SIGBUS feature
This patch adds tests for UFFD_FEATURE_SIGBUS feature. The tests will verify signal delivery instead of userfault events. Also, test use of UFFDIO_COPY to allocate memory and retry accessing monitored area after signal delivery. This patch also fixes a bug in uffd_poll_thread() where 'uffd' is leaked. Signed-off-by: Prakash Sangappa --- Change log v2: - Added comments to explain the tests. - Fixed test to fail immediately if signal repeats. - Addressed other review comments. v1: https://lkml.org/lkml/2017/7/26/101 --- tools/testing/selftests/vm/userfaultfd.c | 127 +- 1 files changed, 124 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 1eae79a..3976d7a 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -66,6 +66,7 @@ #include #include #include +#include #ifdef __NR_userfaultfd @@ -408,6 +409,7 @@ static int copy_page(int ufd, unsigned long offset) userfaults++; break; case UFFD_EVENT_FORK: + close(uffd); uffd = msg.arg.fork.ufd; pollfd[0].fd = uffd; break; @@ -572,6 +574,17 @@ static int userfaultfd_open(int features) return 0; } +sigjmp_buf jbuf, *sigbuf; + +static void sighndl(int sig, siginfo_t *siginfo, void *ptr) +{ + if (sig == SIGBUS) { + if (sigbuf) + siglongjmp(*sigbuf, 1); + abort(); + } +} + /* * For non-cooperative userfaultfd test we fork() a process that will * generate pagefaults, will mremap the area monitored by the @@ -585,19 +598,59 @@ static int userfaultfd_open(int features) * The release of the pages currently generates event for shmem and * anonymous memory (UFFD_EVENT_REMOVE), hence it is not checked * for hugetlb. + * For signal test(UFFD_FEATURE_SIGBUS), signal_test = 1, we register + * monitored area, generate pagefaults and test that signal is delivered. + * Use UFFDIO_COPY to allocate missing page and retry. For signal_test = 2 + * test robustness use case - we release monitored area, fork a process + * that will generate pagefaults and verify signal is generated. + * This also tests UFFD_FEATURE_EVENT_FORK event along with the signal + * feature. Using monitor thread, verify no userfault events are generated. */ -static int faulting_process(void) +static int faulting_process(int signal_test) { unsigned long nr; unsigned long long count; unsigned long split_nr_pages; + unsigned long lastnr; + struct sigaction act; + unsigned long signalled = 0, sig_repeats = 0; if (test_type != TEST_HUGETLB) split_nr_pages = (nr_pages + 1) / 2; else split_nr_pages = nr_pages; + if (signal_test) { + sigbuf = + memset(, 0, sizeof(act)); + act.sa_sigaction = sighndl; + act.sa_flags = SA_SIGINFO; + if (sigaction(SIGBUS, , 0)) { + perror("sigaction"); + return 1; + } + lastnr = (unsigned long)-1; + } + for (nr = 0; nr < split_nr_pages; nr++) { + if (signal_test) { + if (sigsetjmp(*sigbuf, 1) != 0) { + if (nr == lastnr) { + sig_repeats++; + break; + } + + lastnr = nr; + if (signal_test == 1) { + if (copy_page(uffd, nr * page_size)) + signalled++; + } else { + signalled++; + continue; + } + } + } + count = *area_count(area_dst, nr); if (count != count_verify[nr]) { fprintf(stderr, @@ -607,6 +660,9 @@ static int faulting_process(void) } } + if (signal_test) + return !(signalled == split_nr_pages && sig_repeats == 0); + if (test_type == TEST_HUGETLB) return 0; @@ -761,7 +817,7 @@ static int userfaultfd_events_test(void) perror("fork"), exit(1); if (!pid) - return faulting_process(); + return faulting_process(0); waitpid(pid, , 0); if (err) @@ -778,6 +834,70 @@ static int userfaultfd_events_test(void) return userfaults != nr_pages; } +static int userfaultfd_sig_test(void) +{ + struct
Re: [RESEND PATCH 1/2] userfaultfd: Add feature to request for a signal delivery
Yes, I will provide a man page update. -Prakash. On 7/27/17 4:58 AM, Michal Hocko wrote: Please do not forget to provide a man page update with clarified semantic.
Re: [RESEND PATCH 1/2] userfaultfd: Add feature to request for a signal delivery
Yes, I will provide a man page update. -Prakash. On 7/27/17 4:58 AM, Michal Hocko wrote: Please do not forget to provide a man page update with clarified semantic.
Re: [RESEND PATCH 2/2] userfaultfd: selftest: Add tests for UFFD_FREATURE_SIGBUS
On 7/26/17 7:27 AM, Andrea Arcangeli wrote: On Tue, Jul 25, 2017 at 12:47:42AM -0400, Prakash Sangappa wrote: Signed-off-by: Prakash Sangappa <prakash.sanga...@oracle.com> --- tools/testing/selftests/vm/userfaultfd.c | 121 +- 1 files changed, 118 insertions(+), 3 deletions(-) Like Mike said, some comment about the test would be better, commit messages are never one liners in the kernel. Ok @@ -408,6 +409,7 @@ static int copy_page(int ufd, unsigned long offset) userfaults++; break; case UFFD_EVENT_FORK: + close(uffd); uffd = msg.arg.fork.ufd; pollfd[0].fd = uffd; break; Isn't this fd leak bugfix independent of the rest of the changes? The only side effects should have been that it could run out of fds, but I assume this was found by source review as I doubt it could run out of fds. This could be splitted off in a separate patch. Not just the fd leak, it causes problems here with the addition of the new test userfaultfd_sig_test(). Since the original vma registration persists in the parent, subsequent registration in userfaultfd_events_test() fails with 'EBUSY' error, as userfault implementation does not allow registering same vma with another uffd, while one exists. Therefore, will need this change. I could just leave this fix here along with the rest of the changes, will that be ok? -Prakash Overall it looks a good test also exercising UFFD_EVENT_FORK at the same time. Thanks, Andrea
Re: [RESEND PATCH 2/2] userfaultfd: selftest: Add tests for UFFD_FREATURE_SIGBUS
On 7/26/17 7:27 AM, Andrea Arcangeli wrote: On Tue, Jul 25, 2017 at 12:47:42AM -0400, Prakash Sangappa wrote: Signed-off-by: Prakash Sangappa --- tools/testing/selftests/vm/userfaultfd.c | 121 +- 1 files changed, 118 insertions(+), 3 deletions(-) Like Mike said, some comment about the test would be better, commit messages are never one liners in the kernel. Ok @@ -408,6 +409,7 @@ static int copy_page(int ufd, unsigned long offset) userfaults++; break; case UFFD_EVENT_FORK: + close(uffd); uffd = msg.arg.fork.ufd; pollfd[0].fd = uffd; break; Isn't this fd leak bugfix independent of the rest of the changes? The only side effects should have been that it could run out of fds, but I assume this was found by source review as I doubt it could run out of fds. This could be splitted off in a separate patch. Not just the fd leak, it causes problems here with the addition of the new test userfaultfd_sig_test(). Since the original vma registration persists in the parent, subsequent registration in userfaultfd_events_test() fails with 'EBUSY' error, as userfault implementation does not allow registering same vma with another uffd, while one exists. Therefore, will need this change. I could just leave this fix here along with the rest of the changes, will that be ok? -Prakash Overall it looks a good test also exercising UFFD_EVENT_FORK at the same time. Thanks, Andrea
Re: [RESEND PATCH 2/2] userfaultfd: selftest: Add tests for UFFD_FREATURE_SIGBUS
On 7/26/17 12:53 AM, Mike Rapoport wrote: + /* * For non-cooperative userfaultfd test we fork() a process that will * generate pagefaults, will mremap the area monitored by the @@ -585,19 +598,54 @@ static int userfaultfd_open(int features) * The release of the pages currently generates event for shmem and * anonymous memory (UFFD_EVENT_REMOVE), hence it is not checked * for hugetlb. + * For signal test(UFFD_FEATURE_SIGBUS), primarily test signal + * delivery and ensure no userfault events are generated. Can you add some details about the tests? E.g. what is the meaning if signal_test=1 and signal_test=2 and what is the difference between them? Ok, I will. */ -static int faulting_process(void) +static int faulting_process(int signal_test) { unsigned long nr; unsigned long long count; unsigned long split_nr_pages; + unsigned long lastnr; + struct sigaction act; + unsigned long signalled=0, sig_repeats = 0; Spaces around that '=' ^ Will fix it. if (test_type != TEST_HUGETLB) split_nr_pages = (nr_pages + 1) / 2; else split_nr_pages = nr_pages; + if (signal_test) { + sigbuf = + memset (, 0, sizeof(act)); There should be no space between function name and open parenthesis. ok + act.sa_sigaction = sighndl; + act.sa_flags = SA_SIGINFO; + if (sigaction(SIGBUS, , 0)) { + perror("sigaction"); + return 1; + } + lastnr = (unsigned long)-1; + } + for (nr = 0; nr < split_nr_pages; nr++) { + if (signal_test) { + if (sigsetjmp(*sigbuf, 1) != 0) { + if (nr == lastnr) { + sig_repeats++; + continue; If I understand correctly, when nr == lastnr we get a repeated signal for the same page and this is an error, right? Yes, Why would we continue the test and won't return error immediately? Yes, it could just return error. I will fix it. + } + + lastnr = nr; + if (signal_test == 1) { + if (copy_page(uffd, nr * page_size)) + signalled++; + } else { + signalled++; + continue; + } + } + } + count = *area_count(area_dst, nr); if (count != count_verify[nr]) { fprintf(stderr, @@ -607,6 +655,8 @@ static int faulting_process(void) } } + if (signal_test) + return signalled != split_nr_pages || sig_repeats != 0; I believe return !(signalled == split_nr_pages && sig_repeats == 0) is clearer. And I blank line after the return statement would be nice :) Ok. Will send out v2 patch with the changes. Thanks, -Prakash
Re: [RESEND PATCH 2/2] userfaultfd: selftest: Add tests for UFFD_FREATURE_SIGBUS
On 7/26/17 12:53 AM, Mike Rapoport wrote: + /* * For non-cooperative userfaultfd test we fork() a process that will * generate pagefaults, will mremap the area monitored by the @@ -585,19 +598,54 @@ static int userfaultfd_open(int features) * The release of the pages currently generates event for shmem and * anonymous memory (UFFD_EVENT_REMOVE), hence it is not checked * for hugetlb. + * For signal test(UFFD_FEATURE_SIGBUS), primarily test signal + * delivery and ensure no userfault events are generated. Can you add some details about the tests? E.g. what is the meaning if signal_test=1 and signal_test=2 and what is the difference between them? Ok, I will. */ -static int faulting_process(void) +static int faulting_process(int signal_test) { unsigned long nr; unsigned long long count; unsigned long split_nr_pages; + unsigned long lastnr; + struct sigaction act; + unsigned long signalled=0, sig_repeats = 0; Spaces around that '=' ^ Will fix it. if (test_type != TEST_HUGETLB) split_nr_pages = (nr_pages + 1) / 2; else split_nr_pages = nr_pages; + if (signal_test) { + sigbuf = + memset (, 0, sizeof(act)); There should be no space between function name and open parenthesis. ok + act.sa_sigaction = sighndl; + act.sa_flags = SA_SIGINFO; + if (sigaction(SIGBUS, , 0)) { + perror("sigaction"); + return 1; + } + lastnr = (unsigned long)-1; + } + for (nr = 0; nr < split_nr_pages; nr++) { + if (signal_test) { + if (sigsetjmp(*sigbuf, 1) != 0) { + if (nr == lastnr) { + sig_repeats++; + continue; If I understand correctly, when nr == lastnr we get a repeated signal for the same page and this is an error, right? Yes, Why would we continue the test and won't return error immediately? Yes, it could just return error. I will fix it. + } + + lastnr = nr; + if (signal_test == 1) { + if (copy_page(uffd, nr * page_size)) + signalled++; + } else { + signalled++; + continue; + } + } + } + count = *area_count(area_dst, nr); if (count != count_verify[nr]) { fprintf(stderr, @@ -607,6 +655,8 @@ static int faulting_process(void) } } + if (signal_test) + return signalled != split_nr_pages || sig_repeats != 0; I believe return !(signalled == split_nr_pages && sig_repeats == 0) is clearer. And I blank line after the return statement would be nice :) Ok. Will send out v2 patch with the changes. Thanks, -Prakash
[RESEND PATCH 0/2] userfaultfd: Add feature to request for a signal delivery
Hi Andrea, Mike, Rsending - fixed email address. Here is the patch set for the proposed userfaultfd UFFD_FEATURE_SIGBUS feature, including tests in selftest/vm/userfaultfd.c Please review. See following for previous discussion. http://www.spinics.net/lists/linux-mm/msg129224.html http://www.spinics.net/lists/linux-mm/msg130678.html Thanks, Prakash Sangappa (2): userfaultfd: Add feature to request for a signal delivery userfaultfd: selftest: Add tests for UFFD_FREATURE_SIGBUS fs/userfaultfd.c |3 + include/uapi/linux/userfaultfd.h | 10 ++- tools/testing/selftests/vm/userfaultfd.c | 121 +- 3 files changed, 130 insertions(+), 4 deletions(-)
[RESEND PATCH 1/2] userfaultfd: Add feature to request for a signal delivery
In some cases, userfaultfd mechanism should just deliver a SIGBUS signal to the faulting process, instead of the page-fault event. Dealing with page-fault event using a monitor thread can be an overhead in these cases. For example applications like the database could use the signaling mechanism for robustness purpose. Database uses hugetlbfs for performance reason. Files on hugetlbfs filesystem are created and huge pages allocated using fallocate() API. Pages are deallocated/freed using fallocate() hole punching support. These files are mmapped and accessed by many processes as shared memory. The database keeps track of which offsets in the hugetlbfs file have pages allocated. Any access to mapped address over holes in the file, which can occur due to bugs in the application, is considered invalid and expect the process to simply receive a SIGBUS. However, currently when a hole in the file is accessed via the mapped address, kernel/mm attempts to automatically allocate a page at page fault time, resulting in implicitly filling the hole in the file. This may not be the desired behavior for applications like the database that want to explicitly manage page allocations of hugetlbfs files. Using userfaultfd mechanism with this support to get a signal, database application can prevent pages from being allocated implicitly when processes access mapped address over holes in the file. This patch adds UFFD_FEATURE_SIGBUS feature to userfaultfd mechnism to request for a SIGBUS signal. See following for previous discussion about the database requirement leading to this proposal as suggested by Andrea. http://www.spinics.net/lists/linux-mm/msg129224.html Signed-off-by: Prakash Sangappa <prakash.sanga...@oracle.com> --- fs/userfaultfd.c |3 +++ include/uapi/linux/userfaultfd.h | 10 +- 2 files changed, 12 insertions(+), 1 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 1d622f2..0bbe7df 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -371,6 +371,9 @@ int handle_userfault(struct vm_fault *vmf, unsigned long reason) VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP)); VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP)); + if (ctx->features & UFFD_FEATURE_SIGBUS) + goto out; + /* * If it's already released don't get it. This avoids to loop * in __get_user_pages if userfaultfd_release waits on the diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 3b05953..d39d5db 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -23,7 +23,8 @@ UFFD_FEATURE_EVENT_REMOVE | \ UFFD_FEATURE_EVENT_UNMAP | \ UFFD_FEATURE_MISSING_HUGETLBFS | \ - UFFD_FEATURE_MISSING_SHMEM) + UFFD_FEATURE_MISSING_SHMEM | \ + UFFD_FEATURE_SIGBUS) #define UFFD_API_IOCTLS\ ((__u64)1 << _UFFDIO_REGISTER | \ (__u64)1 << _UFFDIO_UNREGISTER | \ @@ -153,6 +154,12 @@ struct uffdio_api { * UFFD_FEATURE_MISSING_SHMEM works the same as * UFFD_FEATURE_MISSING_HUGETLBFS, but it applies to shmem * (i.e. tmpfs and other shmem based APIs). +* +* UFFD_FEATURE_SIGBUS feature means no page-fault +* (UFFD_EVENT_PAGEFAULT) event will be delivered, instead +* a SIGBUS signal will be sent to the faulting process. +* The application process can enable this behavior by adding +* it to uffdio_api.features. */ #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0) #define UFFD_FEATURE_EVENT_FORK(1<<1) @@ -161,6 +168,7 @@ struct uffdio_api { #define UFFD_FEATURE_MISSING_HUGETLBFS (1<<4) #define UFFD_FEATURE_MISSING_SHMEM (1<<5) #define UFFD_FEATURE_EVENT_UNMAP (1<<6) +#define UFFD_FEATURE_SIGBUS(1<<7) __u64 features; __u64 ioctls; -- 1.7.1
[RESEND PATCH 0/2] userfaultfd: Add feature to request for a signal delivery
Hi Andrea, Mike, Rsending - fixed email address. Here is the patch set for the proposed userfaultfd UFFD_FEATURE_SIGBUS feature, including tests in selftest/vm/userfaultfd.c Please review. See following for previous discussion. http://www.spinics.net/lists/linux-mm/msg129224.html http://www.spinics.net/lists/linux-mm/msg130678.html Thanks, Prakash Sangappa (2): userfaultfd: Add feature to request for a signal delivery userfaultfd: selftest: Add tests for UFFD_FREATURE_SIGBUS fs/userfaultfd.c |3 + include/uapi/linux/userfaultfd.h | 10 ++- tools/testing/selftests/vm/userfaultfd.c | 121 +- 3 files changed, 130 insertions(+), 4 deletions(-)
[RESEND PATCH 1/2] userfaultfd: Add feature to request for a signal delivery
In some cases, userfaultfd mechanism should just deliver a SIGBUS signal to the faulting process, instead of the page-fault event. Dealing with page-fault event using a monitor thread can be an overhead in these cases. For example applications like the database could use the signaling mechanism for robustness purpose. Database uses hugetlbfs for performance reason. Files on hugetlbfs filesystem are created and huge pages allocated using fallocate() API. Pages are deallocated/freed using fallocate() hole punching support. These files are mmapped and accessed by many processes as shared memory. The database keeps track of which offsets in the hugetlbfs file have pages allocated. Any access to mapped address over holes in the file, which can occur due to bugs in the application, is considered invalid and expect the process to simply receive a SIGBUS. However, currently when a hole in the file is accessed via the mapped address, kernel/mm attempts to automatically allocate a page at page fault time, resulting in implicitly filling the hole in the file. This may not be the desired behavior for applications like the database that want to explicitly manage page allocations of hugetlbfs files. Using userfaultfd mechanism with this support to get a signal, database application can prevent pages from being allocated implicitly when processes access mapped address over holes in the file. This patch adds UFFD_FEATURE_SIGBUS feature to userfaultfd mechnism to request for a SIGBUS signal. See following for previous discussion about the database requirement leading to this proposal as suggested by Andrea. http://www.spinics.net/lists/linux-mm/msg129224.html Signed-off-by: Prakash Sangappa --- fs/userfaultfd.c |3 +++ include/uapi/linux/userfaultfd.h | 10 +- 2 files changed, 12 insertions(+), 1 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 1d622f2..0bbe7df 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -371,6 +371,9 @@ int handle_userfault(struct vm_fault *vmf, unsigned long reason) VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP)); VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP)); + if (ctx->features & UFFD_FEATURE_SIGBUS) + goto out; + /* * If it's already released don't get it. This avoids to loop * in __get_user_pages if userfaultfd_release waits on the diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 3b05953..d39d5db 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -23,7 +23,8 @@ UFFD_FEATURE_EVENT_REMOVE | \ UFFD_FEATURE_EVENT_UNMAP | \ UFFD_FEATURE_MISSING_HUGETLBFS | \ - UFFD_FEATURE_MISSING_SHMEM) + UFFD_FEATURE_MISSING_SHMEM | \ + UFFD_FEATURE_SIGBUS) #define UFFD_API_IOCTLS\ ((__u64)1 << _UFFDIO_REGISTER | \ (__u64)1 << _UFFDIO_UNREGISTER | \ @@ -153,6 +154,12 @@ struct uffdio_api { * UFFD_FEATURE_MISSING_SHMEM works the same as * UFFD_FEATURE_MISSING_HUGETLBFS, but it applies to shmem * (i.e. tmpfs and other shmem based APIs). +* +* UFFD_FEATURE_SIGBUS feature means no page-fault +* (UFFD_EVENT_PAGEFAULT) event will be delivered, instead +* a SIGBUS signal will be sent to the faulting process. +* The application process can enable this behavior by adding +* it to uffdio_api.features. */ #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0) #define UFFD_FEATURE_EVENT_FORK(1<<1) @@ -161,6 +168,7 @@ struct uffdio_api { #define UFFD_FEATURE_MISSING_HUGETLBFS (1<<4) #define UFFD_FEATURE_MISSING_SHMEM (1<<5) #define UFFD_FEATURE_EVENT_UNMAP (1<<6) +#define UFFD_FEATURE_SIGBUS(1<<7) __u64 features; __u64 ioctls; -- 1.7.1
[RESEND PATCH 2/2] userfaultfd: selftest: Add tests for UFFD_FREATURE_SIGBUS
Signed-off-by: Prakash Sangappa <prakash.sanga...@oracle.com> --- tools/testing/selftests/vm/userfaultfd.c | 121 +- 1 files changed, 118 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 1eae79a..6a43e84 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -66,6 +66,7 @@ #include #include #include +#include #ifdef __NR_userfaultfd @@ -408,6 +409,7 @@ static int copy_page(int ufd, unsigned long offset) userfaults++; break; case UFFD_EVENT_FORK: + close(uffd); uffd = msg.arg.fork.ufd; pollfd[0].fd = uffd; break; @@ -572,6 +574,17 @@ static int userfaultfd_open(int features) return 0; } +sigjmp_buf jbuf, *sigbuf; + +static void sighndl(int sig, siginfo_t *siginfo, void *ptr) +{ +if (sig == SIGBUS) { +if (sigbuf) + siglongjmp(*sigbuf, 1); +abort(); +} +} + /* * For non-cooperative userfaultfd test we fork() a process that will * generate pagefaults, will mremap the area monitored by the @@ -585,19 +598,54 @@ static int userfaultfd_open(int features) * The release of the pages currently generates event for shmem and * anonymous memory (UFFD_EVENT_REMOVE), hence it is not checked * for hugetlb. + * For signal test(UFFD_FEATURE_SIGBUS), primarily test signal + * delivery and ensure no userfault events are generated. */ -static int faulting_process(void) +static int faulting_process(int signal_test) { unsigned long nr; unsigned long long count; unsigned long split_nr_pages; + unsigned long lastnr; + struct sigaction act; + unsigned long signalled=0, sig_repeats = 0; if (test_type != TEST_HUGETLB) split_nr_pages = (nr_pages + 1) / 2; else split_nr_pages = nr_pages; + if (signal_test) { + sigbuf = + memset (, 0, sizeof(act)); + act.sa_sigaction = sighndl; + act.sa_flags = SA_SIGINFO; + if (sigaction(SIGBUS, , 0)) { + perror("sigaction"); + return 1; + } + lastnr = (unsigned long)-1; + } + for (nr = 0; nr < split_nr_pages; nr++) { + if (signal_test) { + if (sigsetjmp(*sigbuf, 1) != 0) { + if (nr == lastnr) { + sig_repeats++; + continue; + } + + lastnr = nr; + if (signal_test == 1) { + if (copy_page(uffd, nr * page_size)) + signalled++; + } else { + signalled++; + continue; + } + } + } + count = *area_count(area_dst, nr); if (count != count_verify[nr]) { fprintf(stderr, @@ -607,6 +655,8 @@ static int faulting_process(void) } } + if (signal_test) + return signalled != split_nr_pages || sig_repeats != 0; if (test_type == TEST_HUGETLB) return 0; @@ -761,7 +811,7 @@ static int userfaultfd_events_test(void) perror("fork"), exit(1); if (!pid) - return faulting_process(); + return faulting_process(0); waitpid(pid, , 0); if (err) @@ -778,6 +828,70 @@ static int userfaultfd_events_test(void) return userfaults != nr_pages; } +static int userfaultfd_sig_test(void) +{ + struct uffdio_register uffdio_register; + unsigned long expected_ioctls; + unsigned long userfaults; + pthread_t uffd_mon; + int err, features; + pid_t pid; + char c; + + printf("testing signal delivery: "); + fflush(stdout); + + if (uffd_test_ops->release_pages(area_dst)) + return 1; + + features = UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_SIGBUS; + if (userfaultfd_open(features) < 0) + return 1; + fcntl(uffd, F_SETFL, uffd_flags | O_NONBLOCK); + + uffdio_register.range.start = (unsigned long) area_dst; + uffdio_register.range.len = nr_pages * page_size; + uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; + if (ioctl(uffd, UFFDIO_REGISTER, _register)) + fprintf(stderr, "register failure\n&quo
[RESEND PATCH 2/2] userfaultfd: selftest: Add tests for UFFD_FREATURE_SIGBUS
Signed-off-by: Prakash Sangappa --- tools/testing/selftests/vm/userfaultfd.c | 121 +- 1 files changed, 118 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 1eae79a..6a43e84 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -66,6 +66,7 @@ #include #include #include +#include #ifdef __NR_userfaultfd @@ -408,6 +409,7 @@ static int copy_page(int ufd, unsigned long offset) userfaults++; break; case UFFD_EVENT_FORK: + close(uffd); uffd = msg.arg.fork.ufd; pollfd[0].fd = uffd; break; @@ -572,6 +574,17 @@ static int userfaultfd_open(int features) return 0; } +sigjmp_buf jbuf, *sigbuf; + +static void sighndl(int sig, siginfo_t *siginfo, void *ptr) +{ +if (sig == SIGBUS) { +if (sigbuf) + siglongjmp(*sigbuf, 1); +abort(); +} +} + /* * For non-cooperative userfaultfd test we fork() a process that will * generate pagefaults, will mremap the area monitored by the @@ -585,19 +598,54 @@ static int userfaultfd_open(int features) * The release of the pages currently generates event for shmem and * anonymous memory (UFFD_EVENT_REMOVE), hence it is not checked * for hugetlb. + * For signal test(UFFD_FEATURE_SIGBUS), primarily test signal + * delivery and ensure no userfault events are generated. */ -static int faulting_process(void) +static int faulting_process(int signal_test) { unsigned long nr; unsigned long long count; unsigned long split_nr_pages; + unsigned long lastnr; + struct sigaction act; + unsigned long signalled=0, sig_repeats = 0; if (test_type != TEST_HUGETLB) split_nr_pages = (nr_pages + 1) / 2; else split_nr_pages = nr_pages; + if (signal_test) { + sigbuf = + memset (, 0, sizeof(act)); + act.sa_sigaction = sighndl; + act.sa_flags = SA_SIGINFO; + if (sigaction(SIGBUS, , 0)) { + perror("sigaction"); + return 1; + } + lastnr = (unsigned long)-1; + } + for (nr = 0; nr < split_nr_pages; nr++) { + if (signal_test) { + if (sigsetjmp(*sigbuf, 1) != 0) { + if (nr == lastnr) { + sig_repeats++; + continue; + } + + lastnr = nr; + if (signal_test == 1) { + if (copy_page(uffd, nr * page_size)) + signalled++; + } else { + signalled++; + continue; + } + } + } + count = *area_count(area_dst, nr); if (count != count_verify[nr]) { fprintf(stderr, @@ -607,6 +655,8 @@ static int faulting_process(void) } } + if (signal_test) + return signalled != split_nr_pages || sig_repeats != 0; if (test_type == TEST_HUGETLB) return 0; @@ -761,7 +811,7 @@ static int userfaultfd_events_test(void) perror("fork"), exit(1); if (!pid) - return faulting_process(); + return faulting_process(0); waitpid(pid, , 0); if (err) @@ -778,6 +828,70 @@ static int userfaultfd_events_test(void) return userfaults != nr_pages; } +static int userfaultfd_sig_test(void) +{ + struct uffdio_register uffdio_register; + unsigned long expected_ioctls; + unsigned long userfaults; + pthread_t uffd_mon; + int err, features; + pid_t pid; + char c; + + printf("testing signal delivery: "); + fflush(stdout); + + if (uffd_test_ops->release_pages(area_dst)) + return 1; + + features = UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_SIGBUS; + if (userfaultfd_open(features) < 0) + return 1; + fcntl(uffd, F_SETFL, uffd_flags | O_NONBLOCK); + + uffdio_register.range.start = (unsigned long) area_dst; + uffdio_register.range.len = nr_pages * page_size; + uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; + if (ioctl(uffd, UFFDIO_REGISTER, _register)) + fprintf(stderr, "register failure\n"), exit(1); + +
Re: [RFC PATCH] userfaultfd: Add feature to request for a signal delivery
On 6/30/2017 6:08 AM, Andrea Arcangeli wrote: On Fri, Jun 30, 2017 at 11:47:35AM +0200, Michal Hocko wrote: [...] As an aside, I rememeber that prior to MADV_FREE there was long discussion about lazy freeing of memory from userspace. Some users wanted to be signalled when their memory was freed by the system so that they could rebuild the original content (e.g. uncompressed images in memory). It seems like MADV_FREE + this signalling could be used for that usecase. John would surely know more about those usecases. That would provide an equivalent API to the one volatile pages provided agreed. So it would allow to adapt code (if any?) more easily to drop the duplicate feature in volatile pages code (however it would be faster if the userland code using volatile pages lazy reclaim mode was converted to poll the uffd so the kernel talks directly to the monitor without involving a SIGBUS signal handler which will cause spurious enter/exit if compared to signal-less uffd API). The main benefit in my view is not volatile pages but that UFFD_FEATURE_SIGBUS would work equally well to enforce robustness on all kind of memory not only hugetlbfs (so one could run the database with robustness on THP over tmpfs) and the new cache can be injected in the filesystem using UFFDIO_COPY which is likely faster than fallocate as UFFDIO_COPY was already demonstrated to be faster even than a regular page fault. Interesting that UFFDIO_COPY is faster then fallocate(). In the DB use case the page does not need to be allocated at the time a process trips on the hugetlbfs file hole and receives SIGBUS. fallocate() is called on the hugetlbfs file, when more memory needs to be allocated by a separate process. It's also simpler to handle backwards compatibility with the UFFDIO_API call, that allows probing if UFFD_FEATURE_SIGBUS is supported by the running kernel regardless of kernel version (so it can be backported and enabled by the database, without the database noticing it's on a older kernel version). Yes, this is useful as this change will need to be back ported. So while this wasn't the intended way to use the userfault and I already pointed out the possibility to use a single monitor to do all this, I'm positive about UFFD_FEATURE_SIGBUS if the overhead of having a monitor is so concerning. Ultimately there are many pros and just a single cons: the branch in handle_userfault(). I wonder if it would be possible to use static_branch_enable() in UFFDIO_API and static_branch_unlikely in handle_userfault() to eliminate that branch but perhaps it's overkill and UFFDIO_API is unprivileged and it would send an IPI to all CPUs. I don't think we normally expose the static_branch_enable() to unprivileged userland and making UFFD_FEATURE_SIGBUS a privileged op doesn't sound attractive (although the alternative of altering a hugetlbfs mount option would be a privileged op). Regarding hugetlbfs mount option, one consideration is to allow mounts of hugetlbfs inside user namespaces's mount namespace. Which would allow non privileged processes to mount hugetlbfs for use inside a user namespace. This may be needed even for the 'min_size' mount option using which an application could reserve huge pages and mount a filesystem for its use, with out the need to have privileges given the system has enough hugepages configured. It seems if non privileged processes are allowed to mount hugetlbfs filesystem, then min_size should be subject to some resource limits. Mounting inside user namespace will be a different patch proposal later. Thanks, Andrea
Re: [RFC PATCH] userfaultfd: Add feature to request for a signal delivery
On 6/30/2017 6:08 AM, Andrea Arcangeli wrote: On Fri, Jun 30, 2017 at 11:47:35AM +0200, Michal Hocko wrote: [...] As an aside, I rememeber that prior to MADV_FREE there was long discussion about lazy freeing of memory from userspace. Some users wanted to be signalled when their memory was freed by the system so that they could rebuild the original content (e.g. uncompressed images in memory). It seems like MADV_FREE + this signalling could be used for that usecase. John would surely know more about those usecases. That would provide an equivalent API to the one volatile pages provided agreed. So it would allow to adapt code (if any?) more easily to drop the duplicate feature in volatile pages code (however it would be faster if the userland code using volatile pages lazy reclaim mode was converted to poll the uffd so the kernel talks directly to the monitor without involving a SIGBUS signal handler which will cause spurious enter/exit if compared to signal-less uffd API). The main benefit in my view is not volatile pages but that UFFD_FEATURE_SIGBUS would work equally well to enforce robustness on all kind of memory not only hugetlbfs (so one could run the database with robustness on THP over tmpfs) and the new cache can be injected in the filesystem using UFFDIO_COPY which is likely faster than fallocate as UFFDIO_COPY was already demonstrated to be faster even than a regular page fault. Interesting that UFFDIO_COPY is faster then fallocate(). In the DB use case the page does not need to be allocated at the time a process trips on the hugetlbfs file hole and receives SIGBUS. fallocate() is called on the hugetlbfs file, when more memory needs to be allocated by a separate process. It's also simpler to handle backwards compatibility with the UFFDIO_API call, that allows probing if UFFD_FEATURE_SIGBUS is supported by the running kernel regardless of kernel version (so it can be backported and enabled by the database, without the database noticing it's on a older kernel version). Yes, this is useful as this change will need to be back ported. So while this wasn't the intended way to use the userfault and I already pointed out the possibility to use a single monitor to do all this, I'm positive about UFFD_FEATURE_SIGBUS if the overhead of having a monitor is so concerning. Ultimately there are many pros and just a single cons: the branch in handle_userfault(). I wonder if it would be possible to use static_branch_enable() in UFFDIO_API and static_branch_unlikely in handle_userfault() to eliminate that branch but perhaps it's overkill and UFFDIO_API is unprivileged and it would send an IPI to all CPUs. I don't think we normally expose the static_branch_enable() to unprivileged userland and making UFFD_FEATURE_SIGBUS a privileged op doesn't sound attractive (although the alternative of altering a hugetlbfs mount option would be a privileged op). Regarding hugetlbfs mount option, one consideration is to allow mounts of hugetlbfs inside user namespaces's mount namespace. Which would allow non privileged processes to mount hugetlbfs for use inside a user namespace. This may be needed even for the 'min_size' mount option using which an application could reserve huge pages and mount a filesystem for its use, with out the need to have privileges given the system has enough hugepages configured. It seems if non privileged processes are allowed to mount hugetlbfs filesystem, then min_size should be subject to some resource limits. Mounting inside user namespace will be a different patch proposal later. Thanks, Andrea
Re: [RFC PATCH] userfaultfd: Add feature to request for a signal delivery
On 6/28/17 6:18 AM, Mike Rapoport wrote: On Tue, Jun 27, 2017 at 09:01:20AM -0700, Prakash Sangappa wrote: On 6/27/17 8:35 AM, Mike Rapoport wrote: On Tue, Jun 27, 2017 at 09:06:43AM +0200, Michal Hocko wrote: This is an user visible API so let's CC linux-api mailing list. On Mon 26-06-17 12:46:13, Prakash Sangappa wrote: Any access to mapped address over holes in the file, which can occur due to bugs in the application, is considered invalid and expect the process to simply receive a SIGBUS. However, currently when a hole in the file is accessed via the mapped address, kernel/mm attempts to automatically allocate a page at page fault time, resulting in implicitly filling the hole in the file. This may not be the desired behavior for applications like the database that want to explicitly manage page allocations of hugetlbfs files. So you register UFFD_FEATURE_SIGBUS on each region tha you are unmapping and than just let those offenders die? If I understand correctly, the database will create the mapping, then it'll open userfaultfd and register those mappings with the userfault. Afterwards, when the application accesses a hole userfault will cause SIGBUS and the application will process it in whatever way it likes, e.g. just die. Yes. What I don't understand is why won't you use userfault monitor process that will take care of the page fault events? It shouldn't be much overhead running it and it can keep track on all the userfault file descriptors for you and it will allow more versatile error handling that SIGBUS. Co-ordination with the external monitor process by all the database processes to send their userfaultfd is still an overhead. You are planning to register in userfaultfd only the holes you punch to deallocate pages, am I right? No, the entire mmap'ed region. The DB processes would mmap(MAP_NORESERVE) hugetlbfs files, register this mapped address with userfaultfd ones right after the mmap() call. And the co-ordination of the userfault file descriptor with the monitor would have been added after calls to fallocate() and userfaultfd_register()? Well, the database application does not need to deal with a monitor. I've just been thinking that maybe it would be possible to use UFFD_EVENT_REMOVE for this case. We anyway need to implement the generation of UFFD_EVENT_REMOVE for the case of hole punching in hugetlbfs for non-cooperative userfaultfd. It could be that it will solve your issue as well. Will this result in a signal delivery? In the use case described, the database application does not need any event for hole punching. Basically, just a signal for any invalid access to mapped area over holes in the file.
Re: [RFC PATCH] userfaultfd: Add feature to request for a signal delivery
On 6/28/17 6:18 AM, Mike Rapoport wrote: On Tue, Jun 27, 2017 at 09:01:20AM -0700, Prakash Sangappa wrote: On 6/27/17 8:35 AM, Mike Rapoport wrote: On Tue, Jun 27, 2017 at 09:06:43AM +0200, Michal Hocko wrote: This is an user visible API so let's CC linux-api mailing list. On Mon 26-06-17 12:46:13, Prakash Sangappa wrote: Any access to mapped address over holes in the file, which can occur due to bugs in the application, is considered invalid and expect the process to simply receive a SIGBUS. However, currently when a hole in the file is accessed via the mapped address, kernel/mm attempts to automatically allocate a page at page fault time, resulting in implicitly filling the hole in the file. This may not be the desired behavior for applications like the database that want to explicitly manage page allocations of hugetlbfs files. So you register UFFD_FEATURE_SIGBUS on each region tha you are unmapping and than just let those offenders die? If I understand correctly, the database will create the mapping, then it'll open userfaultfd and register those mappings with the userfault. Afterwards, when the application accesses a hole userfault will cause SIGBUS and the application will process it in whatever way it likes, e.g. just die. Yes. What I don't understand is why won't you use userfault monitor process that will take care of the page fault events? It shouldn't be much overhead running it and it can keep track on all the userfault file descriptors for you and it will allow more versatile error handling that SIGBUS. Co-ordination with the external monitor process by all the database processes to send their userfaultfd is still an overhead. You are planning to register in userfaultfd only the holes you punch to deallocate pages, am I right? No, the entire mmap'ed region. The DB processes would mmap(MAP_NORESERVE) hugetlbfs files, register this mapped address with userfaultfd ones right after the mmap() call. And the co-ordination of the userfault file descriptor with the monitor would have been added after calls to fallocate() and userfaultfd_register()? Well, the database application does not need to deal with a monitor. I've just been thinking that maybe it would be possible to use UFFD_EVENT_REMOVE for this case. We anyway need to implement the generation of UFFD_EVENT_REMOVE for the case of hole punching in hugetlbfs for non-cooperative userfaultfd. It could be that it will solve your issue as well. Will this result in a signal delivery? In the use case described, the database application does not need any event for hole punching. Basically, just a signal for any invalid access to mapped area over holes in the file.
Re: [PATCH RFC] hugetlbfs 'noautofill' mount option
On 6/20/17 4:35 PM, Prakash Sangappa wrote: On 6/16/17 6:15 AM, Andrea Arcangeli wrote: Adding a single if (ctx->feature & UFFD_FEATURE_SIGBUS) goto out, branch for this corner case to handle_userfault() isn't great and the hugetlbfs mount option is absolutely zero cost to the handle_userfault which is primarily why I'm not against it.. although it's not going to be measurable so it would be ok also to add such feature. If implementing UFFD_FEATURE_SIGBUS is preferred instead of the mount option, I could look into that. Implementing UFFD_FEATURE_SIGBUS seems reasonable. I wanted to note here on this thread that I sent out a seperate RFC patch review for adding UFFD_FEATURE_SIGBUS. See, http://marc.info/?l=linux-mm=149857975906880=2
Re: [PATCH RFC] hugetlbfs 'noautofill' mount option
On 6/20/17 4:35 PM, Prakash Sangappa wrote: On 6/16/17 6:15 AM, Andrea Arcangeli wrote: Adding a single if (ctx->feature & UFFD_FEATURE_SIGBUS) goto out, branch for this corner case to handle_userfault() isn't great and the hugetlbfs mount option is absolutely zero cost to the handle_userfault which is primarily why I'm not against it.. although it's not going to be measurable so it would be ok also to add such feature. If implementing UFFD_FEATURE_SIGBUS is preferred instead of the mount option, I could look into that. Implementing UFFD_FEATURE_SIGBUS seems reasonable. I wanted to note here on this thread that I sent out a seperate RFC patch review for adding UFFD_FEATURE_SIGBUS. See, http://marc.info/?l=linux-mm=149857975906880=2
Re: [RFC PATCH] userfaultfd: Add feature to request for a signal delivery
On 6/27/17 8:35 AM, Mike Rapoport wrote: On Tue, Jun 27, 2017 at 09:06:43AM +0200, Michal Hocko wrote: This is an user visible API so let's CC linux-api mailing list. On Mon 26-06-17 12:46:13, Prakash Sangappa wrote: In some cases, userfaultfd mechanism should just deliver a SIGBUS signal to the faulting process, instead of the page-fault event. Dealing with page-fault event using a monitor thread can be an overhead in these cases. For example applications like the database could use the signaling mechanism for robustness purpose. this is rather confusing. What is the reason that the monitor would be slower than signal delivery and handling? Database uses hugetlbfs for performance reason. Files on hugetlbfs filesystem are created and huge pages allocated using fallocate() API. Pages are deallocated/freed using fallocate() hole punching support. These files are mmapped and accessed by many processes as shared memory. The database keeps track of which offsets in the hugetlbfs file have pages allocated. Any access to mapped address over holes in the file, which can occur due to bugs in the application, is considered invalid and expect the process to simply receive a SIGBUS. However, currently when a hole in the file is accessed via the mapped address, kernel/mm attempts to automatically allocate a page at page fault time, resulting in implicitly filling the hole in the file. This may not be the desired behavior for applications like the database that want to explicitly manage page allocations of hugetlbfs files. So you register UFFD_FEATURE_SIGBUS on each region tha you are unmapping and than just let those offenders die? If I understand correctly, the database will create the mapping, then it'll open userfaultfd and register those mappings with the userfault. Afterwards, when the application accesses a hole userfault will cause SIGBUS and the application will process it in whatever way it likes, e.g. just die. Yes. What I don't understand is why won't you use userfault monitor process that will take care of the page fault events? It shouldn't be much overhead running it and it can keep track on all the userfault file descriptors for you and it will allow more versatile error handling that SIGBUS. Co-ordination with the external monitor process by all the database processes to send their userfaultfd is still an overhead. Using userfaultfd mechanism, with this support to get a signal, database application can prevent pages from being allocated implicitly when processes access mapped address over holes in the file. This patch adds the feature to request for a SIGBUS signal to userfaultfd mechanism. See following for previous discussion about the database requirement leading to this proposal as suggested by Andrea. http://www.spinics.net/lists/linux-mm/msg129224.html Please make those requirements part of the changelog. Signed-off-by: Prakash <prakash.sanga...@oracle.com> --- fs/userfaultfd.c | 5 + include/uapi/linux/userfaultfd.h | 10 +- 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 1d622f2..5686d6d2 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -371,6 +371,11 @@ int handle_userfault(struct vm_fault *vmf, unsigned long reason) VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP)); VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP)); +if (ctx->features & UFFD_FEATURE_SIGBUS) { +goto out; +} + /* * If it's already released don't get it. This avoids to loop * in __get_user_pages if userfaultfd_release waits on the diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 3b05953..d39d5db 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -23,7 +23,8 @@ UFFD_FEATURE_EVENT_REMOVE |\ UFFD_FEATURE_EVENT_UNMAP |\ UFFD_FEATURE_MISSING_HUGETLBFS |\ - UFFD_FEATURE_MISSING_SHMEM) + UFFD_FEATURE_MISSING_SHMEM |\ + UFFD_FEATURE_SIGBUS) #define UFFD_API_IOCTLS\ ((__u64)1 << _UFFDIO_REGISTER |\ (__u64)1 << _UFFDIO_UNREGISTER |\ @@ -153,6 +154,12 @@ struct uffdio_api { * UFFD_FEATURE_MISSING_SHMEM works the same as * UFFD_FEATURE_MISSING_HUGETLBFS, but it applies to shmem * (i.e. tmpfs and other shmem based APIs). + * + * UFFD_FEATURE_SIGBUS feature means no page-fault + * (UFFD_EVENT_PAGEFAULT) event will be delivered, instead + * a SIGBUS signal will be sent to the faulting process. + * The application process can enable this behavior by adding + * it to uffdio_api.features. */ #define UFFD_FEATURE_PAGEFAULT_FLAG_WP(1<<0) #define UFFD_FEATURE_EVENT_FORK(1<<1)
Re: [RFC PATCH] userfaultfd: Add feature to request for a signal delivery
On 6/27/17 8:35 AM, Mike Rapoport wrote: On Tue, Jun 27, 2017 at 09:06:43AM +0200, Michal Hocko wrote: This is an user visible API so let's CC linux-api mailing list. On Mon 26-06-17 12:46:13, Prakash Sangappa wrote: In some cases, userfaultfd mechanism should just deliver a SIGBUS signal to the faulting process, instead of the page-fault event. Dealing with page-fault event using a monitor thread can be an overhead in these cases. For example applications like the database could use the signaling mechanism for robustness purpose. this is rather confusing. What is the reason that the monitor would be slower than signal delivery and handling? Database uses hugetlbfs for performance reason. Files on hugetlbfs filesystem are created and huge pages allocated using fallocate() API. Pages are deallocated/freed using fallocate() hole punching support. These files are mmapped and accessed by many processes as shared memory. The database keeps track of which offsets in the hugetlbfs file have pages allocated. Any access to mapped address over holes in the file, which can occur due to bugs in the application, is considered invalid and expect the process to simply receive a SIGBUS. However, currently when a hole in the file is accessed via the mapped address, kernel/mm attempts to automatically allocate a page at page fault time, resulting in implicitly filling the hole in the file. This may not be the desired behavior for applications like the database that want to explicitly manage page allocations of hugetlbfs files. So you register UFFD_FEATURE_SIGBUS on each region tha you are unmapping and than just let those offenders die? If I understand correctly, the database will create the mapping, then it'll open userfaultfd and register those mappings with the userfault. Afterwards, when the application accesses a hole userfault will cause SIGBUS and the application will process it in whatever way it likes, e.g. just die. Yes. What I don't understand is why won't you use userfault monitor process that will take care of the page fault events? It shouldn't be much overhead running it and it can keep track on all the userfault file descriptors for you and it will allow more versatile error handling that SIGBUS. Co-ordination with the external monitor process by all the database processes to send their userfaultfd is still an overhead. Using userfaultfd mechanism, with this support to get a signal, database application can prevent pages from being allocated implicitly when processes access mapped address over holes in the file. This patch adds the feature to request for a SIGBUS signal to userfaultfd mechanism. See following for previous discussion about the database requirement leading to this proposal as suggested by Andrea. http://www.spinics.net/lists/linux-mm/msg129224.html Please make those requirements part of the changelog. Signed-off-by: Prakash --- fs/userfaultfd.c | 5 + include/uapi/linux/userfaultfd.h | 10 +- 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 1d622f2..5686d6d2 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -371,6 +371,11 @@ int handle_userfault(struct vm_fault *vmf, unsigned long reason) VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP)); VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP)); +if (ctx->features & UFFD_FEATURE_SIGBUS) { +goto out; +} + /* * If it's already released don't get it. This avoids to loop * in __get_user_pages if userfaultfd_release waits on the diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 3b05953..d39d5db 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -23,7 +23,8 @@ UFFD_FEATURE_EVENT_REMOVE |\ UFFD_FEATURE_EVENT_UNMAP |\ UFFD_FEATURE_MISSING_HUGETLBFS |\ - UFFD_FEATURE_MISSING_SHMEM) + UFFD_FEATURE_MISSING_SHMEM |\ + UFFD_FEATURE_SIGBUS) #define UFFD_API_IOCTLS\ ((__u64)1 << _UFFDIO_REGISTER |\ (__u64)1 << _UFFDIO_UNREGISTER |\ @@ -153,6 +154,12 @@ struct uffdio_api { * UFFD_FEATURE_MISSING_SHMEM works the same as * UFFD_FEATURE_MISSING_HUGETLBFS, but it applies to shmem * (i.e. tmpfs and other shmem based APIs). + * + * UFFD_FEATURE_SIGBUS feature means no page-fault + * (UFFD_EVENT_PAGEFAULT) event will be delivered, instead + * a SIGBUS signal will be sent to the faulting process. + * The application process can enable this behavior by adding + * it to uffdio_api.features. */ #define UFFD_FEATURE_PAGEFAULT_FLAG_WP(1<<0) #define UFFD_FEATURE_EVENT_FORK(1<<1) @@ -161,6 +168,7 @@ struct uffdio_api { #define UFFD_FEATUR
Re: [RFC PATCH] userfaultfd: Add feature to request for a signal delivery
On 6/27/17 12:06 AM, Michal Hocko wrote: This is an user visible API so let's CC linux-api mailing list. On Mon 26-06-17 12:46:13, Prakash Sangappa wrote: In some cases, userfaultfd mechanism should just deliver a SIGBUS signal to the faulting process, instead of the page-fault event. Dealing with page-fault event using a monitor thread can be an overhead in these cases. For example applications like the database could use the signaling mechanism for robustness purpose. this is rather confusing. What is the reason that the monitor would be slower than signal delivery and handling? There are a large number of single threaded database processes involved, each of these processes will require a monitor thread which is considered an overhead. Database uses hugetlbfs for performance reason. Files on hugetlbfs filesystem are created and huge pages allocated using fallocate() API. Pages are deallocated/freed using fallocate() hole punching support. These files are mmapped and accessed by many processes as shared memory. The database keeps track of which offsets in the hugetlbfs file have pages allocated. Any access to mapped address over holes in the file, which can occur due to bugs in the application, is considered invalid and expect the process to simply receive a SIGBUS. However, currently when a hole in the file is accessed via the mapped address, kernel/mm attempts to automatically allocate a page at page fault time, resulting in implicitly filling the hole in the file. This may not be the desired behavior for applications like the database that want to explicitly manage page allocations of hugetlbfs files. So you register UFFD_FEATURE_SIGBUS on each region tha you are unmapping and than just let those offenders die? The database application will create the mapping and register with userfault. Subsequently when the processes the mapping over a hole will result in SIGBUS and die. Using userfaultfd mechanism, with this support to get a signal, database application can prevent pages from being allocated implicitly when processes access mapped address over holes in the file. This patch adds the feature to request for a SIGBUS signal to userfaultfd mechanism. See following for previous discussion about the database requirement leading to this proposal as suggested by Andrea. http://www.spinics.net/lists/linux-mm/msg129224.html Please make those requirements part of the changelog. The requirement is described above, which is the need for the database application to not fill hole implicitly. Sorry, if this was not clear. I will update the change log and send a v2 patch again.
Re: [RFC PATCH] userfaultfd: Add feature to request for a signal delivery
On 6/27/17 12:06 AM, Michal Hocko wrote: This is an user visible API so let's CC linux-api mailing list. On Mon 26-06-17 12:46:13, Prakash Sangappa wrote: In some cases, userfaultfd mechanism should just deliver a SIGBUS signal to the faulting process, instead of the page-fault event. Dealing with page-fault event using a monitor thread can be an overhead in these cases. For example applications like the database could use the signaling mechanism for robustness purpose. this is rather confusing. What is the reason that the monitor would be slower than signal delivery and handling? There are a large number of single threaded database processes involved, each of these processes will require a monitor thread which is considered an overhead. Database uses hugetlbfs for performance reason. Files on hugetlbfs filesystem are created and huge pages allocated using fallocate() API. Pages are deallocated/freed using fallocate() hole punching support. These files are mmapped and accessed by many processes as shared memory. The database keeps track of which offsets in the hugetlbfs file have pages allocated. Any access to mapped address over holes in the file, which can occur due to bugs in the application, is considered invalid and expect the process to simply receive a SIGBUS. However, currently when a hole in the file is accessed via the mapped address, kernel/mm attempts to automatically allocate a page at page fault time, resulting in implicitly filling the hole in the file. This may not be the desired behavior for applications like the database that want to explicitly manage page allocations of hugetlbfs files. So you register UFFD_FEATURE_SIGBUS on each region tha you are unmapping and than just let those offenders die? The database application will create the mapping and register with userfault. Subsequently when the processes the mapping over a hole will result in SIGBUS and die. Using userfaultfd mechanism, with this support to get a signal, database application can prevent pages from being allocated implicitly when processes access mapped address over holes in the file. This patch adds the feature to request for a SIGBUS signal to userfaultfd mechanism. See following for previous discussion about the database requirement leading to this proposal as suggested by Andrea. http://www.spinics.net/lists/linux-mm/msg129224.html Please make those requirements part of the changelog. The requirement is described above, which is the need for the database application to not fill hole implicitly. Sorry, if this was not clear. I will update the change log and send a v2 patch again.
[RFC PATCH] userfaultfd: Add feature to request for a signal delivery
In some cases, userfaultfd mechanism should just deliver a SIGBUS signal to the faulting process, instead of the page-fault event. Dealing with page-fault event using a monitor thread can be an overhead in these cases. For example applications like the database could use the signaling mechanism for robustness purpose. Database uses hugetlbfs for performance reason. Files on hugetlbfs filesystem are created and huge pages allocated using fallocate() API. Pages are deallocated/freed using fallocate() hole punching support. These files are mmapped and accessed by many processes as shared memory. The database keeps track of which offsets in the hugetlbfs file have pages allocated. Any access to mapped address over holes in the file, which can occur due to bugs in the application, is considered invalid and expect the process to simply receive a SIGBUS. However, currently when a hole in the file is accessed via the mapped address, kernel/mm attempts to automatically allocate a page at page fault time, resulting in implicitly filling the hole in the file. This may not be the desired behavior for applications like the database that want to explicitly manage page allocations of hugetlbfs files. Using userfaultfd mechanism, with this support to get a signal, database application can prevent pages from being allocated implicitly when processes access mapped address over holes in the file. This patch adds the feature to request for a SIGBUS signal to userfaultfd mechanism. See following for previous discussion about the database requirement leading to this proposal as suggested by Andrea. http://www.spinics.net/lists/linux-mm/msg129224.html Signed-off-by: Prakash--- fs/userfaultfd.c | 5 + include/uapi/linux/userfaultfd.h | 10 +- 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 1d622f2..5686d6d2 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -371,6 +371,11 @@ int handle_userfault(struct vm_fault *vmf, unsigned long reason) VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP)); VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP)); +if (ctx->features & UFFD_FEATURE_SIGBUS) { +goto out; +} + /* * If it's already released don't get it. This avoids to loop * in __get_user_pages if userfaultfd_release waits on the diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 3b05953..d39d5db 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -23,7 +23,8 @@ UFFD_FEATURE_EVENT_REMOVE |\ UFFD_FEATURE_EVENT_UNMAP |\ UFFD_FEATURE_MISSING_HUGETLBFS |\ - UFFD_FEATURE_MISSING_SHMEM) + UFFD_FEATURE_MISSING_SHMEM |\ + UFFD_FEATURE_SIGBUS) #define UFFD_API_IOCTLS\ ((__u64)1 << _UFFDIO_REGISTER |\ (__u64)1 << _UFFDIO_UNREGISTER |\ @@ -153,6 +154,12 @@ struct uffdio_api { * UFFD_FEATURE_MISSING_SHMEM works the same as * UFFD_FEATURE_MISSING_HUGETLBFS, but it applies to shmem * (i.e. tmpfs and other shmem based APIs). + * + * UFFD_FEATURE_SIGBUS feature means no page-fault + * (UFFD_EVENT_PAGEFAULT) event will be delivered, instead + * a SIGBUS signal will be sent to the faulting process. + * The application process can enable this behavior by adding + * it to uffdio_api.features. */ #define UFFD_FEATURE_PAGEFAULT_FLAG_WP(1<<0) #define UFFD_FEATURE_EVENT_FORK(1<<1) @@ -161,6 +168,7 @@ struct uffdio_api { #define UFFD_FEATURE_MISSING_HUGETLBFS(1<<4) #define UFFD_FEATURE_MISSING_SHMEM(1<<5) #define UFFD_FEATURE_EVENT_UNMAP(1<<6) +#define UFFD_FEATURE_SIGBUS(1<<7) __u64 features; __u64 ioctls; -- 2.7.4
[RFC PATCH] userfaultfd: Add feature to request for a signal delivery
In some cases, userfaultfd mechanism should just deliver a SIGBUS signal to the faulting process, instead of the page-fault event. Dealing with page-fault event using a monitor thread can be an overhead in these cases. For example applications like the database could use the signaling mechanism for robustness purpose. Database uses hugetlbfs for performance reason. Files on hugetlbfs filesystem are created and huge pages allocated using fallocate() API. Pages are deallocated/freed using fallocate() hole punching support. These files are mmapped and accessed by many processes as shared memory. The database keeps track of which offsets in the hugetlbfs file have pages allocated. Any access to mapped address over holes in the file, which can occur due to bugs in the application, is considered invalid and expect the process to simply receive a SIGBUS. However, currently when a hole in the file is accessed via the mapped address, kernel/mm attempts to automatically allocate a page at page fault time, resulting in implicitly filling the hole in the file. This may not be the desired behavior for applications like the database that want to explicitly manage page allocations of hugetlbfs files. Using userfaultfd mechanism, with this support to get a signal, database application can prevent pages from being allocated implicitly when processes access mapped address over holes in the file. This patch adds the feature to request for a SIGBUS signal to userfaultfd mechanism. See following for previous discussion about the database requirement leading to this proposal as suggested by Andrea. http://www.spinics.net/lists/linux-mm/msg129224.html Signed-off-by: Prakash --- fs/userfaultfd.c | 5 + include/uapi/linux/userfaultfd.h | 10 +- 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 1d622f2..5686d6d2 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -371,6 +371,11 @@ int handle_userfault(struct vm_fault *vmf, unsigned long reason) VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP)); VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP)); +if (ctx->features & UFFD_FEATURE_SIGBUS) { +goto out; +} + /* * If it's already released don't get it. This avoids to loop * in __get_user_pages if userfaultfd_release waits on the diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 3b05953..d39d5db 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -23,7 +23,8 @@ UFFD_FEATURE_EVENT_REMOVE |\ UFFD_FEATURE_EVENT_UNMAP |\ UFFD_FEATURE_MISSING_HUGETLBFS |\ - UFFD_FEATURE_MISSING_SHMEM) + UFFD_FEATURE_MISSING_SHMEM |\ + UFFD_FEATURE_SIGBUS) #define UFFD_API_IOCTLS\ ((__u64)1 << _UFFDIO_REGISTER |\ (__u64)1 << _UFFDIO_UNREGISTER |\ @@ -153,6 +154,12 @@ struct uffdio_api { * UFFD_FEATURE_MISSING_SHMEM works the same as * UFFD_FEATURE_MISSING_HUGETLBFS, but it applies to shmem * (i.e. tmpfs and other shmem based APIs). + * + * UFFD_FEATURE_SIGBUS feature means no page-fault + * (UFFD_EVENT_PAGEFAULT) event will be delivered, instead + * a SIGBUS signal will be sent to the faulting process. + * The application process can enable this behavior by adding + * it to uffdio_api.features. */ #define UFFD_FEATURE_PAGEFAULT_FLAG_WP(1<<0) #define UFFD_FEATURE_EVENT_FORK(1<<1) @@ -161,6 +168,7 @@ struct uffdio_api { #define UFFD_FEATURE_MISSING_HUGETLBFS(1<<4) #define UFFD_FEATURE_MISSING_SHMEM(1<<5) #define UFFD_FEATURE_EVENT_UNMAP(1<<6) +#define UFFD_FEATURE_SIGBUS(1<<7) __u64 features; __u64 ioctls; -- 2.7.4
Re: [PATCH RFC] hugetlbfs 'noautofill' mount option
On 6/16/17 6:15 AM, Andrea Arcangeli wrote: Hello Prakash, Thanks for you response. Comments inline. On Tue, May 09, 2017 at 01:59:34PM -0700, Prakash Sangappa wrote: On 5/9/17 1:58 AM, Christoph Hellwig wrote: On Mon, May 08, 2017 at 03:12:42PM -0700, prakash.sangappa wrote: Regarding #3 as a general feature, do we want to consider this and the complexity associated with the implementation? We have to. Given that no one has exclusive access to hugetlbfs a mount option is fundamentally the wrong interface. A hugetlbfs filesystem may need to be mounted for exclusive use by an application. Note, recently the 'min_size' mount option was added to hugetlbfs, which would reserve minimum number of huge pages for that filesystem for use by an application. If the filesystem with min size specified, is not setup for exclusive use by an application, then the purpose of reserving huge pages is defeated. The min_size option was for use by applications like the database. Also, I am investigating enabling hugetlbfs mounts within user namespace's mount namespace. That would allow an application to mount a hugetlbfs filesystem inside a namespace exclusively for its use, running as a non root user. For this it seems like the 'min_size' should be subject to some user limits. Anyways, mounting inside user namespaces is a different discussion. So, if a filesystem has to be setup for exclusive use by an application, then different mount options can be used for that filesystem. Before userfaultfd I used a madvise that triggered SIGBUS. Aside from performance that is much lower than userfaultfd because of the return to userland, SIGBUS handling and new enter kernel to communicate through a pipe with a memory manager, it couldn't work reliably because you're not going to get exact information on the virtual address that triggered the fault if the SIGBUS triggers in some random in a copy-user of some random syscall, depending on the syscall some random error will be returned. So it couldn't work transparently to the app as far as syscalls and get_user_pages drivers were concerned. Sure, seems like that would be the case if an application wants to take some action as a result of the fault. With your solution if you pass a corrupted pointer to a random read() syscall you're going to get a error, but supposedly you already handle any syscall error and stop the app. Yes, the expectation is that the application will handle the error and stop. This would be similar to an application passing an invalid address to a system call. So, in the use case for this feature, accessing the mapped address over a hole in hugetlbfs file is invalid. The application will keep track of the valid regions. This is a special case because you don't care about performance and you don't care about not returning random EFAULT errors from syscalls like read(). Exactly. This mount option seems non intrusive enough and hugetlbfs is quite special already, so I'm not particularly concerned by the fact it's one more special tweak. If it would be enough to convert the SIGBUS into a (killable) process hang, you could still use uffd and there would be no need to send the uffd to a manager. You'd find the corrupting buggy process stuck in handle_userfault(). This could be a useful feature in debug mode. However, In the normal mode the application should exit/die. As an alternative to the mount option we could consider adding UFFD_FEATURE_SIGBUS that tells the handle_userfault() to simply return VM_FAULT_SIGBUS in presence of a pagefault event. You'd still get weird EFAULT or erratic retvals from syscalls so it would only be usable in for your robustness feature. Then you could use UFFDIO_COPY too to fill the memory atomically which runs faster than a page fault (fallocate punch hole still required to zap it). Adding a single if (ctx->feature & UFFD_FEATURE_SIGBUS) goto out, branch for this corner case to handle_userfault() isn't great and the hugetlbfs mount option is absolutely zero cost to the handle_userfault which is primarily why I'm not against it.. although it's not going to be measurable so it would be ok also to add such feature. Sure, UFFD_FEATURE_SIGBUS would address the use case for the database using hugetlbfs. This could be a generic API and so could be useful in other cases as well maybe? However for this, the userfaultfd(2) has to be opened to register. This fd has to remain opened. Is this ok? Also, even though a monitor thread will not be required for this particular feature, hopefully it will not hinder future enhancements to userfaultfd. Expectation is that the overhead of registering UFFD_FEATURE_SIGBUS is minimal, and the registration will be done by the application ones after every mmap() call as required, hopefully this is not required to be done frequently. In the database use case, the registration will mainly be done once in the beginning when mapping hu
Re: [PATCH RFC] hugetlbfs 'noautofill' mount option
On 6/16/17 6:15 AM, Andrea Arcangeli wrote: Hello Prakash, Thanks for you response. Comments inline. On Tue, May 09, 2017 at 01:59:34PM -0700, Prakash Sangappa wrote: On 5/9/17 1:58 AM, Christoph Hellwig wrote: On Mon, May 08, 2017 at 03:12:42PM -0700, prakash.sangappa wrote: Regarding #3 as a general feature, do we want to consider this and the complexity associated with the implementation? We have to. Given that no one has exclusive access to hugetlbfs a mount option is fundamentally the wrong interface. A hugetlbfs filesystem may need to be mounted for exclusive use by an application. Note, recently the 'min_size' mount option was added to hugetlbfs, which would reserve minimum number of huge pages for that filesystem for use by an application. If the filesystem with min size specified, is not setup for exclusive use by an application, then the purpose of reserving huge pages is defeated. The min_size option was for use by applications like the database. Also, I am investigating enabling hugetlbfs mounts within user namespace's mount namespace. That would allow an application to mount a hugetlbfs filesystem inside a namespace exclusively for its use, running as a non root user. For this it seems like the 'min_size' should be subject to some user limits. Anyways, mounting inside user namespaces is a different discussion. So, if a filesystem has to be setup for exclusive use by an application, then different mount options can be used for that filesystem. Before userfaultfd I used a madvise that triggered SIGBUS. Aside from performance that is much lower than userfaultfd because of the return to userland, SIGBUS handling and new enter kernel to communicate through a pipe with a memory manager, it couldn't work reliably because you're not going to get exact information on the virtual address that triggered the fault if the SIGBUS triggers in some random in a copy-user of some random syscall, depending on the syscall some random error will be returned. So it couldn't work transparently to the app as far as syscalls and get_user_pages drivers were concerned. Sure, seems like that would be the case if an application wants to take some action as a result of the fault. With your solution if you pass a corrupted pointer to a random read() syscall you're going to get a error, but supposedly you already handle any syscall error and stop the app. Yes, the expectation is that the application will handle the error and stop. This would be similar to an application passing an invalid address to a system call. So, in the use case for this feature, accessing the mapped address over a hole in hugetlbfs file is invalid. The application will keep track of the valid regions. This is a special case because you don't care about performance and you don't care about not returning random EFAULT errors from syscalls like read(). Exactly. This mount option seems non intrusive enough and hugetlbfs is quite special already, so I'm not particularly concerned by the fact it's one more special tweak. If it would be enough to convert the SIGBUS into a (killable) process hang, you could still use uffd and there would be no need to send the uffd to a manager. You'd find the corrupting buggy process stuck in handle_userfault(). This could be a useful feature in debug mode. However, In the normal mode the application should exit/die. As an alternative to the mount option we could consider adding UFFD_FEATURE_SIGBUS that tells the handle_userfault() to simply return VM_FAULT_SIGBUS in presence of a pagefault event. You'd still get weird EFAULT or erratic retvals from syscalls so it would only be usable in for your robustness feature. Then you could use UFFDIO_COPY too to fill the memory atomically which runs faster than a page fault (fallocate punch hole still required to zap it). Adding a single if (ctx->feature & UFFD_FEATURE_SIGBUS) goto out, branch for this corner case to handle_userfault() isn't great and the hugetlbfs mount option is absolutely zero cost to the handle_userfault which is primarily why I'm not against it.. although it's not going to be measurable so it would be ok also to add such feature. Sure, UFFD_FEATURE_SIGBUS would address the use case for the database using hugetlbfs. This could be a generic API and so could be useful in other cases as well maybe? However for this, the userfaultfd(2) has to be opened to register. This fd has to remain opened. Is this ok? Also, even though a monitor thread will not be required for this particular feature, hopefully it will not hinder future enhancements to userfaultfd. Expectation is that the overhead of registering UFFD_FEATURE_SIGBUS is minimal, and the registration will be done by the application ones after every mmap() call as required, hopefully this is not required to be done frequently. In the database use case, the registration will mainly be done once in the beginning when mapping hu
Re: [PATCH RFC] hugetlbfs 'noautofill' mount option
On 5/9/17 1:59 PM, Prakash Sangappa wrote: On 5/9/17 1:58 AM, Christoph Hellwig wrote: On Mon, May 08, 2017 at 03:12:42PM -0700, prakash.sangappa wrote: Regarding #3 as a general feature, do we want to consider this and the complexity associated with the implementation? We have to. Given that no one has exclusive access to hugetlbfs a mount option is fundamentally the wrong interface. A hugetlbfs filesystem may need to be mounted for exclusive use by an application. Note, recently the 'min_size' mount option was added to hugetlbfs, which would reserve minimum number of huge pages for that filesystem for use by an application. If the filesystem with min size specified, is not setup for exclusive use by an application, then the purpose of reserving huge pages is defeated. The min_size option was for use by applications like the database. Also, I am investigating enabling hugetlbfs mounts within user namespace's mount namespace. That would allow an application to mount a hugetlbfs filesystem inside a namespace exclusively for its use, running as a non root user. For this it seems like the 'min_size' should be subject to some user limits. Anyways, mounting inside user namespaces is a different discussion. So, if a filesystem has to be setup for exclusive use by an application, then different mount options can be used for that filesystem. Any further comments? Cc'ing Andrea as we had discussed this requirement for the Database.
Re: [PATCH RFC] hugetlbfs 'noautofill' mount option
On 5/9/17 1:59 PM, Prakash Sangappa wrote: On 5/9/17 1:58 AM, Christoph Hellwig wrote: On Mon, May 08, 2017 at 03:12:42PM -0700, prakash.sangappa wrote: Regarding #3 as a general feature, do we want to consider this and the complexity associated with the implementation? We have to. Given that no one has exclusive access to hugetlbfs a mount option is fundamentally the wrong interface. A hugetlbfs filesystem may need to be mounted for exclusive use by an application. Note, recently the 'min_size' mount option was added to hugetlbfs, which would reserve minimum number of huge pages for that filesystem for use by an application. If the filesystem with min size specified, is not setup for exclusive use by an application, then the purpose of reserving huge pages is defeated. The min_size option was for use by applications like the database. Also, I am investigating enabling hugetlbfs mounts within user namespace's mount namespace. That would allow an application to mount a hugetlbfs filesystem inside a namespace exclusively for its use, running as a non root user. For this it seems like the 'min_size' should be subject to some user limits. Anyways, mounting inside user namespaces is a different discussion. So, if a filesystem has to be setup for exclusive use by an application, then different mount options can be used for that filesystem. Any further comments? Cc'ing Andrea as we had discussed this requirement for the Database.
Re: [PATCH RFC] hugetlbfs 'noautofill' mount option
On 5/9/17 1:58 AM, Christoph Hellwig wrote: On Mon, May 08, 2017 at 03:12:42PM -0700, prakash.sangappa wrote: Regarding #3 as a general feature, do we want to consider this and the complexity associated with the implementation? We have to. Given that no one has exclusive access to hugetlbfs a mount option is fundamentally the wrong interface. A hugetlbfs filesystem may need to be mounted for exclusive use by an application. Note, recently the 'min_size' mount option was added to hugetlbfs, which would reserve minimum number of huge pages for that filesystem for use by an application. If the filesystem with min size specified, is not setup for exclusive use by an application, then the purpose of reserving huge pages is defeated. The min_size option was for use by applications like the database. Also, I am investigating enabling hugetlbfs mounts within user namespace's mount namespace. That would allow an application to mount a hugetlbfs filesystem inside a namespace exclusively for its use, running as a non root user. For this it seems like the 'min_size' should be subject to some user limits. Anyways, mounting inside user namespaces is a different discussion. So, if a filesystem has to be setup for exclusive use by an application, then different mount options can be used for that filesystem.
Re: [PATCH RFC] hugetlbfs 'noautofill' mount option
On 5/9/17 1:58 AM, Christoph Hellwig wrote: On Mon, May 08, 2017 at 03:12:42PM -0700, prakash.sangappa wrote: Regarding #3 as a general feature, do we want to consider this and the complexity associated with the implementation? We have to. Given that no one has exclusive access to hugetlbfs a mount option is fundamentally the wrong interface. A hugetlbfs filesystem may need to be mounted for exclusive use by an application. Note, recently the 'min_size' mount option was added to hugetlbfs, which would reserve minimum number of huge pages for that filesystem for use by an application. If the filesystem with min size specified, is not setup for exclusive use by an application, then the purpose of reserving huge pages is defeated. The min_size option was for use by applications like the database. Also, I am investigating enabling hugetlbfs mounts within user namespace's mount namespace. That would allow an application to mount a hugetlbfs filesystem inside a namespace exclusively for its use, running as a non root user. For this it seems like the 'min_size' should be subject to some user limits. Anyways, mounting inside user namespaces is a different discussion. So, if a filesystem has to be setup for exclusive use by an application, then different mount options can be used for that filesystem.
Re: [PATCH RFC] hugetlbfs 'noautofill' mount option
On 5/3/17 12:02 PM, Prakash Sangappa wrote: On 5/2/17 4:43 PM, Dave Hansen wrote: Ideally, it would be something that is *not* specifically for hugetlbfs. MADV_NOAUTOFILL, for instance, could be defined to SIGSEGV whenever memory is touched that was not populated with MADV_WILLNEED, mlock(), etc... If this is a generic advice type, necessary support will have to be implemented in various filesystems which can support this. The proposed behavior for 'noautofill' was to not fill holes in files(like sparse files). In the page fault path, mm would not know if the mmapped address on which the fault occurred, is over a hole in the file or just that the page is not available in the page cache. The underlying filesystem would be called and it determines if it is a hole and that is where it would fail and not fill the hole, if this support is added. Normally, filesystem which support sparse files(holes in file) automatically fill the hole when accessed. Then there is the issue of file system block size and page size. If the block sizes are smaller then page size, it could mean the noautofill would only work if the hole size is equal to or a multiple of, page size? In case of hugetlbfs it is much straight forward. Since this filesystem is not like a normal filesystems and and the file sizes are multiple of huge pages. The hole will be a multiple of the huge page size. For this reason then should the advise be specific to hugetlbfs? Any further comments? I think introducing a general madvise option or a mmap flag applicable to all filesystems, may not be required. The 'noautofill' behavior would be specifically useful in hugetlbfs filesystem. So, if it is specific to hugetlbfs, will the mount option be ok? Otherwise adding a madvise / mmap option specific to hugetlbfs, be preferred?
Re: [PATCH RFC] hugetlbfs 'noautofill' mount option
On 5/3/17 12:02 PM, Prakash Sangappa wrote: On 5/2/17 4:43 PM, Dave Hansen wrote: Ideally, it would be something that is *not* specifically for hugetlbfs. MADV_NOAUTOFILL, for instance, could be defined to SIGSEGV whenever memory is touched that was not populated with MADV_WILLNEED, mlock(), etc... If this is a generic advice type, necessary support will have to be implemented in various filesystems which can support this. The proposed behavior for 'noautofill' was to not fill holes in files(like sparse files). In the page fault path, mm would not know if the mmapped address on which the fault occurred, is over a hole in the file or just that the page is not available in the page cache. The underlying filesystem would be called and it determines if it is a hole and that is where it would fail and not fill the hole, if this support is added. Normally, filesystem which support sparse files(holes in file) automatically fill the hole when accessed. Then there is the issue of file system block size and page size. If the block sizes are smaller then page size, it could mean the noautofill would only work if the hole size is equal to or a multiple of, page size? In case of hugetlbfs it is much straight forward. Since this filesystem is not like a normal filesystems and and the file sizes are multiple of huge pages. The hole will be a multiple of the huge page size. For this reason then should the advise be specific to hugetlbfs? Any further comments? I think introducing a general madvise option or a mmap flag applicable to all filesystems, may not be required. The 'noautofill' behavior would be specifically useful in hugetlbfs filesystem. So, if it is specific to hugetlbfs, will the mount option be ok? Otherwise adding a madvise / mmap option specific to hugetlbfs, be preferred?
Re: [PATCH RFC] hugetlbfs 'noautofill' mount option
On 5/2/17 4:43 PM, Dave Hansen wrote: On 05/02/2017 04:34 PM, Prakash Sangappa wrote: Similarly, a madvise() option also requires additional system call by every process mapping the file, this is considered a overhead for the database. How long-lived are these processes? For a database, I'd assume that this would happen a single time, or a single time per mmap() at process startup time. Such a syscall would be doing something on the order of taking mmap_sem, walking the VMA tree, setting a bit per VMA, and unlocking. That's a pretty cheap one-time cost... Plus a call into the filesystem (a_ops?) to check if the underlying filesystem supports not filling holes to mapped access before setting the bit per vma. Although the overhead may not be that bad. Database processes can exit and new once started, for instance, depending on database activity. If we do consider a new madvise() option, will it be acceptable since this will be specifically for hugetlbfs file mappings? Ideally, it would be something that is *not* specifically for hugetlbfs. MADV_NOAUTOFILL, for instance, could be defined to SIGSEGV whenever memory is touched that was not populated with MADV_WILLNEED, mlock(), etc... If this is a generic advice type, necessary support will have to be implemented in various filesystems which can support this. The proposed behavior for 'noautofill' was to not fill holes in files(like sparse files). In the page fault path, mm would not know if the mmapped address on which the fault occurred, is over a hole in the file or just that the page is not available in the page cache. The underlying filesystem would be called and it determines if it is a hole and that is where it would fail and not fill the hole, if this support is added. Normally, filesystem which support sparse files(holes in file) automatically fill the hole when accessed. Then there is the issue of file system block size and page size. If the block sizes are smaller then page size, it could mean the noautofill would only work if the hole size is equal to or a multiple of, page size? In case of hugetlbfs it is much straight forward. Since this filesystem is not like a normal filesystems and and the file sizes are multiple of huge pages. The hole will be a multiple of the huge page size. For this reason then should the advise be specific to hugetlbfs? If so, would a new flag to mmap() call itself be acceptable, which would define the proposed behavior?. That way no additional system calls need to be made. I don't feel super strongly about it, but I guess an mmap() flag could work too. Same goes with the mmap call, if it is a generic flag. -Prakash.
Re: [PATCH RFC] hugetlbfs 'noautofill' mount option
On 5/2/17 4:43 PM, Dave Hansen wrote: On 05/02/2017 04:34 PM, Prakash Sangappa wrote: Similarly, a madvise() option also requires additional system call by every process mapping the file, this is considered a overhead for the database. How long-lived are these processes? For a database, I'd assume that this would happen a single time, or a single time per mmap() at process startup time. Such a syscall would be doing something on the order of taking mmap_sem, walking the VMA tree, setting a bit per VMA, and unlocking. That's a pretty cheap one-time cost... Plus a call into the filesystem (a_ops?) to check if the underlying filesystem supports not filling holes to mapped access before setting the bit per vma. Although the overhead may not be that bad. Database processes can exit and new once started, for instance, depending on database activity. If we do consider a new madvise() option, will it be acceptable since this will be specifically for hugetlbfs file mappings? Ideally, it would be something that is *not* specifically for hugetlbfs. MADV_NOAUTOFILL, for instance, could be defined to SIGSEGV whenever memory is touched that was not populated with MADV_WILLNEED, mlock(), etc... If this is a generic advice type, necessary support will have to be implemented in various filesystems which can support this. The proposed behavior for 'noautofill' was to not fill holes in files(like sparse files). In the page fault path, mm would not know if the mmapped address on which the fault occurred, is over a hole in the file or just that the page is not available in the page cache. The underlying filesystem would be called and it determines if it is a hole and that is where it would fail and not fill the hole, if this support is added. Normally, filesystem which support sparse files(holes in file) automatically fill the hole when accessed. Then there is the issue of file system block size and page size. If the block sizes are smaller then page size, it could mean the noautofill would only work if the hole size is equal to or a multiple of, page size? In case of hugetlbfs it is much straight forward. Since this filesystem is not like a normal filesystems and and the file sizes are multiple of huge pages. The hole will be a multiple of the huge page size. For this reason then should the advise be specific to hugetlbfs? If so, would a new flag to mmap() call itself be acceptable, which would define the proposed behavior?. That way no additional system calls need to be made. I don't feel super strongly about it, but I guess an mmap() flag could work too. Same goes with the mmap call, if it is a generic flag. -Prakash.
Re: [PATCH RFC] hugetlbfs 'noautofill' mount option
On 5/2/17 2:32 PM, Dave Hansen wrote: On 05/01/2017 11:00 AM, Prakash Sangappa wrote: This patch adds a new hugetlbfs mount option 'noautofill', to indicate that pages should not be allocated at page fault time when accessed thru mmapped address. I think the main argument against doing something like this is further specializing hugetlbfs. I was really hoping that userfaultfd would be usable for your needs here. Could you elaborate on other options that you considered? Did you look at userfaultfd? What about an madvise() option that disallows backing allocations? Yes, we did consider userfaultfd and madvise(). The use case in mind is the database. With a database, large number of single threaded processes are involved which will map hugetlbfs file and use it for shared memory. The concern with using userfaultfd is the overhead of setup and having an additional thread per process to monitor the userfaultfd. Even if the additional thread can be avoided, by using an external monitor process and each process sending the userfaultfd to this monitor process, setup overhead exists. Similarly, a madvise() option also requires additional system call by every process mapping the file, this is considered a overhead for the database. If we do consider a new madvise() option, will it be acceptable since this will be specifically for hugetlbfs file mappings? If so, would a new flag to mmap() call itself be acceptable, which would define the proposed behavior?. That way no additional system calls need to be made. Again this mmap flag would be applicable specifically to hugetlbfs file mappings With the proposed mount option, it would enforce one consistent behavior and the application using this filesystem would not have to take additional steps as with userfaultfd or madvise().
Re: [PATCH RFC] hugetlbfs 'noautofill' mount option
On 5/2/17 2:32 PM, Dave Hansen wrote: On 05/01/2017 11:00 AM, Prakash Sangappa wrote: This patch adds a new hugetlbfs mount option 'noautofill', to indicate that pages should not be allocated at page fault time when accessed thru mmapped address. I think the main argument against doing something like this is further specializing hugetlbfs. I was really hoping that userfaultfd would be usable for your needs here. Could you elaborate on other options that you considered? Did you look at userfaultfd? What about an madvise() option that disallows backing allocations? Yes, we did consider userfaultfd and madvise(). The use case in mind is the database. With a database, large number of single threaded processes are involved which will map hugetlbfs file and use it for shared memory. The concern with using userfaultfd is the overhead of setup and having an additional thread per process to monitor the userfaultfd. Even if the additional thread can be avoided, by using an external monitor process and each process sending the userfaultfd to this monitor process, setup overhead exists. Similarly, a madvise() option also requires additional system call by every process mapping the file, this is considered a overhead for the database. If we do consider a new madvise() option, will it be acceptable since this will be specifically for hugetlbfs file mappings? If so, would a new flag to mmap() call itself be acceptable, which would define the proposed behavior?. That way no additional system calls need to be made. Again this mmap flag would be applicable specifically to hugetlbfs file mappings With the proposed mount option, it would enforce one consistent behavior and the application using this filesystem would not have to take additional steps as with userfaultfd or madvise().
Re: [PATCH RFC] hugetlbfs 'noautofill' mount option
On 5/2/17 3:53 AM, Anshuman Khandual wrote: On 05/01/2017 11:30 PM, Prakash Sangappa wrote: Some applications like a database use hugetblfs for performance reasons. Files on hugetlbfs filesystem are created and huge pages allocated using fallocate() API. Pages are deallocated/freed using fallocate() hole punching support that has been added to hugetlbfs. These files are mmapped and accessed by many processes as shared memory. Such applications keep track of which offsets in the hugetlbfs file have pages allocated. Any access to mapped address over holes in the file, which can occur due s/mapped/unmapped/ ^ ? It is 'mapped' address. to bugs in the application, is considered invalid and expect the process to simply receive a SIGBUS. However, currently when a hole in the file is accessed via the mapped address, kernel/mm attempts to automatically allocate a page at page fault time, resulting in implicitly filling the hole But this is expected when you try to control the file allocation from a mapped address. Any changes while walking past or writing the range in the memory mapped should reflect exactly in the file on the disk. Why its not a valid behavior ? Sure, that is a valid behavior. However, hugetlbfs is a pesudo filesystem and the purpose is for applications to use hugepage memory. The contents of these filesystem are not backed by disk nor are they swapped out. The proposed new behavior is only applicable for hugetlbfs filesystem mounted with the new 'noautofill' mount option. The file's page allocation/free are managed using the 'fallocate()' API. For hugetlbfs filesystems mounted without this option, there is no change in behavior. in the file. This may not be the desired behavior for applications like the database that want to explicitly manage page allocations of hugetlbfs files. This patch adds a new hugetlbfs mount option 'noautofill', to indicate that pages should not be allocated at page fault time when accessed thru mmapped address. When the page should be allocated for mapping ? The application would allocate/free file pages using the fallocate() API.
Re: [PATCH RFC] hugetlbfs 'noautofill' mount option
On 5/2/17 3:53 AM, Anshuman Khandual wrote: On 05/01/2017 11:30 PM, Prakash Sangappa wrote: Some applications like a database use hugetblfs for performance reasons. Files on hugetlbfs filesystem are created and huge pages allocated using fallocate() API. Pages are deallocated/freed using fallocate() hole punching support that has been added to hugetlbfs. These files are mmapped and accessed by many processes as shared memory. Such applications keep track of which offsets in the hugetlbfs file have pages allocated. Any access to mapped address over holes in the file, which can occur due s/mapped/unmapped/ ^ ? It is 'mapped' address. to bugs in the application, is considered invalid and expect the process to simply receive a SIGBUS. However, currently when a hole in the file is accessed via the mapped address, kernel/mm attempts to automatically allocate a page at page fault time, resulting in implicitly filling the hole But this is expected when you try to control the file allocation from a mapped address. Any changes while walking past or writing the range in the memory mapped should reflect exactly in the file on the disk. Why its not a valid behavior ? Sure, that is a valid behavior. However, hugetlbfs is a pesudo filesystem and the purpose is for applications to use hugepage memory. The contents of these filesystem are not backed by disk nor are they swapped out. The proposed new behavior is only applicable for hugetlbfs filesystem mounted with the new 'noautofill' mount option. The file's page allocation/free are managed using the 'fallocate()' API. For hugetlbfs filesystems mounted without this option, there is no change in behavior. in the file. This may not be the desired behavior for applications like the database that want to explicitly manage page allocations of hugetlbfs files. This patch adds a new hugetlbfs mount option 'noautofill', to indicate that pages should not be allocated at page fault time when accessed thru mmapped address. When the page should be allocated for mapping ? The application would allocate/free file pages using the fallocate() API.