[no subject]
confirm 179e695f420474677205db49a8cbfe950329975c
[no subject]
confirm 0da5e6b1343dcc6395ebcc8054c362d930498440
Re: [PATCH 0/1] powerpc/vdso: remove redundant locking in update_vsyscall_tz()
On Fri, Nov 09, 2012 at 09:57:49AM +0800, Shan Hai wrote: The locking in update_vsyscall_tz() is not only unnecessary because the vdso code copies the data unproteced in __kernel_gettimeofday() but also introduces a hard to reproduce race condition between update_vsyscall() and update_vsyscall_tz(), which causes user space process to loop forever in vdso code. The following patch removes the locking from update_vsyscall_tz(). --- arch/powerpc/kernel/time.c |5 - 1 file changed, 5 deletions(-) Hi Ben, Would you please review this one? From 167eac293b07e0ee201ffe5043ec442d52495a48 Mon Sep 17 00:00:00 2001 From: Shan Hai shan@windriver.com Date: Mon, 5 Nov 2012 18:24:22 +0800 Subject: [PATCH 1/1] powerpc/vdso: remove redundant locking in update_vsyscall_tz() Locking is not only unnecessary because the vdso code copies the data unprotected in __kernel_gettimeofday() but also erroneous because updating the tb_update_count is not atomic and introduces a hard to reproduce race condition between update_vsyscall() and update_vsyscall_tz(), which further causes user space process to loop forever in vdso code. The below scenario describes the race condition, x==0Boot CPUother CPU proc_P: x==0 timer interrupt update_vsyscall x==1x++;syncsettimeofday update_vsyscall_tz x==2x++;sync x==3sync;x++ sync;x++ proc_P: x==3 (loops until x becomes even) Because the ++ operator would be implemented as three instructions and not atomic on powerpc. A similar change was made for x86 in commit 6c260d58634 (x86: vdso: Remove bogus locking in update_vsyscall_tz) Signed-off-by: Shan Hai shan@windriver.com --- arch/powerpc/kernel/time.c |5 - 1 file changed, 5 deletions(-) diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 03b29a6..426141f 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -859,13 +859,8 @@ void update_vsyscall(struct timespec *wall_time, struct timespec *wtm, void update_vsyscall_tz(void) { - /* Make userspace gettimeofday spin until we're done. */ - ++vdso_data-tb_update_count; - smp_mb(); vdso_data-tz_minuteswest = sys_tz.tz_minuteswest; vdso_data-tz_dsttime = sys_tz.tz_dsttime; - smp_mb(); - ++vdso_data-tb_update_count; } static void __init clocksource_init(void) -- 1.7.10.4 Best Regards Shan Hai Thanks Shan Hai From 167eac293b07e0ee201ffe5043ec442d52495a48 Mon Sep 17 00:00:00 2001 From: Shan Hai shan@windriver.com Date: Mon, 5 Nov 2012 18:24:22 +0800 Subject: [PATCH 1/1] powerpc/vdso: remove redundant locking in update_vsyscall_tz() Locking is not only unnecessary because the vdso code copies the data unprotected in __kernel_gettimeofday() but also erroneous because updating the tb_update_count is not atomic and introduces a hard to reproduce race condition between update_vsyscall() and update_vsyscall_tz(), which further causes user space process to loop forever in vdso code. The below scenario describes the race condition, x==0 Boot CPUother CPU proc_P: x==0 timer interrupt update_vsyscall x==1 x++;syncsettimeofday update_vsyscall_tz x==2 x++;sync x==3 sync;x++ sync;x++ proc_P: x==3 (loops until x becomes even) Because the ++ operator would be implemented as three instructions and not atomic on powerpc. A similar change was made for x86 in commit 6c260d58634 (x86: vdso: Remove bogus locking in update_vsyscall_tz) Signed-off-by: Shan Hai shan@windriver.com --- arch/powerpc/kernel/time.c |5 - 1 file changed, 5 deletions(-) diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 03b29a6..426141f 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -859,13 +859,8 @@ void update_vsyscall(struct timespec *wall_time, struct timespec *wtm, void update_vsyscall_tz(void) { - /* Make userspace gettimeofday spin until we're done. */ - ++vdso_data-tb_update_count; - smp_mb(); vdso_data-tz_minuteswest = sys_tz.tz_minuteswest; vdso_data-tz_dsttime = sys_tz.tz_dsttime; - smp_mb(); - ++vdso_data-tb_update_count; } static void __init clocksource_init(void) -- 1.7.10.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https
[PATCH 0/1] powerpc/vdso: remove redundant locking in update_vsyscall_tz()
The locking in update_vsyscall_tz() is not only unnecessary because the vdso code copies the data unproteced in __kernel_gettimeofday() but also introduces a hard to reproduce race condition between update_vsyscall() and update_vsyscall_tz(), which causes user space process to loop forever in vdso code. The following patch removes the locking from update_vsyscall_tz(). --- arch/powerpc/kernel/time.c |5 - 1 file changed, 5 deletions(-) Thanks Shan Hai From 167eac293b07e0ee201ffe5043ec442d52495a48 Mon Sep 17 00:00:00 2001 From: Shan Hai shan@windriver.com Date: Mon, 5 Nov 2012 18:24:22 +0800 Subject: [PATCH 1/1] powerpc/vdso: remove redundant locking in update_vsyscall_tz() Locking is not only unnecessary because the vdso code copies the data unprotected in __kernel_gettimeofday() but also erroneous because updating the tb_update_count is not atomic and introduces a hard to reproduce race condition between update_vsyscall() and update_vsyscall_tz(), which further causes user space process to loop forever in vdso code. The below scenario describes the race condition, x==0 Boot CPU other CPU proc_P: x==0 timer interrupt update_vsyscall x==1 x++;sync settimeofday update_vsyscall_tz x==2 x++;sync x==3 sync;x++ sync;x++ proc_P: x==3 (loops until x becomes even) Because the ++ operator would be implemented as three instructions and not atomic on powerpc. A similar change was made for x86 in commit 6c260d58634 (x86: vdso: Remove bogus locking in update_vsyscall_tz) Signed-off-by: Shan Hai shan@windriver.com --- arch/powerpc/kernel/time.c |5 - 1 file changed, 5 deletions(-) diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 03b29a6..426141f 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -859,13 +859,8 @@ void update_vsyscall(struct timespec *wall_time, struct timespec *wtm, void update_vsyscall_tz(void) { - /* Make userspace gettimeofday spin until we're done. */ - ++vdso_data-tb_update_count; - smp_mb(); vdso_data-tz_minuteswest = sys_tz.tz_minuteswest; vdso_data-tz_dsttime = sys_tz.tz_dsttime; - smp_mb(); - ++vdso_data-tb_update_count; } static void __init clocksource_init(void) -- 1.7.10.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC/PATCH] mm/futex: Fix futex writes on archs with SW tracking of dirty young
On 07/22/2011 06:59 AM, Andrew Morton wrote: On Fri, 22 Jul 2011 08:52:06 +1000 Benjamin Herrenschmidtb...@kernel.crashing.org wrote: On Thu, 2011-07-21 at 15:36 -0700, Andrew Morton wrote: On Tue, 19 Jul 2011 14:29:22 +1000 Benjamin Herrenschmidtb...@kernel.crashing.org wrote: The futex code currently attempts to write to user memory within a pagefault disabled section, and if that fails, tries to fix it up using get_user_pages(). This doesn't work on archs where the dirty and young bits are maintained by software, since they will gate access permission in the TLB, and will not be updated by gup(). In addition, there's an expectation on some archs that a spurious write fault triggers a local TLB flush, and that is missing from the picture as well. I decided that adding those features to gup() would be too much for this already too complex function, and instead added a new simpler fixup_user_fault() which is essentially a wrapper around handle_mm_fault() which the futex code can call. Signed-off-by: Benjamin Herrenschmidtb...@kernel.crashing.org --- Shan, can you test this ? It might not fix the problem um, what problem. There's no description here of the user-visible effects of the bug hence it's hard to work out what kernel version(s) should receive this patch. Shan could give you an actual example (it was in the previous thread), but basically, livelock as the kernel keeps trying and trying the in_atomic op and never resolves it. What kernel version(s) should receive this patch? I haven't dug. Probably anything it applies on as far as we did that trick of atomic + gup() for futex. You're not understanding me. I need a good reason to merge this into 3.0. The -stable maintainers need even better reasons to merge this into earlier kernels. Please provide those reasons! Summary: - Encountered a 100% CPU system usage problem on pthread_mutex allocated in a shared memory region, and the problem occurs only on setting PRIORITY_INHERITANCE to the pthread_mutex. - ftrace result reveals that an infinite loop in the futex_lock_pi caused high CPU usage. - The powerpc e500 was affected but the x86 was not. I have not tested on other archs so I am not sure whether the other archs are attacked by the problem. - Tested it on 2.6.34 and 3.0-rc7, both are affected, earlier versions might be affected. Please refer the threads [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core and [PATCH 1/1] Fixup write permission of TLB on powerpc e500 core for the whole story. Provided the test case code in the [PATH 0/1]. Thanks Shan Hai (Documentation/stable_kernel_rules.txt, 4th bullet) (And it's not just me and -stable maintainers. Distro maintainers will also look at this patch and wonder whether they should merge it) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC/PATCH] mm/futex: Fix futex writes on archs with SW tracking of dirty young
On 07/19/2011 03:46 PM, Benjamin Herrenschmidt wrote: On Tue, 2011-07-19 at 13:38 +0800, Shan Hai wrote: What you said is another path, that is futex_wake_op(), but what about futex_lock_pi in which my test case failed? your patch will call handle_mm_fault on every futex contention in the futex_lock_pi path. futex_lock_pi() ret = futex_lock_pi_atomic(uaddr, hb,q.key,q.pi_state, current, 0); case -EFAULT: goto uaddr_faulted; ... uaddr_faulted: ret = fault_in_user_writeable(uaddr); Euh ... and how do we get to uaddr_faulted ? You may have missed the return statement right before it :-) From what I can tell we only get there as a result of -EFAULT from futex_lock_pi_atomic() which is exactly the case we are trying to cover. Got it, if the fault_in_user_writeable() is designed to catch the exact same write permission fault problem we discuss here, so your patch fixed that very nicely, we should fixup it by directly calling handle_mm_fault like what you did because we are for sure to know what just happened(permission violation), its not necessary to check what's happened by calling gup--follow_page, and further the follow_page failed to report the fault :-) Thanks Shan Hai .../... [PATCH 1/1] Fixup write permission of TLB on powerpc e500 core? (I will fix the stupid errors in my original patch if the concept is acceptable) in this way we could decrease the overhead of handle_mm_fault in the path which does not need write permission fixup. Which overhead ? gup does handle_mm_fault() as well if needed. it does it *if needed*, and this requirement is rare in my opinion. And how does gup figure out of it's needed ? By walking down the page tables in follow_page... what does handle_mm_fault do ? walk down the page tables... The main (if not the only) relevant difference here, is going to be the spurious fault TLB invaliate for writes ... which is a nop on x86 and needed in all the cases we care about (and if it's not needed, then it's up to the arch to nop it out, we should probably do it on powerpc too ... but that's un unrelated discussion). Cheers, Ben. Thanks Shan Hai What I do is I replace what is arguably an abuse of gup() in the case where a fixup -is- needed with a dedicated function designed to perform the said fixup ... and do it properly which gup() didn't :-) Cheers, Ben. Thanks Shan Hai Cheers, Ben. diff --git a/include/linux/mm.h b/include/linux/mm.h index 9670f71..1036614 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -985,6 +985,8 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages); struct page *get_dump_page(unsigned long addr); +extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, + unsigned long address, unsigned int fault_flags); extern int try_to_release_page(struct page * page, gfp_t gfp_mask); extern void do_invalidatepage(struct page *page, unsigned long offset); diff --git a/kernel/futex.c b/kernel/futex.c index fe28dc2..7a0a4ed 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -355,8 +355,8 @@ static int fault_in_user_writeable(u32 __user *uaddr) int ret; down_read(mm-mmap_sem); - ret = get_user_pages(current, mm, (unsigned long)uaddr, -1, 1, 0, NULL, NULL); + ret = fixup_user_fault(current, mm, (unsigned long)uaddr, + FAULT_FLAG_WRITE); up_read(mm-mmap_sem); return ret0 ? ret : 0; diff --git a/mm/memory.c b/mm/memory.c index 40b7531..b967fb0 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1815,7 +1815,64 @@ next_page: } EXPORT_SYMBOL(__get_user_pages); -/** +/* + * fixup_user_fault() - manually resolve a user page fault + * @tsk: the task_struct to use for page fault accounting, or + * NULL if faults are not to be recorded. + * @mm:mm_struct of target mm + * @address: user address + * @fault_flags:flags to pass down to handle_mm_fault() + * + * This is meant to be called in the specific scenario where for + * locking reasons we try to access user memory in atomic context + * (within a pagefault_disable() section), this returns -EFAULT, + * and we want to resolve the user fault before trying again. + * + * Typically this is meant to be used by the futex code. + * + * The main difference with get_user_pages() is that this function + * will unconditionally call handle_mm_fault() which will in turn + * perform all the necessary SW fixup of the dirty and young bits + * in the PTE, while handle_mm_fault() only guarantees to update + * these in the struct page. + * + * This is important for some architectures where those bits also + * gate the access permission to the page because their are + * maintained in software
Re: [RFC/PATCH] mm/futex: Fix futex writes on archs with SW trackingof dirty young
On 07/19/2011 04:26 PM, David Laight wrote: Got it, if the fault_in_user_writeable() is designed to catch the exact same write permission fault problem we discuss here, so your patch fixed that very nicely, we should fixup it by directly calling handle_mm_fault like what you did because we are for sure to know what just happened(permission violation), its not necessary to check what's happened by calling gup--follow_page, and further the follow_page failed to report the fault :-) One thought I've had - and I don't know enough about the data area in use to know if it is a problem - is what happens if a different cpu faults on the same user page and has already marked it 'valid' between the fault happening and the fault handler looking at the page tables to find out why. If any of the memory areas are shared, it might be that the PTE (etc) might already show the page a writable by the time the fault handler is looking at them - this might confuse it! There is no problem at all if you mean *valid* by page present and writable, because when the fault_in_user_writeable() is called, the pte to the shared page was already setup by demand paging and pte.present and pte.write was set, and the reason why the fault was taken is that because of violation of permission on present and writable user page occurred on sw dirty/young tracking architectures. Thanks Shan Hai David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/1] Fixup write permission of TLB on powerpc e500 core
: */ mark_page_accessed(page); } + if ((flags FOLL_MLOCK) (vma-vm_flags VM_LOCKED)) { /* * The preliminary mapping check is mainly to avoid the @@ -3358,21 +3386,8 @@ int handle_pte_fault(struct mm_struct *mm, if (!pte_write(entry)) return do_wp_page(mm, vma, address, pte, pmd, ptl, entry); - entry = pte_mkdirty(entry); - } - entry = pte_mkyoung(entry); - if (ptep_set_access_flags(vma, address, pte, entry, flags FAULT_FLAG_WRITE)) { - update_mmu_cache(vma, address, pte); - } else { - /* -* This is needed only for protection faults but the arch code -* is not yet telling us if this is a protection fault or not. -* This still avoids useless tlb flushes for .text page faults -* with threads. -*/ - if (flags FAULT_FLAG_WRITE) - flush_tlb_fix_spurious_fault(vma, address); } + handle_pte_sw_young_dirty(vma, address, pte, flags FAULT_FLAG_WRITE); unlock: pte_unmap_unlock(pte, ptl); return 0; So what about the following? diff --git a/mm/memory.c b/mm/memory.c index 9b8a01d..fb48122 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1442,6 +1442,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsig spinlock_t *ptl; struct page *page; struct mm_struct *mm = vma-vm_mm; + int fix_write_permission = false; page = follow_huge_addr(mm, address, flags FOLL_WRITE); if (!IS_ERR(page)) { @@ -1519,6 +1520,11 @@ split_fallthrough: if ((flags FOLL_WRITE) !pte_dirty(pte) !PageDirty(page)) set_page_dirty(page); + +#ifdef CONFIG_FIXUP_WRITE_PERMISSION + if ((flags FOLL_WRITE) !pte_dirty(pte)) + fix_write_permission = true; +#endif /* * pte_mkyoung() would be more correct here, but atomic care * is needed to avoid losing the dirty bit: it is easier to use @@ -1551,7 +1557,7 @@ split_fallthrough: unlock: pte_unmap_unlock(ptep, ptl); out: - return page; + return (fix_write_permission == true) ? NULL: page; bad_page: pte_unmap_unlock(ptep, ptl); From the CONFIG_FIXUP_WRITE_PERMISSION and (flags FOLL_WRITE) !pte_dirty(pte) the follow_page() could figure out that the caller want to write to the (present writable non-dirty) pte, and the architecture want to fixup the problem by indicating CONFIG_FIXUP_WRITE_PERMISSION, so let the follow_page() return NULL to the __get_user_pages, and let the handle_mm_fault to fixup dirty/young tracking. Checking the following code we can conclude that the handle_mm_fault is ready to handle the faults and the write permission violation is a kind of fault, so why don't we let the handle_mm_fault to handle that fault? __get_user_pages() while (!(page = follow_page(vma, start, foll_flags))) { ... ret = handle_mm_fault(mm, vma, start, fault_flags); ... } Thanks Shan Hai ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/1] Fixup write permission of TLB on powerpc e500 core
On 07/18/2011 03:01 PM, Benjamin Herrenschmidt wrote: On Mon, 2011-07-18 at 14:48 +0800, Shan Hai wrote: It could not fix the problem, refer the following reply for the reasons. .../... diff --git a/kernel/futex.c b/kernel/futex.c index fe28dc2..02adff7 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -355,8 +355,8 @@ static int fault_in_user_writeable(u32 __user *uaddr) int ret; down_read(mm-mmap_sem); - ret = get_user_pages(current, mm, (unsigned long)uaddr, -1, 1, 0, NULL, NULL); + ret = __get_user_pages(current, mm, (unsigned long)uaddr, 1, + FOLL_WRITE | FOLL_FIXFAULT, NULL, NULL, NULL); the FOLL_FIXFAULT is filtered out at the following code get_user_pages() if (write) flags |= FOLL_WRITE; I'm not sure what you're talking about here, you may notice that I'm calling __get_user_pages() not get_user_pages(). Make sure you get my -second- post of the patch (the one with a proper description s-o-b) since the first one was a mis-send of an wip version. I am sorry I hadn't tried your newer patch, I tried it but it still could not work in my test environment, I will dig into and tell you why that failed later. + + if (flags FOLL_FIXFAULT) + handle_pte_sw_young_dirty(vma, address, ptep, + flags FOLL_WRITE); if (flags FOLL_TOUCH) { if ((flags FOLL_WRITE) !pte_dirty(pte) !PageDirty(page)) call handle_pte_sw_young_dirty before !pte_dirty(pte) might has problems. No this is on purpose. My initial patch was only calling it under the same condition as the FOLL_TOUCH case, but I got concerned by this whole flush_tlb_fix_spurious_fault() business. Basically, our generic code is designed so that relaxing write protection on a PTE can be done without flushing the TLB on all CPUs, so that a spurrious fault on a secondary CPU will flush the TLB at that point. I don't know which arch relies on this feature (ARM maybe ?) but if we are going to be semantically equivalent to a real fault, we must also do that, so the right thing to do here is to always call in there if FOLL_FIXFAULT is set. It's up to the caller to only set FOLL_FIXFAULT when really trying to deal with an -EFAULT, to avoid possible unnecessary overhead, but in this case I think we are fine, this is all a fallback slow path. .../... So what about the following? diff --git a/mm/memory.c b/mm/memory.c index 9b8a01d..fb48122 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1442,6 +1442,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsig spinlock_t *ptl; struct page *page; struct mm_struct *mm = vma-vm_mm; + int fix_write_permission = false; page = follow_huge_addr(mm, address, flags FOLL_WRITE); if (!IS_ERR(page)) { @@ -1519,6 +1520,11 @@ split_fallthrough: if ((flags FOLL_WRITE) !pte_dirty(pte) !PageDirty(page)) set_page_dirty(page); + +#ifdef CONFIG_FIXUP_WRITE_PERMISSION + if ((flags FOLL_WRITE) !pte_dirty(pte)) + fix_write_permission = true; +#endif /* * pte_mkyoung() would be more correct here, but atomic care * is needed to avoid losing the dirty bit: it is easier to use @@ -1551,7 +1557,7 @@ split_fallthrough: unlock: pte_unmap_unlock(ptep, ptl); out: - return page; + return (fix_write_permission == true) ? NULL: page; bad_page: pte_unmap_unlock(ptep, ptl); You patch not only is uglier (more ifdef's) but also incomplete since it doesn't handle the young case and it doesn't handle the spurious fault case either. Yep, I know holding lots of ifdef's everywhere is not so good, but if we have some other way(I don't know how till now) to figure out the arch has the need to fixup up the write permission we could eradicate the ugly ifdef's here. I think the handle_mm_fault could do all dirty/young tracking, because the purpose of making follow_page return NULL to its caller is that want to the handle_mm_fault to be called on write permission protection fault. Thanks Shan Hai What the futex code is trying to do is use gup() as a way to fixup from a fault which means essentially to have the -exact- same semantics as a normal fault would have. Thus by factoring the common fault fixup code and using that exact same code in gup(), we get a much more robust guarantee that this will work in the long run. I don't expect gup to be that commonly used to fixup access after an attempt at doing a user access with page faults disabled, only those case will need to be modified to use the new flag. From the CONFIG_FIXUP_WRITE_PERMISSION and (flags FOLL_WRITE) !pte_dirty(pte) the follow_page() could figure out that the caller want to write
Re: [PATCH 1/1] Fixup write permission of TLB on powerpc e500 core
On 07/18/2011 03:36 PM, Benjamin Herrenschmidt wrote: On Mon, 2011-07-18 at 15:26 +0800, Shan Hai wrote: I am sorry I hadn't tried your newer patch, I tried it but it still could not work in my test environment, I will dig into and tell you why that failed later. Ok, please let me know what you find ! Have not been finding out the reason why failed, I tried the following based on your code, (1) diff --git a/kernel/futex.c b/kernel/futex.c index fe28dc2..820556d 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -353,10 +353,11 @@ static int fault_in_user_writeable(u32 __user *uaddr) { struct mm_struct *mm = current-mm; int ret; + int flags = FOLL_TOUCH | FOLL_GET | FOLL_WRITE | FOLL_FIXFAULT; down_read(mm-mmap_sem); - ret = get_user_pages(current, mm, (unsigned long)uaddr, -1, 1, 0, NULL, NULL); + ret = __get_user_pages(current, mm, (unsigned long)uaddr, 1, + flags, NULL, NULL, NULL); up_read(mm-mmap_sem); return ret 0 ? ret : 0; (2) diff --git a/mm/memory.c b/mm/memory.c index 9b8a01d..f7ba26e 100644 --- a/mm/memory.c +++ b/mm/memory.c ... + + if ((flags (FOLL_WRITE | FOLL_FIXFAULT)) !pte_dirty(pte)) + handle_pte_sw_young_dirty(vma, address, ptep, + FAULT_FLAG_WRITE); ... And everything lookes good, but still couldn't work, need more investigation. Yep, I know holding lots of ifdef's everywhere is not so good, but if we have some other way(I don't know how till now) to figure out the arch has the need to fixup up the write permission we could eradicate the ugly ifdef's here. I think the handle_mm_fault could do all dirty/young tracking, because the purpose of making follow_page return NULL to its caller is that want to the handle_mm_fault to be called on write permission protection fault. I see your point. Rather than factoring the fixup code out, we could force gup to call handle_mm_fault()... that makes sense. However, I don't think we should special case archs. There's plenty of cases where we don't care about this fixup even on archs that do SW tracking of dirty and young. For example when gup is using for subsequent DMA. Only the (rare ?) cases where it's used as a mean to fixup a failing atomic user access are relevant. So I believe we should still pass an explicit flag to __get_user_pages() as I propose to activate that behaviour. How about the following one? the write permission fixup behaviour is triggered explicitly by the trouble making parts like futex as you suggested. In this way, the follow_page() mimics exactly how the MMU faults on atomic access to the user pages, and we could handle the fault by already existing handle_mm_fault which also do the dirty/young tracking properly. diff --git a/include/linux/mm.h b/include/linux/mm.h index 9670f71..8a76694 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1546,6 +1546,7 @@ struct page *follow_page(struct vm_area_struct *, unsigned long address, #define FOLL_MLOCK0x40/* mark page as mlocked */ #define FOLL_SPLIT0x80/* don't return transhuge pages, split them */ #define FOLL_HWPOISON0x100/* check page is hwpoisoned */ +#define FOLL_FIXFAULT0x200/* fixup after a fault (PTE dirty/young upd) */ typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr, void *data); diff --git a/kernel/futex.c b/kernel/futex.c index fe28dc2..820556d 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -353,10 +353,11 @@ static int fault_in_user_writeable(u32 __user *uaddr) { struct mm_struct *mm = current-mm; int ret; +int flags = FOLL_TOUCH | FOLL_GET | FOLL_WRITE | FOLL_FIXFAULT; down_read(mm-mmap_sem); -ret = get_user_pages(current, mm, (unsigned long)uaddr, - 1, 1, 0, NULL, NULL); +ret = __get_user_pages(current, mm, (unsigned long)uaddr, 1, + flags, NULL, NULL, NULL); up_read(mm-mmap_sem); return ret 0 ? ret : 0; diff --git a/mm/memory.c b/mm/memory.c index 9b8a01d..5682501 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1442,6 +1442,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, spinlock_t *ptl; struct page *page; struct mm_struct *mm = vma-vm_mm; +int fix_write_permission = 0; page = follow_huge_addr(mm, address, flags FOLL_WRITE); if (!IS_ERR(page)) { @@ -1519,6 +1520,9 @@ split_fallthrough: if ((flags FOLL_WRITE) !pte_dirty(pte) !PageDirty(page)) set_page_dirty(page); + +if ((flags (FOLL_WRITE | FOLL_FIXFAULT)) !pte_dirty(pte)) +fix_write_permission = 1; /* * pte_mkyoung() would be more correct here, but atomic care * is needed to avoid losing the dirty bit: it is easier to use @@ -1551,7 +1555,7 @@ split_fallthrough: unlock: pte_unmap_unlock
Re: [RFC/PATCH] mm/futex: Fix futex writes on archs with SW tracking of dirty young
On 07/19/2011 12:29 PM, Benjamin Herrenschmidt wrote: The futex code currently attempts to write to user memory within a pagefault disabled section, and if that fails, tries to fix it up using get_user_pages(). This doesn't work on archs where the dirty and young bits are maintained by software, since they will gate access permission in the TLB, and will not be updated by gup(). In addition, there's an expectation on some archs that a spurious write fault triggers a local TLB flush, and that is missing from the picture as well. I decided that adding those features to gup() would be too much for this already too complex function, and instead added a new simpler fixup_user_fault() which is essentially a wrapper around handle_mm_fault() which the futex code can call. Signed-off-by: Benjamin Herrenschmidtb...@kernel.crashing.org --- Shan, can you test this ? It might not fix the problem since I'm starting to have the nasty feeling that you are hitting what is somewhat a subtly different issue or my previous patch should have worked (but then I might have done a stupid mistake as well) but let us know anyway. Ok, I will test the patch, I think this should work, because it's similar to my first posted patch, the difference is that I tried to do it in the futex_atomic_cmpxchg_inatomic() in the ppc specific path, lower level than yours as in fault_in_user_writable :-) Anyway, I will notify you on the test result. Thanks Shan Hai Cheers, Ben. diff --git a/include/linux/mm.h b/include/linux/mm.h index 9670f71..1036614 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -985,6 +985,8 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages); struct page *get_dump_page(unsigned long addr); +extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, + unsigned long address, unsigned int fault_flags); extern int try_to_release_page(struct page * page, gfp_t gfp_mask); extern void do_invalidatepage(struct page *page, unsigned long offset); diff --git a/kernel/futex.c b/kernel/futex.c index fe28dc2..7a0a4ed 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -355,8 +355,8 @@ static int fault_in_user_writeable(u32 __user *uaddr) int ret; down_read(mm-mmap_sem); - ret = get_user_pages(current, mm, (unsigned long)uaddr, -1, 1, 0, NULL, NULL); + ret = fixup_user_fault(current, mm, (unsigned long)uaddr, + FAULT_FLAG_WRITE); up_read(mm-mmap_sem); return ret 0 ? ret : 0; diff --git a/mm/memory.c b/mm/memory.c index 40b7531..b967fb0 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1815,7 +1815,64 @@ next_page: } EXPORT_SYMBOL(__get_user_pages); -/** +/* + * fixup_user_fault() - manually resolve a user page fault + * @tsk: the task_struct to use for page fault accounting, or + * NULL if faults are not to be recorded. + * @mm:mm_struct of target mm + * @address: user address + * @fault_flags:flags to pass down to handle_mm_fault() + * + * This is meant to be called in the specific scenario where for + * locking reasons we try to access user memory in atomic context + * (within a pagefault_disable() section), this returns -EFAULT, + * and we want to resolve the user fault before trying again. + * + * Typically this is meant to be used by the futex code. + * + * The main difference with get_user_pages() is that this function + * will unconditionally call handle_mm_fault() which will in turn + * perform all the necessary SW fixup of the dirty and young bits + * in the PTE, while handle_mm_fault() only guarantees to update + * these in the struct page. + * + * This is important for some architectures where those bits also + * gate the access permission to the page because their are + * maintained in software. On such architecture, gup() will not + * be enough to make a subsequent access succeed. + * + * This should be called with the mm_sem held for read. + */ +int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, +unsigned long address, unsigned int fault_flags) +{ + struct vm_area_struct *vma; + int ret; + + vma = find_extend_vma(mm, address); + if (!vma || address vma-vm_start) + return -EFAULT; + + ret = handle_mm_fault(mm, vma, address, fault_flags); + if (ret VM_FAULT_ERROR) { + if (ret VM_FAULT_OOM) + return -ENOMEM; + if (ret (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE)) + return -EHWPOISON; + if (ret VM_FAULT_SIGBUS) + return -EFAULT; + BUG(); + } + if (tsk) { + if (ret VM_FAULT_MAJOR) + tsk-maj_flt++; + else
Re: [RFC/PATCH] mm/futex: Fix futex writes on archs with SW tracking of dirty young
On 07/19/2011 12:29 PM, Benjamin Herrenschmidt wrote: The futex code currently attempts to write to user memory within a pagefault disabled section, and if that fails, tries to fix it up using get_user_pages(). This doesn't work on archs where the dirty and young bits are maintained by software, since they will gate access permission in the TLB, and will not be updated by gup(). In addition, there's an expectation on some archs that a spurious write fault triggers a local TLB flush, and that is missing from the picture as well. I decided that adding those features to gup() would be too much for this already too complex function, and instead added a new simpler fixup_user_fault() which is essentially a wrapper around handle_mm_fault() which the futex code can call. Signed-off-by: Benjamin Herrenschmidtb...@kernel.crashing.org --- Shan, can you test this ? It might not fix the problem since I'm starting to have the nasty feeling that you are hitting what is somewhat a subtly different issue or my previous patch should have worked (but then I might have done a stupid mistake as well) but let us know anyway. The patch works, but I have certain confusions, - Do we want to handle_mm_fault on each futex_lock_pi even though in most cases there is no write permission fixup's needed? - How about let the archs do their own write permission fixup as what I did in my original [PATCH 1/1] Fixup write permission of TLB on powerpc e500 core? (I will fix the stupid errors in my original patch if the concept is acceptable) in this way we could decrease the overhead of handle_mm_fault in the path which does not need write permission fixup. Thanks Shan Hai Cheers, Ben. diff --git a/include/linux/mm.h b/include/linux/mm.h index 9670f71..1036614 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -985,6 +985,8 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages); struct page *get_dump_page(unsigned long addr); +extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, + unsigned long address, unsigned int fault_flags); extern int try_to_release_page(struct page * page, gfp_t gfp_mask); extern void do_invalidatepage(struct page *page, unsigned long offset); diff --git a/kernel/futex.c b/kernel/futex.c index fe28dc2..7a0a4ed 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -355,8 +355,8 @@ static int fault_in_user_writeable(u32 __user *uaddr) int ret; down_read(mm-mmap_sem); - ret = get_user_pages(current, mm, (unsigned long)uaddr, -1, 1, 0, NULL, NULL); + ret = fixup_user_fault(current, mm, (unsigned long)uaddr, + FAULT_FLAG_WRITE); up_read(mm-mmap_sem); return ret 0 ? ret : 0; diff --git a/mm/memory.c b/mm/memory.c index 40b7531..b967fb0 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1815,7 +1815,64 @@ next_page: } EXPORT_SYMBOL(__get_user_pages); -/** +/* + * fixup_user_fault() - manually resolve a user page fault + * @tsk: the task_struct to use for page fault accounting, or + * NULL if faults are not to be recorded. + * @mm:mm_struct of target mm + * @address: user address + * @fault_flags:flags to pass down to handle_mm_fault() + * + * This is meant to be called in the specific scenario where for + * locking reasons we try to access user memory in atomic context + * (within a pagefault_disable() section), this returns -EFAULT, + * and we want to resolve the user fault before trying again. + * + * Typically this is meant to be used by the futex code. + * + * The main difference with get_user_pages() is that this function + * will unconditionally call handle_mm_fault() which will in turn + * perform all the necessary SW fixup of the dirty and young bits + * in the PTE, while handle_mm_fault() only guarantees to update + * these in the struct page. + * + * This is important for some architectures where those bits also + * gate the access permission to the page because their are + * maintained in software. On such architecture, gup() will not + * be enough to make a subsequent access succeed. + * + * This should be called with the mm_sem held for read. + */ +int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, +unsigned long address, unsigned int fault_flags) +{ + struct vm_area_struct *vma; + int ret; + + vma = find_extend_vma(mm, address); + if (!vma || address vma-vm_start) + return -EFAULT; + + ret = handle_mm_fault(mm, vma, address, fault_flags); + if (ret VM_FAULT_ERROR) { + if (ret VM_FAULT_OOM) + return -ENOMEM; + if (ret (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE)) + return
Re: [RFC/PATCH] mm/futex: Fix futex writes on archs with SW tracking of dirty young
On 07/19/2011 01:24 PM, Benjamin Herrenschmidt wrote: On Tue, 2011-07-19 at 13:17 +0800, Shan Hai wrote: The patch works, but I have certain confusions, - Do we want to handle_mm_fault on each futex_lock_pi even though in most cases there is no write permission fixup's needed? Don't we only ever call this when futex_atomic_op_inuser() failed ? Which means a fixup -is- needed The fast path is still there. What you said is another path, that is futex_wake_op(), but what about futex_lock_pi in which my test case failed? your patch will call handle_mm_fault on every futex contention in the futex_lock_pi path. futex_lock_pi() ret = futex_lock_pi_atomic(uaddr, hb, q.key, q.pi_state, current, 0); case -EFAULT: goto uaddr_faulted; ... uaddr_faulted: ret = fault_in_user_writeable(uaddr); - How about let the archs do their own write permission fixup as what I did in my original Why ? This is generic and will fix all archs at once with generic code which is a significant improvement in my book and a lot more maintainable :-) If the overhead in the futex_lock_pi path is not considerable yes fix it up generally is nice :-) [PATCH 1/1] Fixup write permission of TLB on powerpc e500 core? (I will fix the stupid errors in my original patch if the concept is acceptable) in this way we could decrease the overhead of handle_mm_fault in the path which does not need write permission fixup. Which overhead ? gup does handle_mm_fault() as well if needed. it does it *if needed*, and this requirement is rare in my opinion. Thanks Shan Hai What I do is I replace what is arguably an abuse of gup() in the case where a fixup -is- needed with a dedicated function designed to perform the said fixup ... and do it properly which gup() didn't :-) Cheers, Ben. Thanks Shan Hai Cheers, Ben. diff --git a/include/linux/mm.h b/include/linux/mm.h index 9670f71..1036614 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -985,6 +985,8 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages); struct page *get_dump_page(unsigned long addr); +extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, + unsigned long address, unsigned int fault_flags); extern int try_to_release_page(struct page * page, gfp_t gfp_mask); extern void do_invalidatepage(struct page *page, unsigned long offset); diff --git a/kernel/futex.c b/kernel/futex.c index fe28dc2..7a0a4ed 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -355,8 +355,8 @@ static int fault_in_user_writeable(u32 __user *uaddr) int ret; down_read(mm-mmap_sem); - ret = get_user_pages(current, mm, (unsigned long)uaddr, -1, 1, 0, NULL, NULL); + ret = fixup_user_fault(current, mm, (unsigned long)uaddr, + FAULT_FLAG_WRITE); up_read(mm-mmap_sem); return ret 0 ? ret : 0; diff --git a/mm/memory.c b/mm/memory.c index 40b7531..b967fb0 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1815,7 +1815,64 @@ next_page: } EXPORT_SYMBOL(__get_user_pages); -/** +/* + * fixup_user_fault() - manually resolve a user page fault + * @tsk: the task_struct to use for page fault accounting, or + * NULL if faults are not to be recorded. + * @mm:mm_struct of target mm + * @address: user address + * @fault_flags:flags to pass down to handle_mm_fault() + * + * This is meant to be called in the specific scenario where for + * locking reasons we try to access user memory in atomic context + * (within a pagefault_disable() section), this returns -EFAULT, + * and we want to resolve the user fault before trying again. + * + * Typically this is meant to be used by the futex code. + * + * The main difference with get_user_pages() is that this function + * will unconditionally call handle_mm_fault() which will in turn + * perform all the necessary SW fixup of the dirty and young bits + * in the PTE, while handle_mm_fault() only guarantees to update + * these in the struct page. + * + * This is important for some architectures where those bits also + * gate the access permission to the page because their are + * maintained in software. On such architecture, gup() will not + * be enough to make a subsequent access succeed. + * + * This should be called with the mm_sem held for read. + */ +int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, +unsigned long address, unsigned int fault_flags) +{ + struct vm_area_struct *vma; + int ret; + + vma = find_extend_vma(mm, address); + if (!vma || address vma-vm_start) + return -EFAULT; + + ret = handle_mm_fault(mm, vma, address, fault_flags); + if (ret
Re: [PATCH 1/1] Fixup write permission of TLB on powerpc e500 core
On 07/17/2011 07:02 PM, Peter Zijlstra wrote: On Sun, 2011-07-17 at 09:49 +1000, Benjamin Herrenschmidt wrote: In the meantime, other than rewriting the futex code to not require those in-atomic accesses (can't it just access the pages via the linear mapping and/or kmap after the gup ?), That'll wreck performance on things like ARM and SPARC that have to deal with cache aliasing. all I see would be a way to force dirty and young after gup, with appropriate locks, or a variant of gup (via a flag ?) to require it to do so. Again, _WHY_ isn't gup(.write=1) a complete write fault? Its supposed to be, it needs to break COW, do dirty page tracking and call page_mkwrite. I'm still thinking this e500 stuff is smoking crack. ARM has no hardware dirty bit either, and yet it works for them. I can't exactly tell how because I got lost in there, but it does, again, suggest e500 is on crack. Ok, the following feature of the architecture causes failure of gup(.write=1) on dirtying pages, - allows pages to be protected from supervisor-mode writes On ARM you could not protect pages from supervisor-mode writes, isn't it? That means, all writable user pages are writable for supervisor too, but its not hold for at least x86 and powerpc, x86 and powerpc can be configured to protect pages from supervisor-mode writes. Think about the following situation, a page fault occurs on the kernel trying to write to a writable shared user page which is read only to the kernel, the following conditions hold, - the page is *present*, because its a shared page - the page is *writable*, because demand paging sets up the pte for the current process to so The follow_page() called in the __get_user_page() returns non NULL to its caller on the above mentioned *present* and *writable* page, so the gup(.write=1) has no chance to set pte dirty by calling handle_mm_fault, the follow_page() has no knowledge of supervisor-mode write protected pages, that's the culprit in the bug discussed here. Thanks Shan Hai ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/1] Fixup write permission of TLB on powerpc e500 core
On 07/17/2011 10:48 PM, Benjamin Herrenschmidt wrote: On Sun, 2011-07-17 at 21:33 +0800, Shan Hai wrote: On ARM you could not protect pages from supervisor-mode writes, isn't it? That means, all writable user pages are writable for supervisor too, but its not hold for at least x86 and powerpc, x86 and powerpc can be configured to protect pages from supervisor-mode writes. That doesn't sound right... how would put_user() work properly then ? A cursory glance at the ARM code doesn't show it doing anything special, just stores ... but I might have missing something. That's real for ARM, for the reason put_user() work properly is that the first time access to the write protected page triggers a page fault, and the handle_mm_fault() will fix up the write permission for the kernel, because at this time no one disabled the page fault as done in the futex case. Think about the following situation, a page fault occurs on the kernel trying to write to a writable shared user page which is read only to the kernel, the following conditions hold, - the page is *present*, because its a shared page - the page is *writable*, because demand paging sets up the pte for the current process to so The follow_page() called in the __get_user_page() returns non NULL to its caller on the above mentioned *present* and *writable* page, so the gup(.write=1) has no chance to set pte dirty by calling handle_mm_fault, the follow_page() has no knowledge of supervisor-mode write protected pages, that's the culprit in the bug discussed here. Right, the problem is with writable pages that have lost (or never had but usually it's lost, due to swapping for example) their dirty bit, or any page that has lost young. From what I can tell, we need to either fix those bits from the caller of gup (futex code), which sound nasty, or more easily fix those from gup itself, possibly under control of flags in the write argument to avoid breaking code relying on the existing behaviour, expecially vs. dirty. So, for the reason the SW tracked dirty/young and supervisor protected pages has potential effects on not only *futex* but also on other components of the kernel which might access the non-dirty supervisor protected page, in my opinion it might be more sensible to fix it from gup instead of fixing it in the futex. Thanks Shan Hai Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/1] Fixup write permission of TLB on powerpc e500 core
On 07/15/2011 06:23 AM, Peter Zijlstra wrote: On Fri, 2011-07-15 at 16:07 +0800, Shan Hai wrote: The kernel has no write permission on COW pages by default on e500 core, this will cause endless loop in futex_lock_pi, because futex code assumes the kernel has write permission on COW pages. Grant write permission to the kernel on COW pages when access violation page fault occurs. Signed-off-by: Shan Haihaishan@gmail.com --- arch/powerpc/include/asm/futex.h | 11 ++- arch/powerpc/include/asm/tlb.h | 25 + 2 files changed, 35 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/include/asm/futex.h b/arch/powerpc/include/asm/futex.h index c94e4a3..54c3e74 100644 --- a/arch/powerpc/include/asm/futex.h +++ b/arch/powerpc/include/asm/futex.h @@ -8,6 +8,7 @@ #includeasm/errno.h #includeasm/synch.h #includeasm/asm-compat.h +#includeasm/tlb.h #define __futex_atomic_op(insn, ret, oldval, uaddr, oparg) \ __asm__ __volatile ( \ @@ -113,7 +114,15 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr, : cc, memory); *uval = prev; -return ret; + + /* Futex assumes the kernel has permission to write to +* COW pages, grant the kernel write permission on COW +* pages because it has none by default. +*/ + if (ret == -EFAULT) + __tlb_fixup_write_permission(current-mm, (unsigned long)uaddr); + + return ret; } #endif /* __KERNEL__ */ diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h index e2b428b..3863c6a 100644 --- a/arch/powerpc/include/asm/tlb.h +++ b/arch/powerpc/include/asm/tlb.h @@ -45,5 +45,30 @@ static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep, #endif } +/* Grant write permission to the kernel on a page. */ +static inline void __tlb_fixup_write_permission(struct mm_struct *mm, + unsigned long address) +{ +#if defined(CONFIG_FSL_BOOKE) + /* Grant write permission to the kernel on a page by setting TLB.SW +* bit, the bit setting operation is tricky here, calling +* handle_mm_fault with FAULT_FLAG_WRITE causes _PAGE_DIRTY bit of +* the pte to be set, the _PAGE_DIRTY of the pte is translated into +* TLB.SW on Powerpc e500 core. +*/ + + struct vm_area_struct *vma; + + vma = find_vma(mm, address); Uhm, find_vma() needs mmap_sem, and futex_atomic_cmpxchg_inatomic() is most certainly not called with that lock held. + if (likely(vma)) { + /* only fixup present page */ + if (follow_page(vma, address, FOLL_WRITE)) { + handle_mm_fault(mm, vma, address, FAULT_FLAG_WRITE); So how can this toggle your sw dirty/young tracking, that's pretty much what gup(.write=1) does too! That's right the gup(.write=1) want to do the same thing as the above code snippet, but it failed for the following reason: because the get_user_pages() would not dirty pte for the reason the follow_page() returns not NULL on *present* and *writable* page, the page which holds the lock is present because its a shared page, writable because demand paging set that up so for shared writable page, so the handle_mm_fault() in the __get_user_page() could not be called. Why the above code could do the same task, because by calling handle_mm_fault() will set pte dirty by [do_annonymous_page(), memory.c] if (vma-vm_flags VM_WRITE) entry = pte_mkwrite(pte_mkdirty(entry)); Thanks Shan Hai + flush_tlb_page(vma, address); + } + } +#endif +} + #endif /* __KERNEL__ */ #endif /* __ASM_POWERPC_TLB_H */ ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core
On 07/15/2011 08:20 PM, Benjamin Herrenschmidt wrote: On Fri, 2011-07-15 at 11:32 -0400, Shan Hai wrote: I agree with you, the problem could be triggered by accessing any user space page which has kernel read only permission in the page fault disabled context, the problem also affects architectures which depend on SW dirty/young tracking as stated by Benjamin in this thread. In the e500 case, the commit 6cfd8990e27d3a491c1c605d6cbc18a46ae51fef removed the write permission fixup from TLB miss handlers and left it to generic code, so it might be right time to fixup the write permission here in the generic code. But we can't. The must not modify the PTE from an interrupt context and the atomic variants of user accesses can be called in such contexts. I think the problem is that we try to actually do things other than just peek at user memory (for backtraces etc...) but actually useful things in page fault disabled contexts. That's bad and various archs mm were designed with the assumption that this never happens. Yes I understood, the *here* above means 'generic code' like futex code, I am sorry for my ambiguous description. If the futex case is seldom here, we could probably find a way to work around in that specific case. That's what my patch wants to do. However, I -still- don't understand why gup didn't fixup the write permission. gup doesn't set dirty ? Yep, gup doesn't set dirty, because when the page fault occurs on the kernel accessing a user page which is read only to the kernel the following conditions hold, - the page is present, because its a shared page - the page is writable, because demand paging sets up the pte for the current process to so The follow_page() called in the __get_user_page() returns non NULL to its caller on the above mentioned present and writable page, so the gup(.write=1) has no chance to set pte dirty by calling handle_mm_fault Thanks Shan Hai s Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/1] Fixup write permission of TLB on powerpc e500 core
On 07/15/2011 11:24 AM, Peter Zijlstra wrote: On Fri, 2011-07-15 at 11:18 -0400, Shan Hai wrote: + vma = find_vma(mm, address); Uhm, find_vma() needs mmap_sem, and futex_atomic_cmpxchg_inatomic() is most certainly not called with that lock held. My fault, that will be fixed in the V2 patch. But you cannot, the function isn't called _atomic_ just for kicks, its used while holding spinlocks. Yes we can do that, _atomic_ here is just atomic for cmpxchg implemented by the combination of 'lwarx' and 'stwcx.' instructions as done in the spin lock implementation, so even we hold the mmap_sem that has no impact on the _atomic_ feature of the futex_atomic_cmpxchg_inatomic(). + if (likely(vma)) { + /* only fixup present page */ + if (follow_page(vma, address, FOLL_WRITE)) { + handle_mm_fault(mm, vma, address, FAULT_FLAG_WRITE); So how can this toggle your sw dirty/young tracking, that's pretty much what gup(.write=1) does too! because of the kernel read only permission of the page is transparent to the follow_page(), the handle_mm_fault() is not to be activated in the __get_use_pages(), so the gup(.write=1) could not help to fixup the write permission. So why do you need the vma? Is it like I wrote earlier that you don't have spare PTE bits and need the vma flags to see if it may become writable? Need vma for the reason to call handle_mm_fault(), that's all. gup(.write=1) not triggering this is a serious problem though, not something you can just paper over. I wouldn't be at all surprised to find there's more things broken because of that. In my opinion another solution might be check the read only for kernel feature of a page in the follow_page() on gup(.write=1) to avoid this problem on all architectures. Thanks Shan Hai ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 0/1] Fixup write permission of TLB on powerpc e500 core
The following test case could reveal a bug in the futex_lock_pi() BUG: On FUTEX_LOCK_PI, there is a infinite loop in the futex_lock_pi() on Powerpc e500 core. Cause: The linux kernel on the e500 core has no write permission on the COW page, refer the head comment of the following test code. ftrace on test case: [000] 353.990181: futex_lock_pi_atomic -futex_lock_pi [000] 353.990185: cmpxchg_futex_value_locked -futex_lock_pi_atomic [snip] [000] 353.990191: do_page_fault -handle_page_fault [000] 353.990192: bad_page_fault -handle_page_fault [000] 353.990193: search_exception_tables -bad_page_fault [snip] [000] 353.990199: get_user_pages -fault_in_user_writeable [snip] [000] 353.990208: mark_page_accessed -follow_page [000] 353.990222: futex_lock_pi_atomic -futex_lock_pi [snip] [000] 353.990230: cmpxchg_futex_value_locked -futex_lock_pi_atomic [ a loop occures here ] /* * A test case for revealing an infinite loop in the futex_lock_pi(). * - there are 2 processes, parent and a child * - the parent process allocates and initializes a pthread_mutex MUTEX in a * shared memory region * - the parent process holds the MUTEX and do long time computing * - the child process tries to hold the MUTEX during the parent holding it and * traps into the kernel for waiting on the MUTEX because of contention * - the kernel loops in futex_lock_pi() * - result of 'top' command reveals that the system usage of CPU is 100% */ #include stdio.h #include stdlib.h #include sys/ipc.h #include sys/shm.h #include errno.h #include pthread.h #include string.h #include signal.h #include fcntl.h #include sys/types.h #include sys/mman.h enum { SHM_INIT, SHM_GET }; enum { PARENT, CHILD }; #define FIXED_MMAP_ADDR 0x2000 #define MMAP_SIZE 0x200 static int shmid; static char shm_name[100]; static int sleep_period = 10; void * shmem_init(int flag) { int start = FIXED_MMAP_ADDR; int memory_size = MMAP_SIZE; int mode = 0666; void *addr; int ret; sprintf(shm_name, /shmem_1234); shmid = shm_open (shm_name, O_RDWR | O_EXCL | O_CREAT | O_TRUNC, mode); if (shmid 0) { if (errno == EEXIST) { printf (shm_open: %s\n, strerror(errno)); shmid = shm_open (shm_name, O_RDWR, mode); } else { printf(failed to shm_open, err=%s\n, strerror(errno)); return NULL; } } ret = fcntl (shmid, F_SETFD, FD_CLOEXEC); if (ret 0) { printf(fcntl: %s\n, strerror(errno)); return NULL; } ret = ftruncate (shmid, memory_size); if (ret 0) { printf(ftruncate: %s\n, strerror(errno)); return NULL; } addr = mmap ((void *)start, memory_size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_FIXED, shmid, 0); if (addr == MAP_FAILED) { printf (mmap: %s\n, strerror(errno)); close (shmid); shm_unlink (shm_name); return NULL; } if (flag == SHM_INIT) memset(addr, 0, memory_size); return (void *)start; } pthread_mutex_t * shmem_mutex_init(int flag) { pthread_mutex_t * pmutex = (pthread_mutex_t *)shmem_init(flag); pthread_mutexattr_t attr; if (flag == SHM_INIT) { pthread_mutexattr_init (attr); pthread_mutexattr_setpshared (attr, PTHREAD_PROCESS_SHARED); pthread_mutexattr_setprotocol (attr, PTHREAD_PRIO_INHERIT); pthread_mutexattr_setrobust_np (attr, PTHREAD_MUTEX_STALLED_NP); pthread_mutexattr_settype (attr, PTHREAD_MUTEX_ERRORCHECK); if (pthread_mutex_init (pmutex, attr) != 0) { printf(Init mutex failed, err=%s\n, strerror(errno)); pthread_mutexattr_destroy (attr); return NULL; } } return pmutex; } void long_running_task(int flag) { static int counter = 0; if (flag == PARENT) usleep(5*sleep_period); else usleep(3*sleep_period); counter = (counter + 1) % 100; printf(%d: completed %d computing\n, getpid(), counter); } void sig_handler(int signum) { close(shmid); shm_unlink(shm_name); exit(0); } int main(int argc, char *argv[]) { pthread_mutex_t *mutex_parent, *mutex_child; signal(SIGUSR1, sig_handler); if (fork()) { /* parent process */ if ((mutex_parent = shmem_mutex_init(SHM_INIT)) == NULL) { printf(failed to get the shmem_mutex\n); exit(-1); }
[PATCH 1/1] Fixup write permission of TLB on powerpc e500 core
The kernel has no write permission on COW pages by default on e500 core, this will cause endless loop in futex_lock_pi, because futex code assumes the kernel has write permission on COW pages. Grant write permission to the kernel on COW pages when access violation page fault occurs. Signed-off-by: Shan Hai haishan@gmail.com --- arch/powerpc/include/asm/futex.h | 11 ++- arch/powerpc/include/asm/tlb.h | 25 + 2 files changed, 35 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/include/asm/futex.h b/arch/powerpc/include/asm/futex.h index c94e4a3..54c3e74 100644 --- a/arch/powerpc/include/asm/futex.h +++ b/arch/powerpc/include/asm/futex.h @@ -8,6 +8,7 @@ #include asm/errno.h #include asm/synch.h #include asm/asm-compat.h +#include asm/tlb.h #define __futex_atomic_op(insn, ret, oldval, uaddr, oparg) \ __asm__ __volatile ( \ @@ -113,7 +114,15 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr, : cc, memory); *uval = prev; -return ret; + + /* Futex assumes the kernel has permission to write to +* COW pages, grant the kernel write permission on COW +* pages because it has none by default. +*/ + if (ret == -EFAULT) + __tlb_fixup_write_permission(current-mm, (unsigned long)uaddr); + + return ret; } #endif /* __KERNEL__ */ diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h index e2b428b..3863c6a 100644 --- a/arch/powerpc/include/asm/tlb.h +++ b/arch/powerpc/include/asm/tlb.h @@ -45,5 +45,30 @@ static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep, #endif } +/* Grant write permission to the kernel on a page. */ +static inline void __tlb_fixup_write_permission(struct mm_struct *mm, + unsigned long address) +{ +#if defined(CONFIG_FSL_BOOKE) + /* Grant write permission to the kernel on a page by setting TLB.SW +* bit, the bit setting operation is tricky here, calling +* handle_mm_fault with FAULT_FLAG_WRITE causes _PAGE_DIRTY bit of +* the pte to be set, the _PAGE_DIRTY of the pte is translated into +* TLB.SW on Powerpc e500 core. +*/ + + struct vm_area_struct *vma; + + vma = find_vma(mm, address); + if (likely(vma)) { + /* only fixup present page */ + if (follow_page(vma, address, FOLL_WRITE)) { + handle_mm_fault(mm, vma, address, FAULT_FLAG_WRITE); + flush_tlb_page(vma, address); + } + } +#endif +} + #endif /* __KERNEL__ */ #endif /* __ASM_POWERPC_TLB_H */ -- 1.7.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core
On 07/15/2011 04:44 PM, Peter Zijlstra wrote: On Fri, 2011-07-15 at 16:38 +0800, MailingLists wrote: On 07/15/2011 04:20 PM, Peter Zijlstra wrote: On Fri, 2011-07-15 at 16:07 +0800, Shan Hai wrote: The following test case could reveal a bug in the futex_lock_pi() BUG: On FUTEX_LOCK_PI, there is a infinite loop in the futex_lock_pi() on Powerpc e500 core. Cause: The linux kernel on the e500 core has no write permission on the COW page, refer the head comment of the following test code. ftrace on test case: [000] 353.990181: futex_lock_pi_atomic-futex_lock_pi [000] 353.990185: cmpxchg_futex_value_locked-futex_lock_pi_atomic [snip] [000] 353.990191: do_page_fault-handle_page_fault [000] 353.990192: bad_page_fault-handle_page_fault [000] 353.990193: search_exception_tables-bad_page_fault [snip] [000] 353.990199: get_user_pages-fault_in_user_writeable [snip] [000] 353.990208: mark_page_accessed-follow_page [000] 353.990222: futex_lock_pi_atomic-futex_lock_pi [snip] [000] 353.990230: cmpxchg_futex_value_locked-futex_lock_pi_atomic [ a loop occures here ] But but but but, that get_user_pages(.write=1, .force=0) should result in a COW break, getting our own writable page. What is this e500 thing smoking that this doesn't work? A page could be set to read only by the kernel (supervisor in the powerpc literature) on the e500, and that's what the kernel do. Set SW(supervisor write) bit in the TLB entry to grant write permission to the kernel on a page. And further the SW bit is set according to the DIRTY flag of the PTE, PTE.DIRTY is set in the do_page_fault(), the futex_lock_pi() disabled page fault, the PTE.DIRTY never can be set, so do the SW bit, unbreakable COW occurred, infinite loop followed. I'm fairly sure fault_in_user_writeable() has PF enabled as it takes mmap_sem, an pagefaul_disable() is akin to preemp_disable() on mainline. Also get_user_pages() fully expects to be able to schedule, and in fact can call the full pf handler path all by its lonesome self. The whole scenario should be, - the child process triggers a page fault at the first time access to the lock, and it got its own writable page, but its *clean* for the reason just for checking the status of the lock. I am sorry for above unbreakable COW. - the futex_lock_pi() is invoked because of the lock contention, and the futex_atomic_cmpxchg_inatomic() tries to get the lock, it found out the lock is free so tries to write to the lock for reservation, a page fault occurs, because the page is read only for kernel(e500 specific), and returns -EFAULT to the caller - the fault_in_user_writeable() tries to fix the fault, but from the get_user_pages() view everything is ok, because the COW was already broken, retry futex_lock_pi_atomic() - futex_lock_pi_atomic() -- futex_atomic_cmpxchg_inatomic(), another write protection page fault - infinite loop Thanks Shan Hai ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core
On 07/15/2011 05:50 PM, Peter Zijlstra wrote: On Fri, 2011-07-15 at 17:08 +0800, Shan Hai wrote: The whole scenario should be, - the child process triggers a page fault at the first time access to the lock, and it got its own writable page, but its *clean* for the reason just for checking the status of the lock. I am sorry for above unbreakable COW. - the futex_lock_pi() is invoked because of the lock contention, and the futex_atomic_cmpxchg_inatomic() tries to get the lock, it found out the lock is free so tries to write to the lock for reservation, a page fault occurs, because the page is read only for kernel(e500 specific), and returns -EFAULT to the caller - the fault_in_user_writeable() tries to fix the fault, but from the get_user_pages() view everything is ok, because the COW was already broken, retry futex_lock_pi_atomic() but that's a bug right there, gup(.write=1) _should_ be a complete write fault, and as such toggle your sw dirty/young tracking. The fault causing futex_atomic_cmpxchg_inatomic() is protected by pagefault_disable(), so the page fault handler has no chance to toggle the SW dirty/young tracking. Thanks Shan Hai - futex_lock_pi_atomic() -- futex_atomic_cmpxchg_inatomic(), another write protection page fault - infinite loop ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/1] Fixup write permission of TLB on powerpc e500 core
On 07/15/2011 06:23 AM, Peter Zijlstra wrote: On Fri, 2011-07-15 at 16:07 +0800, Shan Hai wrote: The kernel has no write permission on COW pages by default on e500 core, this will cause endless loop in futex_lock_pi, because futex code assumes the kernel has write permission on COW pages. Grant write permission to the kernel on COW pages when access violation page fault occurs. Signed-off-by: Shan Haihaishan@gmail.com --- arch/powerpc/include/asm/futex.h | 11 ++- arch/powerpc/include/asm/tlb.h | 25 + 2 files changed, 35 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/include/asm/futex.h b/arch/powerpc/include/asm/futex.h index c94e4a3..54c3e74 100644 --- a/arch/powerpc/include/asm/futex.h +++ b/arch/powerpc/include/asm/futex.h @@ -8,6 +8,7 @@ #includeasm/errno.h #includeasm/synch.h #includeasm/asm-compat.h +#includeasm/tlb.h #define __futex_atomic_op(insn, ret, oldval, uaddr, oparg) \ __asm__ __volatile ( \ @@ -113,7 +114,15 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr, : cc, memory); *uval = prev; -return ret; + + /* Futex assumes the kernel has permission to write to +* COW pages, grant the kernel write permission on COW +* pages because it has none by default. +*/ + if (ret == -EFAULT) + __tlb_fixup_write_permission(current-mm, (unsigned long)uaddr); + + return ret; } #endif /* __KERNEL__ */ diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h index e2b428b..3863c6a 100644 --- a/arch/powerpc/include/asm/tlb.h +++ b/arch/powerpc/include/asm/tlb.h @@ -45,5 +45,30 @@ static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep, #endif } +/* Grant write permission to the kernel on a page. */ +static inline void __tlb_fixup_write_permission(struct mm_struct *mm, + unsigned long address) +{ +#if defined(CONFIG_FSL_BOOKE) + /* Grant write permission to the kernel on a page by setting TLB.SW +* bit, the bit setting operation is tricky here, calling +* handle_mm_fault with FAULT_FLAG_WRITE causes _PAGE_DIRTY bit of +* the pte to be set, the _PAGE_DIRTY of the pte is translated into +* TLB.SW on Powerpc e500 core. +*/ + + struct vm_area_struct *vma; + + vma = find_vma(mm, address); Uhm, find_vma() needs mmap_sem, and futex_atomic_cmpxchg_inatomic() is most certainly not called with that lock held. My fault, that will be fixed in the V2 patch. + if (likely(vma)) { + /* only fixup present page */ + if (follow_page(vma, address, FOLL_WRITE)) { + handle_mm_fault(mm, vma, address, FAULT_FLAG_WRITE); So how can this toggle your sw dirty/young tracking, that's pretty much what gup(.write=1) does too! because of the kernel read only permission of the page is transparent to the follow_page(), the handle_mm_fault() is not to be activated in the __get_use_pages(), so the gup(.write=1) could not help to fixup the write permission. Thanks Shan Hai + flush_tlb_page(vma, address); + } + } +#endif +} + #endif /* __KERNEL__ */ #endif /* __ASM_POWERPC_TLB_H */ ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core
On 07/15/2011 06:32 AM, David Laight wrote: The fault causing futex_atomic_cmpxchg_inatomic() is protected by pagefault_disable(), so the page fault handler has no chance to toggle the SW dirty/young tracking. Perhaps that is the bug! Whatever pagefault_disable() does, it shouldn't disable the SW dirty/young tracking - which should only needs bits moving in the page table itself (and TLB update??) rather than any operations on the rest of the data areas. It looks to me as though this could happen any time a page is marked inaccessible by the dirty/young tracking. Not just as a result of COW. I agree with you, the problem could be triggered by accessing any user space page which has kernel read only permission in the page fault disabled context, the problem also affects architectures which depend on SW dirty/young tracking as stated by Benjamin in this thread. In the e500 case, the commit 6cfd8990e27d3a491c1c605d6cbc18a46ae51fef removed the write permission fixup from TLB miss handlers and left it to generic code, so it might be right time to fixup the write permission here in the generic code. Thanks Shan Hai David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
math_efp.c: Fixed SPE data type conversion failure
The following test case failed on Powerpc sbc8548 with CONFIG_SPE static float fm; static signed int si_min = (-2147483647 - 1); static unsigned int ui; int main() { fm = (float) si_min; ; ui = (unsigned int)fm; printf(ui=%d, should be %d\n, ui, si_min); return 0; } Result: ui=-1, should be -2147483648 The reason is failure to emulate the minus float to unsigned integer conversion instruction in the SPE driver. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] Fix SPE float to integer conversion failure
Conversion from float to integer should based on both the instruction encoding and the sign of the operand. Signed-off-by: Shan Hai shan@windriver.com --- arch/powerpc/math-emu/math_efp.c | 12 1 files changed, 8 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/math-emu/math_efp.c b/arch/powerpc/math-emu/math_efp.c index 41f4ef3..64faf10 100644 --- a/arch/powerpc/math-emu/math_efp.c +++ b/arch/powerpc/math-emu/math_efp.c @@ -320,7 +320,8 @@ int do_spe_mathemu(struct pt_regs *regs) } else { _FP_ROUND_ZERO(1, SB); } - FP_TO_INT_S(vc.wp[1], SB, 32, ((func 0x3) != 0)); + FP_TO_INT_S(vc.wp[1], SB, 32, + (((func 0x3) != 0) || SB_s)); goto update_regs; default: @@ -458,7 +459,8 @@ cmp_s: } else { _FP_ROUND_ZERO(2, DB); } - FP_TO_INT_D(vc.wp[1], DB, 32, ((func 0x3) != 0)); + FP_TO_INT_D(vc.wp[1], DB, 32, + (((func 0x3) != 0) || DB_s)); goto update_regs; default: @@ -589,8 +591,10 @@ cmp_d: _FP_ROUND_ZERO(1, SB0); _FP_ROUND_ZERO(1, SB1); } - FP_TO_INT_S(vc.wp[0], SB0, 32, ((func 0x3) != 0)); - FP_TO_INT_S(vc.wp[1], SB1, 32, ((func 0x3) != 0)); + FP_TO_INT_S(vc.wp[0], SB0, 32, + (((func 0x3) != 0) || SB0_s)); + FP_TO_INT_S(vc.wp[1], SB1, 32, + (((func 0x3) != 0) || SB1_s)); goto update_regs; default: -- 1.7.0.2 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev