[no subject]

2018-02-13 Thread Shan Hai
confirm 179e695f420474677205db49a8cbfe950329975c


[no subject]

2018-02-13 Thread Shan Hai
confirm 0da5e6b1343dcc6395ebcc8054c362d930498440


Re: [PATCH 0/1] powerpc/vdso: remove redundant locking in update_vsyscall_tz()

2012-11-14 Thread Shan Hai
On Fri, Nov 09, 2012 at 09:57:49AM +0800, Shan Hai wrote:
 The locking in update_vsyscall_tz() is not only unnecessary because the vdso
 code copies the data unproteced in __kernel_gettimeofday() but also
 introduces a hard to reproduce race condition between update_vsyscall()
 and update_vsyscall_tz(), which causes user space process to loop
 forever in vdso code.
 
 The following patch removes the locking from update_vsyscall_tz().
 
 ---
  arch/powerpc/kernel/time.c |5 -
  1 file changed, 5 deletions(-)


Hi Ben,

Would you please review this one?
 
From 167eac293b07e0ee201ffe5043ec442d52495a48 Mon Sep 17 00:00:00 2001
From: Shan Hai shan@windriver.com
Date: Mon, 5 Nov 2012 18:24:22 +0800
Subject: [PATCH 1/1] powerpc/vdso: remove redundant locking in
 update_vsyscall_tz()

Locking is not only unnecessary because the vdso code copies the data
unprotected in __kernel_gettimeofday() but also erroneous because updating
the tb_update_count is not atomic and introduces a hard to reproduce race
condition between update_vsyscall() and update_vsyscall_tz(), which further
causes user space process to loop forever in vdso code.

The below scenario describes the race condition,
x==0Boot CPUother CPU
proc_P: x==0
timer interrupt
update_vsyscall
x==1x++;syncsettimeofday
update_vsyscall_tz
x==2x++;sync
x==3sync;x++
sync;x++
proc_P: x==3 (loops until x becomes even)

Because the ++ operator would be implemented as three instructions and not
atomic on powerpc.

A similar change was made for x86 in commit 6c260d58634
(x86: vdso: Remove bogus locking in update_vsyscall_tz)

Signed-off-by: Shan Hai shan@windriver.com
---
 arch/powerpc/kernel/time.c |5 -
 1 file changed, 5 deletions(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 03b29a6..426141f 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -859,13 +859,8 @@ void update_vsyscall(struct timespec *wall_time, struct 
timespec *wtm,
 
 void update_vsyscall_tz(void)
 {
-   /* Make userspace gettimeofday spin until we're done. */
-   ++vdso_data-tb_update_count;
-   smp_mb();
vdso_data-tz_minuteswest = sys_tz.tz_minuteswest;
vdso_data-tz_dsttime = sys_tz.tz_dsttime;
-   smp_mb();
-   ++vdso_data-tb_update_count;
 }
 
 static void __init clocksource_init(void)
-- 
1.7.10.4

Best Regards
Shan Hai

 Thanks
 Shan Hai

 From 167eac293b07e0ee201ffe5043ec442d52495a48 Mon Sep 17 00:00:00 2001
 From: Shan Hai shan@windriver.com
 Date: Mon, 5 Nov 2012 18:24:22 +0800
 Subject: [PATCH 1/1] powerpc/vdso: remove redundant locking in
  update_vsyscall_tz()
 
 Locking is not only unnecessary because the vdso code copies the data
 unprotected in __kernel_gettimeofday() but also erroneous because updating
 the tb_update_count is not atomic and introduces a hard to reproduce race
 condition between update_vsyscall() and update_vsyscall_tz(), which further
 causes user space process to loop forever in vdso code.
 
 The below scenario describes the race condition,
 x==0  Boot CPUother CPU
   proc_P: x==0
   timer interrupt
   update_vsyscall
 x==1  x++;syncsettimeofday
   update_vsyscall_tz
 x==2  x++;sync
 x==3  sync;x++
   sync;x++
   proc_P: x==3 (loops until x becomes even)
 
 Because the ++ operator would be implemented as three instructions and not
 atomic on powerpc.
 
 A similar change was made for x86 in commit 6c260d58634
 (x86: vdso: Remove bogus locking in update_vsyscall_tz)
 
 Signed-off-by: Shan Hai shan@windriver.com
 ---
  arch/powerpc/kernel/time.c |5 -
  1 file changed, 5 deletions(-)
 
 diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
 index 03b29a6..426141f 100644
 --- a/arch/powerpc/kernel/time.c
 +++ b/arch/powerpc/kernel/time.c
 @@ -859,13 +859,8 @@ void update_vsyscall(struct timespec *wall_time, struct 
 timespec *wtm,
  
  void update_vsyscall_tz(void)
  {
 - /* Make userspace gettimeofday spin until we're done. */
 - ++vdso_data-tb_update_count;
 - smp_mb();
   vdso_data-tz_minuteswest = sys_tz.tz_minuteswest;
   vdso_data-tz_dsttime = sys_tz.tz_dsttime;
 - smp_mb();
 - ++vdso_data-tb_update_count;
  }
  
  static void __init clocksource_init(void)
 -- 
 1.7.10.4
 

 ___
 Linuxppc-dev mailing list
 Linuxppc-dev@lists.ozlabs.org
 https://lists.ozlabs.org/listinfo/linuxppc-dev

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https

[PATCH 0/1] powerpc/vdso: remove redundant locking in update_vsyscall_tz()

2012-11-08 Thread Shan Hai
The locking in update_vsyscall_tz() is not only unnecessary because the vdso
code copies the data unproteced in __kernel_gettimeofday() but also
introduces a hard to reproduce race condition between update_vsyscall()
and update_vsyscall_tz(), which causes user space process to loop
forever in vdso code.

The following patch removes the locking from update_vsyscall_tz().

---
 arch/powerpc/kernel/time.c |5 -
 1 file changed, 5 deletions(-)

Thanks
Shan Hai
From 167eac293b07e0ee201ffe5043ec442d52495a48 Mon Sep 17 00:00:00 2001
From: Shan Hai shan@windriver.com
Date: Mon, 5 Nov 2012 18:24:22 +0800
Subject: [PATCH 1/1] powerpc/vdso: remove redundant locking in
 update_vsyscall_tz()

Locking is not only unnecessary because the vdso code copies the data
unprotected in __kernel_gettimeofday() but also erroneous because updating
the tb_update_count is not atomic and introduces a hard to reproduce race
condition between update_vsyscall() and update_vsyscall_tz(), which further
causes user space process to loop forever in vdso code.

The below scenario describes the race condition,
x==0	Boot CPU			other CPU
	proc_P: x==0
	timer interrupt
		update_vsyscall
x==1		x++;sync		settimeofday
	update_vsyscall_tz
x==2		x++;sync
x==3		sync;x++
		sync;x++
	proc_P: x==3 (loops until x becomes even)

Because the ++ operator would be implemented as three instructions and not
atomic on powerpc.

A similar change was made for x86 in commit 6c260d58634
(x86: vdso: Remove bogus locking in update_vsyscall_tz)

Signed-off-by: Shan Hai shan@windriver.com
---
 arch/powerpc/kernel/time.c |5 -
 1 file changed, 5 deletions(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 03b29a6..426141f 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -859,13 +859,8 @@ void update_vsyscall(struct timespec *wall_time, struct timespec *wtm,
 
 void update_vsyscall_tz(void)
 {
-	/* Make userspace gettimeofday spin until we're done. */
-	++vdso_data-tb_update_count;
-	smp_mb();
 	vdso_data-tz_minuteswest = sys_tz.tz_minuteswest;
 	vdso_data-tz_dsttime = sys_tz.tz_dsttime;
-	smp_mb();
-	++vdso_data-tb_update_count;
 }
 
 static void __init clocksource_init(void)
-- 
1.7.10.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC/PATCH] mm/futex: Fix futex writes on archs with SW tracking of dirty young

2011-07-21 Thread Shan Hai

On 07/22/2011 06:59 AM, Andrew Morton wrote:

On Fri, 22 Jul 2011 08:52:06 +1000
Benjamin Herrenschmidtb...@kernel.crashing.org  wrote:


On Thu, 2011-07-21 at 15:36 -0700, Andrew Morton wrote:

On Tue, 19 Jul 2011 14:29:22 +1000
Benjamin Herrenschmidtb...@kernel.crashing.org  wrote:


The futex code currently attempts to write to user memory within
a pagefault disabled section, and if that fails, tries to fix it
up using get_user_pages().

This doesn't work on archs where the dirty and young bits are
maintained by software, since they will gate access permission
in the TLB, and will not be updated by gup().

In addition, there's an expectation on some archs that a
spurious write fault triggers a local TLB flush, and that is
missing from the picture as well.

I decided that adding those features to gup() would be too much
for this already too complex function, and instead added a new
simpler fixup_user_fault() which is essentially a wrapper around
handle_mm_fault() which the futex code can call.

Signed-off-by: Benjamin Herrenschmidtb...@kernel.crashing.org
---

Shan, can you test this ? It might not fix the problem

um, what problem.  There's no description here of the user-visible
effects of the bug hence it's hard to work out what kernel version(s)
should receive this patch.

Shan could give you an actual example (it was in the previous thread),
but basically, livelock as the kernel keeps trying and trying the
in_atomic op and never resolves it.


What kernel version(s) should receive this patch?

I haven't dug. Probably anything it applies on as far as we did that
trick of atomic + gup() for futex.

You're not understanding me.

I need a good reason to merge this into 3.0.

The -stable maintainers need even better reasons to merge this into
earlier kernels.

Please provide those reasons!



Summary:
- Encountered a 100% CPU system usage problem on pthread_mutex allocated 
in a
shared memory region, and the problem occurs only on setting 
PRIORITY_INHERITANCE

to the pthread_mutex.
- ftrace result reveals that an infinite loop in the futex_lock_pi 
caused high CPU usage.

- The powerpc e500 was affected but the x86 was not.
I have not tested on other archs so I am not sure whether the other 
archs are attacked

by the problem.
- Tested it on 2.6.34 and 3.0-rc7, both are affected, earlier versions 
might be affected.


Please refer the threads [PATCH 0/1] Fixup write permission of TLB on 
powerpc e500 core
and [PATCH 1/1] Fixup write permission of TLB on powerpc e500 core for 
the whole story.

Provided the test case code in the [PATH 0/1].

Thanks
Shan Hai


(Documentation/stable_kernel_rules.txt, 4th bullet)

(And it's not just me and -stable maintainers.  Distro maintainers will
also look at this patch and wonder whether they should merge it)


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC/PATCH] mm/futex: Fix futex writes on archs with SW tracking of dirty young

2011-07-19 Thread Shan Hai

On 07/19/2011 03:46 PM, Benjamin Herrenschmidt wrote:

On Tue, 2011-07-19 at 13:38 +0800, Shan Hai wrote:


What you said is another path, that is futex_wake_op(),
but what about futex_lock_pi in which my test case failed?
your patch will call handle_mm_fault on every futex contention
in the futex_lock_pi path.

futex_lock_pi()
  ret = futex_lock_pi_atomic(uaddr, hb,q.key,q.pi_state, current, 0);
  case -EFAULT:
  goto uaddr_faulted;

  ...
uaddr_faulted:
  ret = fault_in_user_writeable(uaddr);

Euh ... and how do we get to uaddr_faulted ? You may have missed the
return statement right before it :-)

 From what I can tell we only get there as a result of -EFAULT from
futex_lock_pi_atomic() which is exactly the case we are trying to
cover.



Got it, if the fault_in_user_writeable() is designed to catch the
exact same write permission fault problem we discuss here, so
your patch fixed that very nicely, we should fixup it by directly
calling handle_mm_fault like what you did because we are for sure
to know what just happened(permission violation), its not necessary
to check what's happened by calling gup--follow_page, and
further the follow_page failed to report the fault :-)

Thanks
Shan Hai


  .../...

   [PATCH 1/1] Fixup write permission of TLB on powerpc e500 core?
   (I will fix the stupid errors in my original patch if the concept
is acceptable)
   in this way we could decrease the overhead of handle_mm_fault
   in the path which does not need write permission fixup.

Which overhead ? gup does handle_mm_fault() as well if needed.

it does it *if needed*, and this requirement is rare in my opinion.

And how does gup figure out of it's needed ? By walking down the page
tables in follow_page... what does handle_mm_fault do ? walk down the
page tables...

The main (if not the only) relevant difference here, is going to be the
spurious fault TLB invaliate for writes ... which is a nop on x86
and needed in all the cases we care about (and if it's not needed, then
it's up to the arch to nop it out, we should probably do it on powerpc
too ...  but that's un unrelated discussion).

Cheers,
Ben.


Thanks
Shan Hai


What I do is I replace what is arguably an abuse of gup() in the case
where a fixup -is- needed with a dedicated function designed to perform
the said fixup ... and do it properly which gup() didn't :-)

Cheers,
Ben.


Thanks
Shan Hai

Cheers,
Ben.

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9670f71..1036614 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -985,6 +985,8 @@ int get_user_pages(struct task_struct *tsk, struct 
mm_struct *mm,
int get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages);
struct page *get_dump_page(unsigned long addr);
+extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
+   unsigned long address, unsigned int fault_flags);

extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
extern void do_invalidatepage(struct page *page, unsigned long offset);
diff --git a/kernel/futex.c b/kernel/futex.c
index fe28dc2..7a0a4ed 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -355,8 +355,8 @@ static int fault_in_user_writeable(u32 __user *uaddr)
int ret;

down_read(mm-mmap_sem);
-   ret = get_user_pages(current, mm, (unsigned long)uaddr,
-1, 1, 0, NULL, NULL);
+   ret = fixup_user_fault(current, mm, (unsigned long)uaddr,
+  FAULT_FLAG_WRITE);
up_read(mm-mmap_sem);

return ret0 ? ret : 0;
diff --git a/mm/memory.c b/mm/memory.c
index 40b7531..b967fb0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1815,7 +1815,64 @@ next_page:
}
EXPORT_SYMBOL(__get_user_pages);

-/**
+/*
+ * fixup_user_fault() - manually resolve a user page  fault
+ * @tsk:   the task_struct to use for page fault accounting, or
+ * NULL if faults are not to be recorded.
+ * @mm:mm_struct of target mm
+ * @address:   user address
+ * @fault_flags:flags to pass down to handle_mm_fault()
+ *
+ * This is meant to be called in the specific scenario where for
+ * locking reasons we try to access user memory in atomic context
+ * (within a pagefault_disable() section), this returns -EFAULT,
+ * and we want to resolve the user fault before trying again.
+ *
+ * Typically this is meant to be used by the futex code.
+ *
+ * The main difference with get_user_pages() is that this function
+ * will unconditionally call handle_mm_fault() which will in turn
+ * perform all the necessary SW fixup of the dirty and young bits
+ * in the PTE, while handle_mm_fault() only guarantees to update
+ * these in the struct page.
+ *
+ * This is important for some architectures where those bits also
+ * gate the access permission to the page because their are
+ * maintained in software

Re: [RFC/PATCH] mm/futex: Fix futex writes on archs with SW trackingof dirty young

2011-07-19 Thread Shan Hai

On 07/19/2011 04:26 PM, David Laight wrote:



Got it, if the fault_in_user_writeable() is designed to catch the
exact same write permission fault problem we discuss here, so
your patch fixed that very nicely, we should fixup it by directly
calling handle_mm_fault like what you did because we are for sure
to know what just happened(permission violation), its not necessary
to check what's happened by calling gup--follow_page, and
further the follow_page failed to report the fault :-)

One thought I've had - and I don't know enough about the data
area in use to know if it is a problem - is what happens if
a different cpu faults on the same user page and has already
marked it 'valid' between the fault happening and the fault
handler looking at the page tables to find out why.
If any of the memory areas are shared, it might be that the
PTE (etc) might already show the page a writable by the
time the fault handler is looking at them - this might confuse it!



There is no problem at all if you mean *valid* by page present
and writable, because when the fault_in_user_writeable()
is called, the pte to the shared page was already setup by
demand paging and pte.present and pte.write was set, and the
reason why the fault was taken is that because of violation of
permission on present and writable user page occurred on sw
dirty/young tracking architectures.

Thanks
Shan Hai


David




___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/1] Fixup write permission of TLB on powerpc e500 core

2011-07-18 Thread Shan Hai
:
 */
mark_page_accessed(page);
}
+
if ((flags  FOLL_MLOCK)  (vma-vm_flags  VM_LOCKED)) {
/*
 * The preliminary mapping check is mainly to avoid the
@@ -3358,21 +3386,8 @@ int handle_pte_fault(struct mm_struct *mm,
if (!pte_write(entry))
return do_wp_page(mm, vma, address,
pte, pmd, ptl, entry);
-   entry = pte_mkdirty(entry);
-   }
-   entry = pte_mkyoung(entry);
-   if (ptep_set_access_flags(vma, address, pte, entry, flags  
FAULT_FLAG_WRITE)) {
-   update_mmu_cache(vma, address, pte);
-   } else {
-   /*
-* This is needed only for protection faults but the arch code
-* is not yet telling us if this is a protection fault or not.
-* This still avoids useless tlb flushes for .text page faults
-* with threads.
-*/
-   if (flags  FAULT_FLAG_WRITE)
-   flush_tlb_fix_spurious_fault(vma, address);
}
+   handle_pte_sw_young_dirty(vma, address, pte, flags  FAULT_FLAG_WRITE);
  unlock:
pte_unmap_unlock(pte, ptl);
return 0;




So what about the following?
diff --git a/mm/memory.c b/mm/memory.c
index 9b8a01d..fb48122 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1442,6 +1442,7 @@ struct page *follow_page(struct vm_area_struct 
*vma, unsig

spinlock_t *ptl;
struct page *page;
struct mm_struct *mm = vma-vm_mm;
+   int fix_write_permission = false;

page = follow_huge_addr(mm, address, flags  FOLL_WRITE);
if (!IS_ERR(page)) {
@@ -1519,6 +1520,11 @@ split_fallthrough:
if ((flags  FOLL_WRITE) 
!pte_dirty(pte)  !PageDirty(page))
set_page_dirty(page);
+
+#ifdef CONFIG_FIXUP_WRITE_PERMISSION
+   if ((flags  FOLL_WRITE)  !pte_dirty(pte))
+   fix_write_permission = true;
+#endif
/*
 * pte_mkyoung() would be more correct here, but atomic 
care
 * is needed to avoid losing the dirty bit: it is 
easier to use

@@ -1551,7 +1557,7 @@ split_fallthrough:
 unlock:
pte_unmap_unlock(ptep, ptl);
 out:
-   return page;
+   return (fix_write_permission == true) ? NULL: page;

 bad_page:
pte_unmap_unlock(ptep, ptl);



From the CONFIG_FIXUP_WRITE_PERMISSION and
(flags  FOLL_WRITE)  !pte_dirty(pte) the follow_page()
could figure out that the caller want to write to the
(present  writable  non-dirty) pte, and the architecture
want to fixup the problem by indicating CONFIG_FIXUP_WRITE_PERMISSION,
so let the follow_page() return NULL to the __get_user_pages, and
let the handle_mm_fault to fixup dirty/young tracking.

Checking the following code we can conclude that the handle_mm_fault
is ready to handle the faults and the write permission violation is
a kind of fault, so why don't we let the handle_mm_fault to
handle that fault?

__get_user_pages()
 while (!(page = follow_page(vma, start, foll_flags))) {
...
ret = handle_mm_fault(mm, vma, start,
fault_flags);
...
}

Thanks
Shan Hai
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/1] Fixup write permission of TLB on powerpc e500 core

2011-07-18 Thread Shan Hai

On 07/18/2011 03:01 PM, Benjamin Herrenschmidt wrote:

On Mon, 2011-07-18 at 14:48 +0800, Shan Hai wrote:


It could not fix the problem, refer the following reply for
the reasons.

  .../...


diff --git a/kernel/futex.c b/kernel/futex.c
index fe28dc2..02adff7 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -355,8 +355,8 @@ static int fault_in_user_writeable(u32 __user *uaddr)
int ret;

down_read(mm-mmap_sem);
-   ret = get_user_pages(current, mm, (unsigned long)uaddr,
-1, 1, 0, NULL, NULL);
+   ret = __get_user_pages(current, mm, (unsigned long)uaddr, 1,
+  FOLL_WRITE | FOLL_FIXFAULT, NULL, NULL, NULL);

the FOLL_FIXFAULT is filtered out at the following code
get_user_pages()
  if (write)
  flags |= FOLL_WRITE;


I'm not sure what you're talking about here, you may notice that I'm
calling __get_user_pages() not get_user_pages(). Make sure you get my
-second- post of the patch (the one with a proper description  s-o-b)
since the first one was a mis-send of an wip version.



I am sorry I hadn't tried your newer patch, I tried it but it still 
could not

work in my test environment, I will dig into and tell you why
that failed later.


+
+   if (flags   FOLL_FIXFAULT)
+   handle_pte_sw_young_dirty(vma, address, ptep,
+ flags   FOLL_WRITE);
if (flags   FOLL_TOUCH) {
if ((flags   FOLL_WRITE)
!pte_dirty(pte)   !PageDirty(page))

call handle_pte_sw_young_dirty before !pte_dirty(pte)
might has problems.

No this is on purpose.

My initial patch was only calling it under the same condition as the
FOLL_TOUCH case, but I got concerned by this whole
flush_tlb_fix_spurious_fault() business.

Basically, our generic code is designed so that relaxing write
protection on a PTE can be done without flushing the TLB on all CPUs, so
that a spurrious fault on a secondary CPU will flush the TLB at that
point.

I don't know which arch relies on this feature (ARM maybe ?) but if we
are going to be semantically equivalent to a real fault, we must also do
that, so the right thing to do here is to always call in there if
FOLL_FIXFAULT is set.

It's up to the caller to only set FOLL_FIXFAULT when really trying to
deal with an -EFAULT, to avoid possible unnecessary overhead, but in
this case I think we are fine, this is all a fallback slow path.

  .../...


So what about the following?
diff --git a/mm/memory.c b/mm/memory.c
index 9b8a01d..fb48122 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1442,6 +1442,7 @@ struct page *follow_page(struct vm_area_struct
*vma, unsig
  spinlock_t *ptl;
  struct page *page;
  struct mm_struct *mm = vma-vm_mm;
+   int fix_write_permission = false;

  page = follow_huge_addr(mm, address, flags  FOLL_WRITE);
  if (!IS_ERR(page)) {
@@ -1519,6 +1520,11 @@ split_fallthrough:
  if ((flags  FOLL_WRITE)
  !pte_dirty(pte)  !PageDirty(page))
  set_page_dirty(page);
+
+#ifdef CONFIG_FIXUP_WRITE_PERMISSION
+   if ((flags  FOLL_WRITE)  !pte_dirty(pte))
+   fix_write_permission = true;
+#endif
  /*
   * pte_mkyoung() would be more correct here, but atomic
care
   * is needed to avoid losing the dirty bit: it is
easier to use
@@ -1551,7 +1557,7 @@ split_fallthrough:
   unlock:
  pte_unmap_unlock(ptep, ptl);
   out:
-   return page;
+   return (fix_write_permission == true) ? NULL: page;

   bad_page:
  pte_unmap_unlock(ptep, ptl);

You patch not only is uglier (more ifdef's) but also incomplete since it
doesn't handle the young case and it doesn't handle the spurious fault
case either.



Yep, I know holding lots of ifdef's everywhere is not so good,
but if we have some other way(I don't know how till now) to
figure out the arch has the need to fixup up the write permission
we could eradicate the ugly ifdef's here.

I think the handle_mm_fault could do all dirty/young tracking,
because the purpose of making follow_page return NULL to
its caller is that want to the handle_mm_fault to be called
on write permission protection fault.

Thanks
Shan Hai


What the futex code is trying to do is use gup() as a way to fixup from
a fault which means essentially to have the -exact- same semantics as a
normal fault would have.

Thus by factoring the common fault fixup code and using that exact same
code in gup(), we get a much more robust guarantee that this will work
in the long run.

I don't expect gup to be that commonly used to fixup access after an
attempt at doing a user access with page faults disabled, only those
case will need to be modified to use the new flag.


   From the CONFIG_FIXUP_WRITE_PERMISSION and
(flags  FOLL_WRITE)  !pte_dirty(pte) the follow_page()
could figure out that the caller want to write

Re: [PATCH 1/1] Fixup write permission of TLB on powerpc e500 core

2011-07-18 Thread Shan Hai

On 07/18/2011 03:36 PM, Benjamin Herrenschmidt wrote:

On Mon, 2011-07-18 at 15:26 +0800, Shan Hai wrote:

I am sorry I hadn't tried your newer patch, I tried it but it still
could not work in my test environment, I will dig into and tell you
why that failed later.

Ok, please let me know what you find !



Have not been finding out the reason why failed,
I tried the following based on your code,
(1)
diff --git a/kernel/futex.c b/kernel/futex.c
index fe28dc2..820556d 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -353,10 +353,11 @@ static int fault_in_user_writeable(u32 __user *uaddr)
 {
struct mm_struct *mm = current-mm;
int ret;
+   int flags = FOLL_TOUCH | FOLL_GET | FOLL_WRITE | FOLL_FIXFAULT;

down_read(mm-mmap_sem);
-   ret = get_user_pages(current, mm, (unsigned long)uaddr,
-1, 1, 0, NULL, NULL);
+   ret = __get_user_pages(current, mm, (unsigned long)uaddr, 1,
+  flags, NULL, NULL, NULL);
up_read(mm-mmap_sem);

return ret  0 ? ret : 0;

(2)
diff --git a/mm/memory.c b/mm/memory.c
index 9b8a01d..f7ba26e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
...
+
+   if ((flags  (FOLL_WRITE | FOLL_FIXFAULT))  !pte_dirty(pte))
+   handle_pte_sw_young_dirty(vma, address, ptep,
+  FAULT_FLAG_WRITE);
...

And everything lookes good, but still couldn't work, need more 
investigation.



Yep, I know holding lots of ifdef's everywhere is not so good,
but if we have some other way(I don't know how till now) to
figure out the arch has the need to fixup up the write permission
we could eradicate the ugly ifdef's here.

I think the handle_mm_fault could do all dirty/young tracking,
because the purpose of making follow_page return NULL to
its caller is that want to the handle_mm_fault to be called
on write permission protection fault.

I see your point. Rather than factoring the fixup code out, we could
force gup to call handle_mm_fault()... that makes sense.

However, I don't think we should special case archs. There's plenty of
cases where we don't care about this fixup even on archs that do SW
tracking of dirty and young. For example when gup is using for
subsequent DMA.

Only the (rare ?) cases where it's used as a mean to fixup a failing
atomic user access are relevant.

So I believe we should still pass an explicit flag to __get_user_pages()
as I propose to activate that behaviour.



How about the following one?
the write permission fixup behaviour is triggered explicitly by
the trouble making parts like futex as you suggested.

In this way, the follow_page() mimics exactly how the MMU
faults on atomic access to the user pages, and we could handle
the fault by already existing handle_mm_fault which also do
the dirty/young tracking properly.


diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9670f71..8a76694 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1546,6 +1546,7 @@ struct page *follow_page(struct vm_area_struct *, 
unsigned long address,

 #define FOLL_MLOCK0x40/* mark page as mlocked */
 #define FOLL_SPLIT0x80/* don't return transhuge pages, split 
them */

 #define FOLL_HWPOISON0x100/* check page is hwpoisoned */
+#define FOLL_FIXFAULT0x200/* fixup after a fault (PTE 
dirty/young upd) */


 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 void *data);
diff --git a/kernel/futex.c b/kernel/futex.c
index fe28dc2..820556d 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -353,10 +353,11 @@ static int fault_in_user_writeable(u32 __user *uaddr)
 {
 struct mm_struct *mm = current-mm;
 int ret;
+int flags = FOLL_TOUCH | FOLL_GET | FOLL_WRITE | FOLL_FIXFAULT;

 down_read(mm-mmap_sem);
-ret = get_user_pages(current, mm, (unsigned long)uaddr,
- 1, 1, 0, NULL, NULL);
+ret = __get_user_pages(current, mm, (unsigned long)uaddr, 1,
+   flags, NULL, NULL, NULL);
 up_read(mm-mmap_sem);

 return ret  0 ? ret : 0;
diff --git a/mm/memory.c b/mm/memory.c
index 9b8a01d..5682501 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1442,6 +1442,7 @@ struct page *follow_page(struct vm_area_struct 
*vma, unsigned long address,

 spinlock_t *ptl;
 struct page *page;
 struct mm_struct *mm = vma-vm_mm;
+int fix_write_permission = 0;

 page = follow_huge_addr(mm, address, flags  FOLL_WRITE);
 if (!IS_ERR(page)) {
@@ -1519,6 +1520,9 @@ split_fallthrough:
 if ((flags  FOLL_WRITE) 
 !pte_dirty(pte)  !PageDirty(page))
 set_page_dirty(page);
+
+if ((flags  (FOLL_WRITE | FOLL_FIXFAULT))  !pte_dirty(pte))
+fix_write_permission = 1;
 /*
  * pte_mkyoung() would be more correct here, but atomic care
  * is needed to avoid losing the dirty bit: it is easier to use
@@ -1551,7 +1555,7 @@ split_fallthrough:
 unlock:
 pte_unmap_unlock

Re: [RFC/PATCH] mm/futex: Fix futex writes on archs with SW tracking of dirty young

2011-07-18 Thread Shan Hai

On 07/19/2011 12:29 PM, Benjamin Herrenschmidt wrote:

The futex code currently attempts to write to user memory within
a pagefault disabled section, and if that fails, tries to fix it
up using get_user_pages().

This doesn't work on archs where the dirty and young bits are
maintained by software, since they will gate access permission
in the TLB, and will not be updated by gup().

In addition, there's an expectation on some archs that a
spurious write fault triggers a local TLB flush, and that is
missing from the picture as well.

I decided that adding those features to gup() would be too much
for this already too complex function, and instead added a new
simpler fixup_user_fault() which is essentially a wrapper around
handle_mm_fault() which the futex code can call.

Signed-off-by: Benjamin Herrenschmidtb...@kernel.crashing.org
---

Shan, can you test this ? It might not fix the problem since I'm
starting to have the nasty feeling that you are hitting what is
somewhat a subtly different issue or my previous patch should
have worked (but then I might have done a stupid mistake as well)
but let us know anyway.



Ok, I will test the patch, I think this should work, because
it's similar to my first posted patch, the difference is that
I tried to do it in the futex_atomic_cmpxchg_inatomic() in
the ppc specific path, lower level than yours as in
fault_in_user_writable :-)

Anyway, I will notify you on the test result.

Thanks
Shan Hai


Cheers,
Ben.

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9670f71..1036614 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -985,6 +985,8 @@ int get_user_pages(struct task_struct *tsk, struct 
mm_struct *mm,
  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages);
  struct page *get_dump_page(unsigned long addr);
+extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
+   unsigned long address, unsigned int fault_flags);

  extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
  extern void do_invalidatepage(struct page *page, unsigned long offset);
diff --git a/kernel/futex.c b/kernel/futex.c
index fe28dc2..7a0a4ed 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -355,8 +355,8 @@ static int fault_in_user_writeable(u32 __user *uaddr)
int ret;

down_read(mm-mmap_sem);
-   ret = get_user_pages(current, mm, (unsigned long)uaddr,
-1, 1, 0, NULL, NULL);
+   ret = fixup_user_fault(current, mm, (unsigned long)uaddr,
+  FAULT_FLAG_WRITE);
up_read(mm-mmap_sem);

return ret  0 ? ret : 0;
diff --git a/mm/memory.c b/mm/memory.c
index 40b7531..b967fb0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1815,7 +1815,64 @@ next_page:
  }
  EXPORT_SYMBOL(__get_user_pages);

-/**
+/*
+ * fixup_user_fault() - manually resolve a user page  fault
+ * @tsk:   the task_struct to use for page fault accounting, or
+ * NULL if faults are not to be recorded.
+ * @mm:mm_struct of target mm
+ * @address:   user address
+ * @fault_flags:flags to pass down to handle_mm_fault()
+ *
+ * This is meant to be called in the specific scenario where for
+ * locking reasons we try to access user memory in atomic context
+ * (within a pagefault_disable() section), this returns -EFAULT,
+ * and we want to resolve the user fault before trying again.
+ *
+ * Typically this is meant to be used by the futex code.
+ *
+ * The main difference with get_user_pages() is that this function
+ * will unconditionally call handle_mm_fault() which will in turn
+ * perform all the necessary SW fixup of the dirty and young bits
+ * in the PTE, while handle_mm_fault() only guarantees to update
+ * these in the struct page.
+ *
+ * This is important for some architectures where those bits also
+ * gate the access permission to the page because their are
+ * maintained in software. On such architecture, gup() will not
+ * be enough to make a subsequent access succeed.
+ *
+ * This should be called with the mm_sem held for read.
+ */
+int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
+unsigned long address, unsigned int fault_flags)
+{
+   struct vm_area_struct *vma;
+   int ret;
+
+   vma = find_extend_vma(mm, address);
+   if (!vma || address  vma-vm_start)
+   return -EFAULT;
+   
+   ret = handle_mm_fault(mm, vma, address, fault_flags);
+   if (ret  VM_FAULT_ERROR) {
+   if (ret  VM_FAULT_OOM)
+   return -ENOMEM;
+   if (ret  (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
+   return -EHWPOISON;
+   if (ret  VM_FAULT_SIGBUS)
+   return -EFAULT;
+   BUG();
+   }
+   if (tsk) {
+   if (ret  VM_FAULT_MAJOR)
+   tsk-maj_flt++;
+   else

Re: [RFC/PATCH] mm/futex: Fix futex writes on archs with SW tracking of dirty young

2011-07-18 Thread Shan Hai

On 07/19/2011 12:29 PM, Benjamin Herrenschmidt wrote:

The futex code currently attempts to write to user memory within
a pagefault disabled section, and if that fails, tries to fix it
up using get_user_pages().

This doesn't work on archs where the dirty and young bits are
maintained by software, since they will gate access permission
in the TLB, and will not be updated by gup().

In addition, there's an expectation on some archs that a
spurious write fault triggers a local TLB flush, and that is
missing from the picture as well.

I decided that adding those features to gup() would be too much
for this already too complex function, and instead added a new
simpler fixup_user_fault() which is essentially a wrapper around
handle_mm_fault() which the futex code can call.

Signed-off-by: Benjamin Herrenschmidtb...@kernel.crashing.org
---

Shan, can you test this ? It might not fix the problem since I'm
starting to have the nasty feeling that you are hitting what is
somewhat a subtly different issue or my previous patch should
have worked (but then I might have done a stupid mistake as well)
but let us know anyway.



The patch works, but I have certain confusions,
- Do we want to handle_mm_fault on each futex_lock_pi
even though in most cases there is no write permission
fixup's needed?
- How about let the archs do their own write permission
fixup as what I did in my original
[PATCH 1/1] Fixup write permission of TLB on powerpc e500 core?
(I will fix the stupid errors in my original patch if the concept 
is acceptable)

in this way we could decrease the overhead of handle_mm_fault
in the path which does not need write permission fixup.

Thanks
Shan Hai

Cheers,
Ben.

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9670f71..1036614 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -985,6 +985,8 @@ int get_user_pages(struct task_struct *tsk, struct 
mm_struct *mm,
  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages);
  struct page *get_dump_page(unsigned long addr);
+extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
+   unsigned long address, unsigned int fault_flags);

  extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
  extern void do_invalidatepage(struct page *page, unsigned long offset);
diff --git a/kernel/futex.c b/kernel/futex.c
index fe28dc2..7a0a4ed 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -355,8 +355,8 @@ static int fault_in_user_writeable(u32 __user *uaddr)
int ret;

down_read(mm-mmap_sem);
-   ret = get_user_pages(current, mm, (unsigned long)uaddr,
-1, 1, 0, NULL, NULL);
+   ret = fixup_user_fault(current, mm, (unsigned long)uaddr,
+  FAULT_FLAG_WRITE);
up_read(mm-mmap_sem);

return ret  0 ? ret : 0;
diff --git a/mm/memory.c b/mm/memory.c
index 40b7531..b967fb0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1815,7 +1815,64 @@ next_page:
  }
  EXPORT_SYMBOL(__get_user_pages);

-/**
+/*
+ * fixup_user_fault() - manually resolve a user page  fault
+ * @tsk:   the task_struct to use for page fault accounting, or
+ * NULL if faults are not to be recorded.
+ * @mm:mm_struct of target mm
+ * @address:   user address
+ * @fault_flags:flags to pass down to handle_mm_fault()
+ *
+ * This is meant to be called in the specific scenario where for
+ * locking reasons we try to access user memory in atomic context
+ * (within a pagefault_disable() section), this returns -EFAULT,
+ * and we want to resolve the user fault before trying again.
+ *
+ * Typically this is meant to be used by the futex code.
+ *
+ * The main difference with get_user_pages() is that this function
+ * will unconditionally call handle_mm_fault() which will in turn
+ * perform all the necessary SW fixup of the dirty and young bits
+ * in the PTE, while handle_mm_fault() only guarantees to update
+ * these in the struct page.
+ *
+ * This is important for some architectures where those bits also
+ * gate the access permission to the page because their are
+ * maintained in software. On such architecture, gup() will not
+ * be enough to make a subsequent access succeed.
+ *
+ * This should be called with the mm_sem held for read.
+ */
+int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
+unsigned long address, unsigned int fault_flags)
+{
+   struct vm_area_struct *vma;
+   int ret;
+
+   vma = find_extend_vma(mm, address);
+   if (!vma || address  vma-vm_start)
+   return -EFAULT;
+   
+   ret = handle_mm_fault(mm, vma, address, fault_flags);
+   if (ret  VM_FAULT_ERROR) {
+   if (ret  VM_FAULT_OOM)
+   return -ENOMEM;
+   if (ret  (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
+   return

Re: [RFC/PATCH] mm/futex: Fix futex writes on archs with SW tracking of dirty young

2011-07-18 Thread Shan Hai

On 07/19/2011 01:24 PM, Benjamin Herrenschmidt wrote:

On Tue, 2011-07-19 at 13:17 +0800, Shan Hai wrote:


The patch works, but I have certain confusions,
- Do we want to handle_mm_fault on each futex_lock_pi
  even though in most cases there is no write permission
  fixup's needed?

Don't we only ever call this when futex_atomic_op_inuser() failed ?
Which means a fixup -is- needed  The fast path is still there.



What you said is another path, that is futex_wake_op(),
but what about futex_lock_pi in which my test case failed?
your patch will call handle_mm_fault on every futex contention
in the futex_lock_pi path.

futex_lock_pi()
ret = futex_lock_pi_atomic(uaddr, hb, q.key, q.pi_state, current, 0);
case -EFAULT:
goto uaddr_faulted;

...
uaddr_faulted:
ret = fault_in_user_writeable(uaddr);



- How about let the archs do their own write permission
  fixup as what I did in my original

Why ? This is generic and will fix all archs at once with generic code
which is a significant improvement in my book and a lot more
maintainable :-)



If the overhead in the futex_lock_pi  path is not considerable yes fix it up
generally is nice :-)


  [PATCH 1/1] Fixup write permission of TLB on powerpc e500 core?
  (I will fix the stupid errors in my original patch if the concept
is acceptable)
  in this way we could decrease the overhead of handle_mm_fault
  in the path which does not need write permission fixup.

Which overhead ? gup does handle_mm_fault() as well if needed.


it does it *if needed*, and this requirement is rare in my opinion.


Thanks
Shan Hai


What I do is I replace what is arguably an abuse of gup() in the case
where a fixup -is- needed with a dedicated function designed to perform
the said fixup ... and do it properly which gup() didn't :-)

Cheers,
Ben.


Thanks
Shan Hai

Cheers,
Ben.

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9670f71..1036614 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -985,6 +985,8 @@ int get_user_pages(struct task_struct *tsk, struct 
mm_struct *mm,
   int get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages);
   struct page *get_dump_page(unsigned long addr);
+extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
+   unsigned long address, unsigned int fault_flags);

   extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
   extern void do_invalidatepage(struct page *page, unsigned long offset);
diff --git a/kernel/futex.c b/kernel/futex.c
index fe28dc2..7a0a4ed 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -355,8 +355,8 @@ static int fault_in_user_writeable(u32 __user *uaddr)
int ret;

down_read(mm-mmap_sem);
-   ret = get_user_pages(current, mm, (unsigned long)uaddr,
-1, 1, 0, NULL, NULL);
+   ret = fixup_user_fault(current, mm, (unsigned long)uaddr,
+  FAULT_FLAG_WRITE);
up_read(mm-mmap_sem);

return ret   0 ? ret : 0;
diff --git a/mm/memory.c b/mm/memory.c
index 40b7531..b967fb0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1815,7 +1815,64 @@ next_page:
   }
   EXPORT_SYMBOL(__get_user_pages);

-/**
+/*
+ * fixup_user_fault() - manually resolve a user page  fault
+ * @tsk:   the task_struct to use for page fault accounting, or
+ * NULL if faults are not to be recorded.
+ * @mm:mm_struct of target mm
+ * @address:   user address
+ * @fault_flags:flags to pass down to handle_mm_fault()
+ *
+ * This is meant to be called in the specific scenario where for
+ * locking reasons we try to access user memory in atomic context
+ * (within a pagefault_disable() section), this returns -EFAULT,
+ * and we want to resolve the user fault before trying again.
+ *
+ * Typically this is meant to be used by the futex code.
+ *
+ * The main difference with get_user_pages() is that this function
+ * will unconditionally call handle_mm_fault() which will in turn
+ * perform all the necessary SW fixup of the dirty and young bits
+ * in the PTE, while handle_mm_fault() only guarantees to update
+ * these in the struct page.
+ *
+ * This is important for some architectures where those bits also
+ * gate the access permission to the page because their are
+ * maintained in software. On such architecture, gup() will not
+ * be enough to make a subsequent access succeed.
+ *
+ * This should be called with the mm_sem held for read.
+ */
+int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
+unsigned long address, unsigned int fault_flags)
+{
+   struct vm_area_struct *vma;
+   int ret;
+
+   vma = find_extend_vma(mm, address);
+   if (!vma || address   vma-vm_start)
+   return -EFAULT;
+   
+   ret = handle_mm_fault(mm, vma, address, fault_flags);
+   if (ret

Re: [PATCH 1/1] Fixup write permission of TLB on powerpc e500 core

2011-07-17 Thread Shan Hai

On 07/17/2011 07:02 PM, Peter Zijlstra wrote:

On Sun, 2011-07-17 at 09:49 +1000, Benjamin Herrenschmidt wrote:

In the meantime, other than rewriting the futex code to not require
those in-atomic accesses (can't it just access the pages via the linear
mapping and/or kmap after the gup ?),

That'll wreck performance on things like ARM and SPARC that have to deal
with cache aliasing.


  all I see would be a way to force
dirty and young after gup, with appropriate locks, or a variant of gup
(via a flag ?) to require it to do so.

Again, _WHY_ isn't gup(.write=1) a complete write fault? Its supposed to
be, it needs to break COW, do dirty page tracking and call page_mkwrite.
I'm still thinking this e500 stuff is smoking crack.

ARM has no hardware dirty bit either, and yet it works for them. I can't
exactly tell how because I got lost in there, but it does, again,
suggest e500 is on crack.


Ok, the following feature of the architecture causes failure of
gup(.write=1) on dirtying pages,
- allows pages to be protected from supervisor-mode writes

On ARM you could not protect pages from supervisor-mode writes,
isn't it?  That means, all writable user pages are writable for
supervisor too, but its not hold for at least x86 and powerpc,
x86 and powerpc can be configured to protect pages from
supervisor-mode writes.

Think about the following situation,
a page fault occurs on the kernel trying to write to a writable shared
user page which is read only to the kernel, the following conditions hold,
- the page is *present*, because its a shared page
- the page is *writable*, because demand paging sets up the pte for
the current process to so

The follow_page() called in the __get_user_page() returns non NULL
to its caller on the above mentioned *present* and *writable* page,
so the gup(.write=1) has no chance to set pte dirty by calling 
handle_mm_fault,

the follow_page() has no knowledge of supervisor-mode write protected pages,
that's the culprit in the bug discussed here.

Thanks
Shan Hai


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/1] Fixup write permission of TLB on powerpc e500 core

2011-07-17 Thread Shan Hai

On 07/17/2011 10:48 PM, Benjamin Herrenschmidt wrote:

On Sun, 2011-07-17 at 21:33 +0800, Shan Hai wrote:

On ARM you could not protect pages from supervisor-mode writes,
isn't it?  That means, all writable user pages are writable for
supervisor too, but its not hold for at least x86 and powerpc,
x86 and powerpc can be configured to protect pages from
supervisor-mode writes.

That doesn't sound right... how would put_user() work properly then ? A
cursory glance at the ARM code doesn't show it doing anything special,
just stores ... but I might have missing something.



That's real for ARM, for the reason put_user() work properly is that
the first time access to the write protected page triggers a page
fault, and the handle_mm_fault() will fix up the write permission
for the kernel, because at this time no one disabled the page fault
as done in the futex case.


Think about the following situation,
a page fault occurs on the kernel trying to write to a writable shared
user page which is read only to the kernel, the following conditions
hold,
- the page is *present*, because its a shared page
- the page is *writable*, because demand paging sets up the pte for
  the current process to so

The follow_page() called in the __get_user_page() returns non NULL
to its caller on the above mentioned *present* and *writable* page,
so the gup(.write=1) has no chance to set pte dirty by calling
handle_mm_fault,
the follow_page() has no knowledge of supervisor-mode write protected
pages,
that's the culprit in the bug discussed here.

Right, the problem is with writable pages that have lost (or never had
but usually it's lost, due to swapping for example) their dirty bit, or
any page that has lost young.

 From what I can tell, we need to either fix those bits from the caller
of gup (futex code), which sound nasty, or more easily fix those from
gup itself, possibly under control of flags in the write argument to
avoid breaking code relying on the existing behaviour, expecially vs.
dirty.



So, for the reason the SW tracked dirty/young and supervisor protected
pages has potential effects on not only *futex* but also on other components
of the kernel which might access the non-dirty supervisor protected page,
in my opinion it might be more sensible to fix it from gup instead of fixing
it in the futex.

Thanks
Shan Hai


Cheers,
Ben.




___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/1] Fixup write permission of TLB on powerpc e500 core

2011-07-16 Thread Shan Hai

On 07/15/2011 06:23 AM, Peter Zijlstra wrote:

On Fri, 2011-07-15 at 16:07 +0800, Shan Hai wrote:

The kernel has no write permission on COW pages by default on e500 core, this
will cause endless loop in futex_lock_pi, because futex code assumes the kernel
has write permission on COW pages. Grant write permission to the kernel on COW
pages when access violation page fault occurs.

Signed-off-by: Shan Haihaishan@gmail.com
---
  arch/powerpc/include/asm/futex.h |   11 ++-
  arch/powerpc/include/asm/tlb.h   |   25 +
  2 files changed, 35 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/include/asm/futex.h b/arch/powerpc/include/asm/futex.h
index c94e4a3..54c3e74 100644
--- a/arch/powerpc/include/asm/futex.h
+++ b/arch/powerpc/include/asm/futex.h
@@ -8,6 +8,7 @@
  #includeasm/errno.h
  #includeasm/synch.h
  #includeasm/asm-compat.h
+#includeasm/tlb.h

  #define __futex_atomic_op(insn, ret, oldval, uaddr, oparg) \
__asm__ __volatile ( \
@@ -113,7 +114,15 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr,
  : cc, memory);

*uval = prev;
-return ret;
+
+   /* Futex assumes the kernel has permission to write to
+* COW pages, grant the kernel write permission on COW
+* pages because it has none by default.
+*/
+   if (ret == -EFAULT)
+   __tlb_fixup_write_permission(current-mm, (unsigned long)uaddr);
+
+   return ret;
  }

  #endif /* __KERNEL__ */
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index e2b428b..3863c6a 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -45,5 +45,30 @@ static inline void __tlb_remove_tlb_entry(struct mmu_gather 
*tlb, pte_t *ptep,
  #endif
  }

+/* Grant write permission to the kernel on a page. */
+static inline void __tlb_fixup_write_permission(struct mm_struct *mm,
+   unsigned long address)
+{
+#if defined(CONFIG_FSL_BOOKE)
+   /* Grant write permission to the kernel on a page by setting TLB.SW
+* bit, the bit setting operation is tricky here, calling
+* handle_mm_fault with FAULT_FLAG_WRITE causes _PAGE_DIRTY bit of
+* the pte to be set, the _PAGE_DIRTY of the pte is translated into
+* TLB.SW on Powerpc e500 core.
+*/
+
+   struct vm_area_struct *vma;
+
+   vma = find_vma(mm, address);

Uhm, find_vma() needs mmap_sem, and futex_atomic_cmpxchg_inatomic() is
most certainly not called with that lock held.


+   if (likely(vma)) {
+   /* only fixup present page */
+   if (follow_page(vma, address, FOLL_WRITE)) {
+   handle_mm_fault(mm, vma, address, FAULT_FLAG_WRITE);

So how can this toggle your sw dirty/young tracking, that's pretty much
what gup(.write=1) does too!



That's right the gup(.write=1) want to do the same thing as the
above code snippet, but it failed for the following reason:
because the get_user_pages() would not dirty pte for the reason
the follow_page() returns not NULL on *present* and *writable*
page, the page which holds the lock is present because its a shared page,
writable because demand paging set that up so for shared
writable page, so the handle_mm_fault() in the __get_user_page()
could not be called.

Why the above code could do the same task, because by calling
handle_mm_fault() will set pte dirty by
[do_annonymous_page(), memory.c]
if (vma-vm_flags  VM_WRITE)
entry = pte_mkwrite(pte_mkdirty(entry));

Thanks
Shan Hai


+   flush_tlb_page(vma, address);
+   }
+   }
+#endif
+}
+
  #endif /* __KERNEL__ */
  #endif /* __ASM_POWERPC_TLB_H */


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core

2011-07-16 Thread Shan Hai

On 07/15/2011 08:20 PM, Benjamin Herrenschmidt wrote:

On Fri, 2011-07-15 at 11:32 -0400, Shan Hai wrote:

I agree with you, the problem could be triggered by accessing
any user space page which has kernel read only permission
in the page fault disabled context, the problem also affects
architectures which depend on SW dirty/young tracking as
stated by Benjamin in this thread.

In the e500 case, the commit 6cfd8990e27d3a491c1c605d6cbc18a46ae51fef
removed the write permission fixup from TLB miss handlers and left it to
generic code, so it might be right time to fixup the write permission here
in the generic code.

But we can't. The must not modify the PTE from an interrupt context and
the atomic variants of user accesses can be called in such contexts.

I think the problem is that we try to actually do things other than just
peek at user memory (for backtraces etc...) but actually useful things
in page fault disabled contexts. That's bad and various archs mm were
designed with the assumption that this never happens.



Yes I understood, the *here* above means 'generic code' like futex code,
I am sorry for my ambiguous description.


If the futex case is seldom here, we could probably find a way to work
around in that specific case.



That's what my patch wants to do.


However, I -still- don't understand why gup didn't fixup the write
permission. gup doesn't set dirty ?



Yep, gup doesn't set dirty, because when the page fault
occurs on the kernel accessing a user page which is
read only to the kernel the following conditions hold,
- the page is present, because its a shared page
- the page is writable, because demand paging
sets up the pte for the current  process to so

The follow_page() called in the __get_user_page()
returns non NULL to its caller on the above mentioned
present and writable page, so the gup(.write=1) has no
chance to set pte dirty by calling handle_mm_fault

Thanks
Shan Hai
s

Cheers,
Ben.




___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/1] Fixup write permission of TLB on powerpc e500 core

2011-07-16 Thread Shan Hai

On 07/15/2011 11:24 AM, Peter Zijlstra wrote:

On Fri, 2011-07-15 at 11:18 -0400, Shan Hai wrote:


+   vma = find_vma(mm, address);

Uhm, find_vma() needs mmap_sem, and futex_atomic_cmpxchg_inatomic() is
most certainly not called with that lock held.


My fault, that will be fixed in the V2 patch.

But you cannot, the function isn't called _atomic_ just for kicks, its
used while holding spinlocks.



Yes we can do that, _atomic_ here is just atomic for cmpxchg
implemented by the combination of 'lwarx' and 'stwcx.' instructions
as done in the spin lock implementation, so even we hold the
mmap_sem that has no impact on the _atomic_ feature of the
futex_atomic_cmpxchg_inatomic().


+   if (likely(vma)) {
+   /* only fixup present page */
+   if (follow_page(vma, address, FOLL_WRITE)) {
+   handle_mm_fault(mm, vma, address, FAULT_FLAG_WRITE);

So how can this toggle your sw dirty/young tracking, that's pretty much
what gup(.write=1) does too!


because of the kernel read only permission of the page is transparent
to the follow_page(),  the handle_mm_fault() is not to be activated
in the __get_use_pages(), so the gup(.write=1) could not help to fixup
the write permission.

So why do you need the vma? Is it like I wrote earlier that you don't
have spare PTE bits and need the vma flags to see if it may become
writable?



Need vma for the reason to call handle_mm_fault(), that's all.


gup(.write=1) not triggering this is a serious problem though, not
something you can just paper over. I wouldn't be at all surprised to
find there's more things broken because of that.


In my opinion another solution might be check the read only for kernel
feature of a page in the follow_page() on gup(.write=1) to avoid this
problem on all architectures.

Thanks
Shan Hai

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 0/1] Fixup write permission of TLB on powerpc e500 core

2011-07-15 Thread Shan Hai
The following test case could reveal a bug in the futex_lock_pi()

BUG: On FUTEX_LOCK_PI, there is a infinite loop in the futex_lock_pi() 
on Powerpc e500 core.
Cause: The linux kernel on the e500 core has no write permission on
the COW page, refer the head comment of the following test code.
 
ftrace on test case:
[000]   353.990181: futex_lock_pi_atomic -futex_lock_pi
[000]   353.990185: cmpxchg_futex_value_locked -futex_lock_pi_atomic
[snip]
[000]   353.990191: do_page_fault -handle_page_fault
[000]   353.990192: bad_page_fault -handle_page_fault
[000]   353.990193: search_exception_tables -bad_page_fault
[snip]
[000]   353.990199: get_user_pages -fault_in_user_writeable
[snip]
[000]   353.990208: mark_page_accessed -follow_page
[000]   353.990222: futex_lock_pi_atomic -futex_lock_pi
[snip]
[000]   353.990230: cmpxchg_futex_value_locked -futex_lock_pi_atomic
[ a loop occures here ]


/* 
 * A test case for revealing an infinite loop in the futex_lock_pi().
 * - there are 2 processes, parent and a child
 * - the parent process allocates and initializes a pthread_mutex MUTEX in a 
 *  shared memory region
 * - the parent process holds the MUTEX and do long time computing
 * - the child process tries to hold the MUTEX during the parent holding it and 
 *  traps into the kernel for waiting on the MUTEX because of contention
 * - the kernel loops in futex_lock_pi()
 * - result of 'top' command reveals that the system usage of CPU is 100%
 */

#include stdio.h
#include stdlib.h
#include sys/ipc.h
#include sys/shm.h
#include errno.h
#include pthread.h
#include string.h
#include signal.h
#include fcntl.h
#include sys/types.h
#include sys/mman.h

enum { SHM_INIT, SHM_GET };
enum { PARENT, CHILD };

#define FIXED_MMAP_ADDR 0x2000
#define MMAP_SIZE   0x200

static int shmid;
static char shm_name[100];
static int sleep_period = 10;

void * shmem_init(int flag)
{
int start = FIXED_MMAP_ADDR;
int memory_size = MMAP_SIZE;
int mode = 0666;
void *addr;
int ret;

sprintf(shm_name, /shmem_1234);

shmid = shm_open (shm_name, O_RDWR | O_EXCL | O_CREAT | O_TRUNC, mode);

if (shmid  0) {
if (errno == EEXIST) {
printf (shm_open: %s\n, strerror(errno)); 
shmid = shm_open (shm_name, O_RDWR, mode);
} else {
printf(failed to shm_open, err=%s\n, strerror(errno));
return NULL;
}
}

ret = fcntl (shmid, F_SETFD, FD_CLOEXEC);
if (ret  0) {
printf(fcntl: %s\n, strerror(errno));
return NULL;
}

ret = ftruncate (shmid, memory_size);
if (ret  0) {
printf(ftruncate: %s\n, strerror(errno));
return NULL;
}

addr = mmap ((void *)start, memory_size, PROT_READ | PROT_WRITE, 
MAP_SHARED | MAP_FIXED, shmid, 0);

if (addr == MAP_FAILED) {
printf (mmap: %s\n, strerror(errno)); 
close (shmid);
shm_unlink (shm_name);
return NULL;
}

if (flag == SHM_INIT)
memset(addr, 0, memory_size);

return (void *)start;
}

pthread_mutex_t * shmem_mutex_init(int flag)
{
pthread_mutex_t * pmutex = (pthread_mutex_t *)shmem_init(flag);
pthread_mutexattr_t attr;

if (flag == SHM_INIT) {
pthread_mutexattr_init (attr);
pthread_mutexattr_setpshared (attr, PTHREAD_PROCESS_SHARED);
pthread_mutexattr_setprotocol (attr, PTHREAD_PRIO_INHERIT);
pthread_mutexattr_setrobust_np (attr, 
PTHREAD_MUTEX_STALLED_NP);
pthread_mutexattr_settype (attr, PTHREAD_MUTEX_ERRORCHECK);
if (pthread_mutex_init (pmutex, attr) != 0) {
printf(Init mutex failed, err=%s\n, strerror(errno));
pthread_mutexattr_destroy (attr);
return NULL;
}
}

return pmutex;
}

void long_running_task(int flag)
{
static int counter = 0;

if (flag == PARENT) 
usleep(5*sleep_period);
else
usleep(3*sleep_period);

counter = (counter + 1) % 100;
printf(%d: completed %d computing\n, getpid(), counter);
}

void sig_handler(int signum)
{
close(shmid);
shm_unlink(shm_name);

exit(0);
}

int main(int argc, char *argv[])
{
pthread_mutex_t *mutex_parent, *mutex_child;

signal(SIGUSR1, sig_handler);

if (fork()) { /* parent process */
if ((mutex_parent = shmem_mutex_init(SHM_INIT)) == NULL) {
printf(failed to get the shmem_mutex\n);
exit(-1);
}
 

[PATCH 1/1] Fixup write permission of TLB on powerpc e500 core

2011-07-15 Thread Shan Hai
The kernel has no write permission on COW pages by default on e500 core, this
will cause endless loop in futex_lock_pi, because futex code assumes the kernel
has write permission on COW pages. Grant write permission to the kernel on COW
pages when access violation page fault occurs.

Signed-off-by: Shan Hai haishan@gmail.com
---
 arch/powerpc/include/asm/futex.h |   11 ++-
 arch/powerpc/include/asm/tlb.h   |   25 +
 2 files changed, 35 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/include/asm/futex.h b/arch/powerpc/include/asm/futex.h
index c94e4a3..54c3e74 100644
--- a/arch/powerpc/include/asm/futex.h
+++ b/arch/powerpc/include/asm/futex.h
@@ -8,6 +8,7 @@
 #include asm/errno.h
 #include asm/synch.h
 #include asm/asm-compat.h
+#include asm/tlb.h
 
 #define __futex_atomic_op(insn, ret, oldval, uaddr, oparg) \
   __asm__ __volatile ( \
@@ -113,7 +114,15 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr,
 : cc, memory);
 
*uval = prev;
-return ret;
+
+   /* Futex assumes the kernel has permission to write to
+* COW pages, grant the kernel write permission on COW
+* pages because it has none by default.
+*/
+   if (ret == -EFAULT)
+   __tlb_fixup_write_permission(current-mm, (unsigned long)uaddr);
+
+   return ret;
 }
 
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index e2b428b..3863c6a 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -45,5 +45,30 @@ static inline void __tlb_remove_tlb_entry(struct mmu_gather 
*tlb, pte_t *ptep,
 #endif
 }
 
+/* Grant write permission to the kernel on a page. */
+static inline void __tlb_fixup_write_permission(struct mm_struct *mm,
+   unsigned long address)
+{
+#if defined(CONFIG_FSL_BOOKE)
+   /* Grant write permission to the kernel on a page by setting TLB.SW
+* bit, the bit setting operation is tricky here, calling
+* handle_mm_fault with FAULT_FLAG_WRITE causes _PAGE_DIRTY bit of
+* the pte to be set, the _PAGE_DIRTY of the pte is translated into
+* TLB.SW on Powerpc e500 core.
+*/
+
+   struct vm_area_struct *vma;
+
+   vma = find_vma(mm, address);
+   if (likely(vma)) {
+   /* only fixup present page */
+   if (follow_page(vma, address, FOLL_WRITE)) {
+   handle_mm_fault(mm, vma, address, FAULT_FLAG_WRITE);
+   flush_tlb_page(vma, address);
+   }
+   }
+#endif
+}
+
 #endif /* __KERNEL__ */
 #endif /* __ASM_POWERPC_TLB_H */
-- 
1.7.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core

2011-07-15 Thread Shan Hai

On 07/15/2011 04:44 PM, Peter Zijlstra wrote:

On Fri, 2011-07-15 at 16:38 +0800, MailingLists wrote:

On 07/15/2011 04:20 PM, Peter Zijlstra wrote:

On Fri, 2011-07-15 at 16:07 +0800, Shan Hai wrote:

The following test case could reveal a bug in the futex_lock_pi()

BUG: On FUTEX_LOCK_PI, there is a infinite loop in the futex_lock_pi()
  on Powerpc e500 core.
Cause: The linux kernel on the e500 core has no write permission on
  the COW page, refer the head comment of the following test code.

ftrace on test case:
[000]   353.990181: futex_lock_pi_atomic-futex_lock_pi
[000]   353.990185: cmpxchg_futex_value_locked-futex_lock_pi_atomic
[snip]
[000]   353.990191: do_page_fault-handle_page_fault
[000]   353.990192: bad_page_fault-handle_page_fault
[000]   353.990193: search_exception_tables-bad_page_fault
[snip]
[000]   353.990199: get_user_pages-fault_in_user_writeable
[snip]
[000]   353.990208: mark_page_accessed-follow_page
[000]   353.990222: futex_lock_pi_atomic-futex_lock_pi
[snip]
[000]   353.990230: cmpxchg_futex_value_locked-futex_lock_pi_atomic
[ a loop occures here ]


But but but but, that get_user_pages(.write=1, .force=0) should result
in a COW break, getting our own writable page.

What is this e500 thing smoking that this doesn't work?

A page could be set to read only by the kernel (supervisor in the powerpc
literature) on the e500, and that's what the kernel do. Set SW(supervisor
write) bit in the TLB entry to grant write permission to the kernel on a
page.

And further the SW bit is set according to the DIRTY flag of the PTE,
PTE.DIRTY is set in the do_page_fault(), the futex_lock_pi() disabled
page fault, the PTE.DIRTY never can be set, so do the SW bit, unbreakable
COW occurred, infinite loop followed.

I'm fairly sure fault_in_user_writeable() has PF enabled as it takes
mmap_sem, an pagefaul_disable() is akin to preemp_disable() on mainline.

Also get_user_pages() fully expects to be able to schedule, and in fact
can call the full pf handler path all by its lonesome self.


The whole scenario should be,
- the child process triggers a page fault at the first time access to
the lock, and it got its own writable page, but its *clean* for
the reason just for checking the status of the lock.
I am sorry for above unbreakable COW.
- the futex_lock_pi() is invoked because of the lock contention,
and the futex_atomic_cmpxchg_inatomic() tries to get the lock,
it found out the lock is free so tries to write to the lock for
reservation, a page fault occurs, because the page is read only
for kernel(e500 specific), and returns -EFAULT to the caller
- the fault_in_user_writeable() tries to fix the fault,
but from the get_user_pages() view everything is ok, because
the COW was already broken, retry futex_lock_pi_atomic()
- futex_lock_pi_atomic() -- futex_atomic_cmpxchg_inatomic(),
another write protection page fault
- infinite loop

Thanks
Shan Hai


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core

2011-07-15 Thread Shan Hai

On 07/15/2011 05:50 PM, Peter Zijlstra wrote:

On Fri, 2011-07-15 at 17:08 +0800, Shan Hai wrote:

The whole scenario should be,
- the child process triggers a page fault at the first time access to
  the lock, and it got its own writable page, but its *clean* for
  the reason just for checking the status of the lock.
  I am sorry for above unbreakable COW.
- the futex_lock_pi() is invoked because of the lock contention,
  and the futex_atomic_cmpxchg_inatomic() tries to get the lock,
  it found out the lock is free so tries to write to the lock for
  reservation, a page fault occurs, because the page is read only
  for kernel(e500 specific), and returns -EFAULT to the caller
- the fault_in_user_writeable() tries to fix the fault,
  but from the get_user_pages() view everything is ok, because
  the COW was already broken, retry futex_lock_pi_atomic()

but that's a bug right there, gup(.write=1) _should_ be a complete write
fault, and as such toggle your sw dirty/young tracking.



The fault causing futex_atomic_cmpxchg_inatomic() is
protected by pagefault_disable(), so the page fault handler has
no chance to toggle the SW dirty/young tracking.

Thanks
Shan Hai


- futex_lock_pi_atomic() --  futex_atomic_cmpxchg_inatomic(),
  another write protection page fault
- infinite loop


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/1] Fixup write permission of TLB on powerpc e500 core

2011-07-15 Thread Shan Hai

On 07/15/2011 06:23 AM, Peter Zijlstra wrote:

On Fri, 2011-07-15 at 16:07 +0800, Shan Hai wrote:

The kernel has no write permission on COW pages by default on e500 core, this
will cause endless loop in futex_lock_pi, because futex code assumes the kernel
has write permission on COW pages. Grant write permission to the kernel on COW
pages when access violation page fault occurs.

Signed-off-by: Shan Haihaishan@gmail.com
---
  arch/powerpc/include/asm/futex.h |   11 ++-
  arch/powerpc/include/asm/tlb.h   |   25 +
  2 files changed, 35 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/include/asm/futex.h b/arch/powerpc/include/asm/futex.h
index c94e4a3..54c3e74 100644
--- a/arch/powerpc/include/asm/futex.h
+++ b/arch/powerpc/include/asm/futex.h
@@ -8,6 +8,7 @@
  #includeasm/errno.h
  #includeasm/synch.h
  #includeasm/asm-compat.h
+#includeasm/tlb.h

  #define __futex_atomic_op(insn, ret, oldval, uaddr, oparg) \
__asm__ __volatile ( \
@@ -113,7 +114,15 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr,
  : cc, memory);

*uval = prev;
-return ret;
+
+   /* Futex assumes the kernel has permission to write to
+* COW pages, grant the kernel write permission on COW
+* pages because it has none by default.
+*/
+   if (ret == -EFAULT)
+   __tlb_fixup_write_permission(current-mm, (unsigned long)uaddr);
+
+   return ret;
  }

  #endif /* __KERNEL__ */
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index e2b428b..3863c6a 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -45,5 +45,30 @@ static inline void __tlb_remove_tlb_entry(struct mmu_gather 
*tlb, pte_t *ptep,
  #endif
  }

+/* Grant write permission to the kernel on a page. */
+static inline void __tlb_fixup_write_permission(struct mm_struct *mm,
+   unsigned long address)
+{
+#if defined(CONFIG_FSL_BOOKE)
+   /* Grant write permission to the kernel on a page by setting TLB.SW
+* bit, the bit setting operation is tricky here, calling
+* handle_mm_fault with FAULT_FLAG_WRITE causes _PAGE_DIRTY bit of
+* the pte to be set, the _PAGE_DIRTY of the pte is translated into
+* TLB.SW on Powerpc e500 core.
+*/
+
+   struct vm_area_struct *vma;
+
+   vma = find_vma(mm, address);

Uhm, find_vma() needs mmap_sem, and futex_atomic_cmpxchg_inatomic() is
most certainly not called with that lock held.



My fault, that will be fixed in the V2 patch.


+   if (likely(vma)) {
+   /* only fixup present page */
+   if (follow_page(vma, address, FOLL_WRITE)) {
+   handle_mm_fault(mm, vma, address, FAULT_FLAG_WRITE);

So how can this toggle your sw dirty/young tracking, that's pretty much
what gup(.write=1) does too!



because of the kernel read only permission of the page is transparent
to the follow_page(),  the handle_mm_fault() is not to be activated
in the __get_use_pages(), so the gup(.write=1) could not help to fixup
the write permission.


Thanks
Shan Hai


+   flush_tlb_page(vma, address);
+   }
+   }
+#endif
+}
+
  #endif /* __KERNEL__ */
  #endif /* __ASM_POWERPC_TLB_H */


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core

2011-07-15 Thread Shan Hai

On 07/15/2011 06:32 AM, David Laight wrote:



The fault causing futex_atomic_cmpxchg_inatomic() is
protected by pagefault_disable(), so the page fault handler has
no chance to toggle the SW dirty/young tracking.

Perhaps that is the bug!
Whatever pagefault_disable() does, it shouldn't disable the
SW dirty/young tracking - which should only needs bits moving
in the page table itself (and TLB update??) rather than any
operations on the rest of the data areas.

It looks to me as though this could happen any time a page
is marked inaccessible by the dirty/young tracking.
Not just as a result of COW.



I agree with you, the problem could be triggered by accessing
any user space page which has kernel read only permission
in the page fault disabled context, the problem also affects
architectures which depend on SW dirty/young tracking as
stated by Benjamin in this thread.

In the e500 case, the commit 6cfd8990e27d3a491c1c605d6cbc18a46ae51fef
removed the write permission fixup from TLB miss handlers and left it to
generic code, so it might be right time to fixup the write permission here
in the generic code.


Thanks
Shan Hai


David




___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


math_efp.c: Fixed SPE data type conversion failure

2010-11-16 Thread Shan Hai

The following test case failed on Powerpc sbc8548 with CONFIG_SPE

static float fm;
static signed int si_min = (-2147483647 - 1);
static unsigned int ui;
int main()
{
fm = (float) si_min; ;
ui = (unsigned int)fm;
printf(ui=%d, should be %d\n, ui, si_min);

return 0;
}
Result: ui=-1, should be -2147483648

The reason is failure to emulate the minus float to unsigned integer conversion 
instruction in the SPE driver.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH] Fix SPE float to integer conversion failure

2010-11-16 Thread Shan Hai
Conversion from float to integer should based on both the instruction
encoding and the sign of the operand.

Signed-off-by: Shan Hai shan@windriver.com
---
 arch/powerpc/math-emu/math_efp.c |   12 
 1 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/math-emu/math_efp.c b/arch/powerpc/math-emu/math_efp.c
index 41f4ef3..64faf10 100644
--- a/arch/powerpc/math-emu/math_efp.c
+++ b/arch/powerpc/math-emu/math_efp.c
@@ -320,7 +320,8 @@ int do_spe_mathemu(struct pt_regs *regs)
} else {
_FP_ROUND_ZERO(1, SB);
}
-   FP_TO_INT_S(vc.wp[1], SB, 32, ((func  0x3) != 0));
+   FP_TO_INT_S(vc.wp[1], SB, 32,
+   (((func  0x3) != 0) || SB_s));
goto update_regs;
 
default:
@@ -458,7 +459,8 @@ cmp_s:
} else {
_FP_ROUND_ZERO(2, DB);
}
-   FP_TO_INT_D(vc.wp[1], DB, 32, ((func  0x3) != 0));
+   FP_TO_INT_D(vc.wp[1], DB, 32,
+   (((func  0x3) != 0) || DB_s));
goto update_regs;
 
default:
@@ -589,8 +591,10 @@ cmp_d:
_FP_ROUND_ZERO(1, SB0);
_FP_ROUND_ZERO(1, SB1);
}
-   FP_TO_INT_S(vc.wp[0], SB0, 32, ((func  0x3) != 0));
-   FP_TO_INT_S(vc.wp[1], SB1, 32, ((func  0x3) != 0));
+   FP_TO_INT_S(vc.wp[0], SB0, 32,
+   (((func  0x3) != 0) || SB0_s));
+   FP_TO_INT_S(vc.wp[1], SB1, 32,
+   (((func  0x3) != 0) || SB1_s));
goto update_regs;
 
default:
-- 
1.7.0.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev