Re: Generic page fault (Was: libsigsegv ....)

2015-02-28 Thread Benjamin Herrenschmidt
On Sat, 2015-02-28 at 16:41 -0800, Linus Torvalds wrote:
> On Sat, Feb 28, 2015 at 3:02 PM, Benjamin Herrenschmidt
>  wrote:
> >
> > Anyway, here's the current patch:
> 
> Ok, I think I like this approach better.
> 
> Your FAULT_FLAG_EXEC handling is wrong, though. It shouldn't check
> VM_WRITE, it should check VM_EXEC. A bit too much copy-paste ;)

Ah yes indeed :) I would have caught that pretty quickly once I tried
powerpc.

> Btw, it's quite possible that we could just do all the PF_PROT
> handling at the x86 level, before even calling the generic fault
> handler. It's not like we even need the vma or the mm semaphore: if
> it's a non-write protection fault, we always SIGSEGV. So why even
> bother getting the locks and looking up the page tables etc?

For non-write possibly, though that means we always end up going
through handle_mm_fault for a non-present PTE before we decide
that it wasn't actually accessible.

Not a huge deal I suppose unless something (ab)uses mprotect + emulation
for things like spying on MMIO accesses (been there) or god knows what
other userspace based paging or GC might be out there. It's asking for
trouble but the check in access_error() doesn't hurt much either.

On the other hand, we are all about simplifying code here, aren't
we ? :-) It's not like the early bail-out on x86 will buy us much and
it's going to be more arch specific code since they'll all have a
different variant of PF_PROT.

One thing that worries me a bit, however, with the PF_PROT handling in
the generic case is the case of archs who support
present-but-not-accessible PTEs in "HW" (for us that basically means no
_PAGE_USER), as these might use such fault for NUMA balancing, or
software maintained "young" bit, and in those case, bailing out on
PF_PROT is wrong. I'll have to figure out if that ever happens when I
dig through more archs.

> Now, that PF_PROT handling isn't exactly performance-critical, but it
> might help to remove the odd x86 special case from the generic code.

I don't think it's that odd on x86. If you look at what other archs do,
there's a bit of everything out there. Just powerpc may or may not
support exec permission for example depending on the CPU generation and
when it doesn't it's basically similar to x86. I yet have to fully
understand what ARM does (looks simple, just need to make sure it's what
I think it is and check various ARM generations), sparc has its
oddities, etc...

None of that is terribly urgent and I'm pretty busy so give me a bit
more time now that we have some kind of direction, to go through a few
more archs and see where it takes us. I'll start gathering cross
compilers and qemu debian images in the next few days.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Generic page fault (Was: libsigsegv ....)

2015-02-28 Thread Linus Torvalds
On Sat, Feb 28, 2015 at 3:02 PM, Benjamin Herrenschmidt
 wrote:
>
> Anyway, here's the current patch:

Ok, I think I like this approach better.

Your FAULT_FLAG_EXEC handling is wrong, though. It shouldn't check
VM_WRITE, it should check VM_EXEC. A bit too much copy-paste ;)

Btw, it's quite possible that we could just do all the PF_PROT
handling at the x86 level, before even calling the generic fault
handler. It's not like we even need the vma or the mm semaphore: if
it's a non-write protection fault, we always SIGSEGV. So why even
bother getting the locks and looking up the page tables etc?

Now, that PF_PROT handling isn't exactly performance-critical, but it
might help to remove the odd x86 special case from the generic code.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Generic page fault (Was: libsigsegv ....)

2015-02-28 Thread Benjamin Herrenschmidt
On Sun, 2015-03-01 at 09:16 +1100, Benjamin Herrenschmidt wrote:
> So for error handling, I'm trying to simply return the VM_FAULT_* flags
> from generic_page_fault see where that takes us. That's a way to avoid
> passing an arch specific struct around. It also allows my hack to
> account major faults with the hypervisor to be done outside the generic
> code completely (no hook).
>
> .../...

Here's what it looks like for x86 only and without completely sorting
out the fatal signal business. However I might still have to do the
arch pointer you mentioned for sparc and possibly other archs, but so
far it looks better already.

Note that if I add that arch pointer, I might stop messing around
or even returning "fault" and instead just return a simple enum
minor,major,error and let inline arch hooks populate the arch pointer
with the error details in whatever fashion the arch prefers. However
I suspect they'll all end up with sig and si_code in there...

Anyway, here's the current patch:

 arch/x86/include/asm/fault.h |  21 
 arch/x86/mm/fault.c  | 233 ---
 include/linux/fault.h|  24 +
 include/linux/mm.h   |   5 +-
 mm/Makefile  |   2 +-
 mm/fault.c   | 196 
 6 files changed, 266 insertions(+), 215 deletions(-)
 create mode 100644 arch/x86/include/asm/fault.h
 create mode 100644 include/linux/fault.h
 create mode 100644 mm/fault.c

diff --git a/arch/x86/include/asm/fault.h b/arch/x86/include/asm/fault.h
new file mode 100644
index 000..04263ec
--- /dev/null
+++ b/arch/x86/include/asm/fault.h
@@ -0,0 +1,21 @@
+#ifndef _ASM_X86_FAULT_H
+#define _ASM_X86_FAULT_H
+
+#include 
+#include 
+
+/* Check if the stack is allowed to grow during a user page fault */
+static inline bool stack_can_grow(struct pt_regs *regs, unsigned long flags,
+ unsigned long address,
+ struct vm_area_struct *vma)
+{
+   /*
+* Accessing the stack below %sp is always a bug.
+* The large cushion allows instructions like enter
+* and pusha to work. ("enter $65535, $31" pushes
+* 32 pointers and then decrements %sp by 65535.)
+*/
+   return address + 65536 + 32 * sizeof(unsigned long) >= regs->sp;
+}
+
+#endif /*  _ASM_X86_FAULT_H */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index ede025f..19a8a91 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -13,6 +13,7 @@
 #include  /* hstate_index_to_shift*/
 #include /* prefetchw*/
 #include /* exception_enter(), ...   */
+#include 
 
 #include  /* dotraplinkage, ...   */
 #include/* pgd_*(), ... */
@@ -748,8 +749,7 @@ show_signal_msg(struct pt_regs *regs, unsigned long 
error_code,
printk(KERN_CONT "\n");
 }
 
-static void
-__bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
+static void __bad_area(struct pt_regs *regs, unsigned long error_code,
   unsigned long address, int si_code)
 {
struct task_struct *tsk = current;
@@ -804,39 +804,10 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned 
long error_code,
no_context(regs, error_code, address, SIGSEGV, si_code);
 }
 
-static noinline void
-bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
-unsigned long address)
-{
-   __bad_area_nosemaphore(regs, error_code, address, SEGV_MAPERR);
-}
-
-static void
-__bad_area(struct pt_regs *regs, unsigned long error_code,
-  unsigned long address, int si_code)
-{
-   struct mm_struct *mm = current->mm;
-
-   /*
-* Something tried to access memory that isn't in our memory map..
-* Fix it, but check if it's kernel or user first..
-*/
-   up_read(>mmap_sem);
-
-   __bad_area_nosemaphore(regs, error_code, address, si_code);
-}
-
-static noinline void
-bad_area(struct pt_regs *regs, unsigned long error_code, unsigned long address)
-{
-   __bad_area(regs, error_code, address, SEGV_MAPERR);
-}
-
-static noinline void
-bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
- unsigned long address)
+static inline void bad_area(struct pt_regs *regs, unsigned long error_code,
+   unsigned long address, int si_code)
 {
-   __bad_area(regs, error_code, address, SEGV_ACCERR);
+   __bad_area(regs, error_code, address, si_code);
 }
 
 static void
@@ -871,40 +842,6 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, 
unsigned long address,
force_sig_info_fault(SIGBUS, code, address, tsk, fault);
 }
 
-static noinline void
-mm_fault_error(struct pt_regs *regs, unsigned long error_code,
-  unsigned long address, unsigned int fault)
-{
-   if (fatal_signal_pending(current) && 

Re: Generic page fault (Was: libsigsegv ....)

2015-02-28 Thread Benjamin Herrenschmidt
On Sun, 2015-03-01 at 09:16 +1100, Benjamin Herrenschmidt wrote:
> So for error handling, I'm trying to simply return the VM_FAULT_* flags
> from generic_page_fault see where that takes us. That's a way to avoid
> passing an arch specific struct around. It also allows my hack to
> account major faults with the hypervisor to be done outside the generic
> code completely (no hook).

I suspect sparc64 will defeat the "not passing an arch struct around"
though... oh well.

Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Generic page fault (Was: libsigsegv ....)

2015-02-28 Thread Benjamin Herrenschmidt
On Sat, 2015-02-28 at 13:49 -0800, Linus Torvalds wrote:

 .../...

>  - we handle write faults separately (see the first part of access_error()
> 
>  - so now we know it was a read or an instruction fetch
> 
>  - if PF_PROT is set, that means that the present bit was set in the
> page tables, so it must have been an exec access to a NX page
> 
>  - otherwise, we just say "PROTNONE means no access, otherwise
> populate the page tables"
> 
> .. and if it turns out that it was a PF_INSTR to a NX page, we'll end
> up taking the page fault *again* after it's been populated, and now
> since the page table was populated, the access_error() will catch it
> with the PF_PROT case.
> 
> Or something like that. I might have screwed up some detail, but it
> should all work.

I see, it should work yes, I'll still add that FAULT_FLAG_EXEC for
those who can tell reliably but it shouldn't hurt for x86 to not set it.

Cheers,
Ben.


>  Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-arch" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Generic page fault (Was: libsigsegv ....)

2015-02-28 Thread Benjamin Herrenschmidt
So for error handling, I'm trying to simply return the VM_FAULT_* flags
from generic_page_fault see where that takes us. That's a way to avoid
passing an arch specific struct around. It also allows my hack to
account major faults with the hypervisor to be done outside the generic
code completely (no hook).

We will process generically some of the flags first such as the repeat
logic or major/minor accounting of course.

For that to work, I'm adding a VM_FAULT_ACCESS (that gets OR'ed with
VM_FAULT_SIGSEGV) to differentiate SEGV_MAPERR and SEGV_ACCERR. So far
so good.

However, I noticed a small discrepancy on x86 in the handling of fatal
signals:

I see two path that can be hit on a fatal signal. The "obvious"
one is the one in access_error() which calls no_context() with a 0
signal argument, the other path is in the retry handling, which will in
this case call no_context() with SIGBUS/BUS_ADRERR. 

Now, the only place in there that seems to care about the signal that
gets passed in is the sig_on_uaccess_error case. On one case (0 sig),
that test will be skipped, on the other case (SIGBUS), that test will be
done and might result in a sigbus being generated, which might override
the original deadly signal (or am I missing something ?)

Now I don't completely understand how the x86 vsyscall stuff works so I
don't know precisely in what circumstances that test matters, I'll need
you help there.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Generic page fault (Was: libsigsegv ....)

2015-02-28 Thread Linus Torvalds
On Sat, Feb 28, 2015 at 1:14 PM, Benjamin Herrenschmidt
 wrote:
>
> BTW. I fail to see how x86 checks PF_INSTR vs. VM_NOEXEC ... or it doesn't ?

It doesn't. x86 traditionally doesn't have an execute bit, so
traditionally "read == exec".

So PF_INSTR really wasn't historically very useful, in that it would
only show if the *first* access to a page was an instruction fetch -
if you did a regular read to brign the page in, then subsequent
instruction fetches would just work.

Then NX came along, and what happens now is

 - we handle write faults separately (see the first part of access_error()

 - so now we know it was a read or an instruction fetch

 - if PF_PROT is set, that means that the present bit was set in the
page tables, so it must have been an exec access to a NX page

 - otherwise, we just say "PROTNONE means no access, otherwise
populate the page tables"

.. and if it turns out that it was a PF_INSTR to a NX page, we'll end
up taking the page fault *again* after it's been populated, and now
since the page table was populated, the access_error() will catch it
with the PF_PROT case.

Or something like that. I might have screwed up some detail, but it
should all work.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Generic page fault (Was: libsigsegv ....)

2015-02-28 Thread Benjamin Herrenschmidt
On Sat, 2015-02-28 at 11:56 -0800, Linus Torvalds wrote:
> On Fri, Feb 27, 2015 at 11:12 PM, Benjamin Herrenschmidt
>  wrote:
> >
> > Let me know what you think of the approach.
> 
> Hmm. I'm not happy with just how many of those arch wrapper/helper
> functions there are, and some of it looks a bit unportable.
> 
> For example, the code "knows" that "err_code" and "address" are the
> only two architecture-specific pieces of information (in addition to
> "struct pt_regs", of course.
> 
> And the fault_is_user/write() stuff seems unnecessary - we have the
> generic FAULT_FLAG_USER/WRITE flags for that, but instead of passing
> in the generic version to the generic function, we pass in the
> arch-specific ones.

I was trying to keep that fault mask local but it might indeed be a
mistake.

The reasoning is that if you look at more archs (the scary one is
sparc), there is potentially more arch "gunk" that needs to be hooked in
after find_vma()...

And a lot of that need arch specific flags coming in. For example on ppc
I check if the instruction is allowed to grow the stack before we take
the mm sem and pass that information for use by the stack growing check.

Here I could just I suppose add some arch specific fault flags but it
might get messy... 

> The same goes for "access_error()". Right now it's an arch-specific
> helper function, but it actually does some generic tests (like
> checking the vma protections), and I think it could/should be entirely
> generic if we just added the necessary FAULT_FLAG_READ/EXEC/NOTPRESENT
> information to the "flags" register. Just looking at the ppc version,
> my gut feel is that "access_error()" is just horribly error-prone
> as-is even from an architecture standpoint.

I was weary of going down the path of having access_error() be an arch
helper because it gets even wilder with other archs and the crazy
platform flags we deal with on ppc. Here. I was thinking of looking at
factoring that in a second step. You are right that most of it seems
to related to execute permission however (at least sparc, mips, powerpc
and arm).

The way to go would be to define a FAULT_FLAG_EXEC which the arch only
sets if it supports enforcing exec at fault time.

BTW. I fail to see how x86 checks PF_INSTR vs. VM_NOEXEC ... or it doesn't ?

> Yes, the "read/exec/notpresent" bits are a bit weird, but old x86
> isn't the only architecture that doesn't necessarily know the
> difference between read and exec.
> 
> So I'd like a bit more encapsulation. The generic code should never
> really need to look at the arch-specific details, although it is true
> that then the error handling cases will likely need it (in order to
> put it on the arch-specific stack info or similar).

Right, we might still have to pass error_code around, I'll have a look.

> So my *gut* feel is that the generic code should be passed
> 
>  - address and the generic flags (ie FAULT_FLAG_USER/WRITE filled in
> by the caller)
> 
>These are the only things the generic code should need to use itself
>
>  - I guess we do need to pass in "struct pt_regs", because things like
> generic tracing actually want it.

And error handling but it could be in the latter...

>  - an opaque "arch specific fault information structure pointer". Not
> used for anything but passing back to the error functions (ie very
> much *not* for helper functions for the normal case, like the current
> "access_error()" - if it's actually needed for those kinds of things,
> then I'm wrong about the model)

Well, it might be needed for stack growth check. It's also somewhat
means we have yet *another* fault information structure which is a bit
sad but I don't see a good way to fold them into one.

For errors I was thinking instead of returning error info from the
generic code and leaving the generation of oops/signal to the caller,
which means less arch gunk to carry around.

>This would (for x86) contain "error_code", but I have the feeling
> that other architectures might need/want more than one word of
> information.
> 
> But I don't know. Maybe I'm wrong. I don't hate the patch as-is, I
> just have this feeling that it coudl be more "generic", and less
> "random small arch helpers".

I don't like it that much either, it's a first draft to see what the
waters are like.

I'll play around more and see what It comes to.

Cheers,
Ben.

> 
>   Linus
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Generic page fault (Was: libsigsegv ....)

2015-02-28 Thread Linus Torvalds
On Sat, Feb 28, 2015 at 11:56 AM, Linus Torvalds
 wrote:
>
> But I don't know. Maybe I'm wrong. I don't hate the patch as-is, I
> just have this feeling that it coudl be more "generic", and less
> "random small arch helpers".

Oh, and I definitely agree with you on the "single handle_bad_fault()"
thing rather than the current
handle_bad_area/handle_kernel_fault/do_sigbus.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Generic page fault (Was: libsigsegv ....)

2015-02-28 Thread Linus Torvalds
On Fri, Feb 27, 2015 at 11:12 PM, Benjamin Herrenschmidt
 wrote:
>
> Let me know what you think of the approach.

Hmm. I'm not happy with just how many of those arch wrapper/helper
functions there are, and some of it looks a bit unportable.

For example, the code "knows" that "err_code" and "address" are the
only two architecture-specific pieces of information (in addition to
"struct pt_regs", of course.

And the fault_is_user/write() stuff seems unnecessary - we have the
generic FAULT_FLAG_USER/WRITE flags for that, but instead of passing
in the generic version to the generic function, we pass in the
arch-specific ones.

The same goes for "access_error()". Right now it's an arch-specific
helper function, but it actually does some generic tests (like
checking the vma protections), and I think it could/should be entirely
generic if we just added the necessary FAULT_FLAG_READ/EXEC/NOTPRESENT
information to the "flags" register. Just looking at the ppc version,
my gut feel is that "access_error()" is just horribly error-prone
as-is even from an architecture standpoint.

Yes, the "read/exec/notpresent" bits are a bit weird, but old x86
isn't the only architecture that doesn't necessarily know the
difference between read and exec.

So I'd like a bit more encapsulation. The generic code should never
really need to look at the arch-specific details, although it is true
that then the error handling cases will likely need it (in order to
put it on the arch-specific stack info or similar).

So my *gut* feel is that the generic code should be passed

 - address and the generic flags (ie FAULT_FLAG_USER/WRITE filled in
by the caller)

   These are the only things the generic code should need to use itself

 - I guess we do need to pass in "struct pt_regs", because things like
generic tracing actually want it.

 - an opaque "arch specific fault information structure pointer". Not
used for anything but passing back to the error functions (ie very
much *not* for helper functions for the normal case, like the current
"access_error()" - if it's actually needed for those kinds of things,
then I'm wrong about the model)

   This would (for x86) contain "error_code", but I have the feeling
that other architectures might need/want more than one word of
information.

But I don't know. Maybe I'm wrong. I don't hate the patch as-is, I
just have this feeling that it coudl be more "generic", and less
"random small arch helpers".

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Generic page fault (Was: libsigsegv ....)

2015-02-28 Thread Benjamin Herrenschmidt
On Sat, 2015-02-28 at 18:14 +1100, Benjamin Herrenschmidt wrote:
> On Sat, 2015-02-28 at 18:12 +1100, Benjamin Herrenschmidt wrote:
> > 
> > Let me know what you think of the approach. It's boot tested on x86_64
> > in qemu and 
> 
>  ... and powerpc64 LE on qemu as well :-)

One thing I'd like to do is fold handle_kernel_fault() into
handle_bad_area() (and in fact fold do_sigbus as well).

Basically have a single handle_bad_fault() that takes sig and si_code as
arguments which we call from the generic code for all faults. It will
test for kernel vs. user mode and do the right thing (and we could
handle the sigbus special case by simply comparing sig to sigbus).

The one reason I haven't done it so far is that x86 handle_bad_area()
has the is_f00f_bug() call which isn't do for other cases of
no_context() and I'm not sure generalizing it is safe for all cases (or
maybe I can call it only when sig is SIGSEGV ?) ... I don't actually
understand what it does :)

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Generic page fault (Was: libsigsegv ....)

2015-02-28 Thread Benjamin Herrenschmidt
On Sat, 2015-02-28 at 18:14 +1100, Benjamin Herrenschmidt wrote:
 On Sat, 2015-02-28 at 18:12 +1100, Benjamin Herrenschmidt wrote:
  
  Let me know what you think of the approach. It's boot tested on x86_64
  in qemu and 
 
  ... and powerpc64 LE on qemu as well :-)

One thing I'd like to do is fold handle_kernel_fault() into
handle_bad_area() (and in fact fold do_sigbus as well).

Basically have a single handle_bad_fault() that takes sig and si_code as
arguments which we call from the generic code for all faults. It will
test for kernel vs. user mode and do the right thing (and we could
handle the sigbus special case by simply comparing sig to sigbus).

The one reason I haven't done it so far is that x86 handle_bad_area()
has the is_f00f_bug() call which isn't do for other cases of
no_context() and I'm not sure generalizing it is safe for all cases (or
maybe I can call it only when sig is SIGSEGV ?) ... I don't actually
understand what it does :)

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Generic page fault (Was: libsigsegv ....)

2015-02-28 Thread Benjamin Herrenschmidt
On Sun, 2015-03-01 at 09:16 +1100, Benjamin Herrenschmidt wrote:
 So for error handling, I'm trying to simply return the VM_FAULT_* flags
 from generic_page_fault see where that takes us. That's a way to avoid
 passing an arch specific struct around. It also allows my hack to
 account major faults with the hypervisor to be done outside the generic
 code completely (no hook).

 .../...

Here's what it looks like for x86 only and without completely sorting
out the fatal signal business. However I might still have to do the
arch pointer you mentioned for sparc and possibly other archs, but so
far it looks better already.

Note that if I add that arch pointer, I might stop messing around
or even returning fault and instead just return a simple enum
minor,major,error and let inline arch hooks populate the arch pointer
with the error details in whatever fashion the arch prefers. However
I suspect they'll all end up with sig and si_code in there...

Anyway, here's the current patch:

 arch/x86/include/asm/fault.h |  21 
 arch/x86/mm/fault.c  | 233 ---
 include/linux/fault.h|  24 +
 include/linux/mm.h   |   5 +-
 mm/Makefile  |   2 +-
 mm/fault.c   | 196 
 6 files changed, 266 insertions(+), 215 deletions(-)
 create mode 100644 arch/x86/include/asm/fault.h
 create mode 100644 include/linux/fault.h
 create mode 100644 mm/fault.c

diff --git a/arch/x86/include/asm/fault.h b/arch/x86/include/asm/fault.h
new file mode 100644
index 000..04263ec
--- /dev/null
+++ b/arch/x86/include/asm/fault.h
@@ -0,0 +1,21 @@
+#ifndef _ASM_X86_FAULT_H
+#define _ASM_X86_FAULT_H
+
+#include linux/types.h
+#include asm/ptrace.h
+
+/* Check if the stack is allowed to grow during a user page fault */
+static inline bool stack_can_grow(struct pt_regs *regs, unsigned long flags,
+ unsigned long address,
+ struct vm_area_struct *vma)
+{
+   /*
+* Accessing the stack below %sp is always a bug.
+* The large cushion allows instructions like enter
+* and pusha to work. (enter $65535, $31 pushes
+* 32 pointers and then decrements %sp by 65535.)
+*/
+   return address + 65536 + 32 * sizeof(unsigned long) = regs-sp;
+}
+
+#endif /*  _ASM_X86_FAULT_H */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index ede025f..19a8a91 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -13,6 +13,7 @@
 #include linux/hugetlb.h /* hstate_index_to_shift*/
 #include linux/prefetch.h/* prefetchw*/
 #include linux/context_tracking.h/* exception_enter(), ...   */
+#include linux/fault.h
 
 #include asm/traps.h /* dotraplinkage, ...   */
 #include asm/pgalloc.h   /* pgd_*(), ... */
@@ -748,8 +749,7 @@ show_signal_msg(struct pt_regs *regs, unsigned long 
error_code,
printk(KERN_CONT \n);
 }
 
-static void
-__bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
+static void __bad_area(struct pt_regs *regs, unsigned long error_code,
   unsigned long address, int si_code)
 {
struct task_struct *tsk = current;
@@ -804,39 +804,10 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned 
long error_code,
no_context(regs, error_code, address, SIGSEGV, si_code);
 }
 
-static noinline void
-bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
-unsigned long address)
-{
-   __bad_area_nosemaphore(regs, error_code, address, SEGV_MAPERR);
-}
-
-static void
-__bad_area(struct pt_regs *regs, unsigned long error_code,
-  unsigned long address, int si_code)
-{
-   struct mm_struct *mm = current-mm;
-
-   /*
-* Something tried to access memory that isn't in our memory map..
-* Fix it, but check if it's kernel or user first..
-*/
-   up_read(mm-mmap_sem);
-
-   __bad_area_nosemaphore(regs, error_code, address, si_code);
-}
-
-static noinline void
-bad_area(struct pt_regs *regs, unsigned long error_code, unsigned long address)
-{
-   __bad_area(regs, error_code, address, SEGV_MAPERR);
-}
-
-static noinline void
-bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
- unsigned long address)
+static inline void bad_area(struct pt_regs *regs, unsigned long error_code,
+   unsigned long address, int si_code)
 {
-   __bad_area(regs, error_code, address, SEGV_ACCERR);
+   __bad_area(regs, error_code, address, si_code);
 }
 
 static void
@@ -871,40 +842,6 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, 
unsigned long address,
force_sig_info_fault(SIGBUS, code, address, tsk, fault);
 }
 
-static noinline void
-mm_fault_error(struct pt_regs *regs, unsigned long error_code,
-   

Re: Generic page fault (Was: libsigsegv ....)

2015-02-28 Thread Linus Torvalds
On Sat, Feb 28, 2015 at 3:02 PM, Benjamin Herrenschmidt
b...@kernel.crashing.org wrote:

 Anyway, here's the current patch:

Ok, I think I like this approach better.

Your FAULT_FLAG_EXEC handling is wrong, though. It shouldn't check
VM_WRITE, it should check VM_EXEC. A bit too much copy-paste ;)

Btw, it's quite possible that we could just do all the PF_PROT
handling at the x86 level, before even calling the generic fault
handler. It's not like we even need the vma or the mm semaphore: if
it's a non-write protection fault, we always SIGSEGV. So why even
bother getting the locks and looking up the page tables etc?

Now, that PF_PROT handling isn't exactly performance-critical, but it
might help to remove the odd x86 special case from the generic code.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Generic page fault (Was: libsigsegv ....)

2015-02-28 Thread Benjamin Herrenschmidt
On Sat, 2015-02-28 at 11:56 -0800, Linus Torvalds wrote:
 On Fri, Feb 27, 2015 at 11:12 PM, Benjamin Herrenschmidt
 b...@kernel.crashing.org wrote:
 
  Let me know what you think of the approach.
 
 Hmm. I'm not happy with just how many of those arch wrapper/helper
 functions there are, and some of it looks a bit unportable.
 
 For example, the code knows that err_code and address are the
 only two architecture-specific pieces of information (in addition to
 struct pt_regs, of course.
 
 And the fault_is_user/write() stuff seems unnecessary - we have the
 generic FAULT_FLAG_USER/WRITE flags for that, but instead of passing
 in the generic version to the generic function, we pass in the
 arch-specific ones.

I was trying to keep that fault mask local but it might indeed be a
mistake.

The reasoning is that if you look at more archs (the scary one is
sparc), there is potentially more arch gunk that needs to be hooked in
after find_vma()...

And a lot of that need arch specific flags coming in. For example on ppc
I check if the instruction is allowed to grow the stack before we take
the mm sem and pass that information for use by the stack growing check.

Here I could just I suppose add some arch specific fault flags but it
might get messy... 

 The same goes for access_error(). Right now it's an arch-specific
 helper function, but it actually does some generic tests (like
 checking the vma protections), and I think it could/should be entirely
 generic if we just added the necessary FAULT_FLAG_READ/EXEC/NOTPRESENT
 information to the flags register. Just looking at the ppc version,
 my gut feel is that access_error() is just horribly error-prone
 as-is even from an architecture standpoint.

I was weary of going down the path of having access_error() be an arch
helper because it gets even wilder with other archs and the crazy
platform flags we deal with on ppc. Here. I was thinking of looking at
factoring that in a second step. You are right that most of it seems
to related to execute permission however (at least sparc, mips, powerpc
and arm).

The way to go would be to define a FAULT_FLAG_EXEC which the arch only
sets if it supports enforcing exec at fault time.

BTW. I fail to see how x86 checks PF_INSTR vs. VM_NOEXEC ... or it doesn't ?

 Yes, the read/exec/notpresent bits are a bit weird, but old x86
 isn't the only architecture that doesn't necessarily know the
 difference between read and exec.
 
 So I'd like a bit more encapsulation. The generic code should never
 really need to look at the arch-specific details, although it is true
 that then the error handling cases will likely need it (in order to
 put it on the arch-specific stack info or similar).

Right, we might still have to pass error_code around, I'll have a look.

 So my *gut* feel is that the generic code should be passed
 
  - address and the generic flags (ie FAULT_FLAG_USER/WRITE filled in
 by the caller)
 
These are the only things the generic code should need to use itself

  - I guess we do need to pass in struct pt_regs, because things like
 generic tracing actually want it.

And error handling but it could be in the latter...

  - an opaque arch specific fault information structure pointer. Not
 used for anything but passing back to the error functions (ie very
 much *not* for helper functions for the normal case, like the current
 access_error() - if it's actually needed for those kinds of things,
 then I'm wrong about the model)

Well, it might be needed for stack growth check. It's also somewhat
means we have yet *another* fault information structure which is a bit
sad but I don't see a good way to fold them into one.

For errors I was thinking instead of returning error info from the
generic code and leaving the generation of oops/signal to the caller,
which means less arch gunk to carry around.

This would (for x86) contain error_code, but I have the feeling
 that other architectures might need/want more than one word of
 information.
 
 But I don't know. Maybe I'm wrong. I don't hate the patch as-is, I
 just have this feeling that it coudl be more generic, and less
 random small arch helpers.

I don't like it that much either, it's a first draft to see what the
waters are like.

I'll play around more and see what It comes to.

Cheers,
Ben.

 
   Linus
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Generic page fault (Was: libsigsegv ....)

2015-02-28 Thread Benjamin Herrenschmidt
On Sun, 2015-03-01 at 09:16 +1100, Benjamin Herrenschmidt wrote:
 So for error handling, I'm trying to simply return the VM_FAULT_* flags
 from generic_page_fault see where that takes us. That's a way to avoid
 passing an arch specific struct around. It also allows my hack to
 account major faults with the hypervisor to be done outside the generic
 code completely (no hook).

I suspect sparc64 will defeat the not passing an arch struct around
though... oh well.

Ben.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Generic page fault (Was: libsigsegv ....)

2015-02-28 Thread Benjamin Herrenschmidt
On Sat, 2015-02-28 at 13:49 -0800, Linus Torvalds wrote:

 .../...

  - we handle write faults separately (see the first part of access_error()
 
  - so now we know it was a read or an instruction fetch
 
  - if PF_PROT is set, that means that the present bit was set in the
 page tables, so it must have been an exec access to a NX page
 
  - otherwise, we just say PROTNONE means no access, otherwise
 populate the page tables
 
 .. and if it turns out that it was a PF_INSTR to a NX page, we'll end
 up taking the page fault *again* after it's been populated, and now
 since the page table was populated, the access_error() will catch it
 with the PF_PROT case.
 
 Or something like that. I might have screwed up some detail, but it
 should all work.

I see, it should work yes, I'll still add that FAULT_FLAG_EXEC for
those who can tell reliably but it shouldn't hurt for x86 to not set it.

Cheers,
Ben.


  Linus
 --
 To unsubscribe from this list: send the line unsubscribe linux-arch in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Generic page fault (Was: libsigsegv ....)

2015-02-28 Thread Linus Torvalds
On Sat, Feb 28, 2015 at 1:14 PM, Benjamin Herrenschmidt
b...@kernel.crashing.org wrote:

 BTW. I fail to see how x86 checks PF_INSTR vs. VM_NOEXEC ... or it doesn't ?

It doesn't. x86 traditionally doesn't have an execute bit, so
traditionally read == exec.

So PF_INSTR really wasn't historically very useful, in that it would
only show if the *first* access to a page was an instruction fetch -
if you did a regular read to brign the page in, then subsequent
instruction fetches would just work.

Then NX came along, and what happens now is

 - we handle write faults separately (see the first part of access_error()

 - so now we know it was a read or an instruction fetch

 - if PF_PROT is set, that means that the present bit was set in the
page tables, so it must have been an exec access to a NX page

 - otherwise, we just say PROTNONE means no access, otherwise
populate the page tables

.. and if it turns out that it was a PF_INSTR to a NX page, we'll end
up taking the page fault *again* after it's been populated, and now
since the page table was populated, the access_error() will catch it
with the PF_PROT case.

Or something like that. I might have screwed up some detail, but it
should all work.

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Generic page fault (Was: libsigsegv ....)

2015-02-28 Thread Benjamin Herrenschmidt
So for error handling, I'm trying to simply return the VM_FAULT_* flags
from generic_page_fault see where that takes us. That's a way to avoid
passing an arch specific struct around. It also allows my hack to
account major faults with the hypervisor to be done outside the generic
code completely (no hook).

We will process generically some of the flags first such as the repeat
logic or major/minor accounting of course.

For that to work, I'm adding a VM_FAULT_ACCESS (that gets OR'ed with
VM_FAULT_SIGSEGV) to differentiate SEGV_MAPERR and SEGV_ACCERR. So far
so good.

However, I noticed a small discrepancy on x86 in the handling of fatal
signals:

I see two path that can be hit on a fatal signal. The obvious
one is the one in access_error() which calls no_context() with a 0
signal argument, the other path is in the retry handling, which will in
this case call no_context() with SIGBUS/BUS_ADRERR. 

Now, the only place in there that seems to care about the signal that
gets passed in is the sig_on_uaccess_error case. On one case (0 sig),
that test will be skipped, on the other case (SIGBUS), that test will be
done and might result in a sigbus being generated, which might override
the original deadly signal (or am I missing something ?)

Now I don't completely understand how the x86 vsyscall stuff works so I
don't know precisely in what circumstances that test matters, I'll need
you help there.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Generic page fault (Was: libsigsegv ....)

2015-02-28 Thread Linus Torvalds
On Sat, Feb 28, 2015 at 11:56 AM, Linus Torvalds
torva...@linux-foundation.org wrote:

 But I don't know. Maybe I'm wrong. I don't hate the patch as-is, I
 just have this feeling that it coudl be more generic, and less
 random small arch helpers.

Oh, and I definitely agree with you on the single handle_bad_fault()
thing rather than the current
handle_bad_area/handle_kernel_fault/do_sigbus.

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Generic page fault (Was: libsigsegv ....)

2015-02-28 Thread Linus Torvalds
On Fri, Feb 27, 2015 at 11:12 PM, Benjamin Herrenschmidt
b...@kernel.crashing.org wrote:

 Let me know what you think of the approach.

Hmm. I'm not happy with just how many of those arch wrapper/helper
functions there are, and some of it looks a bit unportable.

For example, the code knows that err_code and address are the
only two architecture-specific pieces of information (in addition to
struct pt_regs, of course.

And the fault_is_user/write() stuff seems unnecessary - we have the
generic FAULT_FLAG_USER/WRITE flags for that, but instead of passing
in the generic version to the generic function, we pass in the
arch-specific ones.

The same goes for access_error(). Right now it's an arch-specific
helper function, but it actually does some generic tests (like
checking the vma protections), and I think it could/should be entirely
generic if we just added the necessary FAULT_FLAG_READ/EXEC/NOTPRESENT
information to the flags register. Just looking at the ppc version,
my gut feel is that access_error() is just horribly error-prone
as-is even from an architecture standpoint.

Yes, the read/exec/notpresent bits are a bit weird, but old x86
isn't the only architecture that doesn't necessarily know the
difference between read and exec.

So I'd like a bit more encapsulation. The generic code should never
really need to look at the arch-specific details, although it is true
that then the error handling cases will likely need it (in order to
put it on the arch-specific stack info or similar).

So my *gut* feel is that the generic code should be passed

 - address and the generic flags (ie FAULT_FLAG_USER/WRITE filled in
by the caller)

   These are the only things the generic code should need to use itself

 - I guess we do need to pass in struct pt_regs, because things like
generic tracing actually want it.

 - an opaque arch specific fault information structure pointer. Not
used for anything but passing back to the error functions (ie very
much *not* for helper functions for the normal case, like the current
access_error() - if it's actually needed for those kinds of things,
then I'm wrong about the model)

   This would (for x86) contain error_code, but I have the feeling
that other architectures might need/want more than one word of
information.

But I don't know. Maybe I'm wrong. I don't hate the patch as-is, I
just have this feeling that it coudl be more generic, and less
random small arch helpers.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Generic page fault (Was: libsigsegv ....)

2015-02-28 Thread Benjamin Herrenschmidt
On Sat, 2015-02-28 at 16:41 -0800, Linus Torvalds wrote:
 On Sat, Feb 28, 2015 at 3:02 PM, Benjamin Herrenschmidt
 b...@kernel.crashing.org wrote:
 
  Anyway, here's the current patch:
 
 Ok, I think I like this approach better.
 
 Your FAULT_FLAG_EXEC handling is wrong, though. It shouldn't check
 VM_WRITE, it should check VM_EXEC. A bit too much copy-paste ;)

Ah yes indeed :) I would have caught that pretty quickly once I tried
powerpc.

 Btw, it's quite possible that we could just do all the PF_PROT
 handling at the x86 level, before even calling the generic fault
 handler. It's not like we even need the vma or the mm semaphore: if
 it's a non-write protection fault, we always SIGSEGV. So why even
 bother getting the locks and looking up the page tables etc?

For non-write possibly, though that means we always end up going
through handle_mm_fault for a non-present PTE before we decide
that it wasn't actually accessible.

Not a huge deal I suppose unless something (ab)uses mprotect + emulation
for things like spying on MMIO accesses (been there) or god knows what
other userspace based paging or GC might be out there. It's asking for
trouble but the check in access_error() doesn't hurt much either.

On the other hand, we are all about simplifying code here, aren't
we ? :-) It's not like the early bail-out on x86 will buy us much and
it's going to be more arch specific code since they'll all have a
different variant of PF_PROT.

One thing that worries me a bit, however, with the PF_PROT handling in
the generic case is the case of archs who support
present-but-not-accessible PTEs in HW (for us that basically means no
_PAGE_USER), as these might use such fault for NUMA balancing, or
software maintained young bit, and in those case, bailing out on
PF_PROT is wrong. I'll have to figure out if that ever happens when I
dig through more archs.

 Now, that PF_PROT handling isn't exactly performance-critical, but it
 might help to remove the odd x86 special case from the generic code.

I don't think it's that odd on x86. If you look at what other archs do,
there's a bit of everything out there. Just powerpc may or may not
support exec permission for example depending on the CPU generation and
when it doesn't it's basically similar to x86. I yet have to fully
understand what ARM does (looks simple, just need to make sure it's what
I think it is and check various ARM generations), sparc has its
oddities, etc...

None of that is terribly urgent and I'm pretty busy so give me a bit
more time now that we have some kind of direction, to go through a few
more archs and see where it takes us. I'll start gathering cross
compilers and qemu debian images in the next few days.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Generic page fault (Was: libsigsegv ....)

2015-02-27 Thread Benjamin Herrenschmidt
On Sat, 2015-02-28 at 18:12 +1100, Benjamin Herrenschmidt wrote:
> 
> Let me know what you think of the approach. It's boot tested on x86_64
> in qemu and 

 ... and powerpc64 LE on qemu as well :-)

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Generic page fault (Was: libsigsegv ....)

2015-02-27 Thread Benjamin Herrenschmidt
On Sun, 2015-02-01 at 17:09 -0800, Linus Torvalds wrote:
> 
> Of course, what I *really* want would be to make a new
> "generic_mm_fault()" helper that would do all the normal stuff:
> 
>  - find_vma()
>  - check permissions and ranges
>  - call 'handle_mm_fault()'
>  - do the proper error, retry and minor/major fault handling
> 
> and then most architectures could just call that.

So I spent a bit of time today while the kids were playing quietly (it
does happen !), and came up with this (very) draft pair of patches
for x86 and powerpc. It's by no mean a finished product as you can see,
but it shows how "messy" things get. Basically a bit more messy than I
originally thought but it's not *too* bad either.

Let me know what you think of the approach. It's boot tested on x86_64
in qemu and .

Next I think I'll tackle ARM, test a bit more, clean a few things up and
submit, but by all means, please provide feedback on the approach before
that :)

I'm attaching the patches for now (there are two and I don't feel like
posting two emails on that subject. The final stuff will be broken down
in smaller bits).

Cheers,
Ben.

From 1e3060ecdb479a3dfd587a5870e0351e0b1b5ddc Mon Sep 17 00:00:00 2001
From: Benjamin Herrenschmidt 
Date: Sat, 28 Feb 2015 17:38:17 +1100
Subject: [PATCH 1/2] Move bulk of x86 __do_page_fault() to a
 generic_page_fault()

(Add add various hooks that other archs will need etc...)

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/x86/include/asm/fault.h |  99 +
 arch/x86/mm/fault.c  | 253 +++
 include/linux/fault.h|  11 ++
 mm/Makefile  |   2 +-
 mm/fault.c   | 171 +
 5 files changed, 296 insertions(+), 240 deletions(-)
 create mode 100644 arch/x86/include/asm/fault.h
 create mode 100644 include/linux/fault.h
 create mode 100644 mm/fault.c

diff --git a/arch/x86/include/asm/fault.h b/arch/x86/include/asm/fault.h
new file mode 100644
index 000..7c1712e1
--- /dev/null
+++ b/arch/x86/include/asm/fault.h
@@ -0,0 +1,99 @@
+#ifndef _ASM_X86_FAULT_H
+#define _ASM_X86_FAULT_H
+
+#include 
+#include 
+
+/*
+ * Page fault error code bits:
+ *
+ *   bit 0 ==	 0: no page found	1: protection fault
+ *   bit 1 ==	 0: read access		1: write access
+ *   bit 2 ==	 0: kernel-mode access	1: user-mode access
+ *   bit 3 ==1: use of reserved bit detected
+ *   bit 4 ==1: fault was an instruction fetch
+ */
+enum x86_pf_error_code {
+
+	PF_PROT		=		1 << 0,
+	PF_WRITE	=		1 << 1,
+	PF_USER		=		1 << 2,
+	PF_RSVD		=		1 << 3,
+	PF_INSTR	=		1 << 4,
+};
+
+static inline bool fault_is_user(struct pt_regs *regs, unsigned long err_code)
+{
+	return err_code & PF_USER;
+}
+
+static inline bool fault_is_write(struct pt_regs *regs, unsigned long err_code)
+{
+	return err_code & PF_WRITE;
+}
+
+/* Return type for do_page_fault */
+typedef void gpf_ret_t;
+
+#define FAULT_NO_ERR
+
+/* Check if the stack is allowed to grow during a user page fault */
+static inline bool stack_can_grow(struct pt_regs *regs, unsigned long err_code,
+  unsigned long address,
+  struct vm_area_struct *vma)
+{
+	/*
+	 * Accessing the stack below %sp is always a bug.
+	 * The large cushion allows instructions like enter
+	 * and pusha to work. ("enter $65535, $31" pushes
+	 * 32 pointers and then decrements %sp by 65535.)
+	 */
+	return address + 65536 + 32 * sizeof(unsigned long) >= regs->sp;
+}
+
+/* Access validity check */
+static inline bool access_error(struct pt_regs *regs, unsigned long err_code,
+struct vm_area_struct *vma)
+{
+	if (err_code & PF_WRITE) {
+		/* write, present and write, not present: */
+		if (unlikely(!(vma->vm_flags & VM_WRITE)))
+			return true;
+		return false;
+	}
+
+	/* read, present: */
+	if (unlikely(err_code & PF_PROT))
+		return true;
+
+	/* read, not present: */
+	if (unlikely(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE
+		return true;
+
+	return false;
+}
+
+/* Error handlers */
+
+gpf_ret_t handle_bad_area(struct pt_regs *regs, unsigned long error_code,
+			  unsigned long address, int si_code);
+
+
+void no_context(struct pt_regs *regs, unsigned long error_code,
+		unsigned long address, int signal, int si_code);
+
+static inline gpf_ret_t handle_kernel_fault(struct pt_regs *regs,
+	unsigned long error_code,
+	unsigned long address, int sig,
+	int si_code)
+{
+	no_context(regs, error_code, address, sig, si_code);
+}
+
+gpf_ret_t do_sigbus(struct pt_regs *regs, unsigned long error_code,
+		unsigned long address, unsigned int fault);
+
+static inline void arch_account_major_fault(void) { }
+
+
+#endif /*  _ASM_X86_FAULT_H */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index ede025f..b7ca60a 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -13,6 +13,7 @@
 #include 		/* hstate_index_to_shift	*/
 #include 		/* prefetchw			*/
 #include 	/* exception_enter(), ...	*/
+#include 
 
 #include 			/* dotraplinkage, 

Generic page fault (Was: libsigsegv ....)

2015-02-27 Thread Benjamin Herrenschmidt
On Sun, 2015-02-01 at 17:09 -0800, Linus Torvalds wrote:
 
 Of course, what I *really* want would be to make a new
 generic_mm_fault() helper that would do all the normal stuff:
 
  - find_vma()
  - check permissions and ranges
  - call 'handle_mm_fault()'
  - do the proper error, retry and minor/major fault handling
 
 and then most architectures could just call that.

So I spent a bit of time today while the kids were playing quietly (it
does happen !), and came up with this (very) draft pair of patches
for x86 and powerpc. It's by no mean a finished product as you can see,
but it shows how messy things get. Basically a bit more messy than I
originally thought but it's not *too* bad either.

Let me know what you think of the approach. It's boot tested on x86_64
in qemu and .

Next I think I'll tackle ARM, test a bit more, clean a few things up and
submit, but by all means, please provide feedback on the approach before
that :)

I'm attaching the patches for now (there are two and I don't feel like
posting two emails on that subject. The final stuff will be broken down
in smaller bits).

Cheers,
Ben.

From 1e3060ecdb479a3dfd587a5870e0351e0b1b5ddc Mon Sep 17 00:00:00 2001
From: Benjamin Herrenschmidt b...@kernel.crashing.org
Date: Sat, 28 Feb 2015 17:38:17 +1100
Subject: [PATCH 1/2] Move bulk of x86 __do_page_fault() to a
 generic_page_fault()

(Add add various hooks that other archs will need etc...)

Signed-off-by: Benjamin Herrenschmidt b...@kernel.crashing.org
---
 arch/x86/include/asm/fault.h |  99 +
 arch/x86/mm/fault.c  | 253 +++
 include/linux/fault.h|  11 ++
 mm/Makefile  |   2 +-
 mm/fault.c   | 171 +
 5 files changed, 296 insertions(+), 240 deletions(-)
 create mode 100644 arch/x86/include/asm/fault.h
 create mode 100644 include/linux/fault.h
 create mode 100644 mm/fault.c

diff --git a/arch/x86/include/asm/fault.h b/arch/x86/include/asm/fault.h
new file mode 100644
index 000..7c1712e1
--- /dev/null
+++ b/arch/x86/include/asm/fault.h
@@ -0,0 +1,99 @@
+#ifndef _ASM_X86_FAULT_H
+#define _ASM_X86_FAULT_H
+
+#include linux/types.h
+#include asm/ptrace.h
+
+/*
+ * Page fault error code bits:
+ *
+ *   bit 0 ==	 0: no page found	1: protection fault
+ *   bit 1 ==	 0: read access		1: write access
+ *   bit 2 ==	 0: kernel-mode access	1: user-mode access
+ *   bit 3 ==1: use of reserved bit detected
+ *   bit 4 ==1: fault was an instruction fetch
+ */
+enum x86_pf_error_code {
+
+	PF_PROT		=		1  0,
+	PF_WRITE	=		1  1,
+	PF_USER		=		1  2,
+	PF_RSVD		=		1  3,
+	PF_INSTR	=		1  4,
+};
+
+static inline bool fault_is_user(struct pt_regs *regs, unsigned long err_code)
+{
+	return err_code  PF_USER;
+}
+
+static inline bool fault_is_write(struct pt_regs *regs, unsigned long err_code)
+{
+	return err_code  PF_WRITE;
+}
+
+/* Return type for do_page_fault */
+typedef void gpf_ret_t;
+
+#define FAULT_NO_ERR
+
+/* Check if the stack is allowed to grow during a user page fault */
+static inline bool stack_can_grow(struct pt_regs *regs, unsigned long err_code,
+  unsigned long address,
+  struct vm_area_struct *vma)
+{
+	/*
+	 * Accessing the stack below %sp is always a bug.
+	 * The large cushion allows instructions like enter
+	 * and pusha to work. (enter $65535, $31 pushes
+	 * 32 pointers and then decrements %sp by 65535.)
+	 */
+	return address + 65536 + 32 * sizeof(unsigned long) = regs-sp;
+}
+
+/* Access validity check */
+static inline bool access_error(struct pt_regs *regs, unsigned long err_code,
+struct vm_area_struct *vma)
+{
+	if (err_code  PF_WRITE) {
+		/* write, present and write, not present: */
+		if (unlikely(!(vma-vm_flags  VM_WRITE)))
+			return true;
+		return false;
+	}
+
+	/* read, present: */
+	if (unlikely(err_code  PF_PROT))
+		return true;
+
+	/* read, not present: */
+	if (unlikely(!(vma-vm_flags  (VM_READ | VM_EXEC | VM_WRITE
+		return true;
+
+	return false;
+}
+
+/* Error handlers */
+
+gpf_ret_t handle_bad_area(struct pt_regs *regs, unsigned long error_code,
+			  unsigned long address, int si_code);
+
+
+void no_context(struct pt_regs *regs, unsigned long error_code,
+		unsigned long address, int signal, int si_code);
+
+static inline gpf_ret_t handle_kernel_fault(struct pt_regs *regs,
+	unsigned long error_code,
+	unsigned long address, int sig,
+	int si_code)
+{
+	no_context(regs, error_code, address, sig, si_code);
+}
+
+gpf_ret_t do_sigbus(struct pt_regs *regs, unsigned long error_code,
+		unsigned long address, unsigned int fault);
+
+static inline void arch_account_major_fault(void) { }
+
+
+#endif /*  _ASM_X86_FAULT_H */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index ede025f..b7ca60a 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -13,6 +13,7 @@
 #include linux/hugetlb.h		/* hstate_index_to_shift	*/
 #include linux/prefetch.h		/* prefetchw			*/
 #include 

Re: Generic page fault (Was: libsigsegv ....)

2015-02-27 Thread Benjamin Herrenschmidt
On Sat, 2015-02-28 at 18:12 +1100, Benjamin Herrenschmidt wrote:
 
 Let me know what you think of the approach. It's boot tested on x86_64
 in qemu and 

 ... and powerpc64 LE on qemu as well :-)

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/