Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections

2014-11-19 Thread Mel Gorman
On Tue, Nov 18, 2014 at 12:18:43PM -0500, Sasha Levin wrote:
> On 11/18/2014 11:56 AM, Aneesh Kumar K.V wrote:
> >>> 4. Similarly, does the kernel boot properly without without patches?
> >> >
> >> > Yes, the kernel works fine without the patches both with and without fake
> >> > numa.
> > 
> > Hmm that is interesting. I am not sure how writeback_fid can be
> > related. We use writeback fid to enable client side caching with 9p
> > (cache=loose). We use this fid to write back dirty pages later. Can you
> > share the qemu command line used, 9p mount options and the test details ? 
> 
> I'm using kvmtool rather than qemu. rootfs is created via kernel parameters:
> 
> root=/dev/root rw rootflags=rw,trans=virtio,version=9p2000.L rootfstype=9p
> 
> The test is just running trinity, there's no 9p or mm specific test going on.
> 
> I've attached my .config.
> 

Ok, based on that I was able to reproduce the problem. I hope to have a
V2 before the end of the week. Thanks.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections

2014-11-19 Thread Mel Gorman
On Tue, Nov 18, 2014 at 12:18:43PM -0500, Sasha Levin wrote:
 On 11/18/2014 11:56 AM, Aneesh Kumar K.V wrote:
  4. Similarly, does the kernel boot properly without without patches?
  
   Yes, the kernel works fine without the patches both with and without fake
   numa.
  
  Hmm that is interesting. I am not sure how writeback_fid can be
  related. We use writeback fid to enable client side caching with 9p
  (cache=loose). We use this fid to write back dirty pages later. Can you
  share the qemu command line used, 9p mount options and the test details ? 
 
 I'm using kvmtool rather than qemu. rootfs is created via kernel parameters:
 
 root=/dev/root rw rootflags=rw,trans=virtio,version=9p2000.L rootfstype=9p
 
 The test is just running trinity, there's no 9p or mm specific test going on.
 
 I've attached my .config.
 

Ok, based on that I was able to reproduce the problem. I hope to have a
V2 before the end of the week. Thanks.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections

2014-11-18 Thread Sasha Levin
On 11/18/2014 11:56 AM, Aneesh Kumar K.V wrote:
>>> 4. Similarly, does the kernel boot properly without without patches?
>> >
>> > Yes, the kernel works fine without the patches both with and without fake
>> > numa.
> 
> Hmm that is interesting. I am not sure how writeback_fid can be
> related. We use writeback fid to enable client side caching with 9p
> (cache=loose). We use this fid to write back dirty pages later. Can you
> share the qemu command line used, 9p mount options and the test details ? 

I'm using kvmtool rather than qemu. rootfs is created via kernel parameters:

root=/dev/root rw rootflags=rw,trans=virtio,version=9p2000.L rootfstype=9p

The test is just running trinity, there's no 9p or mm specific test going on.

I've attached my .config.


Thanks,
Sasha


config-sasha.gz
Description: application/gzip


Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections

2014-11-18 Thread Mel Gorman
On Tue, Nov 18, 2014 at 10:26:41PM +0530, Aneesh Kumar K.V wrote:
> Sasha Levin  writes:
> 
> > On 11/18/2014 10:42 AM, Mel Gorman wrote:
> >> 1. I'm assuming this is a KVM setup but can you confirm?
> >
> > Yes.
> >
> >> 2. Are you using numa=fake=N?
> >
> > Yes. numa=fake=24, which is probably way more nodes on any physical machine
> > than the new code was tested on?
> >
> >> 3. If you are using fake NUMA, what happens if you boot without it as
> >>that should make the patches a no-op?
> >
> > Nope, still seeing it without fake numa.
> >
> >> 4. Similarly, does the kernel boot properly without without patches?
> >
> > Yes, the kernel works fine without the patches both with and without fake
> > numa.
> 
> 
> Hmm that is interesting. I am not sure how writeback_fid can be
> related. We use writeback fid to enable client side caching with 9p
> (cache=loose). We use this fid to write back dirty pages later. Can you
> share the qemu command line used, 9p mount options and the test details ? 
> 

It would help if the test details included the kernel config. I got KVM
working again on an server with an older installation and while it
doesn't use 9p, I'm not seeing any other oddities either yet while
running trinity.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections

2014-11-18 Thread Mel Gorman
On Tue, Nov 18, 2014 at 10:03:30PM +0530, Aneesh Kumar K.V wrote:
> > diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c
> > index 5a236f0..46152aa 100644
> > --- a/arch/powerpc/mm/copro_fault.c
> > +++ b/arch/powerpc/mm/copro_fault.c
> > @@ -64,7 +64,12 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned 
> > long ea,
> > if (!(vma->vm_flags & VM_WRITE))
> > goto out_unlock;
> > } else {
> > -   if (dsisr & DSISR_PROTFAULT)
> > +   /*
> > +* protfault should only happen due to us
> > +* mapping a region readonly temporarily. PROT_NONE
> > +* is also covered by the VMA check above.
> > +*/
> > +   if (WARN_ON_ONCE(dsisr & DSISR_PROTFAULT))
> > goto out_unlock;
> > if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
> > goto out_unlock;
> 
> 
> we should do that DSISR_PROTFAILT check after vma->vm_flags. It is not
> that we will not hit DSISR_PROTFAULT, what we want to ensure here is that
> we get a prot fault only for cases convered by that vma check. So
> everything should be taking the if (!(vma->vm_flags & (VM_READ |
> VM_EXEC))) branch if it is a protfault. If not we would like to know
> about that. And hence the idea of not using WARN_ON_ONCE. I was also not
> sure whether we want to enable that always. The reason for keeping that
> within CONFIG_DEBUG_VM is to make sure that nobody ends up depending on
> PROTFAULT outside the vma check convered. So expectations is that
> developers working on feature will run with DEBUG_VM enable and finds
> this warning. We don't expect to hit this otherwise.
> 

/me slaps self. It's clear now and updated accordingly. Thanks.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections

2014-11-18 Thread Aneesh Kumar K.V
Sasha Levin  writes:

> On 11/18/2014 10:42 AM, Mel Gorman wrote:
>> 1. I'm assuming this is a KVM setup but can you confirm?
>
> Yes.
>
>> 2. Are you using numa=fake=N?
>
> Yes. numa=fake=24, which is probably way more nodes on any physical machine
> than the new code was tested on?
>
>> 3. If you are using fake NUMA, what happens if you boot without it as
>>that should make the patches a no-op?
>
> Nope, still seeing it without fake numa.
>
>> 4. Similarly, does the kernel boot properly without without patches?
>
> Yes, the kernel works fine without the patches both with and without fake
> numa.


Hmm that is interesting. I am not sure how writeback_fid can be
related. We use writeback fid to enable client side caching with 9p
(cache=loose). We use this fid to write back dirty pages later. Can you
share the qemu command line used, 9p mount options and the test details ? 


-aneesh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections

2014-11-18 Thread Sasha Levin
On 11/18/2014 10:42 AM, Mel Gorman wrote:
> 1. I'm assuming this is a KVM setup but can you confirm?

Yes.

> 2. Are you using numa=fake=N?

Yes. numa=fake=24, which is probably way more nodes on any physical machine
than the new code was tested on?

> 3. If you are using fake NUMA, what happens if you boot without it as
>that should make the patches a no-op?

Nope, still seeing it without fake numa.

> 4. Similarly, does the kernel boot properly without without patches?

Yes, the kernel works fine without the patches both with and without fake
numa.

> 5. Are any other patches applied because the line numbers are not lining
>up exactly?

I have quite a few more patches on top of next, but they're debug patches
that add VM_BUG_ONs in quite a few places.

One thing that was odd is that your patches had merge conflicts when applied
on -next in mm/huge-memory.c, so maybe that's where line number differences
are coming from.

> 6. As my own KVM setup appears broken, can you tell me if the host
>kernel has changed recently? If so, does using an older host kernel
>make a difference?

Nope, I've been using the same host kernel (Ubuntu's 3.16.0-24-generic #32)
for a while now.

> At the moment I'm scratching my head trying to figure out how the
> patches could break 9p like this as I don't believe KVM is doing any
> tricks with the same bits that could result in loss.

This issue reproduces rather easily, I'd be happy to try out debug patches
rather than having you guess at what might have gone wrong.


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections

2014-11-18 Thread Aneesh Kumar K.V
Mel Gorman  writes:

> On Mon, Nov 17, 2014 at 01:56:19PM +0530, Aneesh Kumar K.V wrote:
>> Mel Gorman  writes:
>> 
>> > This is follow up from the "pipe/page fault oddness" thread.
>> >
>> > Automatic NUMA balancing depends on being able to protect PTEs to trap a
>> > fault and gather reference locality information. Very broadly speaking it
>> > would mark PTEs as not present and use another bit to distinguish between
>> > NUMA hinting faults and other types of faults. It was universally loved
>> > by everybody and caused no problems whatsoever. That last sentence might
>> > be a lie.
>> >
>> > This series is very heavily based on patches from Linus and Aneesh to
>> > replace the existing PTE/PMD NUMA helper functions with normal change
>> > protections. I did alter and add parts of it but I consider them relatively
>> > minor contributions. Note that the signed-offs here need addressing. I
>> > couldn't use "From" or Signed-off-by from the original authors as the
>> > patches had to be broken up and they were never signed off. I expect the
>> > two people involved will just stick their signed-off-by on it.
>> 
>> 
>> How about the additional change listed below for ppc64 ? One part of the
>> patch is to make sure that we don't hit the WARN_ON in set_pte and set_pmd
>> because we find the _PAGE_PRESENT bit set in case of numa fault. I
>> ended up relaxing the check there.
>> 
>
> I folded the set_pte_at and set_pmd_at changes into the patch "mm: Convert
> p[te|md]_numa users to p[te|md]_protnone_numa" with one change -- both
> set_pte_at and set_pmd_at checks are under CONFIG_DEBUG_VM for consistency.
>
>> Second part of the change is to add a WARN_ON to make sure we are
>> not depending on DSISR_PROTFAULT for anything else. We ideally should not
>> get a DSISR_PROTFAULT for PROT_NONE or NUMA fault. hash_page_mm do check
>> whether the access is allowed by pte before inserting a pte into hash
>> page table. Hence we will never find a PROT_NONE or PROT_NONE_NUMA ptes
>> in hash page table. But it is good to run with VM_WARN_ON ?
>> 
>
> Due to the nature of the check and when they are hit, I converted it to
> a WARN_ON_ONCE. Due to the exceptional circumstance the overhead should
> be non-existant and shouldn't need to be hidden below VM_WARN_ON. I also
> noted that with the patch the kernel  potentially no longer recovers
> from this exceptional cirsumstance and instead falls through. To avoid
> this, I preserved the "goto out_unlock".
>
> Is this still ok?
>
> ---8<---
> ppc64: Add paranoid warnings for unexpected DSISR_PROTFAULT
>
> ppc64 should not be depending on DSISR_PROTFAULT and it's unexpected
> if they are triggered. This patch adds warnings just in case they
> are being accidentally depended upon.
>
> Requires-signed-off-by: Aneesh Kumar K.V 
> Signed-off-by: Mel Gorman 
> ---
>  arch/powerpc/mm/copro_fault.c |  7 ++-
>  arch/powerpc/mm/fault.c   | 20 +---
>  2 files changed, 15 insertions(+), 12 deletions(-)
>
> diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c
> index 5a236f0..46152aa 100644
> --- a/arch/powerpc/mm/copro_fault.c
> +++ b/arch/powerpc/mm/copro_fault.c
> @@ -64,7 +64,12 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned 
> long ea,
>   if (!(vma->vm_flags & VM_WRITE))
>   goto out_unlock;
>   } else {
> - if (dsisr & DSISR_PROTFAULT)
> + /*
> +  * protfault should only happen due to us
> +  * mapping a region readonly temporarily. PROT_NONE
> +  * is also covered by the VMA check above.
> +  */
> + if (WARN_ON_ONCE(dsisr & DSISR_PROTFAULT))
>   goto out_unlock;
>   if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
>   goto out_unlock;


we should do that DSISR_PROTFAILT check after vma->vm_flags. It is not
that we will not hit DSISR_PROTFAULT, what we want to ensure here is that
we get a prot fault only for cases convered by that vma check. So
everything should be taking the if (!(vma->vm_flags & (VM_READ |
VM_EXEC))) branch if it is a protfault. If not we would like to know
about that. And hence the idea of not using WARN_ON_ONCE. I was also not
sure whether we want to enable that always. The reason for keeping that
within CONFIG_DEBUG_VM is to make sure that nobody ends up depending on
PROTFAULT outside the vma check convered. So expectations is that
developers working on feature will run with DEBUG_VM enable and finds
this warning. We don't expect to hit this otherwise.

> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 5007497..9d6e0b3 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -396,17 +396,6 @@ good_area:
>  #endif /* CONFIG_8xx */
>
>   if (is_exec) {
> -#ifdef CONFIG_PPC_STD_MMU
> - /* Protection fault on exec go straight to failure on
> -  * Hash based MMUs as they 

Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections

2014-11-18 Thread Mel Gorman
On Mon, Nov 17, 2014 at 01:56:19PM +0530, Aneesh Kumar K.V wrote:
> Mel Gorman  writes:
> 
> > This is follow up from the "pipe/page fault oddness" thread.
> >
> > Automatic NUMA balancing depends on being able to protect PTEs to trap a
> > fault and gather reference locality information. Very broadly speaking it
> > would mark PTEs as not present and use another bit to distinguish between
> > NUMA hinting faults and other types of faults. It was universally loved
> > by everybody and caused no problems whatsoever. That last sentence might
> > be a lie.
> >
> > This series is very heavily based on patches from Linus and Aneesh to
> > replace the existing PTE/PMD NUMA helper functions with normal change
> > protections. I did alter and add parts of it but I consider them relatively
> > minor contributions. Note that the signed-offs here need addressing. I
> > couldn't use "From" or Signed-off-by from the original authors as the
> > patches had to be broken up and they were never signed off. I expect the
> > two people involved will just stick their signed-off-by on it.
> 
> 
> How about the additional change listed below for ppc64 ? One part of the
> patch is to make sure that we don't hit the WARN_ON in set_pte and set_pmd
> because we find the _PAGE_PRESENT bit set in case of numa fault. I
> ended up relaxing the check there.
> 

I folded the set_pte_at and set_pmd_at changes into the patch "mm: Convert
p[te|md]_numa users to p[te|md]_protnone_numa" with one change -- both
set_pte_at and set_pmd_at checks are under CONFIG_DEBUG_VM for consistency.

> Second part of the change is to add a WARN_ON to make sure we are
> not depending on DSISR_PROTFAULT for anything else. We ideally should not
> get a DSISR_PROTFAULT for PROT_NONE or NUMA fault. hash_page_mm do check
> whether the access is allowed by pte before inserting a pte into hash
> page table. Hence we will never find a PROT_NONE or PROT_NONE_NUMA ptes
> in hash page table. But it is good to run with VM_WARN_ON ?
> 

Due to the nature of the check and when they are hit, I converted it to
a WARN_ON_ONCE. Due to the exceptional circumstance the overhead should
be non-existant and shouldn't need to be hidden below VM_WARN_ON. I also
noted that with the patch the kernel  potentially no longer recovers
from this exceptional cirsumstance and instead falls through. To avoid
this, I preserved the "goto out_unlock".

Is this still ok?

---8<---
ppc64: Add paranoid warnings for unexpected DSISR_PROTFAULT

ppc64 should not be depending on DSISR_PROTFAULT and it's unexpected
if they are triggered. This patch adds warnings just in case they
are being accidentally depended upon.

Requires-signed-off-by: Aneesh Kumar K.V 
Signed-off-by: Mel Gorman 
---
 arch/powerpc/mm/copro_fault.c |  7 ++-
 arch/powerpc/mm/fault.c   | 20 +---
 2 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c
index 5a236f0..46152aa 100644
--- a/arch/powerpc/mm/copro_fault.c
+++ b/arch/powerpc/mm/copro_fault.c
@@ -64,7 +64,12 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned 
long ea,
if (!(vma->vm_flags & VM_WRITE))
goto out_unlock;
} else {
-   if (dsisr & DSISR_PROTFAULT)
+   /*
+* protfault should only happen due to us
+* mapping a region readonly temporarily. PROT_NONE
+* is also covered by the VMA check above.
+*/
+   if (WARN_ON_ONCE(dsisr & DSISR_PROTFAULT))
goto out_unlock;
if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
goto out_unlock;
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 5007497..9d6e0b3 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -396,17 +396,6 @@ good_area:
 #endif /* CONFIG_8xx */
 
if (is_exec) {
-#ifdef CONFIG_PPC_STD_MMU
-   /* Protection fault on exec go straight to failure on
-* Hash based MMUs as they either don't support per-page
-* execute permission, or if they do, it's handled already
-* at the hash level. This test would probably have to
-* be removed if we change the way this works to make hash
-* processors use the same I/D cache coherency mechanism
-* as embedded.
-*/
-#endif /* CONFIG_PPC_STD_MMU */
-
/*
 * Allow execution from readable areas if the MMU does not
 * provide separate controls over reading and executing.
@@ -421,6 +410,14 @@ good_area:
(cpu_has_feature(CPU_FTR_NOEXECUTE) ||
 !(vma->vm_flags & (VM_READ | VM_WRITE
goto bad_area;
+#ifdef CONFIG_PPC_STD_MMU
+   /*
+* protfault should only happen due to us
+

Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections

2014-11-18 Thread Mel Gorman
On Fri, Nov 14, 2014 at 10:29:41PM -0500, Sasha Levin wrote:
> On 11/14/2014 08:32 AM, Mel Gorman wrote:> This is follow up from the 
> "pipe/page fault oddness" thread.
> 
> Hi Mel,
> 
> Applying this patch series I've started seeing the following straight away:
> 
> [  367.547848] page:ea0003fb7db0 count:1007 mapcount:1005 
> mapping:8800691f2f58 index:0x37
> [  367.551481] flags: 
> 0x5001aa8030202d(locked|referenced|uptodate|lru|writeback|unevictable|mlocked)
> [  367.555382] page dumped because: VM_BUG_ON_PAGE(!v9inode->writeback_fid)
> [  367.558262] page->mem_cgroup:88006d8a1bd8
> [  367.560403] [ cut here ]
> [  367.562343] kernel BUG at fs/9p/vfs_addr.c:190!
> [  367.564239] invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
> [  367.566991] Dumping ftrace buffer:
> [  367.568481](ftrace buffer empty)
> [  367.569914] Modules linked in:
> [  367.570254] CPU: 3 PID: 8234 Comm: kworker/u52:1 Not tainted 
> 3.18.0-rc4-next-20141114-sasha-00054-ga9ff95e-dirty #1459

Thanks Sasha. I don't see a next-20141114 so I looked at next-20141113 and
assuming they are similar. It does not appear that writeback_fid is a struct
page so it's not clear what VM_BUG_ON_PAGE means in this context. Certainly
the fields look screwy but I think it's just accessing garbage.

I tried reproducing this but my KVM setup appears to be broken after an
update and not even able to boot 3.17 properly let alone with the patches. I
still have a few questions though.

1. I'm assuming this is a KVM setup but can you confirm?
2. Are you using numa=fake=N?
3. If you are using fake NUMA, what happens if you boot without it as
   that should make the patches a no-op?
4. Similarly, does the kernel boot properly without without patches?
5. Are any other patches applied because the line numbers are not lining
   up exactly?
6. As my own KVM setup appears broken, can you tell me if the host
   kernel has changed recently? If so, does using an older host kernel
   make a difference?

At the moment I'm scratching my head trying to figure out how the
patches could break 9p like this as I don't believe KVM is doing any
tricks with the same bits that could result in loss.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections

2014-11-18 Thread Mel Gorman
On Fri, Nov 14, 2014 at 10:29:41PM -0500, Sasha Levin wrote:
 On 11/14/2014 08:32 AM, Mel Gorman wrote: This is follow up from the 
 pipe/page fault oddness thread.
 
 Hi Mel,
 
 Applying this patch series I've started seeing the following straight away:
 
 [  367.547848] page:ea0003fb7db0 count:1007 mapcount:1005 
 mapping:8800691f2f58 index:0x37
 [  367.551481] flags: 
 0x5001aa8030202d(locked|referenced|uptodate|lru|writeback|unevictable|mlocked)
 [  367.555382] page dumped because: VM_BUG_ON_PAGE(!v9inode-writeback_fid)
 [  367.558262] page-mem_cgroup:88006d8a1bd8
 [  367.560403] [ cut here ]
 [  367.562343] kernel BUG at fs/9p/vfs_addr.c:190!
 [  367.564239] invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
 [  367.566991] Dumping ftrace buffer:
 [  367.568481](ftrace buffer empty)
 [  367.569914] Modules linked in:
 [  367.570254] CPU: 3 PID: 8234 Comm: kworker/u52:1 Not tainted 
 3.18.0-rc4-next-20141114-sasha-00054-ga9ff95e-dirty #1459

Thanks Sasha. I don't see a next-20141114 so I looked at next-20141113 and
assuming they are similar. It does not appear that writeback_fid is a struct
page so it's not clear what VM_BUG_ON_PAGE means in this context. Certainly
the fields look screwy but I think it's just accessing garbage.

I tried reproducing this but my KVM setup appears to be broken after an
update and not even able to boot 3.17 properly let alone with the patches. I
still have a few questions though.

1. I'm assuming this is a KVM setup but can you confirm?
2. Are you using numa=fake=N?
3. If you are using fake NUMA, what happens if you boot without it as
   that should make the patches a no-op?
4. Similarly, does the kernel boot properly without without patches?
5. Are any other patches applied because the line numbers are not lining
   up exactly?
6. As my own KVM setup appears broken, can you tell me if the host
   kernel has changed recently? If so, does using an older host kernel
   make a difference?

At the moment I'm scratching my head trying to figure out how the
patches could break 9p like this as I don't believe KVM is doing any
tricks with the same bits that could result in loss.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections

2014-11-18 Thread Mel Gorman
On Mon, Nov 17, 2014 at 01:56:19PM +0530, Aneesh Kumar K.V wrote:
 Mel Gorman mgor...@suse.de writes:
 
  This is follow up from the pipe/page fault oddness thread.
 
  Automatic NUMA balancing depends on being able to protect PTEs to trap a
  fault and gather reference locality information. Very broadly speaking it
  would mark PTEs as not present and use another bit to distinguish between
  NUMA hinting faults and other types of faults. It was universally loved
  by everybody and caused no problems whatsoever. That last sentence might
  be a lie.
 
  This series is very heavily based on patches from Linus and Aneesh to
  replace the existing PTE/PMD NUMA helper functions with normal change
  protections. I did alter and add parts of it but I consider them relatively
  minor contributions. Note that the signed-offs here need addressing. I
  couldn't use From or Signed-off-by from the original authors as the
  patches had to be broken up and they were never signed off. I expect the
  two people involved will just stick their signed-off-by on it.
 
 
 How about the additional change listed below for ppc64 ? One part of the
 patch is to make sure that we don't hit the WARN_ON in set_pte and set_pmd
 because we find the _PAGE_PRESENT bit set in case of numa fault. I
 ended up relaxing the check there.
 

I folded the set_pte_at and set_pmd_at changes into the patch mm: Convert
p[te|md]_numa users to p[te|md]_protnone_numa with one change -- both
set_pte_at and set_pmd_at checks are under CONFIG_DEBUG_VM for consistency.

 Second part of the change is to add a WARN_ON to make sure we are
 not depending on DSISR_PROTFAULT for anything else. We ideally should not
 get a DSISR_PROTFAULT for PROT_NONE or NUMA fault. hash_page_mm do check
 whether the access is allowed by pte before inserting a pte into hash
 page table. Hence we will never find a PROT_NONE or PROT_NONE_NUMA ptes
 in hash page table. But it is good to run with VM_WARN_ON ?
 

Due to the nature of the check and when they are hit, I converted it to
a WARN_ON_ONCE. Due to the exceptional circumstance the overhead should
be non-existant and shouldn't need to be hidden below VM_WARN_ON. I also
noted that with the patch the kernel  potentially no longer recovers
from this exceptional cirsumstance and instead falls through. To avoid
this, I preserved the goto out_unlock.

Is this still ok?

---8---
ppc64: Add paranoid warnings for unexpected DSISR_PROTFAULT

ppc64 should not be depending on DSISR_PROTFAULT and it's unexpected
if they are triggered. This patch adds warnings just in case they
are being accidentally depended upon.

Requires-signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Signed-off-by: Mel Gorman mgor...@suse.de
---
 arch/powerpc/mm/copro_fault.c |  7 ++-
 arch/powerpc/mm/fault.c   | 20 +---
 2 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c
index 5a236f0..46152aa 100644
--- a/arch/powerpc/mm/copro_fault.c
+++ b/arch/powerpc/mm/copro_fault.c
@@ -64,7 +64,12 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned 
long ea,
if (!(vma-vm_flags  VM_WRITE))
goto out_unlock;
} else {
-   if (dsisr  DSISR_PROTFAULT)
+   /*
+* protfault should only happen due to us
+* mapping a region readonly temporarily. PROT_NONE
+* is also covered by the VMA check above.
+*/
+   if (WARN_ON_ONCE(dsisr  DSISR_PROTFAULT))
goto out_unlock;
if (!(vma-vm_flags  (VM_READ | VM_EXEC)))
goto out_unlock;
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 5007497..9d6e0b3 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -396,17 +396,6 @@ good_area:
 #endif /* CONFIG_8xx */
 
if (is_exec) {
-#ifdef CONFIG_PPC_STD_MMU
-   /* Protection fault on exec go straight to failure on
-* Hash based MMUs as they either don't support per-page
-* execute permission, or if they do, it's handled already
-* at the hash level. This test would probably have to
-* be removed if we change the way this works to make hash
-* processors use the same I/D cache coherency mechanism
-* as embedded.
-*/
-#endif /* CONFIG_PPC_STD_MMU */
-
/*
 * Allow execution from readable areas if the MMU does not
 * provide separate controls over reading and executing.
@@ -421,6 +410,14 @@ good_area:
(cpu_has_feature(CPU_FTR_NOEXECUTE) ||
 !(vma-vm_flags  (VM_READ | VM_WRITE
goto bad_area;
+#ifdef CONFIG_PPC_STD_MMU
+   /*
+* protfault should only happen due to us
+

Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections

2014-11-18 Thread Aneesh Kumar K.V
Mel Gorman mgor...@suse.de writes:

 On Mon, Nov 17, 2014 at 01:56:19PM +0530, Aneesh Kumar K.V wrote:
 Mel Gorman mgor...@suse.de writes:
 
  This is follow up from the pipe/page fault oddness thread.
 
  Automatic NUMA balancing depends on being able to protect PTEs to trap a
  fault and gather reference locality information. Very broadly speaking it
  would mark PTEs as not present and use another bit to distinguish between
  NUMA hinting faults and other types of faults. It was universally loved
  by everybody and caused no problems whatsoever. That last sentence might
  be a lie.
 
  This series is very heavily based on patches from Linus and Aneesh to
  replace the existing PTE/PMD NUMA helper functions with normal change
  protections. I did alter and add parts of it but I consider them relatively
  minor contributions. Note that the signed-offs here need addressing. I
  couldn't use From or Signed-off-by from the original authors as the
  patches had to be broken up and they were never signed off. I expect the
  two people involved will just stick their signed-off-by on it.
 
 
 How about the additional change listed below for ppc64 ? One part of the
 patch is to make sure that we don't hit the WARN_ON in set_pte and set_pmd
 because we find the _PAGE_PRESENT bit set in case of numa fault. I
 ended up relaxing the check there.
 

 I folded the set_pte_at and set_pmd_at changes into the patch mm: Convert
 p[te|md]_numa users to p[te|md]_protnone_numa with one change -- both
 set_pte_at and set_pmd_at checks are under CONFIG_DEBUG_VM for consistency.

 Second part of the change is to add a WARN_ON to make sure we are
 not depending on DSISR_PROTFAULT for anything else. We ideally should not
 get a DSISR_PROTFAULT for PROT_NONE or NUMA fault. hash_page_mm do check
 whether the access is allowed by pte before inserting a pte into hash
 page table. Hence we will never find a PROT_NONE or PROT_NONE_NUMA ptes
 in hash page table. But it is good to run with VM_WARN_ON ?
 

 Due to the nature of the check and when they are hit, I converted it to
 a WARN_ON_ONCE. Due to the exceptional circumstance the overhead should
 be non-existant and shouldn't need to be hidden below VM_WARN_ON. I also
 noted that with the patch the kernel  potentially no longer recovers
 from this exceptional cirsumstance and instead falls through. To avoid
 this, I preserved the goto out_unlock.

 Is this still ok?

 ---8---
 ppc64: Add paranoid warnings for unexpected DSISR_PROTFAULT

 ppc64 should not be depending on DSISR_PROTFAULT and it's unexpected
 if they are triggered. This patch adds warnings just in case they
 are being accidentally depended upon.

 Requires-signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
 Signed-off-by: Mel Gorman mgor...@suse.de
 ---
  arch/powerpc/mm/copro_fault.c |  7 ++-
  arch/powerpc/mm/fault.c   | 20 +---
  2 files changed, 15 insertions(+), 12 deletions(-)

 diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c
 index 5a236f0..46152aa 100644
 --- a/arch/powerpc/mm/copro_fault.c
 +++ b/arch/powerpc/mm/copro_fault.c
 @@ -64,7 +64,12 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned 
 long ea,
   if (!(vma-vm_flags  VM_WRITE))
   goto out_unlock;
   } else {
 - if (dsisr  DSISR_PROTFAULT)
 + /*
 +  * protfault should only happen due to us
 +  * mapping a region readonly temporarily. PROT_NONE
 +  * is also covered by the VMA check above.
 +  */
 + if (WARN_ON_ONCE(dsisr  DSISR_PROTFAULT))
   goto out_unlock;
   if (!(vma-vm_flags  (VM_READ | VM_EXEC)))
   goto out_unlock;


we should do that DSISR_PROTFAILT check after vma-vm_flags. It is not
that we will not hit DSISR_PROTFAULT, what we want to ensure here is that
we get a prot fault only for cases convered by that vma check. So
everything should be taking the if (!(vma-vm_flags  (VM_READ |
VM_EXEC))) branch if it is a protfault. If not we would like to know
about that. And hence the idea of not using WARN_ON_ONCE. I was also not
sure whether we want to enable that always. The reason for keeping that
within CONFIG_DEBUG_VM is to make sure that nobody ends up depending on
PROTFAULT outside the vma check convered. So expectations is that
developers working on feature will run with DEBUG_VM enable and finds
this warning. We don't expect to hit this otherwise.

 diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
 index 5007497..9d6e0b3 100644
 --- a/arch/powerpc/mm/fault.c
 +++ b/arch/powerpc/mm/fault.c
 @@ -396,17 +396,6 @@ good_area:
  #endif /* CONFIG_8xx */

   if (is_exec) {
 -#ifdef CONFIG_PPC_STD_MMU
 - /* Protection fault on exec go straight to failure on
 -  * Hash based MMUs as they either don't support per-page
 -  * execute permission, or if they do, 

Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections

2014-11-18 Thread Sasha Levin
On 11/18/2014 10:42 AM, Mel Gorman wrote:
 1. I'm assuming this is a KVM setup but can you confirm?

Yes.

 2. Are you using numa=fake=N?

Yes. numa=fake=24, which is probably way more nodes on any physical machine
than the new code was tested on?

 3. If you are using fake NUMA, what happens if you boot without it as
that should make the patches a no-op?

Nope, still seeing it without fake numa.

 4. Similarly, does the kernel boot properly without without patches?

Yes, the kernel works fine without the patches both with and without fake
numa.

 5. Are any other patches applied because the line numbers are not lining
up exactly?

I have quite a few more patches on top of next, but they're debug patches
that add VM_BUG_ONs in quite a few places.

One thing that was odd is that your patches had merge conflicts when applied
on -next in mm/huge-memory.c, so maybe that's where line number differences
are coming from.

 6. As my own KVM setup appears broken, can you tell me if the host
kernel has changed recently? If so, does using an older host kernel
make a difference?

Nope, I've been using the same host kernel (Ubuntu's 3.16.0-24-generic #32)
for a while now.

 At the moment I'm scratching my head trying to figure out how the
 patches could break 9p like this as I don't believe KVM is doing any
 tricks with the same bits that could result in loss.

This issue reproduces rather easily, I'd be happy to try out debug patches
rather than having you guess at what might have gone wrong.


Thanks,
Sasha
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections

2014-11-18 Thread Aneesh Kumar K.V
Sasha Levin sasha.le...@oracle.com writes:

 On 11/18/2014 10:42 AM, Mel Gorman wrote:
 1. I'm assuming this is a KVM setup but can you confirm?

 Yes.

 2. Are you using numa=fake=N?

 Yes. numa=fake=24, which is probably way more nodes on any physical machine
 than the new code was tested on?

 3. If you are using fake NUMA, what happens if you boot without it as
that should make the patches a no-op?

 Nope, still seeing it without fake numa.

 4. Similarly, does the kernel boot properly without without patches?

 Yes, the kernel works fine without the patches both with and without fake
 numa.


Hmm that is interesting. I am not sure how writeback_fid can be
related. We use writeback fid to enable client side caching with 9p
(cache=loose). We use this fid to write back dirty pages later. Can you
share the qemu command line used, 9p mount options and the test details ? 


-aneesh

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections

2014-11-18 Thread Mel Gorman
On Tue, Nov 18, 2014 at 10:03:30PM +0530, Aneesh Kumar K.V wrote:
  diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c
  index 5a236f0..46152aa 100644
  --- a/arch/powerpc/mm/copro_fault.c
  +++ b/arch/powerpc/mm/copro_fault.c
  @@ -64,7 +64,12 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned 
  long ea,
  if (!(vma-vm_flags  VM_WRITE))
  goto out_unlock;
  } else {
  -   if (dsisr  DSISR_PROTFAULT)
  +   /*
  +* protfault should only happen due to us
  +* mapping a region readonly temporarily. PROT_NONE
  +* is also covered by the VMA check above.
  +*/
  +   if (WARN_ON_ONCE(dsisr  DSISR_PROTFAULT))
  goto out_unlock;
  if (!(vma-vm_flags  (VM_READ | VM_EXEC)))
  goto out_unlock;
 
 
 we should do that DSISR_PROTFAILT check after vma-vm_flags. It is not
 that we will not hit DSISR_PROTFAULT, what we want to ensure here is that
 we get a prot fault only for cases convered by that vma check. So
 everything should be taking the if (!(vma-vm_flags  (VM_READ |
 VM_EXEC))) branch if it is a protfault. If not we would like to know
 about that. And hence the idea of not using WARN_ON_ONCE. I was also not
 sure whether we want to enable that always. The reason for keeping that
 within CONFIG_DEBUG_VM is to make sure that nobody ends up depending on
 PROTFAULT outside the vma check convered. So expectations is that
 developers working on feature will run with DEBUG_VM enable and finds
 this warning. We don't expect to hit this otherwise.
 

/me slaps self. It's clear now and updated accordingly. Thanks.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections

2014-11-18 Thread Mel Gorman
On Tue, Nov 18, 2014 at 10:26:41PM +0530, Aneesh Kumar K.V wrote:
 Sasha Levin sasha.le...@oracle.com writes:
 
  On 11/18/2014 10:42 AM, Mel Gorman wrote:
  1. I'm assuming this is a KVM setup but can you confirm?
 
  Yes.
 
  2. Are you using numa=fake=N?
 
  Yes. numa=fake=24, which is probably way more nodes on any physical machine
  than the new code was tested on?
 
  3. If you are using fake NUMA, what happens if you boot without it as
 that should make the patches a no-op?
 
  Nope, still seeing it without fake numa.
 
  4. Similarly, does the kernel boot properly without without patches?
 
  Yes, the kernel works fine without the patches both with and without fake
  numa.
 
 
 Hmm that is interesting. I am not sure how writeback_fid can be
 related. We use writeback fid to enable client side caching with 9p
 (cache=loose). We use this fid to write back dirty pages later. Can you
 share the qemu command line used, 9p mount options and the test details ? 
 

It would help if the test details included the kernel config. I got KVM
working again on an server with an older installation and while it
doesn't use 9p, I'm not seeing any other oddities either yet while
running trinity.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections

2014-11-18 Thread Sasha Levin
On 11/18/2014 11:56 AM, Aneesh Kumar K.V wrote:
 4. Similarly, does the kernel boot properly without without patches?
 
  Yes, the kernel works fine without the patches both with and without fake
  numa.
 
 Hmm that is interesting. I am not sure how writeback_fid can be
 related. We use writeback fid to enable client side caching with 9p
 (cache=loose). We use this fid to write back dirty pages later. Can you
 share the qemu command line used, 9p mount options and the test details ? 

I'm using kvmtool rather than qemu. rootfs is created via kernel parameters:

root=/dev/root rw rootflags=rw,trans=virtio,version=9p2000.L rootfstype=9p

The test is just running trinity, there's no 9p or mm specific test going on.

I've attached my .config.


Thanks,
Sasha


config-sasha.gz
Description: application/gzip


Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections

2014-11-17 Thread Aneesh Kumar K.V
Mel Gorman  writes:

> This is follow up from the "pipe/page fault oddness" thread.
>
> Automatic NUMA balancing depends on being able to protect PTEs to trap a
> fault and gather reference locality information. Very broadly speaking it
> would mark PTEs as not present and use another bit to distinguish between
> NUMA hinting faults and other types of faults. It was universally loved
> by everybody and caused no problems whatsoever. That last sentence might
> be a lie.
>
> This series is very heavily based on patches from Linus and Aneesh to
> replace the existing PTE/PMD NUMA helper functions with normal change
> protections. I did alter and add parts of it but I consider them relatively
> minor contributions. Note that the signed-offs here need addressing. I
> couldn't use "From" or Signed-off-by from the original authors as the
> patches had to be broken up and they were never signed off. I expect the
> two people involved will just stick their signed-off-by on it.


How about the additional change listed below for ppc64 ? One part of the
patch is to make sure that we don't hit the WARN_ON in set_pte and set_pmd
because we find the _PAGE_PRESENT bit set in case of numa fault. I
ended up relaxing the check there.

Second part of the change is to add a WARN_ON to make sure we are
not depending on DSISR_PROTFAULT for anything else. We ideally should not
get a DSISR_PROTFAULT for PROT_NONE or NUMA fault. hash_page_mm do check
whether the access is allowed by pte before inserting a pte into hash
page table. Hence we will never find a PROT_NONE or PROT_NONE_NUMA ptes
in hash page table. But it is good to run with VM_WARN_ON ?

I also added a similar change to handle CAPI. 

This will also need an ack from Ben and Paul . (added them to Cc:) 

With the below patch you can add

Acked-by: Aneesh Kumar K.V 

for the respective patches.

diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c
index 5a236f082c78..2e208afb7f4c 100644
--- a/arch/powerpc/mm/copro_fault.c
+++ b/arch/powerpc/mm/copro_fault.c
@@ -64,10 +64,14 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned 
long ea,
if (!(vma->vm_flags & VM_WRITE))
goto out_unlock;
} else {
-   if (dsisr & DSISR_PROTFAULT)
-   goto out_unlock;
if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
goto out_unlock;
+   /*
+* protfault should only happen due to us
+* mapping a region readonly temporarily. PROT_NONE
+* is also covered by the VMA check above.
+*/
+   VM_WARN_ON(dsisr & DSISR_PROTFAULT);
}
 
ret = 0;
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 50074972d555..6df9483e316f 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -396,17 +396,6 @@ good_area:
 #endif /* CONFIG_8xx */
 
if (is_exec) {
-#ifdef CONFIG_PPC_STD_MMU
-   /* Protection fault on exec go straight to failure on
-* Hash based MMUs as they either don't support per-page
-* execute permission, or if they do, it's handled already
-* at the hash level. This test would probably have to
-* be removed if we change the way this works to make hash
-* processors use the same I/D cache coherency mechanism
-* as embedded.
-*/
-#endif /* CONFIG_PPC_STD_MMU */
-
/*
 * Allow execution from readable areas if the MMU does not
 * provide separate controls over reading and executing.
@@ -421,6 +410,14 @@ good_area:
(cpu_has_feature(CPU_FTR_NOEXECUTE) ||
 !(vma->vm_flags & (VM_READ | VM_WRITE
goto bad_area;
+#ifdef CONFIG_PPC_STD_MMU
+   /*
+* protfault should only happen due to us
+* mapping a region readonly temporarily. PROT_NONE
+* is also covered by the VMA check above.
+*/
+   VM_WARN_ON(error_code & DSISR_PROTFAULT);
+#endif /* CONFIG_PPC_STD_MMU */
/* a write */
} else if (is_write) {
if (!(vma->vm_flags & VM_WRITE))
@@ -430,6 +427,7 @@ good_area:
} else {
if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
goto bad_area;
+   VM_WARN_ON(error_code & DSISR_PROTFAULT);
}
 
/*
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index c90e602677c9..75b08098fcf5 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -172,9 +172,13 @@ static pte_t set_access_flags_filter(pte_t pte, struct 
vm_area_struct *vma,
 void set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
pte_t pte)
 {
-#ifdef CONFIG_DEBUG_VM
-   

Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections

2014-11-17 Thread Aneesh Kumar K.V
Mel Gorman mgor...@suse.de writes:

 This is follow up from the pipe/page fault oddness thread.

 Automatic NUMA balancing depends on being able to protect PTEs to trap a
 fault and gather reference locality information. Very broadly speaking it
 would mark PTEs as not present and use another bit to distinguish between
 NUMA hinting faults and other types of faults. It was universally loved
 by everybody and caused no problems whatsoever. That last sentence might
 be a lie.

 This series is very heavily based on patches from Linus and Aneesh to
 replace the existing PTE/PMD NUMA helper functions with normal change
 protections. I did alter and add parts of it but I consider them relatively
 minor contributions. Note that the signed-offs here need addressing. I
 couldn't use From or Signed-off-by from the original authors as the
 patches had to be broken up and they were never signed off. I expect the
 two people involved will just stick their signed-off-by on it.


How about the additional change listed below for ppc64 ? One part of the
patch is to make sure that we don't hit the WARN_ON in set_pte and set_pmd
because we find the _PAGE_PRESENT bit set in case of numa fault. I
ended up relaxing the check there.

Second part of the change is to add a WARN_ON to make sure we are
not depending on DSISR_PROTFAULT for anything else. We ideally should not
get a DSISR_PROTFAULT for PROT_NONE or NUMA fault. hash_page_mm do check
whether the access is allowed by pte before inserting a pte into hash
page table. Hence we will never find a PROT_NONE or PROT_NONE_NUMA ptes
in hash page table. But it is good to run with VM_WARN_ON ?

I also added a similar change to handle CAPI. 

This will also need an ack from Ben and Paul . (added them to Cc:) 

With the below patch you can add

Acked-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com

for the respective patches.

diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c
index 5a236f082c78..2e208afb7f4c 100644
--- a/arch/powerpc/mm/copro_fault.c
+++ b/arch/powerpc/mm/copro_fault.c
@@ -64,10 +64,14 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned 
long ea,
if (!(vma-vm_flags  VM_WRITE))
goto out_unlock;
} else {
-   if (dsisr  DSISR_PROTFAULT)
-   goto out_unlock;
if (!(vma-vm_flags  (VM_READ | VM_EXEC)))
goto out_unlock;
+   /*
+* protfault should only happen due to us
+* mapping a region readonly temporarily. PROT_NONE
+* is also covered by the VMA check above.
+*/
+   VM_WARN_ON(dsisr  DSISR_PROTFAULT);
}
 
ret = 0;
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 50074972d555..6df9483e316f 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -396,17 +396,6 @@ good_area:
 #endif /* CONFIG_8xx */
 
if (is_exec) {
-#ifdef CONFIG_PPC_STD_MMU
-   /* Protection fault on exec go straight to failure on
-* Hash based MMUs as they either don't support per-page
-* execute permission, or if they do, it's handled already
-* at the hash level. This test would probably have to
-* be removed if we change the way this works to make hash
-* processors use the same I/D cache coherency mechanism
-* as embedded.
-*/
-#endif /* CONFIG_PPC_STD_MMU */
-
/*
 * Allow execution from readable areas if the MMU does not
 * provide separate controls over reading and executing.
@@ -421,6 +410,14 @@ good_area:
(cpu_has_feature(CPU_FTR_NOEXECUTE) ||
 !(vma-vm_flags  (VM_READ | VM_WRITE
goto bad_area;
+#ifdef CONFIG_PPC_STD_MMU
+   /*
+* protfault should only happen due to us
+* mapping a region readonly temporarily. PROT_NONE
+* is also covered by the VMA check above.
+*/
+   VM_WARN_ON(error_code  DSISR_PROTFAULT);
+#endif /* CONFIG_PPC_STD_MMU */
/* a write */
} else if (is_write) {
if (!(vma-vm_flags  VM_WRITE))
@@ -430,6 +427,7 @@ good_area:
} else {
if (!(vma-vm_flags  (VM_READ | VM_EXEC | VM_WRITE)))
goto bad_area;
+   VM_WARN_ON(error_code  DSISR_PROTFAULT);
}
 
/*
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index c90e602677c9..75b08098fcf5 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -172,9 +172,13 @@ static pte_t set_access_flags_filter(pte_t pte, struct 
vm_area_struct *vma,
 void set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
pte_t pte)
 {
-#ifdef CONFIG_DEBUG_VM
-   

Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections

2014-11-14 Thread Sasha Levin
On 11/14/2014 08:32 AM, Mel Gorman wrote:> This is follow up from the 
"pipe/page fault oddness" thread.

Hi Mel,

Applying this patch series I've started seeing the following straight away:

[  367.547848] page:ea0003fb7db0 count:1007 mapcount:1005 
mapping:8800691f2f58 index:0x37
[  367.551481] flags: 
0x5001aa8030202d(locked|referenced|uptodate|lru|writeback|unevictable|mlocked)
[  367.555382] page dumped because: VM_BUG_ON_PAGE(!v9inode->writeback_fid)
[  367.558262] page->mem_cgroup:88006d8a1bd8
[  367.560403] [ cut here ]
[  367.562343] kernel BUG at fs/9p/vfs_addr.c:190!
[  367.564239] invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
[  367.566991] Dumping ftrace buffer:
[  367.568481](ftrace buffer empty)
[  367.569914] Modules linked in:
[  367.570254] CPU: 3 PID: 8234 Comm: kworker/u52:1 Not tainted 
3.18.0-rc4-next-20141114-sasha-00054-ga9ff95e-dirty #1459
[  367.570254] Workqueue: writeback bdi_writeback_workfn (flush-9p-1)
[  367.570254] task: 8801e21d8000 ti: 8801e1f34000 task.ti: 
8801e1f34000
[  367.570254] RIP: v9fs_vfs_writepage_locked (fs/9p/vfs_addr.c:190 
(discriminator 1))
[  367.570254] RSP: 0018:8801e1f376c8  EFLAGS: 00010286
[  367.570254] RAX: 0021 RBX: ea0003fb7db0 RCX: 
[  367.570254] RDX: 0021 RSI: 9208b2e6 RDI: 8801e21d8d0c
[  367.570254] RBP: 8801e1f37728 R08: 0001 R09: 
[  367.570254] R10: 0001 R11: 0001 R12: 8800691f2d48
[  367.570254] R13: 1000 R14: 8800691f2c30 R15: 8800691f2c98
[  367.570254] FS:  () GS:8801e5c0() 
knlGS:
[  367.570254] CS:  0010 DS:  ES:  CR0: 8005003b
[  367.570254] CR2:  CR3: ca00c000 CR4: 06a0
[  367.570254] DR0: 8100 DR1:  DR2: 
[  367.570254] DR3:  DR6: 0ff0 DR7: 0600
[  367.570254] Stack:
[  367.570254]  da003c43b1a1 da003c43b1a1  
0002
[  367.570254]  0002 00037000 8801e1f37758 
ea0003fb7db0
[  367.570254]   8801e1f37a60 8801e1f37a60 
ea0003fb7db0
[  367.570254] Call Trace:
[  367.570254] v9fs_vfs_writepage (fs/9p/vfs_addr.c:212)
[  367.570254] __writepage (include/linux/pagemap.h:32 mm/page-writeback.c:2006)
[  367.570254] write_cache_pages (mm/page-writeback.c:1943)
[  367.570254] ? bdi_set_max_ratio (mm/page-writeback.c:2003)
[  367.570254] ? sched_clock_local (kernel/sched/clock.c:202)
[  367.570254] generic_writepages (mm/page-writeback.c:2030)
[  367.570254] do_writepages (mm/page-writeback.c:2047)
[  367.570254] __writeback_single_inode (fs/fs-writeback.c:461 (discriminator 
3))
[  367.570254] writeback_sb_inodes (fs/fs-writeback.c:706)
[  367.570254] __writeback_inodes_wb (fs/fs-writeback.c:749)
[  367.570254] wb_writeback (fs/fs-writeback.c:880)
[  367.570254] ? __lock_is_held (kernel/locking/lockdep.c:3518)
[  367.570254] bdi_writeback_workfn (fs/fs-writeback.c:1015 
fs/fs-writeback.c:1060)
[  367.570254] process_one_work (kernel/workqueue.c:2023 
include/linux/jump_label.h:114 include/trace/events/workqueue.h:111 
kernel/workqueue.c:2028)
[  367.570254] ? process_one_work (kernel/workqueue.c:2020)
[  367.570254] ? get_lock_stats (kernel/locking/lockdep.c:249)
[  367.570254] worker_thread (include/linux/list.h:189 kernel/workqueue.c:2156)
[  367.570254] ? __schedule (./arch/x86/include/asm/bitops.h:311 
include/linux/thread_info.h:91 include/linux/sched.h:2939 
kernel/sched/core.c:2848)
[  367.570254] ? rescuer_thread (kernel/workqueue.c:2100)
[  367.570254] kthread (kernel/kthread.c:207)
[  367.570254] ? flush_kthread_work (kernel/kthread.c:176)
[  367.570254] ret_from_fork (arch/x86/kernel/entry_64.S:348)
[  367.570254] ? flush_kthread_work (kernel/kthread.c:176)
[ 367.570254] Code: 48 83 c4 38 44 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 66 2e 
0f 1f 84 00 00 00 00 00 48 c7 c6 f8 18 37 93 48 89 df e8 e1 8b 93 fe <0f> 0b 48 
89 de 48 c7 c7 30 bd 9f 95 48 89 4d b8 e8 10 5f 02 0f

All code

   0:   48 83 c4 38 add$0x38,%rsp
   4:   44 89 f0mov%r14d,%eax
   7:   5b  pop%rbx
   8:   41 5c   pop%r12
   a:   41 5d   pop%r13
   c:   41 5e   pop%r14
   e:   41 5f   pop%r15
  10:   5d  pop%rbp
  11:   c3  retq
  12:   66 2e 0f 1f 84 00 00nopw   %cs:0x0(%rax,%rax,1)
  19:   00 00 00
  1c:   48 c7 c6 f8 18 37 93mov$0x933718f8,%rsi
  23:   48 89 dfmov%rbx,%rdi
  26:   e8 e1 8b 93 fe  callq  0xfe938c0c
  2b:*  0f 0b   ud2 <-- trapping instruction
  2d:   48 89 demov%rbx,%rsi
  30:   48 c7 

Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections

2014-11-14 Thread Linus Torvalds
On Fri, Nov 14, 2014 at 5:32 AM, Mel Gorman  wrote:
>
> This series is very heavily based on patches from Linus and Aneesh to
> replace the existing PTE/PMD NUMA helper functions with normal change
> protections. I did alter and add parts of it but I consider them relatively
> minor contributions. Note that the signed-offs here need addressing. I
> couldn't use "From" or Signed-off-by from the original authors as the
> patches had to be broken up and they were never signed off. I expect the
> two people involved will just stick their signed-off-by on it.

Feel free to just take authorship of my parts, and make my
"Needs-sign-off's" be just "Acked-by:"

Or alternatively keep them as "Signed-off-by:", even when it looks a
bit odd if it doesn't have a "From:" me, when the actual patch won't
then actually go through me - I'm assuming this will come in through
the -mm tree.

As to the ppc parts, obviously it would be good to have Aneesh re-test
the series..

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections

2014-11-14 Thread Linus Torvalds
On Fri, Nov 14, 2014 at 5:32 AM, Mel Gorman mgor...@suse.de wrote:

 This series is very heavily based on patches from Linus and Aneesh to
 replace the existing PTE/PMD NUMA helper functions with normal change
 protections. I did alter and add parts of it but I consider them relatively
 minor contributions. Note that the signed-offs here need addressing. I
 couldn't use From or Signed-off-by from the original authors as the
 patches had to be broken up and they were never signed off. I expect the
 two people involved will just stick their signed-off-by on it.

Feel free to just take authorship of my parts, and make my
Needs-sign-off's be just Acked-by:

Or alternatively keep them as Signed-off-by:, even when it looks a
bit odd if it doesn't have a From: me, when the actual patch won't
then actually go through me - I'm assuming this will come in through
the -mm tree.

As to the ppc parts, obviously it would be good to have Aneesh re-test
the series..

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/7] Replace _PAGE_NUMA with PAGE_NONE protections

2014-11-14 Thread Sasha Levin
On 11/14/2014 08:32 AM, Mel Gorman wrote: This is follow up from the 
pipe/page fault oddness thread.

Hi Mel,

Applying this patch series I've started seeing the following straight away:

[  367.547848] page:ea0003fb7db0 count:1007 mapcount:1005 
mapping:8800691f2f58 index:0x37
[  367.551481] flags: 
0x5001aa8030202d(locked|referenced|uptodate|lru|writeback|unevictable|mlocked)
[  367.555382] page dumped because: VM_BUG_ON_PAGE(!v9inode-writeback_fid)
[  367.558262] page-mem_cgroup:88006d8a1bd8
[  367.560403] [ cut here ]
[  367.562343] kernel BUG at fs/9p/vfs_addr.c:190!
[  367.564239] invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
[  367.566991] Dumping ftrace buffer:
[  367.568481](ftrace buffer empty)
[  367.569914] Modules linked in:
[  367.570254] CPU: 3 PID: 8234 Comm: kworker/u52:1 Not tainted 
3.18.0-rc4-next-20141114-sasha-00054-ga9ff95e-dirty #1459
[  367.570254] Workqueue: writeback bdi_writeback_workfn (flush-9p-1)
[  367.570254] task: 8801e21d8000 ti: 8801e1f34000 task.ti: 
8801e1f34000
[  367.570254] RIP: v9fs_vfs_writepage_locked (fs/9p/vfs_addr.c:190 
(discriminator 1))
[  367.570254] RSP: 0018:8801e1f376c8  EFLAGS: 00010286
[  367.570254] RAX: 0021 RBX: ea0003fb7db0 RCX: 
[  367.570254] RDX: 0021 RSI: 9208b2e6 RDI: 8801e21d8d0c
[  367.570254] RBP: 8801e1f37728 R08: 0001 R09: 
[  367.570254] R10: 0001 R11: 0001 R12: 8800691f2d48
[  367.570254] R13: 1000 R14: 8800691f2c30 R15: 8800691f2c98
[  367.570254] FS:  () GS:8801e5c0() 
knlGS:
[  367.570254] CS:  0010 DS:  ES:  CR0: 8005003b
[  367.570254] CR2:  CR3: ca00c000 CR4: 06a0
[  367.570254] DR0: 8100 DR1:  DR2: 
[  367.570254] DR3:  DR6: 0ff0 DR7: 0600
[  367.570254] Stack:
[  367.570254]  da003c43b1a1 da003c43b1a1  
0002
[  367.570254]  0002 00037000 8801e1f37758 
ea0003fb7db0
[  367.570254]   8801e1f37a60 8801e1f37a60 
ea0003fb7db0
[  367.570254] Call Trace:
[  367.570254] v9fs_vfs_writepage (fs/9p/vfs_addr.c:212)
[  367.570254] __writepage (include/linux/pagemap.h:32 mm/page-writeback.c:2006)
[  367.570254] write_cache_pages (mm/page-writeback.c:1943)
[  367.570254] ? bdi_set_max_ratio (mm/page-writeback.c:2003)
[  367.570254] ? sched_clock_local (kernel/sched/clock.c:202)
[  367.570254] generic_writepages (mm/page-writeback.c:2030)
[  367.570254] do_writepages (mm/page-writeback.c:2047)
[  367.570254] __writeback_single_inode (fs/fs-writeback.c:461 (discriminator 
3))
[  367.570254] writeback_sb_inodes (fs/fs-writeback.c:706)
[  367.570254] __writeback_inodes_wb (fs/fs-writeback.c:749)
[  367.570254] wb_writeback (fs/fs-writeback.c:880)
[  367.570254] ? __lock_is_held (kernel/locking/lockdep.c:3518)
[  367.570254] bdi_writeback_workfn (fs/fs-writeback.c:1015 
fs/fs-writeback.c:1060)
[  367.570254] process_one_work (kernel/workqueue.c:2023 
include/linux/jump_label.h:114 include/trace/events/workqueue.h:111 
kernel/workqueue.c:2028)
[  367.570254] ? process_one_work (kernel/workqueue.c:2020)
[  367.570254] ? get_lock_stats (kernel/locking/lockdep.c:249)
[  367.570254] worker_thread (include/linux/list.h:189 kernel/workqueue.c:2156)
[  367.570254] ? __schedule (./arch/x86/include/asm/bitops.h:311 
include/linux/thread_info.h:91 include/linux/sched.h:2939 
kernel/sched/core.c:2848)
[  367.570254] ? rescuer_thread (kernel/workqueue.c:2100)
[  367.570254] kthread (kernel/kthread.c:207)
[  367.570254] ? flush_kthread_work (kernel/kthread.c:176)
[  367.570254] ret_from_fork (arch/x86/kernel/entry_64.S:348)
[  367.570254] ? flush_kthread_work (kernel/kthread.c:176)
[ 367.570254] Code: 48 83 c4 38 44 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 66 2e 
0f 1f 84 00 00 00 00 00 48 c7 c6 f8 18 37 93 48 89 df e8 e1 8b 93 fe 0f 0b 48 
89 de 48 c7 c7 30 bd 9f 95 48 89 4d b8 e8 10 5f 02 0f

All code

   0:   48 83 c4 38 add$0x38,%rsp
   4:   44 89 f0mov%r14d,%eax
   7:   5b  pop%rbx
   8:   41 5c   pop%r12
   a:   41 5d   pop%r13
   c:   41 5e   pop%r14
   e:   41 5f   pop%r15
  10:   5d  pop%rbp
  11:   c3  retq
  12:   66 2e 0f 1f 84 00 00nopw   %cs:0x0(%rax,%rax,1)
  19:   00 00 00
  1c:   48 c7 c6 f8 18 37 93mov$0x933718f8,%rsi
  23:   48 89 dfmov%rbx,%rdi
  26:   e8 e1 8b 93 fe  callq  0xfe938c0c
  2b:*  0f 0b   ud2 -- trapping instruction
  2d:   48 89 demov%rbx,%rsi
  30:   48 c7 c7 30 bd