Re: [PATCH 1/2] KVM: MMU: fix SMAP virtualization

2015-06-10 Thread Andrew Morton
On Wed, 10 Jun 2015 11:02:46 -0700 Davidlohr Bueso d...@stgolabs.net wrote:

 On Wed, 2015-05-27 at 10:53 +0800, Xiao Guangrong wrote:
  
  On 05/26/2015 10:48 PM, Paolo Bonzini wrote:
  
  
   On 26/05/2015 16:45, Edward Cree wrote:
   This breaks older compilers that can't initialize anon structures.
  
   How old ? Even gcc 3.1 says you can use unnamed struct/union fields and
   3.2 is the minimum version required to compile the kernel as mentioned
   in the README.
  
   We could simply just name the structure, but I doubt this is the
   only place in the kernel code where it's being used this way :)
   This appears to be GCC bug #10676, see 
   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=10676
   Says it was fixed in 4.6, but I believe the kernel supports GCCs much 
   older
   than that (back to 3.2).  I personally hit it on 4.4.7, the version 
   shipped
   with RHEL6.6.
  
   Yes, it will be fixed soon(ish).  Probably before you can get rid of the
   obnoxious disclaimer... :)
  
  It has been fixed by Andrew:
  
  From: Andrew Morton a...@linux-foundation.org
  Subject: arch/x86/kvm/mmu.c: work around gcc-4.4.4 bug
 
 Folks, this fix is missing in both -tip and Linus' (4.1-rc7) tree. At
 least I'm running into it.

hm, yes, it's in linux-next so I dropped my copy.

I'll send it in to Linus today.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 22/23] userfaultfd: avoid mmap_sem read recursion in mcopy_atomic

2015-05-22 Thread Andrew Morton
On Thu, 14 May 2015 19:31:19 +0200 Andrea Arcangeli aarca...@redhat.com wrote:

 If the rwsem starves writers it wasn't strictly a bug but lockdep
 doesn't like it and this avoids depending on lowlevel implementation
 details of the lock.
 
 ...

 @@ -229,13 +246,33 @@ static __always_inline ssize_t __mcopy_atomic(struct 
 mm_struct *dst_mm,
  
   if (!zeropage)
   err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
 -dst_addr, src_addr);
 +dst_addr, src_addr, page);
   else
   err = mfill_zeropage_pte(dst_mm, dst_pmd, dst_vma,
dst_addr);
  
   cond_resched();
  
 + if (unlikely(err == -EFAULT)) {
 + void *page_kaddr;
 +
 + BUILD_BUG_ON(zeropage);

I'm not sure what this is trying to do.  BUILD_BUG_ON(local_variable)?

It goes bang in my build.  I'll just delete it.

 + up_read(dst_mm-mmap_sem);
 + BUG_ON(!page);
 +
 + page_kaddr = kmap(page);
 + err = copy_from_user(page_kaddr,
 +  (const void __user *) src_addr,
 +  PAGE_SIZE);
 + kunmap(page);
 + if (unlikely(err)) {
 + err = -EFAULT;
 + goto out;
 + }
 + goto retry;
 + } else
 + BUG_ON(page);
 +

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 22/23] userfaultfd: avoid mmap_sem read recursion in mcopy_atomic

2015-05-22 Thread Andrew Morton

There's a more serious failure with i386 allmodconfig:

fs/userfaultfd.c:145:2: note: in expansion of macro 'BUILD_BUG_ON'
  BUILD_BUG_ON(sizeof(struct uffd_msg) != 32);

I'm surprised the feature is even reachable on i386 builds?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] mm: introduce mm_forbids_zeropage function

2014-10-22 Thread Andrew Morton
On Wed, 22 Oct 2014 13:09:28 +0200 Dominik Dingel din...@linux.vnet.ibm.com 
wrote:

 Add a new function stub to allow architectures to disable for
 an mm_structthe backing of non-present, anonymous pages with
 read-only empty zero pages.
 
 ...

 --- a/include/linux/mm.h
 +++ b/include/linux/mm.h
 @@ -56,6 +56,10 @@ extern int sysctl_legacy_va_layout;
  #define __pa_symbol(x)  __pa(RELOC_HIDE((unsigned long)(x), 0))
  #endif
  
 +#ifndef mm_forbids_zeropage
 +#define mm_forbids_zeropage(X)  (0)
 +#endif

Can we document this please?  What it does, why it does it.  We should
also specify precisely which arch header file is responsible for
defining mm_forbids_zeropage.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] mm: introduce mm_forbids_zeropage function

2014-10-22 Thread Andrew Morton
On Wed, 22 Oct 2014 21:45:52 +0200 Dominik Dingel din...@linux.vnet.ibm.com 
wrote:

   +#ifndef mm_forbids_zeropage
   +#define mm_forbids_zeropage(X)  (0)
   +#endif
  
  Can we document this please?  What it does, why it does it.  We should
  also specify precisely which arch header file is responsible for
  defining mm_forbids_zeropage.
  
 
 I will add a comment like:
 
 /*
  * To prevent common memory management code establishing
  * a zero page mapping on a read fault.
  * This function should be implemented within asm/pgtable.h.

s/function should be implemented/macro should be defined/

  * s390 does this to prevent multiplexing of hardware bits
  * related to the physical page in case of virtualization.
  */
 
 Okay?

Looks great, thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Faults which trigger IO release the mmap_sem

2014-09-22 Thread Andrew Morton
On Mon, 22 Sep 2014 23:32:36 +0200 Paolo Bonzini pbonz...@redhat.com wrote:

 Il 22/09/2014 22:49, Andres Lagar-Cavilla ha scritto:
Paolo, should I recut including the recent Reviewed-by's?
  
   No, I'll add them myself.
  Paolo, is this patch waiting for something? Is Gleb's Reviewed-by enough?
 
 It's waiting for an Acked-by on the mm/ changes.
 

The MM changes look good to me.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mm: export symbol dependencies of is_zero_pfn()

2014-09-12 Thread Andrew Morton
On Fri, 12 Sep 2014 22:17:23 +0200 Ard Biesheuvel ard.biesheu...@linaro.org 
wrote:

 In order to make the static inline function is_zero_pfn() callable by
 modules, export its symbol dependencies 'zero_pfn' and (for s390 and
 mips) 'zero_page_mask'.

So hexagon and score get the export if/when needed.

 We need this for KVM, as CONFIG_KVM is a tristate for all supported
 architectures except ARM and arm64, and testing a pfn whether it refers
 to the zero page is required to correctly distinguish the zero page
 from other special RAM ranges that may also have the PG_reserved bit
 set, but need to be treated as MMIO memory.
 
 Signed-off-by: Ard Biesheuvel ard.biesheu...@linaro.org
 ---
  arch/mips/mm/init.c | 1 +
  arch/s390/mm/init.c | 1 +
  mm/memory.c | 2 ++

Looks OK to me.  Please include the patch in whichever tree is is that
needs it, and merge it up via that tree.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] CMA: generalize CMA reserved area management functionality (fixup)

2014-07-17 Thread Andrew Morton
On Thu, 17 Jul 2014 11:36:07 +0200 Marek Szyprowski m.szyprow...@samsung.com 
wrote:

 MAX_CMA_AREAS is used by other subsystems (i.e. arch/arm/mm/dma-mapping.c),
 so we need to provide correct definition even if CMA is disabled.
 This patch fixes this issue.
 
 Reported-by: Sylwester Nawrocki s.nawro...@samsung.com
 Signed-off-by: Marek Szyprowski m.szyprow...@samsung.com
 ---
  include/linux/cma.h | 4 
  1 file changed, 4 insertions(+)
 
 diff --git a/include/linux/cma.h b/include/linux/cma.h
 index 9a18a2b1934c..c077635cad76 100644
 --- a/include/linux/cma.h
 +++ b/include/linux/cma.h
 @@ -5,7 +5,11 @@
   * There is always at least global CMA area and a few optional
   * areas configured in kernel .config.
   */
 +#ifdef CONFIG_CMA
  #define MAX_CMA_AREAS(1 + CONFIG_CMA_AREAS)
 +#else
 +#define MAX_CMA_AREAS(0)
 +#endif
  
  struct cma;

Joonsoo already fixed this up, a bit differently:
http://ozlabs.org/~akpm/mmots/broken-out/cma-generalize-cma-reserved-area-management-functionality-fix.patch

Which approach makes more sense?



From: Joonsoo Kim iamjoonsoo@lge.com
Subject: CMA: fix ARM build failure related to MAX_CMA_AREAS definition

If CMA is disabled, CONFIG_CMA_AREAS isn't defined so compile error
happens. To fix it, define MAX_CMA_AREAS if CONFIG_CMA_AREAS
isn't defined.

Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
Reported-by: Stephen Rothwell s...@canb.auug.org.au
Signed-off-by: Andrew Morton a...@linux-foundation.org
---

 include/linux/cma.h |6 ++
 1 file changed, 6 insertions(+)

diff -puN 
include/linux/cma.h~cma-generalize-cma-reserved-area-management-functionality-fix
 include/linux/cma.h
--- 
a/include/linux/cma.h~cma-generalize-cma-reserved-area-management-functionality-fix
+++ a/include/linux/cma.h
@@ -5,8 +5,14 @@
  * There is always at least global CMA area and a few optional
  * areas configured in kernel .config.
  */
+#ifdef CONFIG_CMA_AREAS
 #define MAX_CMA_AREAS  (1 + CONFIG_CMA_AREAS)
 
+#else
+#define MAX_CMA_AREAS  (0)
+
+#endif
+
 struct cma;
 
 extern phys_addr_t cma_get_base(struct cma *cma);
_


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] CMA: generalize CMA reserved area management functionality (fixup)

2014-07-17 Thread Andrew Morton
On Thu, 17 Jul 2014 11:36:07 +0200 Marek Szyprowski m.szyprow...@samsung.com 
wrote:

 MAX_CMA_AREAS is used by other subsystems (i.e. arch/arm/mm/dma-mapping.c),
 so we need to provide correct definition even if CMA is disabled.
 This patch fixes this issue.
 
 Reported-by: Sylwester Nawrocki s.nawro...@samsung.com
 Signed-off-by: Marek Szyprowski m.szyprow...@samsung.com
 ---
  include/linux/cma.h | 4 
  1 file changed, 4 insertions(+)
 
 diff --git a/include/linux/cma.h b/include/linux/cma.h
 index 9a18a2b1934c..c077635cad76 100644
 --- a/include/linux/cma.h
 +++ b/include/linux/cma.h
 @@ -5,7 +5,11 @@
   * There is always at least global CMA area and a few optional
   * areas configured in kernel .config.
   */
 +#ifdef CONFIG_CMA
  #define MAX_CMA_AREAS(1 + CONFIG_CMA_AREAS)
 +#else
 +#define MAX_CMA_AREAS(0)
 +#endif
  
  struct cma;

Joonsoo already fixed this up, a bit differently:
http://ozlabs.org/~akpm/mmots/broken-out/cma-generalize-cma-reserved-area-management-functionality-fix.patch

Which approach makes more sense?



From: Joonsoo Kim iamjoonsoo@lge.com
Subject: CMA: fix ARM build failure related to MAX_CMA_AREAS definition

If CMA is disabled, CONFIG_CMA_AREAS isn't defined so compile error
happens. To fix it, define MAX_CMA_AREAS if CONFIG_CMA_AREAS
isn't defined.

Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
Reported-by: Stephen Rothwell s...@canb.auug.org.au
Signed-off-by: Andrew Morton a...@linux-foundation.org
---

 include/linux/cma.h |6 ++
 1 file changed, 6 insertions(+)

diff -puN 
include/linux/cma.h~cma-generalize-cma-reserved-area-management-functionality-fix
 include/linux/cma.h
--- 
a/include/linux/cma.h~cma-generalize-cma-reserved-area-management-functionality-fix
+++ a/include/linux/cma.h
@@ -5,8 +5,14 @@
  * There is always at least global CMA area and a few optional
  * areas configured in kernel .config.
  */
+#ifdef CONFIG_CMA_AREAS
 #define MAX_CMA_AREAS  (1 + CONFIG_CMA_AREAS)
 
+#else
+#define MAX_CMA_AREAS  (0)
+
+#endif
+
 struct cma;
 
 extern phys_addr_t cma_get_base(struct cma *cma);
_


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 -next 0/9] CMA: generalize CMA reserved area management code

2014-06-25 Thread Andrew Morton
On Wed, 25 Jun 2014 14:33:56 +0200 Marek Szyprowski m.szyprow...@samsung.com 
wrote:

  That's probably easier.  Marek, I'll merge these into -mm (and hence
  -next and git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git)
  and shall hold them pending you review/ack/test/etc, OK?
 
 Ok. I've tested them and they work fine. I'm sorry that you had to wait for
 me for a few days. You can now add:
 
 Acked-and-tested-by: Marek Szyprowski m.szyprow...@samsung.com

Thanks.

 I've also rebased my pending patches onto this set (I will send them soon).
 
 The question is now if you want to keep the discussed patches in your 
 -mm tree,
 or should I take them to my -next branch. If you like to keep them, I assume
 you will also take the patches which depends on the discussed changes.

Yup, that works.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 -next 0/9] CMA: generalize CMA reserved area management code

2014-06-25 Thread Andrew Morton
On Wed, 25 Jun 2014 14:33:56 +0200 Marek Szyprowski m.szyprow...@samsung.com 
wrote:

  That's probably easier.  Marek, I'll merge these into -mm (and hence
  -next and git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git)
  and shall hold them pending you review/ack/test/etc, OK?
 
 Ok. I've tested them and they work fine. I'm sorry that you had to wait for
 me for a few days. You can now add:
 
 Acked-and-tested-by: Marek Szyprowski m.szyprow...@samsung.com

Thanks.

 I've also rebased my pending patches onto this set (I will send them soon).
 
 The question is now if you want to keep the discussed patches in your 
 -mm tree,
 or should I take them to my -next branch. If you like to keep them, I assume
 you will also take the patches which depends on the discussed changes.

Yup, that works.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 -next 4/9] DMA, CMA: support arbitrary bitmap granularity

2014-06-18 Thread Andrew Morton
On Mon, 16 Jun 2014 14:40:46 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote:

 PPC KVM's CMA area management requires arbitrary bitmap granularity,
 since they want to reserve very large memory and manage this region
 with bitmap that one bit for several pages to reduce management overheads.
 So support arbitrary bitmap granularity for following generalization.
 
 ...

 --- a/drivers/base/dma-contiguous.c
 +++ b/drivers/base/dma-contiguous.c
 @@ -38,6 +38,7 @@ struct cma {
   unsigned long   base_pfn;
   unsigned long   count;
   unsigned long   *bitmap;
 + unsigned int order_per_bit; /* Order of pages represented by one bit */
   struct mutexlock;
  };
  
 @@ -157,9 +158,37 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
  
  static DEFINE_MUTEX(cma_mutex);
  
 +static unsigned long cma_bitmap_aligned_mask(struct cma *cma, int 
 align_order)
 +{
 + return (1  (align_order  cma-order_per_bit)) - 1;
 +}

Might want a 1UL  ... here.

 +static unsigned long cma_bitmap_maxno(struct cma *cma)
 +{
 + return cma-count  cma-order_per_bit;
 +}
 +
 +static unsigned long cma_bitmap_pages_to_bits(struct cma *cma,
 + unsigned long pages)
 +{
 + return ALIGN(pages, 1  cma-order_per_bit)  cma-order_per_bit;
 +}

Ditto.  I'm not really sure what the compiler will do in these cases,
but would prefer not to rely on it anyway!


--- 
a/drivers/base/dma-contiguous.c~dma-cma-support-arbitrary-bitmap-granularity-fix
+++ a/drivers/base/dma-contiguous.c
@@ -160,7 +160,7 @@ static DEFINE_MUTEX(cma_mutex);
 
 static unsigned long cma_bitmap_aligned_mask(struct cma *cma, int align_order)
 {
-   return (1  (align_order  cma-order_per_bit)) - 1;
+   return (1UL  (align_order  cma-order_per_bit)) - 1;
 }
 
 static unsigned long cma_bitmap_maxno(struct cma *cma)
@@ -171,7 +171,7 @@ static unsigned long cma_bitmap_maxno(st
 static unsigned long cma_bitmap_pages_to_bits(struct cma *cma,
unsigned long pages)
 {
-   return ALIGN(pages, 1  cma-order_per_bit)  cma-order_per_bit;
+   return ALIGN(pages, 1UL  cma-order_per_bit)  cma-order_per_bit;
 }
 
 static void cma_clear_bitmap(struct cma *cma, unsigned long pfn, int count)
_

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 -next 0/9] CMA: generalize CMA reserved area management code

2014-06-18 Thread Andrew Morton
On Tue, 17 Jun 2014 10:25:07 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote:

  v2:
 - Although this patchset looks very different with v1, the end result,
 that is, mm/cma.c is same with v1's one. So I carry Ack to patch 6-7.
  
  This patchset is based on linux-next 20140610.
  
  Thanks for taking care of this. I will test it with my setup and if
  everything goes well, I will take it to my -next tree. If any branch
  is required for anyone to continue his works on top of those patches,
  let me know, I will also prepare it.
 
 Hello,
 
 I'm glad to hear that. :)
 But, there is one concern. As you already know, I am preparing further
 patches (Aggressively allocate the pages on CMA reserved memory). It
 may be highly related to MM branch and also slightly depends on this CMA
 changes. In this case, what is the best strategy to merge this
 patchset? IMHO, Anrew's tree is more appropriate branch. If there is
 no issue in this case, I am willing to develope further patches based
 on your tree.

That's probably easier.  Marek, I'll merge these into -mm (and hence
-next and git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git)
and shall hold them pending you review/ack/test/etc, OK?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 -next 4/9] DMA, CMA: support arbitrary bitmap granularity

2014-06-18 Thread Andrew Morton
On Mon, 16 Jun 2014 14:40:46 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote:

 PPC KVM's CMA area management requires arbitrary bitmap granularity,
 since they want to reserve very large memory and manage this region
 with bitmap that one bit for several pages to reduce management overheads.
 So support arbitrary bitmap granularity for following generalization.
 
 ...

 --- a/drivers/base/dma-contiguous.c
 +++ b/drivers/base/dma-contiguous.c
 @@ -38,6 +38,7 @@ struct cma {
   unsigned long   base_pfn;
   unsigned long   count;
   unsigned long   *bitmap;
 + unsigned int order_per_bit; /* Order of pages represented by one bit */
   struct mutexlock;
  };
  
 @@ -157,9 +158,37 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
  
  static DEFINE_MUTEX(cma_mutex);
  
 +static unsigned long cma_bitmap_aligned_mask(struct cma *cma, int 
 align_order)
 +{
 + return (1  (align_order  cma-order_per_bit)) - 1;
 +}

Might want a 1UL  ... here.

 +static unsigned long cma_bitmap_maxno(struct cma *cma)
 +{
 + return cma-count  cma-order_per_bit;
 +}
 +
 +static unsigned long cma_bitmap_pages_to_bits(struct cma *cma,
 + unsigned long pages)
 +{
 + return ALIGN(pages, 1  cma-order_per_bit)  cma-order_per_bit;
 +}

Ditto.  I'm not really sure what the compiler will do in these cases,
but would prefer not to rely on it anyway!


--- 
a/drivers/base/dma-contiguous.c~dma-cma-support-arbitrary-bitmap-granularity-fix
+++ a/drivers/base/dma-contiguous.c
@@ -160,7 +160,7 @@ static DEFINE_MUTEX(cma_mutex);
 
 static unsigned long cma_bitmap_aligned_mask(struct cma *cma, int align_order)
 {
-   return (1  (align_order  cma-order_per_bit)) - 1;
+   return (1UL  (align_order  cma-order_per_bit)) - 1;
 }
 
 static unsigned long cma_bitmap_maxno(struct cma *cma)
@@ -171,7 +171,7 @@ static unsigned long cma_bitmap_maxno(st
 static unsigned long cma_bitmap_pages_to_bits(struct cma *cma,
unsigned long pages)
 {
-   return ALIGN(pages, 1  cma-order_per_bit)  cma-order_per_bit;
+   return ALIGN(pages, 1UL  cma-order_per_bit)  cma-order_per_bit;
 }
 
 static void cma_clear_bitmap(struct cma *cma, unsigned long pfn, int count)
_

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 -next 0/9] CMA: generalize CMA reserved area management code

2014-06-18 Thread Andrew Morton
On Tue, 17 Jun 2014 10:25:07 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote:

  v2:
 - Although this patchset looks very different with v1, the end result,
 that is, mm/cma.c is same with v1's one. So I carry Ack to patch 6-7.
  
  This patchset is based on linux-next 20140610.
  
  Thanks for taking care of this. I will test it with my setup and if
  everything goes well, I will take it to my -next tree. If any branch
  is required for anyone to continue his works on top of those patches,
  let me know, I will also prepare it.
 
 Hello,
 
 I'm glad to hear that. :)
 But, there is one concern. As you already know, I am preparing further
 patches (Aggressively allocate the pages on CMA reserved memory). It
 may be highly related to MM branch and also slightly depends on this CMA
 changes. In this case, what is the best strategy to merge this
 patchset? IMHO, Anrew's tree is more appropriate branch. If there is
 no issue in this case, I am willing to develope further patches based
 on your tree.

That's probably easier.  Marek, I'll merge these into -mm (and hence
-next and git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git)
and shall hold them pending you review/ack/test/etc, OK?
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -mm 0/3] fix numa vs kvm scalability issue

2014-02-19 Thread Andrew Morton
On Wed, 19 Feb 2014 09:59:17 +0100 Peter Zijlstra pet...@infradead.org wrote:

 On Tue, Feb 18, 2014 at 05:12:43PM -0500, r...@redhat.com wrote:
  The NUMA scanning code can end up iterating over many gigabytes
  of unpopulated memory, especially in the case of a freshly started
  KVM guest with lots of memory.
  
  This results in the mmu notifier code being called even when
  there are no mapped pages in a virtual address range. The amount
  of time wasted can be enough to trigger soft lockup warnings
  with very large (2TB) KVM guests.
  
  This patch moves the mmu notifier call to the pmd level, which
  represents 1GB areas of memory on x86-64. Furthermore, the mmu
  notifier code is only called from the address in the PMD where
  present mappings are first encountered.
  
  The hugetlbfs code is left alone for now; hugetlb mappings are
  not relocatable, and as such are left alone by the NUMA code,
  and should never trigger this problem to begin with.
  
  The series also adds a cond_resched to task_numa_work, to
  fix another potential latency issue.
 
 Andrew, I'll pick up the first kernel/sched/ patch; do you want the
 other two mm/ patches?

That works, thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/14] asmlinkage, kvm: Make kvm_rebooting visible

2014-02-10 Thread Andrew Morton
On Sat,  8 Feb 2014 08:51:57 +0100 Andi Kleen a...@linux.intel.com wrote:

 kvm_rebooting is referenced from assembler code, thus
 needs to be visible.

So I read the gcc page and looked at the __visible definition but I
still don't really get it.  What goes wrong if the __visible isn't
present on these referenced-from-asm identifiers?

 Cc: g...@redhat.com
 Cc: pbonz...@redhat.com

Grumble.  Email addresses go into commits with display names and . 
At least, 89.3% of them do.  Some sucker has to go through these and
fix them up.  I'd prefer it not be me ;)


 ...

 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -102,7 +102,7 @@ static void kvm_release_pfn_dirty(pfn_t pfn);
  static void mark_page_dirty_in_slot(struct kvm *kvm,
   struct kvm_memory_slot *memslot, gfn_t gfn);
  
 -bool kvm_rebooting;
 +__visible bool kvm_rebooting;
  EXPORT_SYMBOL_GPL(kvm_rebooting);
  
  static bool largepages_enabled = true;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v3 1/4] idr: Percpu ida

2013-08-28 Thread Andrew Morton
On Wed, 28 Aug 2013 12:53:17 -0700 Kent Overstreet k...@daterainc.com wrote:

   + while (1) {
   + spin_lock(pool-lock);
   +
   + /*
   +  * prepare_to_wait() must come before steal_tags(), in case
   +  * percpu_ida_free() on another cpu flips a bit in
   +  * cpus_have_tags
   +  *
   +  * global lock held and irqs disabled, don't need percpu lock
   +  */
   + prepare_to_wait(pool-wait, wait, TASK_UNINTERRUPTIBLE);
   +
   + if (!tags-nr_free)
   + alloc_global_tags(pool, tags);
   + if (!tags-nr_free)
   + steal_tags(pool, tags);
   +
   + if (tags-nr_free) {
   + tag = tags-freelist[--tags-nr_free];
   + if (tags-nr_free)
   + set_bit(smp_processor_id(),
   + pool-cpus_have_tags);
   + }
   +
   + spin_unlock(pool-lock);
   + local_irq_restore(flags);
   +
   + if (tag = 0 || !(gfp  __GFP_WAIT))
   + break;
   +
   + schedule();
   +
   + local_irq_save(flags);
   + tags = this_cpu_ptr(pool-tag_cpu);
   + }
  
  What guarantees that this wait will terminate?
 
 It seems fairly clear to me from the break statement a couple lines up;
 if we were passed __GFP_WAIT we terminate iff we succesfully allocated a
 tag. If we weren't passed __GFP_WAIT we never actually sleep.

OK ;)  Let me rephrase.  What guarantees that a tag will become available?

If what we have here is an open-coded __GFP_NOFAIL then that is
potentially problematic.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] percpu ida: Switch to cpumask_t, add some comments

2013-08-28 Thread Andrew Morton
On Wed, 28 Aug 2013 12:55:17 -0700 Kent Overstreet k...@daterainc.com wrote:

 Fixup patch, addressing Andrew's review feedback:

Looks reasonable.

  lib/idr.c   | 38 +-

I still don't think it should be in this file.

You say that some as-yet-unmerged patches will tie the new code into
the old ida code.  But will it do it in a manner which requires that
the two reside in the same file?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v3 1/4] idr: Percpu ida

2013-08-28 Thread Andrew Morton
On Wed, 28 Aug 2013 13:44:54 -0700 Kent Overstreet k...@daterainc.com wrote:

What guarantees that this wait will terminate?
   
   It seems fairly clear to me from the break statement a couple lines up;
   if we were passed __GFP_WAIT we terminate iff we succesfully allocated a
   tag. If we weren't passed __GFP_WAIT we never actually sleep.
  
  OK ;)  Let me rephrase.  What guarantees that a tag will become available?
  
  If what we have here is an open-coded __GFP_NOFAIL then that is
  potentially problematic.
 
 It's the same semantics as a mempool, really - it'll succeed when a tag
 gets freed.

OK, that's reasonable if the code is being used to generate IO tags -
we expect the in-flight tags to eventually be returned.

But if a client of this code is using the allocator for something
totally different, there is no guarantee that the act of waiting will
result in any tags being returned.

(These are core design principles/constraints which should be
explicitly documented in a place where future readers will see them!)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] percpu ida: Switch to cpumask_t, add some comments

2013-08-28 Thread Andrew Morton
On Wed, 28 Aug 2013 14:00:10 -0700 Kent Overstreet k...@daterainc.com wrote:

 On Wed, Aug 28, 2013 at 01:25:50PM -0700, Andrew Morton wrote:
  On Wed, 28 Aug 2013 12:55:17 -0700 Kent Overstreet k...@daterainc.com 
  wrote:
  
   Fixup patch, addressing Andrew's review feedback:
  
  Looks reasonable.
  
lib/idr.c   | 38 +-
  
  I still don't think it should be in this file.
  
  You say that some as-yet-unmerged patches will tie the new code into
  the old ida code.  But will it do it in a manner which requires that
  the two reside in the same file?
 
 Not require, no - but it's just intimate enough with my ida rewrite that
 I think it makes sense; it makes some use of stuff that should be
 internal to the ida code.
 
 Mostly just sharing the lock though, since I got rid of the ida
 interfaces that don't do locking, but percpu ida needs a lock that also
 covers what ida needs.
 
 It also makes use of a ganged allocation interface, but there's no real
 reason ida can't expose that, it's just unlikely to be useful to
 anything but percpu ida.
 
 The other reason I think it makes sense to live in idr.c is more for
 users of the code; as you pointed out as far as the user's perspective
 percpu ida isn't doing anything fundamentally different from ida, so I
 think it makes sense for the code to live in the same place as a
 kindness to future kernel developers who are trying to find their way
 around the various library code.

I found things to be quite the opposite - it took 5 minutes of staring,
head-scratching, double-checking and penny-dropping before I was
confident that the newly-added code actually has nothing at all to do
with the current code.  Putting it in the same file was misleading, and
I got misled.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v3 1/4] idr: Percpu ida

2013-08-28 Thread Andrew Morton
On Wed, 28 Aug 2013 14:12:17 -0700 Kent Overstreet k...@daterainc.com wrote:

 How's this look?
 
 diff --git a/lib/idr.c b/lib/idr.c
 index 15c021c..a3f8e9a 100644
 --- a/lib/idr.c
 +++ b/lib/idr.c
 @@ -1288,6 +1288,11 @@ static inline unsigned alloc_local_tag(struct 
 percpu_ida *pool,
   * Safe to be called from interrupt context (assuming it isn't passed
   * __GFP_WAIT, of course).
   *
 + * @gfp indicates whether or not to wait until a free id is available (it's 
 not
 + * used for internal memory allocations); thus if passed __GFP_WAIT we may 
 sleep
 + * however long it takes until another thread frees an id (same semantics as 
 a
 + * mempool).

Looks good.  Mentioning the mempool thing is effective - people
understand that.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] percpu ida: Switch to cpumask_t, add some comments

2013-08-28 Thread Andrew Morton
On Wed, 28 Aug 2013 14:23:58 -0700 Kent Overstreet k...@daterainc.com wrote:

  I found things to be quite the opposite - it took 5 minutes of staring,
  head-scratching, double-checking and penny-dropping before I was
  confident that the newly-added code actually has nothing at all to do
  with the current code.  Putting it in the same file was misleading, and
  I got misled.
 
 Ok... and I could see how the fact that it currently _doesn't_ have
 anything to do with the existing code would be confusing...
 
 Do you think that if/when it's making use of the ida rewrite it'll be
 ok? Or would you still prefer to have it in a new file

I'm constitutionally reluctant to ever assume that any out-of-tree code
will be merged.  Maybe you'll get hit by a bus, and maybe the code
sucks ;)

Are you sure that the two things are so tangled together that they must
live in the same file?  If there's some nice layering between ida and
percpu_ida then perhaps such a physical separation would remain
appropriate?

 (and if so, any preference on the naming?)

percpu_ida.c?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v3 1/4] idr: Percpu ida

2013-08-20 Thread Andrew Morton
On Fri, 16 Aug 2013 23:09:06 + Nicholas A. Bellinger 
n...@linux-iscsi.org wrote:

 From: Kent Overstreet k...@daterainc.com
 
 Percpu frontend for allocating ids. With percpu allocation (that works),
 it's impossible to guarantee it will always be possible to allocate all
 nr_tags - typically, some will be stuck on a remote percpu freelist
 where the current job can't get to them.
 
 We do guarantee that it will always be possible to allocate at least
 (nr_tags / 2) tags - this is done by keeping track of which and how many
 cpus have tags on their percpu freelists. On allocation failure if
 enough cpus have tags that there could potentially be (nr_tags / 2) tags
 stuck on remote percpu freelists, we then pick a remote cpu at random to
 steal from.
 
 Note that there's no cpu hotplug notifier - we don't care, because
 steal_tags() will eventually get the down cpu's tags. We _could_ satisfy
 more allocations if we had a notifier - but we'll still meet our
 guarantees and it's absolutely not a correctness issue, so I don't think
 it's worth the extra code.

 ...

  include/linux/idr.h |   53 +
  lib/idr.c   |  316 
 +--

I don't think this should be in idr.[ch] at all.  It has no
relationship with the existing code.  Apart from duplicating its
functionality :(

 
 ...

 @@ -243,4 +245,55 @@ static inline int ida_get_new(struct ida *ida, int *p_id)
  
  void __init idr_init_cache(void);
  
 +/* Percpu IDA/tag allocator */
 +
 +struct percpu_ida_cpu;
 +
 +struct percpu_ida {
 + /*
 +  * number of tags available to be allocated, as passed to
 +  * percpu_ida_init()
 +  */
 + unsignednr_tags;
 +
 + struct percpu_ida_cpu __percpu  *tag_cpu;
 +
 + /*
 +  * Bitmap of cpus that (may) have tags on their percpu freelists:
 +  * steal_tags() uses this to decide when to steal tags, and which cpus
 +  * to try stealing from.
 +  *
 +  * It's ok for a freelist to be empty when its bit is set - steal_tags()
 +  * will just keep looking - but the bitmap _must_ be set whenever a
 +  * percpu freelist does have tags.
 +  */
 + unsigned long   *cpus_have_tags;

Why not cpumask_t?

 + struct {
 + spinlock_t  lock;
 + /*
 +  * When we go to steal tags from another cpu (see steal_tags()),
 +  * we want to pick a cpu at random. Cycling through them every
 +  * time we steal is a bit easier and more or less equivalent:
 +  */
 + unsignedcpu_last_stolen;
 +
 + /* For sleeping on allocation failure */
 + wait_queue_head_t   wait;
 +
 + /*
 +  * Global freelist - it's a stack where nr_free points to the
 +  * top
 +  */
 + unsignednr_free;
 + unsigned*freelist;
 + } cacheline_aligned_in_smp;

Why the cacheline_aligned_in_smp?

 +};
 
 ...

 +
 +/* Percpu IDA */
 +
 +/*
 + * Number of tags we move between the percpu freelist and the global 
 freelist at
 + * a time

between a percpu freelist would be more accurate?

 + */
 +#define IDA_PCPU_BATCH_MOVE  32U
 +
 +/* Max size of percpu freelist, */
 +#define IDA_PCPU_SIZE((IDA_PCPU_BATCH_MOVE * 3) / 2)
 +
 +struct percpu_ida_cpu {
 + spinlock_t  lock;
 + unsignednr_free;
 + unsignedfreelist[];
 +};

Data structure needs documentation.  There's one of these per cpu.  I
guess nr_free and freelist are clear enough.  The presence of a lock
in a percpu data structure is a surprise.  It's for cross-cpu stealing,
I assume?

 +static inline void move_tags(unsigned *dst, unsigned *dst_nr,
 +  unsigned *src, unsigned *src_nr,
 +  unsigned nr)
 +{
 + *src_nr -= nr;
 + memcpy(dst + *dst_nr, src + *src_nr, sizeof(unsigned) * nr);
 + *dst_nr += nr;
 +}
 +
 
 ...

 +static inline void alloc_global_tags(struct percpu_ida *pool,
 +  struct percpu_ida_cpu *tags)
 +{
 + move_tags(tags-freelist, tags-nr_free,
 +   pool-freelist, pool-nr_free,
 +   min(pool-nr_free, IDA_PCPU_BATCH_MOVE));
 +}

Document this function?

 +static inline unsigned alloc_local_tag(struct percpu_ida *pool,
 +struct percpu_ida_cpu *tags)
 +{
 + int tag = -ENOSPC;
 +
 + spin_lock(tags-lock);
 + if (tags-nr_free)
 + tag = tags-freelist[--tags-nr_free];
 + spin_unlock(tags-lock);
 +
 + return tag;
 +}

I guess this one's clear enough, if the data structure relationships are
understood.

 +/**
 + * percpu_ida_alloc - allocate a tag
 + * @pool: pool to allocate from
 + * @gfp: gfp flags
 + *
 + * Returns a tag - an integer in the 

Re: [PATCH 1/2] mm: add support for discard of unused ptes

2013-07-30 Thread Andrew Morton
On Thu, 25 Jul 2013 10:54:20 +0200 Martin Schwidefsky schwidef...@de.ibm.com 
wrote:

 From: Konstantin Weitz konstantin.we...@gmail.com
 
 In a virtualized environment and given an appropriate interface the guest
 can mark pages as unused while they are free (for the s390 implementation
 see git commit 45e576b1c3d00206 guest page hinting light). For the host
 the unused state is a property of the pte.
 
 This patch adds the primitive 'pte_unused' and code to the host swap out
 handler so that pages marked as unused by all mappers are not swapped out
 but discarded instead, thus saving one IO for swap out and potentially
 another one for swap in.
 
 ...

 --- a/include/asm-generic/pgtable.h
 +++ b/include/asm-generic/pgtable.h
 @@ -193,6 +193,19 @@ static inline int pte_same(pte_t pte_a, pte_t pte_b)
  }
  #endif
  
 +#ifndef __HAVE_ARCH_PTE_UNUSED
 +/*
 + * Some architectures provide facilities to virtualization guests
 + * so that they can flag allocated pages as unused. This allows the
 + * host to transparently reclaim unused pages. This function returns
 + * whether the pte's page is unused.
 + */
 +static inline int pte_unused(pte_t pte)
 +{
 + return 0;
 +}
 +#endif
 +
  #ifndef __HAVE_ARCH_PMD_SAME
  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
  static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 diff --git a/mm/rmap.c b/mm/rmap.c
 index cd356df..2291f25 100644
 --- a/mm/rmap.c
 +++ b/mm/rmap.c
 @@ -1234,6 +1234,16 @@ int try_to_unmap_one(struct page *page, struct 
 vm_area_struct *vma,
   }
   set_pte_at(mm, address, pte,
  swp_entry_to_pte(make_hwpoison_entry(page)));
 + } else if (pte_unused(pteval)) {
 + /*
 +  * The guest indicated that the page content is of no
 +  * interest anymore. Simply discard the pte, vmscan
 +  * will take care of the rest.
 +  */
 + if (PageAnon(page))
 + dec_mm_counter(mm, MM_ANONPAGES);
 + else
 + dec_mm_counter(mm, MM_FILEPAGES);
   } else if (PageAnon(page)) {
   swp_entry_t entry = { .val = page_private(page) };

Obviously harmless.  Please include this in whatever tree carries
[PATCH 2/2] s390/kvm: support collaborative memory management.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/10] powerpc: Prepare to support kernel handling of IOMMU map/unmap

2013-07-24 Thread Andrew Morton
On Tue, 23 Jul 2013 12:22:59 +1000 Alexey Kardashevskiy a...@ozlabs.ru wrote:

 Ping, anyone, please?

ew, you top-posted.

 Ben needs ack from any of MM people before proceeding with this patch. Thanks!

For what?  The three lines of comment in page-flags.h?   ack :)

Manipulating page-_count directly is considered poor form.  Don't
blame us if we break your code ;)

Actually, the manipulation in realmode_get_page() duplicates the
existing get_page_unless_zero() and the one in realmode_put_page()
could perhaps be placed in mm.h with a suitable name and some
documentation.  That would improve your form and might protect the code
from getting broken later on.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/10] powerpc: Prepare to support kernel handling of IOMMU map/unmap

2013-07-24 Thread Andrew Morton
On Tue, 23 Jul 2013 12:22:59 +1000 Alexey Kardashevskiy a...@ozlabs.ru wrote:

 Ping, anyone, please?

ew, you top-posted.

 Ben needs ack from any of MM people before proceeding with this patch. Thanks!

For what?  The three lines of comment in page-flags.h?   ack :)

Manipulating page-_count directly is considered poor form.  Don't
blame us if we break your code ;)

Actually, the manipulation in realmode_get_page() duplicates the
existing get_page_unless_zero() and the one in realmode_put_page()
could perhaps be placed in mm.h with a suitable name and some
documentation.  That would improve your form and might protect the code
from getting broken later on.

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated

2012-09-25 Thread Andrew Morton
On Tue, 25 Sep 2012 10:12:07 +0100
Mel Gorman mgor...@suse.de wrote:

 First, we'd introduce a variant of get_pageblock_migratetype() that returns
 all the bits for the pageblock flags and then helpers to extract either the
 migratetype or the PG_migrate_skip. We already are incurring the cost of
 get_pageblock_migratetype() so it will not be much more expensive than what
 is already there. If there is an allocation or free within a pageblock that
 as the PG_migrate_skip bit set then we increment a counter. When the counter
 reaches some to-be-decided threshold then compaction may clear all the
 bits. This would match the criteria of the clearing being based on activity.
 
 There are four potential problems with this
 
 1. The logic to retrieve all the bits and split them up will be a little
convulated but maybe it would not be that bad.
 
 2. The counter is a shared-writable cache line but obviously it could
be moved to vmstat and incremented with inc_zone_page_state to offset
the cost a little.
 
 3. The biggested weakness is that there is not way to know if the
counter is incremented based on activity in a small subset of blocks.
 
 4. What should the threshold be?
 
 The first problem is minor but the other three are potentially a mess.
 Adding another vmstat counter is bad enough in itself but if the counter
 is incremented based on a small subsets of pageblocks, the hint becomes
 is potentially useless.
 
 However, does this match what you have in mind or am I over-complicating
 things?

Sounds complicated.

Using wall time really does suck.  Are you sure you can't think of
something more logical?

How would we demonstrate the suckage?  What would be the observeable downside of
switching that 5 seconds to 5 hours?

 + for (pfn = start_pfn; pfn  end_pfn; pfn += pageblock_nr_pages) 
 {
 + struct page *page;
 + if (!pfn_valid(pfn))
 + continue;
 +
 + page = pfn_to_page(pfn);
 + if (zone != page_zone(page))
 + continue;
 +
 + clear_pageblock_skip(page);
 + }

What's the worst-case loop count here?

   
   zone-spanned_pages  pageblock_order
  
  What's the worst-case value of (zone-spanned_pages  pageblock_order) :)
 
 Lets take an unlikely case - 128G single-node machine. That loop count
 on x86-64 would be 65536. It'll be fast enough, particularly in this
 path.

That could easily exceed a millisecond.  Can/should we stick a
cond_resched() in there?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/9] mm: compaction: Acquire the zone-lru_lock as late as possible

2012-09-25 Thread Andrew Morton
On Tue, 25 Sep 2012 17:13:27 +0900
Minchan Kim minc...@kernel.org wrote:

 I see. To me, your saying is better than current comment.
 I hope comment could be more explicit.
 
 diff --git a/mm/compaction.c b/mm/compaction.c
 index df01b4e..f1d2cc7 100644
 --- a/mm/compaction.c
 +++ b/mm/compaction.c
 @@ -542,8 +542,9 @@ isolate_migratepages_range(struct zone *zone, struct 
 compact_control *cc,
  * splitting and collapsing (collapsing has already happened
  * if PageLRU is set) but the lock is not necessarily taken
  * here and it is wasteful to take it just to check transhuge.
 -* Check transhuge without lock and skip if it's either a
 -* transhuge or hugetlbfs page.
 +* Check transhuge without lock and *skip* if it's either a
 +* transhuge or hugetlbfs page because it's not safe to call
 +* compound_order.
  */
 if (PageTransHuge(page)) {
 if (!locked)

Going a bit further:

--- 
a/mm/compaction.c~mm-compaction-acquire-the-zone-lru_lock-as-late-as-possible-fix
+++ a/mm/compaction.c
@@ -415,7 +415,8 @@ isolate_migratepages_range(struct zone *
 * if PageLRU is set) but the lock is not necessarily taken
 * here and it is wasteful to take it just to check transhuge.
 * Check transhuge without lock and skip if it's either a
-* transhuge or hugetlbfs page.
+* transhuge or hugetlbfs page because calling compound_order()
+* requires lru_lock to exclude isolation and splitting.
 */
if (PageTransHuge(page)) {
if (!locked)
_


but...  the requirement to hold lru_lock for compound_order() is news
to me.  It doesn't seem to be written down or explained anywhere, and
one wonders why the cheerily undocumented compound_lock() doesn't have
this effect.  What's going on here??

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated

2012-09-24 Thread Andrew Morton
On Mon, 24 Sep 2012 10:39:38 +0100
Mel Gorman mgor...@suse.de wrote:

 On Fri, Sep 21, 2012 at 02:36:56PM -0700, Andrew Morton wrote:
 
  Also, what has to be done to avoid the polling altogether?  eg/ie, zap
  a pageblock's PB_migrate_skip synchronously, when something was done to
  that pageblock which justifies repolling it?
  
 
 The something event you are looking for is pages being freed or
 allocated in the page allocator. A movable page being allocated in block
 or a page being freed should clear the PB_migrate_skip bit if it's set.
 Unfortunately this would impact the fast path of the alloc and free paths
 of the page allocator. I felt that that was too high a price to pay.

We already do a similar thing in the page allocator: clearing of
-all_unreclaimable and -pages_scanned.  But that isn't on the fast
path really - it happens once per pcp unload.  Can we do something
like that?  Drop some hint into the zone without having to visit each
page?

  
   ...
  
   +static void reset_isolation_suitable(struct zone *zone)
   +{
   + unsigned long start_pfn = zone-zone_start_pfn;
   + unsigned long end_pfn = zone-zone_start_pfn + zone-spanned_pages;
   + unsigned long pfn;
   +
   + /*
   +  * Do not reset more than once every five seconds. If allocations are
   +  * failing sufficiently quickly to allow this to happen then continually
   +  * scanning for compaction is not going to help. The choice of five
   +  * seconds is arbitrary but will mitigate excessive scanning.
   +  */
   + if (time_before(jiffies, zone-compact_blockskip_expire))
   + return;
   + zone-compact_blockskip_expire = jiffies + (HZ * 5);
   +
   + /* Walk the zone and mark every pageblock as suitable for isolation */
   + for (pfn = start_pfn; pfn  end_pfn; pfn += pageblock_nr_pages) {
   + struct page *page;
   + if (!pfn_valid(pfn))
   + continue;
   +
   + page = pfn_to_page(pfn);
   + if (zone != page_zone(page))
   + continue;
   +
   + clear_pageblock_skip(page);
   + }
  
  What's the worst-case loop count here?
  
 
 zone-spanned_pages  pageblock_order

What's the worst-case value of (zone-spanned_pages  pageblock_order) :)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/9] mm: compaction: Abort compaction loop if lock is contended or run too long

2012-09-21 Thread Andrew Morton
On Fri, 21 Sep 2012 11:46:18 +0100
Mel Gorman mgor...@suse.de wrote:

 Changelog since V2
 o Fix BUG_ON triggered due to pages left on cc.migratepages
 o Make compact_zone_order() require non-NULL arg `contended'
 
 Changelog since V1
 o only abort the compaction if lock is contended or run too long
 o Rearranged the code by Andrea Arcangeli.
 
 isolate_migratepages_range() might isolate no pages if for example when
 zone-lru_lock is contended and running asynchronous compaction. In this
 case, we should abort compaction, otherwise, compact_zone will run a
 useless loop and make zone-lru_lock is even contended.

hm, this appears to be identical to

mm-compaction-abort-compaction-loop-if-lock-is-contended-or-run-too-long.patch
mm-compaction-abort-compaction-loop-if-lock-is-contended-or-run-too-long-fix.patch
mm-compaction-abort-compaction-loop-if-lock-is-contended-or-run-too-long-fix-2.patch

so I simply omitted patches 2, 3 and 4.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/9] mm: compaction: Acquire the zone-lock as late as possible

2012-09-21 Thread Andrew Morton
On Fri, 21 Sep 2012 11:46:20 +0100
Mel Gorman mgor...@suse.de wrote:

 Compactions free scanner acquires the zone-lock when checking for PageBuddy
 pages and isolating them. It does this even if there are no PageBuddy pages
 in the range.
 
 This patch defers acquiring the zone lock for as long as possible. In the
 event there are no free pages in the pageblock then the lock will not be
 acquired at all which reduces contention on zone-lock.

 ...

 --- a/mm/compaction.c
 +++ b/mm/compaction.c
 @@ -93,6 +93,28 @@ static inline bool compact_trylock_irqsave(spinlock_t 
 *lock,
   return compact_checklock_irqsave(lock, flags, false, cc);
  }
  
 +/* Returns true if the page is within a block suitable for migration to */
 +static bool suitable_migration_target(struct page *page)
 +{
 +

stray newline

 + int migratetype = get_pageblock_migratetype(page);
 +
 + /* Don't interfere with memory hot-remove or the min_free_kbytes blocks 
 */
 + if (migratetype == MIGRATE_ISOLATE || migratetype == MIGRATE_RESERVE)
 + return false;
 +
 + /* If the page is a large free page, then allow migration */
 + if (PageBuddy(page)  page_order(page) = pageblock_order)
 + return true;
 +
 + /* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
 + if (migrate_async_suitable(migratetype))
 + return true;
 +
 + /* Otherwise skip the block */
 + return false;
 +}
 +

 ...

 @@ -168,23 +193,38 @@ static unsigned long isolate_freepages_block(unsigned 
 long blockpfn,
   int isolated, i;
   struct page *page = cursor;
  
 - if (!pfn_valid_within(blockpfn)) {
 - if (strict)
 - return 0;
 - continue;
 - }
 + if (!pfn_valid_within(blockpfn))
 + goto strict_check;
   nr_scanned++;
  
 - if (!PageBuddy(page)) {
 - if (strict)
 - return 0;
 - continue;
 - }
 + if (!PageBuddy(page))
 + goto strict_check;
 +
 + /*
 +  * The zone lock must be held to isolate freepages. This
 +  * unfortunately this is a very coarse lock and can be

this this

 +  * heavily contended if there are parallel allocations
 +  * or parallel compactions. For async compaction do not
 +  * spin on the lock and we acquire the lock as late as
 +  * possible.
 +  */
 + locked = compact_checklock_irqsave(cc-zone-lock, flags,
 + locked, cc);
 + if (!locked)
 + break;
 +
 + /* Recheck this is a suitable migration target under lock */
 + if (!strict  !suitable_migration_target(page))
 + break;
 +
 + /* Recheck this is a buddy page under lock */
 + if (!PageBuddy(page))
 + goto strict_check;
  
   /* Found a free page, break it into order-0 pages */
   isolated = split_free_page(page);
   if (!isolated  strict)
 - return 0;
 + goto strict_check;
   total_isolated += isolated;
   for (i = 0; i  isolated; i++) {
   list_add(page-lru, freelist);
 @@ -196,9 +236,23 @@ static unsigned long isolate_freepages_block(unsigned 
 long blockpfn,
   blockpfn += isolated - 1;
   cursor += isolated - 1;
   }
 +
 + continue;
 +
 +strict_check:
 + /* Abort isolation if the caller requested strict isolation */
 + if (strict) {
 + total_isolated = 0;
 + goto out;
 + }
   }
  
   trace_mm_compaction_isolate_freepages(nr_scanned, total_isolated);
 +
 +out:
 + if (locked)
 + spin_unlock_irqrestore(cc-zone-lock, flags);
 +
   return total_isolated;
  }

A a few things about this function.

Would it be cleaner if we did

if (!strict) {
if (!suitable_migration_target(page))
break;
} else {
if (!PageBuddy(page))
goto strict_check;
}

and then remove the test of `strict' from strict_check (which then
should be renamed)?

Which perhaps means that the whole `strict_check' block can go away:

if (!strict) {
if (!suitable_migration_target(page))
break;
} else {
if (!PageBuddy(page)) {
total_isolated = 0;
goto out;
}

Have a think about it?  The function is a little straggly.

Secondly, it is correct/desirable to skip the (now misnamed

Re: [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated

2012-09-21 Thread Andrew Morton
On Fri, 21 Sep 2012 11:46:22 +0100
Mel Gorman mgor...@suse.de wrote:

 When compaction was implemented it was known that scanning could potentially
 be excessive. The ideal was that a counter be maintained for each pageblock
 but maintaining this information would incur a severe penalty due to a
 shared writable cache line. It has reached the point where the scanning
 costs are an serious problem, particularly on long-lived systems where a
 large process starts and allocates a large number of THPs at the same time.
 
 Instead of using a shared counter, this patch adds another bit to the
 pageblock flags called PG_migrate_skip. If a pageblock is scanned by
 either migrate or free scanner and 0 pages were isolated, the pageblock
 is marked to be skipped in the future. When scanning, this bit is checked
 before any scanning takes place and the block skipped if set.
 
 The main difficulty with a patch like this is when to ignore the cached
 information? If it's ignored too often, the scanning rates will still
 be excessive. If the information is too stale then allocations will fail
 that might have otherwise succeeded. In this patch
 
 o CMA always ignores the information
 o If the migrate and free scanner meet then the cached information will
   be discarded if it's at least 5 seconds since the last time the cache
   was discarded
 o If there are a large number of allocation failures, discard the cache.
 
 The time-based heuristic is very clumsy but there are few choices for a
 better event. Depending solely on multiple allocation failures still allows
 excessive scanning when THP allocations are failing in quick succession
 due to memory pressure. Waiting until memory pressure is relieved would
 cause compaction to continually fail instead of using reclaim/compaction
 to try allocate the page. The time-based mechanism is clumsy but a better
 option is not obvious.

ick.

Wall time has sooo little relationship to what's happening in there. 
If we *have* to use polling, cannot we clock the poll with some metric
which is at least vaguely related to the amount of activity?  Number
(or proportion) of pages scanned, for example?  Or reset everything on
the Nth trip around the zone?  Or even a combination of one of these
*and* of wall time, so the system will at least work harder when MM is
under load.

Also, what has to be done to avoid the polling altogether?  eg/ie, zap
a pageblock's PB_migrate_skip synchronously, when something was done to
that pageblock which justifies repolling it?


 ...

 +static void reset_isolation_suitable(struct zone *zone)
 +{
 + unsigned long start_pfn = zone-zone_start_pfn;
 + unsigned long end_pfn = zone-zone_start_pfn + zone-spanned_pages;
 + unsigned long pfn;
 +
 + /*
 +  * Do not reset more than once every five seconds. If allocations are
 +  * failing sufficiently quickly to allow this to happen then continually
 +  * scanning for compaction is not going to help. The choice of five
 +  * seconds is arbitrary but will mitigate excessive scanning.
 +  */
 + if (time_before(jiffies, zone-compact_blockskip_expire))
 + return;
 + zone-compact_blockskip_expire = jiffies + (HZ * 5);
 +
 + /* Walk the zone and mark every pageblock as suitable for isolation */
 + for (pfn = start_pfn; pfn  end_pfn; pfn += pageblock_nr_pages) {
 + struct page *page;
 + if (!pfn_valid(pfn))
 + continue;
 +
 + page = pfn_to_page(pfn);
 + if (zone != page_zone(page))
 + continue;
 +
 + clear_pageblock_skip(page);
 + }

What's the worst-case loop count here?

 +}
 +

 ...


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm loops after kernel udpate

2012-09-18 Thread Andrew Morton
On Tue, 18 Sep 2012 21:40:31 +0200
Jiri Slaby jsl...@suse.cz wrote:

 On 09/13/2012 11:59 AM, Avi Kivity wrote:
  On 09/12/2012 09:11 PM, Jiri Slaby wrote:
  On 09/12/2012 10:18 AM, Avi Kivity wrote:
  On 09/12/2012 11:13 AM, Jiri Slaby wrote:
 
Please provide the output of vmxcap
  (http://goo.gl/c5lUO),
 
 Unrestricted guest   no
 
  The big real mode fixes.
 
 
 
  and a snapshot of kvm_stat while the guest is hung.
 
  kvm statistics
 
exits  6778198  615942
host_state_reload 1988 187
irq_exits 1523 138
mmu_cache_miss   4   0
fpu_reload   1   0
 
  Please run this as root so we get the tracepoint based output; and press
  'x' when it's running so we get more detailed output.
 
  kvm statistics
 
kvm_exit  13798699  330708
kvm_entry 13799110  330708
kvm_page_fault13793650  330604
kvm_exit(EXCEPTION_NMI)6188458  330604
kvm_exit(EXTERNAL_INTERRUPT)  2169 105
kvm_exit(TPR_BELOW_THRESHOLD)   82   0
kvm_exit(IO_INSTRUCTION) 6   0
 
  Strange, it's unable to fault in the very first page.
 
 I bisected that. Note the bisection log. I have never seen something 
 like that :D:
 git bisect start
 git bisect bad 3de9d1a1500472bc80478bd75e33fa9c1eba1422
 git bisect good fea7a08acb13524b47711625eebea40a0ede69a0
 git bisect good 95a2fe4baa1ad444df5f94bfc9416fc6b4b34cef
 git bisect good f42c0d57a5a60da03c705bdea9fbba381112dd60
 git bisect good 31a2e241a9e37a133278959044960c229acc5714
 git bisect good f15fb01c5593fa1b58cc7a8a9c59913e2625bf2e
 git bisect good 16d21ff46f5d50e311d07406c31f96916e5e8e1a
 git bisect good 0b84592f458b4e8567aa7d803aff382c1d3b64fd
 git bisect bad b955428e7f14cd29fe9d8059efa3ea4be679c83d
 git bisect bad 20c4da4f68fcade05eda9c9b7dbad0a78cc5efe8
 git bisect bad 31b90ed2a90f80fb528ac55ee357a815e1dedc36
 git bisect bad b273fe14ee5b38cecc7bce94ff35a0bf9ee4
 git bisect bad de426dbe9a60706b91b40397f69f819a39a06b6b
 git bisect bad 6b998094ec50248e72b9f251d0607b58b18dba38
 git bisect bad cf9b81d47a89f5d404a0cd8013b461617751e520
 
 === 8 ===
 
 Reverting cf9b81d47a89 (mm: wrap calls to set_pte_at_notify with 
 invalidate_range_start and invalidate_range_end) on the top of today's 
 -next fixes the issue.

hm, thanks.  This will probably take some time to resolve so I think
I'll drop

mm-move-all-mmu-notifier-invocations-to-be-done-outside-the-pt-lock.patch
mm-move-all-mmu-notifier-invocations-to-be-done-outside-the-pt-lock-fix.patch
mm-move-all-mmu-notifier-invocations-to-be-done-outside-the-pt-lock-fix-fix.patch
mm-wrap-calls-to-set_pte_at_notify-with-invalidate_range_start-and-invalidate_range_end.patch
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm loops after kernel udpate

2012-09-18 Thread Andrew Morton
On Wed, 19 Sep 2012 10:00:34 +1000 Stephen Rothwell s...@canb.auug.org.au 
wrote:

 Hi Andrew,
 
 On Tue, 18 Sep 2012 12:46:46 -0700 Andrew Morton a...@linux-foundation.org 
 wrote:
 
  hm, thanks.  This will probably take some time to resolve so I think
  I'll drop
  
  mm-move-all-mmu-notifier-invocations-to-be-done-outside-the-pt-lock.patch
  mm-move-all-mmu-notifier-invocations-to-be-done-outside-the-pt-lock-fix.patch
  mm-move-all-mmu-notifier-invocations-to-be-done-outside-the-pt-lock-fix-fix.patch
  mm-wrap-calls-to-set_pte_at_notify-with-invalidate_range_start-and-invalidate_range_end.patch
 
 Should I attempt to remove these from the akpm tree in linux-next today?

That would be best - there's no point in having people test (and debug)
dead stuff.

 Or should I just wait for a new mmotm?

You could be brave and test http://ozlabs.org/~akpm/mmots/ for me :)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host

2012-08-22 Thread Andrew Morton
On Wed, 22 Aug 2012 18:29:55 +0200
Andrea Arcangeli aarca...@redhat.com wrote:

 On Wed, Aug 22, 2012 at 02:03:41PM +0800, Xiao Guangrong wrote:
  On 08/21/2012 11:06 PM, Andrea Arcangeli wrote:
   CPU0  CPU1
 oldpage[1] == 0 (both guest  host)
   oldpage[0] = 1
   trigger do_wp_page
  
  We always do ptep_clear_flush before set_pte_at_notify(),
  at this point, we have done:
pte = 0 and flush all tlbs
   mmu_notifier_change_pte
   spte = newpage + writable
 guest does newpage[1] = 1
 vmexit
 host read oldpage[1] == 0
  
It can not happen, at this point pte = 0, host can not
access oldpage anymore, host read can generate #PF, it
will be blocked on page table lock until CPU 0 release 
  the lock.
 
 Agreed, this is why your fix is safe.
 
 ...

 Thanks a lot for fixing this subtle race!

I'll take that as an ack.

Unfortunately we weren't told the user-visible effects of the bug,
which often makes it hard to determine which kernel versions should be
patched.  Please do always provide this information when fixing a bug.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host

2012-08-22 Thread Andrew Morton
On Wed, 22 Aug 2012 21:50:43 +0200
Andrea Arcangeli aarca...@redhat.com wrote:

 Hi Andrew,
 
 On Wed, Aug 22, 2012 at 12:15:35PM -0700, Andrew Morton wrote:
  On Wed, 22 Aug 2012 18:29:55 +0200
  Andrea Arcangeli aarca...@redhat.com wrote:
  
   On Wed, Aug 22, 2012 at 02:03:41PM +0800, Xiao Guangrong wrote:
On 08/21/2012 11:06 PM, Andrea Arcangeli wrote:
 CPU0  CPU1
   oldpage[1] == 0 (both guest  host)
 oldpage[0] = 1
 trigger do_wp_page

We always do ptep_clear_flush before set_pte_at_notify(),
at this point, we have done:
  pte = 0 and flush all tlbs
 mmu_notifier_change_pte
 spte = newpage + writable
   guest does newpage[1] = 1
   vmexit
   host read oldpage[1] == 0

  It can not happen, at this point pte = 0, host can not
  access oldpage anymore, host read can generate #PF, it
  will be blocked on page table lock until CPU 0 
release the lock.
   
   Agreed, this is why your fix is safe.
   
   ...
  
   Thanks a lot for fixing this subtle race!
  
  I'll take that as an ack.
 
 Yes thanks!
 
 I'd also like a comment that explains why in that case the order is
 reversed. The reverse order immediately rings an alarm bell otherwise
 ;). But the comment can be added with an incremental patch.

If you can suggest some text I'll type it in right now.

  Unfortunately we weren't told the user-visible effects of the bug,
  which often makes it hard to determine which kernel versions should be
  patched.  Please do always provide this information when fixing a bug.
 
 This is best answered by Xiao who said it's a testcase triggering
 this.
 
 It requires the guest reading memory on CPU0 while the host writes to
 the same memory on CPU1, while CPU2 triggers the copy on write fault
 on another part of the same page (slightly before CPU1 writes). The
 host writes of CPU1 would need to happen in a microsecond window, and
 they wouldn't be immediately propagated to the guest in CPU0. They
 would still appear in the guest but with a microsecond delay (the
 guest has the spte mapped readonly when this happens so it's only a
 guest microsecond delayed reading problem as far as I can tell). I
 guess most of the time it would fall into the undefined by timing
 scenario so it's hard to tell how the side effect could escalate.

OK, thanks.  For this sort of thing I am dependent upon KVM people to
suggest whether the fix should be in 3.6 and whether -stable needs it.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Avoid wasting pages for small lpage_info arrays

2012-05-15 Thread Andrew Morton
On Tue, 15 May 2012 11:02:17 +0300
Avi Kivity a...@redhat.com wrote:

 On 05/14/2012 04:29 PM, Takuya Yoshikawa wrote:
  On Sun, 13 May 2012 13:20:46 +0300
  Avi Kivity a...@redhat.com wrote:
 
   I don't feel that the savings is worth the extra complication.  We save
   two pages per memslot here.
 
  Using a 4KB vmalloced page for a 16B array is ...
 
  Actually I felt like you before and did not do this, but recently there
  was a talk about creating hundreds of memslots.
 
   What about using kvmalloc() instead of vmalloc()?  It's in
   security/apparmor now, but can be made generic.
 
  Andrew once, maybe some times, rejected making such an API generic saying
  that there should not be a generic criterion by which we can decide which
  function - vmalloc() or kmalloc() - to use.
 
  So each caller should decide by its own criteria.
 
  In this case, we need to implement kvm specific kvmalloc().
  BTW, we are already doing this for dirty_bitmap.
 
 Okay, a local kvmalloc() is better than open-coding the logic.
 
 Andrew, prepare yourself for some code duplication.

There are reasons for avoiding vmalloc().

The kernel does not run in a virtual memory environment.  It is a
harsh, low-level environment and kernel code should be robust. 
Assuming that you can allocate vast amounts of contiguous memory is not
robust.  Robust code will implement data structures which avoid this
weakness.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] bitops: add _local bitops

2012-05-09 Thread Andrew Morton
On Wed, 9 May 2012 16:45:29 +0300
Michael S. Tsirkin m...@redhat.com wrote:

 kvm needs to update some hypervisor variables atomically
 in a sense that the operation can't be interrupted
 in the middle. However the hypervisor always runs
 on the same CPU so it does not need any memory
 barrier or lock prefix.

Well.  It adds more complexity, makes the kernel harder to understand
and maintain and introduces more opportunities for developers to add
bugs.  So from that point of view, the best way of handling this patch
is to delete it.

Presumably the patch offers some benefit to offest all those costs. 
But you didn't tell us what that benefit is, so we cannot make
a decision.

IOW: numbers, please.  Convincing ones, for realistic test cases.


Secondly: can KVM just use __set_bit() and friends?  I suspect those
interfaces happen to meet your requirements.  At least on architectures
you care about.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: don't call mmu_shrinker w/o used_mmu_pages

2012-04-20 Thread Andrew Morton
On Fri, 13 Apr 2012 15:38:41 -0700
Ying Han ying...@google.com wrote:

 The mmu_shrink() is heavy by itself by iterating all kvms and holding
 the kvm_lock. spotted the code w/ Rik during LSF, and it turns out we
 don't need to call the shrinker if nothing to shrink.
 

We should probably tell the kvm maintainers about this ;)

 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -188,6 +188,11 @@ static u64 __read_mostly shadow_mmio_mask;
  
  static void mmu_spte_set(u64 *sptep, u64 spte);
  
 +static inline int get_kvm_total_used_mmu_pages()
 +{
 + return percpu_counter_read_positive(kvm_total_used_mmu_pages);
 +}
 +
  void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask)
  {
   shadow_mmio_mask = mmio_mask;
 @@ -3900,6 +3905,9 @@ static int mmu_shrink(struct shrinker *shrink, struct 
 shrink_control *sc)
   if (nr_to_scan == 0)
   goto out;
  
 + if (!get_kvm_total_used_mmu_pages())
 + return 0;
 +
   raw_spin_lock(kvm_lock);
  
   list_for_each_entry(kvm, vm_list, vm_list) {
 @@ -3926,7 +3934,7 @@ static int mmu_shrink(struct shrinker *shrink, struct 
 shrink_control *sc)
   raw_spin_unlock(kvm_lock);
  
  out:
 - return percpu_counter_read_positive(kvm_total_used_mmu_pages);
 + return get_kvm_total_used_mmu_pages();
  }
  
  static struct shrinker mmu_shrinker = {

There's a small functional change: percpu_counter_read_positive() is an
approximate thing, so there will be cases where there will be some
pages which are accounted for only in the percpu_counter's per-cpu
accumulators.  In that case mmu_shrink() will bale out when there are
in fact some freeable pages available.  This is hopefully unimportant.

Do we actually know that this patch helps anything?  Any measurements? Is
kvm_total_used_mmu_pages==0 at all common?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Unmapped page cache control (v5)

2011-03-30 Thread Andrew Morton
On Wed, 30 Mar 2011 11:00:26 +0530
Balbir Singh bal...@linux.vnet.ibm.com wrote:

 Data from the previous patchsets can be found at
 https://lkml.org/lkml/2010/11/30/79

It would be nice if the data for the current patchset was present in
the current patchset's changelog!

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] Provide control over unmapped pages (v5)

2011-03-30 Thread Andrew Morton
On Wed, 30 Mar 2011 11:02:38 +0530
Balbir Singh bal...@linux.vnet.ibm.com wrote:

 Changelog v4
 1. Added documentation for max_unmapped_pages
 2. Better #ifdef'ing of max_unmapped_pages and min_unmapped_pages
 
 Changelog v2
 1. Use a config option to enable the code (Andrew Morton)
 2. Explain the magic tunables in the code or at-least attempt
to explain them (General comment)
 3. Hint uses of the boot parameter with unlikely (Andrew Morton)
 4. Use better names (balanced is not a good naming convention)
 
 Provide control using zone_reclaim() and a boot parameter. The
 code reuses functionality from zone_reclaim() to isolate unmapped
 pages and reclaim them as a priority, ahead of other mapped pages.
 

This:

akpm:/usr/src/25 grep '^+#' 
patches/provide-control-over-unmapped-pages-v5.patch 
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
+#endif
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+#endif
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+#endif
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+#else
+#endif
+#ifdef CONFIG_NUMA
+#else
+#define zone_reclaim_mode 0
+#endif
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
+#endif
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+#endif
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
+#endif
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+#endif
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
+#endif
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+#endif
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+#else /* !CONFIG_UNMAPPED_PAGECACHE_CONTROL */
+#endif
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+#endif
+#endif
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)

is getting out of control.  What happens if we just make the feature
non-configurable?

 +static int __init unmapped_page_control_parm(char *str)
 +{
 + unmapped_page_control = 1;
 + /*
 +  * XXX: Should we tweak swappiness here?
 +  */
 + return 1;
 +}
 +__setup(unmapped_page_control, unmapped_page_control_parm);

That looks like a pain - it requires a reboot to change the option,
which makes testing harder and slower.  Methinks you're being a bit
virtualization-centric here!

 +#else /* !CONFIG_UNMAPPED_PAGECACHE_CONTROL */
 +static inline void reclaim_unmapped_pages(int priority,
 + struct zone *zone, struct scan_control *sc)
 +{
 + return 0;
 +}
 +#endif
 +
  static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
 struct scan_control *sc)
  {
 @@ -2371,6 +2394,12 @@ loop_again:
   shrink_active_list(SWAP_CLUSTER_MAX, zone,
   sc, priority, 0);
  
 + /*
 +  * We do unmapped page reclaim once here and once
 +  * below, so that we don't lose out
 +  */
 + reclaim_unmapped_pages(priority, zone, sc);

Doing this here seems wrong.  balance_pgdat() does two passes across
the zones.  The first pass is a read-only work-out-what-to-do pass and
the second pass is a now-reclaim-some-stuff pass.  But here we've stuck
a do-some-reclaiming operation inside the first, work-out-what-to-do pass.


 @@ -2408,6 +2437,11 @@ loop_again:
   continue;
  
   sc.nr_scanned = 0;
 + /*
 +  * Reclaim unmapped pages upfront, this should be
 +  * really cheap

Comment is mysterious.  Why is it cheap?

 +  */
 + reclaim_unmapped_pages(priority, zone, sc);


I dunno, the whole thing seems rather nasty to me.

It sticks a magical reclaim-unmapped-pages operation right in the
middle of regular page reclaim.  This means that reclaim will walk the
LRU looking at mapped and unmapped pages.  Then it will walk some more,
looking at only unmapped pages and moving the mapped ones to the head
of the LRU.  Then it goes back to looking at mapped and unmapped pages.
So it rather screws up the LRU ordering and page aging, does it not?

Also, the special-case handling sticks out like a sore thumb.  Would it
not be better to manage the mapped/unmapped bias within the core of the
regular scanning?  ie: in shrink_page_list().
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Unmapped page cache control (v5)

2011-03-30 Thread Andrew Morton
On Thu, 31 Mar 2011 10:57:03 +0530 Balbir Singh bal...@linux.vnet.ibm.com 
wrote:

 * Andrew Morton a...@linux-foundation.org [2011-03-30 16:36:07]:
 
  On Wed, 30 Mar 2011 11:00:26 +0530
  Balbir Singh bal...@linux.vnet.ibm.com wrote:
  
   Data from the previous patchsets can be found at
   https://lkml.org/lkml/2010/11/30/79
  
  It would be nice if the data for the current patchset was present in
  the current patchset's changelog!
 
 
 Sure, since there were no major changes, I put in a URL. The main
 change was the documentation update. 

Well some poor schmuck has to copy and paste the data into the
changelog so it's still there in five years time.  It's better to carry
this info around in the patch's own metedata, and to maintain
and update it.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/3] Weight-balanced tree

2011-02-24 Thread Andrew Morton
On Tue, 22 Feb 2011 11:55:05 -0700
Alex Williamson alex.william...@redhat.com wrote:

 Signed-off-by: Alex Williamson alex.william...@redhat.com
 ---
 
  include/linux/wbtree.h |   55 
  lib/Makefile   |3 +
  lib/wbtree.c   |  170 
 
  3 files changed, 227 insertions(+), 1 deletions(-)
  create mode 100644 include/linux/wbtree.h
  create mode 100644 lib/wbtree.c
 
 diff --git a/include/linux/wbtree.h b/include/linux/wbtree.h
 new file mode 100644
 index 000..ef70f6f
 --- /dev/null
 +++ b/include/linux/wbtree.h
 @@ -0,0 +1,55 @@
 +/*
 + * Weight-balanced binary tree
 + *
 + * The balance of this tree is based on search probability.  The
 + * heaviest weighted nodes (the ones we're most likely to hit), are
 + * at the top of each subtree.
 + *
 + * Copywrite (C) 2011 Red Hat, Inc.
 + *
 + * Authors:
 + *   Alex Williamson alex.william...@redhat.com
 + *
 + * This work is licensed under the terms of the GNU GPL, version 2.  See
 + * the COPYING file in the top-level directory.
 + */
 +#ifndef _LINUX_WBTREE_H
 +#define _LINUX_WBTREE_H
 +
 +#include linux/kernel.h
 +#include linux/stddef.h
 +
 +struct wb_node {
 + struct wb_node *wb_parent;
 + struct wb_node *wb_left;
 + struct wb_node *wb_right;
 + unsigned long wb_weight;
 +};
 +
 +struct wb_root {
 + struct wb_node *wb_node;
 +};
 +
 +#define WB_ROOT (struct wb_root) { NULL, }
 +#define wb_entry(ptr, type, member) container_of(ptr, type, member)

Please implement this as a C function.  That way we get typechecking
which container_of() is unable to provide.

Otherwise the code looks OK to me, apart from the total lack of
documentation.  Please document at least all the exported interfaces. 
The documentation should be designed to tell people how to maintain
this code, and how to use this code reliably and well from other subsystems.
Don't fall into the trap of just dumbly filling out templates.

The documentation should also carefully explain the locking
requirements which these functions place upon callers.  (rwlocks used
how?  rcu?)

Once that is all squared away, please merge the code via the KVM tree.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3][RESEND] Provide control over unmapped pages (v4)

2011-02-08 Thread Andrew Morton
On Tue, 01 Feb 2011 22:25:45 +0530
Balbir Singh bal...@linux.vnet.ibm.com wrote:

 Changelog v4
 1. Add max_unmapped_ratio and use that as the upper limit
 to check when to shrink the unmapped page cache (Christoph
 Lameter)
 
 Changelog v2
 1. Use a config option to enable the code (Andrew Morton)
 2. Explain the magic tunables in the code or at-least attempt
to explain them (General comment)
 3. Hint uses of the boot parameter with unlikely (Andrew Morton)
 4. Use better names (balanced is not a good naming convention)
 
 Provide control using zone_reclaim() and a boot parameter. The
 code reuses functionality from zone_reclaim() to isolate unmapped
 pages and reclaim them as a priority, ahead of other mapped pages.
 A new sysctl for max_unmapped_ratio is provided and set to 16,
 indicating 16% of the total zone pages are unmapped, we start
 shrinking unmapped page cache.

We'll need some documentation for sysctl_max_unmapped_ratio, please. 
In Documentation/sysctl/vm.txt, I suppose.

It will be interesting to find out what this ratio refers to.  it
apears to be a percentage.  We've had problem in the past where 1% was
way too much and we had to change the kernel to provide much
finer-grained control.


 ...

 --- a/include/linux/mmzone.h
 +++ b/include/linux/mmzone.h
 @@ -306,7 +306,10 @@ struct zone {
   /*
* zone reclaim becomes active if more unmapped pages exist.
*/
 +#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA)
   unsigned long   min_unmapped_pages;
 + unsigned long   max_unmapped_pages;
 +#endif

This change breaks the connection between min_unmapped_pages and its
documentation, and fails to document max_unmapped_pages.

Also, afacit if CONFIG_NUMA=y and CONFIG_UNMAPPED_PAGE_CONTROL=n,
max_unmapped_pages will be present in the kernel image and will appear
in /proc but it won't actually do anything.  Seems screwed up and
misleading.


 ...

 +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
 +/*
 + * Routine to reclaim unmapped pages, inspired from the code under
 + * CONFIG_NUMA that does unmapped page and slab page control by keeping
 + * min_unmapped_pages in the zone. We currently reclaim just unmapped
 + * pages, slab control will come in soon, at which point this routine
 + * should be called reclaim cached pages
 + */
 +unsigned long reclaim_unmapped_pages(int priority, struct zone *zone,
 + struct scan_control *sc)
 +{
 + if (unlikely(unmapped_page_control) 
 + (zone_unmapped_file_pages(zone)  zone-min_unmapped_pages)) {
 + struct scan_control nsc;
 + unsigned long nr_pages;
 +
 + nsc = *sc;
 +
 + nsc.swappiness = 0;
 + nsc.may_writepage = 0;
 + nsc.may_unmap = 0;
 + nsc.nr_reclaimed = 0;
 +
 + nr_pages = zone_unmapped_file_pages(zone) -
 + zone-min_unmapped_pages;
 + /*
 +  * We don't want to be too aggressive with our
 +  * reclaim, it is our best effort to control
 +  * unmapped pages
 +  */
 + nr_pages = 3;
 +
 + zone_reclaim_pages(zone, nsc, nr_pages);
 + return nsc.nr_reclaimed;
 + }
 + return 0;
 +}

This returns an undocumented ulong which is never used by callers.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page optionally

2011-01-27 Thread Andrew Morton
On Thu, 27 Jan 2011 15:29:05 +0800
Huang Ying ying.hu...@intel.com wrote:

 Hi, Andrew,
 
 On Thu, 2011-01-20 at 23:50 +0800, Marcelo Tosatti wrote:
  On Mon, Jan 17, 2011 at 08:47:39AM +0800, Huang Ying wrote:
   Hi, Andrew,
   
   On Sun, 2011-01-16 at 23:35 +0800, Avi Kivity wrote:
On 01/14/2011 03:37 AM, Huang Ying wrote:
 On Thu, 2011-01-13 at 18:43 +0800, Avi Kivity wrote:
   On 01/13/2011 10:42 AM, Huang Ying wrote:
 Make __get_user_pages return -EHWPOISON for HWPOISON page only 
  if
 FOLL_HWPOISON is specified.  With this patch, the interested 
  callers
 can distinguish HWPOISON page from general FAULT page, while 
  other
 callers will still get -EFAULT for pages, so the user space 
  interface
 need not to be changed.
   
 get_user_pages_hwpoison is added as a variant of get_user_pages 
  that
 can return -EHWPOISON for HWPOISON page.
   
 This feature is needed by KVM, where UCR MCE should be relayed 
  to
 guest for HWPOISON page, while instruction emulation and MMIO 
  will be
 tried for general FAULT page.
   
 The idea comes from Andrew Morton.
   
 Signed-off-by: Huang Yingying.hu...@intel.com
 Cc: Andrew Mortona...@linux-foundation.org
   
 +#ifdef CONFIG_MEMORY_FAILURE
 +int get_user_pages_hwpoison(struct task_struct *tsk, struct 
  mm_struct *mm,
 +   unsigned long start, int nr_pages, 
  int write,
 +   int force, struct page **pages,
 +   struct vm_area_struct **vmas);
 +#else
 
   Since we'd also like to add get_user_pages_noio(), perhaps adding a
   flags field to get_user_pages() is better.

 Or export __get_user_pages()?

That's better, yes.
   
   Do you think it is a good idea to export __get_user_pages() instead of
   adding get_user_pages_noio() and get_user_pages_hwpoison()?
  
  Better Andrew and/or Linus should decide it.
 
 We really need your comments about this.
 

Export __get_user_pages(), I guess.  We don't need hwpoison/kvm-specific
stuff in mm/memory.c and
get_user_pages_hwpoison()/get_user_pages_noio() are very thin
wrappers.

The number of args to these functions is getting nutty - you'll
probably find that it is beneficial to inline these wrapepr functions, if
the number of callsites is small.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page optionally

2011-01-27 Thread Andrew Morton
On Fri, 28 Jan 2011 01:57:11 +0100
Andi Kleen a...@firstfloor.org wrote:

 I personally would consider it cleaner to have clearly
 defined wrappers instead of complicted flags in the caller.
 
  The number of args to these functions is getting nutty - you'll
  probably find that it is beneficial to inline these wrapepr functions, if
  the number of callsites is small.
 
 Really the functions are so heavy weight it should not matter.
 The problem with inlining is that you end up with the code in
 the header file and I personally find that much harder to browse
 instead of having everything in one file.

You'll cope.

 Also longer term we'll get compilers that can do cross-file inlining
 for optimized builds.

Which we'll probably need to turn off all over the place :(

 So please better avoid these kinds of micro optimizations unless
 it's a really really extremly speed critical path.

It's not just speed and it's not just .text size, either. Calling a
ten-arg function consumes stack space.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 1/3] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page optionally

2010-12-22 Thread Andrew Morton
On Wed, 22 Dec 2010 10:51:55 +0800
Huang Ying ying.hu...@intel.com wrote:

 Make __get_user_pages return -EHWPOISON for HWPOISON page only if
 FOLL_HWPOISON is specified.  With this patch, the interested callers
 can distinguish HWPOISON page from general FAULT page, while other
 callers will still get -EFAULT for pages, so the user space interface
 need not to be changed.
 
 get_user_pages_hwpoison is added as a variant of get_user_pages that
 can return -EHWPOISON for HWPOISON page.
 
 This feature is needed by KVM, where UCR MCE should be relayed to
 guest for HWPOISON page, while instruction emulation and MMIO will be
 tried for general FAULT page.
 
 The idea comes from Andrew Morton.

hm, I don't remember that.  I suspect it came from someone else?

 Signed-off-by: Huang Ying ying.hu...@intel.com
 Cc: Andrew Morton a...@linux-foundation.org
 ---
  include/asm-generic/errno.h |2 +
  include/linux/mm.h  |5 
  mm/memory.c |   53 
 +---
  3 files changed, 57 insertions(+), 3 deletions(-)
 
 --- a/include/asm-generic/errno.h
 +++ b/include/asm-generic/errno.h
 @@ -108,4 +108,6 @@
  
  #define ERFKILL  132 /* Operation not possible due to 
 RF-kill */
  
 +#define EHWPOISON133 /* Memory page has hardware error */

So this can be propagated to userspace?

If so, which syscalls might return EHWPOISON and are there any manpage
implications?

  #endif
 --- a/include/linux/mm.h
 +++ b/include/linux/mm.h
 @@ -860,6 +860,10 @@ int get_user_pages(struct task_struct *t
   struct page **pages, struct vm_area_struct **vmas);
  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
   struct page **pages);
 +int get_user_pages_hwpoison(struct task_struct *tsk, struct mm_struct *mm,
 + unsigned long start, int nr_pages, int write,
 + int force, struct page **pages,
 + struct vm_area_struct **vmas);
  struct page *get_dump_page(unsigned long addr);
  
  extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
 @@ -1415,6 +1419,7 @@ struct page *follow_page(struct vm_area_
  #define FOLL_GET 0x04/* do get_page on page */
  #define FOLL_DUMP0x08/* give error on hole if it would be zero */
  #define FOLL_FORCE   0x10/* get_user_pages read/write w/o permission */
 +#define FOLL_HWPOISON0x20/* check page is hwpoisoned */
  
  typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
   void *data);
 --- a/mm/memory.c
 +++ b/mm/memory.c
 @@ -1449,9 +1449,16 @@ int __get_user_pages(struct task_struct
   if (ret  VM_FAULT_ERROR) {
   if (ret  VM_FAULT_OOM)
   return i ? i : -ENOMEM;
 - if (ret 
 - 
 (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE|
 -  VM_FAULT_SIGBUS))
 + if (ret  (VM_FAULT_HWPOISON |
 +VM_FAULT_HWPOISON_LARGE)) {
 + if (i)
 + return i;
 + else if (gup_flags  
 FOLL_HWPOISON)
 + return -EHWPOISON;
 + else
 + return -EFAULT;
 + }
 + if (ret  VM_FAULT_SIGBUS)

hm, that function is getting a bit unweildy.

  /**
 + * get_user_pages_hwpoison() - pin user pages in memory, return hwpoison 
 status
 + * @tsk: task_struct of target task
 + * @mm:  mm_struct of target mm
 + * @start:   starting user address
 + * @nr_pages:number of pages from start to pin
 + * @write:   whether pages will be written to by the caller
 + * @force:   whether to force write access even if user mapping is
 + *   readonly. This will result in the page being COWed even
 + *   in MAP_SHARED mappings. You do not want this.
 + * @pages:   array that receives pointers to the pages pinned.
 + *   Should be at least nr_pages long. Or NULL, if caller
 + *   only intends to ensure the pages are faulted in.
 + * @vmas:array of pointers to vmas corresponding to each page.
 + *   Or NULL if the caller does not require them.
 + *
 + * Returns number of pages pinned.
 + *
 + * If the page table or memory page is hwpoisoned, return -EHWPOISON.
 + *
 + * Otherwise, same as get_user_pages.
 + */
 +int get_user_pages_hwpoison(struct task_struct *tsk, struct mm_struct *mm,
 + unsigned long start, int nr_pages, int write

Re: [RFC 1/3] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page optionally

2010-12-22 Thread Andrew Morton
On Thu, 23 Dec 2010 08:39:59 +0800
Huang Ying ying.hu...@intel.com wrote:

   --- a/mm/memory.c
   +++ b/mm/memory.c
   @@ -1449,9 +1449,16 @@ int __get_user_pages(struct task_struct
 if (ret  VM_FAULT_ERROR) {
 if (ret  VM_FAULT_OOM)
 return i ? i : -ENOMEM;
   - if (ret 
   - 
   (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE|
   -  VM_FAULT_SIGBUS))
   + if (ret  (VM_FAULT_HWPOISON |
   +VM_FAULT_HWPOISON_LARGE)) {
   + if (i)
   + return i;
   + else if (gup_flags  
   FOLL_HWPOISON)
   + return -EHWPOISON;
   + else
   + return -EFAULT;
   + }
   + if (ret  VM_FAULT_SIGBUS)
  
  hm, that function is getting a bit unweildy.
 
 Yes. Do you think the following code is better?
 
 return i ? i : (gup_flags  FOLL_HWPOISON) ? -EHWPOISON : -EFAULT;

nope ;)

The function just needs to be ripped up and redone somehow.  That's not
an appropriate activity in the context of this patch though.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA

2010-11-30 Thread Andrew Morton
On Tue, 30 Nov 2010 15:45:12 +0530
Balbir Singh bal...@linux.vnet.ibm.com wrote:

 This patch moves zone_reclaim and associated helpers
 outside CONFIG_NUMA. This infrastructure is reused
 in the patches for page cache control that follow.
 

Thereby adding a nice dollop of bloat to everyone's kernel.  I don't
think that is justifiable given that the audience for this feature is
about eight people :(

How's about CONFIG_UNMAPPED_PAGECACHE_CONTROL?

Also this patch instantiates sysctl_min_unmapped_ratio and
sysctl_min_slab_ratio on non-NUMA builds but fails to make those
tunables actually tunable in procfs.  Changes to sysctl.c are
needed.

 Reviewed-by: Christoph Lameter c...@linux.com

More careful reviewers, please.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] Provide control over unmapped pages

2010-11-30 Thread Andrew Morton
On Tue, 30 Nov 2010 15:46:31 +0530
Balbir Singh bal...@linux.vnet.ibm.com wrote:

 Provide control using zone_reclaim() and a boot parameter. The
 code reuses functionality from zone_reclaim() to isolate unmapped
 pages and reclaim them as a priority, ahead of other mapped pages.
 
 Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
 ---
  include/linux/swap.h |5 ++-
  mm/page_alloc.c  |7 +++--
  mm/vmscan.c  |   72 
 +-
  3 files changed, 79 insertions(+), 5 deletions(-)
 
 diff --git a/include/linux/swap.h b/include/linux/swap.h
 index eba53e7..78b0830 100644
 --- a/include/linux/swap.h
 +++ b/include/linux/swap.h
 @@ -252,11 +252,12 @@ extern int vm_swappiness;
  extern int remove_mapping(struct address_space *mapping, struct page *page);
  extern long vm_total_pages;
  
 -#ifdef CONFIG_NUMA
 -extern int zone_reclaim_mode;
  extern int sysctl_min_unmapped_ratio;
  extern int sysctl_min_slab_ratio;

This change will need to be moved into the first patch.

  extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
 +extern bool should_balance_unmapped_pages(struct zone *zone);
 +#ifdef CONFIG_NUMA
 +extern int zone_reclaim_mode;
  #else
  #define zone_reclaim_mode 0
  static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int 
 order)
 diff --git a/mm/page_alloc.c b/mm/page_alloc.c
 index 62b7280..4228da3 100644
 --- a/mm/page_alloc.c
 +++ b/mm/page_alloc.c
 @@ -1662,6 +1662,9 @@ zonelist_scan:
   unsigned long mark;
   int ret;
  
 + if (should_balance_unmapped_pages(zone))
 + wakeup_kswapd(zone, order);

gack, this is on the page allocator fastpath, isn't it?  So
99.% of the world's machines end up doing a pointless call to a
pointless function which pointlessly tests a pointless global and
pointlessly returns?  All because of some whacky KSM thing?

The speed and space overhead of this code should be *zero* if
!CONFIG_UNMAPPED_PAGECACHE_CONTROL and should be minimal if
CONFIG_UNMAPPED_PAGECACHE_CONTROL=y.  The way to do the latter is to
inline the test of unmapped_page_control into callers and only if it is
true (and use unlikely(), please) do we call into the KSM gunk.

 --- a/mm/vmscan.c
 +++ b/mm/vmscan.c
 @@ -145,6 +145,21 @@ static DECLARE_RWSEM(shrinker_rwsem);
  #define scanning_global_lru(sc)  (1)
  #endif
  
 +static unsigned long balance_unmapped_pages(int priority, struct zone *zone,
 + struct scan_control *sc);
 +static int unmapped_page_control __read_mostly;
 +
 +static int __init unmapped_page_control_parm(char *str)
 +{
 + unmapped_page_control = 1;
 + /*
 +  * XXX: Should we tweak swappiness here?
 +  */
 + return 1;
 +}
 +__setup(unmapped_page_control, unmapped_page_control_parm);

aw c'mon guys, everybody knows that when you add a kernel parameter you
document it in Documentation/kernel-parameters.txt.

  static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
 struct scan_control *sc)
  {
 @@ -2223,6 +2238,12 @@ loop_again:
   shrink_active_list(SWAP_CLUSTER_MAX, zone,
   sc, priority, 0);
  
 + /*
 +  * We do unmapped page balancing once here and once
 +  * below, so that we don't lose out
 +  */
 + balance_unmapped_pages(priority, zone, sc);
 +
   if (!zone_watermark_ok_safe(zone, order,
   high_wmark_pages(zone), 0, 0)) {
   end_zone = i;
 @@ -2258,6 +2279,11 @@ loop_again:
   continue;
  
   sc.nr_scanned = 0;
 + /*
 +  * Balance unmapped pages upfront, this should be
 +  * really cheap
 +  */
 + balance_unmapped_pages(priority, zone, sc);

More unjustifiable overhead on a commonly-executed codepath.

   /*
* Call soft limit reclaim before calling shrink_zone.
 @@ -2491,7 +2517,8 @@ void wakeup_kswapd(struct zone *zone, int order)
   pgdat-kswapd_max_order = order;
   if (!waitqueue_active(pgdat-kswapd_wait))
   return;
 - if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0))
 + if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0) 
 + !should_balance_unmapped_pages(zone))
   return;
  
   trace_mm_vmscan_wakeup_kswapd(pgdat-node_id, zone_idx(zone), order);
 @@ -2740,6 +2767,49 @@ zone_reclaim_unmapped_pages(struct zone *zone, struct 
 scan_control *sc,
  }
  
  /*
 + * Routine to balance unmapped pages, inspired from the code 

Re: [PATCH 1/2] Add vzalloc shortcut

2010-10-18 Thread Andrew Morton
On Sat, 16 Oct 2010 12:33:31 +0800
Dave Young hidave.darks...@gmail.com wrote:

 Add vzalloc for convinience of vmalloc-then-memset-zero case 
 
 Use __GFP_ZERO in vzalloc to zero fill the allocated memory.
 
 Signed-off-by: Dave Young hidave.darks...@gmail.com
 ---
  include/linux/vmalloc.h |1 +
  mm/vmalloc.c|   13 +
  2 files changed, 14 insertions(+)
 
 --- linux-2.6.orig/include/linux/vmalloc.h2010-08-22 15:31:38.0 
 +0800
 +++ linux-2.6/include/linux/vmalloc.h 2010-10-16 10:50:54.739996121 +0800
 @@ -53,6 +53,7 @@ static inline void vmalloc_init(void)
  #endif
  
  extern void *vmalloc(unsigned long size);
 +extern void *vzalloc(unsigned long size);
  extern void *vmalloc_user(unsigned long size);
  extern void *vmalloc_node(unsigned long size, int node);
  extern void *vmalloc_exec(unsigned long size);
 --- linux-2.6.orig/mm/vmalloc.c   2010-08-22 15:31:39.0 +0800
 +++ linux-2.6/mm/vmalloc.c2010-10-16 10:51:57.126665918 +0800
 @@ -1604,6 +1604,19 @@ void *vmalloc(unsigned long size)
  EXPORT_SYMBOL(vmalloc);
  
  /**
 + *   vzalloc  -  allocate virtually contiguous memory with zero filled

s/filled/fill/

 + *   @size:  allocation size
 + *   Allocate enough pages to cover @size from the page level
 + *   allocator and map them into contiguous kernel virtual space.
 + */
 +void *vzalloc(unsigned long size)
 +{
 + return __vmalloc_node(size, 1, GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO,
 + PAGE_KERNEL, -1, __builtin_return_address(0));
 +}
 +EXPORT_SYMBOL(vzalloc);

We'd need to add the same interface to nommu, please.

Also, a slightly better implementation would be

static inline void *__vmalloc_node_flags(unsigned long size, gfp_t flags)
{
return __vmalloc_node(size, 1, flags, PAGE_KERNEL, -1,
__builtin_return_address(0));
}

void *vzalloc(unsigned long size)
{
return __vmalloc_node_flags(size,
GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
}

void *vmalloc(unsigned long size)
{
return __vmalloc_node_flags(size, GFP_KERNEL | __GFP_HIGHMEM);
}

just to avoid code duplication (and possible later errors derived from it).

Perhaps it should be always_inline, so the __builtin_return_address()
can't get broken.

Or just leave it the way you had it :)


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Add vzalloc shortcut

2010-10-18 Thread Andrew Morton
On Tue, 19 Oct 2010 09:55:17 +0800 Dave Young hidave.darks...@gmail.com wrote:

 On Tue, Oct 19, 2010 at 9:27 AM, Dave Young hidave.darks...@gmail.com wrote:
  On Tue, Oct 19, 2010 at 7:46 AM, Andrew Morton
 
  Also, a slightly better implementation would be
 
  static inline void * vmalloc_node_flags(unsigned long size, gfp_t flags)
  {
 return  vmalloc_node(size, 1, flags, PAGE_KERNEL, -1,
  builtin_return_address(0));
  }
 
 Is this better? might  vmalloc_node_flags would be used by other than vmalloc?
 
 static inline void * vmalloc_node_flags(unsigned long size, int node,
 gfp_t flags)

I have no strong opinions, really.  If we add more and more arguments
to vmalloc_node_flags() it ends up looking like vmalloc_node(), so we
may as well just call vmalloc_node().  Do whatever feels good ;)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] cgroups: fix API thinko

2010-08-25 Thread Andrew Morton
On Fri, 06 Aug 2010 10:38:24 -0600
Alex Williamson alex.william...@redhat.com wrote:

 On Fri, 2010-08-06 at 09:34 -0700, Sridhar Samudrala wrote:
  On 8/5/2010 3:59 PM, Michael S. Tsirkin wrote:
   cgroup_attach_task_current_cg API that have upstream is backwards: we
   really need an API to attach to the cgroups from another process A to
   the current one.
  
   In our case (vhost), a priveledged user wants to attach it's task to 
   cgroups
   from a less priveledged one, the API makes us run it in the other
   task's context, and this fails.
  
   So let's make the API generic and just pass in 'from' and 'to' tasks.
   Add an inline wrapper for cgroup_attach_task_current_cg to avoid
   breaking bisect.
  
   Signed-off-by: Michael S. Tsirkinm...@redhat.com
   ---
  
   Paul, Li, Sridhar, could you please review the following
   patch?
  
   I only compile-tested it due to travel, but looks
   straight-forward to me.
   Alex Williamson volunteered to test and report the results.
   Sending out now for review as I might be offline for a bit.
   Will only try to merge when done, obviously.
  
   If OK, I would like to merge this through -net tree,
   together with the patch fixing vhost-net.
   Let me know if that sounds ok.
  
   Thanks!
  
   This patch is on top of net-next, it is needed for fix
   vhost-net regression in net-next, where a non-priveledged
   process can't enable the device anymore:
  
   when qemu uses vhost, inside the ioctl call it
   creates a thread, and tries to add
   this thread to the groups of current, and it fails.
   But we control the thread, so to solve the problem,
   we really should tell it 'connect to out cgroups'.

So am I correct to assume that this change is now needed in 2.6.36, and
unneeded in 2.6.35?

Can it affect the userspace-kernel API in amy manner?  If so, it
should be backported into earlier kernels to reduce the number of
incompatible kernels out there.

Paul, did you have any comments?

I didn't see any update in response to the minor review comments, so...


 include/linux/cgroup.h |1 +
 kernel/cgroup.c|6 +++---
 2 files changed, 4 insertions(+), 3 deletions(-)

diff -puN include/linux/cgroup.h~cgroups-fix-api-thinko-fix 
include/linux/cgroup.h
--- a/include/linux/cgroup.h~cgroups-fix-api-thinko-fix
+++ a/include/linux/cgroup.h
@@ -579,6 +579,7 @@ void cgroup_iter_end(struct cgroup *cgrp
 int cgroup_scan_tasks(struct cgroup_scanner *scan);
 int cgroup_attach_task(struct cgroup *, struct task_struct *);
 int cgroup_attach_task_all(struct task_struct *from, struct task_struct *);
+
 static inline int cgroup_attach_task_current_cg(struct task_struct *tsk)
 {
return cgroup_attach_task_all(current, tsk);
diff -puN kernel/cgroup.c~cgroups-fix-api-thinko-fix kernel/cgroup.c
--- a/kernel/cgroup.c~cgroups-fix-api-thinko-fix
+++ a/kernel/cgroup.c
@@ -1798,13 +1798,13 @@ out:
 int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 {
struct cgroupfs_root *root;
-   struct cgroup *cur_cg;
int retval = 0;
 
cgroup_lock();
for_each_active_root(root) {
-   cur_cg = task_cgroup_from_root(from, root);
-   retval = cgroup_attach_task(cur_cg, tsk);
+   struct cgroup *from_cg = task_cgroup_from_root(from, root);
+
+   retval = cgroup_attach_task(from_cg, tsk);
if (retval)
break;
}
_

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: + kvm-ia64-dereference-of-null-pointer-in-set_pal_result.patch added to -mm tree

2010-01-13 Thread Andrew Morton
On Wed, 13 Jan 2010 11:22:39 +0200 Avi Kivity a...@redhat.com wrote:

 On 01/13/2010 12:11 AM, a...@linux-foundation.org wrote:
  Subject: kvm/ia64: dereference of NULL pointer in set_pal_result()
  From: Roel Kluinroel.kl...@gmail.com
 
  Do not dereference a NULL pointer
 
  diff -puN 
  arch/ia64/kvm/kvm_fw.c~kvm-ia64-dereference-of-null-pointer-in-set_pal_result
   arch/ia64/kvm/kvm_fw.c
  --- 
  a/arch/ia64/kvm/kvm_fw.c~kvm-ia64-dereference-of-null-pointer-in-set_pal_result
  +++ a/arch/ia64/kvm/kvm_fw.c
  @@ -75,9 +75,11 @@ static void set_pal_result(struct kvm_vc
  struct exit_ctl_data *p;
 
  p = kvm_get_exit_data(vcpu);
  -   if (p  p-exit_reason == EXIT_REASON_PAL_CALL) {
  +   if (!p)
  +   return;
  +   if (p-exit_reason == EXIT_REASON_PAL_CALL) {
  p-u.pal_data.ret = result;
  -   return ;
  +   return;
  }
  INIT_PAL_STATUS_UNIMPLEMENTED(p-u.pal_data.ret);
}
 
 
 
 kvm_get_exit_data() cannot return a NULL pointer.

In that case set_pal_result() doesn't need to test for that.

Roel looks for code along the lines of

if (p)
...

*p;

 Where did this come from?

I got it off linux-kernel.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vga_arb warning [was: mmotm 2009-11-01-10-01 uploaded]

2009-11-03 Thread Andrew Morton

Please cc me on mmotm bug reports!

On Sun, 01 Nov 2009 22:47:05 +0100 Jiri Slaby jirisl...@gmail.com wrote:

 On 11/01/2009 07:07 PM, a...@linux-foundation.org wrote:
  The mm-of-the-moment snapshot 2009-11-01-10-01 has been uploaded to
 
 Hi, I got the following warning while booting an image in qemu-kvm:
 
 WARNING: at fs/attr.c:158 notify_change+0x2da/0x310()
 Hardware name:
 Modules linked in:
 Pid: 1, comm: swapper Not tainted 2.6.32-rc5-mm1_64 #862
 Call Trace:
  [81038008] warn_slowpath_common+0x78/0xb0
  [8103804f] warn_slowpath_null+0xf/0x20
  [810d32ba] notify_change+0x2da/0x310
  [810c5b88] ? fsnotify_create+0x48/0x60
  [810c6d2b] ? vfs_mknod+0xbb/0xe0
  [812487b6] devtmpfs_create_node+0x1e6/0x270
  [811170d0] ? sysfs_addrm_finish+0x20/0x280
  [811175d6] ? __sysfs_add_one+0x26/0xf0
  [81117b6c] ? sysfs_do_create_link+0xcc/0x160
  [81241cf0] device_add+0x1e0/0x5b0
  [8124adb1] ? pm_runtime_init+0xa1/0xb0
  [81248f05] ? device_pm_init+0x65/0x70
  [812420d9] device_register+0x19/0x20
  [81242290] device_create_vargs+0xf0/0x120
  [812422ec] device_create+0x2c/0x30
  [810c0516] ? __register_chrdev+0x86/0xf0
  [81245599] ? __class_create+0x69/0xa0
  [814326e9] ? mutex_lock+0x19/0x50
  [811d4e23] misc_register+0x93/0x170
  [818994a0] ? vga_arb_device_init+0x0/0x77
  [818994b3] vga_arb_device_init+0x13/0x77
  [818994a0] ? vga_arb_device_init+0x0/0x77
  [810001e7] do_one_initcall+0x37/0x190
  [8187d6ce] kernel_init+0x172/0x1c8
  [81003c7a] child_rip+0xa/0x20
  [8187d55c] ? kernel_init+0x0/0x1c8
  [81003c70] ? child_rip+0x0/0x20

There's a -mm-only debug patch:

http://userweb.kernel.org/~akpm/mmotm/broken-out/notify_change-callers-must-hold-i_mutex.patch

--- a/fs/attr.c~notify_change-callers-must-hold-i_mutex
+++ a/fs/attr.c
@@ -155,6 +155,8 @@ int notify_change(struct dentry * dentry
return -EPERM;
}
 
+   WARN_ON_ONCE(!mutex_is_locked(inode-i_mutex));
+
now = current_fs_time(inode-i_sb);
 
attr-ia_ctime = now;
_

I forget why it was added.


It looks like it's blaming the open-coded notify_change() call in
devtmpfs_create_node().

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [was: mmotm 2009-10-09-01-07 uploaded]

2009-10-09 Thread Andrew Morton
On Fri, 9 Oct 2009 08:30:28 -0700
Randy Dunlap rdun...@xenotime.net wrote:

 From: Randy Dunlap randy.dun...@oracle.com
 
 When CONFIG_CPU_FREQ is disabled, cpufreq_get() needs a stub.
 Used by kvm (although it looks like a bit of the kvm code could
 be omitted when CONFIG_CPU_FREQ is disabled).
 
 arch/x86/built-in.o: In function `kvm_arch_init':
 (.text+0x10de7): undefined reference to `cpufreq_get'
  
 Signed-off-by: Randy Dunlap randy.dun...@oracle.com
 Tested-by: Eric Paris epa...@redhat.com
 ---
  include/linux/cpufreq.h |7 +++
  1 file changed, 7 insertions(+)
 
 --- linux-next-20091006.orig/include/linux/cpufreq.h
 +++ linux-next-20091006/include/linux/cpufreq.h
 @@ -291,8 +291,15 @@ struct global_attr {
  int cpufreq_get_policy(struct cpufreq_policy *policy, unsigned int cpu);
  int cpufreq_update_policy(unsigned int cpu);
  
 +#ifdef CONFIG_CPU_FREQ
  /* query the current CPU frequency (in kHz). If zero, cpufreq couldn't 
 detect it */
  unsigned int cpufreq_get(unsigned int cpu);
 +#else
 +static inline unsigned int cpufreq_get(unsigned int cpu)
 +{
 + return 0;
 +}
 +#endif

Thanks.  I'll merge this into mainline in the next batch I think.  It's
only needed by the KVM development tree but it's the correct thing to
do anyway and having it in minaline will simplify life for everyone.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 1/2] mm: export use_mm/unuse_mm to modules

2009-09-16 Thread Andrew Morton
On Thu, 17 Sep 2009 08:38:18 +0300 Michael S. Tsirkin m...@redhat.com wrote:

 Hi Andrew,
 On Tue, Aug 11, 2009 at 03:10:10PM -0700, Andrew Morton wrote:
  On Wed, 12 Aug 2009 00:27:52 +0300
  Michael S. Tsirkin m...@redhat.com wrote:
  
   vhost net module wants to do copy to/from user from a kernel thread,
   which needs use_mm (like what fs/aio has).  Move that into mm/ and
   export to modules.
  
  OK by me.  Please include this change in the virtio patchset.  Which I
  shall cheerfully not be looking at :)
 
 The virtio patches are somewhat delayed as we are ironing out the
 kernel/user interface with Rusty. Can the patch moving use_mm to mm/ be
 applied without exporting to modules for now? This will make it easier
 for virtio which will only have to patch in the EXPORT line.

That was 10,000 patches ago.

 I also have a small patch optimizing atomic usage in use_mm (which I did for
 virtio) and it's easier to apply it if the code is in the new place.
 
 If ok, pls let me know and I'll post the patch without the EXPORT line.

Please just send them all out.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 1/2] mm: export use_mm/unuse_mm to modules

2009-08-11 Thread Andrew Morton
On Wed, 12 Aug 2009 00:27:52 +0300
Michael S. Tsirkin m...@redhat.com wrote:

 vhost net module wants to do copy to/from user from a kernel thread,
 which needs use_mm (like what fs/aio has).  Move that into mm/ and
 export to modules.

OK by me.  Please include this change in the virtio patchset.  Which I
shall cheerfully not be looking at :)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] eventfd - revised interface and cleanups (2nd rev)

2009-06-24 Thread Andrew Morton
On Wed, 24 Jun 2009 15:47:47 -0700 (PDT)
Davide Libenzi davi...@xmailserver.org wrote:

 On Tue, 23 Jun 2009, Andrew Morton wrote:
 
  Maybe there is such a reason, and it hasn't yet been beaten into my
  skull.  But I still think it should be done in a separate patch.  One
  which comes with a full description of the reasons for the change, btw.
 
 Since your skull seems pretty hard to beat, would you like me to split it 
 for ya? :)

Split what?  My skull?

umm, yes please, I believe the patches should be split.  And I'm still
not seeing the justification for forcing CONFIG_EVENTFD onto all
CONFIG_AIO users!

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] eventfd - revised interface and cleanups (2nd rev)

2009-06-24 Thread Andrew Morton
On Wed, 24 Jun 2009 16:52:06 -0700 (PDT)
Davide Libenzi davi...@xmailserver.org wrote:

  umm, yes please, I believe the patches should be split.  And I'm still
  not seeing the justification for forcing CONFIG_EVENTFD onto all
  CONFIG_AIO users!
 
 Eventfd notifications became part of the AIO API (it's not even delivered 
 through a new syscall, from the AIO side - same existing aiocb struct and 
 io_submit syscall) once we merged it,

That was a regression for existing embedded AIO users ;)

 so IMHO (AIO  !EVENTFD) would be 
 similar to split AIO in AIO_READ and AIO_WRITE and have (AIO  !AIO_WRITE).
 Considering that the kernel config, once you unleash the CONFIG_EMBEDDED 
 pandora box, allows you to select (AIO  !EVENTFD) w/out even a warning 
 about possible userspace breakages, this makes it rather a confusing 
 configuration if you ask me.

Sure.  But we do assume that one someone sets CONFIG_EMBEDDED, they
know what they're doing.  We prefer to give them maximum flexibility
and foot-shooting ability.  As long as the maintenance cost of doing so
in each case is reasonable, of course.

 It's not a biggie from the kernel side, just a few ugly errors wrappers 
 around functions. For me AIO (or whatever userspace visible kernel 
 subsystem) should select all the components that are part of the userspace 
 API, but my argument ends here.

Sure, it's not a biggie either way.  Doubtful if many tiny systems are
using AIO anyway.  Heck, lots of them are running 2.4.18...

But from the general this-is-the-way-we-do-things POV, it's preferred
that the two features be separable under CONFIG_EMBEDDED if poss.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] eventfd - revised interface and cleanups (2nd rev)

2009-06-23 Thread Andrew Morton
On Tue, 23 Jun 2009 12:25:36 -0700 (PDT)
Davide Libenzi davi...@xmailserver.org wrote:

 The following patch changes the eventfd interface to de-couple the eventfd 
 memory context, from the file pointer instance.
 Without such change, there is no clean way to racely free handle the 
 POLLHUP event sent when the last instance of the file* goes away.
 Also, now the internal eventfd APIs are using the eventfd context instead 
 of the file*.
 Another cleanup this patch does, is making AIO select EVENTFD, instead of 
 adding a bunch of empty function stubs inside eventfd.h in order to 
 handle the (AIO  !EVENTFD) case.
 

If we really want to squeeze this into 2.6.31 then it would be helpful
to have justification in the changelog, please.  I see that some KVM
feature needs it, but what and why and why now, etc?

The patch conflicts with similar changes int he KVM tree, but that'll
just ruin sfr's life for a day.


Your patch has:


 ...
  static int eventfd_release(struct inode *inode, struct file *file)
  {
 - kfree(file-private_data);
 + struct eventfd_ctx *ctx = file-private_data;
 +
 + wake_up_poll(ctx-wqh, POLLHUP);
 + eventfd_ctx_put(ctx);
   return 0;
  }

but the patch in linux-next has

static int eventfd_release(struct inode *inode, struct file *file)
{
struct eventfd_ctx *ctx = file-private_data;

/*
 * No need to hold the lock here, since we are on the file cleanup
 * path and the ones still attached to the wait queue will be
 * serialized by wake_up_locked_poll().
 */
wake_up_locked_poll(ctx-wqh, POLLHUP);
kfree(ctx);
return 0;
}

which looks quite different (and has a nice comment!)

What's going on here?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] eventfd - revised interface and cleanups (2nd rev)

2009-06-23 Thread Andrew Morton
On Tue, 23 Jun 2009 12:25:36 -0700 (PDT)
Davide Libenzi davi...@xmailserver.org wrote:

 Another cleanup this patch does, is making AIO select EVENTFD, instead of 
 adding a bunch of empty function stubs inside eventfd.h in order to 
 handle the (AIO  !EVENTFD) case.

Given that we're trying to squeak this patch into 2.6.31, it's quite
unfortunate to include unrelated changes.  Especially ones which
involve the always-problematic select.  Although this one looks less
than usually deadly due to the simple dependencies of AIO and eventfd.

However...

Is this a good way of fixing the (AIO  !EVENTFD) case?  Surely if the
developer selected this combination, he doesn't want to carry the
overhead of including eventfd.  IOW, if AIO can work acceptably without
eventfd being compiled into the kernel then adding the stubs is a
superior solution.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] eventfd - revised interface and cleanups (2nd rev)

2009-06-23 Thread Andrew Morton
On Tue, 23 Jun 2009 12:25:36 -0700 (PDT)
Davide Libenzi davi...@xmailserver.org wrote:

 The following patch changes the eventfd interface to de-couple the eventfd 
 memory context, from the file pointer instance.
 Without such change, there is no clean way to racely free handle the 
 POLLHUP event sent when the last instance of the file* goes away.
 Also, now the internal eventfd APIs are using the eventfd context instead 
 of the file*.
 Another cleanup this patch does, is making AIO select EVENTFD, instead of 
 adding a bunch of empty function stubs inside eventfd.h in order to 
 handle the (AIO  !EVENTFD) case.
 
 
 ...

 +/**
 + * eventfd_ctx_get - Acquires a reference to the internal eventfd context.
 + * @ctx: [in] Pointer to the eventfd context.
 + *
 + * Returns: In case of success, returns a pointer to the eventfd context,
 + *  otherwise a proper error code.

The description of the return value

 + */
 +struct eventfd_ctx *eventfd_ctx_get(struct eventfd_ctx *ctx)
 +{
 + kref_get(ctx-kref);
 + return ctx;
 +}
 +EXPORT_SYMBOL_GPL(eventfd_ctx_get);

doesn't match the code.


Also...

 + * Returns: A pointer to the eventfd file structure in case of success, or a
 + *  proper error pointer in case of failure.


 + * Returns: In case of success, it returns a pointer to the internal eventfd
 + *  context, otherwise a proper error code.
 + */

I'm unsure what the word proper means in this context.

The term proper error pointer is understandable enough - something
you run IS_ERR() against.  error pointer would suffice.

But the term proper error code is getting a bit remote from reality.


Unfortunately the kernel doesn't have a simple and agreed-to term for
an ERR_PTR() thingy.  Perhaps we should invent one.  err_ptr?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] eventfd - revised interface and cleanups (2nd rev)

2009-06-23 Thread Andrew Morton
On Tue, 23 Jun 2009 13:59:07 -0700 (PDT)
Davide Libenzi davi...@xmailserver.org wrote:

 On Tue, 23 Jun 2009, Andrew Morton wrote:
 
  On Tue, 23 Jun 2009 12:25:36 -0700 (PDT)
  Davide Libenzi davi...@xmailserver.org wrote:
  
   Another cleanup this patch does, is making AIO select EVENTFD, instead of 
   adding a bunch of empty function stubs inside eventfd.h in order to 
   handle the (AIO  !EVENTFD) case.
  
  Given that we're trying to squeak this patch into 2.6.31, it's quite
  unfortunate to include unrelated changes.  Especially ones which
  involve the always-problematic select.  Although this one looks less
  than usually deadly due to the simple dependencies of AIO and eventfd.
  
  However...
  
  Is this a good way of fixing the (AIO  !EVENTFD) case?  Surely if the
  developer selected this combination, he doesn't want to carry the
  overhead of including eventfd.  IOW, if AIO can work acceptably without
  eventfd being compiled into the kernel then adding the stubs is a
  superior solution.
 
 IMO when someone says AIO included in the kernel, should get the whole 
 AIO functionality and not part of it.
 People already started using KAIO+eventfd, and a (AIO  !EVENTFD) config 
 will make their code fail.

That isn't for us to decide. Entire syscalls can be disabled in config.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] eventfd - revised interface and cleanups (2nd rev)

2009-06-23 Thread Andrew Morton
On Tue, 23 Jun 2009 14:03:22 -0700 (PDT)
Davide Libenzi davi...@xmailserver.org wrote:

 On Tue, 23 Jun 2009, Andrew Morton wrote:
 
  On Tue, 23 Jun 2009 12:25:36 -0700 (PDT)
  Davide Libenzi davi...@xmailserver.org wrote:
  
   The following patch changes the eventfd interface to de-couple the 
   eventfd 
   memory context, from the file pointer instance.
   Without such change, there is no clean way to racely free handle the 
   POLLHUP event sent when the last instance of the file* goes away.
   Also, now the internal eventfd APIs are using the eventfd context instead 
   of the file*.
   Another cleanup this patch does, is making AIO select EVENTFD, instead of 
   adding a bunch of empty function stubs inside eventfd.h in order to 
   handle the (AIO  !EVENTFD) case.
   
   ...
  
   +/**
   + * eventfd_ctx_get - Acquires a reference to the internal eventfd 
   context.
   + * @ctx: [in] Pointer to the eventfd context.
   + *
   + * Returns: In case of success, returns a pointer to the eventfd context,
   + *  otherwise a proper error code.
  
  The description of the return value
 
 Should functions be describing all the returned error codes, ala man pages?
 

I think so.

 
   + */
   +struct eventfd_ctx *eventfd_ctx_get(struct eventfd_ctx *ctx)
   +{
   + kref_get(ctx-kref);
   + return ctx;
   +}
   +EXPORT_SYMBOL_GPL(eventfd_ctx_get);
  
  doesn't match the code.

You appear to have not seen the above sentence.

  Also...
  
   + * Returns: A pointer to the eventfd file structure in case of success, 
   or a
   + *  proper error pointer in case of failure.
  
  
   + * Returns: In case of success, it returns a pointer to the internal 
   eventfd
   + *  context, otherwise a proper error code.
   + */
  
  I'm unsure what the word proper means in this context.
  
  The term proper error pointer is understandable enough - something
  you run IS_ERR() against.  error pointer would suffice.
  
  But the term proper error code is getting a bit remote from reality.
  
  Unfortunately the kernel doesn't have a simple and agreed-to term for
  an ERR_PTR() thingy.  Perhaps we should invent one.  err_ptr?
 
 OK, but you tricked me once again :)
 You posted your comments/changes while you merged the old version in -mm 
 already.

yeah, I never trust people.  You might lose the email or jump on a
plane and disappear for three weeks, then it all gets forgotten about
and lost.

If the code doesn't have any apparent showstoppers I'll often merge it
with a note that it isn't finalised.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] eventfd - revised interface and cleanups (2nd rev)

2009-06-23 Thread Andrew Morton
On Tue, 23 Jun 2009 14:25:05 -0700 (PDT)
Davide Libenzi davi...@xmailserver.org wrote:

 On Tue, 23 Jun 2009, Andrew Morton wrote:
 
  That isn't for us to decide. Entire syscalls can be disabled in config.
 
 That is not a well defined separate syscall though. It's a member/feature 
 of the aiocb.

I don't know what this means, really.

AIO eventfd support is a relatively recently added enhancement to AIO,
is it not?  Applications which continue to use the pre-May07 AIO
features will continue to work as before (they darn well better).  So
for such applications, AIO=y/EVENTFD=n kernels are usable and useful,
and eliminating this option is a loss?

Either way, I believe that this change should be unbundled from the
unrelated KVM one so we can handle it separately.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] eventfd - revised interface and cleanups (2nd rev)

2009-06-23 Thread Andrew Morton
On Tue, 23 Jun 2009 14:34:50 -0700 (PDT)
Davide Libenzi davi...@xmailserver.org wrote:

 On Tue, 23 Jun 2009, Andrew Morton wrote:
 
   Should functions be describing all the returned error codes, ala man 
   pages?
   
  
  I think so.
 
 This becomes pretty painful when the function calls other functions, for 
 which just relays the error code.
 Should we be just documenting the error codes introduced by the function 
 code, and say that the function returns errors A, B, C plus all the ones 
 returned by the called functions X, Y, Z?
 If not, it becomes hell in maintaining the comments...

Well.  Don't worry about making rules.  Taste and common sense apply.  Will
it be useful to readers if I explicitly document the return value.  If
yes then document away.  If no then don't.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] eventfd - revised interface and cleanups (2nd rev)

2009-06-23 Thread Andrew Morton
On Tue, 23 Jun 2009 14:48:51 -0700 (PDT)
Davide Libenzi davi...@xmailserver.org wrote:

   This becomes pretty painful when the function calls other functions, for 
   which just relays the error code.
   Should we be just documenting the error codes introduced by the function 
   code, and say that the function returns errors A, B, C plus all the ones 
   returned by the called functions X, Y, Z?
   If not, it becomes hell in maintaining the comments...
  
  Well.  Don't worry about making rules.  Taste and common sense apply.  Will
  it be useful to readers if I explicitly document the return value.  If
  yes then document away.  If no then don't.
 
 Are you OK with the format in the patch below?

Looks great to me.

Obviously the cost of maintaining this level of detail is fairly high,
and the chances of bitrot are also high.  So I wouldn't be expecting
people to document things at this level in general.  But if you're
prepared to maintain this then good for you.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] eventfd - revised interface and cleanups (2nd rev)

2009-06-23 Thread Andrew Morton
On Tue, 23 Jun 2009 15:49:53 -0700 (PDT)
Davide Libenzi davi...@xmailserver.org wrote:

 On Tue, 23 Jun 2009, Andrew Morton wrote:
 
  On Tue, 23 Jun 2009 14:25:05 -0700 (PDT)
  Davide Libenzi davi...@xmailserver.org wrote:
  
   On Tue, 23 Jun 2009, Andrew Morton wrote:
   
That isn't for us to decide. Entire syscalls can be disabled in config.
   
   That is not a well defined separate syscall though. It's a member/feature 
   of the aiocb.
  
  I don't know what this means, really.
 
 This is the struct iocb:
 
 struct iocb {
   ...
   u_int32_t   aio_resfd;
 };
 
 And the only interface to access KAIO is io_submit().
 IMO the end user perceives the KAIO functionality as the full deployment 
 of the iocb struct and the io_submit() accessory.
 Can code not using eventfd work in (AIO  !EVENTFD) mode? Sure.
 It is a kinda confusing configuration from the end user POV IMO.

What's confusing about it?

Most programmers who are using AIO probably don't even know that the
eventd hookup is available.  And even if they did, they might not want
to use it, to remain compatible with older kernels.

I'm still not seeing any compelling reason for unconditionally adding
kernel text which some users don't need.

Maybe there is such a reason, and it hasn't yet been beaten into my
skull.  But I still think it should be done in a separate patch.  One
which comes with a full description of the reasons for the change, btw.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] add ksm kernel shared memory driver.

2009-05-13 Thread Andrew Morton
On Mon, 20 Apr 2009 04:36:06 +0300
Izik Eidus iei...@redhat.com wrote:

 Ksm is driver that allow merging identical pages between one or more
 applications in way unvisible to the application that use it.
 Pages that are merged are marked as readonly and are COWed when any
 application try to change them.
 
 Ksm is used for cases where using fork() is not suitable,
 one of this cases is where the pages of the application keep changing
 dynamicly and the application cannot know in advance what pages are
 going to be identical.
 
 Ksm works by walking over the memory pages of the applications it
 scan in order to find identical pages.
 It uses a two sorted data strctures called stable and unstable trees
 to find in effective way the identical pages.
 
 When ksm finds two identical pages, it marks them as readonly and merges
 them into single one page,
 after the pages are marked as readonly and merged into one page, linux
 will treat this pages as normal copy_on_write pages and will fork them
 when write access will happen to them.
 
 Ksm scan just memory areas that were registred to be scanned by it.
 
 ...
 + copy_user_highpage(kpage, page1, addr1, vma);
 ...

Breaks ppc64 allmodcofnig because that architecture doesn't export its
copy_user_page() to modules.

Architectures are inconsistent about this.  x86 _does_ export it,
because it bounces it to the exported copy_page().

So can I ask that you sit down and work out upon which architectures it
really makes sense to offer KSM?  Disallow the others in Kconfig and
arrange for copy_user_highpage() to be available on the allowed architectures?

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] add ksm kernel shared memory driver.

2009-04-30 Thread Andrew Morton
On Thu, 30 Apr 2009 20:46:24 +0300 Izik Eidus iei...@redhat.com wrote:

 Following patchs change the api to be more robust, the result change of
 the api came after conversation i had with Andrea and Chris about how
 to make the api as stable as we can,
 
 In addition i hope this patchset fix the cross compilation problems, i
 compiled it on itanium (doesnt have _PAGE_RW) and it seems to work.

eek, please don't send multiple patches per email - it's surprisingly
disruptive to everything.

What is the relationship between these patches and the ones I merged
the other day?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [KVM PATCH v3 0/2] irqfd

2009-04-29 Thread Andrew Morton
On Mon, 27 Apr 2009 14:33:24 -0400 Gregory Haskins ghask...@novell.com wrote:

 (Applies to kvm.git 41b76d8d0487c26d6d4d3fe53c1ff59b3236f096)
 
 This series implements a mechanism called irqfd.  It lets you create
 an eventfd based file-desriptor to inject interrupts to a kvm guest.  We
 associate one gsi per fd for fine-grained routing.

It'd be nice if the KVM weenies amongst us were to be told why one
would want to inject interrupts into a KVM guest.  Monosyllabic words
would be preferred ;)

 We do not have a user of this interface in this series, though note
 future version of virtual-bus (v4 and above) will be based on this.

So I assume that this patchset will be merged if/when a user of it is
merged?

 The first patch will require mainline buy-in, particularly from Davide
 (cc'd).  The last patch is kvm specific.

Three EXPORT_SYMBOL_GPL()s.  Once the shouting has subsided I'd suggest
that this be merged via the KVM tree.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] add ksm kernel shared memory driver.

2009-04-27 Thread Andrew Morton
On Mon, 20 Apr 2009 04:36:06 +0300
Izik Eidus iei...@redhat.com wrote:

 Ksm is driver that allow merging identical pages between one or more
 applications in way unvisible to the application that use it.
 Pages that are merged are marked as readonly and are COWed when any
 application try to change them.

Breaks sparc64 and probably lots of other architectures:

mm/ksm.c: In function `try_to_merge_two_pages_alloc':
mm/ksm.c:697: error: `_PAGE_RW' undeclared (first use in this function)

there should be an official arch-independent way of manipulating
vma-vm_page_prot, but I'm not immediately finding it.

An alternative (and quite inferior) fix would be to disable ksm on
architectures which don't implement _PAGE_RW.  That's most of them.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] add ksm kernel shared memory driver.

2009-04-15 Thread Andrew Morton
On Thu, 16 Apr 2009 01:37:25 +0300
Izik Eidus iei...@redhat.com wrote:

 Andrew Morton wrote:
  On Thu,  9 Apr 2009 06:58:41 +0300
  Izik Eidus iei...@redhat.com wrote:
 

 
  Confused.  In the covering email you indicated that v2 of the patchset
  had abandoned ioctls and had moved the interface to sysfs.

 We have abandoned the ioctls that control the ksm behavior (how much cpu 
 it take, how much kernel pages it may allocate and so on...)
 But we still use ioctls to register the application memory to be used 
 with ksm.

hm. ioctls make kernel people weep and gnash teeth.

An appropriate interface would be to add new syscalls.  But as ksm is
an optional thing and can even be modprobed, that doesn't work.  And
having a driver in mm/ which can be modprobed is kinda neat.

I can't immediately think of a nicer interface.  You could always poke
numbers into some pseudo-file but to me that seems as ugly, or uglier
than an ioctl (others seem to disagee).

Ho hum.  Please design the ioctl interface so that it doesn't need any
compat handling if poss.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux v3

2009-04-14 Thread Andrew Morton
On Thu,  9 Apr 2009 06:58:37 +0300
Izik Eidus iei...@redhat.com wrote:

 KSM is a linux driver that allows dynamicly sharing identical memory
 pages between one or more processes.

Generally looks OK to me.  But that doesn't mean much.  We should rub
bottles with words like hugh and nick on them to be sure.



 ...

  include/linux/ksm.h  |   48 ++
  include/linux/miscdevice.h   |1 +
  include/linux/mm.h   |5 +
  include/linux/mmu_notifier.h |   34 +
  include/linux/rmap.h |   11 +
  mm/Kconfig   |6 +
  mm/Makefile  |1 +
  mm/ksm.c | 1674 
 ++
  mm/memory.c  |   90 +++-
  mm/mmu_notifier.c|   20 +
  mm/rmap.c|  139 

And it's pretty unobtrusive for what it is.  I expect we can get this
into 2.6.31 unless there are some pratfalls which I missed.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] add replace_page(): change the page pte is pointing to.

2009-04-14 Thread Andrew Morton
On Thu,  9 Apr 2009 06:58:40 +0300
Izik Eidus iei...@redhat.com wrote:

 replace_page() allow changing the mapping of pte from one physical page
 into diffrent physical page.

At a high level, this is very similar to what page migration does.  Yet
this implementation shares nothing with the page migration code.

Can this situation be improved?

 this function is working by removing oldpage from the rmap and calling
 put_page on it, and by setting the pte to point into newpage and by
 inserting it to the rmap using page_add_file_rmap().
 
 note: newpage must be non anonymous page, the reason for this is:
 replace_page() is built to allow mapping one page into more than one
 virtual addresses, the mapping of this page can happen in diffrent
 offsets inside each vma, and therefore we cannot trust the page-index
 anymore.
 
 the side effect of this issue is that newpage cannot be anything but
 kernel allocated page that is not swappable.
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] add ksm kernel shared memory driver.

2009-04-14 Thread Andrew Morton
On Thu,  9 Apr 2009 06:58:41 +0300
Izik Eidus iei...@redhat.com wrote:

 Ksm is driver that allow merging identical pages between one or more
 applications in way unvisible to the application that use it.
 Pages that are merged are marked as readonly and are COWed when any
 application try to change them.
 
 Ksm is used for cases where using fork() is not suitable,
 one of this cases is where the pages of the application keep changing
 dynamicly and the application cannot know in advance what pages are
 going to be identical.
 
 Ksm works by walking over the memory pages of the applications it
 scan in order to find identical pages.
 It uses a two sorted data strctures called stable and unstable trees
 to find in effective way the identical pages.
 
 When ksm finds two identical pages, it marks them as readonly and merges
 them into single one page,
 after the pages are marked as readonly and merged into one page, linux
 will treat this pages as normal copy_on_write pages and will fork them
 when write access will happen to them.
 
 Ksm scan just memory areas that were registred to be scanned by it.
 
 Ksm api:
 
 KSM_GET_API_VERSION:
 Give the userspace the api version of the module.
 
 KSM_CREATE_SHARED_MEMORY_AREA:
 Create shared memory reagion fd, that latter allow the user to register
 the memory region to scan by using:
 KSM_REGISTER_MEMORY_REGION and KSM_REMOVE_MEMORY_REGION
 
 KSM_REGISTER_MEMORY_REGION:
 Register userspace virtual address range to be scanned by ksm.
 This ioctl is using the ksm_memory_region structure:
 ksm_memory_region:
 __u32 npages;
  number of pages to share inside this memory region.
 __u32 pad;
 __u64 addr:
 the begining of the virtual address of this region.
 __u64 reserved_bits;
 reserved bits for future usage.
 
 KSM_REMOVE_MEMORY_REGION:
 Remove memory region from ksm.
 

 ...

 +/* ioctls for /dev/ksm */

Confused.  In the covering email you indicated that v2 of the patchset
had abandoned ioctls and had moved the interface to sysfs.

It would be good to completely (and briefly) describe KSM's proposed
userspace intefaces in the changelog or somewhere.  I'm a bit confused.


 ...

 +/*
 + * slots_lock protect against removing and adding memory regions while a 
 scanner
 + * is in the middle of scanning.
 + */

protects

 +static DECLARE_RWSEM(slots_lock);
 +
 +/* The stable and unstable trees heads. */
 +struct rb_root root_stable_tree = RB_ROOT;
 +struct rb_root root_unstable_tree = RB_ROOT;
 +
 +
 +/* The number of linked list members inside the hash table */
 +static int nrmaps_hash;

A signed type doesn't seem appropriate.

 +/* rmap_hash hash table */
 +static struct hlist_head *rmap_hash;
 +
 +static struct kmem_cache *tree_item_cache;
 +static struct kmem_cache *rmap_item_cache;
 +
 +/* the number of nodes inside the stable tree */
 +static unsigned long nnodes_stable_tree;
 +
 +/* the number of kernel allocated pages outside the stable tree */
 +static unsigned long nkpage_out_tree;
 +
 +static int kthread_sleep; /* sleep time of the kernel thread */
 +static int kthread_pages_to_scan; /* npages to scan for the kernel thread */
 +static int kthread_max_kernel_pages; /* number of unswappable pages allowed 
 */

The kthread_max_kernel_pages isn't very illuminating.

The use of kthread in the identifier makes is look like part of the
kthread subsystem.

 +static unsigned long ksm_pages_shared;
 +static struct ksm_scan kthread_ksm_scan;
 +static int ksmd_flags;
 +static struct task_struct *kthread;
 +static DECLARE_WAIT_QUEUE_HEAD(kthread_wait);
 +static DECLARE_RWSEM(kthread_lock);
 +
 +

 ...

 +static pte_t *get_pte(struct mm_struct *mm, unsigned long addr)
 +{
 + pgd_t *pgd;
 + pud_t *pud;
 + pmd_t *pmd;
 + pte_t *ptep = NULL;
 +
 + pgd = pgd_offset(mm, addr);
 + if (!pgd_present(*pgd))
 + goto out;
 +
 + pud = pud_offset(pgd, addr);
 + if (!pud_present(*pud))
 + goto out;
 +
 + pmd = pmd_offset(pud, addr);
 + if (!pmd_present(*pmd))
 + goto out;
 +
 + ptep = pte_offset_map(pmd, addr);
 +out:
 + return ptep;
 +}

hm, this looks very generic.  Does it duplicate anything which core
kernel already provides?  If not, perhaps core kernel should provide
this (perhaps after some reorganisation).


 ...

 +static int rmap_hash_init(void)
 +{
 + if (!rmap_hash_size) {
 + struct sysinfo sinfo;
 +
 + si_meminfo(sinfo);
 + rmap_hash_size = sinfo.totalram / 10;

One slot per ten pages of physical memory?  Is this too large, too
small or just right?

 + }
 + nrmaps_hash = rmap_hash_size;
 + rmap_hash = vmalloc(nrmaps_hash * sizeof(struct hlist_head));
 + if (!rmap_hash)
 + return -ENOMEM;
 + memset(rmap_hash, 0, nrmaps_hash * sizeof(struct hlist_head));
 + return 0;
 +}
 +

 ...

 +static void break_cow(struct mm_struct *mm, unsigned long addr)
 +{
 + struct page *page[1];
 +
 + 

Re: [net-next PATCH 1/2] igbvf: add new driver to support 82576 virtual functions

2009-03-18 Thread Andrew Morton
On Wed, 18 Mar 2009 08:22:46 -0700 Alexander Duyck 
alexander.h.du...@intel.com wrote:

  +static int igbvf_set_ringparam(struct net_device *netdev,
  +   struct ethtool_ringparam *ring)
  +{
  + struct igbvf_adapter *adapter = netdev_priv(netdev);
  + struct igbvf_ring *tx_ring, *tx_old;
  + struct igbvf_ring *rx_ring, *rx_old;
  + int err;
  +
  + if ((ring-rx_mini_pending) || (ring-rx_jumbo_pending))
  + return -EINVAL;
  +
  + while (test_and_set_bit(__IGBVF_RESETTING, adapter-state))
  + msleep(1);
  No timeout needed here?  Interrupts might not be working, for example..
  This bit isn't set in interrupt context.  This is always used out of 
  interrupt context and is just to prevent multiple setting changes at the 
  same time.
  
  Oh.  Can't use plain old mutex_lock()?
 
 We have one or two spots that actually check to see if the bit is set 
 and just report a warning instead of actually waiting on the bit to clear.

mutex_is_locked()?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next PATCH 1/2] igbvf: add new driver to support 82576 virtual functions

2009-03-10 Thread Andrew Morton
On Tue, 10 Mar 2009 19:09:28 -0700 Jeff Kirsher jeffrey.t.kirs...@intel.com 
wrote:

 From: Alexander Duyck alexander.h.du...@intel.com
 
 This adds an igbvf driver to handle virtual functions provided
 by the igb driver.

The drive-by reader is now wondering what a virtual function is.
 

 ...

 +#define IGBVF_STAT(m, b) sizeof(((struct igbvf_adapter *)0)-m), \
 + offsetof(struct igbvf_adapter, m), \
 + offsetof(struct igbvf_adapter, b)

 +static const struct igbvf_stats igbvf_gstrings_stats[] = {
 + { rx_packets, IGBVF_STAT(stats.gprc, stats.base_gprc) },
 + { tx_packets, IGBVF_STAT(stats.gptc, stats.base_gptc) },
 + { rx_bytes, IGBVF_STAT(stats.gorc, stats.base_gorc) },
 + { tx_bytes, IGBVF_STAT(stats.gotc, stats.base_gotc) },
 + { multicast, IGBVF_STAT(stats.mprc, stats.base_mprc) },
 + { lbrx_bytes, IGBVF_STAT(stats.gorlbc, stats.base_gorlbc) },
 + { lbrx_packets, IGBVF_STAT(stats.gprlbc, stats.base_gprlbc) },
 + { tx_restart_queue, IGBVF_STAT(restart_queue, zero_base) },
 + { rx_long_byte_count, IGBVF_STAT(stats.gorc, stats.base_gorc) },
 + { rx_csum_offload_good, IGBVF_STAT(hw_csum_good, zero_base) },
 + { rx_csum_offload_errors, IGBVF_STAT(hw_csum_err, zero_base) },
 + { rx_header_split, IGBVF_STAT(rx_hdr_split, zero_base) },
 + { alloc_rx_buff_failed, IGBVF_STAT(alloc_rx_buff_failed, zero_base) },
 +};

stares at it for a while

It would be clearer if `m' and `b' were (much) more meaningful identifiers.

 +#define IGBVF_GLOBAL_STATS_LEN   \
 + (sizeof(igbvf_gstrings_stats) / sizeof(struct igbvf_stats))

This is ARRAY_SIZE().

 +#define IGBVF_STATS_LEN (IGBVF_GLOBAL_STATS_LEN)

Why does this need to exist?


 ...

 +static int igbvf_set_ringparam(struct net_device *netdev,
 +   struct ethtool_ringparam *ring)
 +{
 + struct igbvf_adapter *adapter = netdev_priv(netdev);
 + struct igbvf_ring *tx_ring, *tx_old;
 + struct igbvf_ring *rx_ring, *rx_old;
 + int err;
 +
 + if ((ring-rx_mini_pending) || (ring-rx_jumbo_pending))
 + return -EINVAL;
 +
 + while (test_and_set_bit(__IGBVF_RESETTING, adapter-state))
 + msleep(1);

No timeout needed here?  Interrupts might not be working, for example..

 + if (netif_running(adapter-netdev))
 + igbvf_down(adapter);
 +
 + tx_old = adapter-tx_ring;
 + rx_old = adapter-rx_ring;
 +
 + err = -ENOMEM;
 + tx_ring = kzalloc(sizeof(struct igbvf_ring), GFP_KERNEL);
 + if (!tx_ring)
 + goto err_alloc_tx;
 + /*
 +  * use a memcpy to save any previously configured
 +  * items like napi structs from having to be
 +  * reinitialized
 +  */
 + memcpy(tx_ring, tx_old, sizeof(struct igbvf_ring));
 +
 + rx_ring = kzalloc(sizeof(struct igbvf_ring), GFP_KERNEL);
 + if (!rx_ring)
 + goto err_alloc_rx;
 + memcpy(rx_ring, rx_old, sizeof(struct igbvf_ring));
 +
 + adapter-tx_ring = tx_ring;
 + adapter-rx_ring = rx_ring;
 +
 + rx_ring-count = max(ring-rx_pending, (u32)IGBVF_MIN_RXD);
 + rx_ring-count = min(rx_ring-count, (u32)(IGBVF_MAX_RXD));
 + rx_ring-count = ALIGN(rx_ring-count, REQ_RX_DESCRIPTOR_MULTIPLE);
 +
 + tx_ring-count = max(ring-tx_pending, (u32)IGBVF_MIN_TXD);
 + tx_ring-count = min(tx_ring-count, (u32)(IGBVF_MAX_TXD));
 + tx_ring-count = ALIGN(tx_ring-count, REQ_TX_DESCRIPTOR_MULTIPLE);
 +
 + if (netif_running(adapter-netdev)) {
 + /* Try to get new resources before deleting old */
 + err = igbvf_setup_rx_resources(adapter);
 + if (err)
 + goto err_setup_rx;
 + err = igbvf_setup_tx_resources(adapter);
 + if (err)
 + goto err_setup_tx;
 +
 + /*
 +  * restore the old in order to free it,
 +  * then add in the new
 +  */
 + adapter-rx_ring = rx_old;
 + adapter-tx_ring = tx_old;
 + igbvf_free_rx_resources(adapter);
 + igbvf_free_tx_resources(adapter);
 + kfree(tx_old);
 + kfree(rx_old);

That's odd-looking.  Why take a copy of rx_old and tx_old when we're
about to free them?

 + adapter-rx_ring = rx_ring;
 + adapter-tx_ring = tx_ring;
 + err = igbvf_up(adapter);
 + if (err)
 + goto err_setup;
 + }
 +
 + clear_bit(__IGBVF_RESETTING, adapter-state);
 + return 0;
 +err_setup_tx:
 + igbvf_free_rx_resources(adapter);
 +err_setup_rx:
 + adapter-rx_ring = rx_old;
 + adapter-tx_ring = tx_old;
 + kfree(rx_ring);
 +err_alloc_rx:
 + kfree(tx_ring);
 +err_alloc_tx:
 + igbvf_up(adapter);
 +err_setup:
 + clear_bit(__IGBVF_RESETTING, adapter-state);
 + return err;
 +}
 +

 ...

 +static void igbvf_diag_test(struct net_device *netdev,
 +struct ethtool_test 

Re: linux-next: Tree for January 23 (kvm)

2009-01-28 Thread Andrew Morton
(cc mailing lists)

On Wed, 28 Jan 2009 12:26:16 -0200
Marcelo Tosatti mtosa...@redhat.com wrote:

 On Wed, Jan 28, 2009 at 11:20:16AM +0100, Alexander Graf wrote:
 
  On 28.01.2009, at 10:51, Andrew Morton wrote:
 
  On Fri, 23 Jan 2009 12:34:24 -0800 Randy Dunlap 
  randy.dun...@oracle.com wrote:
 
  Stephen Rothwell wrote:
  Hi all,
 
  News:  I will be on leave next week, so there will probably be no
  linux-next release until Feb 2.
 
  Changes since 20090122:
 
 
  kvm changes cause this build error on i386:
 
  when kvm is built as module:
  ERROR: __moddi3 [arch/x86/kvm/kvm.ko] undefined!
 
  or when kvm is built into the kernel image:
  lapic.c:(.text+0x19276): undefined reference to `__moddi3'
 
 
  I guess it's this line:
ns = ktime_to_ns(remaining) % apic-timer.period;
 
 
  dammit, I just spent N minutes hunting down the same thing in
  the Jan 26 linux-next :(
 
 
  Yeah, Clemens Foss posted a real fix on k...@vger already.

Probably wasn't the best mailing list, as this bug affects all
developers and testers..  But whatever - thanks.

 commit efe4ff38b2bbc3944d0b2c9b6ec669b67cb7acbc
 Author: Clemens Noss cn...@gmx.de
 Date:   Mon Jan 26 02:55:16 2009 +0100
 
 KVM: x86: fix __moddi3 undefined build failure in lapic.c
 
 use mod_64 from arch/x86/kvm/i8254.c to fix
 ERROR: __moddi3 [arch/x86/kvm/kvm.ko] undefined!
 
 Signed-off-by: Clemens Noss cn...@gmx.de
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
 index d8adc50..f0b67f2 100644
 --- a/arch/x86/kvm/lapic.c
 +++ b/arch/x86/kvm/lapic.c
 @@ -35,6 +35,12 @@
  #include kvm_cache_regs.h
  #include irq.h
  
 +#ifndef CONFIG_X86_64
 +#define mod_64(x, y) ((x) - (y) * div64_u64(x, y))
 +#else
 +#define mod_64(x, y) ((x) % (y))
 +#endif
 +
  #define PRId64 d
  #define PRIx64 llx
  #define PRIu64 u
 @@ -525,7 +531,7 @@ static u32 apic_get_tmcct(struct kvm_lapic *apic)
   if (ktime_to_ns(remaining)  0)
   remaining = ktime_set(0, 0);
  
 - ns = ktime_to_ns(remaining) % apic-timer.period;
 + ns = mod_64(ktime_to_ns(remaining), apic-timer.period);
   tmcct = div64_u64(ns, (APIC_BUS_CYCLE_NS * apic-timer.divide_count));
  
   return tmcct;

Rather than cooking up a private implementation I think it would be
better to implement a kernel-wide mod64_u64() (?) facility and then to
use that in KVM.  There will presumably be other sites which will (or
already do) need such a thing.

However it'll probably be a bit hard, getting the signednesses of the
args and the return value right.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

2008-11-11 Thread Andrew Morton
On Tue, 11 Nov 2008 15:21:37 +0200 Izik Eidus [EMAIL PROTECTED] wrote:

 KSM is a linux driver that allows dynamicly sharing identical memory pages
 between one or more processes.
 
 unlike tradtional page sharing that is made at the allocation of the
 memory, ksm do it dynamicly after the memory was created.
 Memory is periodically scanned; identical pages are identified and merged.
 the sharing is unnoticeable by the process that use this memory.
 (the shared pages are marked as readonly, and in case of write
 do_wp_page() take care to create new copy of the page)
 
 this driver is very useful for KVM as in cases of runing multiple guests
 operation system of the same type, many pages are sharable.
 this driver can be useful by OpenVZ as well.

These benefits should be quantified, please.  Also any benefits to any
other workloads should be identified and quantified.

The whole approach seems wrong to me.  The kernel lost track of these
pages and then we run around post-facto trying to fix that up again. 
Please explain (for the changelog) why the kernel cannot get this right
via the usual sharing, refcounting and COWing approaches.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

2008-11-11 Thread Andrew Morton
On Tue, 11 Nov 2008 20:48:16 +0200
Avi Kivity [EMAIL PROTECTED] wrote:

 Andrew Morton wrote:
  The whole approach seems wrong to me.  The kernel lost track of these
  pages and then we run around post-facto trying to fix that up again. 
  Please explain (for the changelog) why the kernel cannot get this right
  via the usual sharing, refcounting and COWing approaches.

 
 For kvm, the kernel never knew those pages were shared.  They are loaded 
 from independent (possibly compressed and encrypted) disk images.  These 
 images are different; but some pages happen to be the same because they 
 came from the same installation media.

What userspace-only changes could fix this?  Identify the common data,
write it to a flat file and mmap it, something like that?

 For OpenVZ the situation is less clear, but if you allow users to 
 independently upgrade their chroots you will eventually arrive at the 
 same scenario (unless of course you apply the same merging strategy at 
 the filesystem level).

hm.

There has been the occasional discussion about idenfifying all-zeroes
pages and scavenging them, repointing them at the zero page.  Could
this infrastructure be used for that?  (And how much would we gain from
it?)

[I'm looking for reasons why this is more than a muck-up-the-vm-for-kvm
thing here ;) ]
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux

2008-11-11 Thread Andrew Morton
On Tue, 11 Nov 2008 21:07:10 +0200
Izik Eidus [EMAIL PROTECTED] wrote:

 we have used KSM in production for about half year and the numbers that 
 came from our QA is:
 using KSM for desktop (KSM was tested just for windows desktop workload) 
 you can run as many as
 52 windows xp with 1 giga ram each on server with just 16giga ram. (this 
 is more than 300% overcommit)
 the reason is that most of the kernel/dlls of this guests is shared and 
 in addition we are sharing the windows zero
 (windows keep making all its free memory as zero, so every time windows 
 release memory we take the page back to the host)
 there is slide that give this numbers you can find at:
 http://kvm.qumranet.com/kvmwiki/KvmForum2008?action=AttachFiledo=gettarget=kdf2008_3.pdf
  
 (slide 27)
 beside more i gave presentation about ksm that can be found at:
 http://kvm.qumranet.com/kvmwiki/KvmForum2008?action=AttachFiledo=gettarget=kdf2008_12.pdf

OK, 300% isn't chicken feed.

It is quite important that information such as this be prepared, added to
the patch changelogs and maintained.  For a start, without this basic
information, there is no reason for anyone to look at any of the code!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] rmap: add page_wrprotect() function,

2008-11-11 Thread Andrew Morton
On Tue, 11 Nov 2008 15:21:38 +0200
Izik Eidus [EMAIL PROTECTED] wrote:

 From: Izik Eidus [EMAIL PROTECTED]
 
 this function is useful for cases you want to compare page and know
 that its value wont change during you compare it.
 
 this function is working by walking over the whole rmap of a page
 and mark every pte related to the page as write_protect.
 
 the odirect_sync paramter is used to notify the caller of
 page_wrprotect() if one pte or more was not marked readonly
 in order to avoid race with odirect.

The grammar is a bit mangled.  Missing apostrophes.  Sentences start
with capital letters.

Yeah, it's a nit, but we don't actually gain anything from deliberately
mangling the language.


 ...

 --- a/mm/rmap.c
 +++ b/mm/rmap.c
 @@ -576,6 +576,103 @@ int page_mkclean(struct page *page)
  }
  EXPORT_SYMBOL_GPL(page_mkclean);
  
 +static int page_wrprotect_one(struct page *page, struct vm_area_struct *vma,
 +   int *odirect_sync)
 +{
 + struct mm_struct *mm = vma-vm_mm;
 + unsigned long address;
 + pte_t *pte;
 + spinlock_t *ptl;
 + int ret = 0;
 +
 + address = vma_address(page, vma);
 + if (address == -EFAULT)
 + goto out;
 +
 + pte = page_check_address(page, mm, address, ptl, 0);
 + if (!pte)
 + goto out;
 +
 + if (pte_write(*pte)) {
 + pte_t entry;
 +
 + if (page_mapcount(page) != page_count(page)) {
 + *odirect_sync = 0;
 + goto out_unlock;
 + }
 + flush_cache_page(vma, address, pte_pfn(*pte));
 + entry = ptep_clear_flush_notify(vma, address, pte);
 + entry = pte_wrprotect(entry);
 + set_pte_at(mm, address, pte, entry);
 + }
 + ret = 1;
 +
 +out_unlock:
 + pte_unmap_unlock(pte, ptl);
 +out:
 + return ret;
 +}

OK.  I think.  We need to find a way of provoking Hugh to look at it.

 +static int page_wrprotect_file(struct page *page, int *odirect_sync)
 +{
 + struct address_space *mapping;
 + struct prio_tree_iter iter;
 + struct vm_area_struct *vma;
 + pgoff_t pgoff = page-index  (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 + int ret = 0;
 +
 + mapping = page_mapping(page);

What pins *mapping in memory?  Usually this is done by requiring that
the caller has locked the page.  But no such precondition is documented
here.

 + if (!mapping)
 + return ret;
 +
 + spin_lock(mapping-i_mmap_lock);
 +
 + vma_prio_tree_foreach(vma, iter, mapping-i_mmap, pgoff, pgoff)
 + ret += page_wrprotect_one(page, vma, odirect_sync);
 +
 + spin_unlock(mapping-i_mmap_lock);
 +
 + return ret;
 +}
 +
 +static int page_wrprotect_anon(struct page *page, int *odirect_sync)
 +{
 + struct vm_area_struct *vma;
 + struct anon_vma *anon_vma;
 + int ret = 0;
 +
 + anon_vma = page_lock_anon_vma(page);
 + if (!anon_vma)
 + return ret;
 +
 + list_for_each_entry(vma, anon_vma-head, anon_vma_node)
 + ret += page_wrprotect_one(page, vma, odirect_sync);
 +
 + page_unlock_anon_vma(anon_vma);
 +
 + return ret;
 +}
 +
 +/**

This token means this is a kerneldoc comment.

 + * set all the ptes pointed to a page as read only,
 + * odirect_sync is set to 0 in case we cannot protect against race with 
 odirect
 + * return the number of ptes that were set as read only
 + * (ptes that were read only before this function was called are couned as 
 well)
 + */

But it isn't.

I don't understand this odirect_sync thing.  What race?  Please expand
this comment to make the function of odirect_sync more understandable.

 +int page_wrprotect(struct page *page, int *odirect_sync)
 +{
 + int ret =0;
 +
 + *odirect_sync = 1;
 + if (PageAnon(page))
 + ret = page_wrprotect_anon(page, odirect_sync);
 + else
 + ret = page_wrprotect_file(page, odirect_sync);
 +
 + return ret;
 +}
 +EXPORT_SYMBOL(page_wrprotect);

What do you think about making all this new code dependent upon some
CONFIG_ switch which CONFIG_KVM can select?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Andrew Morton
On Tue, 11 Nov 2008 15:21:39 +0200
Izik Eidus [EMAIL PROTECTED] wrote:

 From: Izik Eidus [EMAIL PROTECTED]
 
 this function is needed in cases you want to change the userspace
 virtual mapping into diffrent physical page,

Not sure that I understand that description.  We want to replace a live
page in an anonymous VMA with a different one?

It looks that way.

page migration already kinda does that.  Is there common ground?

 KSM need this for merging the identical pages.
 
 this function is working by removing the oldpage from the rmap and
 calling put_page on it, and by setting the virtual address pte
 to point into the new page.
 (note that the new page (the page that we change the pte to map to)
 cannot be anonymous page)
 

I don't understand the restrictions on anonymous pages.  Please expand
the changelog so that reviewers can understand the reasons for this
restriction.


 ---
  include/linux/mm.h |3 ++
  mm/memory.c|   68 
 
  2 files changed, 71 insertions(+), 0 deletions(-)
 
 diff --git a/include/linux/mm.h b/include/linux/mm.h
 index ffee2f7..4da7fa8 100644
 --- a/include/linux/mm.h
 +++ b/include/linux/mm.h
 @@ -1207,6 +1207,9 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned 
 long addr,
  int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
   unsigned long pfn);
  
 +int replace_page(struct vm_area_struct *vma, struct page *oldpage,
 +  struct page *newpage, pte_t orig_pte, pgprot_t prot);
 +
  struct page *follow_page(struct vm_area_struct *, unsigned long address,
   unsigned int foll_flags);
  #define FOLL_WRITE   0x01/* check pte is writable */
 diff --git a/mm/memory.c b/mm/memory.c
 index 164951c..b2c542c 100644
 --- a/mm/memory.c
 +++ b/mm/memory.c
 @@ -1472,6 +1472,74 @@ int vm_insert_mixed(struct vm_area_struct *vma, 
 unsigned long addr,
  }
  EXPORT_SYMBOL(vm_insert_mixed);
  
 +/**
 + * replace _page - replace the pte mapping related to vm area between two 
 pages

s/replace _page/replace_page/

 + * (from oldpage to newpage)
 + * NOTE: you should take into consideration the impact on the VM when 
 replacing
 + * anonymous pages with kernel non swappable pages.
 + */

This _is_ a kerneldoc comment, but kernedoc comments conventionally
document the arguments and the return value also.

 +int replace_page(struct vm_area_struct *vma, struct page *oldpage,
 +  struct page *newpage, pte_t orig_pte, pgprot_t prot)
 +{
 + struct mm_struct *mm = vma-vm_mm;
 + pgd_t *pgd;
 + pud_t *pud;
 + pmd_t *pmd;
 + pte_t *ptep;
 + spinlock_t *ptl;
 + unsigned long addr;
 + int ret;
 +
 + BUG_ON(PageAnon(newpage));
 +
 + ret = -EFAULT;
 + addr = page_address_in_vma(oldpage, vma);
 + if (addr == -EFAULT)
 + goto out;
 +
 + pgd = pgd_offset(mm, addr);
 + if (!pgd_present(*pgd))
 + goto out;
 +
 + pud = pud_offset(pgd, addr);
 + if (!pud_present(*pud))
 + goto out;
 +
 + pmd = pmd_offset(pud, addr);
 + if (!pmd_present(*pmd))
 + goto out;
 +
 + ptep = pte_offset_map_lock(mm, pmd, addr, ptl);
 + if (!ptep)
 + goto out;
 +
 + if (!pte_same(*ptep, orig_pte)) {
 + pte_unmap_unlock(ptep, ptl);
 + goto out;
 + }
 +
 + ret = 0;
 + get_page(newpage);
 + page_add_file_rmap(newpage);
 +
 + flush_cache_page(vma, addr, pte_pfn(*ptep));
 + ptep_clear_flush(vma, addr, ptep);
 + set_pte_at(mm, addr, ptep, mk_pte(newpage, prot));
 +
 + page_remove_rmap(oldpage, vma);
 + if (PageAnon(oldpage)) {
 + dec_mm_counter(mm, anon_rss);
 + inc_mm_counter(mm, file_rss);
 + }
 + put_page(oldpage);
 +
 + pte_unmap_unlock(ptep, ptl);
 +
 +out:
 + return ret;
 +}
 +EXPORT_SYMBOL(replace_page);

Again, we could make the presence of this code selectable by subsystems
which want it.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] add ksm kernel shared memory driver

2008-11-11 Thread Andrew Morton
On Tue, 11 Nov 2008 15:21:40 +0200
Izik Eidus [EMAIL PROTECTED] wrote:

 From: Izik Eidus [EMAIL PROTECTED]
 
 ksm is driver that allow merging identical pages between one or more
 applications in way unvisible to the application that use it.
 pages that are merged are marked as readonly and are COWed when any 
 application
 try to change them.
 
 ksm is working by walking over the memory pages of the applications it scan
 in order to find identical pages.
 it uses an hash table to find in effective way the identical pages.
 
 when ksm find two identical pages, it marked them as readonly and merge them
 into single one page,
 after the pages are marked as readonly and merged into one page, linux
 will treat this pages as normal copy_on_write pages and will fork them
 when write access will happen to them.
 
 ksm scan just memory areas that were registred to be scanned by it.
 

This driver apepars to implement a new userspace interface, in /dev/ksm

Please fully document that interface in the changelog so that we can
review your decisions here.  This is by far the most important
consideration - we can change all the code, but interfaces are for
ever.

 diff --git a/drivers/Kconfig b/drivers/Kconfig
 index d38f43f..c1c701f 100644
 --- a/drivers/Kconfig
 +++ b/drivers/Kconfig
 @@ -105,4 +105,9 @@ source drivers/uio/Kconfig
  source drivers/xen/Kconfig
  
  source drivers/staging/Kconfig
 +
 +config KSM
 + bool KSM driver support
 + help
 + ksm is driver for merging identical pages between applciations

s/is/is a/

applications is misspelt.

  endmenu
 diff --git a/include/linux/ksm.h b/include/linux/ksm.h
 new file mode 100644
 index 000..f873502
 --- /dev/null
 +++ b/include/linux/ksm.h
 @@ -0,0 +1,53 @@
 +#ifndef __LINUX_KSM_H
 +#define __LINUX_KSM_H
 +
 +/*
 + * Userspace interface for /dev/ksm - kvm shared memory
 + */
 +
 +#include asm/types.h
 +#include linux/ioctl.h
 +
 +#define KSM_API_VERSION 1
 +
 +/* for KSM_REGISTER_MEMORY_REGION */
 +struct ksm_memory_region {
 + __u32 npages; /* number of pages to share */
 + __u32 pad;
 + __u64 addr; /* the begining of the virtual address */
 +};
 +
 +struct ksm_user_scan {
 + __u32 pages_to_scan;
 + __u32 max_pages_to_merge;
 +};
 +
 +struct ksm_kthread_info {
 + __u32 sleep; /* number of microsecoends to sleep */
 + __u32 pages_to_scan; /* number of pages to scan */
 + __u32 max_pages_to_merge;
 + __u32 running;
 +};
 +
 +#define KSMIO 0xAB
 +
 +/* ioctls for /dev/ksm */
 +#define KSM_GET_API_VERSION  _IO(KSMIO,   0x00)
 +#define KSM_CREATE_SHARED_MEMORY_AREA_IO(KSMIO,   0x01) /* return SMA fd 
 */
 +#define KSM_CREATE_SCAN  _IO(KSMIO,   0x02) /* return SCAN 
 fd */
 +#define KSM_START_STOP_KTHREAD_IOW(KSMIO,  0x03,\
 +   struct ksm_kthread_info)
 +#define KSM_GET_INFO_KTHREAD  _IOW(KSMIO,  0x04,\
 +   struct ksm_kthread_info) 
 +
 +
 +/* ioctls for SMA fds */
 +#define KSM_REGISTER_MEMORY_REGION   _IOW(KSMIO,  0x20,\
 +   struct ksm_memory_region)
 +#define KSM_REMOVE_MEMORY_REGION _IO(KSMIO,   0x21)
 +
 +/* ioctls for SCAN fds */
 +#define KSM_SCAN _IOW(KSMIO,  0x40,\
 +   struct ksm_user_scan)
 +
 +#endif

uh-oh, ioctls.

Please do document that interface for us.

 diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h
 index 26433ec..adc2435 100644
 --- a/include/linux/miscdevice.h
 +++ b/include/linux/miscdevice.h
 @@ -30,6 +30,7 @@
  #define TUN_MINOR 200
  #define  HPET_MINOR   228
  #define KVM_MINOR232
 +#define KSM_MINOR233
  
  struct device;
  
 diff --git a/mm/Kconfig b/mm/Kconfig
 index 5b5790f..e7f0061 100644
 --- a/mm/Kconfig
 +++ b/mm/Kconfig
 @@ -222,3 +222,6 @@ config UNEVICTABLE_LRU
  
  config MMU_NOTIFIER
   bool
 +
 +config KSM
 + bool
 diff --git a/mm/Makefile b/mm/Makefile
 index c06b45a..9722afe 100644
 --- a/mm/Makefile
 +++ b/mm/Makefile
 @@ -26,6 +26,7 @@ obj-$(CONFIG_TMPFS_POSIX_ACL) += shmem_acl.o
  obj-$(CONFIG_TINY_SHMEM) += tiny-shmem.o
  obj-$(CONFIG_SLOB) += slob.o
  obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 +obj-$(CONFIG_KSM) += ksm.o
  obj-$(CONFIG_SLAB) += slab.o
  obj-$(CONFIG_SLUB) += slub.o
  obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 diff --git a/mm/ksm.c b/mm/ksm.c
 new file mode 100644
 index 000..977eb37
 --- /dev/null
 +++ b/mm/ksm.c
 @@ -0,0 +1,1202 @@
 +/*
 + * Memory merging driver for Linux
 + *
 + * This module enables dynamic sharing of identical pages found in different
 + * memory areas, even if they are not shared by fork()
 + *
 + * Copyright (C) 2008 Red Hat, Inc.
 + * Authors:
 + *   Izik Eidus
 + *   Andrea Arcangeli
 + *   Chris Wright
 + *
 + * This work is licensed under the terms of the GNU GPL, version 2.
 + 

Re: [PATCH 1/4] rmap: add page_wrprotect() function,

2008-11-11 Thread Andrew Morton
On Tue, 11 Nov 2008 21:38:06 +0100
Andrea Arcangeli [EMAIL PROTECTED] wrote:

   + * set all the ptes pointed to a page as read only,
   + * odirect_sync is set to 0 in case we cannot protect against race with 
   odirect
   + * return the number of ptes that were set as read only
   + * (ptes that were read only before this function was called are couned 
   as well)
   + */
  
  But it isn't.
 
 What isn't?

This code comment had the kerneldoc marker (/**) but it isn't a
kerneldoc comment.

  I don't understand this odirect_sync thing.  What race?  Please expand
  this comment to make the function of odirect_sync more understandable.
 
 I should have answered this one with the above 3 links.

OK, well can we please update the code so these things are clearer.

(It's a permanent problem I have.  I ask what is this, but I really
mean the code should be changed so that readers will know what this is)

  What do you think about making all this new code dependent upon some
  CONFIG_ switch which CONFIG_KVM can select?
 
 I like that too.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mmotm 2008-09-22-01-36 uploaded (kvm)

2008-09-22 Thread Andrew Morton
On Mon, 22 Sep 2008 11:19:47 -0700
Randy Dunlap [EMAIL PROTECTED] wrote:

 On Mon, 22 Sep 2008 01:38:58 -0700 [EMAIL PROTECTED] wrote:
 
  The mm-of-the-moment snapshot 2008-09-22-01-36 has been uploaded to
  
 http://userweb.kernel.org/~akpm/mmotm/
  
  It contains the following patches against 2.6.27-rc7:
 
 
 
 ERROR: intel_iommu_found [arch/x86/kvm/kvm.ko] undefined!
 
 when CONFIG_DMAR=n.
 

Yes, I got that too.  i386 allmodconfig fails similarly.  It's still
unclear which tree in linux-next broke it.  PCI or async_tx, perhaps.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mmotm 2008-09-22-01-36 uploaded (kvm)

2008-09-22 Thread Andrew Morton
On Mon, 22 Sep 2008 14:58:52 -0700
Jesse Barnes [EMAIL PROTECTED] wrote:

 On Monday, September 22, 2008 12:52 pm Andrew Morton wrote:
  On Mon, 22 Sep 2008 11:19:47 -0700
 
  Randy Dunlap [EMAIL PROTECTED] wrote:
   On Mon, 22 Sep 2008 01:38:58 -0700 [EMAIL PROTECTED] wrote:
The mm-of-the-moment snapshot 2008-09-22-01-36 has been uploaded to
   
   http://userweb.kernel.org/~akpm/mmotm/
   
It contains the following patches against 2.6.27-rc7:
  
   ERROR: intel_iommu_found [arch/x86/kvm/kvm.ko] undefined!
  
   when CONFIG_DMAR=n.
 
  Yes, I got that too.  i386 allmodconfig fails similarly.  It's still
  unclear which tree in linux-next broke it.  PCI or async_tx, perhaps.
 
 I don't have that symbol in my tree (yet).  It's part of one of the IOMMU KVM 
 patch sets that's currently in-flight.  I'll add it to my linux-next tree 
 today or tomorrow, assuming I get an ack from David about it (I'll bounce it 
 over to him now).
 

It's already there, and exported in linux-next.  But it just doesn't get
compiled in due to some Kconfig snafu.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


linux-next: ia64 allmodconfig

2008-07-30 Thread Andrew Morton
arch/ia64/kvm/../../../virt/kvm/ioapic.c:42:17: irq.h: No such file or directory
arch/ia64/kvm/../../../virt/kvm/ioapic.c: In function `__kvm_ioapic_update_eoi':
arch/ia64/kvm/../../../virt/kvm/ioapic.c:296: error: implicit declaration of 
function `kvm_notify_acked_irq'
arch/ia64/kvm/../../../virt/kvm/ioapic.c: In function `ioapic_mmio_write':
arch/ia64/kvm/../../../virt/kvm/ioapic.c:389: error: too few arguments to functi
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html