Re: [PATCH 1/2] KVM: MMU: fix SMAP virtualization
On Wed, 10 Jun 2015 11:02:46 -0700 Davidlohr Bueso d...@stgolabs.net wrote: On Wed, 2015-05-27 at 10:53 +0800, Xiao Guangrong wrote: On 05/26/2015 10:48 PM, Paolo Bonzini wrote: On 26/05/2015 16:45, Edward Cree wrote: This breaks older compilers that can't initialize anon structures. How old ? Even gcc 3.1 says you can use unnamed struct/union fields and 3.2 is the minimum version required to compile the kernel as mentioned in the README. We could simply just name the structure, but I doubt this is the only place in the kernel code where it's being used this way :) This appears to be GCC bug #10676, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=10676 Says it was fixed in 4.6, but I believe the kernel supports GCCs much older than that (back to 3.2). I personally hit it on 4.4.7, the version shipped with RHEL6.6. Yes, it will be fixed soon(ish). Probably before you can get rid of the obnoxious disclaimer... :) It has been fixed by Andrew: From: Andrew Morton a...@linux-foundation.org Subject: arch/x86/kvm/mmu.c: work around gcc-4.4.4 bug Folks, this fix is missing in both -tip and Linus' (4.1-rc7) tree. At least I'm running into it. hm, yes, it's in linux-next so I dropped my copy. I'll send it in to Linus today. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 22/23] userfaultfd: avoid mmap_sem read recursion in mcopy_atomic
On Thu, 14 May 2015 19:31:19 +0200 Andrea Arcangeli aarca...@redhat.com wrote: If the rwsem starves writers it wasn't strictly a bug but lockdep doesn't like it and this avoids depending on lowlevel implementation details of the lock. ... @@ -229,13 +246,33 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, if (!zeropage) err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma, -dst_addr, src_addr); +dst_addr, src_addr, page); else err = mfill_zeropage_pte(dst_mm, dst_pmd, dst_vma, dst_addr); cond_resched(); + if (unlikely(err == -EFAULT)) { + void *page_kaddr; + + BUILD_BUG_ON(zeropage); I'm not sure what this is trying to do. BUILD_BUG_ON(local_variable)? It goes bang in my build. I'll just delete it. + up_read(dst_mm-mmap_sem); + BUG_ON(!page); + + page_kaddr = kmap(page); + err = copy_from_user(page_kaddr, + (const void __user *) src_addr, + PAGE_SIZE); + kunmap(page); + if (unlikely(err)) { + err = -EFAULT; + goto out; + } + goto retry; + } else + BUG_ON(page); + -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 22/23] userfaultfd: avoid mmap_sem read recursion in mcopy_atomic
There's a more serious failure with i386 allmodconfig: fs/userfaultfd.c:145:2: note: in expansion of macro 'BUILD_BUG_ON' BUILD_BUG_ON(sizeof(struct uffd_msg) != 32); I'm surprised the feature is even reachable on i386 builds? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] mm: introduce mm_forbids_zeropage function
On Wed, 22 Oct 2014 13:09:28 +0200 Dominik Dingel din...@linux.vnet.ibm.com wrote: Add a new function stub to allow architectures to disable for an mm_structthe backing of non-present, anonymous pages with read-only empty zero pages. ... --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -56,6 +56,10 @@ extern int sysctl_legacy_va_layout; #define __pa_symbol(x) __pa(RELOC_HIDE((unsigned long)(x), 0)) #endif +#ifndef mm_forbids_zeropage +#define mm_forbids_zeropage(X) (0) +#endif Can we document this please? What it does, why it does it. We should also specify precisely which arch header file is responsible for defining mm_forbids_zeropage. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] mm: introduce mm_forbids_zeropage function
On Wed, 22 Oct 2014 21:45:52 +0200 Dominik Dingel din...@linux.vnet.ibm.com wrote: +#ifndef mm_forbids_zeropage +#define mm_forbids_zeropage(X) (0) +#endif Can we document this please? What it does, why it does it. We should also specify precisely which arch header file is responsible for defining mm_forbids_zeropage. I will add a comment like: /* * To prevent common memory management code establishing * a zero page mapping on a read fault. * This function should be implemented within asm/pgtable.h. s/function should be implemented/macro should be defined/ * s390 does this to prevent multiplexing of hardware bits * related to the physical page in case of virtualization. */ Okay? Looks great, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: Faults which trigger IO release the mmap_sem
On Mon, 22 Sep 2014 23:32:36 +0200 Paolo Bonzini pbonz...@redhat.com wrote: Il 22/09/2014 22:49, Andres Lagar-Cavilla ha scritto: Paolo, should I recut including the recent Reviewed-by's? No, I'll add them myself. Paolo, is this patch waiting for something? Is Gleb's Reviewed-by enough? It's waiting for an Acked-by on the mm/ changes. The MM changes look good to me. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mm: export symbol dependencies of is_zero_pfn()
On Fri, 12 Sep 2014 22:17:23 +0200 Ard Biesheuvel ard.biesheu...@linaro.org wrote: In order to make the static inline function is_zero_pfn() callable by modules, export its symbol dependencies 'zero_pfn' and (for s390 and mips) 'zero_page_mask'. So hexagon and score get the export if/when needed. We need this for KVM, as CONFIG_KVM is a tristate for all supported architectures except ARM and arm64, and testing a pfn whether it refers to the zero page is required to correctly distinguish the zero page from other special RAM ranges that may also have the PG_reserved bit set, but need to be treated as MMIO memory. Signed-off-by: Ard Biesheuvel ard.biesheu...@linaro.org --- arch/mips/mm/init.c | 1 + arch/s390/mm/init.c | 1 + mm/memory.c | 2 ++ Looks OK to me. Please include the patch in whichever tree is is that needs it, and merge it up via that tree. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] CMA: generalize CMA reserved area management functionality (fixup)
On Thu, 17 Jul 2014 11:36:07 +0200 Marek Szyprowski m.szyprow...@samsung.com wrote: MAX_CMA_AREAS is used by other subsystems (i.e. arch/arm/mm/dma-mapping.c), so we need to provide correct definition even if CMA is disabled. This patch fixes this issue. Reported-by: Sylwester Nawrocki s.nawro...@samsung.com Signed-off-by: Marek Szyprowski m.szyprow...@samsung.com --- include/linux/cma.h | 4 1 file changed, 4 insertions(+) diff --git a/include/linux/cma.h b/include/linux/cma.h index 9a18a2b1934c..c077635cad76 100644 --- a/include/linux/cma.h +++ b/include/linux/cma.h @@ -5,7 +5,11 @@ * There is always at least global CMA area and a few optional * areas configured in kernel .config. */ +#ifdef CONFIG_CMA #define MAX_CMA_AREAS(1 + CONFIG_CMA_AREAS) +#else +#define MAX_CMA_AREAS(0) +#endif struct cma; Joonsoo already fixed this up, a bit differently: http://ozlabs.org/~akpm/mmots/broken-out/cma-generalize-cma-reserved-area-management-functionality-fix.patch Which approach makes more sense? From: Joonsoo Kim iamjoonsoo@lge.com Subject: CMA: fix ARM build failure related to MAX_CMA_AREAS definition If CMA is disabled, CONFIG_CMA_AREAS isn't defined so compile error happens. To fix it, define MAX_CMA_AREAS if CONFIG_CMA_AREAS isn't defined. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com Reported-by: Stephen Rothwell s...@canb.auug.org.au Signed-off-by: Andrew Morton a...@linux-foundation.org --- include/linux/cma.h |6 ++ 1 file changed, 6 insertions(+) diff -puN include/linux/cma.h~cma-generalize-cma-reserved-area-management-functionality-fix include/linux/cma.h --- a/include/linux/cma.h~cma-generalize-cma-reserved-area-management-functionality-fix +++ a/include/linux/cma.h @@ -5,8 +5,14 @@ * There is always at least global CMA area and a few optional * areas configured in kernel .config. */ +#ifdef CONFIG_CMA_AREAS #define MAX_CMA_AREAS (1 + CONFIG_CMA_AREAS) +#else +#define MAX_CMA_AREAS (0) + +#endif + struct cma; extern phys_addr_t cma_get_base(struct cma *cma); _ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] CMA: generalize CMA reserved area management functionality (fixup)
On Thu, 17 Jul 2014 11:36:07 +0200 Marek Szyprowski m.szyprow...@samsung.com wrote: MAX_CMA_AREAS is used by other subsystems (i.e. arch/arm/mm/dma-mapping.c), so we need to provide correct definition even if CMA is disabled. This patch fixes this issue. Reported-by: Sylwester Nawrocki s.nawro...@samsung.com Signed-off-by: Marek Szyprowski m.szyprow...@samsung.com --- include/linux/cma.h | 4 1 file changed, 4 insertions(+) diff --git a/include/linux/cma.h b/include/linux/cma.h index 9a18a2b1934c..c077635cad76 100644 --- a/include/linux/cma.h +++ b/include/linux/cma.h @@ -5,7 +5,11 @@ * There is always at least global CMA area and a few optional * areas configured in kernel .config. */ +#ifdef CONFIG_CMA #define MAX_CMA_AREAS(1 + CONFIG_CMA_AREAS) +#else +#define MAX_CMA_AREAS(0) +#endif struct cma; Joonsoo already fixed this up, a bit differently: http://ozlabs.org/~akpm/mmots/broken-out/cma-generalize-cma-reserved-area-management-functionality-fix.patch Which approach makes more sense? From: Joonsoo Kim iamjoonsoo@lge.com Subject: CMA: fix ARM build failure related to MAX_CMA_AREAS definition If CMA is disabled, CONFIG_CMA_AREAS isn't defined so compile error happens. To fix it, define MAX_CMA_AREAS if CONFIG_CMA_AREAS isn't defined. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com Reported-by: Stephen Rothwell s...@canb.auug.org.au Signed-off-by: Andrew Morton a...@linux-foundation.org --- include/linux/cma.h |6 ++ 1 file changed, 6 insertions(+) diff -puN include/linux/cma.h~cma-generalize-cma-reserved-area-management-functionality-fix include/linux/cma.h --- a/include/linux/cma.h~cma-generalize-cma-reserved-area-management-functionality-fix +++ a/include/linux/cma.h @@ -5,8 +5,14 @@ * There is always at least global CMA area and a few optional * areas configured in kernel .config. */ +#ifdef CONFIG_CMA_AREAS #define MAX_CMA_AREAS (1 + CONFIG_CMA_AREAS) +#else +#define MAX_CMA_AREAS (0) + +#endif + struct cma; extern phys_addr_t cma_get_base(struct cma *cma); _ -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 -next 0/9] CMA: generalize CMA reserved area management code
On Wed, 25 Jun 2014 14:33:56 +0200 Marek Szyprowski m.szyprow...@samsung.com wrote: That's probably easier. Marek, I'll merge these into -mm (and hence -next and git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git) and shall hold them pending you review/ack/test/etc, OK? Ok. I've tested them and they work fine. I'm sorry that you had to wait for me for a few days. You can now add: Acked-and-tested-by: Marek Szyprowski m.szyprow...@samsung.com Thanks. I've also rebased my pending patches onto this set (I will send them soon). The question is now if you want to keep the discussed patches in your -mm tree, or should I take them to my -next branch. If you like to keep them, I assume you will also take the patches which depends on the discussed changes. Yup, that works. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 -next 0/9] CMA: generalize CMA reserved area management code
On Wed, 25 Jun 2014 14:33:56 +0200 Marek Szyprowski m.szyprow...@samsung.com wrote: That's probably easier. Marek, I'll merge these into -mm (and hence -next and git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git) and shall hold them pending you review/ack/test/etc, OK? Ok. I've tested them and they work fine. I'm sorry that you had to wait for me for a few days. You can now add: Acked-and-tested-by: Marek Szyprowski m.szyprow...@samsung.com Thanks. I've also rebased my pending patches onto this set (I will send them soon). The question is now if you want to keep the discussed patches in your -mm tree, or should I take them to my -next branch. If you like to keep them, I assume you will also take the patches which depends on the discussed changes. Yup, that works. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 -next 4/9] DMA, CMA: support arbitrary bitmap granularity
On Mon, 16 Jun 2014 14:40:46 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote: PPC KVM's CMA area management requires arbitrary bitmap granularity, since they want to reserve very large memory and manage this region with bitmap that one bit for several pages to reduce management overheads. So support arbitrary bitmap granularity for following generalization. ... --- a/drivers/base/dma-contiguous.c +++ b/drivers/base/dma-contiguous.c @@ -38,6 +38,7 @@ struct cma { unsigned long base_pfn; unsigned long count; unsigned long *bitmap; + unsigned int order_per_bit; /* Order of pages represented by one bit */ struct mutexlock; }; @@ -157,9 +158,37 @@ void __init dma_contiguous_reserve(phys_addr_t limit) static DEFINE_MUTEX(cma_mutex); +static unsigned long cma_bitmap_aligned_mask(struct cma *cma, int align_order) +{ + return (1 (align_order cma-order_per_bit)) - 1; +} Might want a 1UL ... here. +static unsigned long cma_bitmap_maxno(struct cma *cma) +{ + return cma-count cma-order_per_bit; +} + +static unsigned long cma_bitmap_pages_to_bits(struct cma *cma, + unsigned long pages) +{ + return ALIGN(pages, 1 cma-order_per_bit) cma-order_per_bit; +} Ditto. I'm not really sure what the compiler will do in these cases, but would prefer not to rely on it anyway! --- a/drivers/base/dma-contiguous.c~dma-cma-support-arbitrary-bitmap-granularity-fix +++ a/drivers/base/dma-contiguous.c @@ -160,7 +160,7 @@ static DEFINE_MUTEX(cma_mutex); static unsigned long cma_bitmap_aligned_mask(struct cma *cma, int align_order) { - return (1 (align_order cma-order_per_bit)) - 1; + return (1UL (align_order cma-order_per_bit)) - 1; } static unsigned long cma_bitmap_maxno(struct cma *cma) @@ -171,7 +171,7 @@ static unsigned long cma_bitmap_maxno(st static unsigned long cma_bitmap_pages_to_bits(struct cma *cma, unsigned long pages) { - return ALIGN(pages, 1 cma-order_per_bit) cma-order_per_bit; + return ALIGN(pages, 1UL cma-order_per_bit) cma-order_per_bit; } static void cma_clear_bitmap(struct cma *cma, unsigned long pfn, int count) _ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 -next 0/9] CMA: generalize CMA reserved area management code
On Tue, 17 Jun 2014 10:25:07 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote: v2: - Although this patchset looks very different with v1, the end result, that is, mm/cma.c is same with v1's one. So I carry Ack to patch 6-7. This patchset is based on linux-next 20140610. Thanks for taking care of this. I will test it with my setup and if everything goes well, I will take it to my -next tree. If any branch is required for anyone to continue his works on top of those patches, let me know, I will also prepare it. Hello, I'm glad to hear that. :) But, there is one concern. As you already know, I am preparing further patches (Aggressively allocate the pages on CMA reserved memory). It may be highly related to MM branch and also slightly depends on this CMA changes. In this case, what is the best strategy to merge this patchset? IMHO, Anrew's tree is more appropriate branch. If there is no issue in this case, I am willing to develope further patches based on your tree. That's probably easier. Marek, I'll merge these into -mm (and hence -next and git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git) and shall hold them pending you review/ack/test/etc, OK? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 -next 4/9] DMA, CMA: support arbitrary bitmap granularity
On Mon, 16 Jun 2014 14:40:46 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote: PPC KVM's CMA area management requires arbitrary bitmap granularity, since they want to reserve very large memory and manage this region with bitmap that one bit for several pages to reduce management overheads. So support arbitrary bitmap granularity for following generalization. ... --- a/drivers/base/dma-contiguous.c +++ b/drivers/base/dma-contiguous.c @@ -38,6 +38,7 @@ struct cma { unsigned long base_pfn; unsigned long count; unsigned long *bitmap; + unsigned int order_per_bit; /* Order of pages represented by one bit */ struct mutexlock; }; @@ -157,9 +158,37 @@ void __init dma_contiguous_reserve(phys_addr_t limit) static DEFINE_MUTEX(cma_mutex); +static unsigned long cma_bitmap_aligned_mask(struct cma *cma, int align_order) +{ + return (1 (align_order cma-order_per_bit)) - 1; +} Might want a 1UL ... here. +static unsigned long cma_bitmap_maxno(struct cma *cma) +{ + return cma-count cma-order_per_bit; +} + +static unsigned long cma_bitmap_pages_to_bits(struct cma *cma, + unsigned long pages) +{ + return ALIGN(pages, 1 cma-order_per_bit) cma-order_per_bit; +} Ditto. I'm not really sure what the compiler will do in these cases, but would prefer not to rely on it anyway! --- a/drivers/base/dma-contiguous.c~dma-cma-support-arbitrary-bitmap-granularity-fix +++ a/drivers/base/dma-contiguous.c @@ -160,7 +160,7 @@ static DEFINE_MUTEX(cma_mutex); static unsigned long cma_bitmap_aligned_mask(struct cma *cma, int align_order) { - return (1 (align_order cma-order_per_bit)) - 1; + return (1UL (align_order cma-order_per_bit)) - 1; } static unsigned long cma_bitmap_maxno(struct cma *cma) @@ -171,7 +171,7 @@ static unsigned long cma_bitmap_maxno(st static unsigned long cma_bitmap_pages_to_bits(struct cma *cma, unsigned long pages) { - return ALIGN(pages, 1 cma-order_per_bit) cma-order_per_bit; + return ALIGN(pages, 1UL cma-order_per_bit) cma-order_per_bit; } static void cma_clear_bitmap(struct cma *cma, unsigned long pfn, int count) _ -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 -next 0/9] CMA: generalize CMA reserved area management code
On Tue, 17 Jun 2014 10:25:07 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote: v2: - Although this patchset looks very different with v1, the end result, that is, mm/cma.c is same with v1's one. So I carry Ack to patch 6-7. This patchset is based on linux-next 20140610. Thanks for taking care of this. I will test it with my setup and if everything goes well, I will take it to my -next tree. If any branch is required for anyone to continue his works on top of those patches, let me know, I will also prepare it. Hello, I'm glad to hear that. :) But, there is one concern. As you already know, I am preparing further patches (Aggressively allocate the pages on CMA reserved memory). It may be highly related to MM branch and also slightly depends on this CMA changes. In this case, what is the best strategy to merge this patchset? IMHO, Anrew's tree is more appropriate branch. If there is no issue in this case, I am willing to develope further patches based on your tree. That's probably easier. Marek, I'll merge these into -mm (and hence -next and git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git) and shall hold them pending you review/ack/test/etc, OK? -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -mm 0/3] fix numa vs kvm scalability issue
On Wed, 19 Feb 2014 09:59:17 +0100 Peter Zijlstra pet...@infradead.org wrote: On Tue, Feb 18, 2014 at 05:12:43PM -0500, r...@redhat.com wrote: The NUMA scanning code can end up iterating over many gigabytes of unpopulated memory, especially in the case of a freshly started KVM guest with lots of memory. This results in the mmu notifier code being called even when there are no mapped pages in a virtual address range. The amount of time wasted can be enough to trigger soft lockup warnings with very large (2TB) KVM guests. This patch moves the mmu notifier call to the pmd level, which represents 1GB areas of memory on x86-64. Furthermore, the mmu notifier code is only called from the address in the PMD where present mappings are first encountered. The hugetlbfs code is left alone for now; hugetlb mappings are not relocatable, and as such are left alone by the NUMA code, and should never trigger this problem to begin with. The series also adds a cond_resched to task_numa_work, to fix another potential latency issue. Andrew, I'll pick up the first kernel/sched/ patch; do you want the other two mm/ patches? That works, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/14] asmlinkage, kvm: Make kvm_rebooting visible
On Sat, 8 Feb 2014 08:51:57 +0100 Andi Kleen a...@linux.intel.com wrote: kvm_rebooting is referenced from assembler code, thus needs to be visible. So I read the gcc page and looked at the __visible definition but I still don't really get it. What goes wrong if the __visible isn't present on these referenced-from-asm identifiers? Cc: g...@redhat.com Cc: pbonz...@redhat.com Grumble. Email addresses go into commits with display names and . At least, 89.3% of them do. Some sucker has to go through these and fix them up. I'd prefer it not be me ;) ... --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -102,7 +102,7 @@ static void kvm_release_pfn_dirty(pfn_t pfn); static void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot, gfn_t gfn); -bool kvm_rebooting; +__visible bool kvm_rebooting; EXPORT_SYMBOL_GPL(kvm_rebooting); static bool largepages_enabled = true; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v3 1/4] idr: Percpu ida
On Wed, 28 Aug 2013 12:53:17 -0700 Kent Overstreet k...@daterainc.com wrote: + while (1) { + spin_lock(pool-lock); + + /* + * prepare_to_wait() must come before steal_tags(), in case + * percpu_ida_free() on another cpu flips a bit in + * cpus_have_tags + * + * global lock held and irqs disabled, don't need percpu lock + */ + prepare_to_wait(pool-wait, wait, TASK_UNINTERRUPTIBLE); + + if (!tags-nr_free) + alloc_global_tags(pool, tags); + if (!tags-nr_free) + steal_tags(pool, tags); + + if (tags-nr_free) { + tag = tags-freelist[--tags-nr_free]; + if (tags-nr_free) + set_bit(smp_processor_id(), + pool-cpus_have_tags); + } + + spin_unlock(pool-lock); + local_irq_restore(flags); + + if (tag = 0 || !(gfp __GFP_WAIT)) + break; + + schedule(); + + local_irq_save(flags); + tags = this_cpu_ptr(pool-tag_cpu); + } What guarantees that this wait will terminate? It seems fairly clear to me from the break statement a couple lines up; if we were passed __GFP_WAIT we terminate iff we succesfully allocated a tag. If we weren't passed __GFP_WAIT we never actually sleep. OK ;) Let me rephrase. What guarantees that a tag will become available? If what we have here is an open-coded __GFP_NOFAIL then that is potentially problematic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] percpu ida: Switch to cpumask_t, add some comments
On Wed, 28 Aug 2013 12:55:17 -0700 Kent Overstreet k...@daterainc.com wrote: Fixup patch, addressing Andrew's review feedback: Looks reasonable. lib/idr.c | 38 +- I still don't think it should be in this file. You say that some as-yet-unmerged patches will tie the new code into the old ida code. But will it do it in a manner which requires that the two reside in the same file? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v3 1/4] idr: Percpu ida
On Wed, 28 Aug 2013 13:44:54 -0700 Kent Overstreet k...@daterainc.com wrote: What guarantees that this wait will terminate? It seems fairly clear to me from the break statement a couple lines up; if we were passed __GFP_WAIT we terminate iff we succesfully allocated a tag. If we weren't passed __GFP_WAIT we never actually sleep. OK ;) Let me rephrase. What guarantees that a tag will become available? If what we have here is an open-coded __GFP_NOFAIL then that is potentially problematic. It's the same semantics as a mempool, really - it'll succeed when a tag gets freed. OK, that's reasonable if the code is being used to generate IO tags - we expect the in-flight tags to eventually be returned. But if a client of this code is using the allocator for something totally different, there is no guarantee that the act of waiting will result in any tags being returned. (These are core design principles/constraints which should be explicitly documented in a place where future readers will see them!) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] percpu ida: Switch to cpumask_t, add some comments
On Wed, 28 Aug 2013 14:00:10 -0700 Kent Overstreet k...@daterainc.com wrote: On Wed, Aug 28, 2013 at 01:25:50PM -0700, Andrew Morton wrote: On Wed, 28 Aug 2013 12:55:17 -0700 Kent Overstreet k...@daterainc.com wrote: Fixup patch, addressing Andrew's review feedback: Looks reasonable. lib/idr.c | 38 +- I still don't think it should be in this file. You say that some as-yet-unmerged patches will tie the new code into the old ida code. But will it do it in a manner which requires that the two reside in the same file? Not require, no - but it's just intimate enough with my ida rewrite that I think it makes sense; it makes some use of stuff that should be internal to the ida code. Mostly just sharing the lock though, since I got rid of the ida interfaces that don't do locking, but percpu ida needs a lock that also covers what ida needs. It also makes use of a ganged allocation interface, but there's no real reason ida can't expose that, it's just unlikely to be useful to anything but percpu ida. The other reason I think it makes sense to live in idr.c is more for users of the code; as you pointed out as far as the user's perspective percpu ida isn't doing anything fundamentally different from ida, so I think it makes sense for the code to live in the same place as a kindness to future kernel developers who are trying to find their way around the various library code. I found things to be quite the opposite - it took 5 minutes of staring, head-scratching, double-checking and penny-dropping before I was confident that the newly-added code actually has nothing at all to do with the current code. Putting it in the same file was misleading, and I got misled. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v3 1/4] idr: Percpu ida
On Wed, 28 Aug 2013 14:12:17 -0700 Kent Overstreet k...@daterainc.com wrote: How's this look? diff --git a/lib/idr.c b/lib/idr.c index 15c021c..a3f8e9a 100644 --- a/lib/idr.c +++ b/lib/idr.c @@ -1288,6 +1288,11 @@ static inline unsigned alloc_local_tag(struct percpu_ida *pool, * Safe to be called from interrupt context (assuming it isn't passed * __GFP_WAIT, of course). * + * @gfp indicates whether or not to wait until a free id is available (it's not + * used for internal memory allocations); thus if passed __GFP_WAIT we may sleep + * however long it takes until another thread frees an id (same semantics as a + * mempool). Looks good. Mentioning the mempool thing is effective - people understand that. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] percpu ida: Switch to cpumask_t, add some comments
On Wed, 28 Aug 2013 14:23:58 -0700 Kent Overstreet k...@daterainc.com wrote: I found things to be quite the opposite - it took 5 minutes of staring, head-scratching, double-checking and penny-dropping before I was confident that the newly-added code actually has nothing at all to do with the current code. Putting it in the same file was misleading, and I got misled. Ok... and I could see how the fact that it currently _doesn't_ have anything to do with the existing code would be confusing... Do you think that if/when it's making use of the ida rewrite it'll be ok? Or would you still prefer to have it in a new file I'm constitutionally reluctant to ever assume that any out-of-tree code will be merged. Maybe you'll get hit by a bus, and maybe the code sucks ;) Are you sure that the two things are so tangled together that they must live in the same file? If there's some nice layering between ida and percpu_ida then perhaps such a physical separation would remain appropriate? (and if so, any preference on the naming?) percpu_ida.c? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v3 1/4] idr: Percpu ida
On Fri, 16 Aug 2013 23:09:06 + Nicholas A. Bellinger n...@linux-iscsi.org wrote: From: Kent Overstreet k...@daterainc.com Percpu frontend for allocating ids. With percpu allocation (that works), it's impossible to guarantee it will always be possible to allocate all nr_tags - typically, some will be stuck on a remote percpu freelist where the current job can't get to them. We do guarantee that it will always be possible to allocate at least (nr_tags / 2) tags - this is done by keeping track of which and how many cpus have tags on their percpu freelists. On allocation failure if enough cpus have tags that there could potentially be (nr_tags / 2) tags stuck on remote percpu freelists, we then pick a remote cpu at random to steal from. Note that there's no cpu hotplug notifier - we don't care, because steal_tags() will eventually get the down cpu's tags. We _could_ satisfy more allocations if we had a notifier - but we'll still meet our guarantees and it's absolutely not a correctness issue, so I don't think it's worth the extra code. ... include/linux/idr.h | 53 + lib/idr.c | 316 +-- I don't think this should be in idr.[ch] at all. It has no relationship with the existing code. Apart from duplicating its functionality :( ... @@ -243,4 +245,55 @@ static inline int ida_get_new(struct ida *ida, int *p_id) void __init idr_init_cache(void); +/* Percpu IDA/tag allocator */ + +struct percpu_ida_cpu; + +struct percpu_ida { + /* + * number of tags available to be allocated, as passed to + * percpu_ida_init() + */ + unsignednr_tags; + + struct percpu_ida_cpu __percpu *tag_cpu; + + /* + * Bitmap of cpus that (may) have tags on their percpu freelists: + * steal_tags() uses this to decide when to steal tags, and which cpus + * to try stealing from. + * + * It's ok for a freelist to be empty when its bit is set - steal_tags() + * will just keep looking - but the bitmap _must_ be set whenever a + * percpu freelist does have tags. + */ + unsigned long *cpus_have_tags; Why not cpumask_t? + struct { + spinlock_t lock; + /* + * When we go to steal tags from another cpu (see steal_tags()), + * we want to pick a cpu at random. Cycling through them every + * time we steal is a bit easier and more or less equivalent: + */ + unsignedcpu_last_stolen; + + /* For sleeping on allocation failure */ + wait_queue_head_t wait; + + /* + * Global freelist - it's a stack where nr_free points to the + * top + */ + unsignednr_free; + unsigned*freelist; + } cacheline_aligned_in_smp; Why the cacheline_aligned_in_smp? +}; ... + +/* Percpu IDA */ + +/* + * Number of tags we move between the percpu freelist and the global freelist at + * a time between a percpu freelist would be more accurate? + */ +#define IDA_PCPU_BATCH_MOVE 32U + +/* Max size of percpu freelist, */ +#define IDA_PCPU_SIZE((IDA_PCPU_BATCH_MOVE * 3) / 2) + +struct percpu_ida_cpu { + spinlock_t lock; + unsignednr_free; + unsignedfreelist[]; +}; Data structure needs documentation. There's one of these per cpu. I guess nr_free and freelist are clear enough. The presence of a lock in a percpu data structure is a surprise. It's for cross-cpu stealing, I assume? +static inline void move_tags(unsigned *dst, unsigned *dst_nr, + unsigned *src, unsigned *src_nr, + unsigned nr) +{ + *src_nr -= nr; + memcpy(dst + *dst_nr, src + *src_nr, sizeof(unsigned) * nr); + *dst_nr += nr; +} + ... +static inline void alloc_global_tags(struct percpu_ida *pool, + struct percpu_ida_cpu *tags) +{ + move_tags(tags-freelist, tags-nr_free, + pool-freelist, pool-nr_free, + min(pool-nr_free, IDA_PCPU_BATCH_MOVE)); +} Document this function? +static inline unsigned alloc_local_tag(struct percpu_ida *pool, +struct percpu_ida_cpu *tags) +{ + int tag = -ENOSPC; + + spin_lock(tags-lock); + if (tags-nr_free) + tag = tags-freelist[--tags-nr_free]; + spin_unlock(tags-lock); + + return tag; +} I guess this one's clear enough, if the data structure relationships are understood. +/** + * percpu_ida_alloc - allocate a tag + * @pool: pool to allocate from + * @gfp: gfp flags + * + * Returns a tag - an integer in the
Re: [PATCH 1/2] mm: add support for discard of unused ptes
On Thu, 25 Jul 2013 10:54:20 +0200 Martin Schwidefsky schwidef...@de.ibm.com wrote: From: Konstantin Weitz konstantin.we...@gmail.com In a virtualized environment and given an appropriate interface the guest can mark pages as unused while they are free (for the s390 implementation see git commit 45e576b1c3d00206 guest page hinting light). For the host the unused state is a property of the pte. This patch adds the primitive 'pte_unused' and code to the host swap out handler so that pages marked as unused by all mappers are not swapped out but discarded instead, thus saving one IO for swap out and potentially another one for swap in. ... --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -193,6 +193,19 @@ static inline int pte_same(pte_t pte_a, pte_t pte_b) } #endif +#ifndef __HAVE_ARCH_PTE_UNUSED +/* + * Some architectures provide facilities to virtualization guests + * so that they can flag allocated pages as unused. This allows the + * host to transparently reclaim unused pages. This function returns + * whether the pte's page is unused. + */ +static inline int pte_unused(pte_t pte) +{ + return 0; +} +#endif + #ifndef __HAVE_ARCH_PMD_SAME #ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b) diff --git a/mm/rmap.c b/mm/rmap.c index cd356df..2291f25 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1234,6 +1234,16 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, } set_pte_at(mm, address, pte, swp_entry_to_pte(make_hwpoison_entry(page))); + } else if (pte_unused(pteval)) { + /* + * The guest indicated that the page content is of no + * interest anymore. Simply discard the pte, vmscan + * will take care of the rest. + */ + if (PageAnon(page)) + dec_mm_counter(mm, MM_ANONPAGES); + else + dec_mm_counter(mm, MM_FILEPAGES); } else if (PageAnon(page)) { swp_entry_t entry = { .val = page_private(page) }; Obviously harmless. Please include this in whatever tree carries [PATCH 2/2] s390/kvm: support collaborative memory management. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/10] powerpc: Prepare to support kernel handling of IOMMU map/unmap
On Tue, 23 Jul 2013 12:22:59 +1000 Alexey Kardashevskiy a...@ozlabs.ru wrote: Ping, anyone, please? ew, you top-posted. Ben needs ack from any of MM people before proceeding with this patch. Thanks! For what? The three lines of comment in page-flags.h? ack :) Manipulating page-_count directly is considered poor form. Don't blame us if we break your code ;) Actually, the manipulation in realmode_get_page() duplicates the existing get_page_unless_zero() and the one in realmode_put_page() could perhaps be placed in mm.h with a suitable name and some documentation. That would improve your form and might protect the code from getting broken later on. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/10] powerpc: Prepare to support kernel handling of IOMMU map/unmap
On Tue, 23 Jul 2013 12:22:59 +1000 Alexey Kardashevskiy a...@ozlabs.ru wrote: Ping, anyone, please? ew, you top-posted. Ben needs ack from any of MM people before proceeding with this patch. Thanks! For what? The three lines of comment in page-flags.h? ack :) Manipulating page-_count directly is considered poor form. Don't blame us if we break your code ;) Actually, the manipulation in realmode_get_page() duplicates the existing get_page_unless_zero() and the one in realmode_put_page() could perhaps be placed in mm.h with a suitable name and some documentation. That would improve your form and might protect the code from getting broken later on. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated
On Tue, 25 Sep 2012 10:12:07 +0100 Mel Gorman mgor...@suse.de wrote: First, we'd introduce a variant of get_pageblock_migratetype() that returns all the bits for the pageblock flags and then helpers to extract either the migratetype or the PG_migrate_skip. We already are incurring the cost of get_pageblock_migratetype() so it will not be much more expensive than what is already there. If there is an allocation or free within a pageblock that as the PG_migrate_skip bit set then we increment a counter. When the counter reaches some to-be-decided threshold then compaction may clear all the bits. This would match the criteria of the clearing being based on activity. There are four potential problems with this 1. The logic to retrieve all the bits and split them up will be a little convulated but maybe it would not be that bad. 2. The counter is a shared-writable cache line but obviously it could be moved to vmstat and incremented with inc_zone_page_state to offset the cost a little. 3. The biggested weakness is that there is not way to know if the counter is incremented based on activity in a small subset of blocks. 4. What should the threshold be? The first problem is minor but the other three are potentially a mess. Adding another vmstat counter is bad enough in itself but if the counter is incremented based on a small subsets of pageblocks, the hint becomes is potentially useless. However, does this match what you have in mind or am I over-complicating things? Sounds complicated. Using wall time really does suck. Are you sure you can't think of something more logical? How would we demonstrate the suckage? What would be the observeable downside of switching that 5 seconds to 5 hours? + for (pfn = start_pfn; pfn end_pfn; pfn += pageblock_nr_pages) { + struct page *page; + if (!pfn_valid(pfn)) + continue; + + page = pfn_to_page(pfn); + if (zone != page_zone(page)) + continue; + + clear_pageblock_skip(page); + } What's the worst-case loop count here? zone-spanned_pages pageblock_order What's the worst-case value of (zone-spanned_pages pageblock_order) :) Lets take an unlikely case - 128G single-node machine. That loop count on x86-64 would be 65536. It'll be fast enough, particularly in this path. That could easily exceed a millisecond. Can/should we stick a cond_resched() in there? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/9] mm: compaction: Acquire the zone-lru_lock as late as possible
On Tue, 25 Sep 2012 17:13:27 +0900 Minchan Kim minc...@kernel.org wrote: I see. To me, your saying is better than current comment. I hope comment could be more explicit. diff --git a/mm/compaction.c b/mm/compaction.c index df01b4e..f1d2cc7 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -542,8 +542,9 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc, * splitting and collapsing (collapsing has already happened * if PageLRU is set) but the lock is not necessarily taken * here and it is wasteful to take it just to check transhuge. -* Check transhuge without lock and skip if it's either a -* transhuge or hugetlbfs page. +* Check transhuge without lock and *skip* if it's either a +* transhuge or hugetlbfs page because it's not safe to call +* compound_order. */ if (PageTransHuge(page)) { if (!locked) Going a bit further: --- a/mm/compaction.c~mm-compaction-acquire-the-zone-lru_lock-as-late-as-possible-fix +++ a/mm/compaction.c @@ -415,7 +415,8 @@ isolate_migratepages_range(struct zone * * if PageLRU is set) but the lock is not necessarily taken * here and it is wasteful to take it just to check transhuge. * Check transhuge without lock and skip if it's either a -* transhuge or hugetlbfs page. +* transhuge or hugetlbfs page because calling compound_order() +* requires lru_lock to exclude isolation and splitting. */ if (PageTransHuge(page)) { if (!locked) _ but... the requirement to hold lru_lock for compound_order() is news to me. It doesn't seem to be written down or explained anywhere, and one wonders why the cheerily undocumented compound_lock() doesn't have this effect. What's going on here?? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated
On Mon, 24 Sep 2012 10:39:38 +0100 Mel Gorman mgor...@suse.de wrote: On Fri, Sep 21, 2012 at 02:36:56PM -0700, Andrew Morton wrote: Also, what has to be done to avoid the polling altogether? eg/ie, zap a pageblock's PB_migrate_skip synchronously, when something was done to that pageblock which justifies repolling it? The something event you are looking for is pages being freed or allocated in the page allocator. A movable page being allocated in block or a page being freed should clear the PB_migrate_skip bit if it's set. Unfortunately this would impact the fast path of the alloc and free paths of the page allocator. I felt that that was too high a price to pay. We already do a similar thing in the page allocator: clearing of -all_unreclaimable and -pages_scanned. But that isn't on the fast path really - it happens once per pcp unload. Can we do something like that? Drop some hint into the zone without having to visit each page? ... +static void reset_isolation_suitable(struct zone *zone) +{ + unsigned long start_pfn = zone-zone_start_pfn; + unsigned long end_pfn = zone-zone_start_pfn + zone-spanned_pages; + unsigned long pfn; + + /* + * Do not reset more than once every five seconds. If allocations are + * failing sufficiently quickly to allow this to happen then continually + * scanning for compaction is not going to help. The choice of five + * seconds is arbitrary but will mitigate excessive scanning. + */ + if (time_before(jiffies, zone-compact_blockskip_expire)) + return; + zone-compact_blockskip_expire = jiffies + (HZ * 5); + + /* Walk the zone and mark every pageblock as suitable for isolation */ + for (pfn = start_pfn; pfn end_pfn; pfn += pageblock_nr_pages) { + struct page *page; + if (!pfn_valid(pfn)) + continue; + + page = pfn_to_page(pfn); + if (zone != page_zone(page)) + continue; + + clear_pageblock_skip(page); + } What's the worst-case loop count here? zone-spanned_pages pageblock_order What's the worst-case value of (zone-spanned_pages pageblock_order) :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/9] mm: compaction: Abort compaction loop if lock is contended or run too long
On Fri, 21 Sep 2012 11:46:18 +0100 Mel Gorman mgor...@suse.de wrote: Changelog since V2 o Fix BUG_ON triggered due to pages left on cc.migratepages o Make compact_zone_order() require non-NULL arg `contended' Changelog since V1 o only abort the compaction if lock is contended or run too long o Rearranged the code by Andrea Arcangeli. isolate_migratepages_range() might isolate no pages if for example when zone-lru_lock is contended and running asynchronous compaction. In this case, we should abort compaction, otherwise, compact_zone will run a useless loop and make zone-lru_lock is even contended. hm, this appears to be identical to mm-compaction-abort-compaction-loop-if-lock-is-contended-or-run-too-long.patch mm-compaction-abort-compaction-loop-if-lock-is-contended-or-run-too-long-fix.patch mm-compaction-abort-compaction-loop-if-lock-is-contended-or-run-too-long-fix-2.patch so I simply omitted patches 2, 3 and 4. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/9] mm: compaction: Acquire the zone-lock as late as possible
On Fri, 21 Sep 2012 11:46:20 +0100 Mel Gorman mgor...@suse.de wrote: Compactions free scanner acquires the zone-lock when checking for PageBuddy pages and isolating them. It does this even if there are no PageBuddy pages in the range. This patch defers acquiring the zone lock for as long as possible. In the event there are no free pages in the pageblock then the lock will not be acquired at all which reduces contention on zone-lock. ... --- a/mm/compaction.c +++ b/mm/compaction.c @@ -93,6 +93,28 @@ static inline bool compact_trylock_irqsave(spinlock_t *lock, return compact_checklock_irqsave(lock, flags, false, cc); } +/* Returns true if the page is within a block suitable for migration to */ +static bool suitable_migration_target(struct page *page) +{ + stray newline + int migratetype = get_pageblock_migratetype(page); + + /* Don't interfere with memory hot-remove or the min_free_kbytes blocks */ + if (migratetype == MIGRATE_ISOLATE || migratetype == MIGRATE_RESERVE) + return false; + + /* If the page is a large free page, then allow migration */ + if (PageBuddy(page) page_order(page) = pageblock_order) + return true; + + /* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */ + if (migrate_async_suitable(migratetype)) + return true; + + /* Otherwise skip the block */ + return false; +} + ... @@ -168,23 +193,38 @@ static unsigned long isolate_freepages_block(unsigned long blockpfn, int isolated, i; struct page *page = cursor; - if (!pfn_valid_within(blockpfn)) { - if (strict) - return 0; - continue; - } + if (!pfn_valid_within(blockpfn)) + goto strict_check; nr_scanned++; - if (!PageBuddy(page)) { - if (strict) - return 0; - continue; - } + if (!PageBuddy(page)) + goto strict_check; + + /* + * The zone lock must be held to isolate freepages. This + * unfortunately this is a very coarse lock and can be this this + * heavily contended if there are parallel allocations + * or parallel compactions. For async compaction do not + * spin on the lock and we acquire the lock as late as + * possible. + */ + locked = compact_checklock_irqsave(cc-zone-lock, flags, + locked, cc); + if (!locked) + break; + + /* Recheck this is a suitable migration target under lock */ + if (!strict !suitable_migration_target(page)) + break; + + /* Recheck this is a buddy page under lock */ + if (!PageBuddy(page)) + goto strict_check; /* Found a free page, break it into order-0 pages */ isolated = split_free_page(page); if (!isolated strict) - return 0; + goto strict_check; total_isolated += isolated; for (i = 0; i isolated; i++) { list_add(page-lru, freelist); @@ -196,9 +236,23 @@ static unsigned long isolate_freepages_block(unsigned long blockpfn, blockpfn += isolated - 1; cursor += isolated - 1; } + + continue; + +strict_check: + /* Abort isolation if the caller requested strict isolation */ + if (strict) { + total_isolated = 0; + goto out; + } } trace_mm_compaction_isolate_freepages(nr_scanned, total_isolated); + +out: + if (locked) + spin_unlock_irqrestore(cc-zone-lock, flags); + return total_isolated; } A a few things about this function. Would it be cleaner if we did if (!strict) { if (!suitable_migration_target(page)) break; } else { if (!PageBuddy(page)) goto strict_check; } and then remove the test of `strict' from strict_check (which then should be renamed)? Which perhaps means that the whole `strict_check' block can go away: if (!strict) { if (!suitable_migration_target(page)) break; } else { if (!PageBuddy(page)) { total_isolated = 0; goto out; } Have a think about it? The function is a little straggly. Secondly, it is correct/desirable to skip the (now misnamed
Re: [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated
On Fri, 21 Sep 2012 11:46:22 +0100 Mel Gorman mgor...@suse.de wrote: When compaction was implemented it was known that scanning could potentially be excessive. The ideal was that a counter be maintained for each pageblock but maintaining this information would incur a severe penalty due to a shared writable cache line. It has reached the point where the scanning costs are an serious problem, particularly on long-lived systems where a large process starts and allocates a large number of THPs at the same time. Instead of using a shared counter, this patch adds another bit to the pageblock flags called PG_migrate_skip. If a pageblock is scanned by either migrate or free scanner and 0 pages were isolated, the pageblock is marked to be skipped in the future. When scanning, this bit is checked before any scanning takes place and the block skipped if set. The main difficulty with a patch like this is when to ignore the cached information? If it's ignored too often, the scanning rates will still be excessive. If the information is too stale then allocations will fail that might have otherwise succeeded. In this patch o CMA always ignores the information o If the migrate and free scanner meet then the cached information will be discarded if it's at least 5 seconds since the last time the cache was discarded o If there are a large number of allocation failures, discard the cache. The time-based heuristic is very clumsy but there are few choices for a better event. Depending solely on multiple allocation failures still allows excessive scanning when THP allocations are failing in quick succession due to memory pressure. Waiting until memory pressure is relieved would cause compaction to continually fail instead of using reclaim/compaction to try allocate the page. The time-based mechanism is clumsy but a better option is not obvious. ick. Wall time has sooo little relationship to what's happening in there. If we *have* to use polling, cannot we clock the poll with some metric which is at least vaguely related to the amount of activity? Number (or proportion) of pages scanned, for example? Or reset everything on the Nth trip around the zone? Or even a combination of one of these *and* of wall time, so the system will at least work harder when MM is under load. Also, what has to be done to avoid the polling altogether? eg/ie, zap a pageblock's PB_migrate_skip synchronously, when something was done to that pageblock which justifies repolling it? ... +static void reset_isolation_suitable(struct zone *zone) +{ + unsigned long start_pfn = zone-zone_start_pfn; + unsigned long end_pfn = zone-zone_start_pfn + zone-spanned_pages; + unsigned long pfn; + + /* + * Do not reset more than once every five seconds. If allocations are + * failing sufficiently quickly to allow this to happen then continually + * scanning for compaction is not going to help. The choice of five + * seconds is arbitrary but will mitigate excessive scanning. + */ + if (time_before(jiffies, zone-compact_blockskip_expire)) + return; + zone-compact_blockskip_expire = jiffies + (HZ * 5); + + /* Walk the zone and mark every pageblock as suitable for isolation */ + for (pfn = start_pfn; pfn end_pfn; pfn += pageblock_nr_pages) { + struct page *page; + if (!pfn_valid(pfn)) + continue; + + page = pfn_to_page(pfn); + if (zone != page_zone(page)) + continue; + + clear_pageblock_skip(page); + } What's the worst-case loop count here? +} + ... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm loops after kernel udpate
On Tue, 18 Sep 2012 21:40:31 +0200 Jiri Slaby jsl...@suse.cz wrote: On 09/13/2012 11:59 AM, Avi Kivity wrote: On 09/12/2012 09:11 PM, Jiri Slaby wrote: On 09/12/2012 10:18 AM, Avi Kivity wrote: On 09/12/2012 11:13 AM, Jiri Slaby wrote: Please provide the output of vmxcap (http://goo.gl/c5lUO), Unrestricted guest no The big real mode fixes. and a snapshot of kvm_stat while the guest is hung. kvm statistics exits 6778198 615942 host_state_reload 1988 187 irq_exits 1523 138 mmu_cache_miss 4 0 fpu_reload 1 0 Please run this as root so we get the tracepoint based output; and press 'x' when it's running so we get more detailed output. kvm statistics kvm_exit 13798699 330708 kvm_entry 13799110 330708 kvm_page_fault13793650 330604 kvm_exit(EXCEPTION_NMI)6188458 330604 kvm_exit(EXTERNAL_INTERRUPT) 2169 105 kvm_exit(TPR_BELOW_THRESHOLD) 82 0 kvm_exit(IO_INSTRUCTION) 6 0 Strange, it's unable to fault in the very first page. I bisected that. Note the bisection log. I have never seen something like that :D: git bisect start git bisect bad 3de9d1a1500472bc80478bd75e33fa9c1eba1422 git bisect good fea7a08acb13524b47711625eebea40a0ede69a0 git bisect good 95a2fe4baa1ad444df5f94bfc9416fc6b4b34cef git bisect good f42c0d57a5a60da03c705bdea9fbba381112dd60 git bisect good 31a2e241a9e37a133278959044960c229acc5714 git bisect good f15fb01c5593fa1b58cc7a8a9c59913e2625bf2e git bisect good 16d21ff46f5d50e311d07406c31f96916e5e8e1a git bisect good 0b84592f458b4e8567aa7d803aff382c1d3b64fd git bisect bad b955428e7f14cd29fe9d8059efa3ea4be679c83d git bisect bad 20c4da4f68fcade05eda9c9b7dbad0a78cc5efe8 git bisect bad 31b90ed2a90f80fb528ac55ee357a815e1dedc36 git bisect bad b273fe14ee5b38cecc7bce94ff35a0bf9ee4 git bisect bad de426dbe9a60706b91b40397f69f819a39a06b6b git bisect bad 6b998094ec50248e72b9f251d0607b58b18dba38 git bisect bad cf9b81d47a89f5d404a0cd8013b461617751e520 === 8 === Reverting cf9b81d47a89 (mm: wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end) on the top of today's -next fixes the issue. hm, thanks. This will probably take some time to resolve so I think I'll drop mm-move-all-mmu-notifier-invocations-to-be-done-outside-the-pt-lock.patch mm-move-all-mmu-notifier-invocations-to-be-done-outside-the-pt-lock-fix.patch mm-move-all-mmu-notifier-invocations-to-be-done-outside-the-pt-lock-fix-fix.patch mm-wrap-calls-to-set_pte_at_notify-with-invalidate_range_start-and-invalidate_range_end.patch -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm loops after kernel udpate
On Wed, 19 Sep 2012 10:00:34 +1000 Stephen Rothwell s...@canb.auug.org.au wrote: Hi Andrew, On Tue, 18 Sep 2012 12:46:46 -0700 Andrew Morton a...@linux-foundation.org wrote: hm, thanks. This will probably take some time to resolve so I think I'll drop mm-move-all-mmu-notifier-invocations-to-be-done-outside-the-pt-lock.patch mm-move-all-mmu-notifier-invocations-to-be-done-outside-the-pt-lock-fix.patch mm-move-all-mmu-notifier-invocations-to-be-done-outside-the-pt-lock-fix-fix.patch mm-wrap-calls-to-set_pte_at_notify-with-invalidate_range_start-and-invalidate_range_end.patch Should I attempt to remove these from the akpm tree in linux-next today? That would be best - there's no point in having people test (and debug) dead stuff. Or should I just wait for a new mmotm? You could be brave and test http://ozlabs.org/~akpm/mmots/ for me :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host
On Wed, 22 Aug 2012 18:29:55 +0200 Andrea Arcangeli aarca...@redhat.com wrote: On Wed, Aug 22, 2012 at 02:03:41PM +0800, Xiao Guangrong wrote: On 08/21/2012 11:06 PM, Andrea Arcangeli wrote: CPU0 CPU1 oldpage[1] == 0 (both guest host) oldpage[0] = 1 trigger do_wp_page We always do ptep_clear_flush before set_pte_at_notify(), at this point, we have done: pte = 0 and flush all tlbs mmu_notifier_change_pte spte = newpage + writable guest does newpage[1] = 1 vmexit host read oldpage[1] == 0 It can not happen, at this point pte = 0, host can not access oldpage anymore, host read can generate #PF, it will be blocked on page table lock until CPU 0 release the lock. Agreed, this is why your fix is safe. ... Thanks a lot for fixing this subtle race! I'll take that as an ack. Unfortunately we weren't told the user-visible effects of the bug, which often makes it hard to determine which kernel versions should be patched. Please do always provide this information when fixing a bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host
On Wed, 22 Aug 2012 21:50:43 +0200 Andrea Arcangeli aarca...@redhat.com wrote: Hi Andrew, On Wed, Aug 22, 2012 at 12:15:35PM -0700, Andrew Morton wrote: On Wed, 22 Aug 2012 18:29:55 +0200 Andrea Arcangeli aarca...@redhat.com wrote: On Wed, Aug 22, 2012 at 02:03:41PM +0800, Xiao Guangrong wrote: On 08/21/2012 11:06 PM, Andrea Arcangeli wrote: CPU0 CPU1 oldpage[1] == 0 (both guest host) oldpage[0] = 1 trigger do_wp_page We always do ptep_clear_flush before set_pte_at_notify(), at this point, we have done: pte = 0 and flush all tlbs mmu_notifier_change_pte spte = newpage + writable guest does newpage[1] = 1 vmexit host read oldpage[1] == 0 It can not happen, at this point pte = 0, host can not access oldpage anymore, host read can generate #PF, it will be blocked on page table lock until CPU 0 release the lock. Agreed, this is why your fix is safe. ... Thanks a lot for fixing this subtle race! I'll take that as an ack. Yes thanks! I'd also like a comment that explains why in that case the order is reversed. The reverse order immediately rings an alarm bell otherwise ;). But the comment can be added with an incremental patch. If you can suggest some text I'll type it in right now. Unfortunately we weren't told the user-visible effects of the bug, which often makes it hard to determine which kernel versions should be patched. Please do always provide this information when fixing a bug. This is best answered by Xiao who said it's a testcase triggering this. It requires the guest reading memory on CPU0 while the host writes to the same memory on CPU1, while CPU2 triggers the copy on write fault on another part of the same page (slightly before CPU1 writes). The host writes of CPU1 would need to happen in a microsecond window, and they wouldn't be immediately propagated to the guest in CPU0. They would still appear in the guest but with a microsecond delay (the guest has the spte mapped readonly when this happens so it's only a guest microsecond delayed reading problem as far as I can tell). I guess most of the time it would fall into the undefined by timing scenario so it's hard to tell how the side effect could escalate. OK, thanks. For this sort of thing I am dependent upon KVM people to suggest whether the fix should be in 3.6 and whether -stable needs it. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: Avoid wasting pages for small lpage_info arrays
On Tue, 15 May 2012 11:02:17 +0300 Avi Kivity a...@redhat.com wrote: On 05/14/2012 04:29 PM, Takuya Yoshikawa wrote: On Sun, 13 May 2012 13:20:46 +0300 Avi Kivity a...@redhat.com wrote: I don't feel that the savings is worth the extra complication. We save two pages per memslot here. Using a 4KB vmalloced page for a 16B array is ... Actually I felt like you before and did not do this, but recently there was a talk about creating hundreds of memslots. What about using kvmalloc() instead of vmalloc()? It's in security/apparmor now, but can be made generic. Andrew once, maybe some times, rejected making such an API generic saying that there should not be a generic criterion by which we can decide which function - vmalloc() or kmalloc() - to use. So each caller should decide by its own criteria. In this case, we need to implement kvm specific kvmalloc(). BTW, we are already doing this for dirty_bitmap. Okay, a local kvmalloc() is better than open-coding the logic. Andrew, prepare yourself for some code duplication. There are reasons for avoiding vmalloc(). The kernel does not run in a virtual memory environment. It is a harsh, low-level environment and kernel code should be robust. Assuming that you can allocate vast amounts of contiguous memory is not robust. Robust code will implement data structures which avoid this weakness. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] bitops: add _local bitops
On Wed, 9 May 2012 16:45:29 +0300 Michael S. Tsirkin m...@redhat.com wrote: kvm needs to update some hypervisor variables atomically in a sense that the operation can't be interrupted in the middle. However the hypervisor always runs on the same CPU so it does not need any memory barrier or lock prefix. Well. It adds more complexity, makes the kernel harder to understand and maintain and introduces more opportunities for developers to add bugs. So from that point of view, the best way of handling this patch is to delete it. Presumably the patch offers some benefit to offest all those costs. But you didn't tell us what that benefit is, so we cannot make a decision. IOW: numbers, please. Convincing ones, for realistic test cases. Secondly: can KVM just use __set_bit() and friends? I suspect those interfaces happen to meet your requirements. At least on architectures you care about. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: don't call mmu_shrinker w/o used_mmu_pages
On Fri, 13 Apr 2012 15:38:41 -0700 Ying Han ying...@google.com wrote: The mmu_shrink() is heavy by itself by iterating all kvms and holding the kvm_lock. spotted the code w/ Rik during LSF, and it turns out we don't need to call the shrinker if nothing to shrink. We should probably tell the kvm maintainers about this ;) --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -188,6 +188,11 @@ static u64 __read_mostly shadow_mmio_mask; static void mmu_spte_set(u64 *sptep, u64 spte); +static inline int get_kvm_total_used_mmu_pages() +{ + return percpu_counter_read_positive(kvm_total_used_mmu_pages); +} + void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask) { shadow_mmio_mask = mmio_mask; @@ -3900,6 +3905,9 @@ static int mmu_shrink(struct shrinker *shrink, struct shrink_control *sc) if (nr_to_scan == 0) goto out; + if (!get_kvm_total_used_mmu_pages()) + return 0; + raw_spin_lock(kvm_lock); list_for_each_entry(kvm, vm_list, vm_list) { @@ -3926,7 +3934,7 @@ static int mmu_shrink(struct shrinker *shrink, struct shrink_control *sc) raw_spin_unlock(kvm_lock); out: - return percpu_counter_read_positive(kvm_total_used_mmu_pages); + return get_kvm_total_used_mmu_pages(); } static struct shrinker mmu_shrinker = { There's a small functional change: percpu_counter_read_positive() is an approximate thing, so there will be cases where there will be some pages which are accounted for only in the percpu_counter's per-cpu accumulators. In that case mmu_shrink() will bale out when there are in fact some freeable pages available. This is hopefully unimportant. Do we actually know that this patch helps anything? Any measurements? Is kvm_total_used_mmu_pages==0 at all common? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Unmapped page cache control (v5)
On Wed, 30 Mar 2011 11:00:26 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: Data from the previous patchsets can be found at https://lkml.org/lkml/2010/11/30/79 It would be nice if the data for the current patchset was present in the current patchset's changelog! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] Provide control over unmapped pages (v5)
On Wed, 30 Mar 2011 11:02:38 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: Changelog v4 1. Added documentation for max_unmapped_pages 2. Better #ifdef'ing of max_unmapped_pages and min_unmapped_pages Changelog v2 1. Use a config option to enable the code (Andrew Morton) 2. Explain the magic tunables in the code or at-least attempt to explain them (General comment) 3. Hint uses of the boot parameter with unlikely (Andrew Morton) 4. Use better names (balanced is not a good naming convention) Provide control using zone_reclaim() and a boot parameter. The code reuses functionality from zone_reclaim() to isolate unmapped pages and reclaim them as a priority, ahead of other mapped pages. This: akpm:/usr/src/25 grep '^+#' patches/provide-control-over-unmapped-pages-v5.patch +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA) +#endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) +#endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA) +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) +#endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) +#else +#endif +#ifdef CONFIG_NUMA +#else +#define zone_reclaim_mode 0 +#endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA) +#endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) +#endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA) +#endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) +#endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA) +#endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) +#endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) +#else /* !CONFIG_UNMAPPED_PAGECACHE_CONTROL */ +#endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA) +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) +#endif +#endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) is getting out of control. What happens if we just make the feature non-configurable? +static int __init unmapped_page_control_parm(char *str) +{ + unmapped_page_control = 1; + /* + * XXX: Should we tweak swappiness here? + */ + return 1; +} +__setup(unmapped_page_control, unmapped_page_control_parm); That looks like a pain - it requires a reboot to change the option, which makes testing harder and slower. Methinks you're being a bit virtualization-centric here! +#else /* !CONFIG_UNMAPPED_PAGECACHE_CONTROL */ +static inline void reclaim_unmapped_pages(int priority, + struct zone *zone, struct scan_control *sc) +{ + return 0; +} +#endif + static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone, struct scan_control *sc) { @@ -2371,6 +2394,12 @@ loop_again: shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0); + /* + * We do unmapped page reclaim once here and once + * below, so that we don't lose out + */ + reclaim_unmapped_pages(priority, zone, sc); Doing this here seems wrong. balance_pgdat() does two passes across the zones. The first pass is a read-only work-out-what-to-do pass and the second pass is a now-reclaim-some-stuff pass. But here we've stuck a do-some-reclaiming operation inside the first, work-out-what-to-do pass. @@ -2408,6 +2437,11 @@ loop_again: continue; sc.nr_scanned = 0; + /* + * Reclaim unmapped pages upfront, this should be + * really cheap Comment is mysterious. Why is it cheap? + */ + reclaim_unmapped_pages(priority, zone, sc); I dunno, the whole thing seems rather nasty to me. It sticks a magical reclaim-unmapped-pages operation right in the middle of regular page reclaim. This means that reclaim will walk the LRU looking at mapped and unmapped pages. Then it will walk some more, looking at only unmapped pages and moving the mapped ones to the head of the LRU. Then it goes back to looking at mapped and unmapped pages. So it rather screws up the LRU ordering and page aging, does it not? Also, the special-case handling sticks out like a sore thumb. Would it not be better to manage the mapped/unmapped bias within the core of the regular scanning? ie: in shrink_page_list(). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Unmapped page cache control (v5)
On Thu, 31 Mar 2011 10:57:03 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: * Andrew Morton a...@linux-foundation.org [2011-03-30 16:36:07]: On Wed, 30 Mar 2011 11:00:26 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: Data from the previous patchsets can be found at https://lkml.org/lkml/2010/11/30/79 It would be nice if the data for the current patchset was present in the current patchset's changelog! Sure, since there were no major changes, I put in a URL. The main change was the documentation update. Well some poor schmuck has to copy and paste the data into the changelog so it's still there in five years time. It's better to carry this info around in the patch's own metedata, and to maintain and update it. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/3] Weight-balanced tree
On Tue, 22 Feb 2011 11:55:05 -0700 Alex Williamson alex.william...@redhat.com wrote: Signed-off-by: Alex Williamson alex.william...@redhat.com --- include/linux/wbtree.h | 55 lib/Makefile |3 + lib/wbtree.c | 170 3 files changed, 227 insertions(+), 1 deletions(-) create mode 100644 include/linux/wbtree.h create mode 100644 lib/wbtree.c diff --git a/include/linux/wbtree.h b/include/linux/wbtree.h new file mode 100644 index 000..ef70f6f --- /dev/null +++ b/include/linux/wbtree.h @@ -0,0 +1,55 @@ +/* + * Weight-balanced binary tree + * + * The balance of this tree is based on search probability. The + * heaviest weighted nodes (the ones we're most likely to hit), are + * at the top of each subtree. + * + * Copywrite (C) 2011 Red Hat, Inc. + * + * Authors: + * Alex Williamson alex.william...@redhat.com + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ +#ifndef _LINUX_WBTREE_H +#define _LINUX_WBTREE_H + +#include linux/kernel.h +#include linux/stddef.h + +struct wb_node { + struct wb_node *wb_parent; + struct wb_node *wb_left; + struct wb_node *wb_right; + unsigned long wb_weight; +}; + +struct wb_root { + struct wb_node *wb_node; +}; + +#define WB_ROOT (struct wb_root) { NULL, } +#define wb_entry(ptr, type, member) container_of(ptr, type, member) Please implement this as a C function. That way we get typechecking which container_of() is unable to provide. Otherwise the code looks OK to me, apart from the total lack of documentation. Please document at least all the exported interfaces. The documentation should be designed to tell people how to maintain this code, and how to use this code reliably and well from other subsystems. Don't fall into the trap of just dumbly filling out templates. The documentation should also carefully explain the locking requirements which these functions place upon callers. (rwlocks used how? rcu?) Once that is all squared away, please merge the code via the KVM tree. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3][RESEND] Provide control over unmapped pages (v4)
On Tue, 01 Feb 2011 22:25:45 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: Changelog v4 1. Add max_unmapped_ratio and use that as the upper limit to check when to shrink the unmapped page cache (Christoph Lameter) Changelog v2 1. Use a config option to enable the code (Andrew Morton) 2. Explain the magic tunables in the code or at-least attempt to explain them (General comment) 3. Hint uses of the boot parameter with unlikely (Andrew Morton) 4. Use better names (balanced is not a good naming convention) Provide control using zone_reclaim() and a boot parameter. The code reuses functionality from zone_reclaim() to isolate unmapped pages and reclaim them as a priority, ahead of other mapped pages. A new sysctl for max_unmapped_ratio is provided and set to 16, indicating 16% of the total zone pages are unmapped, we start shrinking unmapped page cache. We'll need some documentation for sysctl_max_unmapped_ratio, please. In Documentation/sysctl/vm.txt, I suppose. It will be interesting to find out what this ratio refers to. it apears to be a percentage. We've had problem in the past where 1% was way too much and we had to change the kernel to provide much finer-grained control. ... --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -306,7 +306,10 @@ struct zone { /* * zone reclaim becomes active if more unmapped pages exist. */ +#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA) unsigned long min_unmapped_pages; + unsigned long max_unmapped_pages; +#endif This change breaks the connection between min_unmapped_pages and its documentation, and fails to document max_unmapped_pages. Also, afacit if CONFIG_NUMA=y and CONFIG_UNMAPPED_PAGE_CONTROL=n, max_unmapped_pages will be present in the kernel image and will appear in /proc but it won't actually do anything. Seems screwed up and misleading. ... +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) +/* + * Routine to reclaim unmapped pages, inspired from the code under + * CONFIG_NUMA that does unmapped page and slab page control by keeping + * min_unmapped_pages in the zone. We currently reclaim just unmapped + * pages, slab control will come in soon, at which point this routine + * should be called reclaim cached pages + */ +unsigned long reclaim_unmapped_pages(int priority, struct zone *zone, + struct scan_control *sc) +{ + if (unlikely(unmapped_page_control) + (zone_unmapped_file_pages(zone) zone-min_unmapped_pages)) { + struct scan_control nsc; + unsigned long nr_pages; + + nsc = *sc; + + nsc.swappiness = 0; + nsc.may_writepage = 0; + nsc.may_unmap = 0; + nsc.nr_reclaimed = 0; + + nr_pages = zone_unmapped_file_pages(zone) - + zone-min_unmapped_pages; + /* + * We don't want to be too aggressive with our + * reclaim, it is our best effort to control + * unmapped pages + */ + nr_pages = 3; + + zone_reclaim_pages(zone, nsc, nr_pages); + return nsc.nr_reclaimed; + } + return 0; +} This returns an undocumented ulong which is never used by callers. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page optionally
On Thu, 27 Jan 2011 15:29:05 +0800 Huang Ying ying.hu...@intel.com wrote: Hi, Andrew, On Thu, 2011-01-20 at 23:50 +0800, Marcelo Tosatti wrote: On Mon, Jan 17, 2011 at 08:47:39AM +0800, Huang Ying wrote: Hi, Andrew, On Sun, 2011-01-16 at 23:35 +0800, Avi Kivity wrote: On 01/14/2011 03:37 AM, Huang Ying wrote: On Thu, 2011-01-13 at 18:43 +0800, Avi Kivity wrote: On 01/13/2011 10:42 AM, Huang Ying wrote: Make __get_user_pages return -EHWPOISON for HWPOISON page only if FOLL_HWPOISON is specified. With this patch, the interested callers can distinguish HWPOISON page from general FAULT page, while other callers will still get -EFAULT for pages, so the user space interface need not to be changed. get_user_pages_hwpoison is added as a variant of get_user_pages that can return -EHWPOISON for HWPOISON page. This feature is needed by KVM, where UCR MCE should be relayed to guest for HWPOISON page, while instruction emulation and MMIO will be tried for general FAULT page. The idea comes from Andrew Morton. Signed-off-by: Huang Yingying.hu...@intel.com Cc: Andrew Mortona...@linux-foundation.org +#ifdef CONFIG_MEMORY_FAILURE +int get_user_pages_hwpoison(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int nr_pages, int write, + int force, struct page **pages, + struct vm_area_struct **vmas); +#else Since we'd also like to add get_user_pages_noio(), perhaps adding a flags field to get_user_pages() is better. Or export __get_user_pages()? That's better, yes. Do you think it is a good idea to export __get_user_pages() instead of adding get_user_pages_noio() and get_user_pages_hwpoison()? Better Andrew and/or Linus should decide it. We really need your comments about this. Export __get_user_pages(), I guess. We don't need hwpoison/kvm-specific stuff in mm/memory.c and get_user_pages_hwpoison()/get_user_pages_noio() are very thin wrappers. The number of args to these functions is getting nutty - you'll probably find that it is beneficial to inline these wrapepr functions, if the number of callsites is small. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page optionally
On Fri, 28 Jan 2011 01:57:11 +0100 Andi Kleen a...@firstfloor.org wrote: I personally would consider it cleaner to have clearly defined wrappers instead of complicted flags in the caller. The number of args to these functions is getting nutty - you'll probably find that it is beneficial to inline these wrapepr functions, if the number of callsites is small. Really the functions are so heavy weight it should not matter. The problem with inlining is that you end up with the code in the header file and I personally find that much harder to browse instead of having everything in one file. You'll cope. Also longer term we'll get compilers that can do cross-file inlining for optimized builds. Which we'll probably need to turn off all over the place :( So please better avoid these kinds of micro optimizations unless it's a really really extremly speed critical path. It's not just speed and it's not just .text size, either. Calling a ten-arg function consumes stack space. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 1/3] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page optionally
On Wed, 22 Dec 2010 10:51:55 +0800 Huang Ying ying.hu...@intel.com wrote: Make __get_user_pages return -EHWPOISON for HWPOISON page only if FOLL_HWPOISON is specified. With this patch, the interested callers can distinguish HWPOISON page from general FAULT page, while other callers will still get -EFAULT for pages, so the user space interface need not to be changed. get_user_pages_hwpoison is added as a variant of get_user_pages that can return -EHWPOISON for HWPOISON page. This feature is needed by KVM, where UCR MCE should be relayed to guest for HWPOISON page, while instruction emulation and MMIO will be tried for general FAULT page. The idea comes from Andrew Morton. hm, I don't remember that. I suspect it came from someone else? Signed-off-by: Huang Ying ying.hu...@intel.com Cc: Andrew Morton a...@linux-foundation.org --- include/asm-generic/errno.h |2 + include/linux/mm.h |5 mm/memory.c | 53 +--- 3 files changed, 57 insertions(+), 3 deletions(-) --- a/include/asm-generic/errno.h +++ b/include/asm-generic/errno.h @@ -108,4 +108,6 @@ #define ERFKILL 132 /* Operation not possible due to RF-kill */ +#define EHWPOISON133 /* Memory page has hardware error */ So this can be propagated to userspace? If so, which syscalls might return EHWPOISON and are there any manpage implications? #endif --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -860,6 +860,10 @@ int get_user_pages(struct task_struct *t struct page **pages, struct vm_area_struct **vmas); int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages); +int get_user_pages_hwpoison(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int nr_pages, int write, + int force, struct page **pages, + struct vm_area_struct **vmas); struct page *get_dump_page(unsigned long addr); extern int try_to_release_page(struct page * page, gfp_t gfp_mask); @@ -1415,6 +1419,7 @@ struct page *follow_page(struct vm_area_ #define FOLL_GET 0x04/* do get_page on page */ #define FOLL_DUMP0x08/* give error on hole if it would be zero */ #define FOLL_FORCE 0x10/* get_user_pages read/write w/o permission */ +#define FOLL_HWPOISON0x20/* check page is hwpoisoned */ typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr, void *data); --- a/mm/memory.c +++ b/mm/memory.c @@ -1449,9 +1449,16 @@ int __get_user_pages(struct task_struct if (ret VM_FAULT_ERROR) { if (ret VM_FAULT_OOM) return i ? i : -ENOMEM; - if (ret - (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE| - VM_FAULT_SIGBUS)) + if (ret (VM_FAULT_HWPOISON | +VM_FAULT_HWPOISON_LARGE)) { + if (i) + return i; + else if (gup_flags FOLL_HWPOISON) + return -EHWPOISON; + else + return -EFAULT; + } + if (ret VM_FAULT_SIGBUS) hm, that function is getting a bit unweildy. /** + * get_user_pages_hwpoison() - pin user pages in memory, return hwpoison status + * @tsk: task_struct of target task + * @mm: mm_struct of target mm + * @start: starting user address + * @nr_pages:number of pages from start to pin + * @write: whether pages will be written to by the caller + * @force: whether to force write access even if user mapping is + * readonly. This will result in the page being COWed even + * in MAP_SHARED mappings. You do not want this. + * @pages: array that receives pointers to the pages pinned. + * Should be at least nr_pages long. Or NULL, if caller + * only intends to ensure the pages are faulted in. + * @vmas:array of pointers to vmas corresponding to each page. + * Or NULL if the caller does not require them. + * + * Returns number of pages pinned. + * + * If the page table or memory page is hwpoisoned, return -EHWPOISON. + * + * Otherwise, same as get_user_pages. + */ +int get_user_pages_hwpoison(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int nr_pages, int write
Re: [RFC 1/3] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page optionally
On Thu, 23 Dec 2010 08:39:59 +0800 Huang Ying ying.hu...@intel.com wrote: --- a/mm/memory.c +++ b/mm/memory.c @@ -1449,9 +1449,16 @@ int __get_user_pages(struct task_struct if (ret VM_FAULT_ERROR) { if (ret VM_FAULT_OOM) return i ? i : -ENOMEM; - if (ret - (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE| - VM_FAULT_SIGBUS)) + if (ret (VM_FAULT_HWPOISON | +VM_FAULT_HWPOISON_LARGE)) { + if (i) + return i; + else if (gup_flags FOLL_HWPOISON) + return -EHWPOISON; + else + return -EFAULT; + } + if (ret VM_FAULT_SIGBUS) hm, that function is getting a bit unweildy. Yes. Do you think the following code is better? return i ? i : (gup_flags FOLL_HWPOISON) ? -EHWPOISON : -EFAULT; nope ;) The function just needs to be ripped up and redone somehow. That's not an appropriate activity in the context of this patch though. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA
On Tue, 30 Nov 2010 15:45:12 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: This patch moves zone_reclaim and associated helpers outside CONFIG_NUMA. This infrastructure is reused in the patches for page cache control that follow. Thereby adding a nice dollop of bloat to everyone's kernel. I don't think that is justifiable given that the audience for this feature is about eight people :( How's about CONFIG_UNMAPPED_PAGECACHE_CONTROL? Also this patch instantiates sysctl_min_unmapped_ratio and sysctl_min_slab_ratio on non-NUMA builds but fails to make those tunables actually tunable in procfs. Changes to sysctl.c are needed. Reviewed-by: Christoph Lameter c...@linux.com More careful reviewers, please. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] Provide control over unmapped pages
On Tue, 30 Nov 2010 15:46:31 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: Provide control using zone_reclaim() and a boot parameter. The code reuses functionality from zone_reclaim() to isolate unmapped pages and reclaim them as a priority, ahead of other mapped pages. Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- include/linux/swap.h |5 ++- mm/page_alloc.c |7 +++-- mm/vmscan.c | 72 +- 3 files changed, 79 insertions(+), 5 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index eba53e7..78b0830 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -252,11 +252,12 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern long vm_total_pages; -#ifdef CONFIG_NUMA -extern int zone_reclaim_mode; extern int sysctl_min_unmapped_ratio; extern int sysctl_min_slab_ratio; This change will need to be moved into the first patch. extern int zone_reclaim(struct zone *, gfp_t, unsigned int); +extern bool should_balance_unmapped_pages(struct zone *zone); +#ifdef CONFIG_NUMA +extern int zone_reclaim_mode; #else #define zone_reclaim_mode 0 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 62b7280..4228da3 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1662,6 +1662,9 @@ zonelist_scan: unsigned long mark; int ret; + if (should_balance_unmapped_pages(zone)) + wakeup_kswapd(zone, order); gack, this is on the page allocator fastpath, isn't it? So 99.% of the world's machines end up doing a pointless call to a pointless function which pointlessly tests a pointless global and pointlessly returns? All because of some whacky KSM thing? The speed and space overhead of this code should be *zero* if !CONFIG_UNMAPPED_PAGECACHE_CONTROL and should be minimal if CONFIG_UNMAPPED_PAGECACHE_CONTROL=y. The way to do the latter is to inline the test of unmapped_page_control into callers and only if it is true (and use unlikely(), please) do we call into the KSM gunk. --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -145,6 +145,21 @@ static DECLARE_RWSEM(shrinker_rwsem); #define scanning_global_lru(sc) (1) #endif +static unsigned long balance_unmapped_pages(int priority, struct zone *zone, + struct scan_control *sc); +static int unmapped_page_control __read_mostly; + +static int __init unmapped_page_control_parm(char *str) +{ + unmapped_page_control = 1; + /* + * XXX: Should we tweak swappiness here? + */ + return 1; +} +__setup(unmapped_page_control, unmapped_page_control_parm); aw c'mon guys, everybody knows that when you add a kernel parameter you document it in Documentation/kernel-parameters.txt. static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone, struct scan_control *sc) { @@ -2223,6 +2238,12 @@ loop_again: shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0); + /* + * We do unmapped page balancing once here and once + * below, so that we don't lose out + */ + balance_unmapped_pages(priority, zone, sc); + if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone), 0, 0)) { end_zone = i; @@ -2258,6 +2279,11 @@ loop_again: continue; sc.nr_scanned = 0; + /* + * Balance unmapped pages upfront, this should be + * really cheap + */ + balance_unmapped_pages(priority, zone, sc); More unjustifiable overhead on a commonly-executed codepath. /* * Call soft limit reclaim before calling shrink_zone. @@ -2491,7 +2517,8 @@ void wakeup_kswapd(struct zone *zone, int order) pgdat-kswapd_max_order = order; if (!waitqueue_active(pgdat-kswapd_wait)) return; - if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0)) + if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0) + !should_balance_unmapped_pages(zone)) return; trace_mm_vmscan_wakeup_kswapd(pgdat-node_id, zone_idx(zone), order); @@ -2740,6 +2767,49 @@ zone_reclaim_unmapped_pages(struct zone *zone, struct scan_control *sc, } /* + * Routine to balance unmapped pages, inspired from the code
Re: [PATCH 1/2] Add vzalloc shortcut
On Sat, 16 Oct 2010 12:33:31 +0800 Dave Young hidave.darks...@gmail.com wrote: Add vzalloc for convinience of vmalloc-then-memset-zero case Use __GFP_ZERO in vzalloc to zero fill the allocated memory. Signed-off-by: Dave Young hidave.darks...@gmail.com --- include/linux/vmalloc.h |1 + mm/vmalloc.c| 13 + 2 files changed, 14 insertions(+) --- linux-2.6.orig/include/linux/vmalloc.h2010-08-22 15:31:38.0 +0800 +++ linux-2.6/include/linux/vmalloc.h 2010-10-16 10:50:54.739996121 +0800 @@ -53,6 +53,7 @@ static inline void vmalloc_init(void) #endif extern void *vmalloc(unsigned long size); +extern void *vzalloc(unsigned long size); extern void *vmalloc_user(unsigned long size); extern void *vmalloc_node(unsigned long size, int node); extern void *vmalloc_exec(unsigned long size); --- linux-2.6.orig/mm/vmalloc.c 2010-08-22 15:31:39.0 +0800 +++ linux-2.6/mm/vmalloc.c2010-10-16 10:51:57.126665918 +0800 @@ -1604,6 +1604,19 @@ void *vmalloc(unsigned long size) EXPORT_SYMBOL(vmalloc); /** + * vzalloc - allocate virtually contiguous memory with zero filled s/filled/fill/ + * @size: allocation size + * Allocate enough pages to cover @size from the page level + * allocator and map them into contiguous kernel virtual space. + */ +void *vzalloc(unsigned long size) +{ + return __vmalloc_node(size, 1, GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO, + PAGE_KERNEL, -1, __builtin_return_address(0)); +} +EXPORT_SYMBOL(vzalloc); We'd need to add the same interface to nommu, please. Also, a slightly better implementation would be static inline void *__vmalloc_node_flags(unsigned long size, gfp_t flags) { return __vmalloc_node(size, 1, flags, PAGE_KERNEL, -1, __builtin_return_address(0)); } void *vzalloc(unsigned long size) { return __vmalloc_node_flags(size, GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO); } void *vmalloc(unsigned long size) { return __vmalloc_node_flags(size, GFP_KERNEL | __GFP_HIGHMEM); } just to avoid code duplication (and possible later errors derived from it). Perhaps it should be always_inline, so the __builtin_return_address() can't get broken. Or just leave it the way you had it :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] Add vzalloc shortcut
On Tue, 19 Oct 2010 09:55:17 +0800 Dave Young hidave.darks...@gmail.com wrote: On Tue, Oct 19, 2010 at 9:27 AM, Dave Young hidave.darks...@gmail.com wrote: On Tue, Oct 19, 2010 at 7:46 AM, Andrew Morton Also, a slightly better implementation would be static inline void * vmalloc_node_flags(unsigned long size, gfp_t flags) { return vmalloc_node(size, 1, flags, PAGE_KERNEL, -1, builtin_return_address(0)); } Is this better? might vmalloc_node_flags would be used by other than vmalloc? static inline void * vmalloc_node_flags(unsigned long size, int node, gfp_t flags) I have no strong opinions, really. If we add more and more arguments to vmalloc_node_flags() it ends up looking like vmalloc_node(), so we may as well just call vmalloc_node(). Do whatever feels good ;) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] cgroups: fix API thinko
On Fri, 06 Aug 2010 10:38:24 -0600 Alex Williamson alex.william...@redhat.com wrote: On Fri, 2010-08-06 at 09:34 -0700, Sridhar Samudrala wrote: On 8/5/2010 3:59 PM, Michael S. Tsirkin wrote: cgroup_attach_task_current_cg API that have upstream is backwards: we really need an API to attach to the cgroups from another process A to the current one. In our case (vhost), a priveledged user wants to attach it's task to cgroups from a less priveledged one, the API makes us run it in the other task's context, and this fails. So let's make the API generic and just pass in 'from' and 'to' tasks. Add an inline wrapper for cgroup_attach_task_current_cg to avoid breaking bisect. Signed-off-by: Michael S. Tsirkinm...@redhat.com --- Paul, Li, Sridhar, could you please review the following patch? I only compile-tested it due to travel, but looks straight-forward to me. Alex Williamson volunteered to test and report the results. Sending out now for review as I might be offline for a bit. Will only try to merge when done, obviously. If OK, I would like to merge this through -net tree, together with the patch fixing vhost-net. Let me know if that sounds ok. Thanks! This patch is on top of net-next, it is needed for fix vhost-net regression in net-next, where a non-priveledged process can't enable the device anymore: when qemu uses vhost, inside the ioctl call it creates a thread, and tries to add this thread to the groups of current, and it fails. But we control the thread, so to solve the problem, we really should tell it 'connect to out cgroups'. So am I correct to assume that this change is now needed in 2.6.36, and unneeded in 2.6.35? Can it affect the userspace-kernel API in amy manner? If so, it should be backported into earlier kernels to reduce the number of incompatible kernels out there. Paul, did you have any comments? I didn't see any update in response to the minor review comments, so... include/linux/cgroup.h |1 + kernel/cgroup.c|6 +++--- 2 files changed, 4 insertions(+), 3 deletions(-) diff -puN include/linux/cgroup.h~cgroups-fix-api-thinko-fix include/linux/cgroup.h --- a/include/linux/cgroup.h~cgroups-fix-api-thinko-fix +++ a/include/linux/cgroup.h @@ -579,6 +579,7 @@ void cgroup_iter_end(struct cgroup *cgrp int cgroup_scan_tasks(struct cgroup_scanner *scan); int cgroup_attach_task(struct cgroup *, struct task_struct *); int cgroup_attach_task_all(struct task_struct *from, struct task_struct *); + static inline int cgroup_attach_task_current_cg(struct task_struct *tsk) { return cgroup_attach_task_all(current, tsk); diff -puN kernel/cgroup.c~cgroups-fix-api-thinko-fix kernel/cgroup.c --- a/kernel/cgroup.c~cgroups-fix-api-thinko-fix +++ a/kernel/cgroup.c @@ -1798,13 +1798,13 @@ out: int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) { struct cgroupfs_root *root; - struct cgroup *cur_cg; int retval = 0; cgroup_lock(); for_each_active_root(root) { - cur_cg = task_cgroup_from_root(from, root); - retval = cgroup_attach_task(cur_cg, tsk); + struct cgroup *from_cg = task_cgroup_from_root(from, root); + + retval = cgroup_attach_task(from_cg, tsk); if (retval) break; } _ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: + kvm-ia64-dereference-of-null-pointer-in-set_pal_result.patch added to -mm tree
On Wed, 13 Jan 2010 11:22:39 +0200 Avi Kivity a...@redhat.com wrote: On 01/13/2010 12:11 AM, a...@linux-foundation.org wrote: Subject: kvm/ia64: dereference of NULL pointer in set_pal_result() From: Roel Kluinroel.kl...@gmail.com Do not dereference a NULL pointer diff -puN arch/ia64/kvm/kvm_fw.c~kvm-ia64-dereference-of-null-pointer-in-set_pal_result arch/ia64/kvm/kvm_fw.c --- a/arch/ia64/kvm/kvm_fw.c~kvm-ia64-dereference-of-null-pointer-in-set_pal_result +++ a/arch/ia64/kvm/kvm_fw.c @@ -75,9 +75,11 @@ static void set_pal_result(struct kvm_vc struct exit_ctl_data *p; p = kvm_get_exit_data(vcpu); - if (p p-exit_reason == EXIT_REASON_PAL_CALL) { + if (!p) + return; + if (p-exit_reason == EXIT_REASON_PAL_CALL) { p-u.pal_data.ret = result; - return ; + return; } INIT_PAL_STATUS_UNIMPLEMENTED(p-u.pal_data.ret); } kvm_get_exit_data() cannot return a NULL pointer. In that case set_pal_result() doesn't need to test for that. Roel looks for code along the lines of if (p) ... *p; Where did this come from? I got it off linux-kernel. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vga_arb warning [was: mmotm 2009-11-01-10-01 uploaded]
Please cc me on mmotm bug reports! On Sun, 01 Nov 2009 22:47:05 +0100 Jiri Slaby jirisl...@gmail.com wrote: On 11/01/2009 07:07 PM, a...@linux-foundation.org wrote: The mm-of-the-moment snapshot 2009-11-01-10-01 has been uploaded to Hi, I got the following warning while booting an image in qemu-kvm: WARNING: at fs/attr.c:158 notify_change+0x2da/0x310() Hardware name: Modules linked in: Pid: 1, comm: swapper Not tainted 2.6.32-rc5-mm1_64 #862 Call Trace: [81038008] warn_slowpath_common+0x78/0xb0 [8103804f] warn_slowpath_null+0xf/0x20 [810d32ba] notify_change+0x2da/0x310 [810c5b88] ? fsnotify_create+0x48/0x60 [810c6d2b] ? vfs_mknod+0xbb/0xe0 [812487b6] devtmpfs_create_node+0x1e6/0x270 [811170d0] ? sysfs_addrm_finish+0x20/0x280 [811175d6] ? __sysfs_add_one+0x26/0xf0 [81117b6c] ? sysfs_do_create_link+0xcc/0x160 [81241cf0] device_add+0x1e0/0x5b0 [8124adb1] ? pm_runtime_init+0xa1/0xb0 [81248f05] ? device_pm_init+0x65/0x70 [812420d9] device_register+0x19/0x20 [81242290] device_create_vargs+0xf0/0x120 [812422ec] device_create+0x2c/0x30 [810c0516] ? __register_chrdev+0x86/0xf0 [81245599] ? __class_create+0x69/0xa0 [814326e9] ? mutex_lock+0x19/0x50 [811d4e23] misc_register+0x93/0x170 [818994a0] ? vga_arb_device_init+0x0/0x77 [818994b3] vga_arb_device_init+0x13/0x77 [818994a0] ? vga_arb_device_init+0x0/0x77 [810001e7] do_one_initcall+0x37/0x190 [8187d6ce] kernel_init+0x172/0x1c8 [81003c7a] child_rip+0xa/0x20 [8187d55c] ? kernel_init+0x0/0x1c8 [81003c70] ? child_rip+0x0/0x20 There's a -mm-only debug patch: http://userweb.kernel.org/~akpm/mmotm/broken-out/notify_change-callers-must-hold-i_mutex.patch --- a/fs/attr.c~notify_change-callers-must-hold-i_mutex +++ a/fs/attr.c @@ -155,6 +155,8 @@ int notify_change(struct dentry * dentry return -EPERM; } + WARN_ON_ONCE(!mutex_is_locked(inode-i_mutex)); + now = current_fs_time(inode-i_sb); attr-ia_ctime = now; _ I forget why it was added. It looks like it's blaming the open-coded notify_change() call in devtmpfs_create_node(). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [was: mmotm 2009-10-09-01-07 uploaded]
On Fri, 9 Oct 2009 08:30:28 -0700 Randy Dunlap rdun...@xenotime.net wrote: From: Randy Dunlap randy.dun...@oracle.com When CONFIG_CPU_FREQ is disabled, cpufreq_get() needs a stub. Used by kvm (although it looks like a bit of the kvm code could be omitted when CONFIG_CPU_FREQ is disabled). arch/x86/built-in.o: In function `kvm_arch_init': (.text+0x10de7): undefined reference to `cpufreq_get' Signed-off-by: Randy Dunlap randy.dun...@oracle.com Tested-by: Eric Paris epa...@redhat.com --- include/linux/cpufreq.h |7 +++ 1 file changed, 7 insertions(+) --- linux-next-20091006.orig/include/linux/cpufreq.h +++ linux-next-20091006/include/linux/cpufreq.h @@ -291,8 +291,15 @@ struct global_attr { int cpufreq_get_policy(struct cpufreq_policy *policy, unsigned int cpu); int cpufreq_update_policy(unsigned int cpu); +#ifdef CONFIG_CPU_FREQ /* query the current CPU frequency (in kHz). If zero, cpufreq couldn't detect it */ unsigned int cpufreq_get(unsigned int cpu); +#else +static inline unsigned int cpufreq_get(unsigned int cpu) +{ + return 0; +} +#endif Thanks. I'll merge this into mainline in the next batch I think. It's only needed by the KVM development tree but it's the correct thing to do anyway and having it in minaline will simplify life for everyone. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 1/2] mm: export use_mm/unuse_mm to modules
On Thu, 17 Sep 2009 08:38:18 +0300 Michael S. Tsirkin m...@redhat.com wrote: Hi Andrew, On Tue, Aug 11, 2009 at 03:10:10PM -0700, Andrew Morton wrote: On Wed, 12 Aug 2009 00:27:52 +0300 Michael S. Tsirkin m...@redhat.com wrote: vhost net module wants to do copy to/from user from a kernel thread, which needs use_mm (like what fs/aio has). Move that into mm/ and export to modules. OK by me. Please include this change in the virtio patchset. Which I shall cheerfully not be looking at :) The virtio patches are somewhat delayed as we are ironing out the kernel/user interface with Rusty. Can the patch moving use_mm to mm/ be applied without exporting to modules for now? This will make it easier for virtio which will only have to patch in the EXPORT line. That was 10,000 patches ago. I also have a small patch optimizing atomic usage in use_mm (which I did for virtio) and it's easier to apply it if the code is in the new place. If ok, pls let me know and I'll post the patch without the EXPORT line. Please just send them all out. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 1/2] mm: export use_mm/unuse_mm to modules
On Wed, 12 Aug 2009 00:27:52 +0300 Michael S. Tsirkin m...@redhat.com wrote: vhost net module wants to do copy to/from user from a kernel thread, which needs use_mm (like what fs/aio has). Move that into mm/ and export to modules. OK by me. Please include this change in the virtio patchset. Which I shall cheerfully not be looking at :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] eventfd - revised interface and cleanups (2nd rev)
On Wed, 24 Jun 2009 15:47:47 -0700 (PDT) Davide Libenzi davi...@xmailserver.org wrote: On Tue, 23 Jun 2009, Andrew Morton wrote: Maybe there is such a reason, and it hasn't yet been beaten into my skull. But I still think it should be done in a separate patch. One which comes with a full description of the reasons for the change, btw. Since your skull seems pretty hard to beat, would you like me to split it for ya? :) Split what? My skull? umm, yes please, I believe the patches should be split. And I'm still not seeing the justification for forcing CONFIG_EVENTFD onto all CONFIG_AIO users! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] eventfd - revised interface and cleanups (2nd rev)
On Wed, 24 Jun 2009 16:52:06 -0700 (PDT) Davide Libenzi davi...@xmailserver.org wrote: umm, yes please, I believe the patches should be split. And I'm still not seeing the justification for forcing CONFIG_EVENTFD onto all CONFIG_AIO users! Eventfd notifications became part of the AIO API (it's not even delivered through a new syscall, from the AIO side - same existing aiocb struct and io_submit syscall) once we merged it, That was a regression for existing embedded AIO users ;) so IMHO (AIO !EVENTFD) would be similar to split AIO in AIO_READ and AIO_WRITE and have (AIO !AIO_WRITE). Considering that the kernel config, once you unleash the CONFIG_EMBEDDED pandora box, allows you to select (AIO !EVENTFD) w/out even a warning about possible userspace breakages, this makes it rather a confusing configuration if you ask me. Sure. But we do assume that one someone sets CONFIG_EMBEDDED, they know what they're doing. We prefer to give them maximum flexibility and foot-shooting ability. As long as the maintenance cost of doing so in each case is reasonable, of course. It's not a biggie from the kernel side, just a few ugly errors wrappers around functions. For me AIO (or whatever userspace visible kernel subsystem) should select all the components that are part of the userspace API, but my argument ends here. Sure, it's not a biggie either way. Doubtful if many tiny systems are using AIO anyway. Heck, lots of them are running 2.4.18... But from the general this-is-the-way-we-do-things POV, it's preferred that the two features be separable under CONFIG_EMBEDDED if poss. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] eventfd - revised interface and cleanups (2nd rev)
On Tue, 23 Jun 2009 12:25:36 -0700 (PDT) Davide Libenzi davi...@xmailserver.org wrote: The following patch changes the eventfd interface to de-couple the eventfd memory context, from the file pointer instance. Without such change, there is no clean way to racely free handle the POLLHUP event sent when the last instance of the file* goes away. Also, now the internal eventfd APIs are using the eventfd context instead of the file*. Another cleanup this patch does, is making AIO select EVENTFD, instead of adding a bunch of empty function stubs inside eventfd.h in order to handle the (AIO !EVENTFD) case. If we really want to squeeze this into 2.6.31 then it would be helpful to have justification in the changelog, please. I see that some KVM feature needs it, but what and why and why now, etc? The patch conflicts with similar changes int he KVM tree, but that'll just ruin sfr's life for a day. Your patch has: ... static int eventfd_release(struct inode *inode, struct file *file) { - kfree(file-private_data); + struct eventfd_ctx *ctx = file-private_data; + + wake_up_poll(ctx-wqh, POLLHUP); + eventfd_ctx_put(ctx); return 0; } but the patch in linux-next has static int eventfd_release(struct inode *inode, struct file *file) { struct eventfd_ctx *ctx = file-private_data; /* * No need to hold the lock here, since we are on the file cleanup * path and the ones still attached to the wait queue will be * serialized by wake_up_locked_poll(). */ wake_up_locked_poll(ctx-wqh, POLLHUP); kfree(ctx); return 0; } which looks quite different (and has a nice comment!) What's going on here? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] eventfd - revised interface and cleanups (2nd rev)
On Tue, 23 Jun 2009 12:25:36 -0700 (PDT) Davide Libenzi davi...@xmailserver.org wrote: Another cleanup this patch does, is making AIO select EVENTFD, instead of adding a bunch of empty function stubs inside eventfd.h in order to handle the (AIO !EVENTFD) case. Given that we're trying to squeak this patch into 2.6.31, it's quite unfortunate to include unrelated changes. Especially ones which involve the always-problematic select. Although this one looks less than usually deadly due to the simple dependencies of AIO and eventfd. However... Is this a good way of fixing the (AIO !EVENTFD) case? Surely if the developer selected this combination, he doesn't want to carry the overhead of including eventfd. IOW, if AIO can work acceptably without eventfd being compiled into the kernel then adding the stubs is a superior solution. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] eventfd - revised interface and cleanups (2nd rev)
On Tue, 23 Jun 2009 12:25:36 -0700 (PDT) Davide Libenzi davi...@xmailserver.org wrote: The following patch changes the eventfd interface to de-couple the eventfd memory context, from the file pointer instance. Without such change, there is no clean way to racely free handle the POLLHUP event sent when the last instance of the file* goes away. Also, now the internal eventfd APIs are using the eventfd context instead of the file*. Another cleanup this patch does, is making AIO select EVENTFD, instead of adding a bunch of empty function stubs inside eventfd.h in order to handle the (AIO !EVENTFD) case. ... +/** + * eventfd_ctx_get - Acquires a reference to the internal eventfd context. + * @ctx: [in] Pointer to the eventfd context. + * + * Returns: In case of success, returns a pointer to the eventfd context, + * otherwise a proper error code. The description of the return value + */ +struct eventfd_ctx *eventfd_ctx_get(struct eventfd_ctx *ctx) +{ + kref_get(ctx-kref); + return ctx; +} +EXPORT_SYMBOL_GPL(eventfd_ctx_get); doesn't match the code. Also... + * Returns: A pointer to the eventfd file structure in case of success, or a + * proper error pointer in case of failure. + * Returns: In case of success, it returns a pointer to the internal eventfd + * context, otherwise a proper error code. + */ I'm unsure what the word proper means in this context. The term proper error pointer is understandable enough - something you run IS_ERR() against. error pointer would suffice. But the term proper error code is getting a bit remote from reality. Unfortunately the kernel doesn't have a simple and agreed-to term for an ERR_PTR() thingy. Perhaps we should invent one. err_ptr? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] eventfd - revised interface and cleanups (2nd rev)
On Tue, 23 Jun 2009 13:59:07 -0700 (PDT) Davide Libenzi davi...@xmailserver.org wrote: On Tue, 23 Jun 2009, Andrew Morton wrote: On Tue, 23 Jun 2009 12:25:36 -0700 (PDT) Davide Libenzi davi...@xmailserver.org wrote: Another cleanup this patch does, is making AIO select EVENTFD, instead of adding a bunch of empty function stubs inside eventfd.h in order to handle the (AIO !EVENTFD) case. Given that we're trying to squeak this patch into 2.6.31, it's quite unfortunate to include unrelated changes. Especially ones which involve the always-problematic select. Although this one looks less than usually deadly due to the simple dependencies of AIO and eventfd. However... Is this a good way of fixing the (AIO !EVENTFD) case? Surely if the developer selected this combination, he doesn't want to carry the overhead of including eventfd. IOW, if AIO can work acceptably without eventfd being compiled into the kernel then adding the stubs is a superior solution. IMO when someone says AIO included in the kernel, should get the whole AIO functionality and not part of it. People already started using KAIO+eventfd, and a (AIO !EVENTFD) config will make their code fail. That isn't for us to decide. Entire syscalls can be disabled in config. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] eventfd - revised interface and cleanups (2nd rev)
On Tue, 23 Jun 2009 14:03:22 -0700 (PDT) Davide Libenzi davi...@xmailserver.org wrote: On Tue, 23 Jun 2009, Andrew Morton wrote: On Tue, 23 Jun 2009 12:25:36 -0700 (PDT) Davide Libenzi davi...@xmailserver.org wrote: The following patch changes the eventfd interface to de-couple the eventfd memory context, from the file pointer instance. Without such change, there is no clean way to racely free handle the POLLHUP event sent when the last instance of the file* goes away. Also, now the internal eventfd APIs are using the eventfd context instead of the file*. Another cleanup this patch does, is making AIO select EVENTFD, instead of adding a bunch of empty function stubs inside eventfd.h in order to handle the (AIO !EVENTFD) case. ... +/** + * eventfd_ctx_get - Acquires a reference to the internal eventfd context. + * @ctx: [in] Pointer to the eventfd context. + * + * Returns: In case of success, returns a pointer to the eventfd context, + * otherwise a proper error code. The description of the return value Should functions be describing all the returned error codes, ala man pages? I think so. + */ +struct eventfd_ctx *eventfd_ctx_get(struct eventfd_ctx *ctx) +{ + kref_get(ctx-kref); + return ctx; +} +EXPORT_SYMBOL_GPL(eventfd_ctx_get); doesn't match the code. You appear to have not seen the above sentence. Also... + * Returns: A pointer to the eventfd file structure in case of success, or a + * proper error pointer in case of failure. + * Returns: In case of success, it returns a pointer to the internal eventfd + * context, otherwise a proper error code. + */ I'm unsure what the word proper means in this context. The term proper error pointer is understandable enough - something you run IS_ERR() against. error pointer would suffice. But the term proper error code is getting a bit remote from reality. Unfortunately the kernel doesn't have a simple and agreed-to term for an ERR_PTR() thingy. Perhaps we should invent one. err_ptr? OK, but you tricked me once again :) You posted your comments/changes while you merged the old version in -mm already. yeah, I never trust people. You might lose the email or jump on a plane and disappear for three weeks, then it all gets forgotten about and lost. If the code doesn't have any apparent showstoppers I'll often merge it with a note that it isn't finalised. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] eventfd - revised interface and cleanups (2nd rev)
On Tue, 23 Jun 2009 14:25:05 -0700 (PDT) Davide Libenzi davi...@xmailserver.org wrote: On Tue, 23 Jun 2009, Andrew Morton wrote: That isn't for us to decide. Entire syscalls can be disabled in config. That is not a well defined separate syscall though. It's a member/feature of the aiocb. I don't know what this means, really. AIO eventfd support is a relatively recently added enhancement to AIO, is it not? Applications which continue to use the pre-May07 AIO features will continue to work as before (they darn well better). So for such applications, AIO=y/EVENTFD=n kernels are usable and useful, and eliminating this option is a loss? Either way, I believe that this change should be unbundled from the unrelated KVM one so we can handle it separately. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] eventfd - revised interface and cleanups (2nd rev)
On Tue, 23 Jun 2009 14:34:50 -0700 (PDT) Davide Libenzi davi...@xmailserver.org wrote: On Tue, 23 Jun 2009, Andrew Morton wrote: Should functions be describing all the returned error codes, ala man pages? I think so. This becomes pretty painful when the function calls other functions, for which just relays the error code. Should we be just documenting the error codes introduced by the function code, and say that the function returns errors A, B, C plus all the ones returned by the called functions X, Y, Z? If not, it becomes hell in maintaining the comments... Well. Don't worry about making rules. Taste and common sense apply. Will it be useful to readers if I explicitly document the return value. If yes then document away. If no then don't. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] eventfd - revised interface and cleanups (2nd rev)
On Tue, 23 Jun 2009 14:48:51 -0700 (PDT) Davide Libenzi davi...@xmailserver.org wrote: This becomes pretty painful when the function calls other functions, for which just relays the error code. Should we be just documenting the error codes introduced by the function code, and say that the function returns errors A, B, C plus all the ones returned by the called functions X, Y, Z? If not, it becomes hell in maintaining the comments... Well. Don't worry about making rules. Taste and common sense apply. Will it be useful to readers if I explicitly document the return value. If yes then document away. If no then don't. Are you OK with the format in the patch below? Looks great to me. Obviously the cost of maintaining this level of detail is fairly high, and the chances of bitrot are also high. So I wouldn't be expecting people to document things at this level in general. But if you're prepared to maintain this then good for you. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] eventfd - revised interface and cleanups (2nd rev)
On Tue, 23 Jun 2009 15:49:53 -0700 (PDT) Davide Libenzi davi...@xmailserver.org wrote: On Tue, 23 Jun 2009, Andrew Morton wrote: On Tue, 23 Jun 2009 14:25:05 -0700 (PDT) Davide Libenzi davi...@xmailserver.org wrote: On Tue, 23 Jun 2009, Andrew Morton wrote: That isn't for us to decide. Entire syscalls can be disabled in config. That is not a well defined separate syscall though. It's a member/feature of the aiocb. I don't know what this means, really. This is the struct iocb: struct iocb { ... u_int32_t aio_resfd; }; And the only interface to access KAIO is io_submit(). IMO the end user perceives the KAIO functionality as the full deployment of the iocb struct and the io_submit() accessory. Can code not using eventfd work in (AIO !EVENTFD) mode? Sure. It is a kinda confusing configuration from the end user POV IMO. What's confusing about it? Most programmers who are using AIO probably don't even know that the eventd hookup is available. And even if they did, they might not want to use it, to remain compatible with older kernels. I'm still not seeing any compelling reason for unconditionally adding kernel text which some users don't need. Maybe there is such a reason, and it hasn't yet been beaten into my skull. But I still think it should be done in a separate patch. One which comes with a full description of the reasons for the change, btw. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] add ksm kernel shared memory driver.
On Mon, 20 Apr 2009 04:36:06 +0300 Izik Eidus iei...@redhat.com wrote: Ksm is driver that allow merging identical pages between one or more applications in way unvisible to the application that use it. Pages that are merged are marked as readonly and are COWed when any application try to change them. Ksm is used for cases where using fork() is not suitable, one of this cases is where the pages of the application keep changing dynamicly and the application cannot know in advance what pages are going to be identical. Ksm works by walking over the memory pages of the applications it scan in order to find identical pages. It uses a two sorted data strctures called stable and unstable trees to find in effective way the identical pages. When ksm finds two identical pages, it marks them as readonly and merges them into single one page, after the pages are marked as readonly and merged into one page, linux will treat this pages as normal copy_on_write pages and will fork them when write access will happen to them. Ksm scan just memory areas that were registred to be scanned by it. ... + copy_user_highpage(kpage, page1, addr1, vma); ... Breaks ppc64 allmodcofnig because that architecture doesn't export its copy_user_page() to modules. Architectures are inconsistent about this. x86 _does_ export it, because it bounces it to the exported copy_page(). So can I ask that you sit down and work out upon which architectures it really makes sense to offer KSM? Disallow the others in Kconfig and arrange for copy_user_highpage() to be available on the allowed architectures? Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] add ksm kernel shared memory driver.
On Thu, 30 Apr 2009 20:46:24 +0300 Izik Eidus iei...@redhat.com wrote: Following patchs change the api to be more robust, the result change of the api came after conversation i had with Andrea and Chris about how to make the api as stable as we can, In addition i hope this patchset fix the cross compilation problems, i compiled it on itanium (doesnt have _PAGE_RW) and it seems to work. eek, please don't send multiple patches per email - it's surprisingly disruptive to everything. What is the relationship between these patches and the ones I merged the other day? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM PATCH v3 0/2] irqfd
On Mon, 27 Apr 2009 14:33:24 -0400 Gregory Haskins ghask...@novell.com wrote: (Applies to kvm.git 41b76d8d0487c26d6d4d3fe53c1ff59b3236f096) This series implements a mechanism called irqfd. It lets you create an eventfd based file-desriptor to inject interrupts to a kvm guest. We associate one gsi per fd for fine-grained routing. It'd be nice if the KVM weenies amongst us were to be told why one would want to inject interrupts into a KVM guest. Monosyllabic words would be preferred ;) We do not have a user of this interface in this series, though note future version of virtual-bus (v4 and above) will be based on this. So I assume that this patchset will be merged if/when a user of it is merged? The first patch will require mainline buy-in, particularly from Davide (cc'd). The last patch is kvm specific. Three EXPORT_SYMBOL_GPL()s. Once the shouting has subsided I'd suggest that this be merged via the KVM tree. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] add ksm kernel shared memory driver.
On Mon, 20 Apr 2009 04:36:06 +0300 Izik Eidus iei...@redhat.com wrote: Ksm is driver that allow merging identical pages between one or more applications in way unvisible to the application that use it. Pages that are merged are marked as readonly and are COWed when any application try to change them. Breaks sparc64 and probably lots of other architectures: mm/ksm.c: In function `try_to_merge_two_pages_alloc': mm/ksm.c:697: error: `_PAGE_RW' undeclared (first use in this function) there should be an official arch-independent way of manipulating vma-vm_page_prot, but I'm not immediately finding it. An alternative (and quite inferior) fix would be to disable ksm on architectures which don't implement _PAGE_RW. That's most of them. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] add ksm kernel shared memory driver.
On Thu, 16 Apr 2009 01:37:25 +0300 Izik Eidus iei...@redhat.com wrote: Andrew Morton wrote: On Thu, 9 Apr 2009 06:58:41 +0300 Izik Eidus iei...@redhat.com wrote: Confused. In the covering email you indicated that v2 of the patchset had abandoned ioctls and had moved the interface to sysfs. We have abandoned the ioctls that control the ksm behavior (how much cpu it take, how much kernel pages it may allocate and so on...) But we still use ioctls to register the application memory to be used with ksm. hm. ioctls make kernel people weep and gnash teeth. An appropriate interface would be to add new syscalls. But as ksm is an optional thing and can even be modprobed, that doesn't work. And having a driver in mm/ which can be modprobed is kinda neat. I can't immediately think of a nicer interface. You could always poke numbers into some pseudo-file but to me that seems as ugly, or uglier than an ioctl (others seem to disagee). Ho hum. Please design the ioctl interface so that it doesn't need any compat handling if poss. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux v3
On Thu, 9 Apr 2009 06:58:37 +0300 Izik Eidus iei...@redhat.com wrote: KSM is a linux driver that allows dynamicly sharing identical memory pages between one or more processes. Generally looks OK to me. But that doesn't mean much. We should rub bottles with words like hugh and nick on them to be sure. ... include/linux/ksm.h | 48 ++ include/linux/miscdevice.h |1 + include/linux/mm.h |5 + include/linux/mmu_notifier.h | 34 + include/linux/rmap.h | 11 + mm/Kconfig |6 + mm/Makefile |1 + mm/ksm.c | 1674 ++ mm/memory.c | 90 +++- mm/mmu_notifier.c| 20 + mm/rmap.c| 139 And it's pretty unobtrusive for what it is. I expect we can get this into 2.6.31 unless there are some pratfalls which I missed. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] add replace_page(): change the page pte is pointing to.
On Thu, 9 Apr 2009 06:58:40 +0300 Izik Eidus iei...@redhat.com wrote: replace_page() allow changing the mapping of pte from one physical page into diffrent physical page. At a high level, this is very similar to what page migration does. Yet this implementation shares nothing with the page migration code. Can this situation be improved? this function is working by removing oldpage from the rmap and calling put_page on it, and by setting the pte to point into newpage and by inserting it to the rmap using page_add_file_rmap(). note: newpage must be non anonymous page, the reason for this is: replace_page() is built to allow mapping one page into more than one virtual addresses, the mapping of this page can happen in diffrent offsets inside each vma, and therefore we cannot trust the page-index anymore. the side effect of this issue is that newpage cannot be anything but kernel allocated page that is not swappable. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] add ksm kernel shared memory driver.
On Thu, 9 Apr 2009 06:58:41 +0300 Izik Eidus iei...@redhat.com wrote: Ksm is driver that allow merging identical pages between one or more applications in way unvisible to the application that use it. Pages that are merged are marked as readonly and are COWed when any application try to change them. Ksm is used for cases where using fork() is not suitable, one of this cases is where the pages of the application keep changing dynamicly and the application cannot know in advance what pages are going to be identical. Ksm works by walking over the memory pages of the applications it scan in order to find identical pages. It uses a two sorted data strctures called stable and unstable trees to find in effective way the identical pages. When ksm finds two identical pages, it marks them as readonly and merges them into single one page, after the pages are marked as readonly and merged into one page, linux will treat this pages as normal copy_on_write pages and will fork them when write access will happen to them. Ksm scan just memory areas that were registred to be scanned by it. Ksm api: KSM_GET_API_VERSION: Give the userspace the api version of the module. KSM_CREATE_SHARED_MEMORY_AREA: Create shared memory reagion fd, that latter allow the user to register the memory region to scan by using: KSM_REGISTER_MEMORY_REGION and KSM_REMOVE_MEMORY_REGION KSM_REGISTER_MEMORY_REGION: Register userspace virtual address range to be scanned by ksm. This ioctl is using the ksm_memory_region structure: ksm_memory_region: __u32 npages; number of pages to share inside this memory region. __u32 pad; __u64 addr: the begining of the virtual address of this region. __u64 reserved_bits; reserved bits for future usage. KSM_REMOVE_MEMORY_REGION: Remove memory region from ksm. ... +/* ioctls for /dev/ksm */ Confused. In the covering email you indicated that v2 of the patchset had abandoned ioctls and had moved the interface to sysfs. It would be good to completely (and briefly) describe KSM's proposed userspace intefaces in the changelog or somewhere. I'm a bit confused. ... +/* + * slots_lock protect against removing and adding memory regions while a scanner + * is in the middle of scanning. + */ protects +static DECLARE_RWSEM(slots_lock); + +/* The stable and unstable trees heads. */ +struct rb_root root_stable_tree = RB_ROOT; +struct rb_root root_unstable_tree = RB_ROOT; + + +/* The number of linked list members inside the hash table */ +static int nrmaps_hash; A signed type doesn't seem appropriate. +/* rmap_hash hash table */ +static struct hlist_head *rmap_hash; + +static struct kmem_cache *tree_item_cache; +static struct kmem_cache *rmap_item_cache; + +/* the number of nodes inside the stable tree */ +static unsigned long nnodes_stable_tree; + +/* the number of kernel allocated pages outside the stable tree */ +static unsigned long nkpage_out_tree; + +static int kthread_sleep; /* sleep time of the kernel thread */ +static int kthread_pages_to_scan; /* npages to scan for the kernel thread */ +static int kthread_max_kernel_pages; /* number of unswappable pages allowed */ The kthread_max_kernel_pages isn't very illuminating. The use of kthread in the identifier makes is look like part of the kthread subsystem. +static unsigned long ksm_pages_shared; +static struct ksm_scan kthread_ksm_scan; +static int ksmd_flags; +static struct task_struct *kthread; +static DECLARE_WAIT_QUEUE_HEAD(kthread_wait); +static DECLARE_RWSEM(kthread_lock); + + ... +static pte_t *get_pte(struct mm_struct *mm, unsigned long addr) +{ + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + pte_t *ptep = NULL; + + pgd = pgd_offset(mm, addr); + if (!pgd_present(*pgd)) + goto out; + + pud = pud_offset(pgd, addr); + if (!pud_present(*pud)) + goto out; + + pmd = pmd_offset(pud, addr); + if (!pmd_present(*pmd)) + goto out; + + ptep = pte_offset_map(pmd, addr); +out: + return ptep; +} hm, this looks very generic. Does it duplicate anything which core kernel already provides? If not, perhaps core kernel should provide this (perhaps after some reorganisation). ... +static int rmap_hash_init(void) +{ + if (!rmap_hash_size) { + struct sysinfo sinfo; + + si_meminfo(sinfo); + rmap_hash_size = sinfo.totalram / 10; One slot per ten pages of physical memory? Is this too large, too small or just right? + } + nrmaps_hash = rmap_hash_size; + rmap_hash = vmalloc(nrmaps_hash * sizeof(struct hlist_head)); + if (!rmap_hash) + return -ENOMEM; + memset(rmap_hash, 0, nrmaps_hash * sizeof(struct hlist_head)); + return 0; +} + ... +static void break_cow(struct mm_struct *mm, unsigned long addr) +{ + struct page *page[1]; + +
Re: [net-next PATCH 1/2] igbvf: add new driver to support 82576 virtual functions
On Wed, 18 Mar 2009 08:22:46 -0700 Alexander Duyck alexander.h.du...@intel.com wrote: +static int igbvf_set_ringparam(struct net_device *netdev, + struct ethtool_ringparam *ring) +{ + struct igbvf_adapter *adapter = netdev_priv(netdev); + struct igbvf_ring *tx_ring, *tx_old; + struct igbvf_ring *rx_ring, *rx_old; + int err; + + if ((ring-rx_mini_pending) || (ring-rx_jumbo_pending)) + return -EINVAL; + + while (test_and_set_bit(__IGBVF_RESETTING, adapter-state)) + msleep(1); No timeout needed here? Interrupts might not be working, for example.. This bit isn't set in interrupt context. This is always used out of interrupt context and is just to prevent multiple setting changes at the same time. Oh. Can't use plain old mutex_lock()? We have one or two spots that actually check to see if the bit is set and just report a warning instead of actually waiting on the bit to clear. mutex_is_locked()? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next PATCH 1/2] igbvf: add new driver to support 82576 virtual functions
On Tue, 10 Mar 2009 19:09:28 -0700 Jeff Kirsher jeffrey.t.kirs...@intel.com wrote: From: Alexander Duyck alexander.h.du...@intel.com This adds an igbvf driver to handle virtual functions provided by the igb driver. The drive-by reader is now wondering what a virtual function is. ... +#define IGBVF_STAT(m, b) sizeof(((struct igbvf_adapter *)0)-m), \ + offsetof(struct igbvf_adapter, m), \ + offsetof(struct igbvf_adapter, b) +static const struct igbvf_stats igbvf_gstrings_stats[] = { + { rx_packets, IGBVF_STAT(stats.gprc, stats.base_gprc) }, + { tx_packets, IGBVF_STAT(stats.gptc, stats.base_gptc) }, + { rx_bytes, IGBVF_STAT(stats.gorc, stats.base_gorc) }, + { tx_bytes, IGBVF_STAT(stats.gotc, stats.base_gotc) }, + { multicast, IGBVF_STAT(stats.mprc, stats.base_mprc) }, + { lbrx_bytes, IGBVF_STAT(stats.gorlbc, stats.base_gorlbc) }, + { lbrx_packets, IGBVF_STAT(stats.gprlbc, stats.base_gprlbc) }, + { tx_restart_queue, IGBVF_STAT(restart_queue, zero_base) }, + { rx_long_byte_count, IGBVF_STAT(stats.gorc, stats.base_gorc) }, + { rx_csum_offload_good, IGBVF_STAT(hw_csum_good, zero_base) }, + { rx_csum_offload_errors, IGBVF_STAT(hw_csum_err, zero_base) }, + { rx_header_split, IGBVF_STAT(rx_hdr_split, zero_base) }, + { alloc_rx_buff_failed, IGBVF_STAT(alloc_rx_buff_failed, zero_base) }, +}; stares at it for a while It would be clearer if `m' and `b' were (much) more meaningful identifiers. +#define IGBVF_GLOBAL_STATS_LEN \ + (sizeof(igbvf_gstrings_stats) / sizeof(struct igbvf_stats)) This is ARRAY_SIZE(). +#define IGBVF_STATS_LEN (IGBVF_GLOBAL_STATS_LEN) Why does this need to exist? ... +static int igbvf_set_ringparam(struct net_device *netdev, + struct ethtool_ringparam *ring) +{ + struct igbvf_adapter *adapter = netdev_priv(netdev); + struct igbvf_ring *tx_ring, *tx_old; + struct igbvf_ring *rx_ring, *rx_old; + int err; + + if ((ring-rx_mini_pending) || (ring-rx_jumbo_pending)) + return -EINVAL; + + while (test_and_set_bit(__IGBVF_RESETTING, adapter-state)) + msleep(1); No timeout needed here? Interrupts might not be working, for example.. + if (netif_running(adapter-netdev)) + igbvf_down(adapter); + + tx_old = adapter-tx_ring; + rx_old = adapter-rx_ring; + + err = -ENOMEM; + tx_ring = kzalloc(sizeof(struct igbvf_ring), GFP_KERNEL); + if (!tx_ring) + goto err_alloc_tx; + /* + * use a memcpy to save any previously configured + * items like napi structs from having to be + * reinitialized + */ + memcpy(tx_ring, tx_old, sizeof(struct igbvf_ring)); + + rx_ring = kzalloc(sizeof(struct igbvf_ring), GFP_KERNEL); + if (!rx_ring) + goto err_alloc_rx; + memcpy(rx_ring, rx_old, sizeof(struct igbvf_ring)); + + adapter-tx_ring = tx_ring; + adapter-rx_ring = rx_ring; + + rx_ring-count = max(ring-rx_pending, (u32)IGBVF_MIN_RXD); + rx_ring-count = min(rx_ring-count, (u32)(IGBVF_MAX_RXD)); + rx_ring-count = ALIGN(rx_ring-count, REQ_RX_DESCRIPTOR_MULTIPLE); + + tx_ring-count = max(ring-tx_pending, (u32)IGBVF_MIN_TXD); + tx_ring-count = min(tx_ring-count, (u32)(IGBVF_MAX_TXD)); + tx_ring-count = ALIGN(tx_ring-count, REQ_TX_DESCRIPTOR_MULTIPLE); + + if (netif_running(adapter-netdev)) { + /* Try to get new resources before deleting old */ + err = igbvf_setup_rx_resources(adapter); + if (err) + goto err_setup_rx; + err = igbvf_setup_tx_resources(adapter); + if (err) + goto err_setup_tx; + + /* + * restore the old in order to free it, + * then add in the new + */ + adapter-rx_ring = rx_old; + adapter-tx_ring = tx_old; + igbvf_free_rx_resources(adapter); + igbvf_free_tx_resources(adapter); + kfree(tx_old); + kfree(rx_old); That's odd-looking. Why take a copy of rx_old and tx_old when we're about to free them? + adapter-rx_ring = rx_ring; + adapter-tx_ring = tx_ring; + err = igbvf_up(adapter); + if (err) + goto err_setup; + } + + clear_bit(__IGBVF_RESETTING, adapter-state); + return 0; +err_setup_tx: + igbvf_free_rx_resources(adapter); +err_setup_rx: + adapter-rx_ring = rx_old; + adapter-tx_ring = tx_old; + kfree(rx_ring); +err_alloc_rx: + kfree(tx_ring); +err_alloc_tx: + igbvf_up(adapter); +err_setup: + clear_bit(__IGBVF_RESETTING, adapter-state); + return err; +} + ... +static void igbvf_diag_test(struct net_device *netdev, +struct ethtool_test
Re: linux-next: Tree for January 23 (kvm)
(cc mailing lists) On Wed, 28 Jan 2009 12:26:16 -0200 Marcelo Tosatti mtosa...@redhat.com wrote: On Wed, Jan 28, 2009 at 11:20:16AM +0100, Alexander Graf wrote: On 28.01.2009, at 10:51, Andrew Morton wrote: On Fri, 23 Jan 2009 12:34:24 -0800 Randy Dunlap randy.dun...@oracle.com wrote: Stephen Rothwell wrote: Hi all, News: I will be on leave next week, so there will probably be no linux-next release until Feb 2. Changes since 20090122: kvm changes cause this build error on i386: when kvm is built as module: ERROR: __moddi3 [arch/x86/kvm/kvm.ko] undefined! or when kvm is built into the kernel image: lapic.c:(.text+0x19276): undefined reference to `__moddi3' I guess it's this line: ns = ktime_to_ns(remaining) % apic-timer.period; dammit, I just spent N minutes hunting down the same thing in the Jan 26 linux-next :( Yeah, Clemens Foss posted a real fix on k...@vger already. Probably wasn't the best mailing list, as this bug affects all developers and testers.. But whatever - thanks. commit efe4ff38b2bbc3944d0b2c9b6ec669b67cb7acbc Author: Clemens Noss cn...@gmx.de Date: Mon Jan 26 02:55:16 2009 +0100 KVM: x86: fix __moddi3 undefined build failure in lapic.c use mod_64 from arch/x86/kvm/i8254.c to fix ERROR: __moddi3 [arch/x86/kvm/kvm.ko] undefined! Signed-off-by: Clemens Noss cn...@gmx.de Signed-off-by: Marcelo Tosatti mtosa...@redhat.com diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index d8adc50..f0b67f2 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -35,6 +35,12 @@ #include kvm_cache_regs.h #include irq.h +#ifndef CONFIG_X86_64 +#define mod_64(x, y) ((x) - (y) * div64_u64(x, y)) +#else +#define mod_64(x, y) ((x) % (y)) +#endif + #define PRId64 d #define PRIx64 llx #define PRIu64 u @@ -525,7 +531,7 @@ static u32 apic_get_tmcct(struct kvm_lapic *apic) if (ktime_to_ns(remaining) 0) remaining = ktime_set(0, 0); - ns = ktime_to_ns(remaining) % apic-timer.period; + ns = mod_64(ktime_to_ns(remaining), apic-timer.period); tmcct = div64_u64(ns, (APIC_BUS_CYCLE_NS * apic-timer.divide_count)); return tmcct; Rather than cooking up a private implementation I think it would be better to implement a kernel-wide mod64_u64() (?) facility and then to use that in KVM. There will presumably be other sites which will (or already do) need such a thing. However it'll probably be a bit hard, getting the signednesses of the args and the return value right. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux
On Tue, 11 Nov 2008 15:21:37 +0200 Izik Eidus [EMAIL PROTECTED] wrote: KSM is a linux driver that allows dynamicly sharing identical memory pages between one or more processes. unlike tradtional page sharing that is made at the allocation of the memory, ksm do it dynamicly after the memory was created. Memory is periodically scanned; identical pages are identified and merged. the sharing is unnoticeable by the process that use this memory. (the shared pages are marked as readonly, and in case of write do_wp_page() take care to create new copy of the page) this driver is very useful for KVM as in cases of runing multiple guests operation system of the same type, many pages are sharable. this driver can be useful by OpenVZ as well. These benefits should be quantified, please. Also any benefits to any other workloads should be identified and quantified. The whole approach seems wrong to me. The kernel lost track of these pages and then we run around post-facto trying to fix that up again. Please explain (for the changelog) why the kernel cannot get this right via the usual sharing, refcounting and COWing approaches. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux
On Tue, 11 Nov 2008 20:48:16 +0200 Avi Kivity [EMAIL PROTECTED] wrote: Andrew Morton wrote: The whole approach seems wrong to me. The kernel lost track of these pages and then we run around post-facto trying to fix that up again. Please explain (for the changelog) why the kernel cannot get this right via the usual sharing, refcounting and COWing approaches. For kvm, the kernel never knew those pages were shared. They are loaded from independent (possibly compressed and encrypted) disk images. These images are different; but some pages happen to be the same because they came from the same installation media. What userspace-only changes could fix this? Identify the common data, write it to a flat file and mmap it, something like that? For OpenVZ the situation is less clear, but if you allow users to independently upgrade their chroots you will eventually arrive at the same scenario (unless of course you apply the same merging strategy at the filesystem level). hm. There has been the occasional discussion about idenfifying all-zeroes pages and scavenging them, repointing them at the zero page. Could this infrastructure be used for that? (And how much would we gain from it?) [I'm looking for reasons why this is more than a muck-up-the-vm-for-kvm thing here ;) ] -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] ksm - dynamic page sharing driver for linux
On Tue, 11 Nov 2008 21:07:10 +0200 Izik Eidus [EMAIL PROTECTED] wrote: we have used KSM in production for about half year and the numbers that came from our QA is: using KSM for desktop (KSM was tested just for windows desktop workload) you can run as many as 52 windows xp with 1 giga ram each on server with just 16giga ram. (this is more than 300% overcommit) the reason is that most of the kernel/dlls of this guests is shared and in addition we are sharing the windows zero (windows keep making all its free memory as zero, so every time windows release memory we take the page back to the host) there is slide that give this numbers you can find at: http://kvm.qumranet.com/kvmwiki/KvmForum2008?action=AttachFiledo=gettarget=kdf2008_3.pdf (slide 27) beside more i gave presentation about ksm that can be found at: http://kvm.qumranet.com/kvmwiki/KvmForum2008?action=AttachFiledo=gettarget=kdf2008_12.pdf OK, 300% isn't chicken feed. It is quite important that information such as this be prepared, added to the patch changelogs and maintained. For a start, without this basic information, there is no reason for anyone to look at any of the code! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] rmap: add page_wrprotect() function,
On Tue, 11 Nov 2008 15:21:38 +0200 Izik Eidus [EMAIL PROTECTED] wrote: From: Izik Eidus [EMAIL PROTECTED] this function is useful for cases you want to compare page and know that its value wont change during you compare it. this function is working by walking over the whole rmap of a page and mark every pte related to the page as write_protect. the odirect_sync paramter is used to notify the caller of page_wrprotect() if one pte or more was not marked readonly in order to avoid race with odirect. The grammar is a bit mangled. Missing apostrophes. Sentences start with capital letters. Yeah, it's a nit, but we don't actually gain anything from deliberately mangling the language. ... --- a/mm/rmap.c +++ b/mm/rmap.c @@ -576,6 +576,103 @@ int page_mkclean(struct page *page) } EXPORT_SYMBOL_GPL(page_mkclean); +static int page_wrprotect_one(struct page *page, struct vm_area_struct *vma, + int *odirect_sync) +{ + struct mm_struct *mm = vma-vm_mm; + unsigned long address; + pte_t *pte; + spinlock_t *ptl; + int ret = 0; + + address = vma_address(page, vma); + if (address == -EFAULT) + goto out; + + pte = page_check_address(page, mm, address, ptl, 0); + if (!pte) + goto out; + + if (pte_write(*pte)) { + pte_t entry; + + if (page_mapcount(page) != page_count(page)) { + *odirect_sync = 0; + goto out_unlock; + } + flush_cache_page(vma, address, pte_pfn(*pte)); + entry = ptep_clear_flush_notify(vma, address, pte); + entry = pte_wrprotect(entry); + set_pte_at(mm, address, pte, entry); + } + ret = 1; + +out_unlock: + pte_unmap_unlock(pte, ptl); +out: + return ret; +} OK. I think. We need to find a way of provoking Hugh to look at it. +static int page_wrprotect_file(struct page *page, int *odirect_sync) +{ + struct address_space *mapping; + struct prio_tree_iter iter; + struct vm_area_struct *vma; + pgoff_t pgoff = page-index (PAGE_CACHE_SHIFT - PAGE_SHIFT); + int ret = 0; + + mapping = page_mapping(page); What pins *mapping in memory? Usually this is done by requiring that the caller has locked the page. But no such precondition is documented here. + if (!mapping) + return ret; + + spin_lock(mapping-i_mmap_lock); + + vma_prio_tree_foreach(vma, iter, mapping-i_mmap, pgoff, pgoff) + ret += page_wrprotect_one(page, vma, odirect_sync); + + spin_unlock(mapping-i_mmap_lock); + + return ret; +} + +static int page_wrprotect_anon(struct page *page, int *odirect_sync) +{ + struct vm_area_struct *vma; + struct anon_vma *anon_vma; + int ret = 0; + + anon_vma = page_lock_anon_vma(page); + if (!anon_vma) + return ret; + + list_for_each_entry(vma, anon_vma-head, anon_vma_node) + ret += page_wrprotect_one(page, vma, odirect_sync); + + page_unlock_anon_vma(anon_vma); + + return ret; +} + +/** This token means this is a kerneldoc comment. + * set all the ptes pointed to a page as read only, + * odirect_sync is set to 0 in case we cannot protect against race with odirect + * return the number of ptes that were set as read only + * (ptes that were read only before this function was called are couned as well) + */ But it isn't. I don't understand this odirect_sync thing. What race? Please expand this comment to make the function of odirect_sync more understandable. +int page_wrprotect(struct page *page, int *odirect_sync) +{ + int ret =0; + + *odirect_sync = 1; + if (PageAnon(page)) + ret = page_wrprotect_anon(page, odirect_sync); + else + ret = page_wrprotect_file(page, odirect_sync); + + return ret; +} +EXPORT_SYMBOL(page_wrprotect); What do you think about making all this new code dependent upon some CONFIG_ switch which CONFIG_KVM can select? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Tue, 11 Nov 2008 15:21:39 +0200 Izik Eidus [EMAIL PROTECTED] wrote: From: Izik Eidus [EMAIL PROTECTED] this function is needed in cases you want to change the userspace virtual mapping into diffrent physical page, Not sure that I understand that description. We want to replace a live page in an anonymous VMA with a different one? It looks that way. page migration already kinda does that. Is there common ground? KSM need this for merging the identical pages. this function is working by removing the oldpage from the rmap and calling put_page on it, and by setting the virtual address pte to point into the new page. (note that the new page (the page that we change the pte to map to) cannot be anonymous page) I don't understand the restrictions on anonymous pages. Please expand the changelog so that reviewers can understand the reasons for this restriction. --- include/linux/mm.h |3 ++ mm/memory.c| 68 2 files changed, 71 insertions(+), 0 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index ffee2f7..4da7fa8 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1207,6 +1207,9 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn); +int replace_page(struct vm_area_struct *vma, struct page *oldpage, + struct page *newpage, pte_t orig_pte, pgprot_t prot); + struct page *follow_page(struct vm_area_struct *, unsigned long address, unsigned int foll_flags); #define FOLL_WRITE 0x01/* check pte is writable */ diff --git a/mm/memory.c b/mm/memory.c index 164951c..b2c542c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1472,6 +1472,74 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, } EXPORT_SYMBOL(vm_insert_mixed); +/** + * replace _page - replace the pte mapping related to vm area between two pages s/replace _page/replace_page/ + * (from oldpage to newpage) + * NOTE: you should take into consideration the impact on the VM when replacing + * anonymous pages with kernel non swappable pages. + */ This _is_ a kerneldoc comment, but kernedoc comments conventionally document the arguments and the return value also. +int replace_page(struct vm_area_struct *vma, struct page *oldpage, + struct page *newpage, pte_t orig_pte, pgprot_t prot) +{ + struct mm_struct *mm = vma-vm_mm; + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + pte_t *ptep; + spinlock_t *ptl; + unsigned long addr; + int ret; + + BUG_ON(PageAnon(newpage)); + + ret = -EFAULT; + addr = page_address_in_vma(oldpage, vma); + if (addr == -EFAULT) + goto out; + + pgd = pgd_offset(mm, addr); + if (!pgd_present(*pgd)) + goto out; + + pud = pud_offset(pgd, addr); + if (!pud_present(*pud)) + goto out; + + pmd = pmd_offset(pud, addr); + if (!pmd_present(*pmd)) + goto out; + + ptep = pte_offset_map_lock(mm, pmd, addr, ptl); + if (!ptep) + goto out; + + if (!pte_same(*ptep, orig_pte)) { + pte_unmap_unlock(ptep, ptl); + goto out; + } + + ret = 0; + get_page(newpage); + page_add_file_rmap(newpage); + + flush_cache_page(vma, addr, pte_pfn(*ptep)); + ptep_clear_flush(vma, addr, ptep); + set_pte_at(mm, addr, ptep, mk_pte(newpage, prot)); + + page_remove_rmap(oldpage, vma); + if (PageAnon(oldpage)) { + dec_mm_counter(mm, anon_rss); + inc_mm_counter(mm, file_rss); + } + put_page(oldpage); + + pte_unmap_unlock(ptep, ptl); + +out: + return ret; +} +EXPORT_SYMBOL(replace_page); Again, we could make the presence of this code selectable by subsystems which want it. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] add ksm kernel shared memory driver
On Tue, 11 Nov 2008 15:21:40 +0200 Izik Eidus [EMAIL PROTECTED] wrote: From: Izik Eidus [EMAIL PROTECTED] ksm is driver that allow merging identical pages between one or more applications in way unvisible to the application that use it. pages that are merged are marked as readonly and are COWed when any application try to change them. ksm is working by walking over the memory pages of the applications it scan in order to find identical pages. it uses an hash table to find in effective way the identical pages. when ksm find two identical pages, it marked them as readonly and merge them into single one page, after the pages are marked as readonly and merged into one page, linux will treat this pages as normal copy_on_write pages and will fork them when write access will happen to them. ksm scan just memory areas that were registred to be scanned by it. This driver apepars to implement a new userspace interface, in /dev/ksm Please fully document that interface in the changelog so that we can review your decisions here. This is by far the most important consideration - we can change all the code, but interfaces are for ever. diff --git a/drivers/Kconfig b/drivers/Kconfig index d38f43f..c1c701f 100644 --- a/drivers/Kconfig +++ b/drivers/Kconfig @@ -105,4 +105,9 @@ source drivers/uio/Kconfig source drivers/xen/Kconfig source drivers/staging/Kconfig + +config KSM + bool KSM driver support + help + ksm is driver for merging identical pages between applciations s/is/is a/ applications is misspelt. endmenu diff --git a/include/linux/ksm.h b/include/linux/ksm.h new file mode 100644 index 000..f873502 --- /dev/null +++ b/include/linux/ksm.h @@ -0,0 +1,53 @@ +#ifndef __LINUX_KSM_H +#define __LINUX_KSM_H + +/* + * Userspace interface for /dev/ksm - kvm shared memory + */ + +#include asm/types.h +#include linux/ioctl.h + +#define KSM_API_VERSION 1 + +/* for KSM_REGISTER_MEMORY_REGION */ +struct ksm_memory_region { + __u32 npages; /* number of pages to share */ + __u32 pad; + __u64 addr; /* the begining of the virtual address */ +}; + +struct ksm_user_scan { + __u32 pages_to_scan; + __u32 max_pages_to_merge; +}; + +struct ksm_kthread_info { + __u32 sleep; /* number of microsecoends to sleep */ + __u32 pages_to_scan; /* number of pages to scan */ + __u32 max_pages_to_merge; + __u32 running; +}; + +#define KSMIO 0xAB + +/* ioctls for /dev/ksm */ +#define KSM_GET_API_VERSION _IO(KSMIO, 0x00) +#define KSM_CREATE_SHARED_MEMORY_AREA_IO(KSMIO, 0x01) /* return SMA fd */ +#define KSM_CREATE_SCAN _IO(KSMIO, 0x02) /* return SCAN fd */ +#define KSM_START_STOP_KTHREAD_IOW(KSMIO, 0x03,\ + struct ksm_kthread_info) +#define KSM_GET_INFO_KTHREAD _IOW(KSMIO, 0x04,\ + struct ksm_kthread_info) + + +/* ioctls for SMA fds */ +#define KSM_REGISTER_MEMORY_REGION _IOW(KSMIO, 0x20,\ + struct ksm_memory_region) +#define KSM_REMOVE_MEMORY_REGION _IO(KSMIO, 0x21) + +/* ioctls for SCAN fds */ +#define KSM_SCAN _IOW(KSMIO, 0x40,\ + struct ksm_user_scan) + +#endif uh-oh, ioctls. Please do document that interface for us. diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h index 26433ec..adc2435 100644 --- a/include/linux/miscdevice.h +++ b/include/linux/miscdevice.h @@ -30,6 +30,7 @@ #define TUN_MINOR 200 #define HPET_MINOR 228 #define KVM_MINOR232 +#define KSM_MINOR233 struct device; diff --git a/mm/Kconfig b/mm/Kconfig index 5b5790f..e7f0061 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -222,3 +222,6 @@ config UNEVICTABLE_LRU config MMU_NOTIFIER bool + +config KSM + bool diff --git a/mm/Makefile b/mm/Makefile index c06b45a..9722afe 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -26,6 +26,7 @@ obj-$(CONFIG_TMPFS_POSIX_ACL) += shmem_acl.o obj-$(CONFIG_TINY_SHMEM) += tiny-shmem.o obj-$(CONFIG_SLOB) += slob.o obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o +obj-$(CONFIG_KSM) += ksm.o obj-$(CONFIG_SLAB) += slab.o obj-$(CONFIG_SLUB) += slub.o obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o diff --git a/mm/ksm.c b/mm/ksm.c new file mode 100644 index 000..977eb37 --- /dev/null +++ b/mm/ksm.c @@ -0,0 +1,1202 @@ +/* + * Memory merging driver for Linux + * + * This module enables dynamic sharing of identical pages found in different + * memory areas, even if they are not shared by fork() + * + * Copyright (C) 2008 Red Hat, Inc. + * Authors: + * Izik Eidus + * Andrea Arcangeli + * Chris Wright + * + * This work is licensed under the terms of the GNU GPL, version 2. +
Re: [PATCH 1/4] rmap: add page_wrprotect() function,
On Tue, 11 Nov 2008 21:38:06 +0100 Andrea Arcangeli [EMAIL PROTECTED] wrote: + * set all the ptes pointed to a page as read only, + * odirect_sync is set to 0 in case we cannot protect against race with odirect + * return the number of ptes that were set as read only + * (ptes that were read only before this function was called are couned as well) + */ But it isn't. What isn't? This code comment had the kerneldoc marker (/**) but it isn't a kerneldoc comment. I don't understand this odirect_sync thing. What race? Please expand this comment to make the function of odirect_sync more understandable. I should have answered this one with the above 3 links. OK, well can we please update the code so these things are clearer. (It's a permanent problem I have. I ask what is this, but I really mean the code should be changed so that readers will know what this is) What do you think about making all this new code dependent upon some CONFIG_ switch which CONFIG_KVM can select? I like that too. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mmotm 2008-09-22-01-36 uploaded (kvm)
On Mon, 22 Sep 2008 11:19:47 -0700 Randy Dunlap [EMAIL PROTECTED] wrote: On Mon, 22 Sep 2008 01:38:58 -0700 [EMAIL PROTECTED] wrote: The mm-of-the-moment snapshot 2008-09-22-01-36 has been uploaded to http://userweb.kernel.org/~akpm/mmotm/ It contains the following patches against 2.6.27-rc7: ERROR: intel_iommu_found [arch/x86/kvm/kvm.ko] undefined! when CONFIG_DMAR=n. Yes, I got that too. i386 allmodconfig fails similarly. It's still unclear which tree in linux-next broke it. PCI or async_tx, perhaps. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mmotm 2008-09-22-01-36 uploaded (kvm)
On Mon, 22 Sep 2008 14:58:52 -0700 Jesse Barnes [EMAIL PROTECTED] wrote: On Monday, September 22, 2008 12:52 pm Andrew Morton wrote: On Mon, 22 Sep 2008 11:19:47 -0700 Randy Dunlap [EMAIL PROTECTED] wrote: On Mon, 22 Sep 2008 01:38:58 -0700 [EMAIL PROTECTED] wrote: The mm-of-the-moment snapshot 2008-09-22-01-36 has been uploaded to http://userweb.kernel.org/~akpm/mmotm/ It contains the following patches against 2.6.27-rc7: ERROR: intel_iommu_found [arch/x86/kvm/kvm.ko] undefined! when CONFIG_DMAR=n. Yes, I got that too. i386 allmodconfig fails similarly. It's still unclear which tree in linux-next broke it. PCI or async_tx, perhaps. I don't have that symbol in my tree (yet). It's part of one of the IOMMU KVM patch sets that's currently in-flight. I'll add it to my linux-next tree today or tomorrow, assuming I get an ack from David about it (I'll bounce it over to him now). It's already there, and exported in linux-next. But it just doesn't get compiled in due to some Kconfig snafu. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
linux-next: ia64 allmodconfig
arch/ia64/kvm/../../../virt/kvm/ioapic.c:42:17: irq.h: No such file or directory arch/ia64/kvm/../../../virt/kvm/ioapic.c: In function `__kvm_ioapic_update_eoi': arch/ia64/kvm/../../../virt/kvm/ioapic.c:296: error: implicit declaration of function `kvm_notify_acked_irq' arch/ia64/kvm/../../../virt/kvm/ioapic.c: In function `ioapic_mmio_write': arch/ia64/kvm/../../../virt/kvm/ioapic.c:389: error: too few arguments to functi -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html