Re: [PATCH v3] mm: thp: Set the accessed flag for old pages on access fault.
On 10/26/2012 05:34 PM, Will Deacon wrote: On Fri, Oct 26, 2012 at 07:19:55AM +0100, Ni zhan Chen wrote: On 10/26/2012 12:44 AM, Will Deacon wrote: On x86 memory accesses to pages without the ACCESSED flag set result in the ACCESSED flag being set automatically. With the ARM architecture a page access fault is raised instead (and it will continue to be raised until the ACCESSED flag is set for the appropriate PTE/PMD). For normal memory pages, handle_pte_fault will call pte_mkyoung (effectively setting the ACCESSED flag). For transparent huge pages, pmd_mkyoung will only be called for a write fault. This patch ensures that faults on transparent hugepages which do not result in a CoW update the access flags for the faulting pmd. Could you write changlog? >From v2? I included something below my SoB. The code should do exactly the same as before, it's just rebased onto next so that I can play nicely with Peter's patches. Cc: Chris Metcalf Cc: Kirill A. Shutemov Cc: Andrea Arcangeli Signed-off-by: Will Deacon --- Ok chaps, I rebased this thing onto today's next (which basically necessitated a rewrite) so I've reluctantly dropped my acks and kindly ask if you could eyeball the new code, especially where the locking is concerned. In the numa code (do_huge_pmd_prot_none), Peter checks again that the page is not splitting, but I can't see why that is required. Cheers, Will Could you explain why you not call pmd_trans_huge_lock to confirm the pmd is splitting or stable as Andrea point out? The way handle_mm_fault is now structured after the numa changes means that we only enter the huge pmd page aging code if the entry wasn't splitting Why you call it huge pmd page *aging* code? Regards, Chen before taking the lock, so it seemed a bit gratuitous to jump through those hoops again in pmd_trans_huge_lock. Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/26/2012 04:02 PM, Fengguang Wu wrote: On Fri, Oct 26, 2012 at 03:47:19PM +0800, Ni zhan Chen wrote: On 10/26/2012 03:36 PM, Fengguang Wu wrote: On Fri, Oct 26, 2012 at 03:19:57PM +0800, Ni zhan Chen wrote: On 10/26/2012 03:09 PM, Fengguang Wu wrote: On Fri, Oct 26, 2012 at 03:03:12PM +0800, Ni zhan Chen wrote: On 10/26/2012 02:58 PM, Fengguang Wu wrote: static void shrink_readahead_size_eio(struct file *filp, struct file_ra_state *ra) { - ra->ra_pages /= 4; + spin_lock(>f_lock); + filp->f_mode |= FMODE_RANDOM; + spin_unlock(>f_lock); As the example in comment above this function, the read maybe still sequential, and it will waste IO bandwith if modify to FMODE_RANDOM directly. Yes immediately disabling readahead may hurt IO performance, the original '/ 4' may perform better when there are only 1-3 IO errors encountered. Hi Fengguang, Why the number should be 1-3? The original behavior is '/= 4' on each error. After 1 errors, readahead size will be shrinked by 1/4 After 2 errors, readahead size will be shrinked by 1/16 After 3 errors, readahead size will be shrinked by 1/64 After 4 errors, readahead size will be effectively 0 (disabled) But from function shrink_readahead_size_eio and its caller filemap_fault I can't find the behavior you mentioned. How you figure out it? It's this line in shrink_readahead_size_eio(): ra->ra_pages /= 4; Yeah, I mean why the 4th readahead size will be 0(disabled)? What's the original value of ra->ra_pages? How can guarantee the 4th shrink readahead size can be 0? Ah OK, I'm talking about the typical case. The default readahead size is 128k, which will become 0 after / 256. The reasonable good ra size for hard disks is 1MB=256pages, which also becomes 1page after 4 errors. Then why default size is not set to reasonable size? Regards, Chen Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: MMTests 0.06
On 10/12/2012 10:51 PM, Mel Gorman wrote: MMTests 0.06 is a configurable test suite that runs a number of common workloads of interest to MM developers. There are multiple additions all but in many respects the most useful will be automatic package installation. The package names are based on openSUSE but it's easy to create mappings in bin/install-depends where the package names differ. The very basics of monitoring NUMA efficiency is there as well and the autonuma benchmark has a test. The stats it reports for NUMA need significant improvement but for the most part that should be straight forward. Changelog since v0.05 o Automatically install packages (need name mappings for other distros) o Add benchmark for autonumabench o Add support for benchmarking NAS with MPI o Add pgbench for autonumabench (may need a bit more work) o Upgrade postgres version to 9.2.1 o Upgrade kernel verion used for kernbench to 3.0 for newer toolchains o Alter mailserver config to finish in a reasonable time o Add monitor for perf sched o Add moinitor that gathers ftrace information with trace-cmd o Add preliminary monitors for NUMA stats (very basic) o Specify ftrace events to monitor from config file o Remove the bulk of whats left of VMRegress o Convert shellpacks to a template format to auto-generate boilerplate code o Collect lock_stat information if enabled o Run multiple iterations of aim9 o Add basic regression tests for Cross Memory Attach o Copy with preempt being enabled in highalloc stres tests o Have largedd cope with a missing large file to work with o Add a monitor-only mode to just capture logs o Report receive-side throughput in netperf for results At LSF/MM at some point a request was made that a series of tests be identified that were of interest to MM developers and that could be used for testing the Linux memory management subsystem. There is renewed interest in some sort of general testing framework during discussions for Kernel Summit 2012 so here is what I use. http://www.csn.ul.ie/~mel/projects/mmtests/ http://www.csn.ul.ie/~mel/projects/mmtests/mmtests-0.06-mmtests-0.01.tar.gz There are a number of stock configurations stored in configs/. For example config-global-dhp__pagealloc-performance runs a number of tests that may be able to identify performance regressions or gains in the page allocator. Similarly there network and scheduler configs. There are also more complex options. config-global-dhp__parallelio-memcachetest will run memcachetest in the foreground while doing IO of different sizes in the background to measure how much unrelated IO affects the throughput of an in-memory database. This release is also a little rough and the extraction scripts could have been tidier but they were mostly written in an airport and for the most part they work as advertised. I'll fix bugs as according as they are brought to my attention. The stats reporting still needs work because while some tests know how to make a better estimate of mean by filtering outliers it is not being handled consistently and the methodology needs work. I know filtering statistics like this is a major flaw in the methodology but the decision was made in this case in the interest of the benchmarks with unstable results completing in a reasonable time. Hi Gorman, Could MMTests 0.07 auto download related packages for different distributions? Regards, Chen -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/26/2012 03:36 PM, Fengguang Wu wrote: On Fri, Oct 26, 2012 at 03:19:57PM +0800, Ni zhan Chen wrote: On 10/26/2012 03:09 PM, Fengguang Wu wrote: On Fri, Oct 26, 2012 at 03:03:12PM +0800, Ni zhan Chen wrote: On 10/26/2012 02:58 PM, Fengguang Wu wrote: static void shrink_readahead_size_eio(struct file *filp, struct file_ra_state *ra) { - ra->ra_pages /= 4; + spin_lock(>f_lock); + filp->f_mode |= FMODE_RANDOM; + spin_unlock(>f_lock); As the example in comment above this function, the read maybe still sequential, and it will waste IO bandwith if modify to FMODE_RANDOM directly. Yes immediately disabling readahead may hurt IO performance, the original '/ 4' may perform better when there are only 1-3 IO errors encountered. Hi Fengguang, Why the number should be 1-3? The original behavior is '/= 4' on each error. After 1 errors, readahead size will be shrinked by 1/4 After 2 errors, readahead size will be shrinked by 1/16 After 3 errors, readahead size will be shrinked by 1/64 After 4 errors, readahead size will be effectively 0 (disabled) But from function shrink_readahead_size_eio and its caller filemap_fault I can't find the behavior you mentioned. How you figure out it? It's this line in shrink_readahead_size_eio(): ra->ra_pages /= 4; Yeah, I mean why the 4th readahead size will be 0(disabled)? What's the original value of ra->ra_pages? How can guarantee the 4th shrink readahead size can be 0? Regards, Chen That ra_pages will keep shrinking by 4 on each error. The only way to restore it is to reopen the file, or POSIX_FADV_SEQUENTIAL. Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/26/2012 03:09 PM, Fengguang Wu wrote: On Fri, Oct 26, 2012 at 03:03:12PM +0800, Ni zhan Chen wrote: On 10/26/2012 02:58 PM, Fengguang Wu wrote: static void shrink_readahead_size_eio(struct file *filp, struct file_ra_state *ra) { - ra->ra_pages /= 4; + spin_lock(>f_lock); + filp->f_mode |= FMODE_RANDOM; + spin_unlock(>f_lock); As the example in comment above this function, the read maybe still sequential, and it will waste IO bandwith if modify to FMODE_RANDOM directly. Yes immediately disabling readahead may hurt IO performance, the original '/ 4' may perform better when there are only 1-3 IO errors encountered. Hi Fengguang, Why the number should be 1-3? The original behavior is '/= 4' on each error. After 1 errors, readahead size will be shrinked by 1/4 After 2 errors, readahead size will be shrinked by 1/16 After 3 errors, readahead size will be shrinked by 1/64 After 4 errors, readahead size will be effectively 0 (disabled) But from function shrink_readahead_size_eio and its caller filemap_fault I can't find the behavior you mentioned. How you figure out it? Regards, Chen Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/26/2012 02:58 PM, Fengguang Wu wrote: static void shrink_readahead_size_eio(struct file *filp, struct file_ra_state *ra) { - ra->ra_pages /= 4; + spin_lock(>f_lock); + filp->f_mode |= FMODE_RANDOM; + spin_unlock(>f_lock); As the example in comment above this function, the read maybe still sequential, and it will waste IO bandwith if modify to FMODE_RANDOM directly. Yes immediately disabling readahead may hurt IO performance, the original '/ 4' may perform better when there are only 1-3 IO errors encountered. Hi Fengguang, Why the number should be 1-3? Regards, Chen Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3] mm: thp: Set the accessed flag for old pages on access fault.
On 10/26/2012 12:44 AM, Will Deacon wrote: On x86 memory accesses to pages without the ACCESSED flag set result in the ACCESSED flag being set automatically. With the ARM architecture a page access fault is raised instead (and it will continue to be raised until the ACCESSED flag is set for the appropriate PTE/PMD). For normal memory pages, handle_pte_fault will call pte_mkyoung (effectively setting the ACCESSED flag). For transparent huge pages, pmd_mkyoung will only be called for a write fault. This patch ensures that faults on transparent hugepages which do not result in a CoW update the access flags for the faulting pmd. Could you write changlog? Cc: Chris Metcalf Cc: Kirill A. Shutemov Cc: Andrea Arcangeli Signed-off-by: Will Deacon --- Ok chaps, I rebased this thing onto today's next (which basically necessitated a rewrite) so I've reluctantly dropped my acks and kindly ask if you could eyeball the new code, especially where the locking is concerned. In the numa code (do_huge_pmd_prot_none), Peter checks again that the page is not splitting, but I can't see why that is required. Cheers, Will Could you explain why you not call pmd_trans_huge_lock to confirm the pmd is splitting or stable as Andrea point out? include/linux/huge_mm.h |4 mm/huge_memory.c| 22 ++ mm/memory.c |7 ++- 3 files changed, 32 insertions(+), 1 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 4f0f948..766fb27 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -8,6 +8,10 @@ extern int do_huge_pmd_anonymous_page(struct mm_struct *mm, extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, struct vm_area_struct *vma); +extern void huge_pmd_set_accessed(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long address, pmd_t *pmd, + pmd_t orig_pmd, int dirty); extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pmd_t orig_pmd); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 3c14a96..f024d98 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -932,6 +932,28 @@ out: return ret; } +void huge_pmd_set_accessed(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long address, + pmd_t *pmd, pmd_t orig_pmd, + int dirty) +{ + pmd_t entry; + unsigned long haddr; + + spin_lock(>page_table_lock); + if (unlikely(!pmd_same(*pmd, orig_pmd))) + goto unlock; + + entry = pmd_mkyoung(orig_pmd); + haddr = address & HPAGE_PMD_MASK; + if (pmdp_set_access_flags(vma, haddr, pmd, entry, dirty)) + update_mmu_cache_pmd(vma, address, pmd); + +unlock: + spin_unlock(>page_table_lock); +} + static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, diff --git a/mm/memory.c b/mm/memory.c index f21ac1c..bcbc084 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3650,12 +3650,14 @@ retry: barrier(); if (pmd_trans_huge(orig_pmd) && !pmd_trans_splitting(orig_pmd)) { + unsigned int dirty = flags & FAULT_FLAG_WRITE; + if (pmd_numa(vma, orig_pmd)) { do_huge_pmd_numa_page(mm, vma, address, pmd, flags, orig_pmd); } - if ((flags & FAULT_FLAG_WRITE) && !pmd_write(orig_pmd)) { + if (dirty && !pmd_write(orig_pmd)) { ret = do_huge_pmd_wp_page(mm, vma, address, pmd, orig_pmd); /* @@ -3665,6 +3667,9 @@ retry: */ if (unlikely(ret & VM_FAULT_OOM)) goto retry; + } else { + huge_pmd_set_accessed(mm, vma, address, pmd, + orig_pmd, dirty); } return ret; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3] mm: thp: Set the accessed flag for old pages on access fault.
On 10/26/2012 12:44 AM, Will Deacon wrote: On x86 memory accesses to pages without the ACCESSED flag set result in the ACCESSED flag being set automatically. With the ARM architecture a page access fault is raised instead (and it will continue to be raised until the ACCESSED flag is set for the appropriate PTE/PMD). For normal memory pages, handle_pte_fault will call pte_mkyoung (effectively setting the ACCESSED flag). For transparent huge pages, pmd_mkyoung will only be called for a write fault. This patch ensures that faults on transparent hugepages which do not result in a CoW update the access flags for the faulting pmd. Could you write changlog? Cc: Chris Metcalf cmetc...@tilera.com Cc: Kirill A. Shutemov kir...@shutemov.name Cc: Andrea Arcangeli aarca...@redhat.com Signed-off-by: Will Deacon will.dea...@arm.com --- Ok chaps, I rebased this thing onto today's next (which basically necessitated a rewrite) so I've reluctantly dropped my acks and kindly ask if you could eyeball the new code, especially where the locking is concerned. In the numa code (do_huge_pmd_prot_none), Peter checks again that the page is not splitting, but I can't see why that is required. Cheers, Will Could you explain why you not call pmd_trans_huge_lock to confirm the pmd is splitting or stable as Andrea point out? include/linux/huge_mm.h |4 mm/huge_memory.c| 22 ++ mm/memory.c |7 ++- 3 files changed, 32 insertions(+), 1 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 4f0f948..766fb27 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -8,6 +8,10 @@ extern int do_huge_pmd_anonymous_page(struct mm_struct *mm, extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, struct vm_area_struct *vma); +extern void huge_pmd_set_accessed(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long address, pmd_t *pmd, + pmd_t orig_pmd, int dirty); extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pmd_t orig_pmd); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 3c14a96..f024d98 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -932,6 +932,28 @@ out: return ret; } +void huge_pmd_set_accessed(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long address, + pmd_t *pmd, pmd_t orig_pmd, + int dirty) +{ + pmd_t entry; + unsigned long haddr; + + spin_lock(mm-page_table_lock); + if (unlikely(!pmd_same(*pmd, orig_pmd))) + goto unlock; + + entry = pmd_mkyoung(orig_pmd); + haddr = address HPAGE_PMD_MASK; + if (pmdp_set_access_flags(vma, haddr, pmd, entry, dirty)) + update_mmu_cache_pmd(vma, address, pmd); + +unlock: + spin_unlock(mm-page_table_lock); +} + static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, diff --git a/mm/memory.c b/mm/memory.c index f21ac1c..bcbc084 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3650,12 +3650,14 @@ retry: barrier(); if (pmd_trans_huge(orig_pmd) !pmd_trans_splitting(orig_pmd)) { + unsigned int dirty = flags FAULT_FLAG_WRITE; + if (pmd_numa(vma, orig_pmd)) { do_huge_pmd_numa_page(mm, vma, address, pmd, flags, orig_pmd); } - if ((flags FAULT_FLAG_WRITE) !pmd_write(orig_pmd)) { + if (dirty !pmd_write(orig_pmd)) { ret = do_huge_pmd_wp_page(mm, vma, address, pmd, orig_pmd); /* @@ -3665,6 +3667,9 @@ retry: */ if (unlikely(ret VM_FAULT_OOM)) goto retry; + } else { + huge_pmd_set_accessed(mm, vma, address, pmd, + orig_pmd, dirty); } return ret; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/26/2012 02:58 PM, Fengguang Wu wrote: static void shrink_readahead_size_eio(struct file *filp, struct file_ra_state *ra) { - ra-ra_pages /= 4; + spin_lock(filp-f_lock); + filp-f_mode |= FMODE_RANDOM; + spin_unlock(filp-f_lock); As the example in comment above this function, the read maybe still sequential, and it will waste IO bandwith if modify to FMODE_RANDOM directly. Yes immediately disabling readahead may hurt IO performance, the original '/ 4' may perform better when there are only 1-3 IO errors encountered. Hi Fengguang, Why the number should be 1-3? Regards, Chen Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/26/2012 03:09 PM, Fengguang Wu wrote: On Fri, Oct 26, 2012 at 03:03:12PM +0800, Ni zhan Chen wrote: On 10/26/2012 02:58 PM, Fengguang Wu wrote: static void shrink_readahead_size_eio(struct file *filp, struct file_ra_state *ra) { - ra-ra_pages /= 4; + spin_lock(filp-f_lock); + filp-f_mode |= FMODE_RANDOM; + spin_unlock(filp-f_lock); As the example in comment above this function, the read maybe still sequential, and it will waste IO bandwith if modify to FMODE_RANDOM directly. Yes immediately disabling readahead may hurt IO performance, the original '/ 4' may perform better when there are only 1-3 IO errors encountered. Hi Fengguang, Why the number should be 1-3? The original behavior is '/= 4' on each error. After 1 errors, readahead size will be shrinked by 1/4 After 2 errors, readahead size will be shrinked by 1/16 After 3 errors, readahead size will be shrinked by 1/64 After 4 errors, readahead size will be effectively 0 (disabled) But from function shrink_readahead_size_eio and its caller filemap_fault I can't find the behavior you mentioned. How you figure out it? Regards, Chen Thanks, Fengguang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/26/2012 03:36 PM, Fengguang Wu wrote: On Fri, Oct 26, 2012 at 03:19:57PM +0800, Ni zhan Chen wrote: On 10/26/2012 03:09 PM, Fengguang Wu wrote: On Fri, Oct 26, 2012 at 03:03:12PM +0800, Ni zhan Chen wrote: On 10/26/2012 02:58 PM, Fengguang Wu wrote: static void shrink_readahead_size_eio(struct file *filp, struct file_ra_state *ra) { - ra-ra_pages /= 4; + spin_lock(filp-f_lock); + filp-f_mode |= FMODE_RANDOM; + spin_unlock(filp-f_lock); As the example in comment above this function, the read maybe still sequential, and it will waste IO bandwith if modify to FMODE_RANDOM directly. Yes immediately disabling readahead may hurt IO performance, the original '/ 4' may perform better when there are only 1-3 IO errors encountered. Hi Fengguang, Why the number should be 1-3? The original behavior is '/= 4' on each error. After 1 errors, readahead size will be shrinked by 1/4 After 2 errors, readahead size will be shrinked by 1/16 After 3 errors, readahead size will be shrinked by 1/64 After 4 errors, readahead size will be effectively 0 (disabled) But from function shrink_readahead_size_eio and its caller filemap_fault I can't find the behavior you mentioned. How you figure out it? It's this line in shrink_readahead_size_eio(): ra-ra_pages /= 4; Yeah, I mean why the 4th readahead size will be 0(disabled)? What's the original value of ra-ra_pages? How can guarantee the 4th shrink readahead size can be 0? Regards, Chen That ra_pages will keep shrinking by 4 on each error. The only way to restore it is to reopen the file, or POSIX_FADV_SEQUENTIAL. Thanks, Fengguang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: MMTests 0.06
On 10/12/2012 10:51 PM, Mel Gorman wrote: MMTests 0.06 is a configurable test suite that runs a number of common workloads of interest to MM developers. There are multiple additions all but in many respects the most useful will be automatic package installation. The package names are based on openSUSE but it's easy to create mappings in bin/install-depends where the package names differ. The very basics of monitoring NUMA efficiency is there as well and the autonuma benchmark has a test. The stats it reports for NUMA need significant improvement but for the most part that should be straight forward. Changelog since v0.05 o Automatically install packages (need name mappings for other distros) o Add benchmark for autonumabench o Add support for benchmarking NAS with MPI o Add pgbench for autonumabench (may need a bit more work) o Upgrade postgres version to 9.2.1 o Upgrade kernel verion used for kernbench to 3.0 for newer toolchains o Alter mailserver config to finish in a reasonable time o Add monitor for perf sched o Add moinitor that gathers ftrace information with trace-cmd o Add preliminary monitors for NUMA stats (very basic) o Specify ftrace events to monitor from config file o Remove the bulk of whats left of VMRegress o Convert shellpacks to a template format to auto-generate boilerplate code o Collect lock_stat information if enabled o Run multiple iterations of aim9 o Add basic regression tests for Cross Memory Attach o Copy with preempt being enabled in highalloc stres tests o Have largedd cope with a missing large file to work with o Add a monitor-only mode to just capture logs o Report receive-side throughput in netperf for results At LSF/MM at some point a request was made that a series of tests be identified that were of interest to MM developers and that could be used for testing the Linux memory management subsystem. There is renewed interest in some sort of general testing framework during discussions for Kernel Summit 2012 so here is what I use. http://www.csn.ul.ie/~mel/projects/mmtests/ http://www.csn.ul.ie/~mel/projects/mmtests/mmtests-0.06-mmtests-0.01.tar.gz There are a number of stock configurations stored in configs/. For example config-global-dhp__pagealloc-performance runs a number of tests that may be able to identify performance regressions or gains in the page allocator. Similarly there network and scheduler configs. There are also more complex options. config-global-dhp__parallelio-memcachetest will run memcachetest in the foreground while doing IO of different sizes in the background to measure how much unrelated IO affects the throughput of an in-memory database. This release is also a little rough and the extraction scripts could have been tidier but they were mostly written in an airport and for the most part they work as advertised. I'll fix bugs as according as they are brought to my attention. The stats reporting still needs work because while some tests know how to make a better estimate of mean by filtering outliers it is not being handled consistently and the methodology needs work. I know filtering statistics like this is a major flaw in the methodology but the decision was made in this case in the interest of the benchmarks with unstable results completing in a reasonable time. Hi Gorman, Could MMTests 0.07 auto download related packages for different distributions? Regards, Chen -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/26/2012 04:02 PM, Fengguang Wu wrote: On Fri, Oct 26, 2012 at 03:47:19PM +0800, Ni zhan Chen wrote: On 10/26/2012 03:36 PM, Fengguang Wu wrote: On Fri, Oct 26, 2012 at 03:19:57PM +0800, Ni zhan Chen wrote: On 10/26/2012 03:09 PM, Fengguang Wu wrote: On Fri, Oct 26, 2012 at 03:03:12PM +0800, Ni zhan Chen wrote: On 10/26/2012 02:58 PM, Fengguang Wu wrote: static void shrink_readahead_size_eio(struct file *filp, struct file_ra_state *ra) { - ra-ra_pages /= 4; + spin_lock(filp-f_lock); + filp-f_mode |= FMODE_RANDOM; + spin_unlock(filp-f_lock); As the example in comment above this function, the read maybe still sequential, and it will waste IO bandwith if modify to FMODE_RANDOM directly. Yes immediately disabling readahead may hurt IO performance, the original '/ 4' may perform better when there are only 1-3 IO errors encountered. Hi Fengguang, Why the number should be 1-3? The original behavior is '/= 4' on each error. After 1 errors, readahead size will be shrinked by 1/4 After 2 errors, readahead size will be shrinked by 1/16 After 3 errors, readahead size will be shrinked by 1/64 After 4 errors, readahead size will be effectively 0 (disabled) But from function shrink_readahead_size_eio and its caller filemap_fault I can't find the behavior you mentioned. How you figure out it? It's this line in shrink_readahead_size_eio(): ra-ra_pages /= 4; Yeah, I mean why the 4th readahead size will be 0(disabled)? What's the original value of ra-ra_pages? How can guarantee the 4th shrink readahead size can be 0? Ah OK, I'm talking about the typical case. The default readahead size is 128k, which will become 0 after / 256. The reasonable good ra size for hard disks is 1MB=256pages, which also becomes 1page after 4 errors. Then why default size is not set to reasonable size? Regards, Chen Thanks, Fengguang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3] mm: thp: Set the accessed flag for old pages on access fault.
On 10/26/2012 05:34 PM, Will Deacon wrote: On Fri, Oct 26, 2012 at 07:19:55AM +0100, Ni zhan Chen wrote: On 10/26/2012 12:44 AM, Will Deacon wrote: On x86 memory accesses to pages without the ACCESSED flag set result in the ACCESSED flag being set automatically. With the ARM architecture a page access fault is raised instead (and it will continue to be raised until the ACCESSED flag is set for the appropriate PTE/PMD). For normal memory pages, handle_pte_fault will call pte_mkyoung (effectively setting the ACCESSED flag). For transparent huge pages, pmd_mkyoung will only be called for a write fault. This patch ensures that faults on transparent hugepages which do not result in a CoW update the access flags for the faulting pmd. Could you write changlog? From v2? I included something below my SoB. The code should do exactly the same as before, it's just rebased onto next so that I can play nicely with Peter's patches. Cc: Chris Metcalf cmetc...@tilera.com Cc: Kirill A. Shutemov kir...@shutemov.name Cc: Andrea Arcangeli aarca...@redhat.com Signed-off-by: Will Deacon will.dea...@arm.com --- Ok chaps, I rebased this thing onto today's next (which basically necessitated a rewrite) so I've reluctantly dropped my acks and kindly ask if you could eyeball the new code, especially where the locking is concerned. In the numa code (do_huge_pmd_prot_none), Peter checks again that the page is not splitting, but I can't see why that is required. Cheers, Will Could you explain why you not call pmd_trans_huge_lock to confirm the pmd is splitting or stable as Andrea point out? The way handle_mm_fault is now structured after the numa changes means that we only enter the huge pmd page aging code if the entry wasn't splitting Why you call it huge pmd page *aging* code? Regards, Chen before taking the lock, so it seemed a bit gratuitous to jump through those hoops again in pmd_trans_huge_lock. Will -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/26/2012 11:28 AM, YingHang Zhu wrote: On Fri, Oct 26, 2012 at 10:30 AM, Ni zhan Chen wrote: On 10/26/2012 09:27 AM, Fengguang Wu wrote: On Fri, Oct 26, 2012 at 11:25:44AM +1100, Dave Chinner wrote: On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote: Hi Chen, But how can bdi related ra_pages reflect different files' readahead window? Maybe these different files are sequential read, random read and so on. It's simple: sequential reads will get ra_pages readahead size while random reads will not get readahead at all. Talking about the below chunk, it might hurt someone that explicitly takes advantage of the behavior, however the ra_pages*2 seems more like a hack than general solution to me: if the user will need POSIX_FADV_SEQUENTIAL to double the max readahead window size for improving IO performance, then why not just increase bdi->ra_pages and benefit all reads? One may argue that it offers some differential behavior to specific applications, however it may also present as a counter-optimization: if the root already tuned bdi->ra_pages to the optimal size, the doubled readahead size will only cost more memory and perhaps IO latency. --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) spin_unlock(>f_lock); break; case POSIX_FADV_SEQUENTIAL: - file->f_ra.ra_pages = bdi->ra_pages * 2; I think we really have to reset file->f_ra.ra_pages here as it is not a set-and-forget value. e.g. shrink_readahead_size_eio() can reduce ra_pages as a result of IO errors. Hence if you have had io errors, telling the kernel that you are now going to do sequential IO should reset the readahead to the maximum ra_pages value supported Good point! but wait this patch removes file->f_ra.ra_pages in all other places too, so there will be no file->f_ra.ra_pages to be reset here... In his patch, static void shrink_readahead_size_eio(struct file *filp, struct file_ra_state *ra) { - ra->ra_pages /= 4; + spin_lock(>f_lock); + filp->f_mode |= FMODE_RANDOM; + spin_unlock(>f_lock); As the example in comment above this function, the read maybe still sequential, and it will waste IO bandwith if modify to FMODE_RANDOM directly. I've considered about this. On the first try I modified file_ra_state.size and file_ra_state.async_size directly, like file_ra_state.async_size = 0; file_ra_state.size /= 4; but as what I comment here, we can not predict whether the bad sectors will trash the readahead window, maybe the following sectors after current one are ok to go in normal readahead, it's hard to know, the current approach gives us a chance to slow down softly. Then when will check filp->f_mode |= FMODE_RANDOM; ? Does it will influence ra->ra_pages? Thanks, Ying Zhu Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3] mm: thp: Set the accessed flag for old pages on access fault.
On 10/26/2012 03:51 AM, Johannes Weiner wrote: On Thu, Oct 25, 2012 at 05:44:31PM +0100, Will Deacon wrote: On x86 memory accesses to pages without the ACCESSED flag set result in the ACCESSED flag being set automatically. With the ARM architecture a page access fault is raised instead (and it will continue to be raised until the ACCESSED flag is set for the appropriate PTE/PMD). For normal memory pages, handle_pte_fault will call pte_mkyoung (effectively setting the ACCESSED flag). For transparent huge pages, pmd_mkyoung will only be called for a write fault. This patch ensures that faults on transparent hugepages which do not result in a CoW update the access flags for the faulting pmd. Cc: Chris Metcalf Cc: Kirill A. Shutemov Cc: Andrea Arcangeli Signed-off-by: Will Deacon Acked-by: Johannes Weiner Ok chaps, I rebased this thing onto today's next (which basically necessitated a rewrite) so I've reluctantly dropped my acks and kindly ask if you could eyeball the new code, especially where the locking is concerned. In the numa code (do_huge_pmd_prot_none), Peter checks again that the page is not splitting, but I can't see why that is required. I don't either. If the thing was splitting when the fault happened, that path is not taken. And the locked pmd_same() check should rule out splitting setting in after testing pmd_trans_huge_splitting(). Why I can't find function pmd_trans_huge_splitting() you mentioned in latest mainline codes and linux-next? Peter? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/26/2012 09:27 AM, Fengguang Wu wrote: On Fri, Oct 26, 2012 at 11:25:44AM +1100, Dave Chinner wrote: On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote: Hi Chen, But how can bdi related ra_pages reflect different files' readahead window? Maybe these different files are sequential read, random read and so on. It's simple: sequential reads will get ra_pages readahead size while random reads will not get readahead at all. Talking about the below chunk, it might hurt someone that explicitly takes advantage of the behavior, however the ra_pages*2 seems more like a hack than general solution to me: if the user will need POSIX_FADV_SEQUENTIAL to double the max readahead window size for improving IO performance, then why not just increase bdi->ra_pages and benefit all reads? One may argue that it offers some differential behavior to specific applications, however it may also present as a counter-optimization: if the root already tuned bdi->ra_pages to the optimal size, the doubled readahead size will only cost more memory and perhaps IO latency. --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) spin_unlock(>f_lock); break; case POSIX_FADV_SEQUENTIAL: - file->f_ra.ra_pages = bdi->ra_pages * 2; I think we really have to reset file->f_ra.ra_pages here as it is not a set-and-forget value. e.g. shrink_readahead_size_eio() can reduce ra_pages as a result of IO errors. Hence if you have had io errors, telling the kernel that you are now going to do sequential IO should reset the readahead to the maximum ra_pages value supported Good point! but wait this patch removes file->f_ra.ra_pages in all other places too, so there will be no file->f_ra.ra_pages to be reset here... In his patch, static void shrink_readahead_size_eio(struct file *filp, struct file_ra_state *ra) { - ra->ra_pages /= 4; + spin_lock(>f_lock); + filp->f_mode |= FMODE_RANDOM; + spin_unlock(>f_lock); As the example in comment above this function, the read maybe still sequential, and it will waste IO bandwith if modify to FMODE_RANDOM directly. Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/26/2012 08:25 AM, Dave Chinner wrote: On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote: Hi Chen, But how can bdi related ra_pages reflect different files' readahead window? Maybe these different files are sequential read, random read and so on. It's simple: sequential reads will get ra_pages readahead size while random reads will not get readahead at all. Talking about the below chunk, it might hurt someone that explicitly takes advantage of the behavior, however the ra_pages*2 seems more like a hack than general solution to me: if the user will need POSIX_FADV_SEQUENTIAL to double the max readahead window size for improving IO performance, then why not just increase bdi->ra_pages and benefit all reads? One may argue that it offers some differential behavior to specific applications, however it may also present as a counter-optimization: if the root already tuned bdi->ra_pages to the optimal size, the doubled readahead size will only cost more memory and perhaps IO latency. --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) spin_unlock(>f_lock); break; case POSIX_FADV_SEQUENTIAL: - file->f_ra.ra_pages = bdi->ra_pages * 2; I think we really have to reset file->f_ra.ra_pages here as it is not a set-and-forget value. e.g. shrink_readahead_size_eio() can reduce ra_pages as a result of IO errors. Hence if you have had io errors, telling the kernel that you are now going to do sequential IO should reset the readahead to the maximum ra_pages value supported Good catch! Cheers, Dave. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]
On 10/26/2012 05:48 AM, Hugh Dickins wrote: On Thu, 25 Oct 2012, Johannes Weiner wrote: On Wed, Oct 24, 2012 at 09:36:27PM -0700, Hugh Dickins wrote: On Wed, 24 Oct 2012, Dave Jones wrote: Machine under significant load (4gb memory used, swap usage fluctuating) triggered this... WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70() Pid: 29795, comm: trinity-child4 Not tainted 3.7.0-rc2+ #49 1148 error = shmem_add_to_page_cache(page, mapping, index, 1149 gfp, swp_to_radix_entry(swap)); 1150 /* We already confirmed swap, and make no allocation */ 1151 VM_BUG_ON(error); 1152 } That's very surprising. Easy enough to handle an error there, but of course I made it a VM_BUG_ON because it violates my assumptions: I rather need to understand how this can be, and I've no idea. Could it be concurrent truncation clearing out the entry between shmem_confirm_swap() and shmem_add_to_page_cache()? I don't see anything preventing that. The empty slot would not match the expected swap entry this call passes in and the returned error would be -ENOENT. Excellent notion, many thanks Hannes, I believe you've got it. I've hit that truncation problem in swapoff (and commented on it in shmem_unuse_inode), but never hit it or considered it here. I think of the page lock as holding it stable, but truncation's free_swap_and_cache only does a trylock on the swapcache page, so we're not secured against that possibility. Hi Hugh, Even though free_swap_and_cache only does a trylock on the swapcache page, but it doens't call delete_from_swap_cache and the associated entry should still be there, I am interested in what you have already introduce to protect it? So I'd like to change it to VM_BUG_ON(error && error != -ENOENT), but there's a little tidying up to do in the -ENOENT case, which Do you mean radix_tree_insert will return -ENOENT if the associated entry is not present? Why I can't find this return value in the function radix_tree_insert? needs more thought. A delete_from_swap_cache(page) - though we can be lazy and leave that to reclaim for such a rare occurrence - and probably a mem_cgroup uncharge; but the memcg hooks are always the hardest to get right, I'll have think about that one carefully. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]
On 10/26/2012 05:27 AM, Hugh Dickins wrote: On Thu, 25 Oct 2012, Ni zhan Chen wrote: On 10/25/2012 02:59 PM, Hugh Dickins wrote: On Thu, 25 Oct 2012, Ni zhan Chen wrote: I think it maybe caused by your commit [d189922862e03ce: shmem: fix negative rss in memcg memory.stat], one question: Well, yes, I added the VM_BUG_ON in that commit. if function shmem_confirm_swap confirm the entry has already brought back from swap by a racing thread, The reverse: true confirms that the swap entry has not been brought back from swap by a racing thread; false indicates that there has been a race. then why call shmem_add_to_page_cache to add page from swapcache to pagecache again? Adding it to pagecache again, after such a race, would set error to -EEXIST (originating from radix_tree_insert); but we don't do that, we add it to pagecache when it has not already been added. Or that's the intention: but Dave seems to have found an unexpected exception, despite us holding the page lock across all this. (But if it weren't for the memcg and replace_page issues, I'd much prefer to let shmem_add_to_page_cache discover the race as before.) Hugh Hi Hugh Thanks for your response. You mean the -EEXIST originating from radix_tree_insert, in radix_tree_insert: if (slot != NULL) return -EEXIST; But why slot should be NULL? if no race, the pagecache related radix tree entry should be RADIX_TREE_EXCEPTIONAL_ENTRY+swap_entry_t.val, where I miss? I was describing what would happen in a case that should not exist, that you had thought the common case. In actuality, the entry should not be NULL, it should be as you say there. Thanks for your patience. So in the common case, the entry should be the value I mentioned, then why has this check? if (slot != NULL) return -EEXIST; the common case will return -EEXIST. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]
On 10/25/2012 02:59 PM, Hugh Dickins wrote: On Thu, 25 Oct 2012, Ni zhan Chen wrote: On 10/25/2012 12:36 PM, Hugh Dickins wrote: On Wed, 24 Oct 2012, Dave Jones wrote: Machine under significant load (4gb memory used, swap usage fluctuating) triggered this... WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70() Pid: 29795, comm: trinity-child4 Not tainted 3.7.0-rc2+ #49 1148 error = shmem_add_to_page_cache(page, mapping, index, 1149 gfp, swp_to_radix_entry(swap)); 1150 /* We already confirmed swap, and make no allocation */ 1151 VM_BUG_ON(error); 1152 } That's very surprising. Easy enough to handle an error there, but of course I made it a VM_BUG_ON because it violates my assumptions: I rather need to understand how this can be, and I've no idea. Clutching at straws, I expect this is entirely irrelevant, but: there isn't a warning on line 1151 of mm/shmem.c in 3.7.0-rc2 nor in current linux.git; rather, there's a VM_BUG_ON on line 1149. So you've inserted a couple of lines for some reason (more useful trinity behaviour, perhaps)? And have some config option I'm unfamiliar with, that mutates a BUG_ON or VM_BUG_ON into a warning? Hi Hugh, I think it maybe caused by your commit [d189922862e03ce: shmem: fix negative rss in memcg memory.stat], one question: Well, yes, I added the VM_BUG_ON in that commit. if function shmem_confirm_swap confirm the entry has already brought back from swap by a racing thread, The reverse: true confirms that the swap entry has not been brought back from swap by a racing thread; false indicates that there has been a race. then why call shmem_add_to_page_cache to add page from swapcache to pagecache again? Adding it to pagecache again, after such a race, would set error to -EEXIST (originating from radix_tree_insert); but we don't do that, we add it to pagecache when it has not already been added. Or that's the intention: but Dave seems to have found an unexpected exception, despite us holding the page lock across all this. (But if it weren't for the memcg and replace_page issues, I'd much prefer to let shmem_add_to_page_cache discover the race as before.) Hugh Hi Hugh Thanks for your response. You mean the -EEXIST originating from radix_tree_insert, in radix_tree_insert: if (slot != NULL) return -EEXIST; But why slot should be NULL? if no race, the pagecache related radix tree entry should be RADIX_TREE_EXCEPTIONAL_ENTRY+swap_entry_t.val, where I miss? Regards, Chen otherwise, will goto unlock and then go to repeat? where I miss? Regards, Chen -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]
On 10/25/2012 02:59 PM, Hugh Dickins wrote: On Thu, 25 Oct 2012, Ni zhan Chen wrote: On 10/25/2012 12:36 PM, Hugh Dickins wrote: On Wed, 24 Oct 2012, Dave Jones wrote: Machine under significant load (4gb memory used, swap usage fluctuating) triggered this... WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70() Pid: 29795, comm: trinity-child4 Not tainted 3.7.0-rc2+ #49 1148 error = shmem_add_to_page_cache(page, mapping, index, 1149 gfp, swp_to_radix_entry(swap)); 1150 /* We already confirmed swap, and make no allocation */ 1151 VM_BUG_ON(error); 1152 } That's very surprising. Easy enough to handle an error there, but of course I made it a VM_BUG_ON because it violates my assumptions: I rather need to understand how this can be, and I've no idea. Clutching at straws, I expect this is entirely irrelevant, but: there isn't a warning on line 1151 of mm/shmem.c in 3.7.0-rc2 nor in current linux.git; rather, there's a VM_BUG_ON on line 1149. So you've inserted a couple of lines for some reason (more useful trinity behaviour, perhaps)? And have some config option I'm unfamiliar with, that mutates a BUG_ON or VM_BUG_ON into a warning? Hi Hugh, I think it maybe caused by your commit [d189922862e03ce: shmem: fix negative rss in memcg memory.stat], one question: Well, yes, I added the VM_BUG_ON in that commit. if function shmem_confirm_swap confirm the entry has already brought back from swap by a racing thread, The reverse: true confirms that the swap entry has not been brought back from swap by a racing thread; false indicates that there has been a race. then why call shmem_add_to_page_cache to add page from swapcache to pagecache again? Adding it to pagecache again, after such a race, would set error to -EEXIST (originating from radix_tree_insert); but we don't do that, we add it to pagecache when it has not already been added. Or that's the intention: but Dave seems to have found an unexpected exception, despite us holding the page lock across all this. (But if it weren't for the memcg and replace_page issues, I'd much prefer to let shmem_add_to_page_cache discover the race as before.) Hugh Hi Hugh Thanks for your response. You mean the -EEXIST originating from radix_tree_insert, in radix_tree_insert: if (slot != NULL) return -EEXIST; But why slot should be NULL? if no race, the pagecache related radix tree entry should be RADIX_TREE_EXCEPTIONAL_ENTRY+swap_entry_t.val, where I miss? Regards, Chen otherwise, will goto unlock and then go to repeat? where I miss? Regards, Chen -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]
On 10/26/2012 05:27 AM, Hugh Dickins wrote: On Thu, 25 Oct 2012, Ni zhan Chen wrote: On 10/25/2012 02:59 PM, Hugh Dickins wrote: On Thu, 25 Oct 2012, Ni zhan Chen wrote: I think it maybe caused by your commit [d189922862e03ce: shmem: fix negative rss in memcg memory.stat], one question: Well, yes, I added the VM_BUG_ON in that commit. if function shmem_confirm_swap confirm the entry has already brought back from swap by a racing thread, The reverse: true confirms that the swap entry has not been brought back from swap by a racing thread; false indicates that there has been a race. then why call shmem_add_to_page_cache to add page from swapcache to pagecache again? Adding it to pagecache again, after such a race, would set error to -EEXIST (originating from radix_tree_insert); but we don't do that, we add it to pagecache when it has not already been added. Or that's the intention: but Dave seems to have found an unexpected exception, despite us holding the page lock across all this. (But if it weren't for the memcg and replace_page issues, I'd much prefer to let shmem_add_to_page_cache discover the race as before.) Hugh Hi Hugh Thanks for your response. You mean the -EEXIST originating from radix_tree_insert, in radix_tree_insert: if (slot != NULL) return -EEXIST; But why slot should be NULL? if no race, the pagecache related radix tree entry should be RADIX_TREE_EXCEPTIONAL_ENTRY+swap_entry_t.val, where I miss? I was describing what would happen in a case that should not exist, that you had thought the common case. In actuality, the entry should not be NULL, it should be as you say there. Thanks for your patience. So in the common case, the entry should be the value I mentioned, then why has this check? if (slot != NULL) return -EEXIST; the common case will return -EEXIST. Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]
On 10/26/2012 05:48 AM, Hugh Dickins wrote: On Thu, 25 Oct 2012, Johannes Weiner wrote: On Wed, Oct 24, 2012 at 09:36:27PM -0700, Hugh Dickins wrote: On Wed, 24 Oct 2012, Dave Jones wrote: Machine under significant load (4gb memory used, swap usage fluctuating) triggered this... WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70() Pid: 29795, comm: trinity-child4 Not tainted 3.7.0-rc2+ #49 1148 error = shmem_add_to_page_cache(page, mapping, index, 1149 gfp, swp_to_radix_entry(swap)); 1150 /* We already confirmed swap, and make no allocation */ 1151 VM_BUG_ON(error); 1152 } That's very surprising. Easy enough to handle an error there, but of course I made it a VM_BUG_ON because it violates my assumptions: I rather need to understand how this can be, and I've no idea. Could it be concurrent truncation clearing out the entry between shmem_confirm_swap() and shmem_add_to_page_cache()? I don't see anything preventing that. The empty slot would not match the expected swap entry this call passes in and the returned error would be -ENOENT. Excellent notion, many thanks Hannes, I believe you've got it. I've hit that truncation problem in swapoff (and commented on it in shmem_unuse_inode), but never hit it or considered it here. I think of the page lock as holding it stable, but truncation's free_swap_and_cache only does a trylock on the swapcache page, so we're not secured against that possibility. Hi Hugh, Even though free_swap_and_cache only does a trylock on the swapcache page, but it doens't call delete_from_swap_cache and the associated entry should still be there, I am interested in what you have already introduce to protect it? So I'd like to change it to VM_BUG_ON(error error != -ENOENT), but there's a little tidying up to do in the -ENOENT case, which Do you mean radix_tree_insert will return -ENOENT if the associated entry is not present? Why I can't find this return value in the function radix_tree_insert? needs more thought. A delete_from_swap_cache(page) - though we can be lazy and leave that to reclaim for such a rare occurrence - and probably a mem_cgroup uncharge; but the memcg hooks are always the hardest to get right, I'll have think about that one carefully. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/26/2012 08:25 AM, Dave Chinner wrote: On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote: Hi Chen, But how can bdi related ra_pages reflect different files' readahead window? Maybe these different files are sequential read, random read and so on. It's simple: sequential reads will get ra_pages readahead size while random reads will not get readahead at all. Talking about the below chunk, it might hurt someone that explicitly takes advantage of the behavior, however the ra_pages*2 seems more like a hack than general solution to me: if the user will need POSIX_FADV_SEQUENTIAL to double the max readahead window size for improving IO performance, then why not just increase bdi-ra_pages and benefit all reads? One may argue that it offers some differential behavior to specific applications, however it may also present as a counter-optimization: if the root already tuned bdi-ra_pages to the optimal size, the doubled readahead size will only cost more memory and perhaps IO latency. --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) spin_unlock(file-f_lock); break; case POSIX_FADV_SEQUENTIAL: - file-f_ra.ra_pages = bdi-ra_pages * 2; I think we really have to reset file-f_ra.ra_pages here as it is not a set-and-forget value. e.g. shrink_readahead_size_eio() can reduce ra_pages as a result of IO errors. Hence if you have had io errors, telling the kernel that you are now going to do sequential IO should reset the readahead to the maximum ra_pages value supported Good catch! Cheers, Dave. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/26/2012 09:27 AM, Fengguang Wu wrote: On Fri, Oct 26, 2012 at 11:25:44AM +1100, Dave Chinner wrote: On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote: Hi Chen, But how can bdi related ra_pages reflect different files' readahead window? Maybe these different files are sequential read, random read and so on. It's simple: sequential reads will get ra_pages readahead size while random reads will not get readahead at all. Talking about the below chunk, it might hurt someone that explicitly takes advantage of the behavior, however the ra_pages*2 seems more like a hack than general solution to me: if the user will need POSIX_FADV_SEQUENTIAL to double the max readahead window size for improving IO performance, then why not just increase bdi-ra_pages and benefit all reads? One may argue that it offers some differential behavior to specific applications, however it may also present as a counter-optimization: if the root already tuned bdi-ra_pages to the optimal size, the doubled readahead size will only cost more memory and perhaps IO latency. --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) spin_unlock(file-f_lock); break; case POSIX_FADV_SEQUENTIAL: - file-f_ra.ra_pages = bdi-ra_pages * 2; I think we really have to reset file-f_ra.ra_pages here as it is not a set-and-forget value. e.g. shrink_readahead_size_eio() can reduce ra_pages as a result of IO errors. Hence if you have had io errors, telling the kernel that you are now going to do sequential IO should reset the readahead to the maximum ra_pages value supported Good point! but wait this patch removes file-f_ra.ra_pages in all other places too, so there will be no file-f_ra.ra_pages to be reset here... In his patch, static void shrink_readahead_size_eio(struct file *filp, struct file_ra_state *ra) { - ra-ra_pages /= 4; + spin_lock(filp-f_lock); + filp-f_mode |= FMODE_RANDOM; + spin_unlock(filp-f_lock); As the example in comment above this function, the read maybe still sequential, and it will waste IO bandwith if modify to FMODE_RANDOM directly. Thanks, Fengguang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3] mm: thp: Set the accessed flag for old pages on access fault.
On 10/26/2012 03:51 AM, Johannes Weiner wrote: On Thu, Oct 25, 2012 at 05:44:31PM +0100, Will Deacon wrote: On x86 memory accesses to pages without the ACCESSED flag set result in the ACCESSED flag being set automatically. With the ARM architecture a page access fault is raised instead (and it will continue to be raised until the ACCESSED flag is set for the appropriate PTE/PMD). For normal memory pages, handle_pte_fault will call pte_mkyoung (effectively setting the ACCESSED flag). For transparent huge pages, pmd_mkyoung will only be called for a write fault. This patch ensures that faults on transparent hugepages which do not result in a CoW update the access flags for the faulting pmd. Cc: Chris Metcalf cmetc...@tilera.com Cc: Kirill A. Shutemov kir...@shutemov.name Cc: Andrea Arcangeli aarca...@redhat.com Signed-off-by: Will Deacon will.dea...@arm.com Acked-by: Johannes Weiner han...@cmpxchg.org Ok chaps, I rebased this thing onto today's next (which basically necessitated a rewrite) so I've reluctantly dropped my acks and kindly ask if you could eyeball the new code, especially where the locking is concerned. In the numa code (do_huge_pmd_prot_none), Peter checks again that the page is not splitting, but I can't see why that is required. I don't either. If the thing was splitting when the fault happened, that path is not taken. And the locked pmd_same() check should rule out splitting setting in after testing pmd_trans_huge_splitting(). Why I can't find function pmd_trans_huge_splitting() you mentioned in latest mainline codes and linux-next? Peter? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/26/2012 11:28 AM, YingHang Zhu wrote: On Fri, Oct 26, 2012 at 10:30 AM, Ni zhan Chen nizhan.c...@gmail.com wrote: On 10/26/2012 09:27 AM, Fengguang Wu wrote: On Fri, Oct 26, 2012 at 11:25:44AM +1100, Dave Chinner wrote: On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote: Hi Chen, But how can bdi related ra_pages reflect different files' readahead window? Maybe these different files are sequential read, random read and so on. It's simple: sequential reads will get ra_pages readahead size while random reads will not get readahead at all. Talking about the below chunk, it might hurt someone that explicitly takes advantage of the behavior, however the ra_pages*2 seems more like a hack than general solution to me: if the user will need POSIX_FADV_SEQUENTIAL to double the max readahead window size for improving IO performance, then why not just increase bdi-ra_pages and benefit all reads? One may argue that it offers some differential behavior to specific applications, however it may also present as a counter-optimization: if the root already tuned bdi-ra_pages to the optimal size, the doubled readahead size will only cost more memory and perhaps IO latency. --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) spin_unlock(file-f_lock); break; case POSIX_FADV_SEQUENTIAL: - file-f_ra.ra_pages = bdi-ra_pages * 2; I think we really have to reset file-f_ra.ra_pages here as it is not a set-and-forget value. e.g. shrink_readahead_size_eio() can reduce ra_pages as a result of IO errors. Hence if you have had io errors, telling the kernel that you are now going to do sequential IO should reset the readahead to the maximum ra_pages value supported Good point! but wait this patch removes file-f_ra.ra_pages in all other places too, so there will be no file-f_ra.ra_pages to be reset here... In his patch, static void shrink_readahead_size_eio(struct file *filp, struct file_ra_state *ra) { - ra-ra_pages /= 4; + spin_lock(filp-f_lock); + filp-f_mode |= FMODE_RANDOM; + spin_unlock(filp-f_lock); As the example in comment above this function, the read maybe still sequential, and it will waste IO bandwith if modify to FMODE_RANDOM directly. I've considered about this. On the first try I modified file_ra_state.size and file_ra_state.async_size directly, like file_ra_state.async_size = 0; file_ra_state.size /= 4; but as what I comment here, we can not predict whether the bad sectors will trash the readahead window, maybe the following sectors after current one are ok to go in normal readahead, it's hard to know, the current approach gives us a chance to slow down softly. Then when will check filp-f_mode |= FMODE_RANDOM; ? Does it will influence ra-ra_pages? Thanks, Ying Zhu Thanks, Fengguang -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]
On 10/25/2012 12:36 PM, Hugh Dickins wrote: On Wed, 24 Oct 2012, Dave Jones wrote: Machine under significant load (4gb memory used, swap usage fluctuating) triggered this... WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70() Pid: 29795, comm: trinity-child4 Not tainted 3.7.0-rc2+ #49 Call Trace: [] warn_slowpath_common+0x7f/0xc0 [] warn_slowpath_null+0x1a/0x20 [] shmem_getpage_gfp+0xa5c/0xa70 [] ? shmem_getpage_gfp+0x29e/0xa70 [] shmem_fault+0x4f/0xa0 [] __do_fault+0x71/0x5c0 [] ? __lock_acquire+0x306/0x1ba0 [] ? local_clock+0x89/0xa0 [] handle_pte_fault+0x97/0xae0 [] ? sub_preempt_count+0x79/0xd0 [] ? delay_tsc+0xae/0x120 [] ? __const_udelay+0x28/0x30 [] handle_mm_fault+0x289/0x350 [] __do_page_fault+0x18e/0x530 [] ? local_clock+0x89/0xa0 [] ? get_parent_ip+0x11/0x50 [] ? get_parent_ip+0x11/0x50 [] ? sub_preempt_count+0x79/0xd0 [] ? rcu_user_exit+0xc9/0xf0 [] do_page_fault+0x2b/0x50 [] page_fault+0x28/0x30 [] ? copy_user_enhanced_fast_string+0x9/0x20 [] ? sys_futimesat+0x41/0xe0 [] ? syscall_trace_enter+0x25/0x2c0 [] ? tracesys+0x7e/0xe6 [] tracesys+0xe1/0xe6 1148 error = shmem_add_to_page_cache(page, mapping, index, 1149 gfp, swp_to_radix_entry(swap)); 1150 /* We already confirmed swap, and make no allocation */ 1151 VM_BUG_ON(error); 1152 } That's very surprising. Easy enough to handle an error there, but of course I made it a VM_BUG_ON because it violates my assumptions: I rather need to understand how this can be, and I've no idea. Clutching at straws, I expect this is entirely irrelevant, but: there isn't a warning on line 1151 of mm/shmem.c in 3.7.0-rc2 nor in current linux.git; rather, there's a VM_BUG_ON on line 1149. So you've inserted a couple of lines for some reason (more useful trinity behaviour, perhaps)? And have some config option I'm unfamiliar with, that mutates a BUG_ON or VM_BUG_ON into a warning? Hi Hugh, I think it maybe caused by your commit [d189922862e03ce: shmem: fix negative rss in memcg memory.stat], one question: if function shmem_confirm_swap confirm the entry has already brought back from swap by a racing thread, then why call shmem_add_to_page_cache to add page from swapcache to pagecache again? otherwise, will goto unlock and then go to repeat? where I miss? Regards, Chen Hugh total used free sharedbuffers cached Mem: 388552828540641031464 0 9624 19208 -/+ buffers/cache:28252321060296 Swap: 6029308 306565998652 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/25/2012 10:04 AM, YingHang Zhu wrote: On Thu, Oct 25, 2012 at 9:50 AM, Dave Chinner wrote: On Thu, Oct 25, 2012 at 08:17:05AM +0800, YingHang Zhu wrote: On Thu, Oct 25, 2012 at 4:19 AM, Dave Chinner wrote: On Wed, Oct 24, 2012 at 07:53:59AM +0800, YingHang Zhu wrote: Hi Dave, On Wed, Oct 24, 2012 at 6:47 AM, Dave Chinner wrote: On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote: Hi, Recently we ran into the bug that an opened file's ra_pages does not synchronize with it's backing device's when the latter is changed with blockdev --setra, the application needs to reopen the file to know the change, or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead window to the (new) bdi default. which is inappropriate under our circumstances. Which are? We don't know your circumstances, so you need to tell us why you need this and why existing methods of handling such changes are insufficient... Optimal readahead windows tend to be a physical property of the storage and that does not tend to change dynamically. Hence block device readahead should only need to be set up once, and generally that can be done before the filesystem is mounted and files are opened (e.g. via udev rules). Hence you need to explain why you need to change the default block device readahead on the fly, and why fadvise(POSIX_FADV_NORMAL) is "inappropriate" to set readahead windows to the new defaults. Our system is a fuse-based file system, fuse creates a pseudo backing device for the user space file systems, the default readahead size is 128KB and it can't fully utilize the backing storage's read ability, so we should tune it. Sure, but that doesn't tell me anything about why you can't do this at mount time before the application opens any files. i.e. you've simply stated the reason why readahead is tunable, not why you need to be fully dynamic. We store our file system's data on different disks so we need to change ra_pages dynamically according to where the data resides, it can't be fixed at mount time or when we open files. That doesn't make a whole lot of sense to me. let me try to get this straight. There is data that resides on two devices (A + B), and a fuse filesystem to access that data. There is a single file in the fuse fs has data on both devices. An app has the file open, and when the data it is accessing is on device A you need to set the readahead to what is best for device A? And when the app tries to access data for that file that is on device B, you need to set the readahead to what is best for device B? And you are changing the fuse BDI readahead settings according to where the data in the back end lies? It seems to me that you should be setting the fuse readahead to the maximum of the readahead windows the data devices have configured at mount time and leaving it at that Then it may not fully utilize some device's read IO bandwidth and put too much burden on other devices. The abstract bdi of fuse and btrfs provides some dynamically changing bdi.ra_pages based on the real backing device. IMHO this should not be ignored. btrfs simply takes into account the number of disks it has for a given storage pool when setting up the default bdi ra_pages during mount. This is basically doing what I suggested above. Same with the generic fuse code - it's simply setting a sensible default value for the given fuse configuration. Neither are dynamic in the sense you are talking about, though. Actually I've talked about it with Fengguang, he advised we should unify the But how can bdi related ra_pages reflect different files' readahead window? Maybe these different files are sequential read, random read and so on. ra_pages in struct bdi and file_ra_state and leave the issue that spreading data across disks as it is. Fengguang, what's you opinion about this? Thanks, Ying Zhu Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/25/2012 08:17 AM, YingHang Zhu wrote: On Thu, Oct 25, 2012 at 4:19 AM, Dave Chinner wrote: On Wed, Oct 24, 2012 at 07:53:59AM +0800, YingHang Zhu wrote: Hi Dave, On Wed, Oct 24, 2012 at 6:47 AM, Dave Chinner wrote: On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote: Hi, Recently we ran into the bug that an opened file's ra_pages does not synchronize with it's backing device's when the latter is changed with blockdev --setra, the application needs to reopen the file to know the change, or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead window to the (new) bdi default. which is inappropriate under our circumstances. Which are? We don't know your circumstances, so you need to tell us why you need this and why existing methods of handling such changes are insufficient... Optimal readahead windows tend to be a physical property of the storage and that does not tend to change dynamically. Hence block device readahead should only need to be set up once, and generally that can be done before the filesystem is mounted and files are opened (e.g. via udev rules). Hence you need to explain why you need to change the default block device readahead on the fly, and why fadvise(POSIX_FADV_NORMAL) is "inappropriate" to set readahead windows to the new defaults. Our system is a fuse-based file system, fuse creates a pseudo backing device for the user space file systems, the default readahead size is 128KB and it can't fully utilize the backing storage's read ability, so we should tune it. Sure, but that doesn't tell me anything about why you can't do this at mount time before the application opens any files. i.e. you've simply stated the reason why readahead is tunable, not why you need to be fully dynamic. We store our file system's data on different disks so we need to change ra_pages dynamically according to where the data resides, it can't be fixed at mount time or when we open files. The abstract bdi of fuse and btrfs provides some dynamically changing bdi.ra_pages based on the real backing device. IMHO this should not be ignored. And how to tune ra_pages if one big file distribution in different disks, I think Fengguang Wu can answer these questions, Hi Fengguang, The above third-party application using our file system maintains some long-opened files, we does not have any chances to force them to call fadvise(POSIX_FADV_NORMAL). :( So raise a bug/feature request with the third party. Modifying kernel code because you can't directly modify the application isn't the best solution for anyone. This really is an application problem - the kernel already provides the mechanisms to solve this problem... :/ Thanks for advice, I will consult the above application's developers for more information. Now from the code itself should we merge the gap between the real device's ra_pages and the file's? Obviously the ra_pages is duplicated, otherwise each time we run into this problem, someone will do the same work as I have done here. Thanks, Ying Zhu Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/25/2012 08:17 AM, YingHang Zhu wrote: On Thu, Oct 25, 2012 at 4:19 AM, Dave Chinner da...@fromorbit.com wrote: On Wed, Oct 24, 2012 at 07:53:59AM +0800, YingHang Zhu wrote: Hi Dave, On Wed, Oct 24, 2012 at 6:47 AM, Dave Chinner da...@fromorbit.com wrote: On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote: Hi, Recently we ran into the bug that an opened file's ra_pages does not synchronize with it's backing device's when the latter is changed with blockdev --setra, the application needs to reopen the file to know the change, or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead window to the (new) bdi default. which is inappropriate under our circumstances. Which are? We don't know your circumstances, so you need to tell us why you need this and why existing methods of handling such changes are insufficient... Optimal readahead windows tend to be a physical property of the storage and that does not tend to change dynamically. Hence block device readahead should only need to be set up once, and generally that can be done before the filesystem is mounted and files are opened (e.g. via udev rules). Hence you need to explain why you need to change the default block device readahead on the fly, and why fadvise(POSIX_FADV_NORMAL) is inappropriate to set readahead windows to the new defaults. Our system is a fuse-based file system, fuse creates a pseudo backing device for the user space file systems, the default readahead size is 128KB and it can't fully utilize the backing storage's read ability, so we should tune it. Sure, but that doesn't tell me anything about why you can't do this at mount time before the application opens any files. i.e. you've simply stated the reason why readahead is tunable, not why you need to be fully dynamic. We store our file system's data on different disks so we need to change ra_pages dynamically according to where the data resides, it can't be fixed at mount time or when we open files. The abstract bdi of fuse and btrfs provides some dynamically changing bdi.ra_pages based on the real backing device. IMHO this should not be ignored. And how to tune ra_pages if one big file distribution in different disks, I think Fengguang Wu can answer these questions, Hi Fengguang, The above third-party application using our file system maintains some long-opened files, we does not have any chances to force them to call fadvise(POSIX_FADV_NORMAL). :( So raise a bug/feature request with the third party. Modifying kernel code because you can't directly modify the application isn't the best solution for anyone. This really is an application problem - the kernel already provides the mechanisms to solve this problem... :/ Thanks for advice, I will consult the above application's developers for more information. Now from the code itself should we merge the gap between the real device's ra_pages and the file's? Obviously the ra_pages is duplicated, otherwise each time we run into this problem, someone will do the same work as I have done here. Thanks, Ying Zhu Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/25/2012 10:04 AM, YingHang Zhu wrote: On Thu, Oct 25, 2012 at 9:50 AM, Dave Chinner da...@fromorbit.com wrote: On Thu, Oct 25, 2012 at 08:17:05AM +0800, YingHang Zhu wrote: On Thu, Oct 25, 2012 at 4:19 AM, Dave Chinner da...@fromorbit.com wrote: On Wed, Oct 24, 2012 at 07:53:59AM +0800, YingHang Zhu wrote: Hi Dave, On Wed, Oct 24, 2012 at 6:47 AM, Dave Chinner da...@fromorbit.com wrote: On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote: Hi, Recently we ran into the bug that an opened file's ra_pages does not synchronize with it's backing device's when the latter is changed with blockdev --setra, the application needs to reopen the file to know the change, or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead window to the (new) bdi default. which is inappropriate under our circumstances. Which are? We don't know your circumstances, so you need to tell us why you need this and why existing methods of handling such changes are insufficient... Optimal readahead windows tend to be a physical property of the storage and that does not tend to change dynamically. Hence block device readahead should only need to be set up once, and generally that can be done before the filesystem is mounted and files are opened (e.g. via udev rules). Hence you need to explain why you need to change the default block device readahead on the fly, and why fadvise(POSIX_FADV_NORMAL) is inappropriate to set readahead windows to the new defaults. Our system is a fuse-based file system, fuse creates a pseudo backing device for the user space file systems, the default readahead size is 128KB and it can't fully utilize the backing storage's read ability, so we should tune it. Sure, but that doesn't tell me anything about why you can't do this at mount time before the application opens any files. i.e. you've simply stated the reason why readahead is tunable, not why you need to be fully dynamic. We store our file system's data on different disks so we need to change ra_pages dynamically according to where the data resides, it can't be fixed at mount time or when we open files. That doesn't make a whole lot of sense to me. let me try to get this straight. There is data that resides on two devices (A + B), and a fuse filesystem to access that data. There is a single file in the fuse fs has data on both devices. An app has the file open, and when the data it is accessing is on device A you need to set the readahead to what is best for device A? And when the app tries to access data for that file that is on device B, you need to set the readahead to what is best for device B? And you are changing the fuse BDI readahead settings according to where the data in the back end lies? It seems to me that you should be setting the fuse readahead to the maximum of the readahead windows the data devices have configured at mount time and leaving it at that Then it may not fully utilize some device's read IO bandwidth and put too much burden on other devices. The abstract bdi of fuse and btrfs provides some dynamically changing bdi.ra_pages based on the real backing device. IMHO this should not be ignored. btrfs simply takes into account the number of disks it has for a given storage pool when setting up the default bdi ra_pages during mount. This is basically doing what I suggested above. Same with the generic fuse code - it's simply setting a sensible default value for the given fuse configuration. Neither are dynamic in the sense you are talking about, though. Actually I've talked about it with Fengguang, he advised we should unify the But how can bdi related ra_pages reflect different files' readahead window? Maybe these different files are sequential read, random read and so on. ra_pages in struct bdi and file_ra_state and leave the issue that spreading data across disks as it is. Fengguang, what's you opinion about this? Thanks, Ying Zhu Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]
On 10/25/2012 12:36 PM, Hugh Dickins wrote: On Wed, 24 Oct 2012, Dave Jones wrote: Machine under significant load (4gb memory used, swap usage fluctuating) triggered this... WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70() Pid: 29795, comm: trinity-child4 Not tainted 3.7.0-rc2+ #49 Call Trace: [8107100f] warn_slowpath_common+0x7f/0xc0 [8107106a] warn_slowpath_null+0x1a/0x20 [811903fc] shmem_getpage_gfp+0xa5c/0xa70 [8118fc3e] ? shmem_getpage_gfp+0x29e/0xa70 [81190e4f] shmem_fault+0x4f/0xa0 [8119f391] __do_fault+0x71/0x5c0 [810e1ac6] ? __lock_acquire+0x306/0x1ba0 [810b6ff9] ? local_clock+0x89/0xa0 [811a2767] handle_pte_fault+0x97/0xae0 [816d1069] ? sub_preempt_count+0x79/0xd0 [8136d68e] ? delay_tsc+0xae/0x120 [8136d578] ? __const_udelay+0x28/0x30 [811a4a39] handle_mm_fault+0x289/0x350 [816d091e] __do_page_fault+0x18e/0x530 [810b6ff9] ? local_clock+0x89/0xa0 [810b0e51] ? get_parent_ip+0x11/0x50 [810b0e51] ? get_parent_ip+0x11/0x50 [816d1069] ? sub_preempt_count+0x79/0xd0 [8112d389] ? rcu_user_exit+0xc9/0xf0 [816d0ceb] do_page_fault+0x2b/0x50 [816cd3b8] page_fault+0x28/0x30 [8136d259] ? copy_user_enhanced_fast_string+0x9/0x20 [8121c181] ? sys_futimesat+0x41/0xe0 [8102bf35] ? syscall_trace_enter+0x25/0x2c0 [816d5625] ? tracesys+0x7e/0xe6 [816d5688] tracesys+0xe1/0xe6 1148 error = shmem_add_to_page_cache(page, mapping, index, 1149 gfp, swp_to_radix_entry(swap)); 1150 /* We already confirmed swap, and make no allocation */ 1151 VM_BUG_ON(error); 1152 } That's very surprising. Easy enough to handle an error there, but of course I made it a VM_BUG_ON because it violates my assumptions: I rather need to understand how this can be, and I've no idea. Clutching at straws, I expect this is entirely irrelevant, but: there isn't a warning on line 1151 of mm/shmem.c in 3.7.0-rc2 nor in current linux.git; rather, there's a VM_BUG_ON on line 1149. So you've inserted a couple of lines for some reason (more useful trinity behaviour, perhaps)? And have some config option I'm unfamiliar with, that mutates a BUG_ON or VM_BUG_ON into a warning? Hi Hugh, I think it maybe caused by your commit [d189922862e03ce: shmem: fix negative rss in memcg memory.stat], one question: if function shmem_confirm_swap confirm the entry has already brought back from swap by a racing thread, then why call shmem_add_to_page_cache to add page from swapcache to pagecache again? otherwise, will goto unlock and then go to repeat? where I miss? Regards, Chen Hugh total used free sharedbuffers cached Mem: 388552828540641031464 0 9624 19208 -/+ buffers/cache:28252321060296 Swap: 6029308 306565998652 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/23/2012 09:41 PM, YingHang Zhu wrote: Sorry for the annoying, I forgot ccs in the previous mail. Thanks, Ying Zhu Hi Chen, On Tue, Oct 23, 2012 at 9:21 PM, Ni zhan Chen wrote: On 10/23/2012 08:46 PM, Ying Zhu wrote: Hi, Recently we ran into the bug that an opened file's ra_pages does not synchronize with it's backing device's when the latter is changed with blockdev --setra, the application needs to reopen the file to know the change, which is inappropriate under our circumstances. Could you tell me in which function do this synchronize stuff? With this patch we use bdi.ra_pages directly, so change bdi.ra_pages also change an opened file's ra_pages. This bug is also mentioned in scst (generic SCSI target subsystem for Linux)'s README file. This patch tries to unify the ra_pages in struct file_ra_state and struct backing_dev_info. Basically current readahead algorithm will ramp file_ra_state.ra_pages up to bdi.ra_pages once it detects the You mean ondemand readahead algorithm will do this? I don't think so. file_ra_state_init only called in btrfs path, correct? No, it's also called in do_dentry_open. read mode is sequential. Then all files sharing the same backing device have the same max value bdi.ra_pages set in file_ra_state. why remove file_ra_state? If one file is read sequential and another file is read ramdom, how can use the global bdi.ra_pages to indicate the max readahead window of each file? This patch does not remove file_ra_state, an file's readahead window is determined by it's backing device. As Dave said, backing device readahead window doesn't tend to change dynamically, but file readahead window does, it will change when sequential read, random read, thrash, interleaved read and so on occur. Applying this means the flags POSIX_FADV_NORMAL and POSIX_FADV_SEQUENTIAL in fadivse will only set file reading mode without signifying the max readahead size of the file. The current apporach adds no additional overhead in read IO path, IMHO is the simplest solution. Any comments are welcome, thanks in advance. Could you show me how you test this patch? This patch brings no perfmance gain, just fixs some functional bugs. By reading a 500MB file, the default max readahead size of the backing device was 128KB, after applying this patch, the read file's max ra_pages changed when I tuned the device's read ahead size with blockdev. Thanks, Ying Zhu Signed-off-by: Ying Zhu --- include/linux/fs.h |1 - mm/fadvise.c |2 -- mm/filemap.c | 17 +++-- mm/readahead.c |8 4 files changed, 15 insertions(+), 13 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index 17fd887..36303a5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -991,7 +991,6 @@ struct file_ra_state { unsigned int async_size;/* do asynchronous readahead when there are only # of pages ahead */ - unsigned int ra_pages; /* Maximum readahead window */ unsigned int mmap_miss; /* Cache miss stat for mmap accesses */ loff_t prev_pos;/* Cache last read() position */ }; diff --git a/mm/fadvise.c b/mm/fadvise.c index 469491e..75e2378 100644 --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -76,7 +76,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) switch (advice) { case POSIX_FADV_NORMAL: - file->f_ra.ra_pages = bdi->ra_pages; spin_lock(>f_lock); file->f_mode &= ~FMODE_RANDOM; spin_unlock(>f_lock); @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) spin_unlock(>f_lock); break; case POSIX_FADV_SEQUENTIAL: - file->f_ra.ra_pages = bdi->ra_pages * 2; spin_lock(>f_lock); file->f_mode &= ~FMODE_RANDOM; spin_unlock(>f_lock); diff --git a/mm/filemap.c b/mm/filemap.c index a4a5260..e7e4409 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1058,11 +1058,15 @@ EXPORT_SYMBOL(grab_cache_page_nowait); * readahead(R+4...B+3) => bang => read(R+4) => read(R+5) => .. * * It is going insane. Fix it by quickly scaling down the readahead size. + * It's hard to estimate how the bad sectors lay out, so to be conservative, + * set the read mode in random. */ static void shrink_readahead_size_eio(struct file *filp, struct file_ra_state *ra) { - ra->ra_pages /= 4; + spin_lock(>f_lock); + filp->f_mode |= FMODE_RANDOM; + spin_unlock(>f_lock); } /** @@ -1527,12 +1531,12 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma, /* If we don't want any read-ahead, don't bothe
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/23/2012 08:46 PM, Ying Zhu wrote: Hi, Recently we ran into the bug that an opened file's ra_pages does not synchronize with it's backing device's when the latter is changed with blockdev --setra, the application needs to reopen the file to know the change, which is inappropriate under our circumstances. Could you tell me in which function do this synchronize stuff? This bug is also mentioned in scst (generic SCSI target subsystem for Linux)'s README file. This patch tries to unify the ra_pages in struct file_ra_state and struct backing_dev_info. Basically current readahead algorithm will ramp file_ra_state.ra_pages up to bdi.ra_pages once it detects the You mean ondemand readahead algorithm will do this? I don't think so. file_ra_state_init only called in btrfs path, correct? read mode is sequential. Then all files sharing the same backing device have the same max value bdi.ra_pages set in file_ra_state. why remove file_ra_state? If one file is read sequential and another file is read ramdom, how can use the global bdi.ra_pages to indicate the max readahead window of each file? Applying this means the flags POSIX_FADV_NORMAL and POSIX_FADV_SEQUENTIAL in fadivse will only set file reading mode without signifying the max readahead size of the file. The current apporach adds no additional overhead in read IO path, IMHO is the simplest solution. Any comments are welcome, thanks in advance. Could you show me how you test this patch? Thanks, Ying Zhu Signed-off-by: Ying Zhu --- include/linux/fs.h |1 - mm/fadvise.c |2 -- mm/filemap.c | 17 +++-- mm/readahead.c |8 4 files changed, 15 insertions(+), 13 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index 17fd887..36303a5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -991,7 +991,6 @@ struct file_ra_state { unsigned int async_size;/* do asynchronous readahead when there are only # of pages ahead */ - unsigned int ra_pages; /* Maximum readahead window */ unsigned int mmap_miss; /* Cache miss stat for mmap accesses */ loff_t prev_pos;/* Cache last read() position */ }; diff --git a/mm/fadvise.c b/mm/fadvise.c index 469491e..75e2378 100644 --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -76,7 +76,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) switch (advice) { case POSIX_FADV_NORMAL: - file->f_ra.ra_pages = bdi->ra_pages; spin_lock(>f_lock); file->f_mode &= ~FMODE_RANDOM; spin_unlock(>f_lock); @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) spin_unlock(>f_lock); break; case POSIX_FADV_SEQUENTIAL: - file->f_ra.ra_pages = bdi->ra_pages * 2; spin_lock(>f_lock); file->f_mode &= ~FMODE_RANDOM; spin_unlock(>f_lock); diff --git a/mm/filemap.c b/mm/filemap.c index a4a5260..e7e4409 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1058,11 +1058,15 @@ EXPORT_SYMBOL(grab_cache_page_nowait); * readahead(R+4...B+3) => bang => read(R+4) => read(R+5) => .. * * It is going insane. Fix it by quickly scaling down the readahead size. + * It's hard to estimate how the bad sectors lay out, so to be conservative, + * set the read mode in random. */ static void shrink_readahead_size_eio(struct file *filp, struct file_ra_state *ra) { - ra->ra_pages /= 4; + spin_lock(>f_lock); + filp->f_mode |= FMODE_RANDOM; + spin_unlock(>f_lock); } /** @@ -1527,12 +1531,12 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma, /* If we don't want any read-ahead, don't bother */ if (VM_RandomReadHint(vma)) return; - if (!ra->ra_pages) + if (!mapping->backing_dev_info->ra_pages) return; if (VM_SequentialReadHint(vma)) { - page_cache_sync_readahead(mapping, ra, file, offset, - ra->ra_pages); + page_cache_sync_readahead(mapping, ra, file, offset, + mapping->backing_dev_info->ra_pages); return; } @@ -1550,7 +1554,7 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma, /* * mmap read-around */ - ra_pages = max_sane_readahead(ra->ra_pages); + ra_pages = max_sane_readahead(mapping->backing_dev_info->ra_pages); ra->start = max_t(long, 0, offset - ra_pages / 2); ra->size = ra_pages; ra->async_size = ra_pages / 4; @@ -1576,7 +1580,8 @@ static void do_async_mmap_readahead(struct vm_area_struct *vma, ra->mmap_miss--; if
Re: [PATCH v2 00/12] memory-hotplug: hot-remove physical memory
On 10/23/2012 06:30 PM, we...@cn.fujitsu.com wrote: From: Wen Congyang The patchset doesn't support kernel memory hot-remove, correct? If the answer is yes, you should point out in your patchset changelog. The patch-set was divided from following thread's patch-set. https://lkml.org/lkml/2012/9/5/201 The last version of this patchset: https://lkml.org/lkml/2012/10/5/469 If you want to know the reason, please read following thread. https://lkml.org/lkml/2012/10/2/83 The patch-set has only the function of kernel core side for physical memory hot remove. So if you use the patch, please apply following patches. - bug fix for memory hot remove https://lkml.org/lkml/2012/10/19/56 - acpi framework https://lkml.org/lkml/2012/10/19/156 The patches can free/remove the following things: - /sys/firmware/memmap/X/{end, start, type} : [PATCH 2/10] - mem_section and related sysfs files : [PATCH 3-4/10] - memmap of sparse-vmemmap : [PATCH 5-7/10] - page table of removed memory : [RFC PATCH 8/10] - node and related sysfs files : [RFC PATCH 9-10/10] * [PATCH 2/10] checks whether the memory can be removed or not. If you find lack of function for physical memory hot-remove, please let me know. How to test this patchset? 1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE, ACPI_HOTPLUG_MEMORY must be selected. 2. load the module acpi_memhotplug 3. hotplug the memory device(it depends on your hardware) You will see the memory device under the directory /sys/bus/acpi/devices/. Its name is PNP0C80:XX. 4. online/offline pages provided by this memory device You can write online/offline to /sys/devices/system/memory/memoryX/state to online/offline pages provided by this memory device 5. hotremove the memory device You can hotremove the memory device by the hardware, or writing 1 to /sys/bus/acpi/devices/PNP0C80:XX/eject. Note: if the memory provided by the memory device is used by the kernel, it can't be offlined. It is not a bug. Known problems: 1. hotremoving memory device may cause kernel panicked This bug will be fixed by Liu Jiang's patch: https://lkml.org/lkml/2012/7/3/1 Changelogs from v1 to v2: Patch1: new patch, offline memory twice. 1st iterate: offline every non primary memory block. 2nd iterate: offline primary (i.e. first added) memory block. Patch3: new patch, no logical change, just remove reduntant codes. Patch9: merge the patch from wujianguo into this patch. flush tlb on all cpu after the pagetable is changed. Patch12: new patch, free node_data when a node is offlined Wen Congyang (6): memory-hotplug: try to offline the memory twice to avoid dependence memory-hotplug: remove redundant codes memory-hotplug: introduce new function arch_remove_memory() for removing page table depends on architecture memory-hotplug: remove page table of x86_64 architecture memory-hotplug: remove sysfs file of node memory-hotplug: free node_data when a node is offlined Yasuaki Ishimatsu (6): memory-hotplug: check whether all memory blocks are offlined or not when removing memory memory-hotplug: remove /sys/firmware/memmap/X sysfs memory-hotplug: unregister memory section on SPARSEMEM_VMEMMAP memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmap memory-hotplug: remove memmap of sparse-vmemmap memory-hotplug: memory_hotplug: clear zone when removing the memory arch/ia64/mm/discontig.c | 14 ++ arch/ia64/mm/init.c | 18 ++ arch/powerpc/mm/init_64.c| 14 ++ arch/powerpc/mm/mem.c| 12 + arch/s390/mm/init.c | 12 + arch/s390/mm/vmem.c | 14 ++ arch/sh/mm/init.c| 17 ++ arch/sparc/mm/init_64.c | 14 ++ arch/tile/mm/init.c |8 + arch/x86/include/asm/pgtable_types.h |1 + arch/x86/mm/init_32.c| 12 + arch/x86/mm/init_64.c| 409 ++ arch/x86/mm/pageattr.c | 47 ++-- drivers/acpi/acpi_memhotplug.c |8 +- drivers/base/memory.c|6 + drivers/firmware/memmap.c| 98 - include/linux/firmware-map.h |6 + include/linux/memory_hotplug.h | 15 +- include/linux/mm.h |5 +- mm/memory_hotplug.c | 409 -- mm/sparse.c |5 +- 21 files changed, 1087 insertions(+), 57 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line
Re: [PATCH v2 00/12] memory-hotplug: hot-remove physical memory
On 10/23/2012 06:30 PM, we...@cn.fujitsu.com wrote: From: Wen Congyang we...@cn.fujitsu.com The patchset doesn't support kernel memory hot-remove, correct? If the answer is yes, you should point out in your patchset changelog. The patch-set was divided from following thread's patch-set. https://lkml.org/lkml/2012/9/5/201 The last version of this patchset: https://lkml.org/lkml/2012/10/5/469 If you want to know the reason, please read following thread. https://lkml.org/lkml/2012/10/2/83 The patch-set has only the function of kernel core side for physical memory hot remove. So if you use the patch, please apply following patches. - bug fix for memory hot remove https://lkml.org/lkml/2012/10/19/56 - acpi framework https://lkml.org/lkml/2012/10/19/156 The patches can free/remove the following things: - /sys/firmware/memmap/X/{end, start, type} : [PATCH 2/10] - mem_section and related sysfs files : [PATCH 3-4/10] - memmap of sparse-vmemmap : [PATCH 5-7/10] - page table of removed memory : [RFC PATCH 8/10] - node and related sysfs files : [RFC PATCH 9-10/10] * [PATCH 2/10] checks whether the memory can be removed or not. If you find lack of function for physical memory hot-remove, please let me know. How to test this patchset? 1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE, ACPI_HOTPLUG_MEMORY must be selected. 2. load the module acpi_memhotplug 3. hotplug the memory device(it depends on your hardware) You will see the memory device under the directory /sys/bus/acpi/devices/. Its name is PNP0C80:XX. 4. online/offline pages provided by this memory device You can write online/offline to /sys/devices/system/memory/memoryX/state to online/offline pages provided by this memory device 5. hotremove the memory device You can hotremove the memory device by the hardware, or writing 1 to /sys/bus/acpi/devices/PNP0C80:XX/eject. Note: if the memory provided by the memory device is used by the kernel, it can't be offlined. It is not a bug. Known problems: 1. hotremoving memory device may cause kernel panicked This bug will be fixed by Liu Jiang's patch: https://lkml.org/lkml/2012/7/3/1 Changelogs from v1 to v2: Patch1: new patch, offline memory twice. 1st iterate: offline every non primary memory block. 2nd iterate: offline primary (i.e. first added) memory block. Patch3: new patch, no logical change, just remove reduntant codes. Patch9: merge the patch from wujianguo into this patch. flush tlb on all cpu after the pagetable is changed. Patch12: new patch, free node_data when a node is offlined Wen Congyang (6): memory-hotplug: try to offline the memory twice to avoid dependence memory-hotplug: remove redundant codes memory-hotplug: introduce new function arch_remove_memory() for removing page table depends on architecture memory-hotplug: remove page table of x86_64 architecture memory-hotplug: remove sysfs file of node memory-hotplug: free node_data when a node is offlined Yasuaki Ishimatsu (6): memory-hotplug: check whether all memory blocks are offlined or not when removing memory memory-hotplug: remove /sys/firmware/memmap/X sysfs memory-hotplug: unregister memory section on SPARSEMEM_VMEMMAP memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmap memory-hotplug: remove memmap of sparse-vmemmap memory-hotplug: memory_hotplug: clear zone when removing the memory arch/ia64/mm/discontig.c | 14 ++ arch/ia64/mm/init.c | 18 ++ arch/powerpc/mm/init_64.c| 14 ++ arch/powerpc/mm/mem.c| 12 + arch/s390/mm/init.c | 12 + arch/s390/mm/vmem.c | 14 ++ arch/sh/mm/init.c| 17 ++ arch/sparc/mm/init_64.c | 14 ++ arch/tile/mm/init.c |8 + arch/x86/include/asm/pgtable_types.h |1 + arch/x86/mm/init_32.c| 12 + arch/x86/mm/init_64.c| 409 ++ arch/x86/mm/pageattr.c | 47 ++-- drivers/acpi/acpi_memhotplug.c |8 +- drivers/base/memory.c|6 + drivers/firmware/memmap.c| 98 - include/linux/firmware-map.h |6 + include/linux/memory_hotplug.h | 15 +- include/linux/mm.h |5 +- mm/memory_hotplug.c | 409 -- mm/sparse.c |5 +- 21 files changed, 1087 insertions(+), 57 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/23/2012 08:46 PM, Ying Zhu wrote: Hi, Recently we ran into the bug that an opened file's ra_pages does not synchronize with it's backing device's when the latter is changed with blockdev --setra, the application needs to reopen the file to know the change, which is inappropriate under our circumstances. Could you tell me in which function do this synchronize stuff? This bug is also mentioned in scst (generic SCSI target subsystem for Linux)'s README file. This patch tries to unify the ra_pages in struct file_ra_state and struct backing_dev_info. Basically current readahead algorithm will ramp file_ra_state.ra_pages up to bdi.ra_pages once it detects the You mean ondemand readahead algorithm will do this? I don't think so. file_ra_state_init only called in btrfs path, correct? read mode is sequential. Then all files sharing the same backing device have the same max value bdi.ra_pages set in file_ra_state. why remove file_ra_state? If one file is read sequential and another file is read ramdom, how can use the global bdi.ra_pages to indicate the max readahead window of each file? Applying this means the flags POSIX_FADV_NORMAL and POSIX_FADV_SEQUENTIAL in fadivse will only set file reading mode without signifying the max readahead size of the file. The current apporach adds no additional overhead in read IO path, IMHO is the simplest solution. Any comments are welcome, thanks in advance. Could you show me how you test this patch? Thanks, Ying Zhu Signed-off-by: Ying Zhu casualfis...@gmail.com --- include/linux/fs.h |1 - mm/fadvise.c |2 -- mm/filemap.c | 17 +++-- mm/readahead.c |8 4 files changed, 15 insertions(+), 13 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index 17fd887..36303a5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -991,7 +991,6 @@ struct file_ra_state { unsigned int async_size;/* do asynchronous readahead when there are only # of pages ahead */ - unsigned int ra_pages; /* Maximum readahead window */ unsigned int mmap_miss; /* Cache miss stat for mmap accesses */ loff_t prev_pos;/* Cache last read() position */ }; diff --git a/mm/fadvise.c b/mm/fadvise.c index 469491e..75e2378 100644 --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -76,7 +76,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) switch (advice) { case POSIX_FADV_NORMAL: - file-f_ra.ra_pages = bdi-ra_pages; spin_lock(file-f_lock); file-f_mode = ~FMODE_RANDOM; spin_unlock(file-f_lock); @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) spin_unlock(file-f_lock); break; case POSIX_FADV_SEQUENTIAL: - file-f_ra.ra_pages = bdi-ra_pages * 2; spin_lock(file-f_lock); file-f_mode = ~FMODE_RANDOM; spin_unlock(file-f_lock); diff --git a/mm/filemap.c b/mm/filemap.c index a4a5260..e7e4409 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1058,11 +1058,15 @@ EXPORT_SYMBOL(grab_cache_page_nowait); * readahead(R+4...B+3) = bang = read(R+4) = read(R+5) = .. * * It is going insane. Fix it by quickly scaling down the readahead size. + * It's hard to estimate how the bad sectors lay out, so to be conservative, + * set the read mode in random. */ static void shrink_readahead_size_eio(struct file *filp, struct file_ra_state *ra) { - ra-ra_pages /= 4; + spin_lock(filp-f_lock); + filp-f_mode |= FMODE_RANDOM; + spin_unlock(filp-f_lock); } /** @@ -1527,12 +1531,12 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma, /* If we don't want any read-ahead, don't bother */ if (VM_RandomReadHint(vma)) return; - if (!ra-ra_pages) + if (!mapping-backing_dev_info-ra_pages) return; if (VM_SequentialReadHint(vma)) { - page_cache_sync_readahead(mapping, ra, file, offset, - ra-ra_pages); + page_cache_sync_readahead(mapping, ra, file, offset, + mapping-backing_dev_info-ra_pages); return; } @@ -1550,7 +1554,7 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma, /* * mmap read-around */ - ra_pages = max_sane_readahead(ra-ra_pages); + ra_pages = max_sane_readahead(mapping-backing_dev_info-ra_pages); ra-start = max_t(long, 0, offset - ra_pages / 2); ra-size = ra_pages; ra-async_size = ra_pages / 4; @@ -1576,7 +1580,8 @@ static void do_async_mmap_readahead(struct vm_area_struct *vma, ra-mmap_miss--;
Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state
On 10/23/2012 09:41 PM, YingHang Zhu wrote: Sorry for the annoying, I forgot ccs in the previous mail. Thanks, Ying Zhu Hi Chen, On Tue, Oct 23, 2012 at 9:21 PM, Ni zhan Chen nizhan.c...@gmail.com wrote: On 10/23/2012 08:46 PM, Ying Zhu wrote: Hi, Recently we ran into the bug that an opened file's ra_pages does not synchronize with it's backing device's when the latter is changed with blockdev --setra, the application needs to reopen the file to know the change, which is inappropriate under our circumstances. Could you tell me in which function do this synchronize stuff? With this patch we use bdi.ra_pages directly, so change bdi.ra_pages also change an opened file's ra_pages. This bug is also mentioned in scst (generic SCSI target subsystem for Linux)'s README file. This patch tries to unify the ra_pages in struct file_ra_state and struct backing_dev_info. Basically current readahead algorithm will ramp file_ra_state.ra_pages up to bdi.ra_pages once it detects the You mean ondemand readahead algorithm will do this? I don't think so. file_ra_state_init only called in btrfs path, correct? No, it's also called in do_dentry_open. read mode is sequential. Then all files sharing the same backing device have the same max value bdi.ra_pages set in file_ra_state. why remove file_ra_state? If one file is read sequential and another file is read ramdom, how can use the global bdi.ra_pages to indicate the max readahead window of each file? This patch does not remove file_ra_state, an file's readahead window is determined by it's backing device. As Dave said, backing device readahead window doesn't tend to change dynamically, but file readahead window does, it will change when sequential read, random read, thrash, interleaved read and so on occur. Applying this means the flags POSIX_FADV_NORMAL and POSIX_FADV_SEQUENTIAL in fadivse will only set file reading mode without signifying the max readahead size of the file. The current apporach adds no additional overhead in read IO path, IMHO is the simplest solution. Any comments are welcome, thanks in advance. Could you show me how you test this patch? This patch brings no perfmance gain, just fixs some functional bugs. By reading a 500MB file, the default max readahead size of the backing device was 128KB, after applying this patch, the read file's max ra_pages changed when I tuned the device's read ahead size with blockdev. Thanks, Ying Zhu Signed-off-by: Ying Zhu casualfis...@gmail.com --- include/linux/fs.h |1 - mm/fadvise.c |2 -- mm/filemap.c | 17 +++-- mm/readahead.c |8 4 files changed, 15 insertions(+), 13 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index 17fd887..36303a5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -991,7 +991,6 @@ struct file_ra_state { unsigned int async_size;/* do asynchronous readahead when there are only # of pages ahead */ - unsigned int ra_pages; /* Maximum readahead window */ unsigned int mmap_miss; /* Cache miss stat for mmap accesses */ loff_t prev_pos;/* Cache last read() position */ }; diff --git a/mm/fadvise.c b/mm/fadvise.c index 469491e..75e2378 100644 --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -76,7 +76,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) switch (advice) { case POSIX_FADV_NORMAL: - file-f_ra.ra_pages = bdi-ra_pages; spin_lock(file-f_lock); file-f_mode = ~FMODE_RANDOM; spin_unlock(file-f_lock); @@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) spin_unlock(file-f_lock); break; case POSIX_FADV_SEQUENTIAL: - file-f_ra.ra_pages = bdi-ra_pages * 2; spin_lock(file-f_lock); file-f_mode = ~FMODE_RANDOM; spin_unlock(file-f_lock); diff --git a/mm/filemap.c b/mm/filemap.c index a4a5260..e7e4409 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1058,11 +1058,15 @@ EXPORT_SYMBOL(grab_cache_page_nowait); * readahead(R+4...B+3) = bang = read(R+4) = read(R+5) = .. * * It is going insane. Fix it by quickly scaling down the readahead size. + * It's hard to estimate how the bad sectors lay out, so to be conservative, + * set the read mode in random. */ static void shrink_readahead_size_eio(struct file *filp, struct file_ra_state *ra) { - ra-ra_pages /= 4; + spin_lock(filp-f_lock); + filp-f_mode |= FMODE_RANDOM; + spin_unlock(filp-f_lock); } /** @@ -1527,12 +1531,12 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma, /* If we don't want any read-ahead, don't bother
Re: question on NUMA page migration
On 10/19/2012 11:53 PM, Rik van Riel wrote: Hi Andrea, Peter, I have a question on page refcounting in your NUMA page migration code. In Peter's case, I wonder why you introduce a new MIGRATE_FAULT migration mode. If the normal page migration / compaction logic can do without taking an extra reference count, why does your code need it? Hi Rik van Riel, This is which part of codes? Why I can't find MIGRATE_FAULT in latest v3.7-rc2? Regards, Chen In Andrea's case, we have a comment suggesting an extra refcount is needed, immediately followed by a put_page: /* * Pin the head subpage at least until the first * __isolate_lru_page succeeds (__isolate_lru_page pins it * again when it succeeds). If we unpin before * __isolate_lru_page successd, the page could be freed and * reallocated out from under us. Thus our previous checks on * the page, and the split_huge_page, would be worthless. * * We really only need to do this if "ret > 0" but it doesn't * hurt to do it unconditionally as nobody can reference * "page" anymore after this and so we can avoid an "if (ret > * 0)" branch here. */ put_page(page); This also confuses me. If we do not need the extra refcount (and I do not understand why NUMA migrate-on-fault needs one more refcount than normal page migration), we can get rid of the MIGRATE_FAULT mode. If we do need the extra refcount, why is normal page migration safe? :) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question on NUMA page migration
On 10/19/2012 11:53 PM, Rik van Riel wrote: Hi Andrea, Peter, I have a question on page refcounting in your NUMA page migration code. In Peter's case, I wonder why you introduce a new MIGRATE_FAULT migration mode. If the normal page migration / compaction logic can do without taking an extra reference count, why does your code need it? Hi Rik van Riel, This is which part of codes? Why I can't find MIGRATE_FAULT in latest v3.7-rc2? Regards, Chen In Andrea's case, we have a comment suggesting an extra refcount is needed, immediately followed by a put_page: /* * Pin the head subpage at least until the first * __isolate_lru_page succeeds (__isolate_lru_page pins it * again when it succeeds). If we unpin before * __isolate_lru_page successd, the page could be freed and * reallocated out from under us. Thus our previous checks on * the page, and the split_huge_page, would be worthless. * * We really only need to do this if ret 0 but it doesn't * hurt to do it unconditionally as nobody can reference * page anymore after this and so we can avoid an if (ret * 0) branch here. */ put_page(page); This also confuses me. If we do not need the extra refcount (and I do not understand why NUMA migrate-on-fault needs one more refcount than normal page migration), we can get rid of the MIGRATE_FAULT mode. If we do need the extra refcount, why is normal page migration safe? :) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/5] bugfix for memory hotplug
On 10/17/2012 08:08 PM, we...@cn.fujitsu.com wrote: From: Wen Congyang Wen Congyang (5): memory-hotplug: skip HWPoisoned page when offlining pages memory-hotplug: update mce_bad_pages when removing the memory memory-hotplug: auto offline page_cgroup when onlining memory block failed memory-hotplug: fix NR_FREE_PAGES mismatch memory-hotplug: allocate zone's pcp before onlining pages Oops, why you don't write changelog? include/linux/page-isolation.h | 10 ++ mm/memory-failure.c|2 +- mm/memory_hotplug.c| 14 -- mm/page_alloc.c| 37 - mm/page_cgroup.c |3 +++ mm/page_isolation.c| 27 --- mm/sparse.c| 21 + 7 files changed, 87 insertions(+), 27 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/5] bugfix for memory hotplug
On 10/17/2012 08:08 PM, we...@cn.fujitsu.com wrote: From: Wen Congyang we...@cn.fujitsu.com Wen Congyang (5): memory-hotplug: skip HWPoisoned page when offlining pages memory-hotplug: update mce_bad_pages when removing the memory memory-hotplug: auto offline page_cgroup when onlining memory block failed memory-hotplug: fix NR_FREE_PAGES mismatch memory-hotplug: allocate zone's pcp before onlining pages Oops, why you don't write changelog? include/linux/page-isolation.h | 10 ++ mm/memory-failure.c|2 +- mm/memory_hotplug.c| 14 -- mm/page_alloc.c| 37 - mm/page_cgroup.c |3 +++ mm/page_isolation.c| 27 --- mm/sparse.c| 21 + 7 files changed, 87 insertions(+), 27 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 00/10] Introduce huge zero page
On 10/03/2012 08:04 AM, Kirill A. Shutemov wrote: On Tue, Oct 02, 2012 at 03:31:48PM -0700, Andrew Morton wrote: On Tue, 2 Oct 2012 18:19:22 +0300 "Kirill A. Shutemov" wrote: During testing I noticed big (up to 2.5 times) memory consumption overhead on some workloads (e.g. ft.A from NPB) if THP is enabled. The main reason for that big difference is lacking zero page in THP case. We have to allocate a real page on read page fault. A program to demonstrate the issue: #include #include #include #define MB 1024*1024 int main(int argc, char **argv) { char *p; int i; posix_memalign((void **), 2 * MB, 200 * MB); for (i = 0; i < 200 * MB; i+= 4096) assert(p[i] == 0); pause(); return 0; } With thp-never RSS is about 400k, but with thp-always it's 200M. After the patcheset thp-always RSS is 400k too. I'd like to see a full description of the design, please. Okay. Design overview. Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with zeros. The way how we allocate it changes in the patchset: - [01/10] simplest way: hzp allocated on boot time in hugepage_init(); - [09/10] lazy allocation on first use; - [10/10] lockless refcounting + shrinker-reclaimable hzp; We setup it in do_huge_pmd_anonymous_page() if area around fault address is suitable for THP and we've got read page fault. If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we normally do in THP. On wp fault to hzp we allocate real memory for the huge page and clear it. If ENOMEM, graceful fallback: we create a new pmd table and set pte around fault address to newly allocated normal (4k) page. All other ptes in the pmd set to normal zero page. We cannot split hzp (and it's bug if we try), but we can split the pmd which points to it. On splitting the pmd we create a table with all ptes set to normal zero page. Patchset organized in bisect-friendly way: Patches 01-07: prepare all code paths for hzp Patch 08: all code paths are covered: safe to setup hzp Patch 09: lazy allocation Patch 10: lockless refcounting for hzp -- By hpa request I've tried alternative approach for hzp implementation (see Virtual huge zero page patchset): pmd table with all entries set to zero page. This way should be more cache friendly, but it increases TLB pressure. The problem with virtual huge zero page: it requires per-arch enabling. We need a way to mark that pmd table has all ptes set to zero page. Some numbers to compare two implementations (on 4s Westmere-EX): Mirobenchmark1 == test: posix_memalign((void **), 2 * MB, 8 * GB); for (i = 0; i < 100; i++) { assert(memcmp(p, p + 4*GB, 4*GB) == 0); asm volatile ("": : :"memory"); } hzp: Performance counter stats for './test_memcmp' (5 runs): 32356.272845 task-clock#0.998 CPUs utilized ( +- 0.13% ) 40 context-switches #0.001 K/sec ( +- 0.94% ) 0 CPU-migrations#0.000 K/sec 4,218 page-faults #0.130 K/sec ( +- 0.00% ) 76,712,481,765 cycles#2.371 GHz ( +- 0.13% ) [83.31%] 36,279,577,636 stalled-cycles-frontend # 47.29% frontend cycles idle ( +- 0.28% ) [83.35%] 1,684,049,110 stalled-cycles-backend#2.20% backend cycles idle ( +- 2.96% ) [66.67%] 134,355,715,816 instructions #1.75 insns per cycle #0.27 stalled cycles per insn ( +- 0.10% ) [83.35%] 13,526,169,702 branches # 418.039 M/sec ( +- 0.10% ) [83.31%] 1,058,230 branch-misses #0.01% of all branches ( +- 0.91% ) [83.36%] 32.413866442 seconds time elapsed ( +- 0.13% ) vhzp: Performance counter stats for './test_memcmp' (5 runs): 30327.183829 task-clock#0.998 CPUs utilized ( +- 0.13% ) 38 context-switches #0.001 K/sec ( +- 1.53% ) 0 CPU-migrations#0.000 K/sec 4,218 page-faults #0.139 K/sec ( +- 0.01% ) 71,964,773,660 cycles#2.373 GHz ( +- 0.13% ) [83.35%] 31,191,284,231 stalled-cycles-frontend # 43.34% frontend cycles idle ( +- 0.40% ) [83.32%] 773,484,474 stalled-cycles-backend#1.07% backend cycles idle ( +- 6.61% ) [66.67%] 134,982,215,437 instructions #1.88 insns per cycle #0.23 stalled
Re: [PATCH v4 00/10, REBASED] Introduce huge zero page
On 10/16/2012 07:28 PM, Kirill A. Shutemov wrote: On Tue, Oct 16, 2012 at 07:13:07PM +0800, Ni zhan Chen wrote: On 10/16/2012 06:54 PM, Kirill A. Shutemov wrote: On Tue, Oct 16, 2012 at 05:53:07PM +0800, Ni zhan Chen wrote: By hpa request I've tried alternative approach for hzp implementation (see Virtual huge zero page patchset): pmd table with all entries set to zero page. This way should be more cache friendly, but it increases TLB pressure. Thanks for your excellent works. But could you explain me why current implementation not cache friendly and hpa's request cache friendly? Thanks in advance. In workloads like microbenchmark1 you need N * size(zero page) cache space to get zero page fully cached, where N is cache associativity. If zero page is 2M, cache pressure is significant. On other hand with table of 4k zero pages (hpa's proposal) will increase pressure on TLB, since we have more pages for the same memory area. So we have to do more page translation in this case. On my test machine with simple memcmp() virtual huge zero page is faster. But it highly depends on TLB size, cache size, memory access and page translation costs. It looks like cache size in modern processors grows faster than TLB size. Oh, I see, thanks for your quick response. Another one question below, The problem with virtual huge zero page: it requires per-arch enabling. We need a way to mark that pmd table has all ptes set to zero page. Some numbers to compare two implementations (on 4s Westmere-EX): Mirobenchmark1 == test: posix_memalign((void **), 2 * MB, 8 * GB); for (i = 0; i < 100; i++) { assert(memcmp(p, p + 4*GB, 4*GB) == 0); asm volatile ("": : :"memory"); } hzp: Performance counter stats for './test_memcmp' (5 runs): 32356.272845 task-clock#0.998 CPUs utilized ( +- 0.13% ) 40 context-switches #0.001 K/sec ( +- 0.94% ) 0 CPU-migrations#0.000 K/sec 4,218 page-faults #0.130 K/sec ( +- 0.00% ) 76,712,481,765 cycles#2.371 GHz ( +- 0.13% ) [83.31%] 36,279,577,636 stalled-cycles-frontend # 47.29% frontend cycles idle ( +- 0.28% ) [83.35%] 1,684,049,110 stalled-cycles-backend#2.20% backend cycles idle ( +- 2.96% ) [66.67%] 134,355,715,816 instructions #1.75 insns per cycle #0.27 stalled cycles per insn ( +- 0.10% ) [83.35%] 13,526,169,702 branches # 418.039 M/sec ( +- 0.10% ) [83.31%] 1,058,230 branch-misses #0.01% of all branches ( +- 0.91% ) [83.36%] 32.413866442 seconds time elapsed ( +- 0.13% ) vhzp: Performance counter stats for './test_memcmp' (5 runs): 30327.183829 task-clock#0.998 CPUs utilized ( +- 0.13% ) 38 context-switches #0.001 K/sec ( +- 1.53% ) 0 CPU-migrations#0.000 K/sec 4,218 page-faults #0.139 K/sec ( +- 0.01% ) 71,964,773,660 cycles#2.373 GHz ( +- 0.13% ) [83.35%] 31,191,284,231 stalled-cycles-frontend # 43.34% frontend cycles idle ( +- 0.40% ) [83.32%] 773,484,474 stalled-cycles-backend#1.07% backend cycles idle ( +- 6.61% ) [66.67%] 134,982,215,437 instructions #1.88 insns per cycle #0.23 stalled cycles per insn ( +- 0.11% ) [83.32%] 13,509,150,683 branches # 445.447 M/sec ( +- 0.11% ) [83.34%] 1,017,667 branch-misses #0.01% of all branches ( +- 1.07% ) [83.32%] 30.381324695 seconds time elapsed ( +- 0.13% ) Could you tell me which data I should care in this performance counter. And what's the benefit of your current implementation compare to hpa's request? Sorry for my unintelligent. Could you tell me which data I should care in this performance counter stats. The same question about the second benchmark counter stats, thanks in adance. :-) I've missed relevant counters in this run, you can see them in the second benchmark. Relevant counters: L1-dcache-*, LLC-*: shows cache related stats (hits/misses); dTLB-*: shows data TLB hits and misses. Indirect relevant counters: stalled-cycles-*: how long CPU pipeline has to wait for data. Oh, I see, thanks for your patient. :-) Mirobenchmark2 == test: posix_memalign((v
Re: [PATCH v4 00/10, REBASED] Introduce huge zero page
On 10/16/2012 06:54 PM, Kirill A. Shutemov wrote: On Tue, Oct 16, 2012 at 05:53:07PM +0800, Ni zhan Chen wrote: By hpa request I've tried alternative approach for hzp implementation (see Virtual huge zero page patchset): pmd table with all entries set to zero page. This way should be more cache friendly, but it increases TLB pressure. Thanks for your excellent works. But could you explain me why current implementation not cache friendly and hpa's request cache friendly? Thanks in advance. In workloads like microbenchmark1 you need N * size(zero page) cache space to get zero page fully cached, where N is cache associativity. If zero page is 2M, cache pressure is significant. On other hand with table of 4k zero pages (hpa's proposal) will increase pressure on TLB, since we have more pages for the same memory area. So we have to do more page translation in this case. On my test machine with simple memcmp() virtual huge zero page is faster. But it highly depends on TLB size, cache size, memory access and page translation costs. It looks like cache size in modern processors grows faster than TLB size. Oh, I see, thanks for your quick response. Another one question below, The problem with virtual huge zero page: it requires per-arch enabling. We need a way to mark that pmd table has all ptes set to zero page. Some numbers to compare two implementations (on 4s Westmere-EX): Mirobenchmark1 == test: posix_memalign((void **), 2 * MB, 8 * GB); for (i = 0; i < 100; i++) { assert(memcmp(p, p + 4*GB, 4*GB) == 0); asm volatile ("": : :"memory"); } hzp: Performance counter stats for './test_memcmp' (5 runs): 32356.272845 task-clock#0.998 CPUs utilized ( +- 0.13% ) 40 context-switches #0.001 K/sec ( +- 0.94% ) 0 CPU-migrations#0.000 K/sec 4,218 page-faults #0.130 K/sec ( +- 0.00% ) 76,712,481,765 cycles#2.371 GHz ( +- 0.13% ) [83.31%] 36,279,577,636 stalled-cycles-frontend # 47.29% frontend cycles idle ( +- 0.28% ) [83.35%] 1,684,049,110 stalled-cycles-backend#2.20% backend cycles idle ( +- 2.96% ) [66.67%] 134,355,715,816 instructions #1.75 insns per cycle #0.27 stalled cycles per insn ( +- 0.10% ) [83.35%] 13,526,169,702 branches # 418.039 M/sec ( +- 0.10% ) [83.31%] 1,058,230 branch-misses #0.01% of all branches ( +- 0.91% ) [83.36%] 32.413866442 seconds time elapsed ( +- 0.13% ) vhzp: Performance counter stats for './test_memcmp' (5 runs): 30327.183829 task-clock#0.998 CPUs utilized ( +- 0.13% ) 38 context-switches #0.001 K/sec ( +- 1.53% ) 0 CPU-migrations#0.000 K/sec 4,218 page-faults #0.139 K/sec ( +- 0.01% ) 71,964,773,660 cycles#2.373 GHz ( +- 0.13% ) [83.35%] 31,191,284,231 stalled-cycles-frontend # 43.34% frontend cycles idle ( +- 0.40% ) [83.32%] 773,484,474 stalled-cycles-backend#1.07% backend cycles idle ( +- 6.61% ) [66.67%] 134,982,215,437 instructions #1.88 insns per cycle #0.23 stalled cycles per insn ( +- 0.11% ) [83.32%] 13,509,150,683 branches # 445.447 M/sec ( +- 0.11% ) [83.34%] 1,017,667 branch-misses #0.01% of all branches ( +- 1.07% ) [83.32%] 30.381324695 seconds time elapsed ( +- 0.13% ) Could you tell me which data I should care in this performance counter. And what's the benefit of your current implementation compare to hpa's request? Sorry for my unintelligent. Could you tell me which data I should care in this performance counter stats. The same question about the second benchmark counter stats, thanks in adance. :-) Mirobenchmark2 == test: posix_memalign((void **), 2 * MB, 8 * GB); for (i = 0; i < 1000; i++) { char *_p = p; while (_p < p+4*GB) { assert(*_p == *(_p+4*GB)); _p += 4096; asm volatile ("": : :"memory"); } } hzp: Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs): 3505.727639 task-clock
Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
On 10/16/2012 06:12 PM, Sha Zhengju wrote: From: Sha Zhengju Sysctl oom_kill_allocating_task enables or disables killing the OOM-triggering task in out-of-memory situations, but it only works on overall system-wide oom. But it's also a useful indication in memcg so we take it into consideration while oom happening in memcg. Other sysctl such as panic_on_oom has already been memcg-ware. Is it the resend one or new version, could you add changelog if it is the last case? Signed-off-by: Sha Zhengju --- mm/memcontrol.c |9 + 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e4e9b18..c329940 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1486,6 +1486,15 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL); totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1; + if (sysctl_oom_kill_allocating_task && current->mm && + !oom_unkillable_task(current, memcg, NULL) && + current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) { + get_task_struct(current); + oom_kill_process(current, gfp_mask, order, 0, totalpages, memcg, NULL, +"Memory cgroup out of memory (oom_kill_allocating_task)"); + return; + } + for_each_mem_cgroup_tree(iter, memcg) { struct cgroup *cgroup = iter->css.cgroup; struct cgroup_iter it; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 00/10, REBASED] Introduce huge zero page
On 10/15/2012 02:00 PM, Kirill A. Shutemov wrote: From: "Kirill A. Shutemov" Hi, Andrew, here's huge zero page patchset rebased to v3.7-rc1. Andrea, I've dropped your Reviewed-by due not-so-trivial conflicts in during rebase. Could you look through it again. Patches 2, 3, 4, 7, 10 had conflicts. Mostly due new MMU notifiers interface. = During testing I noticed big (up to 2.5 times) memory consumption overhead on some workloads (e.g. ft.A from NPB) if THP is enabled. The main reason for that big difference is lacking zero page in THP case. We have to allocate a real page on read page fault. A program to demonstrate the issue: #include #include #include #define MB 1024*1024 int main(int argc, char **argv) { char *p; int i; posix_memalign((void **), 2 * MB, 200 * MB); for (i = 0; i < 200 * MB; i+= 4096) assert(p[i] == 0); pause(); return 0; } With thp-never RSS is about 400k, but with thp-always it's 200M. After the patcheset thp-always RSS is 400k too. Design overview. Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with zeros. The way how we allocate it changes in the patchset: - [01/10] simplest way: hzp allocated on boot time in hugepage_init(); - [09/10] lazy allocation on first use; - [10/10] lockless refcounting + shrinker-reclaimable hzp; We setup it in do_huge_pmd_anonymous_page() if area around fault address is suitable for THP and we've got read page fault. If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we normally do in THP. On wp fault to hzp we allocate real memory for the huge page and clear it. If ENOMEM, graceful fallback: we create a new pmd table and set pte around fault address to newly allocated normal (4k) page. All other ptes in the pmd set to normal zero page. We cannot split hzp (and it's bug if we try), but we can split the pmd which points to it. On splitting the pmd we create a table with all ptes set to normal zero page. Patchset organized in bisect-friendly way: Patches 01-07: prepare all code paths for hzp Patch 08: all code paths are covered: safe to setup hzp Patch 09: lazy allocation Patch 10: lockless refcounting for hzp v4: - Rebase to v3.7-rc1; - Update commit message; v3: - fix potential deadlock in refcounting code on preemptive kernel. - do not mark huge zero page as movable. - fix typo in comment. - Reviewed-by tag from Andrea Arcangeli. v2: - Avoid find_vma() if we've already had vma on stack. Suggested by Andrea Arcangeli. - Implement refcounting for huge zero page. -- By hpa request I've tried alternative approach for hzp implementation (see Virtual huge zero page patchset): pmd table with all entries set to zero page. This way should be more cache friendly, but it increases TLB pressure. Thanks for your excellent works. But could you explain me why current implementation not cache friendly and hpa's request cache friendly? Thanks in advance. The problem with virtual huge zero page: it requires per-arch enabling. We need a way to mark that pmd table has all ptes set to zero page. Some numbers to compare two implementations (on 4s Westmere-EX): Mirobenchmark1 == test: posix_memalign((void **), 2 * MB, 8 * GB); for (i = 0; i < 100; i++) { assert(memcmp(p, p + 4*GB, 4*GB) == 0); asm volatile ("": : :"memory"); } hzp: Performance counter stats for './test_memcmp' (5 runs): 32356.272845 task-clock#0.998 CPUs utilized ( +- 0.13% ) 40 context-switches #0.001 K/sec ( +- 0.94% ) 0 CPU-migrations#0.000 K/sec 4,218 page-faults #0.130 K/sec ( +- 0.00% ) 76,712,481,765 cycles#2.371 GHz ( +- 0.13% ) [83.31%] 36,279,577,636 stalled-cycles-frontend # 47.29% frontend cycles idle ( +- 0.28% ) [83.35%] 1,684,049,110 stalled-cycles-backend#2.20% backend cycles idle ( +- 2.96% ) [66.67%] 134,355,715,816 instructions #1.75 insns per cycle #0.27 stalled cycles per insn ( +- 0.10% ) [83.35%] 13,526,169,702 branches # 418.039 M/sec ( +- 0.10% ) [83.31%] 1,058,230 branch-misses #0.01% of all branches ( +- 0.91% ) [83.36%] 32.413866442 seconds time elapsed ( +- 0.13% ) vhzp: Performance counter stats for './test_memcmp' (5 runs): 30327.183829 task-clock#0.998 CPUs utilized ( +- 0.13% ) 38 context-switches #0.001
Re: [PATCH] mm: thp: Set the accessed flag for old pages on access fault.
On 10/01/2012 10:59 PM, Andrea Arcangeli wrote: Hi Will, On Mon, Oct 01, 2012 at 02:51:45PM +0100, Will Deacon wrote: +void huge_pmd_set_accessed(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pmd_t *pmd, pmd_t orig_pmd) +{ + pmd_t entry; + + spin_lock(>page_table_lock); + entry = pmd_mkyoung(orig_pmd); + if (pmdp_set_access_flags(vma, address & HPAGE_PMD_MASK, pmd, entry, 0)) + update_mmu_cache(vma, address, pmd); If the pmd is being splitted, this may not be a trasnhuge pmd anymore by the time you obtained the lock. (orig_pmd could be stale, and it wasn't verified with pmd_same either) Could you tell me when should call pmd_same in general? The lock should be obtained through pmd_trans_huge_lock. if (pmd_trans_huge_lock(orig_pmd, vma) == 1) { set young bit spin_unlock(>page_table_lock); } On x86: int pmdp_set_access_flags(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp, pmd_t entry, int dirty) { int changed = !pmd_same(*pmdp, entry); VM_BUG_ON(address & ~HPAGE_PMD_MASK); if (changed && dirty) { *pmdp = entry; with dirty == 0 it looks like it won't make any difference, but I guess your arm pmdp_set_access_flag is different. However it seems "dirty" means write access and so the invocation would better match the pte case: if (pmdp_set_access_flags(vma, address & HPAGE_PMD_MASK, pmd, entry, flags & FAULT_FLAG_WRITE)) But note, you still have to update it even when "dirty" == 0, or it'll still infinite loop for read accesses. + spin_unlock(>page_table_lock); +} + int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pmd_t orig_pmd) { diff --git a/mm/memory.c b/mm/memory.c index 5736170..d5c007d 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3537,7 +3537,11 @@ retry: if (unlikely(ret & VM_FAULT_OOM)) goto retry; return ret; + } else { + huge_pmd_set_accessed(mm, vma, address, pmd, + orig_pmd); } + return 0; Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: thp: Set the accessed flag for old pages on access fault.
On 10/01/2012 10:59 PM, Andrea Arcangeli wrote: Hi Will, On Mon, Oct 01, 2012 at 02:51:45PM +0100, Will Deacon wrote: +void huge_pmd_set_accessed(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pmd_t *pmd, pmd_t orig_pmd) +{ + pmd_t entry; + + spin_lock(mm-page_table_lock); + entry = pmd_mkyoung(orig_pmd); + if (pmdp_set_access_flags(vma, address HPAGE_PMD_MASK, pmd, entry, 0)) + update_mmu_cache(vma, address, pmd); If the pmd is being splitted, this may not be a trasnhuge pmd anymore by the time you obtained the lock. (orig_pmd could be stale, and it wasn't verified with pmd_same either) Could you tell me when should call pmd_same in general? The lock should be obtained through pmd_trans_huge_lock. if (pmd_trans_huge_lock(orig_pmd, vma) == 1) { set young bit spin_unlock(mm-page_table_lock); } On x86: int pmdp_set_access_flags(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp, pmd_t entry, int dirty) { int changed = !pmd_same(*pmdp, entry); VM_BUG_ON(address ~HPAGE_PMD_MASK); if (changed dirty) { *pmdp = entry; with dirty == 0 it looks like it won't make any difference, but I guess your arm pmdp_set_access_flag is different. However it seems dirty means write access and so the invocation would better match the pte case: if (pmdp_set_access_flags(vma, address HPAGE_PMD_MASK, pmd, entry, flags FAULT_FLAG_WRITE)) But note, you still have to update it even when dirty == 0, or it'll still infinite loop for read accesses. + spin_unlock(mm-page_table_lock); +} + int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pmd_t orig_pmd) { diff --git a/mm/memory.c b/mm/memory.c index 5736170..d5c007d 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3537,7 +3537,11 @@ retry: if (unlikely(ret VM_FAULT_OOM)) goto retry; return ret; + } else { + huge_pmd_set_accessed(mm, vma, address, pmd, + orig_pmd); } + return 0; Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 00/10, REBASED] Introduce huge zero page
On 10/15/2012 02:00 PM, Kirill A. Shutemov wrote: From: Kirill A. Shutemov kirill.shute...@linux.intel.com Hi, Andrew, here's huge zero page patchset rebased to v3.7-rc1. Andrea, I've dropped your Reviewed-by due not-so-trivial conflicts in during rebase. Could you look through it again. Patches 2, 3, 4, 7, 10 had conflicts. Mostly due new MMU notifiers interface. = During testing I noticed big (up to 2.5 times) memory consumption overhead on some workloads (e.g. ft.A from NPB) if THP is enabled. The main reason for that big difference is lacking zero page in THP case. We have to allocate a real page on read page fault. A program to demonstrate the issue: #include assert.h #include stdlib.h #include unistd.h #define MB 1024*1024 int main(int argc, char **argv) { char *p; int i; posix_memalign((void **)p, 2 * MB, 200 * MB); for (i = 0; i 200 * MB; i+= 4096) assert(p[i] == 0); pause(); return 0; } With thp-never RSS is about 400k, but with thp-always it's 200M. After the patcheset thp-always RSS is 400k too. Design overview. Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with zeros. The way how we allocate it changes in the patchset: - [01/10] simplest way: hzp allocated on boot time in hugepage_init(); - [09/10] lazy allocation on first use; - [10/10] lockless refcounting + shrinker-reclaimable hzp; We setup it in do_huge_pmd_anonymous_page() if area around fault address is suitable for THP and we've got read page fault. If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we normally do in THP. On wp fault to hzp we allocate real memory for the huge page and clear it. If ENOMEM, graceful fallback: we create a new pmd table and set pte around fault address to newly allocated normal (4k) page. All other ptes in the pmd set to normal zero page. We cannot split hzp (and it's bug if we try), but we can split the pmd which points to it. On splitting the pmd we create a table with all ptes set to normal zero page. Patchset organized in bisect-friendly way: Patches 01-07: prepare all code paths for hzp Patch 08: all code paths are covered: safe to setup hzp Patch 09: lazy allocation Patch 10: lockless refcounting for hzp v4: - Rebase to v3.7-rc1; - Update commit message; v3: - fix potential deadlock in refcounting code on preemptive kernel. - do not mark huge zero page as movable. - fix typo in comment. - Reviewed-by tag from Andrea Arcangeli. v2: - Avoid find_vma() if we've already had vma on stack. Suggested by Andrea Arcangeli. - Implement refcounting for huge zero page. -- By hpa request I've tried alternative approach for hzp implementation (see Virtual huge zero page patchset): pmd table with all entries set to zero page. This way should be more cache friendly, but it increases TLB pressure. Thanks for your excellent works. But could you explain me why current implementation not cache friendly and hpa's request cache friendly? Thanks in advance. The problem with virtual huge zero page: it requires per-arch enabling. We need a way to mark that pmd table has all ptes set to zero page. Some numbers to compare two implementations (on 4s Westmere-EX): Mirobenchmark1 == test: posix_memalign((void **)p, 2 * MB, 8 * GB); for (i = 0; i 100; i++) { assert(memcmp(p, p + 4*GB, 4*GB) == 0); asm volatile (: : :memory); } hzp: Performance counter stats for './test_memcmp' (5 runs): 32356.272845 task-clock#0.998 CPUs utilized ( +- 0.13% ) 40 context-switches #0.001 K/sec ( +- 0.94% ) 0 CPU-migrations#0.000 K/sec 4,218 page-faults #0.130 K/sec ( +- 0.00% ) 76,712,481,765 cycles#2.371 GHz ( +- 0.13% ) [83.31%] 36,279,577,636 stalled-cycles-frontend # 47.29% frontend cycles idle ( +- 0.28% ) [83.35%] 1,684,049,110 stalled-cycles-backend#2.20% backend cycles idle ( +- 2.96% ) [66.67%] 134,355,715,816 instructions #1.75 insns per cycle #0.27 stalled cycles per insn ( +- 0.10% ) [83.35%] 13,526,169,702 branches # 418.039 M/sec ( +- 0.10% ) [83.31%] 1,058,230 branch-misses #0.01% of all branches ( +- 0.91% ) [83.36%] 32.413866442 seconds time elapsed ( +- 0.13% ) vhzp: Performance counter stats for './test_memcmp' (5 runs): 30327.183829 task-clock#0.998 CPUs utilized ( +- 0.13% )
Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
On 10/16/2012 06:12 PM, Sha Zhengju wrote: From: Sha Zhengju handai@taobao.com Sysctl oom_kill_allocating_task enables or disables killing the OOM-triggering task in out-of-memory situations, but it only works on overall system-wide oom. But it's also a useful indication in memcg so we take it into consideration while oom happening in memcg. Other sysctl such as panic_on_oom has already been memcg-ware. Is it the resend one or new version, could you add changelog if it is the last case? Signed-off-by: Sha Zhengju handai@taobao.com --- mm/memcontrol.c |9 + 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e4e9b18..c329940 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1486,6 +1486,15 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL); totalpages = mem_cgroup_get_limit(memcg) PAGE_SHIFT ? : 1; + if (sysctl_oom_kill_allocating_task current-mm + !oom_unkillable_task(current, memcg, NULL) + current-signal-oom_score_adj != OOM_SCORE_ADJ_MIN) { + get_task_struct(current); + oom_kill_process(current, gfp_mask, order, 0, totalpages, memcg, NULL, +Memory cgroup out of memory (oom_kill_allocating_task)); + return; + } + for_each_mem_cgroup_tree(iter, memcg) { struct cgroup *cgroup = iter-css.cgroup; struct cgroup_iter it; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 00/10, REBASED] Introduce huge zero page
On 10/16/2012 06:54 PM, Kirill A. Shutemov wrote: On Tue, Oct 16, 2012 at 05:53:07PM +0800, Ni zhan Chen wrote: By hpa request I've tried alternative approach for hzp implementation (see Virtual huge zero page patchset): pmd table with all entries set to zero page. This way should be more cache friendly, but it increases TLB pressure. Thanks for your excellent works. But could you explain me why current implementation not cache friendly and hpa's request cache friendly? Thanks in advance. In workloads like microbenchmark1 you need N * size(zero page) cache space to get zero page fully cached, where N is cache associativity. If zero page is 2M, cache pressure is significant. On other hand with table of 4k zero pages (hpa's proposal) will increase pressure on TLB, since we have more pages for the same memory area. So we have to do more page translation in this case. On my test machine with simple memcmp() virtual huge zero page is faster. But it highly depends on TLB size, cache size, memory access and page translation costs. It looks like cache size in modern processors grows faster than TLB size. Oh, I see, thanks for your quick response. Another one question below, The problem with virtual huge zero page: it requires per-arch enabling. We need a way to mark that pmd table has all ptes set to zero page. Some numbers to compare two implementations (on 4s Westmere-EX): Mirobenchmark1 == test: posix_memalign((void **)p, 2 * MB, 8 * GB); for (i = 0; i 100; i++) { assert(memcmp(p, p + 4*GB, 4*GB) == 0); asm volatile (: : :memory); } hzp: Performance counter stats for './test_memcmp' (5 runs): 32356.272845 task-clock#0.998 CPUs utilized ( +- 0.13% ) 40 context-switches #0.001 K/sec ( +- 0.94% ) 0 CPU-migrations#0.000 K/sec 4,218 page-faults #0.130 K/sec ( +- 0.00% ) 76,712,481,765 cycles#2.371 GHz ( +- 0.13% ) [83.31%] 36,279,577,636 stalled-cycles-frontend # 47.29% frontend cycles idle ( +- 0.28% ) [83.35%] 1,684,049,110 stalled-cycles-backend#2.20% backend cycles idle ( +- 2.96% ) [66.67%] 134,355,715,816 instructions #1.75 insns per cycle #0.27 stalled cycles per insn ( +- 0.10% ) [83.35%] 13,526,169,702 branches # 418.039 M/sec ( +- 0.10% ) [83.31%] 1,058,230 branch-misses #0.01% of all branches ( +- 0.91% ) [83.36%] 32.413866442 seconds time elapsed ( +- 0.13% ) vhzp: Performance counter stats for './test_memcmp' (5 runs): 30327.183829 task-clock#0.998 CPUs utilized ( +- 0.13% ) 38 context-switches #0.001 K/sec ( +- 1.53% ) 0 CPU-migrations#0.000 K/sec 4,218 page-faults #0.139 K/sec ( +- 0.01% ) 71,964,773,660 cycles#2.373 GHz ( +- 0.13% ) [83.35%] 31,191,284,231 stalled-cycles-frontend # 43.34% frontend cycles idle ( +- 0.40% ) [83.32%] 773,484,474 stalled-cycles-backend#1.07% backend cycles idle ( +- 6.61% ) [66.67%] 134,982,215,437 instructions #1.88 insns per cycle #0.23 stalled cycles per insn ( +- 0.11% ) [83.32%] 13,509,150,683 branches # 445.447 M/sec ( +- 0.11% ) [83.34%] 1,017,667 branch-misses #0.01% of all branches ( +- 1.07% ) [83.32%] 30.381324695 seconds time elapsed ( +- 0.13% ) Could you tell me which data I should care in this performance counter. And what's the benefit of your current implementation compare to hpa's request? Sorry for my unintelligent. Could you tell me which data I should care in this performance counter stats. The same question about the second benchmark counter stats, thanks in adance. :-) Mirobenchmark2 == test: posix_memalign((void **)p, 2 * MB, 8 * GB); for (i = 0; i 1000; i++) { char *_p = p; while (_p p+4*GB) { assert(*_p == *(_p+4*GB)); _p += 4096; asm volatile (: : :memory); } } hzp: Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs): 3505.727639 task-clock#0.998 CPUs utilized ( +- 0.26
Re: [PATCH v4 00/10, REBASED] Introduce huge zero page
On 10/16/2012 07:28 PM, Kirill A. Shutemov wrote: On Tue, Oct 16, 2012 at 07:13:07PM +0800, Ni zhan Chen wrote: On 10/16/2012 06:54 PM, Kirill A. Shutemov wrote: On Tue, Oct 16, 2012 at 05:53:07PM +0800, Ni zhan Chen wrote: By hpa request I've tried alternative approach for hzp implementation (see Virtual huge zero page patchset): pmd table with all entries set to zero page. This way should be more cache friendly, but it increases TLB pressure. Thanks for your excellent works. But could you explain me why current implementation not cache friendly and hpa's request cache friendly? Thanks in advance. In workloads like microbenchmark1 you need N * size(zero page) cache space to get zero page fully cached, where N is cache associativity. If zero page is 2M, cache pressure is significant. On other hand with table of 4k zero pages (hpa's proposal) will increase pressure on TLB, since we have more pages for the same memory area. So we have to do more page translation in this case. On my test machine with simple memcmp() virtual huge zero page is faster. But it highly depends on TLB size, cache size, memory access and page translation costs. It looks like cache size in modern processors grows faster than TLB size. Oh, I see, thanks for your quick response. Another one question below, The problem with virtual huge zero page: it requires per-arch enabling. We need a way to mark that pmd table has all ptes set to zero page. Some numbers to compare two implementations (on 4s Westmere-EX): Mirobenchmark1 == test: posix_memalign((void **)p, 2 * MB, 8 * GB); for (i = 0; i 100; i++) { assert(memcmp(p, p + 4*GB, 4*GB) == 0); asm volatile (: : :memory); } hzp: Performance counter stats for './test_memcmp' (5 runs): 32356.272845 task-clock#0.998 CPUs utilized ( +- 0.13% ) 40 context-switches #0.001 K/sec ( +- 0.94% ) 0 CPU-migrations#0.000 K/sec 4,218 page-faults #0.130 K/sec ( +- 0.00% ) 76,712,481,765 cycles#2.371 GHz ( +- 0.13% ) [83.31%] 36,279,577,636 stalled-cycles-frontend # 47.29% frontend cycles idle ( +- 0.28% ) [83.35%] 1,684,049,110 stalled-cycles-backend#2.20% backend cycles idle ( +- 2.96% ) [66.67%] 134,355,715,816 instructions #1.75 insns per cycle #0.27 stalled cycles per insn ( +- 0.10% ) [83.35%] 13,526,169,702 branches # 418.039 M/sec ( +- 0.10% ) [83.31%] 1,058,230 branch-misses #0.01% of all branches ( +- 0.91% ) [83.36%] 32.413866442 seconds time elapsed ( +- 0.13% ) vhzp: Performance counter stats for './test_memcmp' (5 runs): 30327.183829 task-clock#0.998 CPUs utilized ( +- 0.13% ) 38 context-switches #0.001 K/sec ( +- 1.53% ) 0 CPU-migrations#0.000 K/sec 4,218 page-faults #0.139 K/sec ( +- 0.01% ) 71,964,773,660 cycles#2.373 GHz ( +- 0.13% ) [83.35%] 31,191,284,231 stalled-cycles-frontend # 43.34% frontend cycles idle ( +- 0.40% ) [83.32%] 773,484,474 stalled-cycles-backend#1.07% backend cycles idle ( +- 6.61% ) [66.67%] 134,982,215,437 instructions #1.88 insns per cycle #0.23 stalled cycles per insn ( +- 0.11% ) [83.32%] 13,509,150,683 branches # 445.447 M/sec ( +- 0.11% ) [83.34%] 1,017,667 branch-misses #0.01% of all branches ( +- 1.07% ) [83.32%] 30.381324695 seconds time elapsed ( +- 0.13% ) Could you tell me which data I should care in this performance counter. And what's the benefit of your current implementation compare to hpa's request? Sorry for my unintelligent. Could you tell me which data I should care in this performance counter stats. The same question about the second benchmark counter stats, thanks in adance. :-) I've missed relevant counters in this run, you can see them in the second benchmark. Relevant counters: L1-dcache-*, LLC-*: shows cache related stats (hits/misses); dTLB-*: shows data TLB hits and misses. Indirect relevant counters: stalled-cycles-*: how long CPU pipeline has to wait for data. Oh, I see, thanks for your patient. :-) Mirobenchmark2 == test: posix_memalign((void **)p, 2 * MB, 8 * GB
Re: [PATCH v3 00/10] Introduce huge zero page
On 10/03/2012 08:04 AM, Kirill A. Shutemov wrote: On Tue, Oct 02, 2012 at 03:31:48PM -0700, Andrew Morton wrote: On Tue, 2 Oct 2012 18:19:22 +0300 Kirill A. Shutemov kirill.shute...@linux.intel.com wrote: During testing I noticed big (up to 2.5 times) memory consumption overhead on some workloads (e.g. ft.A from NPB) if THP is enabled. The main reason for that big difference is lacking zero page in THP case. We have to allocate a real page on read page fault. A program to demonstrate the issue: #include assert.h #include stdlib.h #include unistd.h #define MB 1024*1024 int main(int argc, char **argv) { char *p; int i; posix_memalign((void **)p, 2 * MB, 200 * MB); for (i = 0; i 200 * MB; i+= 4096) assert(p[i] == 0); pause(); return 0; } With thp-never RSS is about 400k, but with thp-always it's 200M. After the patcheset thp-always RSS is 400k too. I'd like to see a full description of the design, please. Okay. Design overview. Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with zeros. The way how we allocate it changes in the patchset: - [01/10] simplest way: hzp allocated on boot time in hugepage_init(); - [09/10] lazy allocation on first use; - [10/10] lockless refcounting + shrinker-reclaimable hzp; We setup it in do_huge_pmd_anonymous_page() if area around fault address is suitable for THP and we've got read page fault. If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we normally do in THP. On wp fault to hzp we allocate real memory for the huge page and clear it. If ENOMEM, graceful fallback: we create a new pmd table and set pte around fault address to newly allocated normal (4k) page. All other ptes in the pmd set to normal zero page. We cannot split hzp (and it's bug if we try), but we can split the pmd which points to it. On splitting the pmd we create a table with all ptes set to normal zero page. Patchset organized in bisect-friendly way: Patches 01-07: prepare all code paths for hzp Patch 08: all code paths are covered: safe to setup hzp Patch 09: lazy allocation Patch 10: lockless refcounting for hzp -- By hpa request I've tried alternative approach for hzp implementation (see Virtual huge zero page patchset): pmd table with all entries set to zero page. This way should be more cache friendly, but it increases TLB pressure. The problem with virtual huge zero page: it requires per-arch enabling. We need a way to mark that pmd table has all ptes set to zero page. Some numbers to compare two implementations (on 4s Westmere-EX): Mirobenchmark1 == test: posix_memalign((void **)p, 2 * MB, 8 * GB); for (i = 0; i 100; i++) { assert(memcmp(p, p + 4*GB, 4*GB) == 0); asm volatile (: : :memory); } hzp: Performance counter stats for './test_memcmp' (5 runs): 32356.272845 task-clock#0.998 CPUs utilized ( +- 0.13% ) 40 context-switches #0.001 K/sec ( +- 0.94% ) 0 CPU-migrations#0.000 K/sec 4,218 page-faults #0.130 K/sec ( +- 0.00% ) 76,712,481,765 cycles#2.371 GHz ( +- 0.13% ) [83.31%] 36,279,577,636 stalled-cycles-frontend # 47.29% frontend cycles idle ( +- 0.28% ) [83.35%] 1,684,049,110 stalled-cycles-backend#2.20% backend cycles idle ( +- 2.96% ) [66.67%] 134,355,715,816 instructions #1.75 insns per cycle #0.27 stalled cycles per insn ( +- 0.10% ) [83.35%] 13,526,169,702 branches # 418.039 M/sec ( +- 0.10% ) [83.31%] 1,058,230 branch-misses #0.01% of all branches ( +- 0.91% ) [83.36%] 32.413866442 seconds time elapsed ( +- 0.13% ) vhzp: Performance counter stats for './test_memcmp' (5 runs): 30327.183829 task-clock#0.998 CPUs utilized ( +- 0.13% ) 38 context-switches #0.001 K/sec ( +- 1.53% ) 0 CPU-migrations#0.000 K/sec 4,218 page-faults #0.139 K/sec ( +- 0.01% ) 71,964,773,660 cycles#2.373 GHz ( +- 0.13% ) [83.35%] 31,191,284,231 stalled-cycles-frontend # 43.34% frontend cycles idle ( +- 0.40% ) [83.32%] 773,484,474 stalled-cycles-backend#1.07% backend cycles idle ( +- 6.61% ) [66.67%] 134,982,215,437 instructions #1.88 insns per cycle
Re: [PATCH v3 07/10] thp: implement splitting pmd for huge zero page
On 10/12/2012 12:13 PM, Kirill A. Shutemov wrote: On Fri, Oct 12, 2012 at 11:23:37AM +0800, Ni zhan Chen wrote: On 10/02/2012 11:19 PM, Kirill A. Shutemov wrote: From: "Kirill A. Shutemov" We can't split huge zero page itself, but we can split the pmd which points to it. On splitting the pmd we create a table with all ptes set to normal zero page. Signed-off-by: Kirill A. Shutemov Reviewed-by: Andrea Arcangeli --- mm/huge_memory.c | 32 1 files changed, 32 insertions(+), 0 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 95032d3..3f1c59c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1600,6 +1600,7 @@ int split_huge_page(struct page *page) struct anon_vma *anon_vma; int ret = 1; + BUG_ON(is_huge_zero_pfn(page_to_pfn(page))); BUG_ON(!PageAnon(page)); anon_vma = page_lock_anon_vma(page); if (!anon_vma) @@ -2503,6 +2504,32 @@ static int khugepaged(void *none) return 0; } +static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, + unsigned long haddr, pmd_t *pmd) +{ + pgtable_t pgtable; + pmd_t _pmd; + int i; + + pmdp_clear_flush_notify(vma, haddr, pmd); why I can't find function pmdp_clear_flush_notify in kernel source code? Do you mean pmdp_clear_flush_young_notify or something like that? It was changed recently. See commit 2ec74c3 mm: move all mmu notifier invocations to be done outside the PT lock Oh, thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 07/10] thp: implement splitting pmd for huge zero page
On 10/02/2012 11:19 PM, Kirill A. Shutemov wrote: From: "Kirill A. Shutemov" We can't split huge zero page itself, but we can split the pmd which points to it. On splitting the pmd we create a table with all ptes set to normal zero page. Signed-off-by: Kirill A. Shutemov Reviewed-by: Andrea Arcangeli --- mm/huge_memory.c | 32 1 files changed, 32 insertions(+), 0 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 95032d3..3f1c59c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1600,6 +1600,7 @@ int split_huge_page(struct page *page) struct anon_vma *anon_vma; int ret = 1; + BUG_ON(is_huge_zero_pfn(page_to_pfn(page))); BUG_ON(!PageAnon(page)); anon_vma = page_lock_anon_vma(page); if (!anon_vma) @@ -2503,6 +2504,32 @@ static int khugepaged(void *none) return 0; } +static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, + unsigned long haddr, pmd_t *pmd) +{ + pgtable_t pgtable; + pmd_t _pmd; + int i; + + pmdp_clear_flush_notify(vma, haddr, pmd); why I can't find function pmdp_clear_flush_notify in kernel source code? Do you mean pmdp_clear_flush_young_notify or something like that? + /* leave pmd empty until pte is filled */ + + pgtable = get_pmd_huge_pte(vma->vm_mm); + pmd_populate(vma->vm_mm, &_pmd, pgtable); + + for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) { + pte_t *pte, entry; + entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot); + entry = pte_mkspecial(entry); + pte = pte_offset_map(&_pmd, haddr); + VM_BUG_ON(!pte_none(*pte)); + set_pte_at(vma->vm_mm, haddr, pte, entry); + pte_unmap(pte); + } + smp_wmb(); /* make pte visible before pmd */ + pmd_populate(vma->vm_mm, pmd, pgtable); +} + void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address, pmd_t *pmd) { @@ -2516,6 +2543,11 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address, spin_unlock(>vm_mm->page_table_lock); return; } + if (is_huge_zero_pmd(*pmd)) { + __split_huge_zero_page_pmd(vma, haddr, pmd); + spin_unlock(>vm_mm->page_table_lock); + return; + } page = pmd_page(*pmd); VM_BUG_ON(!page_count(page)); get_page(page); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 07/10] thp: implement splitting pmd for huge zero page
On 10/02/2012 11:19 PM, Kirill A. Shutemov wrote: From: Kirill A. Shutemov kirill.shute...@linux.intel.com We can't split huge zero page itself, but we can split the pmd which points to it. On splitting the pmd we create a table with all ptes set to normal zero page. Signed-off-by: Kirill A. Shutemov kirill.shute...@linux.intel.com Reviewed-by: Andrea Arcangeli aarca...@redhat.com --- mm/huge_memory.c | 32 1 files changed, 32 insertions(+), 0 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 95032d3..3f1c59c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1600,6 +1600,7 @@ int split_huge_page(struct page *page) struct anon_vma *anon_vma; int ret = 1; + BUG_ON(is_huge_zero_pfn(page_to_pfn(page))); BUG_ON(!PageAnon(page)); anon_vma = page_lock_anon_vma(page); if (!anon_vma) @@ -2503,6 +2504,32 @@ static int khugepaged(void *none) return 0; } +static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, + unsigned long haddr, pmd_t *pmd) +{ + pgtable_t pgtable; + pmd_t _pmd; + int i; + + pmdp_clear_flush_notify(vma, haddr, pmd); why I can't find function pmdp_clear_flush_notify in kernel source code? Do you mean pmdp_clear_flush_young_notify or something like that? + /* leave pmd empty until pte is filled */ + + pgtable = get_pmd_huge_pte(vma-vm_mm); + pmd_populate(vma-vm_mm, _pmd, pgtable); + + for (i = 0; i HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) { + pte_t *pte, entry; + entry = pfn_pte(my_zero_pfn(haddr), vma-vm_page_prot); + entry = pte_mkspecial(entry); + pte = pte_offset_map(_pmd, haddr); + VM_BUG_ON(!pte_none(*pte)); + set_pte_at(vma-vm_mm, haddr, pte, entry); + pte_unmap(pte); + } + smp_wmb(); /* make pte visible before pmd */ + pmd_populate(vma-vm_mm, pmd, pgtable); +} + void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address, pmd_t *pmd) { @@ -2516,6 +2543,11 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address, spin_unlock(vma-vm_mm-page_table_lock); return; } + if (is_huge_zero_pmd(*pmd)) { + __split_huge_zero_page_pmd(vma, haddr, pmd); + spin_unlock(vma-vm_mm-page_table_lock); + return; + } page = pmd_page(*pmd); VM_BUG_ON(!page_count(page)); get_page(page); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 07/10] thp: implement splitting pmd for huge zero page
On 10/12/2012 12:13 PM, Kirill A. Shutemov wrote: On Fri, Oct 12, 2012 at 11:23:37AM +0800, Ni zhan Chen wrote: On 10/02/2012 11:19 PM, Kirill A. Shutemov wrote: From: Kirill A. Shutemov kirill.shute...@linux.intel.com We can't split huge zero page itself, but we can split the pmd which points to it. On splitting the pmd we create a table with all ptes set to normal zero page. Signed-off-by: Kirill A. Shutemov kirill.shute...@linux.intel.com Reviewed-by: Andrea Arcangeli aarca...@redhat.com --- mm/huge_memory.c | 32 1 files changed, 32 insertions(+), 0 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 95032d3..3f1c59c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1600,6 +1600,7 @@ int split_huge_page(struct page *page) struct anon_vma *anon_vma; int ret = 1; + BUG_ON(is_huge_zero_pfn(page_to_pfn(page))); BUG_ON(!PageAnon(page)); anon_vma = page_lock_anon_vma(page); if (!anon_vma) @@ -2503,6 +2504,32 @@ static int khugepaged(void *none) return 0; } +static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, + unsigned long haddr, pmd_t *pmd) +{ + pgtable_t pgtable; + pmd_t _pmd; + int i; + + pmdp_clear_flush_notify(vma, haddr, pmd); why I can't find function pmdp_clear_flush_notify in kernel source code? Do you mean pmdp_clear_flush_young_notify or something like that? It was changed recently. See commit 2ec74c3 mm: move all mmu notifier invocations to be done outside the PT lock Oh, thanks! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 8/10] memory-hotplug : remove page table of x86_64 architecture
On 10/08/2012 01:23 PM, Wen Congyang wrote: At 10/08/2012 12:37 PM, Andi Kleen Wrote: Yasuaki Ishimatsu writes: + } + + /* +* We use 2M page, but we need to remove part of them, +* so split 2M page to 4K page. +*/ + pte = alloc_low_page(_phys); What happens when the allocation fails? alloc_low_page seems to be buggy there too, it would __pa a NULL pointer. Yes, it will cause kernek panicked in __pa() if CONFI_DEBUG_VIRTUAL is set. Otherwise, it will return a NULL pointer. I will update this patch to deal with NULL pointer. + if (pud_large(*pud)) { + if ((addr & ~PUD_MASK) == 0 && next <= end) { + set_pud(pud, __pud(0)); + pages++; + continue; + } + + /* +* We use 1G page, but we need to remove part of them, +* so split 1G page to 2M page. +*/ + pmd = alloc_low_page(_phys); Same here + __split_large_page((pte_t *)pud, addr, (pte_t *)pmd); + + spin_lock(_mm.page_table_lock); + pud_populate(_mm, pud, __va(pmd_phys)); + spin_unlock(_mm.page_table_lock); + } + + pmd = map_low_page(pmd_offset(pud, 0)); + phys_pmd_remove(pmd, addr, end); + unmap_low_page(pmd); + __flush_tlb_all(); + } + __flush_tlb_all(); Hi Congyang, I see you call __flush_tlb_all() every pud entry(all pmd, pte related to it changed) modified, then how to determine the flush frequency? why not every pmd entry? Regards, Chen This doesn't flush the other CPUs doesn't it? How to flush the other CPU's tlb? use on_each_cpu() to run __flush_tlb_all() on each online cpu? Thanks Wen Congyang -Andi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 8/10] memory-hotplug : remove page table of x86_64 architecture
On 10/08/2012 01:23 PM, Wen Congyang wrote: At 10/08/2012 12:37 PM, Andi Kleen Wrote: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com writes: + } + + /* +* We use 2M page, but we need to remove part of them, +* so split 2M page to 4K page. +*/ + pte = alloc_low_page(pte_phys); What happens when the allocation fails? alloc_low_page seems to be buggy there too, it would __pa a NULL pointer. Yes, it will cause kernek panicked in __pa() if CONFI_DEBUG_VIRTUAL is set. Otherwise, it will return a NULL pointer. I will update this patch to deal with NULL pointer. + if (pud_large(*pud)) { + if ((addr ~PUD_MASK) == 0 next = end) { + set_pud(pud, __pud(0)); + pages++; + continue; + } + + /* +* We use 1G page, but we need to remove part of them, +* so split 1G page to 2M page. +*/ + pmd = alloc_low_page(pmd_phys); Same here + __split_large_page((pte_t *)pud, addr, (pte_t *)pmd); + + spin_lock(init_mm.page_table_lock); + pud_populate(init_mm, pud, __va(pmd_phys)); + spin_unlock(init_mm.page_table_lock); + } + + pmd = map_low_page(pmd_offset(pud, 0)); + phys_pmd_remove(pmd, addr, end); + unmap_low_page(pmd); + __flush_tlb_all(); + } + __flush_tlb_all(); Hi Congyang, I see you call __flush_tlb_all() every pud entry(all pmd, pte related to it changed) modified, then how to determine the flush frequency? why not every pmd entry? Regards, Chen This doesn't flush the other CPUs doesn't it? How to flush the other CPU's tlb? use on_each_cpu() to run __flush_tlb_all() on each online cpu? Thanks Wen Congyang -Andi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: memmap_init_zone() performance improvement
On 10/08/2012 11:16 PM, Mel Gorman wrote: On Wed, Oct 03, 2012 at 08:56:14AM -0600, Mike Yoknis wrote: memmap_init_zone() loops through every Page Frame Number (pfn), including pfn values that are within the gaps between existing memory sections. The unneeded looping will become a boot performance issue when machines configure larger memory ranges that will contain larger and more numerous gaps. The code will skip across invalid sections to reduce the number of loops executed. Signed-off-by: Mike Yoknis This only helps SPARSEMEM and changes more headers than should be necessary. It would have been easier to do something simple like if (!early_pfn_valid(pfn)) { pfn = ALIGN(pfn + MAX_ORDER_NR_PAGES, MAX_ORDER_NR_PAGES) - 1; continue; } So if present memoy section in sparsemem can have MAX_ORDER_NR_PAGES-aligned range are all invalid? If the answer is yes, when this will happen? because that would obey the expectation that pages within a MAX_ORDER_NR_PAGES-aligned range are all valid or all invalid (ARM is the exception that breaks this rule). It would be less efficient on SPARSEMEM than what you're trying to merge but I do not see the need for the additional complexity unless you can show it makes a big difference to boot times. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: memmap_init_zone() performance improvement
On 10/08/2012 11:16 PM, Mel Gorman wrote: On Wed, Oct 03, 2012 at 08:56:14AM -0600, Mike Yoknis wrote: memmap_init_zone() loops through every Page Frame Number (pfn), including pfn values that are within the gaps between existing memory sections. The unneeded looping will become a boot performance issue when machines configure larger memory ranges that will contain larger and more numerous gaps. The code will skip across invalid sections to reduce the number of loops executed. Signed-off-by: Mike Yoknis mike.yok...@hp.com This only helps SPARSEMEM and changes more headers than should be necessary. It would have been easier to do something simple like if (!early_pfn_valid(pfn)) { pfn = ALIGN(pfn + MAX_ORDER_NR_PAGES, MAX_ORDER_NR_PAGES) - 1; continue; } So if present memoy section in sparsemem can have MAX_ORDER_NR_PAGES-aligned range are all invalid? If the answer is yes, when this will happen? because that would obey the expectation that pages within a MAX_ORDER_NR_PAGES-aligned range are all valid or all invalid (ARM is the exception that breaks this rule). It would be less efficient on SPARSEMEM than what you're trying to merge but I do not see the need for the additional complexity unless you can show it makes a big difference to boot times. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: memmap_init_zone() performance improvement
On 10/03/2012 10:56 PM, Mike Yoknis wrote: memmap_init_zone() loops through every Page Frame Number (pfn), including pfn values that are within the gaps between existing memory sections. The unneeded looping will become a boot performance issue when machines configure larger memory ranges that will contain larger and more numerous gaps. The code will skip across invalid sections to reduce the number of loops executed. looks reasonable to me. Signed-off-by: Mike Yoknis --- arch/x86/include/asm/mmzone_32.h |2 ++ arch/x86/include/asm/page_32.h |1 + arch/x86/include/asm/page_64_types.h |3 ++- include/asm-generic/page.h |1 + include/linux/mmzone.h |6 ++ mm/page_alloc.c |5 - 6 files changed, 16 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/mmzone_32.h b/arch/x86/include/asm/mmzone_32.h index eb05fb3..73c5c74 100644 --- a/arch/x86/include/asm/mmzone_32.h +++ b/arch/x86/include/asm/mmzone_32.h @@ -48,6 +48,8 @@ static inline int pfn_to_nid(unsigned long pfn) #endif } +#define next_pfn_try(pfn) ((pfn)+1) + static inline int pfn_valid(int pfn) { int nid = pfn_to_nid(pfn); diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h index da4e762..e2c4cfc 100644 --- a/arch/x86/include/asm/page_32.h +++ b/arch/x86/include/asm/page_32.h @@ -19,6 +19,7 @@ extern unsigned long __phys_addr(unsigned long); #ifdef CONFIG_FLATMEM #define pfn_valid(pfn)((pfn) < max_mapnr) +#define next_pfn_try(pfn) ((pfn)+1) #endif /* CONFIG_FLATMEM */ #ifdef CONFIG_X86_USE_3DNOW diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h index 320f7bb..02d82e5 100644 --- a/arch/x86/include/asm/page_64_types.h +++ b/arch/x86/include/asm/page_64_types.h @@ -69,7 +69,8 @@ extern void init_extra_mapping_wb(unsigned long phys, unsigned long size); #endif/* !__ASSEMBLY__ */ #ifdef CONFIG_FLATMEM -#define pfn_valid(pfn) ((pfn) < max_pfn) +#define pfn_valid(pfn) ((pfn) < max_pfn) +#define next_pfn_try(pfn) ((pfn)+1) #endif #endif /* _ASM_X86_PAGE_64_DEFS_H */ diff --git a/include/asm-generic/page.h b/include/asm-generic/page.h index 37d1fe2..316200d 100644 --- a/include/asm-generic/page.h +++ b/include/asm-generic/page.h @@ -91,6 +91,7 @@ extern unsigned long memory_end; #endif #define pfn_valid(pfn) ((pfn) >= ARCH_PFN_OFFSET && ((pfn) - ARCH_PFN_OFFSET) < max_mapnr) +#define next_pfn_try(pfn) ((pfn)+1) #define virt_addr_valid(kaddr) (((void *)(kaddr) >= (void *)PAGE_OFFSET) && \ ((void *)(kaddr) < (void *)memory_end)) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index f7d88ba..04d3c39 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1166,6 +1166,12 @@ static inline int pfn_valid(unsigned long pfn) return 0; return valid_section(__nr_to_section(pfn_to_section_nr(pfn))); } + +static inline unsigned long next_pfn_try(unsigned long pfn) +{ + /* Skip entire section, because all of it is invalid. */ + return section_nr_to_pfn(pfn_to_section_nr(pfn) + 1); +} #endif static inline int pfn_present(unsigned long pfn) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5b6b6b1..dd2af8b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3798,8 +3798,11 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, * exist on hotplugged memory. */ if (context == MEMMAP_EARLY) { - if (!early_pfn_valid(pfn)) + if (!early_pfn_valid(pfn)) { + pfn = next_pfn_try(pfn); + pfn--; continue; + } if (!early_pfn_in_nid(pfn, nid)) continue; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] acpi,memory-hotplug : implement framework for hot removing memory
On 10/03/2012 05:52 PM, Yasuaki Ishimatsu wrote: > We are trying to implement a physical memory hot removing function as > following thread. > > https://lkml.org/lkml/2012/9/5/201 > > But there is not enough review to merge into linux kernel. > > I think there are following blockades. > 1. no physical memory hot removable system Which kind of special machine support physical memory hot-remove now? > 2. huge patch-set > > If you have a KVM system, we can get rid of 1st blockade. Because > applying following patch, we can create memory hot removable system > on KVM guest. > > http://lists.gnu.org/archive/html/qemu-devel/2012-07/msg01389.html > > 2nd blockade is own problem. So we try to divide huge patch into > a small patch in each function as follows: > > - bug fix > - acpi framework > - kernel core > > We had already sent bug fix patches. > https://lkml.org/lkml/2012/9/27/39 > https://lkml.org/lkml/2012/10/2/83 > > The patch-set implements a framework for hot removing memory. > > The memory device can be removed by 2 ways: > 1. send eject request by SCI > 2. echo 1 >/sys/bus/pci/devices/PNP0C80:XX/eject > > In the 1st case, acpi_memory_disable_device() will be called. > In the 2nd case, acpi_memory_device_remove() will be called. > acpi_memory_device_remove() will also be called when we unbind the > memory device from the driver acpi_memhotplug. > > acpi_memory_disable_device() has already implemented a code which > offlines memory and releases acpi_memory_info struct . But > acpi_memory_device_remove() has not implemented it yet. > > So the patch prepares the framework for hot removing memory and > adds the framework intoacpi_memory_device_remove(). And it prepares > remove_memory(). But the function does nothing because we cannot > support memory hot remove. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org;> em...@kvack.org > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v9 PATCH 16/21] memory-hotplug: free memmap of sparse-vmemmap
On 10/04/2012 02:26 PM, Yasuaki Ishimatsu wrote: Hi Chen, Sorry for late reply. 2012/10/02 13:21, Ni zhan Chen wrote: On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote: From: Yasuaki Ishimatsu All pages of virtual mapping in removed memory cannot be freed, since some pages used as PGD/PUD includes not only removed memory but also other memory. So the patch checks whether page can be freed or not. How to check whether page can be freed or not? 1. When removing memory, the page structs of the revmoved memory are filled with 0FD. 2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared. In this case, the page used as PT/PMD can be freed. Applying patch, __remove_section() of CONFIG_SPARSEMEM_VMEMMAP is integrated into one. So __remove_section() of CONFIG_SPARSEMEM_VMEMMAP is deleted. Note: vmemmap_kfree() and vmemmap_free_bootmem() are not implemented for ia64, ppc, s390, and sparc. CC: David Rientjes CC: Jiang Liu CC: Len Brown CC: Benjamin Herrenschmidt CC: Paul Mackerras CC: Christoph Lameter Cc: Minchan Kim CC: Andrew Morton CC: KOSAKI Motohiro CC: Wen Congyang Signed-off-by: Yasuaki Ishimatsu --- arch/ia64/mm/discontig.c |8 +++ arch/powerpc/mm/init_64.c |8 +++ arch/s390/mm/vmem.c |8 +++ arch/sparc/mm/init_64.c |8 +++ arch/x86/mm/init_64.c | 119 + include/linux/mm.h|2 + mm/memory_hotplug.c | 17 +-- mm/sparse.c |5 +- 8 files changed, 158 insertions(+), 17 deletions(-) diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c index 33943db..0d23b69 100644 --- a/arch/ia64/mm/discontig.c +++ b/arch/ia64/mm/discontig.c @@ -823,6 +823,14 @@ int __meminit vmemmap_populate(struct page *start_page, return vmemmap_populate_basepages(start_page, size, node); } +void vmemmap_kfree(struct page *memmap, unsigned long nr_pages) +{ +} + +void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages) +{ +} + void register_page_bootmem_memmap(unsigned long section_nr, struct page *start_page, unsigned long size) { diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c index 3690c44..835a2b3 100644 --- a/arch/powerpc/mm/init_64.c +++ b/arch/powerpc/mm/init_64.c @@ -299,6 +299,14 @@ int __meminit vmemmap_populate(struct page *start_page, return 0; } +void vmemmap_kfree(struct page *memmap, unsigned long nr_pages) +{ +} + +void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages) +{ +} + void register_page_bootmem_memmap(unsigned long section_nr, struct page *start_page, unsigned long size) { diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c index eda55cd..4b42b0b 100644 --- a/arch/s390/mm/vmem.c +++ b/arch/s390/mm/vmem.c @@ -227,6 +227,14 @@ out: return ret; } +void vmemmap_kfree(struct page *memmap, unsigned long nr_pages) +{ +} + +void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages) +{ +} + void register_page_bootmem_memmap(unsigned long section_nr, struct page *start_page, unsigned long size) { diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c index add1cc7..1384826 100644 --- a/arch/sparc/mm/init_64.c +++ b/arch/sparc/mm/init_64.c @@ -2078,6 +2078,14 @@ void __meminit vmemmap_populate_print_last(void) } } +void vmemmap_kfree(struct page *memmap, unsigned long nr_pages) +{ +} + +void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages) +{ +} + void register_page_bootmem_memmap(unsigned long section_nr, struct page *start_page, unsigned long size) { diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 0075592..4e8f8a4 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1138,6 +1138,125 @@ vmemmap_populate(struct page *start_page, unsigned long size, int node) return 0; } +#define PAGE_INUSE 0xFD + +unsigned long find_and_clear_pte_page(unsigned long addr, unsigned long end, +struct page **pp, int *page_size) +{ +pgd_t *pgd; +pud_t *pud; +pmd_t *pmd; +pte_t *pte; +void *page_addr; +unsigned long next; + +*pp = NULL; + +pgd = pgd_offset_k(addr); +if (pgd_none(*pgd)) +return pgd_addr_end(addr, end); + +pud = pud_offset(pgd, addr); +if (pud_none(*pud)) +return pud_addr_end(addr, end); + +if (!cpu_has_pse) { +next = (addr + PAGE_SIZE) & PAGE_MASK; +pmd = pmd_offset(pud, addr); +if (pmd_none(*pmd)) +return next; + +pte = pte_offset_kernel(pmd, addr); +if (pte_none(*pte)) +return next; + +*page_size = PAGE_SIZE; +*pp = pte_page(*pte); +} else { +next = pmd_addr_end(addr, end); + +pmd = pmd_offset(pud, addr); +if (pmd_none(*pmd)) +return next; + +*page_size = PMD_
Re: [RFC v9 PATCH 16/21] memory-hotplug: free memmap of sparse-vmemmap
On 10/04/2012 02:26 PM, Yasuaki Ishimatsu wrote: Hi Chen, Sorry for late reply. 2012/10/02 13:21, Ni zhan Chen wrote: On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote: From: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com All pages of virtual mapping in removed memory cannot be freed, since some pages used as PGD/PUD includes not only removed memory but also other memory. So the patch checks whether page can be freed or not. How to check whether page can be freed or not? 1. When removing memory, the page structs of the revmoved memory are filled with 0FD. 2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared. In this case, the page used as PT/PMD can be freed. Applying patch, __remove_section() of CONFIG_SPARSEMEM_VMEMMAP is integrated into one. So __remove_section() of CONFIG_SPARSEMEM_VMEMMAP is deleted. Note: vmemmap_kfree() and vmemmap_free_bootmem() are not implemented for ia64, ppc, s390, and sparc. CC: David Rientjes rient...@google.com CC: Jiang Liu liu...@gmail.com CC: Len Brown len.br...@intel.com CC: Benjamin Herrenschmidt b...@kernel.crashing.org CC: Paul Mackerras pau...@samba.org CC: Christoph Lameter c...@linux.com Cc: Minchan Kim minchan@gmail.com CC: Andrew Morton a...@linux-foundation.org CC: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com CC: Wen Congyang we...@cn.fujitsu.com Signed-off-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com --- arch/ia64/mm/discontig.c |8 +++ arch/powerpc/mm/init_64.c |8 +++ arch/s390/mm/vmem.c |8 +++ arch/sparc/mm/init_64.c |8 +++ arch/x86/mm/init_64.c | 119 + include/linux/mm.h|2 + mm/memory_hotplug.c | 17 +-- mm/sparse.c |5 +- 8 files changed, 158 insertions(+), 17 deletions(-) diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c index 33943db..0d23b69 100644 --- a/arch/ia64/mm/discontig.c +++ b/arch/ia64/mm/discontig.c @@ -823,6 +823,14 @@ int __meminit vmemmap_populate(struct page *start_page, return vmemmap_populate_basepages(start_page, size, node); } +void vmemmap_kfree(struct page *memmap, unsigned long nr_pages) +{ +} + +void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages) +{ +} + void register_page_bootmem_memmap(unsigned long section_nr, struct page *start_page, unsigned long size) { diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c index 3690c44..835a2b3 100644 --- a/arch/powerpc/mm/init_64.c +++ b/arch/powerpc/mm/init_64.c @@ -299,6 +299,14 @@ int __meminit vmemmap_populate(struct page *start_page, return 0; } +void vmemmap_kfree(struct page *memmap, unsigned long nr_pages) +{ +} + +void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages) +{ +} + void register_page_bootmem_memmap(unsigned long section_nr, struct page *start_page, unsigned long size) { diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c index eda55cd..4b42b0b 100644 --- a/arch/s390/mm/vmem.c +++ b/arch/s390/mm/vmem.c @@ -227,6 +227,14 @@ out: return ret; } +void vmemmap_kfree(struct page *memmap, unsigned long nr_pages) +{ +} + +void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages) +{ +} + void register_page_bootmem_memmap(unsigned long section_nr, struct page *start_page, unsigned long size) { diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c index add1cc7..1384826 100644 --- a/arch/sparc/mm/init_64.c +++ b/arch/sparc/mm/init_64.c @@ -2078,6 +2078,14 @@ void __meminit vmemmap_populate_print_last(void) } } +void vmemmap_kfree(struct page *memmap, unsigned long nr_pages) +{ +} + +void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages) +{ +} + void register_page_bootmem_memmap(unsigned long section_nr, struct page *start_page, unsigned long size) { diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 0075592..4e8f8a4 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1138,6 +1138,125 @@ vmemmap_populate(struct page *start_page, unsigned long size, int node) return 0; } +#define PAGE_INUSE 0xFD + +unsigned long find_and_clear_pte_page(unsigned long addr, unsigned long end, +struct page **pp, int *page_size) +{ +pgd_t *pgd; +pud_t *pud; +pmd_t *pmd; +pte_t *pte; +void *page_addr; +unsigned long next; + +*pp = NULL; + +pgd = pgd_offset_k(addr); +if (pgd_none(*pgd)) +return pgd_addr_end(addr, end); + +pud = pud_offset(pgd, addr); +if (pud_none(*pud)) +return pud_addr_end(addr, end); + +if (!cpu_has_pse) { +next = (addr + PAGE_SIZE) PAGE_MASK; +pmd = pmd_offset(pud, addr); +if (pmd_none(*pmd)) +return next; + +pte = pte_offset_kernel(pmd, addr); +if (pte_none(*pte
Re: [PATCH 0/4] acpi,memory-hotplug : implement framework for hot removing memory
On 10/03/2012 05:52 PM, Yasuaki Ishimatsu wrote: We are trying to implement a physical memory hot removing function as following thread. https://lkml.org/lkml/2012/9/5/201 But there is not enough review to merge into linux kernel. I think there are following blockades. 1. no physical memory hot removable system Which kind of special machine support physical memory hot-remove now? 2. huge patch-set If you have a KVM system, we can get rid of 1st blockade. Because applying following patch, we can create memory hot removable system on KVM guest. http://lists.gnu.org/archive/html/qemu-devel/2012-07/msg01389.html 2nd blockade is own problem. So we try to divide huge patch into a small patch in each function as follows: - bug fix - acpi framework - kernel core We had already sent bug fix patches. https://lkml.org/lkml/2012/9/27/39 https://lkml.org/lkml/2012/10/2/83 The patch-set implements a framework for hot removing memory. The memory device can be removed by 2 ways: 1. send eject request by SCI 2. echo 1 /sys/bus/pci/devices/PNP0C80:XX/eject In the 1st case, acpi_memory_disable_device() will be called. In the 2nd case, acpi_memory_device_remove() will be called. acpi_memory_device_remove() will also be called when we unbind the memory device from the driver acpi_memhotplug. acpi_memory_disable_device() has already implemented a code which offlines memory and releases acpi_memory_info struct . But acpi_memory_device_remove() has not implemented it yet. So the patch prepares the framework for hot removing memory and adds the framework intoacpi_memory_device_remove(). And it prepares remove_memory(). But the function does nothing because we cannot support memory hot remove. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: memmap_init_zone() performance improvement
On 10/03/2012 10:56 PM, Mike Yoknis wrote: memmap_init_zone() loops through every Page Frame Number (pfn), including pfn values that are within the gaps between existing memory sections. The unneeded looping will become a boot performance issue when machines configure larger memory ranges that will contain larger and more numerous gaps. The code will skip across invalid sections to reduce the number of loops executed. looks reasonable to me. Signed-off-by: Mike Yoknis mike.yok...@hp.com --- arch/x86/include/asm/mmzone_32.h |2 ++ arch/x86/include/asm/page_32.h |1 + arch/x86/include/asm/page_64_types.h |3 ++- include/asm-generic/page.h |1 + include/linux/mmzone.h |6 ++ mm/page_alloc.c |5 - 6 files changed, 16 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/mmzone_32.h b/arch/x86/include/asm/mmzone_32.h index eb05fb3..73c5c74 100644 --- a/arch/x86/include/asm/mmzone_32.h +++ b/arch/x86/include/asm/mmzone_32.h @@ -48,6 +48,8 @@ static inline int pfn_to_nid(unsigned long pfn) #endif } +#define next_pfn_try(pfn) ((pfn)+1) + static inline int pfn_valid(int pfn) { int nid = pfn_to_nid(pfn); diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h index da4e762..e2c4cfc 100644 --- a/arch/x86/include/asm/page_32.h +++ b/arch/x86/include/asm/page_32.h @@ -19,6 +19,7 @@ extern unsigned long __phys_addr(unsigned long); #ifdef CONFIG_FLATMEM #define pfn_valid(pfn)((pfn) max_mapnr) +#define next_pfn_try(pfn) ((pfn)+1) #endif /* CONFIG_FLATMEM */ #ifdef CONFIG_X86_USE_3DNOW diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h index 320f7bb..02d82e5 100644 --- a/arch/x86/include/asm/page_64_types.h +++ b/arch/x86/include/asm/page_64_types.h @@ -69,7 +69,8 @@ extern void init_extra_mapping_wb(unsigned long phys, unsigned long size); #endif/* !__ASSEMBLY__ */ #ifdef CONFIG_FLATMEM -#define pfn_valid(pfn) ((pfn) max_pfn) +#define pfn_valid(pfn) ((pfn) max_pfn) +#define next_pfn_try(pfn) ((pfn)+1) #endif #endif /* _ASM_X86_PAGE_64_DEFS_H */ diff --git a/include/asm-generic/page.h b/include/asm-generic/page.h index 37d1fe2..316200d 100644 --- a/include/asm-generic/page.h +++ b/include/asm-generic/page.h @@ -91,6 +91,7 @@ extern unsigned long memory_end; #endif #define pfn_valid(pfn) ((pfn) = ARCH_PFN_OFFSET ((pfn) - ARCH_PFN_OFFSET) max_mapnr) +#define next_pfn_try(pfn) ((pfn)+1) #define virt_addr_valid(kaddr) (((void *)(kaddr) = (void *)PAGE_OFFSET) \ ((void *)(kaddr) (void *)memory_end)) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index f7d88ba..04d3c39 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1166,6 +1166,12 @@ static inline int pfn_valid(unsigned long pfn) return 0; return valid_section(__nr_to_section(pfn_to_section_nr(pfn))); } + +static inline unsigned long next_pfn_try(unsigned long pfn) +{ + /* Skip entire section, because all of it is invalid. */ + return section_nr_to_pfn(pfn_to_section_nr(pfn) + 1); +} #endif static inline int pfn_present(unsigned long pfn) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5b6b6b1..dd2af8b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3798,8 +3798,11 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, * exist on hotplugged memory. */ if (context == MEMMAP_EARLY) { - if (!early_pfn_valid(pfn)) + if (!early_pfn_valid(pfn)) { + pfn = next_pfn_try(pfn); + pfn--; continue; + } if (!early_pfn_in_nid(pfn, nid)) continue; } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CPU hotplug, writeback: Don't call writeback_set_ratelimit() too often during hotplug
On 09/28/2012 08:27 PM, Fengguang Wu wrote: On Tue, Sep 25, 2012 at 02:18:20AM +0530, Srivatsa S. Bhat wrote: From: Srivatsa S. Bhat The CPU hotplug callback related to writeback calls writeback_set_ratelimit() during every state change in the hotplug sequence. This is unnecessary since num_online_cpus() changes only once during the entire hotplug operation. So invoke the function only once per hotplug, thereby avoiding the unnecessary repetition of those costly calculations. Signed-off-by: Srivatsa S. Bhat --- Looks good to me. I'll include it in the writeback tree. Hi Fengguang, Could you tell me when inode->i_state & I_DIRTY will be set? thanks. Regards, Chen Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CPU hotplug, writeback: Don't call writeback_set_ratelimit() too often during hotplug
On 09/28/2012 08:27 PM, Fengguang Wu wrote: On Tue, Sep 25, 2012 at 02:18:20AM +0530, Srivatsa S. Bhat wrote: From: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com The CPU hotplug callback related to writeback calls writeback_set_ratelimit() during every state change in the hotplug sequence. This is unnecessary since num_online_cpus() changes only once during the entire hotplug operation. So invoke the function only once per hotplug, thereby avoiding the unnecessary repetition of those costly calculations. Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com --- Looks good to me. I'll include it in the writeback tree. Hi Fengguang, Could you tell me when inode-i_state I_DIRTY will be set? thanks. Regards, Chen Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] memory-hotplug : notification of memoty block's state
On 10/03/2012 09:21 AM, Yasuaki Ishimatsu wrote: Hi Andrew, 2012/10/03 6:42, Andrew Morton wrote: On Tue, 2 Oct 2012 17:25:06 +0900 Yasuaki Ishimatsu wrote: remove_memory() offlines memory. And it is called by following two cases: 1. echo offline >/sys/devices/system/memory/memoryXX/state 2. hot remove a memory device In the 1st case, the memory block's state is changed and the notification that memory block's state changed is sent to userland after calling offline_memory(). So user can notice memory block is changed. But in the 2nd case, the memory block's state is not changed and the notification is not also sent to userspcae even if calling offline_memory(). So user cannot notice memory block is changed. We should also notify to userspace at 2nd case. These two little patches look reasonable to me. There's a lot of recent activity with memory hotplug! We're in the 3.7 merge window now so it is not a good time to be merging new material. Also there appear to be two teams working on it and it's unclear to me how well coordinated this work is? As you know, there are two teams for developing the memory hotplug. - Wen's patch-set https://lkml.org/lkml/2012/9/5/201 - Lai's patch-set https://lkml.org/lkml/2012/9/10/180 Wen's patch-set is for removing physical memory. Now, I'm splitting the patch-set for reviewing more easy. If the patch-set is merged into linux kernel, I believe that linux on x86 can hot remove a physical memory device. But it is not enough since we cannot remove a memory which has kernel memory. If we guarantee the memory hot remove, the memory must belong to ZONE_MOVABLE. So Lai's patch-set tries to create a movable node that the all memory belongs to ZONE_MOVABLE. I think there are two chances for creating the movable node. - boot time - after hot add memory - boot time For creating a movable memory, linux has two kernel parameters (kernelcore and movablecore). But it is not enough, since even if we set the kernel paramter, the movable memory is distributed evenly in each node. So we introduce the kernelcore_max_addr boot parameter. The parameter limits the range of the memory used as a kernel memory. For example, the system has following nodes. node0 : 0x4000 - 0x8000 node1 : 0x8000 - 0xc000 And when I want to hot remove a node1, we set "kernelcore_max_addr=0x8000". In doing so, kernel memory is limited within 0x8000 and node1's memory belongs to ZONE_MOEVALBE. As a result, we can guarantee that node1 is a movable node and we always hot remove node1. - after hot add memory When hot adding memory, the memory belongs to ZONE_NORMAL and is offline. If we online the memory, the memory may have kernel memory. In this case, we cannot hot remove the memory. So we introduce the online_movable function. If we use the function as follow, the memory belongs to ZONE_MOVABLE. echo online_movable > /sys/devices/system/node/nodeX/memoryX/state So when new node is hot added and I echo "online_movale" to all hot added memory, the node's memory belongs to ZONE_MOVABLE. As a result, we can Y guarantee that the node is a movable node and we always hot remove node. Hi Yasuaki, This time can kernel memory allocated from ZONE_MOVABLE ? # I hope to help your understanding about our works by the information. Thanks, Yasuaki Ishimatsu However these two patches are pretty simple and do fix a problem, so I added them to the 3.7 MM queue. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v9 PATCH 13/21] memory-hotplug: check page type in get_page_bootmem
On 10/01/2012 11:03 AM, Yasuaki Ishimatsu wrote: Hi Chen, 2012/09/29 11:15, Ni zhan Chen wrote: On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote: From: Yasuaki Ishimatsu The function get_page_bootmem() may be called more than one time to the same page. There is no need to set page's type, private if the function is not the first time called to the page. Note: the patch is just optimization and does not fix any problem. Hi Yasuaki, this patch is reasonable to me. I have another question associated to get_page_bootmem(), the question is from another fujitsu guy's patch changelog [commit : 04753278769f3], the changelog said that: 1) When the memmap of removing section is allocated on other section by bootmem, it should/can be free. 2) When the memmap of removing section is allocated on the same section, it shouldn't be freed. Because the section has to be logical memory offlined already and all pages must be isolated against page allocater. If it is freed, page allocator may use it which will be removed physically soon. but I don't see his patch guarantee 2), it means that his patch doesn't guarantee the memmap of removing section which is allocated on other section by bootmem doesn't be freed. Hopefully get your explaination in details, thanks in advance. :-) In my understanding, the patch does not guarantee it. Please see [commit : 0c0a4a517a31e]. free_map_bootmem() in the commit guarantees it. Thanks Yasuaki, I have already seen the commit you mentioned. But the changelog of the commit I point out 2), why it said that "If it is freed, page allocator may use it which will be removed physically soon", does it mean that use-after-free ? AFAK, the isolated pages will be free if no users use it, so why not free the associated memmap? Thanks, Yasuaki Ishimatsu CC: David Rientjes CC: Jiang Liu CC: Len Brown CC: Benjamin Herrenschmidt CC: Paul Mackerras CC: Christoph Lameter Cc: Minchan Kim CC: Andrew Morton CC: KOSAKI Motohiro CC: Wen Congyang Signed-off-by: Yasuaki Ishimatsu --- mm/memory_hotplug.c | 15 +++ 1 files changed, 11 insertions(+), 4 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index d736df3..26a5012 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -95,10 +95,17 @@ static void release_memory_resource(struct resource *res) static void get_page_bootmem(unsigned long info, struct page *page, unsigned long type) { -page->lru.next = (struct list_head *) type; -SetPagePrivate(page); -set_page_private(page, info); -atomic_inc(>_count); +unsigned long page_type; + +page_type = (unsigned long)page->lru.next; +if (page_type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE || +page_type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE){ +page->lru.next = (struct list_head *)type; +SetPagePrivate(page); +set_page_private(page, info); +atomic_inc(>_count); +} else +atomic_inc(>_count); } /* reference to __meminit __free_pages_bootmem is valid -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] memory-hotplug : notification of memoty block's state
On 10/02/2012 04:25 PM, Yasuaki Ishimatsu wrote: > We are trying to implement a physical memory hot removing function as > following thread. > > https://lkml.org/lkml/2012/9/5/201 > > But there is not enough review to merge into linux kernel. > > I think there are following blockades. > 1. no physical memory hot removable system > 2. huge patch-set > > If you have a KVM system, we can get rid of 1st blockade. Because > applying following patch, we can create memory hot removable system > on KVM guest. > > http://lists.gnu.org/archive/html/qemu-devel/2012-07/msg01389.html > > 2nd blockade is own problem. So we try to divide huge patch into > a small patch in each function as follows: > > - bug fix > - acpi framework > - kernel core > > We had already sent bug fix patches. > > https://lkml.org/lkml/2012/9/27/39 > > And the patch fixes following bug. > > remove_memory() offlines memory. And it is called by following two cases: > > 1. echo offline >/sys/devices/system/memory/memoryXX/state > 2. hot remove a memory device > > In the 1st case, the memory block's state is changed and the notification > that memory block's state changed is sent to userland after calling > offline_memory(). So user can notice memory block is changed., Hi Yasuaki, Thanks for splitting the patchset, it's more easier to review this time. One question: How can notify userspace? you mean function node_memory_callback or , but this function basically do nothing. > > But in the 2nd case, the memory block's state is not changed and the > notification is not also sent to userspcae even if calling offline_memory(). > So user cannot notice memory block is changed. > > We should also notify to userspace at 2nd case. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org;> em...@kvack.org > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] memory-hotplug : notification of memoty block's state
On 10/02/2012 04:25 PM, Yasuaki Ishimatsu wrote: We are trying to implement a physical memory hot removing function as following thread. https://lkml.org/lkml/2012/9/5/201 But there is not enough review to merge into linux kernel. I think there are following blockades. 1. no physical memory hot removable system 2. huge patch-set If you have a KVM system, we can get rid of 1st blockade. Because applying following patch, we can create memory hot removable system on KVM guest. http://lists.gnu.org/archive/html/qemu-devel/2012-07/msg01389.html 2nd blockade is own problem. So we try to divide huge patch into a small patch in each function as follows: - bug fix - acpi framework - kernel core We had already sent bug fix patches. https://lkml.org/lkml/2012/9/27/39 And the patch fixes following bug. remove_memory() offlines memory. And it is called by following two cases: 1. echo offline /sys/devices/system/memory/memoryXX/state 2. hot remove a memory device In the 1st case, the memory block's state is changed and the notification that memory block's state changed is sent to userland after calling offline_memory(). So user can notice memory block is changed., Hi Yasuaki, Thanks for splitting the patchset, it's more easier to review this time. One question: How can notify userspace? you mean function node_memory_callback or , but this function basically do nothing. But in the 2nd case, the memory block's state is not changed and the notification is not also sent to userspcae even if calling offline_memory(). So user cannot notice memory block is changed. We should also notify to userspace at 2nd case. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v9 PATCH 13/21] memory-hotplug: check page type in get_page_bootmem
On 10/01/2012 11:03 AM, Yasuaki Ishimatsu wrote: Hi Chen, 2012/09/29 11:15, Ni zhan Chen wrote: On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote: From: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com The function get_page_bootmem() may be called more than one time to the same page. There is no need to set page's type, private if the function is not the first time called to the page. Note: the patch is just optimization and does not fix any problem. Hi Yasuaki, this patch is reasonable to me. I have another question associated to get_page_bootmem(), the question is from another fujitsu guy's patch changelog [commit : 04753278769f3], the changelog said that: 1) When the memmap of removing section is allocated on other section by bootmem, it should/can be free. 2) When the memmap of removing section is allocated on the same section, it shouldn't be freed. Because the section has to be logical memory offlined already and all pages must be isolated against page allocater. If it is freed, page allocator may use it which will be removed physically soon. but I don't see his patch guarantee 2), it means that his patch doesn't guarantee the memmap of removing section which is allocated on other section by bootmem doesn't be freed. Hopefully get your explaination in details, thanks in advance. :-) In my understanding, the patch does not guarantee it. Please see [commit : 0c0a4a517a31e]. free_map_bootmem() in the commit guarantees it. Thanks Yasuaki, I have already seen the commit you mentioned. But the changelog of the commit I point out 2), why it said that If it is freed, page allocator may use it which will be removed physically soon, does it mean that use-after-free ? AFAK, the isolated pages will be free if no users use it, so why not free the associated memmap? Thanks, Yasuaki Ishimatsu CC: David Rientjes rient...@google.com CC: Jiang Liu liu...@gmail.com CC: Len Brown len.br...@intel.com CC: Benjamin Herrenschmidt b...@kernel.crashing.org CC: Paul Mackerras pau...@samba.org CC: Christoph Lameter c...@linux.com Cc: Minchan Kim minchan@gmail.com CC: Andrew Morton a...@linux-foundation.org CC: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com CC: Wen Congyang we...@cn.fujitsu.com Signed-off-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com --- mm/memory_hotplug.c | 15 +++ 1 files changed, 11 insertions(+), 4 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index d736df3..26a5012 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -95,10 +95,17 @@ static void release_memory_resource(struct resource *res) static void get_page_bootmem(unsigned long info, struct page *page, unsigned long type) { -page-lru.next = (struct list_head *) type; -SetPagePrivate(page); -set_page_private(page, info); -atomic_inc(page-_count); +unsigned long page_type; + +page_type = (unsigned long)page-lru.next; +if (page_type MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE || +page_type MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE){ +page-lru.next = (struct list_head *)type; +SetPagePrivate(page); +set_page_private(page, info); +atomic_inc(page-_count); +} else +atomic_inc(page-_count); } /* reference to __meminit __free_pages_bootmem is valid -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] memory-hotplug : notification of memoty block's state
On 10/03/2012 09:21 AM, Yasuaki Ishimatsu wrote: Hi Andrew, 2012/10/03 6:42, Andrew Morton wrote: On Tue, 2 Oct 2012 17:25:06 +0900 Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com wrote: remove_memory() offlines memory. And it is called by following two cases: 1. echo offline /sys/devices/system/memory/memoryXX/state 2. hot remove a memory device In the 1st case, the memory block's state is changed and the notification that memory block's state changed is sent to userland after calling offline_memory(). So user can notice memory block is changed. But in the 2nd case, the memory block's state is not changed and the notification is not also sent to userspcae even if calling offline_memory(). So user cannot notice memory block is changed. We should also notify to userspace at 2nd case. These two little patches look reasonable to me. There's a lot of recent activity with memory hotplug! We're in the 3.7 merge window now so it is not a good time to be merging new material. Also there appear to be two teams working on it and it's unclear to me how well coordinated this work is? As you know, there are two teams for developing the memory hotplug. - Wen's patch-set https://lkml.org/lkml/2012/9/5/201 - Lai's patch-set https://lkml.org/lkml/2012/9/10/180 Wen's patch-set is for removing physical memory. Now, I'm splitting the patch-set for reviewing more easy. If the patch-set is merged into linux kernel, I believe that linux on x86 can hot remove a physical memory device. But it is not enough since we cannot remove a memory which has kernel memory. If we guarantee the memory hot remove, the memory must belong to ZONE_MOVABLE. So Lai's patch-set tries to create a movable node that the all memory belongs to ZONE_MOVABLE. I think there are two chances for creating the movable node. - boot time - after hot add memory - boot time For creating a movable memory, linux has two kernel parameters (kernelcore and movablecore). But it is not enough, since even if we set the kernel paramter, the movable memory is distributed evenly in each node. So we introduce the kernelcore_max_addr boot parameter. The parameter limits the range of the memory used as a kernel memory. For example, the system has following nodes. node0 : 0x4000 - 0x8000 node1 : 0x8000 - 0xc000 And when I want to hot remove a node1, we set kernelcore_max_addr=0x8000. In doing so, kernel memory is limited within 0x8000 and node1's memory belongs to ZONE_MOEVALBE. As a result, we can guarantee that node1 is a movable node and we always hot remove node1. - after hot add memory When hot adding memory, the memory belongs to ZONE_NORMAL and is offline. If we online the memory, the memory may have kernel memory. In this case, we cannot hot remove the memory. So we introduce the online_movable function. If we use the function as follow, the memory belongs to ZONE_MOVABLE. echo online_movable /sys/devices/system/node/nodeX/memoryX/state So when new node is hot added and I echo online_movale to all hot added memory, the node's memory belongs to ZONE_MOVABLE. As a result, we can Y guarantee that the node is a movable node and we always hot remove node. Hi Yasuaki, This time can kernel memory allocated from ZONE_MOVABLE ? # I hope to help your understanding about our works by the information. Thanks, Yasuaki Ishimatsu However these two patches are pretty simple and do fix a problem, so I added them to the 3.7 MM queue. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v9 PATCH 16/21] memory-hotplug: free memmap of sparse-vmemmap
On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote: From: Yasuaki Ishimatsu All pages of virtual mapping in removed memory cannot be freed, since some pages used as PGD/PUD includes not only removed memory but also other memory. So the patch checks whether page can be freed or not. How to check whether page can be freed or not? 1. When removing memory, the page structs of the revmoved memory are filled with 0FD. 2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared. In this case, the page used as PT/PMD can be freed. Applying patch, __remove_section() of CONFIG_SPARSEMEM_VMEMMAP is integrated into one. So __remove_section() of CONFIG_SPARSEMEM_VMEMMAP is deleted. Note: vmemmap_kfree() and vmemmap_free_bootmem() are not implemented for ia64, ppc, s390, and sparc. CC: David Rientjes CC: Jiang Liu CC: Len Brown CC: Benjamin Herrenschmidt CC: Paul Mackerras CC: Christoph Lameter Cc: Minchan Kim CC: Andrew Morton CC: KOSAKI Motohiro CC: Wen Congyang Signed-off-by: Yasuaki Ishimatsu --- arch/ia64/mm/discontig.c |8 +++ arch/powerpc/mm/init_64.c |8 +++ arch/s390/mm/vmem.c |8 +++ arch/sparc/mm/init_64.c |8 +++ arch/x86/mm/init_64.c | 119 + include/linux/mm.h|2 + mm/memory_hotplug.c | 17 +-- mm/sparse.c |5 +- 8 files changed, 158 insertions(+), 17 deletions(-) diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c index 33943db..0d23b69 100644 --- a/arch/ia64/mm/discontig.c +++ b/arch/ia64/mm/discontig.c @@ -823,6 +823,14 @@ int __meminit vmemmap_populate(struct page *start_page, return vmemmap_populate_basepages(start_page, size, node); } +void vmemmap_kfree(struct page *memmap, unsigned long nr_pages) +{ +} + +void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages) +{ +} + void register_page_bootmem_memmap(unsigned long section_nr, struct page *start_page, unsigned long size) { diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c index 3690c44..835a2b3 100644 --- a/arch/powerpc/mm/init_64.c +++ b/arch/powerpc/mm/init_64.c @@ -299,6 +299,14 @@ int __meminit vmemmap_populate(struct page *start_page, return 0; } +void vmemmap_kfree(struct page *memmap, unsigned long nr_pages) +{ +} + +void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages) +{ +} + void register_page_bootmem_memmap(unsigned long section_nr, struct page *start_page, unsigned long size) { diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c index eda55cd..4b42b0b 100644 --- a/arch/s390/mm/vmem.c +++ b/arch/s390/mm/vmem.c @@ -227,6 +227,14 @@ out: return ret; } +void vmemmap_kfree(struct page *memmap, unsigned long nr_pages) +{ +} + +void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages) +{ +} + void register_page_bootmem_memmap(unsigned long section_nr, struct page *start_page, unsigned long size) { diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c index add1cc7..1384826 100644 --- a/arch/sparc/mm/init_64.c +++ b/arch/sparc/mm/init_64.c @@ -2078,6 +2078,14 @@ void __meminit vmemmap_populate_print_last(void) } } +void vmemmap_kfree(struct page *memmap, unsigned long nr_pages) +{ +} + +void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages) +{ +} + void register_page_bootmem_memmap(unsigned long section_nr, struct page *start_page, unsigned long size) { diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 0075592..4e8f8a4 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1138,6 +1138,125 @@ vmemmap_populate(struct page *start_page, unsigned long size, int node) return 0; } +#define PAGE_INUSE 0xFD + +unsigned long find_and_clear_pte_page(unsigned long addr, unsigned long end, + struct page **pp, int *page_size) +{ + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + pte_t *pte; + void *page_addr; + unsigned long next; + + *pp = NULL; + + pgd = pgd_offset_k(addr); + if (pgd_none(*pgd)) + return pgd_addr_end(addr, end); + + pud = pud_offset(pgd, addr); + if (pud_none(*pud)) + return pud_addr_end(addr, end); + + if (!cpu_has_pse) { + next = (addr + PAGE_SIZE) & PAGE_MASK; + pmd = pmd_offset(pud, addr); + if (pmd_none(*pmd)) + return next; + + pte = pte_offset_kernel(pmd, addr); + if (pte_none(*pte)) + return next; + + *page_size = PAGE_SIZE; + *pp = pte_page(*pte); + } else { + next = pmd_addr_end(addr, end); + + pmd = pmd_offset(pud,
Re: [RFC v9 PATCH 06/21] memory-hotplug: export the function acpi_bus_remove()
On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote: From: Wen Congyang The function acpi_bus_remove() can remove a acpi device from acpi device. IIUC, s/acpi device/acpi bus When a acpi device is removed, we need to call this function to remove the acpi device from acpi bus. So export this function. CC: David Rientjes CC: Jiang Liu CC: Len Brown CC: Benjamin Herrenschmidt CC: Paul Mackerras CC: Christoph Lameter Cc: Minchan Kim CC: Andrew Morton CC: KOSAKI Motohiro CC: Yasuaki Ishimatsu Signed-off-by: Wen Congyang --- drivers/acpi/scan.c |3 ++- include/acpi/acpi_bus.h |1 + 2 files changed, 3 insertions(+), 1 deletions(-) diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c index d1ecca2..1cefc34 100644 --- a/drivers/acpi/scan.c +++ b/drivers/acpi/scan.c @@ -1224,7 +1224,7 @@ static int acpi_device_set_context(struct acpi_device *device) return -ENODEV; } -static int acpi_bus_remove(struct acpi_device *dev, int rmdevice) +int acpi_bus_remove(struct acpi_device *dev, int rmdevice) { if (!dev) return -EINVAL; @@ -1246,6 +1246,7 @@ static int acpi_bus_remove(struct acpi_device *dev, int rmdevice) return 0; } +EXPORT_SYMBOL(acpi_bus_remove); static int acpi_add_single_object(struct acpi_device **child, acpi_handle handle, int type, diff --git a/include/acpi/acpi_bus.h b/include/acpi/acpi_bus.h index bde976e..2ccf109 100644 --- a/include/acpi/acpi_bus.h +++ b/include/acpi/acpi_bus.h @@ -360,6 +360,7 @@ bool acpi_bus_power_manageable(acpi_handle handle); bool acpi_bus_can_wakeup(acpi_handle handle); int acpi_power_resource_register_device(struct device *dev, acpi_handle handle); void acpi_power_resource_unregister_device(struct device *dev, acpi_handle handle); +int acpi_bus_remove(struct acpi_device *dev, int rmdevice); #ifdef CONFIG_ACPI_PROC_EVENT int acpi_bus_generate_proc_event(struct acpi_device *device, u8 type, int data); int acpi_bus_generate_proc_event4(const char *class, const char *bid, u8 type, int data); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v9 PATCH 00/21] memory-hotplug: hot-remove physical memory
On 10/01/2012 12:44 PM, Yasuaki Ishimatsu wrote: Hi Chen, 2012/09/29 17:19, Ni zhan Chen wrote: On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote: From: Wen Congyang This patch series aims to support physical memory hot-remove. The patches can free/remove the following things: - acpi_memory_info : [RFC PATCH 4/19] - /sys/firmware/memmap/X/{end, start, type} : [RFC PATCH 8/19] - iomem_resource: [RFC PATCH 9/19] - mem_section and related sysfs files : [RFC PATCH 10-11, 13-16/19] - page table of removed memory : [RFC PATCH 12/19] - node and related sysfs files : [RFC PATCH 18-19/19] If you find lack of function for physical memory hot-remove, please let me know. How to test this patchset? 1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE, ACPI_HOTPLUG_MEMORY must be selected. 2. load the module acpi_memhotplug Hi Yasuaki, where is the acpi_memhotplug module? If you build acpi_memhotplug as module, it is created under /lib/modules//driver/acpi/ directory. It depends on config ACPI_HOTPLUG_MEMORY. The confing is [*], it becomes built-in function. So you don't need to care about it. Thanks, Yasuaki Ishimatsu Hi Yasuaki, I build the kernel, MEMORY_HOTPLUG, MEMORY_HOTREMOVE, ACPI_HOTPLUG_MEMORY are seleted as [*], but I can't find PNP0C80:XX under the directory /sys/bus/acpi/devices/. [root@localhost ~]# ls /sys/bus/acpi/devices/ device:00 device:07 device:0e device:15 device:1c device:23 device:2a LNXCPU:00 LNXCPU:07PNP0501:00 PNP0C02:00 PNP0C0F:02 PNP0C14:01 device:01 device:08 device:0f device:16 device:1d device:24 device:2b LNXCPU:01 LNXPWRBN:00 PNP0800:00 PNP0C02:01 PNP0C0F:03 PNP0C31:00 device:02 device:09 device:10 device:17 device:1e device:25 device:2c LNXCPU:02 LNXSYSTM:00 PNP0A08:00 PNP0C02:02 PNP0C0F:04 device:03 device:0a device:11 device:18 device:1f device:26 device:2d LNXCPU:03 PNP:00 PNP0B00:00 PNP0C04:00 PNP0C0F:05 device:04 device:0b device:12 device:19 device:20 device:27 device:2e LNXCPU:04 PNP0100:00 PNP0C01:00 PNP0C0C:00 PNP0C0F:06 device:05 device:0c device:13 device:1a device:21 device:28 device:2f LNXCPU:05 PNP0103:00 PNP0C01:01 PNP0C0F:00 PNP0C0F:07 device:06 device:0d device:14 device:1b device:22 device:29 INT3F0D:00 LNXCPU:06 PNP0200:00 PNP0C01:02 PNP0C0F:01 PNP0C14:00 then what I miss ? thanks. 3. hotplug the memory device(it depends on your hardware) You will see the memory device under the directory /sys/bus/acpi/devices/. Its name is PNP0C80:XX. 4. online/offline pages provided by this memory device You can write online/offline to /sys/devices/system/memory/memoryX/state to online/offline pages provided by this memory device 5. hotremove the memory device You can hotremove the memory device by the hardware, or writing 1 to /sys/bus/acpi/devices/PNP0C80:XX/eject. Note: if the memory provided by the memory device is used by the kernel, it can't be offlined. It is not a bug. Known problems: 1. memory can't be offlined when CONFIG_MEMCG is selected. For example: there is a memory device on node 1. The address range is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10, and memory11 under the directory /sys/devices/system/memory/. If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup when we online pages. When we online memory8, the memory stored page cgroup is not provided by this memory device. But when we online memory9, the memory stored page cgroup may be provided by memory8. So we can't offline memory8 now. We should offline the memory in the reversed order. When the memory device is hotremoved, we will auto offline memory provided by this memory device. But we don't know which memory is onlined first, so offlining memory may fail. In such case, you should offline the memory by hand before hotremoving the memory device. 2. hotremoving memory device may cause kernel panicked This bug will be fixed by Liu Jiang's patch: https://lkml.org/lkml/2012/7/3/1 change log of v9: [RFC PATCH v9 8/21] * add a lock to protect the list map_entries * add an indicator to firmware_map_entry to remember whether the memory is allocated from bootmem [RFC PATCH v9 10/21] * change the macro to inline function [RFC PATCH v9 19/21] * don't offline the node if the cpu on the node is onlined [RFC PATCH v9 21/21] * create new patch: auto offline page_cgroup when onlining memory block failed change log of v8: [RFC PATCH v8 17/20] * Fix problems when one node's range include the other nodes [RFC PATCH v8 18/20] * fix building error when CONFIG_MEMORY_HOTPLUG_SPARSE or CONFIG_HUGETLBFS is not defined. [RFC PATCH v8 19/20] * don't offline
Re: [RFC v9 PATCH 00/21] memory-hotplug: hot-remove physical memory
On 10/01/2012 12:44 PM, Yasuaki Ishimatsu wrote: Hi Chen, 2012/09/29 17:19, Ni zhan Chen wrote: On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote: From: Wen Congyang we...@cn.fujitsu.com This patch series aims to support physical memory hot-remove. The patches can free/remove the following things: - acpi_memory_info : [RFC PATCH 4/19] - /sys/firmware/memmap/X/{end, start, type} : [RFC PATCH 8/19] - iomem_resource: [RFC PATCH 9/19] - mem_section and related sysfs files : [RFC PATCH 10-11, 13-16/19] - page table of removed memory : [RFC PATCH 12/19] - node and related sysfs files : [RFC PATCH 18-19/19] If you find lack of function for physical memory hot-remove, please let me know. How to test this patchset? 1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE, ACPI_HOTPLUG_MEMORY must be selected. 2. load the module acpi_memhotplug Hi Yasuaki, where is the acpi_memhotplug module? If you build acpi_memhotplug as module, it is created under /lib/modules/kernel-version/driver/acpi/ directory. It depends on config ACPI_HOTPLUG_MEMORY. The confing is [*], it becomes built-in function. So you don't need to care about it. Thanks, Yasuaki Ishimatsu Hi Yasuaki, I build the kernel, MEMORY_HOTPLUG, MEMORY_HOTREMOVE, ACPI_HOTPLUG_MEMORY are seleted as [*], but I can't find PNP0C80:XX under the directory /sys/bus/acpi/devices/. [root@localhost ~]# ls /sys/bus/acpi/devices/ device:00 device:07 device:0e device:15 device:1c device:23 device:2a LNXCPU:00 LNXCPU:07PNP0501:00 PNP0C02:00 PNP0C0F:02 PNP0C14:01 device:01 device:08 device:0f device:16 device:1d device:24 device:2b LNXCPU:01 LNXPWRBN:00 PNP0800:00 PNP0C02:01 PNP0C0F:03 PNP0C31:00 device:02 device:09 device:10 device:17 device:1e device:25 device:2c LNXCPU:02 LNXSYSTM:00 PNP0A08:00 PNP0C02:02 PNP0C0F:04 device:03 device:0a device:11 device:18 device:1f device:26 device:2d LNXCPU:03 PNP:00 PNP0B00:00 PNP0C04:00 PNP0C0F:05 device:04 device:0b device:12 device:19 device:20 device:27 device:2e LNXCPU:04 PNP0100:00 PNP0C01:00 PNP0C0C:00 PNP0C0F:06 device:05 device:0c device:13 device:1a device:21 device:28 device:2f LNXCPU:05 PNP0103:00 PNP0C01:01 PNP0C0F:00 PNP0C0F:07 device:06 device:0d device:14 device:1b device:22 device:29 INT3F0D:00 LNXCPU:06 PNP0200:00 PNP0C01:02 PNP0C0F:01 PNP0C14:00 then what I miss ? thanks. 3. hotplug the memory device(it depends on your hardware) You will see the memory device under the directory /sys/bus/acpi/devices/. Its name is PNP0C80:XX. 4. online/offline pages provided by this memory device You can write online/offline to /sys/devices/system/memory/memoryX/state to online/offline pages provided by this memory device 5. hotremove the memory device You can hotremove the memory device by the hardware, or writing 1 to /sys/bus/acpi/devices/PNP0C80:XX/eject. Note: if the memory provided by the memory device is used by the kernel, it can't be offlined. It is not a bug. Known problems: 1. memory can't be offlined when CONFIG_MEMCG is selected. For example: there is a memory device on node 1. The address range is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10, and memory11 under the directory /sys/devices/system/memory/. If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup when we online pages. When we online memory8, the memory stored page cgroup is not provided by this memory device. But when we online memory9, the memory stored page cgroup may be provided by memory8. So we can't offline memory8 now. We should offline the memory in the reversed order. When the memory device is hotremoved, we will auto offline memory provided by this memory device. But we don't know which memory is onlined first, so offlining memory may fail. In such case, you should offline the memory by hand before hotremoving the memory device. 2. hotremoving memory device may cause kernel panicked This bug will be fixed by Liu Jiang's patch: https://lkml.org/lkml/2012/7/3/1 change log of v9: [RFC PATCH v9 8/21] * add a lock to protect the list map_entries * add an indicator to firmware_map_entry to remember whether the memory is allocated from bootmem [RFC PATCH v9 10/21] * change the macro to inline function [RFC PATCH v9 19/21] * don't offline the node if the cpu on the node is onlined [RFC PATCH v9 21/21] * create new patch: auto offline page_cgroup when onlining memory block failed change log of v8: [RFC PATCH v8 17/20] * Fix problems when one node's range include the other nodes [RFC PATCH v8 18/20] * fix building error when CONFIG_MEMORY_HOTPLUG_SPARSE or CONFIG_HUGETLBFS is not defined. [RFC PATCH
Re: [RFC v9 PATCH 06/21] memory-hotplug: export the function acpi_bus_remove()
On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote: From: Wen Congyang we...@cn.fujitsu.com The function acpi_bus_remove() can remove a acpi device from acpi device. IIUC, s/acpi device/acpi bus When a acpi device is removed, we need to call this function to remove the acpi device from acpi bus. So export this function. CC: David Rientjes rient...@google.com CC: Jiang Liu liu...@gmail.com CC: Len Brown len.br...@intel.com CC: Benjamin Herrenschmidt b...@kernel.crashing.org CC: Paul Mackerras pau...@samba.org CC: Christoph Lameter c...@linux.com Cc: Minchan Kim minchan@gmail.com CC: Andrew Morton a...@linux-foundation.org CC: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com CC: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com Signed-off-by: Wen Congyang we...@cn.fujitsu.com --- drivers/acpi/scan.c |3 ++- include/acpi/acpi_bus.h |1 + 2 files changed, 3 insertions(+), 1 deletions(-) diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c index d1ecca2..1cefc34 100644 --- a/drivers/acpi/scan.c +++ b/drivers/acpi/scan.c @@ -1224,7 +1224,7 @@ static int acpi_device_set_context(struct acpi_device *device) return -ENODEV; } -static int acpi_bus_remove(struct acpi_device *dev, int rmdevice) +int acpi_bus_remove(struct acpi_device *dev, int rmdevice) { if (!dev) return -EINVAL; @@ -1246,6 +1246,7 @@ static int acpi_bus_remove(struct acpi_device *dev, int rmdevice) return 0; } +EXPORT_SYMBOL(acpi_bus_remove); static int acpi_add_single_object(struct acpi_device **child, acpi_handle handle, int type, diff --git a/include/acpi/acpi_bus.h b/include/acpi/acpi_bus.h index bde976e..2ccf109 100644 --- a/include/acpi/acpi_bus.h +++ b/include/acpi/acpi_bus.h @@ -360,6 +360,7 @@ bool acpi_bus_power_manageable(acpi_handle handle); bool acpi_bus_can_wakeup(acpi_handle handle); int acpi_power_resource_register_device(struct device *dev, acpi_handle handle); void acpi_power_resource_unregister_device(struct device *dev, acpi_handle handle); +int acpi_bus_remove(struct acpi_device *dev, int rmdevice); #ifdef CONFIG_ACPI_PROC_EVENT int acpi_bus_generate_proc_event(struct acpi_device *device, u8 type, int data); int acpi_bus_generate_proc_event4(const char *class, const char *bid, u8 type, int data); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v9 PATCH 16/21] memory-hotplug: free memmap of sparse-vmemmap
On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote: From: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com All pages of virtual mapping in removed memory cannot be freed, since some pages used as PGD/PUD includes not only removed memory but also other memory. So the patch checks whether page can be freed or not. How to check whether page can be freed or not? 1. When removing memory, the page structs of the revmoved memory are filled with 0FD. 2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared. In this case, the page used as PT/PMD can be freed. Applying patch, __remove_section() of CONFIG_SPARSEMEM_VMEMMAP is integrated into one. So __remove_section() of CONFIG_SPARSEMEM_VMEMMAP is deleted. Note: vmemmap_kfree() and vmemmap_free_bootmem() are not implemented for ia64, ppc, s390, and sparc. CC: David Rientjes rient...@google.com CC: Jiang Liu liu...@gmail.com CC: Len Brown len.br...@intel.com CC: Benjamin Herrenschmidt b...@kernel.crashing.org CC: Paul Mackerras pau...@samba.org CC: Christoph Lameter c...@linux.com Cc: Minchan Kim minchan@gmail.com CC: Andrew Morton a...@linux-foundation.org CC: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com CC: Wen Congyang we...@cn.fujitsu.com Signed-off-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com --- arch/ia64/mm/discontig.c |8 +++ arch/powerpc/mm/init_64.c |8 +++ arch/s390/mm/vmem.c |8 +++ arch/sparc/mm/init_64.c |8 +++ arch/x86/mm/init_64.c | 119 + include/linux/mm.h|2 + mm/memory_hotplug.c | 17 +-- mm/sparse.c |5 +- 8 files changed, 158 insertions(+), 17 deletions(-) diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c index 33943db..0d23b69 100644 --- a/arch/ia64/mm/discontig.c +++ b/arch/ia64/mm/discontig.c @@ -823,6 +823,14 @@ int __meminit vmemmap_populate(struct page *start_page, return vmemmap_populate_basepages(start_page, size, node); } +void vmemmap_kfree(struct page *memmap, unsigned long nr_pages) +{ +} + +void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages) +{ +} + void register_page_bootmem_memmap(unsigned long section_nr, struct page *start_page, unsigned long size) { diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c index 3690c44..835a2b3 100644 --- a/arch/powerpc/mm/init_64.c +++ b/arch/powerpc/mm/init_64.c @@ -299,6 +299,14 @@ int __meminit vmemmap_populate(struct page *start_page, return 0; } +void vmemmap_kfree(struct page *memmap, unsigned long nr_pages) +{ +} + +void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages) +{ +} + void register_page_bootmem_memmap(unsigned long section_nr, struct page *start_page, unsigned long size) { diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c index eda55cd..4b42b0b 100644 --- a/arch/s390/mm/vmem.c +++ b/arch/s390/mm/vmem.c @@ -227,6 +227,14 @@ out: return ret; } +void vmemmap_kfree(struct page *memmap, unsigned long nr_pages) +{ +} + +void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages) +{ +} + void register_page_bootmem_memmap(unsigned long section_nr, struct page *start_page, unsigned long size) { diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c index add1cc7..1384826 100644 --- a/arch/sparc/mm/init_64.c +++ b/arch/sparc/mm/init_64.c @@ -2078,6 +2078,14 @@ void __meminit vmemmap_populate_print_last(void) } } +void vmemmap_kfree(struct page *memmap, unsigned long nr_pages) +{ +} + +void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages) +{ +} + void register_page_bootmem_memmap(unsigned long section_nr, struct page *start_page, unsigned long size) { diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 0075592..4e8f8a4 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1138,6 +1138,125 @@ vmemmap_populate(struct page *start_page, unsigned long size, int node) return 0; } +#define PAGE_INUSE 0xFD + +unsigned long find_and_clear_pte_page(unsigned long addr, unsigned long end, + struct page **pp, int *page_size) +{ + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + pte_t *pte; + void *page_addr; + unsigned long next; + + *pp = NULL; + + pgd = pgd_offset_k(addr); + if (pgd_none(*pgd)) + return pgd_addr_end(addr, end); + + pud = pud_offset(pgd, addr); + if (pud_none(*pud)) + return pud_addr_end(addr, end); + + if (!cpu_has_pse) { + next = (addr + PAGE_SIZE) PAGE_MASK; + pmd = pmd_offset(pud, addr); + if (pmd_none(*pmd)) + return next; + + pte = pte_offset_kernel(pmd, addr);
Re: [RFC v9 PATCH 00/21] memory-hotplug: hot-remove physical memory
On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote: From: Wen Congyang This patch series aims to support physical memory hot-remove. The patches can free/remove the following things: - acpi_memory_info : [RFC PATCH 4/19] - /sys/firmware/memmap/X/{end, start, type} : [RFC PATCH 8/19] - iomem_resource: [RFC PATCH 9/19] - mem_section and related sysfs files : [RFC PATCH 10-11, 13-16/19] - page table of removed memory : [RFC PATCH 12/19] - node and related sysfs files : [RFC PATCH 18-19/19] If you find lack of function for physical memory hot-remove, please let me know. How to test this patchset? 1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE, ACPI_HOTPLUG_MEMORY must be selected. 2. load the module acpi_memhotplug Hi Yasuaki, where is the acpi_memhotplug module? 3. hotplug the memory device(it depends on your hardware) You will see the memory device under the directory /sys/bus/acpi/devices/. Its name is PNP0C80:XX. 4. online/offline pages provided by this memory device You can write online/offline to /sys/devices/system/memory/memoryX/state to online/offline pages provided by this memory device 5. hotremove the memory device You can hotremove the memory device by the hardware, or writing 1 to /sys/bus/acpi/devices/PNP0C80:XX/eject. Note: if the memory provided by the memory device is used by the kernel, it can't be offlined. It is not a bug. Known problems: 1. memory can't be offlined when CONFIG_MEMCG is selected. For example: there is a memory device on node 1. The address range is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10, and memory11 under the directory /sys/devices/system/memory/. If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup when we online pages. When we online memory8, the memory stored page cgroup is not provided by this memory device. But when we online memory9, the memory stored page cgroup may be provided by memory8. So we can't offline memory8 now. We should offline the memory in the reversed order. When the memory device is hotremoved, we will auto offline memory provided by this memory device. But we don't know which memory is onlined first, so offlining memory may fail. In such case, you should offline the memory by hand before hotremoving the memory device. 2. hotremoving memory device may cause kernel panicked This bug will be fixed by Liu Jiang's patch: https://lkml.org/lkml/2012/7/3/1 change log of v9: [RFC PATCH v9 8/21] * add a lock to protect the list map_entries * add an indicator to firmware_map_entry to remember whether the memory is allocated from bootmem [RFC PATCH v9 10/21] * change the macro to inline function [RFC PATCH v9 19/21] * don't offline the node if the cpu on the node is onlined [RFC PATCH v9 21/21] * create new patch: auto offline page_cgroup when onlining memory block failed change log of v8: [RFC PATCH v8 17/20] * Fix problems when one node's range include the other nodes [RFC PATCH v8 18/20] * fix building error when CONFIG_MEMORY_HOTPLUG_SPARSE or CONFIG_HUGETLBFS is not defined. [RFC PATCH v8 19/20] * don't offline node when some memory sections are not removed [RFC PATCH v8 20/20] * create new patch: clear hwpoisoned flag when onlining pages change log of v7: [RFC PATCH v7 4/19] * do not continue if acpi_memory_device_remove_memory() fails. [RFC PATCH v7 15/19] * handle usemap in register_page_bootmem_info_section() too. change log of v6: [RFC PATCH v6 12/19] * fix building error on other archtitectures than x86 [RFC PATCH v6 15-16/19] * fix building error on other archtitectures than x86 change log of v5: * merge the patchset to clear page table and the patchset to hot remove memory(from ishimatsu) to one big patchset. [RFC PATCH v5 1/19] * rename remove_memory() to offline_memory()/offline_pages() [RFC PATCH v5 2/19] * new patch: implement offline_memory(). This function offlines pages, update memory block's state, and notify the userspace that the memory block's state is changed. [RFC PATCH v5 4/19] * offline and remove memory in acpi_memory_disable_device() too. [RFC PATCH v5 17/19] * new patch: add a new function __remove_zone() to revert the things done in the function __add_zone(). [RFC PATCH v5 18/19] * flush work befor reseting node device. change log of v4: * remove "memory-hotplug : unify argument of firmware_map_add_early/hotplug" from the patch series, since the patch is a bugfix. It is being disccussed on other thread. But for testing the patch series, the patch is needed. So I added the patch as [PATCH 0/13]. [RFC PATCH v4 2/13] * check memory is online or not at remove_memory()
Re: [RFC v9 PATCH 00/21] memory-hotplug: hot-remove physical memory
On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote: From: Wen Congyang we...@cn.fujitsu.com This patch series aims to support physical memory hot-remove. The patches can free/remove the following things: - acpi_memory_info : [RFC PATCH 4/19] - /sys/firmware/memmap/X/{end, start, type} : [RFC PATCH 8/19] - iomem_resource: [RFC PATCH 9/19] - mem_section and related sysfs files : [RFC PATCH 10-11, 13-16/19] - page table of removed memory : [RFC PATCH 12/19] - node and related sysfs files : [RFC PATCH 18-19/19] If you find lack of function for physical memory hot-remove, please let me know. How to test this patchset? 1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE, ACPI_HOTPLUG_MEMORY must be selected. 2. load the module acpi_memhotplug Hi Yasuaki, where is the acpi_memhotplug module? 3. hotplug the memory device(it depends on your hardware) You will see the memory device under the directory /sys/bus/acpi/devices/. Its name is PNP0C80:XX. 4. online/offline pages provided by this memory device You can write online/offline to /sys/devices/system/memory/memoryX/state to online/offline pages provided by this memory device 5. hotremove the memory device You can hotremove the memory device by the hardware, or writing 1 to /sys/bus/acpi/devices/PNP0C80:XX/eject. Note: if the memory provided by the memory device is used by the kernel, it can't be offlined. It is not a bug. Known problems: 1. memory can't be offlined when CONFIG_MEMCG is selected. For example: there is a memory device on node 1. The address range is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10, and memory11 under the directory /sys/devices/system/memory/. If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup when we online pages. When we online memory8, the memory stored page cgroup is not provided by this memory device. But when we online memory9, the memory stored page cgroup may be provided by memory8. So we can't offline memory8 now. We should offline the memory in the reversed order. When the memory device is hotremoved, we will auto offline memory provided by this memory device. But we don't know which memory is onlined first, so offlining memory may fail. In such case, you should offline the memory by hand before hotremoving the memory device. 2. hotremoving memory device may cause kernel panicked This bug will be fixed by Liu Jiang's patch: https://lkml.org/lkml/2012/7/3/1 change log of v9: [RFC PATCH v9 8/21] * add a lock to protect the list map_entries * add an indicator to firmware_map_entry to remember whether the memory is allocated from bootmem [RFC PATCH v9 10/21] * change the macro to inline function [RFC PATCH v9 19/21] * don't offline the node if the cpu on the node is onlined [RFC PATCH v9 21/21] * create new patch: auto offline page_cgroup when onlining memory block failed change log of v8: [RFC PATCH v8 17/20] * Fix problems when one node's range include the other nodes [RFC PATCH v8 18/20] * fix building error when CONFIG_MEMORY_HOTPLUG_SPARSE or CONFIG_HUGETLBFS is not defined. [RFC PATCH v8 19/20] * don't offline node when some memory sections are not removed [RFC PATCH v8 20/20] * create new patch: clear hwpoisoned flag when onlining pages change log of v7: [RFC PATCH v7 4/19] * do not continue if acpi_memory_device_remove_memory() fails. [RFC PATCH v7 15/19] * handle usemap in register_page_bootmem_info_section() too. change log of v6: [RFC PATCH v6 12/19] * fix building error on other archtitectures than x86 [RFC PATCH v6 15-16/19] * fix building error on other archtitectures than x86 change log of v5: * merge the patchset to clear page table and the patchset to hot remove memory(from ishimatsu) to one big patchset. [RFC PATCH v5 1/19] * rename remove_memory() to offline_memory()/offline_pages() [RFC PATCH v5 2/19] * new patch: implement offline_memory(). This function offlines pages, update memory block's state, and notify the userspace that the memory block's state is changed. [RFC PATCH v5 4/19] * offline and remove memory in acpi_memory_disable_device() too. [RFC PATCH v5 17/19] * new patch: add a new function __remove_zone() to revert the things done in the function __add_zone(). [RFC PATCH v5 18/19] * flush work befor reseting node device. change log of v4: * remove memory-hotplug : unify argument of firmware_map_add_early/hotplug from the patch series, since the patch is a bugfix. It is being disccussed on other thread. But for testing the patch series, the patch is needed. So I added the patch as [PATCH 0/13]. [RFC PATCH v4 2/13] * check memory is online or not at
Re: [RFC v9 PATCH 00/21] memory-hotplug: hot-remove physical memory
On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote: From: Wen Congyang This patch series aims to support physical memory hot-remove. The patches can free/remove the following things: - acpi_memory_info : [RFC PATCH 4/19] - /sys/firmware/memmap/X/{end, start, type} : [RFC PATCH 8/19] - iomem_resource: [RFC PATCH 9/19] - mem_section and related sysfs files : [RFC PATCH 10-11, 13-16/19] - page table of removed memory : [RFC PATCH 12/19] - node and related sysfs files : [RFC PATCH 18-19/19] If you find lack of function for physical memory hot-remove, please let me know. Since patchset is too big, could you add more patchset changelog to describe how this patchset works? in order that it is easier to review. How to test this patchset? 1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE, ACPI_HOTPLUG_MEMORY must be selected. 2. load the module acpi_memhotplug 3. hotplug the memory device(it depends on your hardware) You will see the memory device under the directory /sys/bus/acpi/devices/. Its name is PNP0C80:XX. 4. online/offline pages provided by this memory device You can write online/offline to /sys/devices/system/memory/memoryX/state to online/offline pages provided by this memory device 5. hotremove the memory device You can hotremove the memory device by the hardware, or writing 1 to /sys/bus/acpi/devices/PNP0C80:XX/eject. Note: if the memory provided by the memory device is used by the kernel, it can't be offlined. It is not a bug. Known problems: 1. memory can't be offlined when CONFIG_MEMCG is selected. For example: there is a memory device on node 1. The address range is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10, and memory11 under the directory /sys/devices/system/memory/. If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup when we online pages. When we online memory8, the memory stored page cgroup is not provided by this memory device. But when we online memory9, the memory stored page cgroup may be provided by memory8. So we can't offline memory8 now. We should offline the memory in the reversed order. When the memory device is hotremoved, we will auto offline memory provided by this memory device. But we don't know which memory is onlined first, so offlining memory may fail. In such case, you should offline the memory by hand before hotremoving the memory device. 2. hotremoving memory device may cause kernel panicked This bug will be fixed by Liu Jiang's patch: https://lkml.org/lkml/2012/7/3/1 change log of v9: [RFC PATCH v9 8/21] * add a lock to protect the list map_entries * add an indicator to firmware_map_entry to remember whether the memory is allocated from bootmem [RFC PATCH v9 10/21] * change the macro to inline function [RFC PATCH v9 19/21] * don't offline the node if the cpu on the node is onlined [RFC PATCH v9 21/21] * create new patch: auto offline page_cgroup when onlining memory block failed change log of v8: [RFC PATCH v8 17/20] * Fix problems when one node's range include the other nodes [RFC PATCH v8 18/20] * fix building error when CONFIG_MEMORY_HOTPLUG_SPARSE or CONFIG_HUGETLBFS is not defined. [RFC PATCH v8 19/20] * don't offline node when some memory sections are not removed [RFC PATCH v8 20/20] * create new patch: clear hwpoisoned flag when onlining pages change log of v7: [RFC PATCH v7 4/19] * do not continue if acpi_memory_device_remove_memory() fails. [RFC PATCH v7 15/19] * handle usemap in register_page_bootmem_info_section() too. change log of v6: [RFC PATCH v6 12/19] * fix building error on other archtitectures than x86 [RFC PATCH v6 15-16/19] * fix building error on other archtitectures than x86 change log of v5: * merge the patchset to clear page table and the patchset to hot remove memory(from ishimatsu) to one big patchset. [RFC PATCH v5 1/19] * rename remove_memory() to offline_memory()/offline_pages() [RFC PATCH v5 2/19] * new patch: implement offline_memory(). This function offlines pages, update memory block's state, and notify the userspace that the memory block's state is changed. [RFC PATCH v5 4/19] * offline and remove memory in acpi_memory_disable_device() too. [RFC PATCH v5 17/19] * new patch: add a new function __remove_zone() to revert the things done in the function __add_zone(). [RFC PATCH v5 18/19] * flush work befor reseting node device. change log of v4: * remove "memory-hotplug : unify argument of firmware_map_add_early/hotplug" from the patch series, since the patch is a bugfix. It is being disccussed on other thread. But for testing the patch series, the patch is needed. So I added the patch as
Re: [PATCH 0/4] bugfix for memory hotplug
On 09/27/2012 01:45 PM, we...@cn.fujitsu.com wrote: From: Wen Congyang Wen Congyang (2): memory-hotplug: clear hwpoisoned flag when onlining pages memory-hotplug: auto offline page_cgroup when onlining memory block failed Again, you should explain these two patches are the new version of memory-hotplug: hot-remove physical memory [20/21,21/21] Yasuaki Ishimatsu (2): memory-hotplug: add memory_block_release memory-hotplug: add node_device_release drivers/base/memory.c |9 - drivers/base/node.c | 11 +++ mm/memory_hotplug.c |8 mm/page_cgroup.c |3 +++ 4 files changed, 30 insertions(+), 1 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v9 PATCH 13/21] memory-hotplug: check page type in get_page_bootmem
On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote: From: Yasuaki Ishimatsu The function get_page_bootmem() may be called more than one time to the same page. There is no need to set page's type, private if the function is not the first time called to the page. Note: the patch is just optimization and does not fix any problem. Hi Yasuaki, this patch is reasonable to me. I have another question associated to get_page_bootmem(), the question is from another fujitsu guy's patch changelog [commit : 04753278769f3], the changelog said that: 1) When the memmap of removing section is allocated on other section by bootmem, it should/can be free. 2) When the memmap of removing section is allocated on the same section, it shouldn't be freed. Because the section has to be logical memory offlined already and all pages must be isolated against page allocater. If it is freed, page allocator may use it which will be removed physically soon. but I don't see his patch guarantee 2), it means that his patch doesn't guarantee the memmap of removing section which is allocated on other section by bootmem doesn't be freed. Hopefully get your explaination in details, thanks in advance. :-) CC: David Rientjes CC: Jiang Liu CC: Len Brown CC: Benjamin Herrenschmidt CC: Paul Mackerras CC: Christoph Lameter Cc: Minchan Kim CC: Andrew Morton CC: KOSAKI Motohiro CC: Wen Congyang Signed-off-by: Yasuaki Ishimatsu --- mm/memory_hotplug.c | 15 +++ 1 files changed, 11 insertions(+), 4 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index d736df3..26a5012 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -95,10 +95,17 @@ static void release_memory_resource(struct resource *res) static void get_page_bootmem(unsigned long info, struct page *page, unsigned long type) { - page->lru.next = (struct list_head *) type; - SetPagePrivate(page); - set_page_private(page, info); - atomic_inc(>_count); + unsigned long page_type; + + page_type = (unsigned long)page->lru.next; + if (page_type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE || + page_type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE){ + page->lru.next = (struct list_head *)type; + SetPagePrivate(page); + set_page_private(page, info); + atomic_inc(>_count); + } else + atomic_inc(>_count); } /* reference to __meminit __free_pages_bootmem is valid -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] memory_hotplug: Don't modify the zone_start_pfn outside of zone_span_writelock()
On 09/28/2012 03:29 PM, Lai Jiangshan wrote: Hi, Chen, On 09/27/2012 09:19 PM, Ni zhan Chen wrote: On 09/27/2012 02:47 PM, Lai Jiangshan wrote: The __add_zone() maybe call sleep-able init_currently_empty_zone() to init wait_table, But this function also modifies the zone_start_pfn without any lock. It is bugy. So we move this modification out, and we ensure the modification of zone_start_pfn is only done with zone_span_writelock() held or in booting. Since zone_start_pfn is not modified by init_currently_empty_zone() grow_zone_span() needs to check zone_start_pfn before update it. CC: Mel Gorman Signed-off-by: Lai Jiangshan Reported-by: Yasuaki ISIMATU Tested-by: Wen Congyang --- mm/memory_hotplug.c |2 +- mm/page_alloc.c |3 +-- 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index b62d429b..790561f 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -205,7 +205,7 @@ static void grow_zone_span(struct zone *zone, unsigned long start_pfn, zone_span_writelock(zone); old_zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages; -if (start_pfn < zone->zone_start_pfn) +if (!zone->zone_start_pfn || start_pfn < zone->zone_start_pfn) zone->zone_start_pfn = start_pfn; zone->spanned_pages = max(old_zone_end_pfn, end_pfn) - diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c13ea75..2545013 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3997,8 +3997,6 @@ int __meminit init_currently_empty_zone(struct zone *zone, return ret; pgdat->nr_zones = zone_idx(zone) + 1; -zone->zone_start_pfn = zone_start_pfn; - then how can mminit_dprintk print zone->zone_start_pfn ? always print 0 make no sense. The full code here: mminit_dprintk(MMINIT_TRACE, "memmap_init", "Initialising map node %d zone %lu pfns %lu -> %lu\n", pgdat->node_id, (unsigned long)zone_idx(zone), zone_start_pfn, (zone_start_pfn + size)); It doesn't always print 0, it still behaves as I expected. Could you elaborate? Yeah, you are right. I mean mminit_dprintk is called after zone->zone_start_pfn initialized to show initialising state, but after this patch applied zone->zone_start_pfn will not be initialized before this print point. Thanks, Lai mminit_dprintk(MMINIT_TRACE, "memmap_init", "Initialising map node %d zone %lu pfns %lu -> %lu\n", pgdat->node_id, @@ -4465,6 +4463,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat, ret = init_currently_empty_zone(zone, zone_start_pfn, size, MEMMAP_EARLY); BUG_ON(ret); +zone->zone_start_pfn = zone_start_pfn; memmap_init(size, nid, j, zone_start_pfn); zone_start_pfn += size; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/4] memory-hotplug: add memory_block_release
On 09/28/2012 11:45 AM, Yasuaki Ishimatsu wrote: Hi Kosaki-san, 2012/09/28 10:35, KOSAKI Motohiro wrote: On Thu, Sep 27, 2012 at 8:24 PM, Yasuaki Ishimatsu wrote: Hi Chen, 2012/09/27 19:20, Ni zhan Chen wrote: Hi Congyang, 2012/9/27 From: Yasuaki Ishimatsu When calling remove_memory_block(), the function shows following message at device_release(). Device 'memory528' does not have a release() function, it is broken and must be fixed. What's the difference between the patch and original implemetation? The implementation is for removing a memory_block. So the purpose is same as original one. But original code is bad manner. kobject_cleanup() is called by remove_memory_block() at last. But release function for releasing memory_block is not registered. As a result, the kernel message is shown. IMHO, memory_block should be release by the releae function. but your patch introduced use after free bug, if i understand correctly. See unregister_memory() function. After your patch, kobject_put() call release_memory_block() and kfree(). and then device_unregister() will touch freed memory. It is not correct. The kobject_put() is prepared against find_memory_block() in remove_memory_block() since kobject->kref is incremented in it. So release_memory_block() is called by device_unregister() correctly as follows: Another issue is memory hotplug which is not associated to this patch report to you: IIUC, function register_mem_sect_under_node should be renamed to register_mem_block_under_node, since this function is register memory block instead of memory section. [ 1014.589008] Pid: 126, comm: kworker/0:2 Not tainted 3.6.0-rc3-enable-memory-hotremove-and-root-bridge #3 [ 1014.702437] Call Trace: [ 1014.731684] [] release_memory_block+0x16/0x30 [ 1014.803581] [] device_release+0x27/0xa0 [ 1014.869312] [] kobject_cleanup+0x82/0x1b0 [ 1014.937062] [] kobject_release+0xd/0x10 [ 1015.002718] [] kobject_put+0x2c/0x60 [ 1015.065271] [] put_device+0x17/0x20 [ 1015.126794] [] device_unregister+0x2a/0x60 [ 1015.195578] [] remove_memory_block+0xbb/0xf0 [ 1015.266434] [] unregister_memory_section+0x1f/0x30 [ 1015.343532] [] __remove_section+0x68/0x110 [ 1015.412318] [] __remove_pages+0xe7/0x120 [ 1015.479021] [] arch_remove_memory+0x2c/0x80 [ 1015.548845] [] remove_memory+0x6b/0xd0 [ 1015.613474] [] acpi_memory_device_remove_memory+0x48/0x73 [ 1015.697834] [] acpi_memory_device_remove+0x2b/0x44 [ 1015.774922] [] acpi_device_remove+0x90/0xb2 [ 1015.844796] [] __device_release_driver+0x7c/0xf0 [ 1015.919814] [] device_release_driver+0x2f/0x50 [ 1015.992753] [] acpi_bus_remove+0x32/0x6d [ 1016.059462] [] acpi_bus_trim+0x91/0x102 [ 1016.125128] [] acpi_bus_hot_remove_device+0x88/0x16b [ 1016.204295] [] acpi_os_execute_deferred+0x27/0x34 [ 1016.280350] [] process_one_work+0x219/0x680 [ 1016.350173] [] ? process_one_work+0x1b8/0x680 [ 1016.422072] [] ? acpi_os_wait_events_complete+0x23/0x23 [ 1016.504357] [] worker_thread+0x12e/0x320 [ 1016.571064] [] ? manage_workers+0x110/0x110 [ 1016.640886] [] kthread+0xc6/0xd0 [ 1016.699290] [] kernel_thread_helper+0x4/0x10 [ 1016.770149] [] ? retint_restore_args+0x13/0x13 [ 1016.843165] [] ? __init_kthread_worker+0x70/0x70 [ 1016.918200] [] ? gs_change+0x13/0x13 Thanks, Yasuaki Ishimatsu static void unregister_memory(struct memory_block *memory) { BUG_ON(memory->dev.bus != _subsys); /* drop the ref. we got in remove_memory_block() */ kobject_put(>dev.kobj); device_unregister(>dev); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/4] memory-hotplug: add memory_block_release
On 09/28/2012 11:45 AM, Yasuaki Ishimatsu wrote: Hi Kosaki-san, 2012/09/28 10:35, KOSAKI Motohiro wrote: On Thu, Sep 27, 2012 at 8:24 PM, Yasuaki Ishimatsu wrote: Hi Chen, 2012/09/27 19:20, Ni zhan Chen wrote: Hi Congyang, 2012/9/27 From: Yasuaki Ishimatsu When calling remove_memory_block(), the function shows following message at device_release(). Device 'memory528' does not have a release() function, it is broken and must be fixed. What's the difference between the patch and original implemetation? The implementation is for removing a memory_block. So the purpose is same as original one. But original code is bad manner. kobject_cleanup() is called by remove_memory_block() at last. But release function for releasing memory_block is not registered. As a result, the kernel message is shown. IMHO, memory_block should be release by the releae function. but your patch introduced use after free bug, if i understand correctly. See unregister_memory() function. After your patch, kobject_put() call release_memory_block() and kfree(). and then device_unregister() will touch freed memory. this patch is similiar to [RFC v9 PATCH 10/21] memory-hotplug: add memory_block_release, they handle the same issue, can these two patches be fold to one? It is not correct. The kobject_put() is prepared against find_memory_block() in remove_memory_block() since kobject->kref is incremented in it. So release_memory_block() is called by device_unregister() correctly as follows: [ 1014.589008] Pid: 126, comm: kworker/0:2 Not tainted 3.6.0-rc3-enable-memory-hotremove-and-root-bridge #3 [ 1014.702437] Call Trace: [ 1014.731684] [] release_memory_block+0x16/0x30 [ 1014.803581] [] device_release+0x27/0xa0 [ 1014.869312] [] kobject_cleanup+0x82/0x1b0 [ 1014.937062] [] kobject_release+0xd/0x10 [ 1015.002718] [] kobject_put+0x2c/0x60 [ 1015.065271] [] put_device+0x17/0x20 [ 1015.126794] [] device_unregister+0x2a/0x60 [ 1015.195578] [] remove_memory_block+0xbb/0xf0 [ 1015.266434] [] unregister_memory_section+0x1f/0x30 [ 1015.343532] [] __remove_section+0x68/0x110 [ 1015.412318] [] __remove_pages+0xe7/0x120 [ 1015.479021] [] arch_remove_memory+0x2c/0x80 [ 1015.548845] [] remove_memory+0x6b/0xd0 [ 1015.613474] [] acpi_memory_device_remove_memory+0x48/0x73 [ 1015.697834] [] acpi_memory_device_remove+0x2b/0x44 [ 1015.774922] [] acpi_device_remove+0x90/0xb2 [ 1015.844796] [] __device_release_driver+0x7c/0xf0 [ 1015.919814] [] device_release_driver+0x2f/0x50 [ 1015.992753] [] acpi_bus_remove+0x32/0x6d [ 1016.059462] [] acpi_bus_trim+0x91/0x102 [ 1016.125128] [] acpi_bus_hot_remove_device+0x88/0x16b [ 1016.204295] [] acpi_os_execute_deferred+0x27/0x34 [ 1016.280350] [] process_one_work+0x219/0x680 [ 1016.350173] [] ? process_one_work+0x1b8/0x680 [ 1016.422072] [] ? acpi_os_wait_events_complete+0x23/0x23 [ 1016.504357] [] worker_thread+0x12e/0x320 [ 1016.571064] [] ? manage_workers+0x110/0x110 [ 1016.640886] [] kthread+0xc6/0xd0 [ 1016.699290] [] kernel_thread_helper+0x4/0x10 [ 1016.770149] [] ? retint_restore_args+0x13/0x13 [ 1016.843165] [] ? __init_kthread_worker+0x70/0x70 [ 1016.918200] [] ? gs_change+0x13/0x13 Thanks, Yasuaki Ishimatsu static void unregister_memory(struct memory_block *memory) { BUG_ON(memory->dev.bus != _subsys); /* drop the ref. we got in remove_memory_block() */ kobject_put(>dev.kobj); device_unregister(>dev); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/4] memory-hotplug: add memory_block_release
On 09/28/2012 11:45 AM, Yasuaki Ishimatsu wrote: Hi Kosaki-san, 2012/09/28 10:35, KOSAKI Motohiro wrote: On Thu, Sep 27, 2012 at 8:24 PM, Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com wrote: Hi Chen, 2012/09/27 19:20, Ni zhan Chen wrote: Hi Congyang, 2012/9/27 we...@cn.fujitsu.com From: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com When calling remove_memory_block(), the function shows following message at device_release(). Device 'memory528' does not have a release() function, it is broken and must be fixed. What's the difference between the patch and original implemetation? The implementation is for removing a memory_block. So the purpose is same as original one. But original code is bad manner. kobject_cleanup() is called by remove_memory_block() at last. But release function for releasing memory_block is not registered. As a result, the kernel message is shown. IMHO, memory_block should be release by the releae function. but your patch introduced use after free bug, if i understand correctly. See unregister_memory() function. After your patch, kobject_put() call release_memory_block() and kfree(). and then device_unregister() will touch freed memory. this patch is similiar to [RFC v9 PATCH 10/21] memory-hotplug: add memory_block_release, they handle the same issue, can these two patches be fold to one? It is not correct. The kobject_put() is prepared against find_memory_block() in remove_memory_block() since kobject-kref is incremented in it. So release_memory_block() is called by device_unregister() correctly as follows: [ 1014.589008] Pid: 126, comm: kworker/0:2 Not tainted 3.6.0-rc3-enable-memory-hotremove-and-root-bridge #3 [ 1014.702437] Call Trace: [ 1014.731684] [8144d096] release_memory_block+0x16/0x30 [ 1014.803581] [81438587] device_release+0x27/0xa0 [ 1014.869312] [8133e962] kobject_cleanup+0x82/0x1b0 [ 1014.937062] [8133ea9d] kobject_release+0xd/0x10 [ 1015.002718] [8133e7ec] kobject_put+0x2c/0x60 [ 1015.065271] [81438107] put_device+0x17/0x20 [ 1015.126794] [8143918a] device_unregister+0x2a/0x60 [ 1015.195578] [8144d55b] remove_memory_block+0xbb/0xf0 [ 1015.266434] [8144d5af] unregister_memory_section+0x1f/0x30 [ 1015.343532] [811c0a58] __remove_section+0x68/0x110 [ 1015.412318] [811c0be7] __remove_pages+0xe7/0x120 [ 1015.479021] [81653d8c] arch_remove_memory+0x2c/0x80 [ 1015.548845] [8165497b] remove_memory+0x6b/0xd0 [ 1015.613474] [813d946c] acpi_memory_device_remove_memory+0x48/0x73 [ 1015.697834] [813d94c2] acpi_memory_device_remove+0x2b/0x44 [ 1015.774922] [813a61e4] acpi_device_remove+0x90/0xb2 [ 1015.844796] [8143c2fc] __device_release_driver+0x7c/0xf0 [ 1015.919814] [8143c47f] device_release_driver+0x2f/0x50 [ 1015.992753] [813a70dc] acpi_bus_remove+0x32/0x6d [ 1016.059462] [813a71a8] acpi_bus_trim+0x91/0x102 [ 1016.125128] [813a72a1] acpi_bus_hot_remove_device+0x88/0x16b [ 1016.204295] [813a2e57] acpi_os_execute_deferred+0x27/0x34 [ 1016.280350] [81090599] process_one_work+0x219/0x680 [ 1016.350173] [81090538] ? process_one_work+0x1b8/0x680 [ 1016.422072] [813a2e30] ? acpi_os_wait_events_complete+0x23/0x23 [ 1016.504357] [810923ce] worker_thread+0x12e/0x320 [ 1016.571064] [810922a0] ? manage_workers+0x110/0x110 [ 1016.640886] [810983a6] kthread+0xc6/0xd0 [ 1016.699290] [8167b144] kernel_thread_helper+0x4/0x10 [ 1016.770149] [81670bb0] ? retint_restore_args+0x13/0x13 [ 1016.843165] [810982e0] ? __init_kthread_worker+0x70/0x70 [ 1016.918200] [8167b140] ? gs_change+0x13/0x13 Thanks, Yasuaki Ishimatsu static void unregister_memory(struct memory_block *memory) { BUG_ON(memory-dev.bus != memory_subsys); /* drop the ref. we got in remove_memory_block() */ kobject_put(memory-dev.kobj); device_unregister(memory-dev); } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/4] memory-hotplug: add memory_block_release
On 09/28/2012 11:45 AM, Yasuaki Ishimatsu wrote: Hi Kosaki-san, 2012/09/28 10:35, KOSAKI Motohiro wrote: On Thu, Sep 27, 2012 at 8:24 PM, Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com wrote: Hi Chen, 2012/09/27 19:20, Ni zhan Chen wrote: Hi Congyang, 2012/9/27 we...@cn.fujitsu.com From: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com When calling remove_memory_block(), the function shows following message at device_release(). Device 'memory528' does not have a release() function, it is broken and must be fixed. What's the difference between the patch and original implemetation? The implementation is for removing a memory_block. So the purpose is same as original one. But original code is bad manner. kobject_cleanup() is called by remove_memory_block() at last. But release function for releasing memory_block is not registered. As a result, the kernel message is shown. IMHO, memory_block should be release by the releae function. but your patch introduced use after free bug, if i understand correctly. See unregister_memory() function. After your patch, kobject_put() call release_memory_block() and kfree(). and then device_unregister() will touch freed memory. It is not correct. The kobject_put() is prepared against find_memory_block() in remove_memory_block() since kobject-kref is incremented in it. So release_memory_block() is called by device_unregister() correctly as follows: Another issue is memory hotplug which is not associated to this patch report to you: IIUC, function register_mem_sect_under_node should be renamed to register_mem_block_under_node, since this function is register memory block instead of memory section. [ 1014.589008] Pid: 126, comm: kworker/0:2 Not tainted 3.6.0-rc3-enable-memory-hotremove-and-root-bridge #3 [ 1014.702437] Call Trace: [ 1014.731684] [8144d096] release_memory_block+0x16/0x30 [ 1014.803581] [81438587] device_release+0x27/0xa0 [ 1014.869312] [8133e962] kobject_cleanup+0x82/0x1b0 [ 1014.937062] [8133ea9d] kobject_release+0xd/0x10 [ 1015.002718] [8133e7ec] kobject_put+0x2c/0x60 [ 1015.065271] [81438107] put_device+0x17/0x20 [ 1015.126794] [8143918a] device_unregister+0x2a/0x60 [ 1015.195578] [8144d55b] remove_memory_block+0xbb/0xf0 [ 1015.266434] [8144d5af] unregister_memory_section+0x1f/0x30 [ 1015.343532] [811c0a58] __remove_section+0x68/0x110 [ 1015.412318] [811c0be7] __remove_pages+0xe7/0x120 [ 1015.479021] [81653d8c] arch_remove_memory+0x2c/0x80 [ 1015.548845] [8165497b] remove_memory+0x6b/0xd0 [ 1015.613474] [813d946c] acpi_memory_device_remove_memory+0x48/0x73 [ 1015.697834] [813d94c2] acpi_memory_device_remove+0x2b/0x44 [ 1015.774922] [813a61e4] acpi_device_remove+0x90/0xb2 [ 1015.844796] [8143c2fc] __device_release_driver+0x7c/0xf0 [ 1015.919814] [8143c47f] device_release_driver+0x2f/0x50 [ 1015.992753] [813a70dc] acpi_bus_remove+0x32/0x6d [ 1016.059462] [813a71a8] acpi_bus_trim+0x91/0x102 [ 1016.125128] [813a72a1] acpi_bus_hot_remove_device+0x88/0x16b [ 1016.204295] [813a2e57] acpi_os_execute_deferred+0x27/0x34 [ 1016.280350] [81090599] process_one_work+0x219/0x680 [ 1016.350173] [81090538] ? process_one_work+0x1b8/0x680 [ 1016.422072] [813a2e30] ? acpi_os_wait_events_complete+0x23/0x23 [ 1016.504357] [810923ce] worker_thread+0x12e/0x320 [ 1016.571064] [810922a0] ? manage_workers+0x110/0x110 [ 1016.640886] [810983a6] kthread+0xc6/0xd0 [ 1016.699290] [8167b144] kernel_thread_helper+0x4/0x10 [ 1016.770149] [81670bb0] ? retint_restore_args+0x13/0x13 [ 1016.843165] [810982e0] ? __init_kthread_worker+0x70/0x70 [ 1016.918200] [8167b140] ? gs_change+0x13/0x13 Thanks, Yasuaki Ishimatsu static void unregister_memory(struct memory_block *memory) { BUG_ON(memory-dev.bus != memory_subsys); /* drop the ref. we got in remove_memory_block() */ kobject_put(memory-dev.kobj); device_unregister(memory-dev); } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] memory_hotplug: Don't modify the zone_start_pfn outside of zone_span_writelock()
On 09/28/2012 03:29 PM, Lai Jiangshan wrote: Hi, Chen, On 09/27/2012 09:19 PM, Ni zhan Chen wrote: On 09/27/2012 02:47 PM, Lai Jiangshan wrote: The __add_zone() maybe call sleep-able init_currently_empty_zone() to init wait_table, But this function also modifies the zone_start_pfn without any lock. It is bugy. So we move this modification out, and we ensure the modification of zone_start_pfn is only done with zone_span_writelock() held or in booting. Since zone_start_pfn is not modified by init_currently_empty_zone() grow_zone_span() needs to check zone_start_pfn before update it. CC: Mel Gorman m...@csn.ul.ie Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com Reported-by: Yasuaki ISIMATU isimatu.yasu...@jp.fujitsu.com Tested-by: Wen Congyang we...@cn.fujitsu.com --- mm/memory_hotplug.c |2 +- mm/page_alloc.c |3 +-- 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index b62d429b..790561f 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -205,7 +205,7 @@ static void grow_zone_span(struct zone *zone, unsigned long start_pfn, zone_span_writelock(zone); old_zone_end_pfn = zone-zone_start_pfn + zone-spanned_pages; -if (start_pfn zone-zone_start_pfn) +if (!zone-zone_start_pfn || start_pfn zone-zone_start_pfn) zone-zone_start_pfn = start_pfn; zone-spanned_pages = max(old_zone_end_pfn, end_pfn) - diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c13ea75..2545013 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3997,8 +3997,6 @@ int __meminit init_currently_empty_zone(struct zone *zone, return ret; pgdat-nr_zones = zone_idx(zone) + 1; -zone-zone_start_pfn = zone_start_pfn; - then how can mminit_dprintk print zone-zone_start_pfn ? always print 0 make no sense. The full code here: mminit_dprintk(MMINIT_TRACE, memmap_init, Initialising map node %d zone %lu pfns %lu - %lu\n, pgdat-node_id, (unsigned long)zone_idx(zone), zone_start_pfn, (zone_start_pfn + size)); It doesn't always print 0, it still behaves as I expected. Could you elaborate? Yeah, you are right. I mean mminit_dprintk is called after zone-zone_start_pfn initialized to show initialising state, but after this patch applied zone-zone_start_pfn will not be initialized before this print point. Thanks, Lai mminit_dprintk(MMINIT_TRACE, memmap_init, Initialising map node %d zone %lu pfns %lu - %lu\n, pgdat-node_id, @@ -4465,6 +4463,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat, ret = init_currently_empty_zone(zone, zone_start_pfn, size, MEMMAP_EARLY); BUG_ON(ret); +zone-zone_start_pfn = zone_start_pfn; memmap_init(size, nid, j, zone_start_pfn); zone_start_pfn += size; } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v9 PATCH 13/21] memory-hotplug: check page type in get_page_bootmem
On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote: From: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com The function get_page_bootmem() may be called more than one time to the same page. There is no need to set page's type, private if the function is not the first time called to the page. Note: the patch is just optimization and does not fix any problem. Hi Yasuaki, this patch is reasonable to me. I have another question associated to get_page_bootmem(), the question is from another fujitsu guy's patch changelog [commit : 04753278769f3], the changelog said that: 1) When the memmap of removing section is allocated on other section by bootmem, it should/can be free. 2) When the memmap of removing section is allocated on the same section, it shouldn't be freed. Because the section has to be logical memory offlined already and all pages must be isolated against page allocater. If it is freed, page allocator may use it which will be removed physically soon. but I don't see his patch guarantee 2), it means that his patch doesn't guarantee the memmap of removing section which is allocated on other section by bootmem doesn't be freed. Hopefully get your explaination in details, thanks in advance. :-) CC: David Rientjes rient...@google.com CC: Jiang Liu liu...@gmail.com CC: Len Brown len.br...@intel.com CC: Benjamin Herrenschmidt b...@kernel.crashing.org CC: Paul Mackerras pau...@samba.org CC: Christoph Lameter c...@linux.com Cc: Minchan Kim minchan@gmail.com CC: Andrew Morton a...@linux-foundation.org CC: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com CC: Wen Congyang we...@cn.fujitsu.com Signed-off-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com --- mm/memory_hotplug.c | 15 +++ 1 files changed, 11 insertions(+), 4 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index d736df3..26a5012 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -95,10 +95,17 @@ static void release_memory_resource(struct resource *res) static void get_page_bootmem(unsigned long info, struct page *page, unsigned long type) { - page-lru.next = (struct list_head *) type; - SetPagePrivate(page); - set_page_private(page, info); - atomic_inc(page-_count); + unsigned long page_type; + + page_type = (unsigned long)page-lru.next; + if (page_type MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE || + page_type MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE){ + page-lru.next = (struct list_head *)type; + SetPagePrivate(page); + set_page_private(page, info); + atomic_inc(page-_count); + } else + atomic_inc(page-_count); } /* reference to __meminit __free_pages_bootmem is valid -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] bugfix for memory hotplug
On 09/27/2012 01:45 PM, we...@cn.fujitsu.com wrote: From: Wen Congyang we...@cn.fujitsu.com Wen Congyang (2): memory-hotplug: clear hwpoisoned flag when onlining pages memory-hotplug: auto offline page_cgroup when onlining memory block failed Again, you should explain these two patches are the new version of memory-hotplug: hot-remove physical memory [20/21,21/21] Yasuaki Ishimatsu (2): memory-hotplug: add memory_block_release memory-hotplug: add node_device_release drivers/base/memory.c |9 - drivers/base/node.c | 11 +++ mm/memory_hotplug.c |8 mm/page_cgroup.c |3 +++ 4 files changed, 30 insertions(+), 1 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v9 PATCH 00/21] memory-hotplug: hot-remove physical memory
On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote: From: Wen Congyang we...@cn.fujitsu.com This patch series aims to support physical memory hot-remove. The patches can free/remove the following things: - acpi_memory_info : [RFC PATCH 4/19] - /sys/firmware/memmap/X/{end, start, type} : [RFC PATCH 8/19] - iomem_resource: [RFC PATCH 9/19] - mem_section and related sysfs files : [RFC PATCH 10-11, 13-16/19] - page table of removed memory : [RFC PATCH 12/19] - node and related sysfs files : [RFC PATCH 18-19/19] If you find lack of function for physical memory hot-remove, please let me know. Since patchset is too big, could you add more patchset changelog to describe how this patchset works? in order that it is easier to review. How to test this patchset? 1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE, ACPI_HOTPLUG_MEMORY must be selected. 2. load the module acpi_memhotplug 3. hotplug the memory device(it depends on your hardware) You will see the memory device under the directory /sys/bus/acpi/devices/. Its name is PNP0C80:XX. 4. online/offline pages provided by this memory device You can write online/offline to /sys/devices/system/memory/memoryX/state to online/offline pages provided by this memory device 5. hotremove the memory device You can hotremove the memory device by the hardware, or writing 1 to /sys/bus/acpi/devices/PNP0C80:XX/eject. Note: if the memory provided by the memory device is used by the kernel, it can't be offlined. It is not a bug. Known problems: 1. memory can't be offlined when CONFIG_MEMCG is selected. For example: there is a memory device on node 1. The address range is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10, and memory11 under the directory /sys/devices/system/memory/. If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup when we online pages. When we online memory8, the memory stored page cgroup is not provided by this memory device. But when we online memory9, the memory stored page cgroup may be provided by memory8. So we can't offline memory8 now. We should offline the memory in the reversed order. When the memory device is hotremoved, we will auto offline memory provided by this memory device. But we don't know which memory is onlined first, so offlining memory may fail. In such case, you should offline the memory by hand before hotremoving the memory device. 2. hotremoving memory device may cause kernel panicked This bug will be fixed by Liu Jiang's patch: https://lkml.org/lkml/2012/7/3/1 change log of v9: [RFC PATCH v9 8/21] * add a lock to protect the list map_entries * add an indicator to firmware_map_entry to remember whether the memory is allocated from bootmem [RFC PATCH v9 10/21] * change the macro to inline function [RFC PATCH v9 19/21] * don't offline the node if the cpu on the node is onlined [RFC PATCH v9 21/21] * create new patch: auto offline page_cgroup when onlining memory block failed change log of v8: [RFC PATCH v8 17/20] * Fix problems when one node's range include the other nodes [RFC PATCH v8 18/20] * fix building error when CONFIG_MEMORY_HOTPLUG_SPARSE or CONFIG_HUGETLBFS is not defined. [RFC PATCH v8 19/20] * don't offline node when some memory sections are not removed [RFC PATCH v8 20/20] * create new patch: clear hwpoisoned flag when onlining pages change log of v7: [RFC PATCH v7 4/19] * do not continue if acpi_memory_device_remove_memory() fails. [RFC PATCH v7 15/19] * handle usemap in register_page_bootmem_info_section() too. change log of v6: [RFC PATCH v6 12/19] * fix building error on other archtitectures than x86 [RFC PATCH v6 15-16/19] * fix building error on other archtitectures than x86 change log of v5: * merge the patchset to clear page table and the patchset to hot remove memory(from ishimatsu) to one big patchset. [RFC PATCH v5 1/19] * rename remove_memory() to offline_memory()/offline_pages() [RFC PATCH v5 2/19] * new patch: implement offline_memory(). This function offlines pages, update memory block's state, and notify the userspace that the memory block's state is changed. [RFC PATCH v5 4/19] * offline and remove memory in acpi_memory_disable_device() too. [RFC PATCH v5 17/19] * new patch: add a new function __remove_zone() to revert the things done in the function __add_zone(). [RFC PATCH v5 18/19] * flush work befor reseting node device. change log of v4: * remove memory-hotplug : unify argument of firmware_map_add_early/hotplug from the patch series, since the patch is a bugfix. It is being disccussed on other thread. But for testing the patch series, the patch is needed. So I added
Re: [RFC v9 PATCH 04/21] memory-hotplug: offline and remove memory when removing the memory device
On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote: From: Yasuaki Ishimatsu We should offline and remove memory when removing the memory device. The memory device can be removed by 2 ways: 1. send eject request by SCI 2. echo 1 >/sys/bus/pci/devices/PNP0C80:XX/eject In the 1st case, acpi_memory_disable_device() will be called. In the 2nd case, acpi_memory_device_remove() will be called. acpi_memory_device_remove() will also be called when we unbind the memory device from the driver acpi_memhotplug. If the type is ACPI_BUS_REMOVAL_EJECT, it means that the user wants to eject the memory device, and we should offline and remove memory in acpi_memory_device_remove(). The function remove_memory() is not implemeted now. It only check whether all memory has been offllined now. CC: David Rientjes CC: Jiang Liu CC: Len Brown CC: Benjamin Herrenschmidt CC: Paul Mackerras CC: Christoph Lameter Cc: Minchan Kim CC: Andrew Morton CC: KOSAKI Motohiro Signed-off-by: Yasuaki Ishimatsu Signed-off-by: Wen Congyang --- drivers/acpi/acpi_memhotplug.c | 45 +-- drivers/base/memory.c | 39 ++ include/linux/memory.h |5 include/linux/memory_hotplug.h |5 mm/memory_hotplug.c| 22 +++ 5 files changed, 109 insertions(+), 7 deletions(-) diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c index 7873832..9d47458 100644 --- a/drivers/acpi/acpi_memhotplug.c +++ b/drivers/acpi/acpi_memhotplug.c @@ -29,6 +29,7 @@ #include #include #include +#include #include #include #include @@ -310,25 +311,44 @@ static int acpi_memory_powerdown_device(struct acpi_memory_device *mem_device) return 0; } -static int acpi_memory_disable_device(struct acpi_memory_device *mem_device) +static int +acpi_memory_device_remove_memory(struct acpi_memory_device *mem_device) { int result; struct acpi_memory_info *info, *n; + int node = mem_device->nid; - - /* -* Ask the VM to offline this memory range. -* Note: Assume that this function returns zero on success -*/ list_for_each_entry_safe(info, n, _device->res_list, list) { if (info->enabled) { result = offline_memory(info->start_addr, info->length); if (result) return result; + + result = remove_memory(node, info->start_addr, + info->length); + if (result) + return result; } + + list_del(>list); kfree(info); } + return 0; +} + +static int acpi_memory_disable_device(struct acpi_memory_device *mem_device) +{ + int result; + + /* +* Ask the VM to offline this memory range. +* Note: Assume that this function returns zero on success +*/ + result = acpi_memory_device_remove_memory(mem_device); + if (result) + return result; + /* Power-off and eject the device */ result = acpi_memory_powerdown_device(mem_device); if (result) { @@ -477,12 +497,23 @@ static int acpi_memory_device_add(struct acpi_device *device) static int acpi_memory_device_remove(struct acpi_device *device, int type) { struct acpi_memory_device *mem_device = NULL; - + int result; if (!device || !acpi_driver_data(device)) return -EINVAL; mem_device = acpi_driver_data(device); + + if (type == ACPI_BUS_REMOVAL_EJECT) { + /* +* offline and remove memory only when the memory device is +* ejected. +*/ + result = acpi_memory_device_remove_memory(mem_device); + if (result) + return result; + } + kfree(mem_device); return 0; diff --git a/drivers/base/memory.c b/drivers/base/memory.c index 86c8821..038be73 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -70,6 +70,45 @@ void unregister_memory_isolate_notifier(struct notifier_block *nb) } EXPORT_SYMBOL(unregister_memory_isolate_notifier); +bool is_memblk_offline(unsigned long start, unsigned long size) +{ + struct memory_block *mem = NULL; + struct mem_section *section; + unsigned long start_pfn, end_pfn; + unsigned long pfn, section_nr; + + start_pfn = PFN_DOWN(start); + end_pfn = PFN_UP(start + size); + + for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) { + section_nr = pfn_to_section_nr(pfn); + if (!present_section_nr(section_nr)) + continue; + + section = __nr_to_section(section_nr); + /* same memblock? */ + if (mem) +
Re: [RFC v9 PATCH 05/21] memory-hotplug: check whether memory is present or not
On 09/11/2012 10:24 AM, Yasuaki Ishimatsu wrote: Hi Wen, 2012/09/11 11:15, Wen Congyang wrote: Hi, ishimatsu At 09/05/2012 05:25 PM, we...@cn.fujitsu.com Wrote: From: Yasuaki Ishimatsu If system supports memory hot-remove, online_pages() may online removed pages. So online_pages() need to check whether onlining pages are present or not. Because we use memory_block_change_state() to hotremoving memory, I think this patch can be removed. What do you think? Pleae teach me detals a little more. If we use memory_block_change_state(), does the conflict never occur? Why? since memory hot-add or hot-remove is based on memblock, if check in memory_block_change_state() can guarantee conflict never occur? Thansk, Yasuaki Ishimatsu Thanks Wen Congyang CC: David Rientjes CC: Jiang Liu CC: Len Brown CC: Benjamin Herrenschmidt CC: Paul Mackerras CC: Christoph Lameter Cc: Minchan Kim CC: Andrew Morton CC: KOSAKI Motohiro CC: Wen Congyang Signed-off-by: Yasuaki Ishimatsu --- include/linux/mmzone.h | 19 +++ mm/memory_hotplug.c| 13 + 2 files changed, 32 insertions(+), 0 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 2daa54f..ac3ae30 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1180,6 +1180,25 @@ void sparse_init(void); #define sparse_index_init(_sec, _nid) do {} while (0) #endif /* CONFIG_SPARSEMEM */ +#ifdef CONFIG_SPARSEMEM +static inline int pfns_present(unsigned long pfn, unsigned long nr_pages) +{ +int i; +for (i = 0; i < nr_pages; i++) { +if (pfn_present(pfn + i)) +continue; +else +return -EINVAL; +} +return 0; +} +#else +static inline int pfns_present(unsigned long pfn, unsigned long nr_pages) +{ +return 0; +} +#endif /* CONFIG_SPARSEMEM*/ + #ifdef CONFIG_NODES_SPAN_OTHER_NODES bool early_pfn_in_nid(unsigned long pfn, int nid); #else diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 49f7747..299747d 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -467,6 +467,19 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages) struct memory_notify arg; lock_memory_hotplug(); +/* + * If system supports memory hot-remove, the memory may have been + * removed. So we check whether the memory has been removed or not. + * + * Note: When CONFIG_SPARSEMEM is defined, pfns_present() become + * effective. If CONFIG_SPARSEMEM is not defined, pfns_present() + * always returns 0. + */ +ret = pfns_present(pfn, nr_pages); +if (ret) { +unlock_memory_hotplug(); +return ret; +} arg.start_pfn = pfn; arg.nr_pages = nr_pages; arg.status_change_nid = -1; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/