Re: [PATCH v3] mm: thp: Set the accessed flag for old pages on access fault.

2012-10-26 Thread Ni zhan Chen

On 10/26/2012 05:34 PM, Will Deacon wrote:

On Fri, Oct 26, 2012 at 07:19:55AM +0100, Ni zhan Chen wrote:

On 10/26/2012 12:44 AM, Will Deacon wrote:

On x86 memory accesses to pages without the ACCESSED flag set result in the
ACCESSED flag being set automatically. With the ARM architecture a page access
fault is raised instead (and it will continue to be raised until the ACCESSED
flag is set for the appropriate PTE/PMD).

For normal memory pages, handle_pte_fault will call pte_mkyoung (effectively
setting the ACCESSED flag). For transparent huge pages, pmd_mkyoung will only
be called for a write fault.

This patch ensures that faults on transparent hugepages which do not result
in a CoW update the access flags for the faulting pmd.

Could you write changlog?

>From v2? I included something below my SoB. The code should do exactly the
same as before, it's just rebased onto next so that I can play nicely with
Peter's patches.


Cc: Chris Metcalf 
Cc: Kirill A. Shutemov 
Cc: Andrea Arcangeli 
Signed-off-by: Will Deacon 
---

Ok chaps, I rebased this thing onto today's next (which basically
necessitated a rewrite) so I've reluctantly dropped my acks and kindly
ask if you could eyeball the new code, especially where the locking is
concerned. In the numa code (do_huge_pmd_prot_none), Peter checks again
that the page is not splitting, but I can't see why that is required.

Cheers,

Will

Could you explain why you not call pmd_trans_huge_lock to confirm the
pmd is splitting or stable as Andrea point out?

The way handle_mm_fault is now structured after the numa changes means that
we only enter the huge pmd page aging code if the entry wasn't splitting


Why you call it huge pmd page *aging* code?

Regards,
Chen


before taking the lock, so it seemed a bit gratuitous to jump through those
hoops again in pmd_trans_huge_lock.

Will



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

2012-10-26 Thread Ni zhan Chen

On 10/26/2012 04:02 PM, Fengguang Wu wrote:

On Fri, Oct 26, 2012 at 03:47:19PM +0800, Ni zhan Chen wrote:

On 10/26/2012 03:36 PM, Fengguang Wu wrote:

On Fri, Oct 26, 2012 at 03:19:57PM +0800, Ni zhan Chen wrote:

On 10/26/2012 03:09 PM, Fengguang Wu wrote:

On Fri, Oct 26, 2012 at 03:03:12PM +0800, Ni zhan Chen wrote:

On 10/26/2012 02:58 PM, Fengguang Wu wrote:

  static void shrink_readahead_size_eio(struct file *filp,
 struct file_ra_state *ra)
  {
-   ra->ra_pages /= 4;
+   spin_lock(>f_lock);
+   filp->f_mode |= FMODE_RANDOM;
+   spin_unlock(>f_lock);

As the example in comment above this function, the read maybe still
sequential, and it will waste IO bandwith if modify to FMODE_RANDOM
directly.

Yes immediately disabling readahead may hurt IO performance, the
original '/ 4' may perform better when there are only 1-3 IO errors
encountered.

Hi Fengguang,

Why the number should be 1-3?

The original behavior is '/= 4' on each error.

After 1 errors, readahead size will be shrinked by 1/4
After 2 errors, readahead size will be shrinked by 1/16
After 3 errors, readahead size will be shrinked by 1/64
After 4 errors, readahead size will be effectively 0 (disabled)

But from function shrink_readahead_size_eio and its caller
filemap_fault I can't find the behavior you mentioned. How you
figure out it?

It's this line in shrink_readahead_size_eio():

 ra->ra_pages /= 4;

Yeah, I mean why the 4th readahead size will be 0(disabled)? What's
the original value of ra->ra_pages? How can guarantee the 4th shrink
readahead size can be 0?

Ah OK, I'm talking about the typical case. The default readahead size
is 128k, which will become 0 after / 256. The reasonable good ra size
for hard disks is 1MB=256pages, which also becomes 1page after 4 errors.


Then why default size is not set to reasonable size?

Regards,
Chen



Thanks,
Fengguang



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: MMTests 0.06

2012-10-26 Thread Ni zhan Chen

On 10/12/2012 10:51 PM, Mel Gorman wrote:

MMTests 0.06 is a configurable test suite that runs a number of common
workloads of interest to MM developers. There are multiple additions
all but in many respects the most useful will be automatic package
installation. The package names are based on openSUSE but it's easy to
create mappings in bin/install-depends where the package names differ. The
very basics of monitoring NUMA efficiency is there as well and the autonuma
benchmark has a test. The stats it reports for NUMA need significant
improvement but for the most part that should be straight forward.

Changelog since v0.05
o Automatically install packages (need name mappings for other distros)
o Add benchmark for autonumabench
o Add support for benchmarking NAS with MPI
o Add pgbench for autonumabench (may need a bit more work)
o Upgrade postgres version to 9.2.1
o Upgrade kernel verion used for kernbench to 3.0 for newer toolchains
o Alter mailserver config to finish in a reasonable time
o Add monitor for perf sched
o Add moinitor that gathers ftrace information with trace-cmd
o Add preliminary monitors for NUMA stats (very basic)
o Specify ftrace events to monitor from config file
o Remove the bulk of whats left of VMRegress
o Convert shellpacks to a template format to auto-generate boilerplate code
o Collect lock_stat information if enabled
o Run multiple iterations of aim9
o Add basic regression tests for Cross Memory Attach
o Copy with preempt being enabled in highalloc stres tests
o Have largedd cope with a missing large file to work with
o Add a monitor-only mode to just capture logs
o Report receive-side throughput in netperf for results

At LSF/MM at some point a request was made that a series of tests
be identified that were of interest to MM developers and that could be
used for testing the Linux memory management subsystem. There is renewed
interest in some sort of general testing framework during discussions for
Kernel Summit 2012 so here is what I use.

http://www.csn.ul.ie/~mel/projects/mmtests/
http://www.csn.ul.ie/~mel/projects/mmtests/mmtests-0.06-mmtests-0.01.tar.gz

There are a number of stock configurations stored in configs/.  For example
config-global-dhp__pagealloc-performance runs a number of tests that
may be able to identify performance regressions or gains in the page
allocator. Similarly there network and scheduler configs. There are also
more complex options. config-global-dhp__parallelio-memcachetest will run
memcachetest in the foreground while doing IO of different sizes in the
background to measure how much unrelated IO affects the throughput of an
in-memory database.

This release is also a little rough and the extraction scripts could
have been tidier but they were mostly written in an airport and for the
most part they work as advertised. I'll fix bugs as according as they are
brought to my attention.

The stats reporting still needs work because while some tests know how
to make a better estimate of mean by filtering outliers it is not being
handled consistently and the methodology needs work. I know filtering
statistics like this is a major flaw in the methodology but the decision
was made in this case in the interest of the benchmarks with unstable
results completing in a reasonable time.


Hi Gorman,

Could MMTests 0.07 auto download related packages for different 
distributions?


Regards,
Chen


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

2012-10-26 Thread Ni zhan Chen

On 10/26/2012 03:36 PM, Fengguang Wu wrote:

On Fri, Oct 26, 2012 at 03:19:57PM +0800, Ni zhan Chen wrote:

On 10/26/2012 03:09 PM, Fengguang Wu wrote:

On Fri, Oct 26, 2012 at 03:03:12PM +0800, Ni zhan Chen wrote:

On 10/26/2012 02:58 PM, Fengguang Wu wrote:

  static void shrink_readahead_size_eio(struct file *filp,
 struct file_ra_state *ra)
  {
-   ra->ra_pages /= 4;
+   spin_lock(>f_lock);
+   filp->f_mode |= FMODE_RANDOM;
+   spin_unlock(>f_lock);

As the example in comment above this function, the read maybe still
sequential, and it will waste IO bandwith if modify to FMODE_RANDOM
directly.

Yes immediately disabling readahead may hurt IO performance, the
original '/ 4' may perform better when there are only 1-3 IO errors
encountered.

Hi Fengguang,

Why the number should be 1-3?

The original behavior is '/= 4' on each error.

After 1 errors, readahead size will be shrinked by 1/4
After 2 errors, readahead size will be shrinked by 1/16
After 3 errors, readahead size will be shrinked by 1/64
After 4 errors, readahead size will be effectively 0 (disabled)

But from function shrink_readahead_size_eio and its caller
filemap_fault I can't find the behavior you mentioned. How you
figure out it?

It's this line in shrink_readahead_size_eio():

 ra->ra_pages /= 4;


Yeah, I mean why the 4th readahead size will be 0(disabled)? What's the 
original value of ra->ra_pages? How can guarantee the 4th shrink 
readahead size can be 0?


Regards,
Chen



That ra_pages will keep shrinking by 4 on each error. The only way to
restore it is to reopen the file, or POSIX_FADV_SEQUENTIAL.

Thanks,
Fengguang



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

2012-10-26 Thread Ni zhan Chen

On 10/26/2012 03:09 PM, Fengguang Wu wrote:

On Fri, Oct 26, 2012 at 03:03:12PM +0800, Ni zhan Chen wrote:

On 10/26/2012 02:58 PM, Fengguang Wu wrote:

  static void shrink_readahead_size_eio(struct file *filp,
 struct file_ra_state *ra)
  {
-   ra->ra_pages /= 4;
+   spin_lock(>f_lock);
+   filp->f_mode |= FMODE_RANDOM;
+   spin_unlock(>f_lock);

As the example in comment above this function, the read maybe still
sequential, and it will waste IO bandwith if modify to FMODE_RANDOM
directly.

Yes immediately disabling readahead may hurt IO performance, the
original '/ 4' may perform better when there are only 1-3 IO errors
encountered.

Hi Fengguang,

Why the number should be 1-3?

The original behavior is '/= 4' on each error.

After 1 errors, readahead size will be shrinked by 1/4
After 2 errors, readahead size will be shrinked by 1/16
After 3 errors, readahead size will be shrinked by 1/64
After 4 errors, readahead size will be effectively 0 (disabled)


But from function shrink_readahead_size_eio and its caller filemap_fault 
I can't find the behavior you mentioned. How you figure out it?


Regards,
Chen



Thanks,
Fengguang



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

2012-10-26 Thread Ni zhan Chen

On 10/26/2012 02:58 PM, Fengguang Wu wrote:

  static void shrink_readahead_size_eio(struct file *filp,
 struct file_ra_state *ra)
  {
-   ra->ra_pages /= 4;
+   spin_lock(>f_lock);
+   filp->f_mode |= FMODE_RANDOM;
+   spin_unlock(>f_lock);

As the example in comment above this function, the read maybe still
sequential, and it will waste IO bandwith if modify to FMODE_RANDOM
directly.

Yes immediately disabling readahead may hurt IO performance, the
original '/ 4' may perform better when there are only 1-3 IO errors
encountered.


Hi Fengguang,

Why the number should be 1-3?

Regards,
Chen



Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3] mm: thp: Set the accessed flag for old pages on access fault.

2012-10-26 Thread Ni zhan Chen

On 10/26/2012 12:44 AM, Will Deacon wrote:

On x86 memory accesses to pages without the ACCESSED flag set result in the
ACCESSED flag being set automatically. With the ARM architecture a page access
fault is raised instead (and it will continue to be raised until the ACCESSED
flag is set for the appropriate PTE/PMD).

For normal memory pages, handle_pte_fault will call pte_mkyoung (effectively
setting the ACCESSED flag). For transparent huge pages, pmd_mkyoung will only
be called for a write fault.

This patch ensures that faults on transparent hugepages which do not result
in a CoW update the access flags for the faulting pmd.


Could you write changlog?



Cc: Chris Metcalf 
Cc: Kirill A. Shutemov 
Cc: Andrea Arcangeli 
Signed-off-by: Will Deacon 
---

Ok chaps, I rebased this thing onto today's next (which basically
necessitated a rewrite) so I've reluctantly dropped my acks and kindly
ask if you could eyeball the new code, especially where the locking is
concerned. In the numa code (do_huge_pmd_prot_none), Peter checks again
that the page is not splitting, but I can't see why that is required.

Cheers,

Will


Could you explain why you not call pmd_trans_huge_lock to confirm the 
pmd is splitting or stable as Andrea point out?




  include/linux/huge_mm.h |4 
  mm/huge_memory.c|   22 ++
  mm/memory.c |7 ++-
  3 files changed, 32 insertions(+), 1 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4f0f948..766fb27 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -8,6 +8,10 @@ extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
  extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 struct vm_area_struct *vma);
+extern void huge_pmd_set_accessed(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd,
+ pmd_t orig_pmd, int dirty);
  extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct 
*vma,
   unsigned long address, pmd_t *pmd,
   pmd_t orig_pmd);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3c14a96..f024d98 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -932,6 +932,28 @@ out:
return ret;
  }
  
+void huge_pmd_set_accessed(struct mm_struct *mm,

+  struct vm_area_struct *vma,
+  unsigned long address,
+  pmd_t *pmd, pmd_t orig_pmd,
+  int dirty)
+{
+   pmd_t entry;
+   unsigned long haddr;
+
+   spin_lock(>page_table_lock);
+   if (unlikely(!pmd_same(*pmd, orig_pmd)))
+   goto unlock;
+
+   entry = pmd_mkyoung(orig_pmd);
+   haddr = address & HPAGE_PMD_MASK;
+   if (pmdp_set_access_flags(vma, haddr, pmd, entry, dirty))
+   update_mmu_cache_pmd(vma, address, pmd);
+
+unlock:
+   spin_unlock(>page_table_lock);
+}
+
  static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long address,
diff --git a/mm/memory.c b/mm/memory.c
index f21ac1c..bcbc084 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3650,12 +3650,14 @@ retry:
  
  		barrier();

if (pmd_trans_huge(orig_pmd) && !pmd_trans_splitting(orig_pmd)) 
{
+   unsigned int dirty = flags & FAULT_FLAG_WRITE;
+
if (pmd_numa(vma, orig_pmd)) {
do_huge_pmd_numa_page(mm, vma, address, pmd,
  flags, orig_pmd);
}
  
-			if ((flags & FAULT_FLAG_WRITE) && !pmd_write(orig_pmd)) {

+   if (dirty && !pmd_write(orig_pmd)) {
ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
  orig_pmd);
/*
@@ -3665,6 +3667,9 @@ retry:
 */
if (unlikely(ret & VM_FAULT_OOM))
goto retry;
+   } else {
+   huge_pmd_set_accessed(mm, vma, address, pmd,
+ orig_pmd, dirty);
}
  
  			return ret;


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3] mm: thp: Set the accessed flag for old pages on access fault.

2012-10-26 Thread Ni zhan Chen

On 10/26/2012 12:44 AM, Will Deacon wrote:

On x86 memory accesses to pages without the ACCESSED flag set result in the
ACCESSED flag being set automatically. With the ARM architecture a page access
fault is raised instead (and it will continue to be raised until the ACCESSED
flag is set for the appropriate PTE/PMD).

For normal memory pages, handle_pte_fault will call pte_mkyoung (effectively
setting the ACCESSED flag). For transparent huge pages, pmd_mkyoung will only
be called for a write fault.

This patch ensures that faults on transparent hugepages which do not result
in a CoW update the access flags for the faulting pmd.


Could you write changlog?



Cc: Chris Metcalf cmetc...@tilera.com
Cc: Kirill A. Shutemov kir...@shutemov.name
Cc: Andrea Arcangeli aarca...@redhat.com
Signed-off-by: Will Deacon will.dea...@arm.com
---

Ok chaps, I rebased this thing onto today's next (which basically
necessitated a rewrite) so I've reluctantly dropped my acks and kindly
ask if you could eyeball the new code, especially where the locking is
concerned. In the numa code (do_huge_pmd_prot_none), Peter checks again
that the page is not splitting, but I can't see why that is required.

Cheers,

Will


Could you explain why you not call pmd_trans_huge_lock to confirm the 
pmd is splitting or stable as Andrea point out?




  include/linux/huge_mm.h |4 
  mm/huge_memory.c|   22 ++
  mm/memory.c |7 ++-
  3 files changed, 32 insertions(+), 1 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4f0f948..766fb27 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -8,6 +8,10 @@ extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
  extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 struct vm_area_struct *vma);
+extern void huge_pmd_set_accessed(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd,
+ pmd_t orig_pmd, int dirty);
  extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct 
*vma,
   unsigned long address, pmd_t *pmd,
   pmd_t orig_pmd);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3c14a96..f024d98 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -932,6 +932,28 @@ out:
return ret;
  }
  
+void huge_pmd_set_accessed(struct mm_struct *mm,

+  struct vm_area_struct *vma,
+  unsigned long address,
+  pmd_t *pmd, pmd_t orig_pmd,
+  int dirty)
+{
+   pmd_t entry;
+   unsigned long haddr;
+
+   spin_lock(mm-page_table_lock);
+   if (unlikely(!pmd_same(*pmd, orig_pmd)))
+   goto unlock;
+
+   entry = pmd_mkyoung(orig_pmd);
+   haddr = address  HPAGE_PMD_MASK;
+   if (pmdp_set_access_flags(vma, haddr, pmd, entry, dirty))
+   update_mmu_cache_pmd(vma, address, pmd);
+
+unlock:
+   spin_unlock(mm-page_table_lock);
+}
+
  static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long address,
diff --git a/mm/memory.c b/mm/memory.c
index f21ac1c..bcbc084 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3650,12 +3650,14 @@ retry:
  
  		barrier();

if (pmd_trans_huge(orig_pmd)  !pmd_trans_splitting(orig_pmd)) 
{
+   unsigned int dirty = flags  FAULT_FLAG_WRITE;
+
if (pmd_numa(vma, orig_pmd)) {
do_huge_pmd_numa_page(mm, vma, address, pmd,
  flags, orig_pmd);
}
  
-			if ((flags  FAULT_FLAG_WRITE)  !pmd_write(orig_pmd)) {

+   if (dirty  !pmd_write(orig_pmd)) {
ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
  orig_pmd);
/*
@@ -3665,6 +3667,9 @@ retry:
 */
if (unlikely(ret  VM_FAULT_OOM))
goto retry;
+   } else {
+   huge_pmd_set_accessed(mm, vma, address, pmd,
+ orig_pmd, dirty);
}
  
  			return ret;


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

2012-10-26 Thread Ni zhan Chen

On 10/26/2012 02:58 PM, Fengguang Wu wrote:

  static void shrink_readahead_size_eio(struct file *filp,
 struct file_ra_state *ra)
  {
-   ra-ra_pages /= 4;
+   spin_lock(filp-f_lock);
+   filp-f_mode |= FMODE_RANDOM;
+   spin_unlock(filp-f_lock);

As the example in comment above this function, the read maybe still
sequential, and it will waste IO bandwith if modify to FMODE_RANDOM
directly.

Yes immediately disabling readahead may hurt IO performance, the
original '/ 4' may perform better when there are only 1-3 IO errors
encountered.


Hi Fengguang,

Why the number should be 1-3?

Regards,
Chen



Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

2012-10-26 Thread Ni zhan Chen

On 10/26/2012 03:09 PM, Fengguang Wu wrote:

On Fri, Oct 26, 2012 at 03:03:12PM +0800, Ni zhan Chen wrote:

On 10/26/2012 02:58 PM, Fengguang Wu wrote:

  static void shrink_readahead_size_eio(struct file *filp,
 struct file_ra_state *ra)
  {
-   ra-ra_pages /= 4;
+   spin_lock(filp-f_lock);
+   filp-f_mode |= FMODE_RANDOM;
+   spin_unlock(filp-f_lock);

As the example in comment above this function, the read maybe still
sequential, and it will waste IO bandwith if modify to FMODE_RANDOM
directly.

Yes immediately disabling readahead may hurt IO performance, the
original '/ 4' may perform better when there are only 1-3 IO errors
encountered.

Hi Fengguang,

Why the number should be 1-3?

The original behavior is '/= 4' on each error.

After 1 errors, readahead size will be shrinked by 1/4
After 2 errors, readahead size will be shrinked by 1/16
After 3 errors, readahead size will be shrinked by 1/64
After 4 errors, readahead size will be effectively 0 (disabled)


But from function shrink_readahead_size_eio and its caller filemap_fault 
I can't find the behavior you mentioned. How you figure out it?


Regards,
Chen



Thanks,
Fengguang



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

2012-10-26 Thread Ni zhan Chen

On 10/26/2012 03:36 PM, Fengguang Wu wrote:

On Fri, Oct 26, 2012 at 03:19:57PM +0800, Ni zhan Chen wrote:

On 10/26/2012 03:09 PM, Fengguang Wu wrote:

On Fri, Oct 26, 2012 at 03:03:12PM +0800, Ni zhan Chen wrote:

On 10/26/2012 02:58 PM, Fengguang Wu wrote:

  static void shrink_readahead_size_eio(struct file *filp,
 struct file_ra_state *ra)
  {
-   ra-ra_pages /= 4;
+   spin_lock(filp-f_lock);
+   filp-f_mode |= FMODE_RANDOM;
+   spin_unlock(filp-f_lock);

As the example in comment above this function, the read maybe still
sequential, and it will waste IO bandwith if modify to FMODE_RANDOM
directly.

Yes immediately disabling readahead may hurt IO performance, the
original '/ 4' may perform better when there are only 1-3 IO errors
encountered.

Hi Fengguang,

Why the number should be 1-3?

The original behavior is '/= 4' on each error.

After 1 errors, readahead size will be shrinked by 1/4
After 2 errors, readahead size will be shrinked by 1/16
After 3 errors, readahead size will be shrinked by 1/64
After 4 errors, readahead size will be effectively 0 (disabled)

But from function shrink_readahead_size_eio and its caller
filemap_fault I can't find the behavior you mentioned. How you
figure out it?

It's this line in shrink_readahead_size_eio():

 ra-ra_pages /= 4;


Yeah, I mean why the 4th readahead size will be 0(disabled)? What's the 
original value of ra-ra_pages? How can guarantee the 4th shrink 
readahead size can be 0?


Regards,
Chen



That ra_pages will keep shrinking by 4 on each error. The only way to
restore it is to reopen the file, or POSIX_FADV_SEQUENTIAL.

Thanks,
Fengguang



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: MMTests 0.06

2012-10-26 Thread Ni zhan Chen

On 10/12/2012 10:51 PM, Mel Gorman wrote:

MMTests 0.06 is a configurable test suite that runs a number of common
workloads of interest to MM developers. There are multiple additions
all but in many respects the most useful will be automatic package
installation. The package names are based on openSUSE but it's easy to
create mappings in bin/install-depends where the package names differ. The
very basics of monitoring NUMA efficiency is there as well and the autonuma
benchmark has a test. The stats it reports for NUMA need significant
improvement but for the most part that should be straight forward.

Changelog since v0.05
o Automatically install packages (need name mappings for other distros)
o Add benchmark for autonumabench
o Add support for benchmarking NAS with MPI
o Add pgbench for autonumabench (may need a bit more work)
o Upgrade postgres version to 9.2.1
o Upgrade kernel verion used for kernbench to 3.0 for newer toolchains
o Alter mailserver config to finish in a reasonable time
o Add monitor for perf sched
o Add moinitor that gathers ftrace information with trace-cmd
o Add preliminary monitors for NUMA stats (very basic)
o Specify ftrace events to monitor from config file
o Remove the bulk of whats left of VMRegress
o Convert shellpacks to a template format to auto-generate boilerplate code
o Collect lock_stat information if enabled
o Run multiple iterations of aim9
o Add basic regression tests for Cross Memory Attach
o Copy with preempt being enabled in highalloc stres tests
o Have largedd cope with a missing large file to work with
o Add a monitor-only mode to just capture logs
o Report receive-side throughput in netperf for results

At LSF/MM at some point a request was made that a series of tests
be identified that were of interest to MM developers and that could be
used for testing the Linux memory management subsystem. There is renewed
interest in some sort of general testing framework during discussions for
Kernel Summit 2012 so here is what I use.

http://www.csn.ul.ie/~mel/projects/mmtests/
http://www.csn.ul.ie/~mel/projects/mmtests/mmtests-0.06-mmtests-0.01.tar.gz

There are a number of stock configurations stored in configs/.  For example
config-global-dhp__pagealloc-performance runs a number of tests that
may be able to identify performance regressions or gains in the page
allocator. Similarly there network and scheduler configs. There are also
more complex options. config-global-dhp__parallelio-memcachetest will run
memcachetest in the foreground while doing IO of different sizes in the
background to measure how much unrelated IO affects the throughput of an
in-memory database.

This release is also a little rough and the extraction scripts could
have been tidier but they were mostly written in an airport and for the
most part they work as advertised. I'll fix bugs as according as they are
brought to my attention.

The stats reporting still needs work because while some tests know how
to make a better estimate of mean by filtering outliers it is not being
handled consistently and the methodology needs work. I know filtering
statistics like this is a major flaw in the methodology but the decision
was made in this case in the interest of the benchmarks with unstable
results completing in a reasonable time.


Hi Gorman,

Could MMTests 0.07 auto download related packages for different 
distributions?


Regards,
Chen


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

2012-10-26 Thread Ni zhan Chen

On 10/26/2012 04:02 PM, Fengguang Wu wrote:

On Fri, Oct 26, 2012 at 03:47:19PM +0800, Ni zhan Chen wrote:

On 10/26/2012 03:36 PM, Fengguang Wu wrote:

On Fri, Oct 26, 2012 at 03:19:57PM +0800, Ni zhan Chen wrote:

On 10/26/2012 03:09 PM, Fengguang Wu wrote:

On Fri, Oct 26, 2012 at 03:03:12PM +0800, Ni zhan Chen wrote:

On 10/26/2012 02:58 PM, Fengguang Wu wrote:

  static void shrink_readahead_size_eio(struct file *filp,
 struct file_ra_state *ra)
  {
-   ra-ra_pages /= 4;
+   spin_lock(filp-f_lock);
+   filp-f_mode |= FMODE_RANDOM;
+   spin_unlock(filp-f_lock);

As the example in comment above this function, the read maybe still
sequential, and it will waste IO bandwith if modify to FMODE_RANDOM
directly.

Yes immediately disabling readahead may hurt IO performance, the
original '/ 4' may perform better when there are only 1-3 IO errors
encountered.

Hi Fengguang,

Why the number should be 1-3?

The original behavior is '/= 4' on each error.

After 1 errors, readahead size will be shrinked by 1/4
After 2 errors, readahead size will be shrinked by 1/16
After 3 errors, readahead size will be shrinked by 1/64
After 4 errors, readahead size will be effectively 0 (disabled)

But from function shrink_readahead_size_eio and its caller
filemap_fault I can't find the behavior you mentioned. How you
figure out it?

It's this line in shrink_readahead_size_eio():

 ra-ra_pages /= 4;

Yeah, I mean why the 4th readahead size will be 0(disabled)? What's
the original value of ra-ra_pages? How can guarantee the 4th shrink
readahead size can be 0?

Ah OK, I'm talking about the typical case. The default readahead size
is 128k, which will become 0 after / 256. The reasonable good ra size
for hard disks is 1MB=256pages, which also becomes 1page after 4 errors.


Then why default size is not set to reasonable size?

Regards,
Chen



Thanks,
Fengguang



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3] mm: thp: Set the accessed flag for old pages on access fault.

2012-10-26 Thread Ni zhan Chen

On 10/26/2012 05:34 PM, Will Deacon wrote:

On Fri, Oct 26, 2012 at 07:19:55AM +0100, Ni zhan Chen wrote:

On 10/26/2012 12:44 AM, Will Deacon wrote:

On x86 memory accesses to pages without the ACCESSED flag set result in the
ACCESSED flag being set automatically. With the ARM architecture a page access
fault is raised instead (and it will continue to be raised until the ACCESSED
flag is set for the appropriate PTE/PMD).

For normal memory pages, handle_pte_fault will call pte_mkyoung (effectively
setting the ACCESSED flag). For transparent huge pages, pmd_mkyoung will only
be called for a write fault.

This patch ensures that faults on transparent hugepages which do not result
in a CoW update the access flags for the faulting pmd.

Could you write changlog?

From v2? I included something below my SoB. The code should do exactly the
same as before, it's just rebased onto next so that I can play nicely with
Peter's patches.


Cc: Chris Metcalf cmetc...@tilera.com
Cc: Kirill A. Shutemov kir...@shutemov.name
Cc: Andrea Arcangeli aarca...@redhat.com
Signed-off-by: Will Deacon will.dea...@arm.com
---

Ok chaps, I rebased this thing onto today's next (which basically
necessitated a rewrite) so I've reluctantly dropped my acks and kindly
ask if you could eyeball the new code, especially where the locking is
concerned. In the numa code (do_huge_pmd_prot_none), Peter checks again
that the page is not splitting, but I can't see why that is required.

Cheers,

Will

Could you explain why you not call pmd_trans_huge_lock to confirm the
pmd is splitting or stable as Andrea point out?

The way handle_mm_fault is now structured after the numa changes means that
we only enter the huge pmd page aging code if the entry wasn't splitting


Why you call it huge pmd page *aging* code?

Regards,
Chen


before taking the lock, so it seemed a bit gratuitous to jump through those
hoops again in pmd_trans_huge_lock.

Will



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

2012-10-25 Thread Ni zhan Chen

On 10/26/2012 11:28 AM, YingHang Zhu wrote:

On Fri, Oct 26, 2012 at 10:30 AM, Ni zhan Chen  wrote:

On 10/26/2012 09:27 AM, Fengguang Wu wrote:

On Fri, Oct 26, 2012 at 11:25:44AM +1100, Dave Chinner wrote:

On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote:

Hi Chen,


But how can bdi related ra_pages reflect different files' readahead
window? Maybe these different files are sequential read, random read
and so on.

It's simple: sequential reads will get ra_pages readahead size while
random reads will not get readahead at all.

Talking about the below chunk, it might hurt someone that explicitly
takes advantage of the behavior, however the ra_pages*2 seems more
like a hack than general solution to me: if the user will need
POSIX_FADV_SEQUENTIAL to double the max readahead window size for
improving IO performance, then why not just increase bdi->ra_pages and
benefit all reads? One may argue that it offers some differential
behavior to specific applications, however it may also present as a
counter-optimization: if the root already tuned bdi->ra_pages to the
optimal size, the doubled readahead size will only cost more memory
and perhaps IO latency.

--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset,
loff_t len, int advice)
  spin_unlock(>f_lock);
  break;
  case POSIX_FADV_SEQUENTIAL:
-   file->f_ra.ra_pages = bdi->ra_pages * 2;

I think we really have to reset file->f_ra.ra_pages here as it is
not a set-and-forget value. e.g.  shrink_readahead_size_eio() can
reduce ra_pages as a result of IO errors. Hence if you have had io
errors, telling the kernel that you are now going to do  sequential
IO should reset the readahead to the maximum ra_pages value
supported

Good point!

 but wait  this patch removes file->f_ra.ra_pages in all other
places too, so there will be no file->f_ra.ra_pages to be reset here...


In his patch,


  static void shrink_readahead_size_eio(struct file *filp,
 struct file_ra_state *ra)
  {
-   ra->ra_pages /= 4;
+   spin_lock(>f_lock);
+   filp->f_mode |= FMODE_RANDOM;
+   spin_unlock(>f_lock);

As the example in comment above this function, the read maybe still
sequential, and it will waste IO bandwith if modify to FMODE_RANDOM
directly.

I've considered about this. On the first try I modified file_ra_state.size and
file_ra_state.async_size directly, like

file_ra_state.async_size = 0;
file_ra_state.size /= 4;

but as what I comment here, we can not
predict whether the bad sectors will trash the readahead window, maybe the
following sectors after current one are ok to go in normal readahead,
it's hard to know,
the current approach gives us a chance to slow down softly.


Then when will check filp->f_mode |= FMODE_RANDOM; ? Does it will 
influence ra->ra_pages?




Thanks,
 Ying Zhu

Thanks,
Fengguang



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3] mm: thp: Set the accessed flag for old pages on access fault.

2012-10-25 Thread Ni zhan Chen

On 10/26/2012 03:51 AM, Johannes Weiner wrote:

On Thu, Oct 25, 2012 at 05:44:31PM +0100, Will Deacon wrote:

On x86 memory accesses to pages without the ACCESSED flag set result in the
ACCESSED flag being set automatically. With the ARM architecture a page access
fault is raised instead (and it will continue to be raised until the ACCESSED
flag is set for the appropriate PTE/PMD).

For normal memory pages, handle_pte_fault will call pte_mkyoung (effectively
setting the ACCESSED flag). For transparent huge pages, pmd_mkyoung will only
be called for a write fault.

This patch ensures that faults on transparent hugepages which do not result
in a CoW update the access flags for the faulting pmd.

Cc: Chris Metcalf 
Cc: Kirill A. Shutemov 
Cc: Andrea Arcangeli 
Signed-off-by: Will Deacon 

Acked-by: Johannes Weiner 


Ok chaps, I rebased this thing onto today's next (which basically
necessitated a rewrite) so I've reluctantly dropped my acks and kindly
ask if you could eyeball the new code, especially where the locking is
concerned. In the numa code (do_huge_pmd_prot_none), Peter checks again
that the page is not splitting, but I can't see why that is required.

I don't either.  If the thing was splitting when the fault happened,
that path is not taken.  And the locked pmd_same() check should rule
out splitting setting in after testing pmd_trans_huge_splitting().


Why I can't find function pmd_trans_huge_splitting() you mentioned in 
latest mainline codes and linux-next?




Peter?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

2012-10-25 Thread Ni zhan Chen

On 10/26/2012 09:27 AM, Fengguang Wu wrote:

On Fri, Oct 26, 2012 at 11:25:44AM +1100, Dave Chinner wrote:

On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote:

Hi Chen,


But how can bdi related ra_pages reflect different files' readahead
window? Maybe these different files are sequential read, random read
and so on.

It's simple: sequential reads will get ra_pages readahead size while
random reads will not get readahead at all.

Talking about the below chunk, it might hurt someone that explicitly
takes advantage of the behavior, however the ra_pages*2 seems more
like a hack than general solution to me: if the user will need
POSIX_FADV_SEQUENTIAL to double the max readahead window size for
improving IO performance, then why not just increase bdi->ra_pages and
benefit all reads? One may argue that it offers some differential
behavior to specific applications, however it may also present as a
counter-optimization: if the root already tuned bdi->ra_pages to the
optimal size, the doubled readahead size will only cost more memory
and perhaps IO latency.

--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t 
len, int advice)
 spin_unlock(>f_lock);
 break;
 case POSIX_FADV_SEQUENTIAL:
-   file->f_ra.ra_pages = bdi->ra_pages * 2;

I think we really have to reset file->f_ra.ra_pages here as it is
not a set-and-forget value. e.g.  shrink_readahead_size_eio() can
reduce ra_pages as a result of IO errors. Hence if you have had io
errors, telling the kernel that you are now going to do  sequential
IO should reset the readahead to the maximum ra_pages value
supported

Good point!

 but wait  this patch removes file->f_ra.ra_pages in all other
places too, so there will be no file->f_ra.ra_pages to be reset here...


In his patch,

 static void shrink_readahead_size_eio(struct file *filp,
struct file_ra_state *ra)
 {
-   ra->ra_pages /= 4;
+   spin_lock(>f_lock);
+   filp->f_mode |= FMODE_RANDOM;
+   spin_unlock(>f_lock);

As the example in comment above this function, the read maybe still 
sequential, and it will waste IO bandwith if modify to FMODE_RANDOM 
directly.




Thanks,
Fengguang



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

2012-10-25 Thread Ni zhan Chen

On 10/26/2012 08:25 AM, Dave Chinner wrote:

On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote:

Hi Chen,


But how can bdi related ra_pages reflect different files' readahead
window? Maybe these different files are sequential read, random read
and so on.

It's simple: sequential reads will get ra_pages readahead size while
random reads will not get readahead at all.

Talking about the below chunk, it might hurt someone that explicitly
takes advantage of the behavior, however the ra_pages*2 seems more
like a hack than general solution to me: if the user will need
POSIX_FADV_SEQUENTIAL to double the max readahead window size for
improving IO performance, then why not just increase bdi->ra_pages and
benefit all reads? One may argue that it offers some differential
behavior to specific applications, however it may also present as a
counter-optimization: if the root already tuned bdi->ra_pages to the
optimal size, the doubled readahead size will only cost more memory
and perhaps IO latency.

--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t 
len, int advice)
 spin_unlock(>f_lock);
 break;
 case POSIX_FADV_SEQUENTIAL:
-   file->f_ra.ra_pages = bdi->ra_pages * 2;

I think we really have to reset file->f_ra.ra_pages here as it is
not a set-and-forget value. e.g.  shrink_readahead_size_eio() can
reduce ra_pages as a result of IO errors. Hence if you have had io
errors, telling the kernel that you are now going to do  sequential
IO should reset the readahead to the maximum ra_pages value
supported


Good catch!



Cheers,

Dave.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]

2012-10-25 Thread Ni zhan Chen

On 10/26/2012 05:48 AM, Hugh Dickins wrote:

On Thu, 25 Oct 2012, Johannes Weiner wrote:

On Wed, Oct 24, 2012 at 09:36:27PM -0700, Hugh Dickins wrote:

On Wed, 24 Oct 2012, Dave Jones wrote:


Machine under significant load (4gb memory used, swap usage fluctuating)
triggered this...

WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70()
Pid: 29795, comm: trinity-child4 Not tainted 3.7.0-rc2+ #49

1148 error = shmem_add_to_page_cache(page, mapping, 
index,
1149 gfp, 
swp_to_radix_entry(swap));
1150 /* We already confirmed swap, and make no 
allocation */
1151 VM_BUG_ON(error);
1152 }

That's very surprising.  Easy enough to handle an error there, but
of course I made it a VM_BUG_ON because it violates my assumptions:
I rather need to understand how this can be, and I've no idea.

Could it be concurrent truncation clearing out the entry between
shmem_confirm_swap() and shmem_add_to_page_cache()?  I don't see
anything preventing that.

The empty slot would not match the expected swap entry this call
passes in and the returned error would be -ENOENT.

Excellent notion, many thanks Hannes, I believe you've got it.

I've hit that truncation problem in swapoff (and commented on it
in shmem_unuse_inode), but never hit it or considered it here.
I think of the page lock as holding it stable, but truncation's
free_swap_and_cache only does a trylock on the swapcache page,
so we're not secured against that possibility.


Hi Hugh,

Even though free_swap_and_cache only does a trylock on the swapcache 
page, but it doens't call delete_from_swap_cache and the associated 
entry should still be there, I am interested in what you have already 
introduce to protect it?




So I'd like to change it to VM_BUG_ON(error && error != -ENOENT),
but there's a little tidying up to do in the -ENOENT case, which


Do you mean radix_tree_insert will return -ENOENT if the associated 
entry is not present? Why I can't find this return value in the function 
radix_tree_insert?



needs more thought.  A delete_from_swap_cache(page) - though we
can be lazy and leave that to reclaim for such a rare occurrence -
and probably a mem_cgroup uncharge; but the memcg hooks are always
the hardest to get right, I'll have think about that one carefully.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]

2012-10-25 Thread Ni zhan Chen

On 10/26/2012 05:27 AM, Hugh Dickins wrote:

On Thu, 25 Oct 2012, Ni zhan Chen wrote:

On 10/25/2012 02:59 PM, Hugh Dickins wrote:

On Thu, 25 Oct 2012, Ni zhan Chen wrote:

I think it maybe caused by your commit [d189922862e03ce: shmem: fix
negative
rss in memcg memory.stat], one question:

Well, yes, I added the VM_BUG_ON in that commit.


if function shmem_confirm_swap confirm the entry has already brought back
from swap by a racing thread,

The reverse: true confirms that the swap entry has not been brought back
from swap by a racing thread; false indicates that there has been a race.


then why call shmem_add_to_page_cache to add
page from swapcache to pagecache again?

Adding it to pagecache again, after such a race, would set error to
-EEXIST (originating from radix_tree_insert); but we don't do that,
we add it to pagecache when it has not already been added.

Or that's the intention: but Dave seems to have found an unexpected
exception, despite us holding the page lock across all this.

(But if it weren't for the memcg and replace_page issues, I'd much
prefer to let shmem_add_to_page_cache discover the race as before.)

Hugh

Hi Hugh

Thanks for your response. You mean the -EEXIST originating from
radix_tree_insert, in radix_tree_insert:
if (slot != NULL)
 return -EEXIST;
But why slot should be NULL? if no race, the pagecache related radix tree
entry should be RADIX_TREE_EXCEPTIONAL_ENTRY+swap_entry_t.val, where I miss?

I was describing what would happen in a case that should not exist,
that you had thought the common case.  In actuality, the entry should
not be NULL, it should be as you say there.


Thanks for your patience. So in the common case, the entry should be the 
value I mentioned, then why has this check?

if (slot != NULL)
return -EEXIST;

the common case will return -EEXIST.



Hugh



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]

2012-10-25 Thread Ni zhan Chen

On 10/25/2012 02:59 PM, Hugh Dickins wrote:

On Thu, 25 Oct 2012, Ni zhan Chen wrote:

On 10/25/2012 12:36 PM, Hugh Dickins wrote:

On Wed, 24 Oct 2012, Dave Jones wrote:


Machine under significant load (4gb memory used, swap usage fluctuating)
triggered this...

WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70()
Pid: 29795, comm: trinity-child4 Not tainted 3.7.0-rc2+ #49

1148 error = shmem_add_to_page_cache(page,
mapping, index,
1149 gfp,
swp_to_radix_entry(swap));
1150 /* We already confirmed swap, and make no
allocation */
1151 VM_BUG_ON(error);
1152 }

That's very surprising.  Easy enough to handle an error there, but
of course I made it a VM_BUG_ON because it violates my assumptions:
I rather need to understand how this can be, and I've no idea.

Clutching at straws, I expect this is entirely irrelevant, but:
there isn't a warning on line 1151 of mm/shmem.c in 3.7.0-rc2 nor
in current linux.git; rather, there's a VM_BUG_ON on line 1149.

So you've inserted a couple of lines for some reason (more useful
trinity behaviour, perhaps)?  And have some config option I'm
unfamiliar with, that mutates a BUG_ON or VM_BUG_ON into a warning?

Hi Hugh,

I think it maybe caused by your commit [d189922862e03ce: shmem: fix negative
rss in memcg memory.stat], one question:

Well, yes, I added the VM_BUG_ON in that commit.


if function shmem_confirm_swap confirm the entry has already brought back
from swap by a racing thread,

The reverse: true confirms that the swap entry has not been brought back
from swap by a racing thread; false indicates that there has been a race.


then why call shmem_add_to_page_cache to add
page from swapcache to pagecache again?

Adding it to pagecache again, after such a race, would set error to
-EEXIST (originating from radix_tree_insert); but we don't do that,
we add it to pagecache when it has not already been added.

Or that's the intention: but Dave seems to have found an unexpected
exception, despite us holding the page lock across all this.

(But if it weren't for the memcg and replace_page issues, I'd much
prefer to let shmem_add_to_page_cache discover the race as before.)

Hugh


Hi Hugh

Thanks for your response. You mean the -EEXIST originating from 
radix_tree_insert, in radix_tree_insert:

if (slot != NULL)
return -EEXIST;
But why slot should be NULL? if no race, the pagecache related radix 
tree entry should be RADIX_TREE_EXCEPTIONAL_ENTRY+swap_entry_t.val, 
where I miss?


Regards,
Chen




otherwise, will goto unlock and then go to repeat? where I miss?

Regards,
Chen


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]

2012-10-25 Thread Ni zhan Chen

On 10/25/2012 02:59 PM, Hugh Dickins wrote:

On Thu, 25 Oct 2012, Ni zhan Chen wrote:

On 10/25/2012 12:36 PM, Hugh Dickins wrote:

On Wed, 24 Oct 2012, Dave Jones wrote:


Machine under significant load (4gb memory used, swap usage fluctuating)
triggered this...

WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70()
Pid: 29795, comm: trinity-child4 Not tainted 3.7.0-rc2+ #49

1148 error = shmem_add_to_page_cache(page,
mapping, index,
1149 gfp,
swp_to_radix_entry(swap));
1150 /* We already confirmed swap, and make no
allocation */
1151 VM_BUG_ON(error);
1152 }

That's very surprising.  Easy enough to handle an error there, but
of course I made it a VM_BUG_ON because it violates my assumptions:
I rather need to understand how this can be, and I've no idea.

Clutching at straws, I expect this is entirely irrelevant, but:
there isn't a warning on line 1151 of mm/shmem.c in 3.7.0-rc2 nor
in current linux.git; rather, there's a VM_BUG_ON on line 1149.

So you've inserted a couple of lines for some reason (more useful
trinity behaviour, perhaps)?  And have some config option I'm
unfamiliar with, that mutates a BUG_ON or VM_BUG_ON into a warning?

Hi Hugh,

I think it maybe caused by your commit [d189922862e03ce: shmem: fix negative
rss in memcg memory.stat], one question:

Well, yes, I added the VM_BUG_ON in that commit.


if function shmem_confirm_swap confirm the entry has already brought back
from swap by a racing thread,

The reverse: true confirms that the swap entry has not been brought back
from swap by a racing thread; false indicates that there has been a race.


then why call shmem_add_to_page_cache to add
page from swapcache to pagecache again?

Adding it to pagecache again, after such a race, would set error to
-EEXIST (originating from radix_tree_insert); but we don't do that,
we add it to pagecache when it has not already been added.

Or that's the intention: but Dave seems to have found an unexpected
exception, despite us holding the page lock across all this.

(But if it weren't for the memcg and replace_page issues, I'd much
prefer to let shmem_add_to_page_cache discover the race as before.)

Hugh


Hi Hugh

Thanks for your response. You mean the -EEXIST originating from 
radix_tree_insert, in radix_tree_insert:

if (slot != NULL)
return -EEXIST;
But why slot should be NULL? if no race, the pagecache related radix 
tree entry should be RADIX_TREE_EXCEPTIONAL_ENTRY+swap_entry_t.val, 
where I miss?


Regards,
Chen




otherwise, will goto unlock and then go to repeat? where I miss?

Regards,
Chen


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]

2012-10-25 Thread Ni zhan Chen

On 10/26/2012 05:27 AM, Hugh Dickins wrote:

On Thu, 25 Oct 2012, Ni zhan Chen wrote:

On 10/25/2012 02:59 PM, Hugh Dickins wrote:

On Thu, 25 Oct 2012, Ni zhan Chen wrote:

I think it maybe caused by your commit [d189922862e03ce: shmem: fix
negative
rss in memcg memory.stat], one question:

Well, yes, I added the VM_BUG_ON in that commit.


if function shmem_confirm_swap confirm the entry has already brought back
from swap by a racing thread,

The reverse: true confirms that the swap entry has not been brought back
from swap by a racing thread; false indicates that there has been a race.


then why call shmem_add_to_page_cache to add
page from swapcache to pagecache again?

Adding it to pagecache again, after such a race, would set error to
-EEXIST (originating from radix_tree_insert); but we don't do that,
we add it to pagecache when it has not already been added.

Or that's the intention: but Dave seems to have found an unexpected
exception, despite us holding the page lock across all this.

(But if it weren't for the memcg and replace_page issues, I'd much
prefer to let shmem_add_to_page_cache discover the race as before.)

Hugh

Hi Hugh

Thanks for your response. You mean the -EEXIST originating from
radix_tree_insert, in radix_tree_insert:
if (slot != NULL)
 return -EEXIST;
But why slot should be NULL? if no race, the pagecache related radix tree
entry should be RADIX_TREE_EXCEPTIONAL_ENTRY+swap_entry_t.val, where I miss?

I was describing what would happen in a case that should not exist,
that you had thought the common case.  In actuality, the entry should
not be NULL, it should be as you say there.


Thanks for your patience. So in the common case, the entry should be the 
value I mentioned, then why has this check?

if (slot != NULL)
return -EEXIST;

the common case will return -EEXIST.



Hugh



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]

2012-10-25 Thread Ni zhan Chen

On 10/26/2012 05:48 AM, Hugh Dickins wrote:

On Thu, 25 Oct 2012, Johannes Weiner wrote:

On Wed, Oct 24, 2012 at 09:36:27PM -0700, Hugh Dickins wrote:

On Wed, 24 Oct 2012, Dave Jones wrote:


Machine under significant load (4gb memory used, swap usage fluctuating)
triggered this...

WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70()
Pid: 29795, comm: trinity-child4 Not tainted 3.7.0-rc2+ #49

1148 error = shmem_add_to_page_cache(page, mapping, 
index,
1149 gfp, 
swp_to_radix_entry(swap));
1150 /* We already confirmed swap, and make no 
allocation */
1151 VM_BUG_ON(error);
1152 }

That's very surprising.  Easy enough to handle an error there, but
of course I made it a VM_BUG_ON because it violates my assumptions:
I rather need to understand how this can be, and I've no idea.

Could it be concurrent truncation clearing out the entry between
shmem_confirm_swap() and shmem_add_to_page_cache()?  I don't see
anything preventing that.

The empty slot would not match the expected swap entry this call
passes in and the returned error would be -ENOENT.

Excellent notion, many thanks Hannes, I believe you've got it.

I've hit that truncation problem in swapoff (and commented on it
in shmem_unuse_inode), but never hit it or considered it here.
I think of the page lock as holding it stable, but truncation's
free_swap_and_cache only does a trylock on the swapcache page,
so we're not secured against that possibility.


Hi Hugh,

Even though free_swap_and_cache only does a trylock on the swapcache 
page, but it doens't call delete_from_swap_cache and the associated 
entry should still be there, I am interested in what you have already 
introduce to protect it?




So I'd like to change it to VM_BUG_ON(error  error != -ENOENT),
but there's a little tidying up to do in the -ENOENT case, which


Do you mean radix_tree_insert will return -ENOENT if the associated 
entry is not present? Why I can't find this return value in the function 
radix_tree_insert?



needs more thought.  A delete_from_swap_cache(page) - though we
can be lazy and leave that to reclaim for such a rare occurrence -
and probably a mem_cgroup uncharge; but the memcg hooks are always
the hardest to get right, I'll have think about that one carefully.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

2012-10-25 Thread Ni zhan Chen

On 10/26/2012 08:25 AM, Dave Chinner wrote:

On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote:

Hi Chen,


But how can bdi related ra_pages reflect different files' readahead
window? Maybe these different files are sequential read, random read
and so on.

It's simple: sequential reads will get ra_pages readahead size while
random reads will not get readahead at all.

Talking about the below chunk, it might hurt someone that explicitly
takes advantage of the behavior, however the ra_pages*2 seems more
like a hack than general solution to me: if the user will need
POSIX_FADV_SEQUENTIAL to double the max readahead window size for
improving IO performance, then why not just increase bdi-ra_pages and
benefit all reads? One may argue that it offers some differential
behavior to specific applications, however it may also present as a
counter-optimization: if the root already tuned bdi-ra_pages to the
optimal size, the doubled readahead size will only cost more memory
and perhaps IO latency.

--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t 
len, int advice)
 spin_unlock(file-f_lock);
 break;
 case POSIX_FADV_SEQUENTIAL:
-   file-f_ra.ra_pages = bdi-ra_pages * 2;

I think we really have to reset file-f_ra.ra_pages here as it is
not a set-and-forget value. e.g.  shrink_readahead_size_eio() can
reduce ra_pages as a result of IO errors. Hence if you have had io
errors, telling the kernel that you are now going to do  sequential
IO should reset the readahead to the maximum ra_pages value
supported


Good catch!



Cheers,

Dave.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

2012-10-25 Thread Ni zhan Chen

On 10/26/2012 09:27 AM, Fengguang Wu wrote:

On Fri, Oct 26, 2012 at 11:25:44AM +1100, Dave Chinner wrote:

On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote:

Hi Chen,


But how can bdi related ra_pages reflect different files' readahead
window? Maybe these different files are sequential read, random read
and so on.

It's simple: sequential reads will get ra_pages readahead size while
random reads will not get readahead at all.

Talking about the below chunk, it might hurt someone that explicitly
takes advantage of the behavior, however the ra_pages*2 seems more
like a hack than general solution to me: if the user will need
POSIX_FADV_SEQUENTIAL to double the max readahead window size for
improving IO performance, then why not just increase bdi-ra_pages and
benefit all reads? One may argue that it offers some differential
behavior to specific applications, however it may also present as a
counter-optimization: if the root already tuned bdi-ra_pages to the
optimal size, the doubled readahead size will only cost more memory
and perhaps IO latency.

--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t 
len, int advice)
 spin_unlock(file-f_lock);
 break;
 case POSIX_FADV_SEQUENTIAL:
-   file-f_ra.ra_pages = bdi-ra_pages * 2;

I think we really have to reset file-f_ra.ra_pages here as it is
not a set-and-forget value. e.g.  shrink_readahead_size_eio() can
reduce ra_pages as a result of IO errors. Hence if you have had io
errors, telling the kernel that you are now going to do  sequential
IO should reset the readahead to the maximum ra_pages value
supported

Good point!

 but wait  this patch removes file-f_ra.ra_pages in all other
places too, so there will be no file-f_ra.ra_pages to be reset here...


In his patch,

 static void shrink_readahead_size_eio(struct file *filp,
struct file_ra_state *ra)
 {
-   ra-ra_pages /= 4;
+   spin_lock(filp-f_lock);
+   filp-f_mode |= FMODE_RANDOM;
+   spin_unlock(filp-f_lock);

As the example in comment above this function, the read maybe still 
sequential, and it will waste IO bandwith if modify to FMODE_RANDOM 
directly.




Thanks,
Fengguang



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3] mm: thp: Set the accessed flag for old pages on access fault.

2012-10-25 Thread Ni zhan Chen

On 10/26/2012 03:51 AM, Johannes Weiner wrote:

On Thu, Oct 25, 2012 at 05:44:31PM +0100, Will Deacon wrote:

On x86 memory accesses to pages without the ACCESSED flag set result in the
ACCESSED flag being set automatically. With the ARM architecture a page access
fault is raised instead (and it will continue to be raised until the ACCESSED
flag is set for the appropriate PTE/PMD).

For normal memory pages, handle_pte_fault will call pte_mkyoung (effectively
setting the ACCESSED flag). For transparent huge pages, pmd_mkyoung will only
be called for a write fault.

This patch ensures that faults on transparent hugepages which do not result
in a CoW update the access flags for the faulting pmd.

Cc: Chris Metcalf cmetc...@tilera.com
Cc: Kirill A. Shutemov kir...@shutemov.name
Cc: Andrea Arcangeli aarca...@redhat.com
Signed-off-by: Will Deacon will.dea...@arm.com

Acked-by: Johannes Weiner han...@cmpxchg.org


Ok chaps, I rebased this thing onto today's next (which basically
necessitated a rewrite) so I've reluctantly dropped my acks and kindly
ask if you could eyeball the new code, especially where the locking is
concerned. In the numa code (do_huge_pmd_prot_none), Peter checks again
that the page is not splitting, but I can't see why that is required.

I don't either.  If the thing was splitting when the fault happened,
that path is not taken.  And the locked pmd_same() check should rule
out splitting setting in after testing pmd_trans_huge_splitting().


Why I can't find function pmd_trans_huge_splitting() you mentioned in 
latest mainline codes and linux-next?




Peter?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

2012-10-25 Thread Ni zhan Chen

On 10/26/2012 11:28 AM, YingHang Zhu wrote:

On Fri, Oct 26, 2012 at 10:30 AM, Ni zhan Chen nizhan.c...@gmail.com wrote:

On 10/26/2012 09:27 AM, Fengguang Wu wrote:

On Fri, Oct 26, 2012 at 11:25:44AM +1100, Dave Chinner wrote:

On Thu, Oct 25, 2012 at 10:58:26AM +0800, Fengguang Wu wrote:

Hi Chen,


But how can bdi related ra_pages reflect different files' readahead
window? Maybe these different files are sequential read, random read
and so on.

It's simple: sequential reads will get ra_pages readahead size while
random reads will not get readahead at all.

Talking about the below chunk, it might hurt someone that explicitly
takes advantage of the behavior, however the ra_pages*2 seems more
like a hack than general solution to me: if the user will need
POSIX_FADV_SEQUENTIAL to double the max readahead window size for
improving IO performance, then why not just increase bdi-ra_pages and
benefit all reads? One may argue that it offers some differential
behavior to specific applications, however it may also present as a
counter-optimization: if the root already tuned bdi-ra_pages to the
optimal size, the doubled readahead size will only cost more memory
and perhaps IO latency.

--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset,
loff_t len, int advice)
  spin_unlock(file-f_lock);
  break;
  case POSIX_FADV_SEQUENTIAL:
-   file-f_ra.ra_pages = bdi-ra_pages * 2;

I think we really have to reset file-f_ra.ra_pages here as it is
not a set-and-forget value. e.g.  shrink_readahead_size_eio() can
reduce ra_pages as a result of IO errors. Hence if you have had io
errors, telling the kernel that you are now going to do  sequential
IO should reset the readahead to the maximum ra_pages value
supported

Good point!

 but wait  this patch removes file-f_ra.ra_pages in all other
places too, so there will be no file-f_ra.ra_pages to be reset here...


In his patch,


  static void shrink_readahead_size_eio(struct file *filp,
 struct file_ra_state *ra)
  {
-   ra-ra_pages /= 4;
+   spin_lock(filp-f_lock);
+   filp-f_mode |= FMODE_RANDOM;
+   spin_unlock(filp-f_lock);

As the example in comment above this function, the read maybe still
sequential, and it will waste IO bandwith if modify to FMODE_RANDOM
directly.

I've considered about this. On the first try I modified file_ra_state.size and
file_ra_state.async_size directly, like

file_ra_state.async_size = 0;
file_ra_state.size /= 4;

but as what I comment here, we can not
predict whether the bad sectors will trash the readahead window, maybe the
following sectors after current one are ok to go in normal readahead,
it's hard to know,
the current approach gives us a chance to slow down softly.


Then when will check filp-f_mode |= FMODE_RANDOM; ? Does it will 
influence ra-ra_pages?




Thanks,
 Ying Zhu

Thanks,
Fengguang



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]

2012-10-24 Thread Ni zhan Chen

On 10/25/2012 12:36 PM, Hugh Dickins wrote:

On Wed, 24 Oct 2012, Dave Jones wrote:


Machine under significant load (4gb memory used, swap usage fluctuating)
triggered this...

WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70()
Pid: 29795, comm: trinity-child4 Not tainted 3.7.0-rc2+ #49
Call Trace:
  [] warn_slowpath_common+0x7f/0xc0
  [] warn_slowpath_null+0x1a/0x20
  [] shmem_getpage_gfp+0xa5c/0xa70
  [] ? shmem_getpage_gfp+0x29e/0xa70
  [] shmem_fault+0x4f/0xa0
  [] __do_fault+0x71/0x5c0
  [] ? __lock_acquire+0x306/0x1ba0
  [] ? local_clock+0x89/0xa0
  [] handle_pte_fault+0x97/0xae0
  [] ? sub_preempt_count+0x79/0xd0
  [] ? delay_tsc+0xae/0x120
  [] ? __const_udelay+0x28/0x30
  [] handle_mm_fault+0x289/0x350
  [] __do_page_fault+0x18e/0x530
  [] ? local_clock+0x89/0xa0
  [] ? get_parent_ip+0x11/0x50
  [] ? get_parent_ip+0x11/0x50
  [] ? sub_preempt_count+0x79/0xd0
  [] ? rcu_user_exit+0xc9/0xf0
  [] do_page_fault+0x2b/0x50
  [] page_fault+0x28/0x30
  [] ? copy_user_enhanced_fast_string+0x9/0x20
  [] ? sys_futimesat+0x41/0xe0
  [] ? syscall_trace_enter+0x25/0x2c0
  [] ? tracesys+0x7e/0xe6
  [] tracesys+0xe1/0xe6



1148 error = shmem_add_to_page_cache(page, mapping, 
index,
1149 gfp, 
swp_to_radix_entry(swap));
1150 /* We already confirmed swap, and make no 
allocation */
1151 VM_BUG_ON(error);
1152 }

That's very surprising.  Easy enough to handle an error there, but
of course I made it a VM_BUG_ON because it violates my assumptions:
I rather need to understand how this can be, and I've no idea.

Clutching at straws, I expect this is entirely irrelevant, but:
there isn't a warning on line 1151 of mm/shmem.c in 3.7.0-rc2 nor
in current linux.git; rather, there's a VM_BUG_ON on line 1149.

So you've inserted a couple of lines for some reason (more useful
trinity behaviour, perhaps)?  And have some config option I'm
unfamiliar with, that mutates a BUG_ON or VM_BUG_ON into a warning?


Hi Hugh,

I think it maybe caused by your commit [d189922862e03ce: shmem: fix 
negative rss in memcg memory.stat], one question:


if function shmem_confirm_swap confirm the entry has already brought 
back from swap by a racing thread, then why call shmem_add_to_page_cache 
to add page from swapcache to pagecache again? otherwise, will goto 
unlock and then go to repeat? where I miss?


Regards,
Chen



Hugh



  total   used   free sharedbuffers cached
Mem:   388552828540641031464  0   9624  19208
-/+ buffers/cache:28252321060296
Swap:  6029308  306565998652

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

2012-10-24 Thread Ni zhan Chen

On 10/25/2012 10:04 AM, YingHang Zhu wrote:

On Thu, Oct 25, 2012 at 9:50 AM, Dave Chinner  wrote:

On Thu, Oct 25, 2012 at 08:17:05AM +0800, YingHang Zhu wrote:

On Thu, Oct 25, 2012 at 4:19 AM, Dave Chinner  wrote:

On Wed, Oct 24, 2012 at 07:53:59AM +0800, YingHang Zhu wrote:

Hi Dave,
On Wed, Oct 24, 2012 at 6:47 AM, Dave Chinner  wrote:

On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote:

Hi,
   Recently we ran into the bug that an opened file's ra_pages does not
synchronize with it's backing device's when the latter is changed
with blockdev --setra, the application needs to reopen the file
to know the change,

or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead
window to the (new) bdi default.


which is inappropriate under our circumstances.

Which are? We don't know your circumstances, so you need to tell us
why you need this and why existing methods of handling such changes
are insufficient...

Optimal readahead windows tend to be a physical property of the
storage and that does not tend to change dynamically. Hence block
device readahead should only need to be set up once, and generally
that can be done before the filesystem is mounted and files are
opened (e.g. via udev rules). Hence you need to explain why you need
to change the default block device readahead on the fly, and why
fadvise(POSIX_FADV_NORMAL) is "inappropriate" to set readahead
windows to the new defaults.

Our system is a fuse-based file system, fuse creates a
pseudo backing device for the user space file systems, the default readahead
size is 128KB and it can't fully utilize the backing storage's read ability,
so we should tune it.

Sure, but that doesn't tell me anything about why you can't do this
at mount time before the application opens any files. i.e.  you've
simply stated the reason why readahead is tunable, not why you need
to be fully dynamic.

We store our file system's data on different disks so we need to change ra_pages
dynamically according to where the data resides, it can't be fixed at mount time
or when we open files.

That doesn't make a whole lot of sense to me. let me try to get this
straight.

There is data that resides on two devices (A + B), and a fuse
filesystem to access that data. There is a single file in the fuse
fs has data on both devices. An app has the file open, and when the
data it is accessing is on device A you need to set the readahead to
what is best for device A? And when the app tries to access data for
that file that is on device B, you need to set the readahead to what
is best for device B? And you are changing the fuse BDI readahead
settings according to where the data in the back end lies?

It seems to me that you should be setting the fuse readahead to the
maximum of the readahead windows the data devices have configured at
mount time and leaving it at that

Then it may not fully utilize some device's read IO bandwidth and put too much
burden on other devices.

The abstract bdi of fuse and btrfs provides some dynamically changing
bdi.ra_pages
based on the real backing device. IMHO this should not be ignored.

btrfs simply takes into account the number of disks it has for a
given storage pool when setting up the default bdi ra_pages during
mount.  This is basically doing what I suggested above.  Same with
the generic fuse code - it's simply setting a sensible default value
for the given fuse configuration.

Neither are dynamic in the sense you are talking about, though.

Actually I've talked about it with Fengguang, he advised we should unify the


But how can bdi related ra_pages reflect different files' readahead 
window? Maybe these different files are sequential read, random read and 
so on.



ra_pages in struct bdi and file_ra_state and leave the issue that
spreading data
across disks as it is.
Fengguang, what's you opinion about this?

Thanks,
  Ying Zhu

Cheers,

Dave.
--
Dave Chinner
da...@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

2012-10-24 Thread Ni zhan Chen

On 10/25/2012 08:17 AM, YingHang Zhu wrote:

On Thu, Oct 25, 2012 at 4:19 AM, Dave Chinner  wrote:

On Wed, Oct 24, 2012 at 07:53:59AM +0800, YingHang Zhu wrote:

Hi Dave,
On Wed, Oct 24, 2012 at 6:47 AM, Dave Chinner  wrote:

On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote:

Hi,
   Recently we ran into the bug that an opened file's ra_pages does not
synchronize with it's backing device's when the latter is changed
with blockdev --setra, the application needs to reopen the file
to know the change,

or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead
window to the (new) bdi default.


which is inappropriate under our circumstances.

Which are? We don't know your circumstances, so you need to tell us
why you need this and why existing methods of handling such changes
are insufficient...

Optimal readahead windows tend to be a physical property of the
storage and that does not tend to change dynamically. Hence block
device readahead should only need to be set up once, and generally
that can be done before the filesystem is mounted and files are
opened (e.g. via udev rules). Hence you need to explain why you need
to change the default block device readahead on the fly, and why
fadvise(POSIX_FADV_NORMAL) is "inappropriate" to set readahead
windows to the new defaults.

Our system is a fuse-based file system, fuse creates a
pseudo backing device for the user space file systems, the default readahead
size is 128KB and it can't fully utilize the backing storage's read ability,
so we should tune it.

Sure, but that doesn't tell me anything about why you can't do this
at mount time before the application opens any files. i.e.  you've
simply stated the reason why readahead is tunable, not why you need
to be fully dynamic.

We store our file system's data on different disks so we need to change ra_pages
dynamically according to where the data resides, it can't be fixed at mount time
or when we open files.
The abstract bdi of fuse and btrfs provides some dynamically changing
bdi.ra_pages
based on the real backing device. IMHO this should not be ignored.


And how to tune ra_pages if one big file distribution in different 
disks, I think Fengguang Wu can answer these questions,


Hi Fengguang,


The above third-party application using our file system maintains
some long-opened files, we does not have any chances
to force them to call fadvise(POSIX_FADV_NORMAL). :(

So raise a bug/feature request with the third party.  Modifying
kernel code because you can't directly modify the application isn't
the best solution for anyone. This really is an application problem
- the kernel already provides the mechanisms to solve this
problem...  :/

Thanks for advice, I will consult the above application's developers
for more information.
Now from the code itself should we merge the gap between the real
device's ra_pages and the file's?
Obviously the ra_pages is duplicated, otherwise each time we run into this
problem, someone will do the same work as I have done here.

Thanks,
  Ying Zhu

Cheers,

Dave.
--
Dave Chinner
da...@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

2012-10-24 Thread Ni zhan Chen

On 10/25/2012 08:17 AM, YingHang Zhu wrote:

On Thu, Oct 25, 2012 at 4:19 AM, Dave Chinner da...@fromorbit.com wrote:

On Wed, Oct 24, 2012 at 07:53:59AM +0800, YingHang Zhu wrote:

Hi Dave,
On Wed, Oct 24, 2012 at 6:47 AM, Dave Chinner da...@fromorbit.com wrote:

On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote:

Hi,
   Recently we ran into the bug that an opened file's ra_pages does not
synchronize with it's backing device's when the latter is changed
with blockdev --setra, the application needs to reopen the file
to know the change,

or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead
window to the (new) bdi default.


which is inappropriate under our circumstances.

Which are? We don't know your circumstances, so you need to tell us
why you need this and why existing methods of handling such changes
are insufficient...

Optimal readahead windows tend to be a physical property of the
storage and that does not tend to change dynamically. Hence block
device readahead should only need to be set up once, and generally
that can be done before the filesystem is mounted and files are
opened (e.g. via udev rules). Hence you need to explain why you need
to change the default block device readahead on the fly, and why
fadvise(POSIX_FADV_NORMAL) is inappropriate to set readahead
windows to the new defaults.

Our system is a fuse-based file system, fuse creates a
pseudo backing device for the user space file systems, the default readahead
size is 128KB and it can't fully utilize the backing storage's read ability,
so we should tune it.

Sure, but that doesn't tell me anything about why you can't do this
at mount time before the application opens any files. i.e.  you've
simply stated the reason why readahead is tunable, not why you need
to be fully dynamic.

We store our file system's data on different disks so we need to change ra_pages
dynamically according to where the data resides, it can't be fixed at mount time
or when we open files.
The abstract bdi of fuse and btrfs provides some dynamically changing
bdi.ra_pages
based on the real backing device. IMHO this should not be ignored.


And how to tune ra_pages if one big file distribution in different 
disks, I think Fengguang Wu can answer these questions,


Hi Fengguang,


The above third-party application using our file system maintains
some long-opened files, we does not have any chances
to force them to call fadvise(POSIX_FADV_NORMAL). :(

So raise a bug/feature request with the third party.  Modifying
kernel code because you can't directly modify the application isn't
the best solution for anyone. This really is an application problem
- the kernel already provides the mechanisms to solve this
problem...  :/

Thanks for advice, I will consult the above application's developers
for more information.
Now from the code itself should we merge the gap between the real
device's ra_pages and the file's?
Obviously the ra_pages is duplicated, otherwise each time we run into this
problem, someone will do the same work as I have done here.

Thanks,
  Ying Zhu

Cheers,

Dave.
--
Dave Chinner
da...@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

2012-10-24 Thread Ni zhan Chen

On 10/25/2012 10:04 AM, YingHang Zhu wrote:

On Thu, Oct 25, 2012 at 9:50 AM, Dave Chinner da...@fromorbit.com wrote:

On Thu, Oct 25, 2012 at 08:17:05AM +0800, YingHang Zhu wrote:

On Thu, Oct 25, 2012 at 4:19 AM, Dave Chinner da...@fromorbit.com wrote:

On Wed, Oct 24, 2012 at 07:53:59AM +0800, YingHang Zhu wrote:

Hi Dave,
On Wed, Oct 24, 2012 at 6:47 AM, Dave Chinner da...@fromorbit.com wrote:

On Tue, Oct 23, 2012 at 08:46:51PM +0800, Ying Zhu wrote:

Hi,
   Recently we ran into the bug that an opened file's ra_pages does not
synchronize with it's backing device's when the latter is changed
with blockdev --setra, the application needs to reopen the file
to know the change,

or simply call fadvise(fd, POSIX_FADV_NORMAL) to reset the readhead
window to the (new) bdi default.


which is inappropriate under our circumstances.

Which are? We don't know your circumstances, so you need to tell us
why you need this and why existing methods of handling such changes
are insufficient...

Optimal readahead windows tend to be a physical property of the
storage and that does not tend to change dynamically. Hence block
device readahead should only need to be set up once, and generally
that can be done before the filesystem is mounted and files are
opened (e.g. via udev rules). Hence you need to explain why you need
to change the default block device readahead on the fly, and why
fadvise(POSIX_FADV_NORMAL) is inappropriate to set readahead
windows to the new defaults.

Our system is a fuse-based file system, fuse creates a
pseudo backing device for the user space file systems, the default readahead
size is 128KB and it can't fully utilize the backing storage's read ability,
so we should tune it.

Sure, but that doesn't tell me anything about why you can't do this
at mount time before the application opens any files. i.e.  you've
simply stated the reason why readahead is tunable, not why you need
to be fully dynamic.

We store our file system's data on different disks so we need to change ra_pages
dynamically according to where the data resides, it can't be fixed at mount time
or when we open files.

That doesn't make a whole lot of sense to me. let me try to get this
straight.

There is data that resides on two devices (A + B), and a fuse
filesystem to access that data. There is a single file in the fuse
fs has data on both devices. An app has the file open, and when the
data it is accessing is on device A you need to set the readahead to
what is best for device A? And when the app tries to access data for
that file that is on device B, you need to set the readahead to what
is best for device B? And you are changing the fuse BDI readahead
settings according to where the data in the back end lies?

It seems to me that you should be setting the fuse readahead to the
maximum of the readahead windows the data devices have configured at
mount time and leaving it at that

Then it may not fully utilize some device's read IO bandwidth and put too much
burden on other devices.

The abstract bdi of fuse and btrfs provides some dynamically changing
bdi.ra_pages
based on the real backing device. IMHO this should not be ignored.

btrfs simply takes into account the number of disks it has for a
given storage pool when setting up the default bdi ra_pages during
mount.  This is basically doing what I suggested above.  Same with
the generic fuse code - it's simply setting a sensible default value
for the given fuse configuration.

Neither are dynamic in the sense you are talking about, though.

Actually I've talked about it with Fengguang, he advised we should unify the


But how can bdi related ra_pages reflect different files' readahead 
window? Maybe these different files are sequential read, random read and 
so on.



ra_pages in struct bdi and file_ra_state and leave the issue that
spreading data
across disks as it is.
Fengguang, what's you opinion about this?

Thanks,
  Ying Zhu

Cheers,

Dave.
--
Dave Chinner
da...@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]

2012-10-24 Thread Ni zhan Chen

On 10/25/2012 12:36 PM, Hugh Dickins wrote:

On Wed, 24 Oct 2012, Dave Jones wrote:


Machine under significant load (4gb memory used, swap usage fluctuating)
triggered this...

WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70()
Pid: 29795, comm: trinity-child4 Not tainted 3.7.0-rc2+ #49
Call Trace:
  [8107100f] warn_slowpath_common+0x7f/0xc0
  [8107106a] warn_slowpath_null+0x1a/0x20
  [811903fc] shmem_getpage_gfp+0xa5c/0xa70
  [8118fc3e] ? shmem_getpage_gfp+0x29e/0xa70
  [81190e4f] shmem_fault+0x4f/0xa0
  [8119f391] __do_fault+0x71/0x5c0
  [810e1ac6] ? __lock_acquire+0x306/0x1ba0
  [810b6ff9] ? local_clock+0x89/0xa0
  [811a2767] handle_pte_fault+0x97/0xae0
  [816d1069] ? sub_preempt_count+0x79/0xd0
  [8136d68e] ? delay_tsc+0xae/0x120
  [8136d578] ? __const_udelay+0x28/0x30
  [811a4a39] handle_mm_fault+0x289/0x350
  [816d091e] __do_page_fault+0x18e/0x530
  [810b6ff9] ? local_clock+0x89/0xa0
  [810b0e51] ? get_parent_ip+0x11/0x50
  [810b0e51] ? get_parent_ip+0x11/0x50
  [816d1069] ? sub_preempt_count+0x79/0xd0
  [8112d389] ? rcu_user_exit+0xc9/0xf0
  [816d0ceb] do_page_fault+0x2b/0x50
  [816cd3b8] page_fault+0x28/0x30
  [8136d259] ? copy_user_enhanced_fast_string+0x9/0x20
  [8121c181] ? sys_futimesat+0x41/0xe0
  [8102bf35] ? syscall_trace_enter+0x25/0x2c0
  [816d5625] ? tracesys+0x7e/0xe6
  [816d5688] tracesys+0xe1/0xe6



1148 error = shmem_add_to_page_cache(page, mapping, 
index,
1149 gfp, 
swp_to_radix_entry(swap));
1150 /* We already confirmed swap, and make no 
allocation */
1151 VM_BUG_ON(error);
1152 }

That's very surprising.  Easy enough to handle an error there, but
of course I made it a VM_BUG_ON because it violates my assumptions:
I rather need to understand how this can be, and I've no idea.

Clutching at straws, I expect this is entirely irrelevant, but:
there isn't a warning on line 1151 of mm/shmem.c in 3.7.0-rc2 nor
in current linux.git; rather, there's a VM_BUG_ON on line 1149.

So you've inserted a couple of lines for some reason (more useful
trinity behaviour, perhaps)?  And have some config option I'm
unfamiliar with, that mutates a BUG_ON or VM_BUG_ON into a warning?


Hi Hugh,

I think it maybe caused by your commit [d189922862e03ce: shmem: fix 
negative rss in memcg memory.stat], one question:


if function shmem_confirm_swap confirm the entry has already brought 
back from swap by a racing thread, then why call shmem_add_to_page_cache 
to add page from swapcache to pagecache again? otherwise, will goto 
unlock and then go to repeat? where I miss?


Regards,
Chen



Hugh



  total   used   free sharedbuffers cached
Mem:   388552828540641031464  0   9624  19208
-/+ buffers/cache:28252321060296
Swap:  6029308  306565998652

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

2012-10-23 Thread Ni zhan Chen

On 10/23/2012 09:41 PM, YingHang Zhu wrote:

Sorry for the annoying, I forgot ccs in the previous mail.
Thanks,
  Ying Zhu
Hi Chen,

On Tue, Oct 23, 2012 at 9:21 PM, Ni zhan Chen  wrote:

On 10/23/2012 08:46 PM, Ying Zhu wrote:

Hi,
Recently we ran into the bug that an opened file's ra_pages does not
synchronize with it's backing device's when the latter is changed
with blockdev --setra, the application needs to reopen the file
to know the change, which is inappropriate under our circumstances.


Could you tell me in which function do this synchronize stuff?

With this patch we use bdi.ra_pages directly, so change bdi.ra_pages also
change an opened file's ra_pages.



This bug is also mentioned in scst (generic SCSI target subsystem for
Linux)'s
README file.
This patch tries to unify the ra_pages in struct file_ra_state
and struct backing_dev_info. Basically current readahead algorithm
will ramp file_ra_state.ra_pages up to bdi.ra_pages once it detects the


You mean ondemand readahead algorithm will do this? I don't think so.
file_ra_state_init only called in btrfs path, correct?

No, it's also called in do_dentry_open.



read mode is sequential. Then all files sharing the same backing device
have the same max value bdi.ra_pages set in file_ra_state.


why remove file_ra_state? If one file is read sequential and another file is
read ramdom, how can use the global bdi.ra_pages to indicate the max
readahead window of each file?

This patch does not remove file_ra_state, an file's readahead window
is determined
by it's backing device.


As Dave said, backing device readahead window doesn't tend to change 
dynamically, but file readahead window does, it will change when 
sequential read, random read, thrash, interleaved read and so on occur.





Applying this means the flags POSIX_FADV_NORMAL and
POSIX_FADV_SEQUENTIAL
in fadivse will only set file reading mode without signifying the
max readahead size of the file. The current apporach adds no additional
overhead in read IO path, IMHO is the simplest solution.
Any comments are welcome, thanks in advance.


Could you show me how you test this patch?

This patch brings no perfmance gain, just fixs some functional bugs.
By reading a 500MB file, the default max readahead size of the
backing device was 128KB, after applying this patch, the read file's
max ra_pages
changed when I tuned the device's read ahead size with blockdev.



Thanks,
 Ying Zhu

Signed-off-by: Ying Zhu 
---
   include/linux/fs.h |1 -
   mm/fadvise.c   |2 --
   mm/filemap.c   |   17 +++--
   mm/readahead.c |8 
   4 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 17fd887..36303a5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -991,7 +991,6 @@ struct file_ra_state {
 unsigned int async_size;/* do asynchronous readahead when
there are only # of pages ahead
*/
   - unsigned int ra_pages;  /* Maximum readahead window */
 unsigned int mmap_miss; /* Cache miss stat for mmap
accesses */
 loff_t prev_pos;/* Cache last read() position */
   };
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 469491e..75e2378 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -76,7 +76,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset,
loff_t len, int advice)
 switch (advice) {
 case POSIX_FADV_NORMAL:
-   file->f_ra.ra_pages = bdi->ra_pages;
 spin_lock(>f_lock);
 file->f_mode &= ~FMODE_RANDOM;
 spin_unlock(>f_lock);
@@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset,
loff_t len, int advice)
 spin_unlock(>f_lock);
 break;
 case POSIX_FADV_SEQUENTIAL:
-   file->f_ra.ra_pages = bdi->ra_pages * 2;
 spin_lock(>f_lock);
 file->f_mode &= ~FMODE_RANDOM;
 spin_unlock(>f_lock);
diff --git a/mm/filemap.c b/mm/filemap.c
index a4a5260..e7e4409 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1058,11 +1058,15 @@ EXPORT_SYMBOL(grab_cache_page_nowait);
* readahead(R+4...B+3) => bang => read(R+4) => read(R+5) => ..
*
* It is going insane. Fix it by quickly scaling down the readahead
size.
+ * It's hard to estimate how the bad sectors lay out, so to be
conservative,
+ * set the read mode in random.
*/
   static void shrink_readahead_size_eio(struct file *filp,
 struct file_ra_state *ra)
   {
-   ra->ra_pages /= 4;
+   spin_lock(>f_lock);
+   filp->f_mode |= FMODE_RANDOM;
+   spin_unlock(>f_lock);
   }
 /**
@@ -1527,12 +1531,12 @@ static void do_sync_mmap_readahead(struct
vm_area_struct *vma,
 /* If we don't want any read-ahead, don't bothe

Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

2012-10-23 Thread Ni zhan Chen

On 10/23/2012 08:46 PM, Ying Zhu wrote:

Hi,
   Recently we ran into the bug that an opened file's ra_pages does not
synchronize with it's backing device's when the latter is changed
with blockdev --setra, the application needs to reopen the file
to know the change, which is inappropriate under our circumstances.


Could you tell me in which function do this synchronize stuff?


This bug is also mentioned in scst (generic SCSI target subsystem for Linux)'s
README file.
   This patch tries to unify the ra_pages in struct file_ra_state
and struct backing_dev_info. Basically current readahead algorithm
will ramp file_ra_state.ra_pages up to bdi.ra_pages once it detects the


You mean ondemand readahead algorithm will do this? I don't think so. 
file_ra_state_init only called in btrfs path, correct?



read mode is sequential. Then all files sharing the same backing device
have the same max value bdi.ra_pages set in file_ra_state.


why remove file_ra_state? If one file is read sequential and another 
file is read ramdom, how can use the global bdi.ra_pages to indicate the 
max readahead window of each file?



   Applying this means the flags POSIX_FADV_NORMAL and POSIX_FADV_SEQUENTIAL
in fadivse will only set file reading mode without signifying the
max readahead size of the file. The current apporach adds no additional
overhead in read IO path, IMHO is the simplest solution.
Any comments are welcome, thanks in advance.


Could you show me how you test this patch?



Thanks,
Ying Zhu

Signed-off-by: Ying Zhu 
---
  include/linux/fs.h |1 -
  mm/fadvise.c   |2 --
  mm/filemap.c   |   17 +++--
  mm/readahead.c |8 
  4 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 17fd887..36303a5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -991,7 +991,6 @@ struct file_ra_state {
unsigned int async_size;/* do asynchronous readahead when
   there are only # of pages ahead */
  
-	unsigned int ra_pages;		/* Maximum readahead window */

unsigned int mmap_miss; /* Cache miss stat for mmap accesses */
loff_t prev_pos;/* Cache last read() position */
  };
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 469491e..75e2378 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -76,7 +76,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t 
len, int advice)
  
  	switch (advice) {

case POSIX_FADV_NORMAL:
-   file->f_ra.ra_pages = bdi->ra_pages;
spin_lock(>f_lock);
file->f_mode &= ~FMODE_RANDOM;
spin_unlock(>f_lock);
@@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t 
len, int advice)
spin_unlock(>f_lock);
break;
case POSIX_FADV_SEQUENTIAL:
-   file->f_ra.ra_pages = bdi->ra_pages * 2;
spin_lock(>f_lock);
file->f_mode &= ~FMODE_RANDOM;
spin_unlock(>f_lock);
diff --git a/mm/filemap.c b/mm/filemap.c
index a4a5260..e7e4409 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1058,11 +1058,15 @@ EXPORT_SYMBOL(grab_cache_page_nowait);
   * readahead(R+4...B+3) => bang => read(R+4) => read(R+5) => ..
   *
   * It is going insane. Fix it by quickly scaling down the readahead size.
+ * It's hard to estimate how the bad sectors lay out, so to be conservative,
+ * set the read mode in random.
   */
  static void shrink_readahead_size_eio(struct file *filp,
struct file_ra_state *ra)
  {
-   ra->ra_pages /= 4;
+   spin_lock(>f_lock);
+   filp->f_mode |= FMODE_RANDOM;
+   spin_unlock(>f_lock);
  }
  
  /**

@@ -1527,12 +1531,12 @@ static void do_sync_mmap_readahead(struct 
vm_area_struct *vma,
/* If we don't want any read-ahead, don't bother */
if (VM_RandomReadHint(vma))
return;
-   if (!ra->ra_pages)
+   if (!mapping->backing_dev_info->ra_pages)
return;
  
  	if (VM_SequentialReadHint(vma)) {

-   page_cache_sync_readahead(mapping, ra, file, offset,
- ra->ra_pages);
+   page_cache_sync_readahead(mapping, ra, file, offset,
+   mapping->backing_dev_info->ra_pages);
return;
}
  
@@ -1550,7 +1554,7 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma,

/*
 * mmap read-around
 */
-   ra_pages = max_sane_readahead(ra->ra_pages);
+   ra_pages = max_sane_readahead(mapping->backing_dev_info->ra_pages);
ra->start = max_t(long, 0, offset - ra_pages / 2);
ra->size = ra_pages;
ra->async_size = ra_pages / 4;
@@ -1576,7 +1580,8 @@ static void do_async_mmap_readahead(struct vm_area_struct 
*vma,
ra->mmap_miss--;
if 

Re: [PATCH v2 00/12] memory-hotplug: hot-remove physical memory

2012-10-23 Thread Ni zhan Chen

On 10/23/2012 06:30 PM, we...@cn.fujitsu.com wrote:

From: Wen Congyang 


The patchset doesn't support kernel memory hot-remove, correct? If the 
answer is yes, you should point out in your patchset changelog.




The patch-set was divided from following thread's patch-set.

https://lkml.org/lkml/2012/9/5/201

The last version of this patchset:
https://lkml.org/lkml/2012/10/5/469

If you want to know the reason, please read following thread.

https://lkml.org/lkml/2012/10/2/83

The patch-set has only the function of kernel core side for physical
memory hot remove. So if you use the patch, please apply following
patches.

- bug fix for memory hot remove
   https://lkml.org/lkml/2012/10/19/56
   
- acpi framework

   https://lkml.org/lkml/2012/10/19/156

The patches can free/remove the following things:

   - /sys/firmware/memmap/X/{end, start, type} : [PATCH 2/10]
   - mem_section and related sysfs files   : [PATCH 3-4/10]
   - memmap of sparse-vmemmap  : [PATCH 5-7/10]
   - page table of removed memory  : [RFC PATCH 8/10]
   - node and related sysfs files  : [RFC PATCH 9-10/10]

* [PATCH 2/10] checks whether the memory can be removed or not.

If you find lack of function for physical memory hot-remove, please let me
know.

How to test this patchset?
1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE,
ACPI_HOTPLUG_MEMORY must be selected.
2. load the module acpi_memhotplug
3. hotplug the memory device(it depends on your hardware)
You will see the memory device under the directory /sys/bus/acpi/devices/.
Its name is PNP0C80:XX.
4. online/offline pages provided by this memory device
You can write online/offline to /sys/devices/system/memory/memoryX/state to
online/offline pages provided by this memory device
5. hotremove the memory device
You can hotremove the memory device by the hardware, or writing 1 to
/sys/bus/acpi/devices/PNP0C80:XX/eject.

Note: if the memory provided by the memory device is used by the kernel, it
can't be offlined. It is not a bug.

Known problems:
1. hotremoving memory device may cause kernel panicked
This bug will be fixed by Liu Jiang's patch:
https://lkml.org/lkml/2012/7/3/1


Changelogs from v1 to v2:
  Patch1: new patch, offline memory twice. 1st iterate: offline every non 
primary
  memory block. 2nd iterate: offline primary (i.e. first added) memory
  block.

  Patch3: new patch, no logical change, just remove reduntant codes.

  Patch9: merge the patch from wujianguo into this patch. flush tlb on all cpu
  after the pagetable is changed.

  Patch12: new patch, free node_data when a node is offlined

Wen Congyang (6):
   memory-hotplug: try to offline the memory twice to avoid dependence
   memory-hotplug: remove redundant codes
   memory-hotplug: introduce new function arch_remove_memory() for
 removing page table depends on architecture
   memory-hotplug: remove page table of x86_64 architecture
   memory-hotplug: remove sysfs file of node
   memory-hotplug: free node_data when a node is offlined

Yasuaki Ishimatsu (6):
   memory-hotplug: check whether all memory blocks are offlined or not
 when removing memory
   memory-hotplug: remove /sys/firmware/memmap/X sysfs
   memory-hotplug: unregister memory section on SPARSEMEM_VMEMMAP
   memory-hotplug: implement register_page_bootmem_info_section of
 sparse-vmemmap
   memory-hotplug: remove memmap of sparse-vmemmap
   memory-hotplug: memory_hotplug: clear zone when removing the memory

  arch/ia64/mm/discontig.c |   14 ++
  arch/ia64/mm/init.c  |   18 ++
  arch/powerpc/mm/init_64.c|   14 ++
  arch/powerpc/mm/mem.c|   12 +
  arch/s390/mm/init.c  |   12 +
  arch/s390/mm/vmem.c  |   14 ++
  arch/sh/mm/init.c|   17 ++
  arch/sparc/mm/init_64.c  |   14 ++
  arch/tile/mm/init.c  |8 +
  arch/x86/include/asm/pgtable_types.h |1 +
  arch/x86/mm/init_32.c|   12 +
  arch/x86/mm/init_64.c|  409 ++
  arch/x86/mm/pageattr.c   |   47 ++--
  drivers/acpi/acpi_memhotplug.c   |8 +-
  drivers/base/memory.c|6 +
  drivers/firmware/memmap.c|   98 -
  include/linux/firmware-map.h |6 +
  include/linux/memory_hotplug.h   |   15 +-
  include/linux/mm.h   |5 +-
  mm/memory_hotplug.c  |  409 --
  mm/sparse.c  |5 +-
  21 files changed, 1087 insertions(+), 57 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 



--
To unsubscribe from this list: send the line 

Re: [PATCH v2 00/12] memory-hotplug: hot-remove physical memory

2012-10-23 Thread Ni zhan Chen

On 10/23/2012 06:30 PM, we...@cn.fujitsu.com wrote:

From: Wen Congyang we...@cn.fujitsu.com


The patchset doesn't support kernel memory hot-remove, correct? If the 
answer is yes, you should point out in your patchset changelog.




The patch-set was divided from following thread's patch-set.

https://lkml.org/lkml/2012/9/5/201

The last version of this patchset:
https://lkml.org/lkml/2012/10/5/469

If you want to know the reason, please read following thread.

https://lkml.org/lkml/2012/10/2/83

The patch-set has only the function of kernel core side for physical
memory hot remove. So if you use the patch, please apply following
patches.

- bug fix for memory hot remove
   https://lkml.org/lkml/2012/10/19/56
   
- acpi framework

   https://lkml.org/lkml/2012/10/19/156

The patches can free/remove the following things:

   - /sys/firmware/memmap/X/{end, start, type} : [PATCH 2/10]
   - mem_section and related sysfs files   : [PATCH 3-4/10]
   - memmap of sparse-vmemmap  : [PATCH 5-7/10]
   - page table of removed memory  : [RFC PATCH 8/10]
   - node and related sysfs files  : [RFC PATCH 9-10/10]

* [PATCH 2/10] checks whether the memory can be removed or not.

If you find lack of function for physical memory hot-remove, please let me
know.

How to test this patchset?
1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE,
ACPI_HOTPLUG_MEMORY must be selected.
2. load the module acpi_memhotplug
3. hotplug the memory device(it depends on your hardware)
You will see the memory device under the directory /sys/bus/acpi/devices/.
Its name is PNP0C80:XX.
4. online/offline pages provided by this memory device
You can write online/offline to /sys/devices/system/memory/memoryX/state to
online/offline pages provided by this memory device
5. hotremove the memory device
You can hotremove the memory device by the hardware, or writing 1 to
/sys/bus/acpi/devices/PNP0C80:XX/eject.

Note: if the memory provided by the memory device is used by the kernel, it
can't be offlined. It is not a bug.

Known problems:
1. hotremoving memory device may cause kernel panicked
This bug will be fixed by Liu Jiang's patch:
https://lkml.org/lkml/2012/7/3/1


Changelogs from v1 to v2:
  Patch1: new patch, offline memory twice. 1st iterate: offline every non 
primary
  memory block. 2nd iterate: offline primary (i.e. first added) memory
  block.

  Patch3: new patch, no logical change, just remove reduntant codes.

  Patch9: merge the patch from wujianguo into this patch. flush tlb on all cpu
  after the pagetable is changed.

  Patch12: new patch, free node_data when a node is offlined

Wen Congyang (6):
   memory-hotplug: try to offline the memory twice to avoid dependence
   memory-hotplug: remove redundant codes
   memory-hotplug: introduce new function arch_remove_memory() for
 removing page table depends on architecture
   memory-hotplug: remove page table of x86_64 architecture
   memory-hotplug: remove sysfs file of node
   memory-hotplug: free node_data when a node is offlined

Yasuaki Ishimatsu (6):
   memory-hotplug: check whether all memory blocks are offlined or not
 when removing memory
   memory-hotplug: remove /sys/firmware/memmap/X sysfs
   memory-hotplug: unregister memory section on SPARSEMEM_VMEMMAP
   memory-hotplug: implement register_page_bootmem_info_section of
 sparse-vmemmap
   memory-hotplug: remove memmap of sparse-vmemmap
   memory-hotplug: memory_hotplug: clear zone when removing the memory

  arch/ia64/mm/discontig.c |   14 ++
  arch/ia64/mm/init.c  |   18 ++
  arch/powerpc/mm/init_64.c|   14 ++
  arch/powerpc/mm/mem.c|   12 +
  arch/s390/mm/init.c  |   12 +
  arch/s390/mm/vmem.c  |   14 ++
  arch/sh/mm/init.c|   17 ++
  arch/sparc/mm/init_64.c  |   14 ++
  arch/tile/mm/init.c  |8 +
  arch/x86/include/asm/pgtable_types.h |1 +
  arch/x86/mm/init_32.c|   12 +
  arch/x86/mm/init_64.c|  409 ++
  arch/x86/mm/pageattr.c   |   47 ++--
  drivers/acpi/acpi_memhotplug.c   |8 +-
  drivers/base/memory.c|6 +
  drivers/firmware/memmap.c|   98 -
  include/linux/firmware-map.h |6 +
  include/linux/memory_hotplug.h   |   15 +-
  include/linux/mm.h   |5 +-
  mm/memory_hotplug.c  |  409 --
  mm/sparse.c  |5 +-
  21 files changed, 1087 insertions(+), 57 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a



--
To unsubscribe from this 

Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

2012-10-23 Thread Ni zhan Chen

On 10/23/2012 08:46 PM, Ying Zhu wrote:

Hi,
   Recently we ran into the bug that an opened file's ra_pages does not
synchronize with it's backing device's when the latter is changed
with blockdev --setra, the application needs to reopen the file
to know the change, which is inappropriate under our circumstances.


Could you tell me in which function do this synchronize stuff?


This bug is also mentioned in scst (generic SCSI target subsystem for Linux)'s
README file.
   This patch tries to unify the ra_pages in struct file_ra_state
and struct backing_dev_info. Basically current readahead algorithm
will ramp file_ra_state.ra_pages up to bdi.ra_pages once it detects the


You mean ondemand readahead algorithm will do this? I don't think so. 
file_ra_state_init only called in btrfs path, correct?



read mode is sequential. Then all files sharing the same backing device
have the same max value bdi.ra_pages set in file_ra_state.


why remove file_ra_state? If one file is read sequential and another 
file is read ramdom, how can use the global bdi.ra_pages to indicate the 
max readahead window of each file?



   Applying this means the flags POSIX_FADV_NORMAL and POSIX_FADV_SEQUENTIAL
in fadivse will only set file reading mode without signifying the
max readahead size of the file. The current apporach adds no additional
overhead in read IO path, IMHO is the simplest solution.
Any comments are welcome, thanks in advance.


Could you show me how you test this patch?



Thanks,
Ying Zhu

Signed-off-by: Ying Zhu casualfis...@gmail.com
---
  include/linux/fs.h |1 -
  mm/fadvise.c   |2 --
  mm/filemap.c   |   17 +++--
  mm/readahead.c |8 
  4 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 17fd887..36303a5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -991,7 +991,6 @@ struct file_ra_state {
unsigned int async_size;/* do asynchronous readahead when
   there are only # of pages ahead */
  
-	unsigned int ra_pages;		/* Maximum readahead window */

unsigned int mmap_miss; /* Cache miss stat for mmap accesses */
loff_t prev_pos;/* Cache last read() position */
  };
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 469491e..75e2378 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -76,7 +76,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t 
len, int advice)
  
  	switch (advice) {

case POSIX_FADV_NORMAL:
-   file-f_ra.ra_pages = bdi-ra_pages;
spin_lock(file-f_lock);
file-f_mode = ~FMODE_RANDOM;
spin_unlock(file-f_lock);
@@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t 
len, int advice)
spin_unlock(file-f_lock);
break;
case POSIX_FADV_SEQUENTIAL:
-   file-f_ra.ra_pages = bdi-ra_pages * 2;
spin_lock(file-f_lock);
file-f_mode = ~FMODE_RANDOM;
spin_unlock(file-f_lock);
diff --git a/mm/filemap.c b/mm/filemap.c
index a4a5260..e7e4409 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1058,11 +1058,15 @@ EXPORT_SYMBOL(grab_cache_page_nowait);
   * readahead(R+4...B+3) = bang = read(R+4) = read(R+5) = ..
   *
   * It is going insane. Fix it by quickly scaling down the readahead size.
+ * It's hard to estimate how the bad sectors lay out, so to be conservative,
+ * set the read mode in random.
   */
  static void shrink_readahead_size_eio(struct file *filp,
struct file_ra_state *ra)
  {
-   ra-ra_pages /= 4;
+   spin_lock(filp-f_lock);
+   filp-f_mode |= FMODE_RANDOM;
+   spin_unlock(filp-f_lock);
  }
  
  /**

@@ -1527,12 +1531,12 @@ static void do_sync_mmap_readahead(struct 
vm_area_struct *vma,
/* If we don't want any read-ahead, don't bother */
if (VM_RandomReadHint(vma))
return;
-   if (!ra-ra_pages)
+   if (!mapping-backing_dev_info-ra_pages)
return;
  
  	if (VM_SequentialReadHint(vma)) {

-   page_cache_sync_readahead(mapping, ra, file, offset,
- ra-ra_pages);
+   page_cache_sync_readahead(mapping, ra, file, offset,
+   mapping-backing_dev_info-ra_pages);
return;
}
  
@@ -1550,7 +1554,7 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma,

/*
 * mmap read-around
 */
-   ra_pages = max_sane_readahead(ra-ra_pages);
+   ra_pages = max_sane_readahead(mapping-backing_dev_info-ra_pages);
ra-start = max_t(long, 0, offset - ra_pages / 2);
ra-size = ra_pages;
ra-async_size = ra_pages / 4;
@@ -1576,7 +1580,8 @@ static void do_async_mmap_readahead(struct vm_area_struct 
*vma,
ra-mmap_miss--;
 

Re: [PATCH] mm: readahead: remove redundant ra_pages in file_ra_state

2012-10-23 Thread Ni zhan Chen

On 10/23/2012 09:41 PM, YingHang Zhu wrote:

Sorry for the annoying, I forgot ccs in the previous mail.
Thanks,
  Ying Zhu
Hi Chen,

On Tue, Oct 23, 2012 at 9:21 PM, Ni zhan Chen nizhan.c...@gmail.com wrote:

On 10/23/2012 08:46 PM, Ying Zhu wrote:

Hi,
Recently we ran into the bug that an opened file's ra_pages does not
synchronize with it's backing device's when the latter is changed
with blockdev --setra, the application needs to reopen the file
to know the change, which is inappropriate under our circumstances.


Could you tell me in which function do this synchronize stuff?

With this patch we use bdi.ra_pages directly, so change bdi.ra_pages also
change an opened file's ra_pages.



This bug is also mentioned in scst (generic SCSI target subsystem for
Linux)'s
README file.
This patch tries to unify the ra_pages in struct file_ra_state
and struct backing_dev_info. Basically current readahead algorithm
will ramp file_ra_state.ra_pages up to bdi.ra_pages once it detects the


You mean ondemand readahead algorithm will do this? I don't think so.
file_ra_state_init only called in btrfs path, correct?

No, it's also called in do_dentry_open.



read mode is sequential. Then all files sharing the same backing device
have the same max value bdi.ra_pages set in file_ra_state.


why remove file_ra_state? If one file is read sequential and another file is
read ramdom, how can use the global bdi.ra_pages to indicate the max
readahead window of each file?

This patch does not remove file_ra_state, an file's readahead window
is determined
by it's backing device.


As Dave said, backing device readahead window doesn't tend to change 
dynamically, but file readahead window does, it will change when 
sequential read, random read, thrash, interleaved read and so on occur.





Applying this means the flags POSIX_FADV_NORMAL and
POSIX_FADV_SEQUENTIAL
in fadivse will only set file reading mode without signifying the
max readahead size of the file. The current apporach adds no additional
overhead in read IO path, IMHO is the simplest solution.
Any comments are welcome, thanks in advance.


Could you show me how you test this patch?

This patch brings no perfmance gain, just fixs some functional bugs.
By reading a 500MB file, the default max readahead size of the
backing device was 128KB, after applying this patch, the read file's
max ra_pages
changed when I tuned the device's read ahead size with blockdev.



Thanks,
 Ying Zhu

Signed-off-by: Ying Zhu casualfis...@gmail.com
---
   include/linux/fs.h |1 -
   mm/fadvise.c   |2 --
   mm/filemap.c   |   17 +++--
   mm/readahead.c |8 
   4 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 17fd887..36303a5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -991,7 +991,6 @@ struct file_ra_state {
 unsigned int async_size;/* do asynchronous readahead when
there are only # of pages ahead
*/
   - unsigned int ra_pages;  /* Maximum readahead window */
 unsigned int mmap_miss; /* Cache miss stat for mmap
accesses */
 loff_t prev_pos;/* Cache last read() position */
   };
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 469491e..75e2378 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -76,7 +76,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset,
loff_t len, int advice)
 switch (advice) {
 case POSIX_FADV_NORMAL:
-   file-f_ra.ra_pages = bdi-ra_pages;
 spin_lock(file-f_lock);
 file-f_mode = ~FMODE_RANDOM;
 spin_unlock(file-f_lock);
@@ -87,7 +86,6 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset,
loff_t len, int advice)
 spin_unlock(file-f_lock);
 break;
 case POSIX_FADV_SEQUENTIAL:
-   file-f_ra.ra_pages = bdi-ra_pages * 2;
 spin_lock(file-f_lock);
 file-f_mode = ~FMODE_RANDOM;
 spin_unlock(file-f_lock);
diff --git a/mm/filemap.c b/mm/filemap.c
index a4a5260..e7e4409 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1058,11 +1058,15 @@ EXPORT_SYMBOL(grab_cache_page_nowait);
* readahead(R+4...B+3) = bang = read(R+4) = read(R+5) = ..
*
* It is going insane. Fix it by quickly scaling down the readahead
size.
+ * It's hard to estimate how the bad sectors lay out, so to be
conservative,
+ * set the read mode in random.
*/
   static void shrink_readahead_size_eio(struct file *filp,
 struct file_ra_state *ra)
   {
-   ra-ra_pages /= 4;
+   spin_lock(filp-f_lock);
+   filp-f_mode |= FMODE_RANDOM;
+   spin_unlock(filp-f_lock);
   }
 /**
@@ -1527,12 +1531,12 @@ static void do_sync_mmap_readahead(struct
vm_area_struct *vma,
 /* If we don't want any read-ahead, don't bother

Re: question on NUMA page migration

2012-10-20 Thread Ni zhan Chen

On 10/19/2012 11:53 PM, Rik van Riel wrote:

Hi Andrea, Peter,

I have a question on page refcounting in your NUMA
page migration code.

In Peter's case, I wonder why you introduce a new
MIGRATE_FAULT migration mode. If the normal page
migration / compaction logic can do without taking
an extra reference count, why does your code need it?


Hi Rik van Riel,

This is which part of codes? Why I can't find MIGRATE_FAULT in latest 
v3.7-rc2?


Regards,
Chen



In Andrea's case, we have a comment suggesting an
extra refcount is needed, immediately followed by
a put_page:

/*
 * Pin the head subpage at least until the first
 * __isolate_lru_page succeeds (__isolate_lru_page pins it
 * again when it succeeds). If we unpin before
 * __isolate_lru_page successd, the page could be freed and
 * reallocated out from under us. Thus our previous checks on
 * the page, and the split_huge_page, would be worthless.
 *
 * We really only need to do this if "ret > 0" but it doesn't
 * hurt to do it unconditionally as nobody can reference
 * "page" anymore after this and so we can avoid an "if (ret >
 * 0)" branch here.
 */
put_page(page);

This also confuses me.

If we do not need the extra refcount (and I do not
understand why NUMA migrate-on-fault needs one more
refcount than normal page migration), we can get
rid of the MIGRATE_FAULT mode.

If we do need the extra refcount, why is normal
page migration safe? :)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question on NUMA page migration

2012-10-20 Thread Ni zhan Chen

On 10/19/2012 11:53 PM, Rik van Riel wrote:

Hi Andrea, Peter,

I have a question on page refcounting in your NUMA
page migration code.

In Peter's case, I wonder why you introduce a new
MIGRATE_FAULT migration mode. If the normal page
migration / compaction logic can do without taking
an extra reference count, why does your code need it?


Hi Rik van Riel,

This is which part of codes? Why I can't find MIGRATE_FAULT in latest 
v3.7-rc2?


Regards,
Chen



In Andrea's case, we have a comment suggesting an
extra refcount is needed, immediately followed by
a put_page:

/*
 * Pin the head subpage at least until the first
 * __isolate_lru_page succeeds (__isolate_lru_page pins it
 * again when it succeeds). If we unpin before
 * __isolate_lru_page successd, the page could be freed and
 * reallocated out from under us. Thus our previous checks on
 * the page, and the split_huge_page, would be worthless.
 *
 * We really only need to do this if ret  0 but it doesn't
 * hurt to do it unconditionally as nobody can reference
 * page anymore after this and so we can avoid an if (ret 
 * 0) branch here.
 */
put_page(page);

This also confuses me.

If we do not need the extra refcount (and I do not
understand why NUMA migrate-on-fault needs one more
refcount than normal page migration), we can get
rid of the MIGRATE_FAULT mode.

If we do need the extra refcount, why is normal
page migration safe? :)



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] bugfix for memory hotplug

2012-10-17 Thread Ni zhan Chen

On 10/17/2012 08:08 PM, we...@cn.fujitsu.com wrote:

From: Wen Congyang 

Wen Congyang (5):
   memory-hotplug: skip HWPoisoned page when offlining pages
   memory-hotplug: update mce_bad_pages when removing the memory
   memory-hotplug: auto offline page_cgroup when onlining memory block
 failed
   memory-hotplug: fix NR_FREE_PAGES mismatch
   memory-hotplug: allocate zone's pcp before onlining pages


Oops, why you don't write changelog?



  include/linux/page-isolation.h |   10 ++
  mm/memory-failure.c|2 +-
  mm/memory_hotplug.c|   14 --
  mm/page_alloc.c|   37 -
  mm/page_cgroup.c   |3 +++
  mm/page_isolation.c|   27 ---
  mm/sparse.c|   21 +
  7 files changed, 87 insertions(+), 27 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] bugfix for memory hotplug

2012-10-17 Thread Ni zhan Chen

On 10/17/2012 08:08 PM, we...@cn.fujitsu.com wrote:

From: Wen Congyang we...@cn.fujitsu.com

Wen Congyang (5):
   memory-hotplug: skip HWPoisoned page when offlining pages
   memory-hotplug: update mce_bad_pages when removing the memory
   memory-hotplug: auto offline page_cgroup when onlining memory block
 failed
   memory-hotplug: fix NR_FREE_PAGES mismatch
   memory-hotplug: allocate zone's pcp before onlining pages


Oops, why you don't write changelog?



  include/linux/page-isolation.h |   10 ++
  mm/memory-failure.c|2 +-
  mm/memory_hotplug.c|   14 --
  mm/page_alloc.c|   37 -
  mm/page_cgroup.c   |3 +++
  mm/page_isolation.c|   27 ---
  mm/sparse.c|   21 +
  7 files changed, 87 insertions(+), 27 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 00/10] Introduce huge zero page

2012-10-16 Thread Ni zhan Chen

On 10/03/2012 08:04 AM, Kirill A. Shutemov wrote:

On Tue, Oct 02, 2012 at 03:31:48PM -0700, Andrew Morton wrote:

On Tue,  2 Oct 2012 18:19:22 +0300
"Kirill A. Shutemov"  wrote:


During testing I noticed big (up to 2.5 times) memory consumption overhead
on some workloads (e.g. ft.A from NPB) if THP is enabled.

The main reason for that big difference is lacking zero page in THP case.
We have to allocate a real page on read page fault.

A program to demonstrate the issue:
#include 
#include 
#include 

#define MB 1024*1024

int main(int argc, char **argv)
{
 char *p;
 int i;

 posix_memalign((void **), 2 * MB, 200 * MB);
 for (i = 0; i < 200 * MB; i+= 4096)
 assert(p[i] == 0);
 pause();
 return 0;
}

With thp-never RSS is about 400k, but with thp-always it's 200M.
After the patcheset thp-always RSS is 400k too.

I'd like to see a full description of the design, please.

Okay. Design overview.

Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
zeros.  The way how we allocate it changes in the patchset:

- [01/10] simplest way: hzp allocated on boot time in hugepage_init();
- [09/10] lazy allocation on first use;
- [10/10] lockless refcounting + shrinker-reclaimable hzp;

We setup it in do_huge_pmd_anonymous_page() if area around fault address
is suitable for THP and we've got read page fault.
If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we
normally do in THP.

On wp fault to hzp we allocate real memory for the huge page and clear it.
If ENOMEM, graceful fallback: we create a new pmd table and set pte around
fault address to newly allocated normal (4k) page. All other ptes in the
pmd set to normal zero page.

We cannot split hzp (and it's bug if we try), but we can split the pmd
which points to it. On splitting the pmd we create a table with all ptes
set to normal zero page.

Patchset organized in bisect-friendly way:
  Patches 01-07: prepare all code paths for hzp
  Patch 08: all code paths are covered: safe to setup hzp
  Patch 09: lazy allocation
  Patch 10: lockless refcounting for hzp

--

By hpa request I've tried alternative approach for hzp implementation (see
Virtual huge zero page patchset): pmd table with all entries set to zero
page. This way should be more cache friendly, but it increases TLB
pressure.

The problem with virtual huge zero page: it requires per-arch enabling.
We need a way to mark that pmd table has all ptes set to zero page.

Some numbers to compare two implementations (on 4s Westmere-EX):

Mirobenchmark1
==

test:
 posix_memalign((void **), 2 * MB, 8 * GB);
 for (i = 0; i < 100; i++) {
 assert(memcmp(p, p + 4*GB, 4*GB) == 0);
 asm volatile ("": : :"memory");
 }

hzp:
  Performance counter stats for './test_memcmp' (5 runs):

   32356.272845 task-clock#0.998 CPUs utilized  
  ( +-  0.13% )
 40 context-switches  #0.001 K/sec  
  ( +-  0.94% )
  0 CPU-migrations#0.000 K/sec
  4,218 page-faults   #0.130 K/sec  
  ( +-  0.00% )
 76,712,481,765 cycles#2.371 GHz
  ( +-  0.13% ) [83.31%]
 36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles idle   
  ( +-  0.28% ) [83.35%]
  1,684,049,110 stalled-cycles-backend#2.20% backend  cycles idle   
  ( +-  2.96% ) [66.67%]
134,355,715,816 instructions  #1.75  insns per cycle
  #0.27  stalled cycles per 
insn  ( +-  0.10% ) [83.35%]
 13,526,169,702 branches  #  418.039 M/sec  
  ( +-  0.10% ) [83.31%]
  1,058,230 branch-misses #0.01% of all branches
  ( +-  0.91% ) [83.36%]

   32.413866442 seconds time elapsed
  ( +-  0.13% )

vhzp:
  Performance counter stats for './test_memcmp' (5 runs):

   30327.183829 task-clock#0.998 CPUs utilized  
  ( +-  0.13% )
 38 context-switches  #0.001 K/sec  
  ( +-  1.53% )
  0 CPU-migrations#0.000 K/sec
  4,218 page-faults   #0.139 K/sec  
  ( +-  0.01% )
 71,964,773,660 cycles#2.373 GHz
  ( +-  0.13% ) [83.35%]
 31,191,284,231 stalled-cycles-frontend   #   43.34% frontend cycles idle   
  ( +-  0.40% ) [83.32%]
773,484,474 stalled-cycles-backend#1.07% backend  cycles idle   
  ( +-  6.61% ) [66.67%]
134,982,215,437 instructions  #1.88  insns per cycle
  #0.23  stalled 

Re: [PATCH v4 00/10, REBASED] Introduce huge zero page

2012-10-16 Thread Ni zhan Chen

On 10/16/2012 07:28 PM, Kirill A. Shutemov wrote:

On Tue, Oct 16, 2012 at 07:13:07PM +0800, Ni zhan Chen wrote:

On 10/16/2012 06:54 PM, Kirill A. Shutemov wrote:

On Tue, Oct 16, 2012 at 05:53:07PM +0800, Ni zhan Chen wrote:

By hpa request I've tried alternative approach for hzp implementation (see
Virtual huge zero page patchset): pmd table with all entries set to zero
page. This way should be more cache friendly, but it increases TLB
pressure.

Thanks for your excellent works. But could you explain me why
current implementation not cache friendly and hpa's request cache
friendly? Thanks in advance.

In workloads like microbenchmark1 you need N * size(zero page) cache
space to get zero page fully cached, where N is cache associativity.
If zero page is 2M, cache pressure is significant.

On other hand with table of 4k zero pages (hpa's proposal) will increase
pressure on TLB, since we have more pages for the same memory area. So we
have to do more page translation in this case.

On my test machine with simple memcmp() virtual huge zero page is faster.
But it highly depends on TLB size, cache size, memory access and page
translation costs.

It looks like cache size in modern processors grows faster than TLB size.

Oh, I see, thanks for your quick response. Another one question below,


The problem with virtual huge zero page: it requires per-arch enabling.
We need a way to mark that pmd table has all ptes set to zero page.

Some numbers to compare two implementations (on 4s Westmere-EX):

Mirobenchmark1
==

test:
 posix_memalign((void **), 2 * MB, 8 * GB);
 for (i = 0; i < 100; i++) {
 assert(memcmp(p, p + 4*GB, 4*GB) == 0);
 asm volatile ("": : :"memory");
 }

hzp:
  Performance counter stats for './test_memcmp' (5 runs):

   32356.272845 task-clock#0.998 CPUs utilized  
  ( +-  0.13% )
 40 context-switches  #0.001 K/sec  
  ( +-  0.94% )
  0 CPU-migrations#0.000 K/sec
  4,218 page-faults   #0.130 K/sec  
  ( +-  0.00% )
 76,712,481,765 cycles#2.371 GHz
  ( +-  0.13% ) [83.31%]
 36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles idle   
  ( +-  0.28% ) [83.35%]
  1,684,049,110 stalled-cycles-backend#2.20% backend  cycles idle   
  ( +-  2.96% ) [66.67%]
134,355,715,816 instructions  #1.75  insns per cycle
  #0.27  stalled cycles per 
insn  ( +-  0.10% ) [83.35%]
 13,526,169,702 branches  #  418.039 M/sec  
  ( +-  0.10% ) [83.31%]
  1,058,230 branch-misses #0.01% of all branches
  ( +-  0.91% ) [83.36%]

   32.413866442 seconds time elapsed
  ( +-  0.13% )

vhzp:
  Performance counter stats for './test_memcmp' (5 runs):

   30327.183829 task-clock#0.998 CPUs utilized  
  ( +-  0.13% )
 38 context-switches  #0.001 K/sec  
  ( +-  1.53% )
  0 CPU-migrations#0.000 K/sec
  4,218 page-faults   #0.139 K/sec  
  ( +-  0.01% )
 71,964,773,660 cycles#2.373 GHz
  ( +-  0.13% ) [83.35%]
 31,191,284,231 stalled-cycles-frontend   #   43.34% frontend cycles idle   
  ( +-  0.40% ) [83.32%]
773,484,474 stalled-cycles-backend#1.07% backend  cycles idle   
  ( +-  6.61% ) [66.67%]
134,982,215,437 instructions  #1.88  insns per cycle
  #0.23  stalled cycles per 
insn  ( +-  0.11% ) [83.32%]
 13,509,150,683 branches  #  445.447 M/sec  
  ( +-  0.11% ) [83.34%]
  1,017,667 branch-misses #0.01% of all branches
  ( +-  1.07% ) [83.32%]

   30.381324695 seconds time elapsed
  ( +-  0.13% )

Could you tell me which data I should care in this performance
counter. And what's the benefit of your current implementation
compare to hpa's request?

Sorry for my unintelligent. Could you tell me which data I should
care in this performance counter stats. The same question about the
second benchmark counter stats, thanks in adance. :-)

I've missed relevant counters in this run, you can see them in the second
benchmark.

Relevant counters:
L1-dcache-*, LLC-*: shows cache related stats (hits/misses);
dTLB-*: shows data TLB hits and misses.

Indirect relevant counters:
stalled-cycles-*: how long CPU pipeline has to wait for data.


Oh, I see, thanks for your patient. :-)




Mirobenchmark2
==

test:
 posix_memalign((v

Re: [PATCH v4 00/10, REBASED] Introduce huge zero page

2012-10-16 Thread Ni zhan Chen

On 10/16/2012 06:54 PM, Kirill A. Shutemov wrote:

On Tue, Oct 16, 2012 at 05:53:07PM +0800, Ni zhan Chen wrote:

By hpa request I've tried alternative approach for hzp implementation (see
Virtual huge zero page patchset): pmd table with all entries set to zero
page. This way should be more cache friendly, but it increases TLB
pressure.

Thanks for your excellent works. But could you explain me why
current implementation not cache friendly and hpa's request cache
friendly? Thanks in advance.

In workloads like microbenchmark1 you need N * size(zero page) cache
space to get zero page fully cached, where N is cache associativity.
If zero page is 2M, cache pressure is significant.

On other hand with table of 4k zero pages (hpa's proposal) will increase
pressure on TLB, since we have more pages for the same memory area. So we
have to do more page translation in this case.

On my test machine with simple memcmp() virtual huge zero page is faster.
But it highly depends on TLB size, cache size, memory access and page
translation costs.

It looks like cache size in modern processors grows faster than TLB size.


Oh, I see, thanks for your quick response. Another one question below,




The problem with virtual huge zero page: it requires per-arch enabling.
We need a way to mark that pmd table has all ptes set to zero page.

Some numbers to compare two implementations (on 4s Westmere-EX):

Mirobenchmark1
==

test:
 posix_memalign((void **), 2 * MB, 8 * GB);
 for (i = 0; i < 100; i++) {
 assert(memcmp(p, p + 4*GB, 4*GB) == 0);
 asm volatile ("": : :"memory");
 }

hzp:
  Performance counter stats for './test_memcmp' (5 runs):

   32356.272845 task-clock#0.998 CPUs utilized  
  ( +-  0.13% )
 40 context-switches  #0.001 K/sec  
  ( +-  0.94% )
  0 CPU-migrations#0.000 K/sec
  4,218 page-faults   #0.130 K/sec  
  ( +-  0.00% )
 76,712,481,765 cycles#2.371 GHz
  ( +-  0.13% ) [83.31%]
 36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles idle   
  ( +-  0.28% ) [83.35%]
  1,684,049,110 stalled-cycles-backend#2.20% backend  cycles idle   
  ( +-  2.96% ) [66.67%]
134,355,715,816 instructions  #1.75  insns per cycle
  #0.27  stalled cycles per 
insn  ( +-  0.10% ) [83.35%]
 13,526,169,702 branches  #  418.039 M/sec  
  ( +-  0.10% ) [83.31%]
  1,058,230 branch-misses #0.01% of all branches
  ( +-  0.91% ) [83.36%]

   32.413866442 seconds time elapsed
  ( +-  0.13% )

vhzp:
  Performance counter stats for './test_memcmp' (5 runs):

   30327.183829 task-clock#0.998 CPUs utilized  
  ( +-  0.13% )
 38 context-switches  #0.001 K/sec  
  ( +-  1.53% )
  0 CPU-migrations#0.000 K/sec
  4,218 page-faults   #0.139 K/sec  
  ( +-  0.01% )
 71,964,773,660 cycles#2.373 GHz
  ( +-  0.13% ) [83.35%]
 31,191,284,231 stalled-cycles-frontend   #   43.34% frontend cycles idle   
  ( +-  0.40% ) [83.32%]
773,484,474 stalled-cycles-backend#1.07% backend  cycles idle   
  ( +-  6.61% ) [66.67%]
134,982,215,437 instructions  #1.88  insns per cycle
  #0.23  stalled cycles per 
insn  ( +-  0.11% ) [83.32%]
 13,509,150,683 branches  #  445.447 M/sec  
  ( +-  0.11% ) [83.34%]
  1,017,667 branch-misses #0.01% of all branches
  ( +-  1.07% ) [83.32%]

   30.381324695 seconds time elapsed
  ( +-  0.13% )

Could you tell me which data I should care in this performance
counter. And what's the benefit of your current implementation
compare to hpa's request?


Sorry for my unintelligent. Could you tell me which data I should care 
in this performance counter stats. The same question about the second 
benchmark counter stats, thanks in adance. :-)

Mirobenchmark2
==

test:
 posix_memalign((void **), 2 * MB, 8 * GB);
 for (i = 0; i < 1000; i++) {
 char *_p = p;
 while (_p < p+4*GB) {
 assert(*_p == *(_p+4*GB));
 _p += 4096;
 asm volatile ("": : :"memory");
 }
 }

hzp:
  Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):

3505.727639 task-clock  

Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening

2012-10-16 Thread Ni zhan Chen

On 10/16/2012 06:12 PM, Sha Zhengju wrote:

From: Sha Zhengju 

Sysctl oom_kill_allocating_task enables or disables killing the OOM-triggering
task in out-of-memory situations, but it only works on overall system-wide oom.
But it's also a useful indication in memcg so we take it into consideration
while oom happening in memcg. Other sysctl such as panic_on_oom has already
been memcg-ware.


Is it the resend one or new version, could you add changelog if it is 
the last case?




Signed-off-by: Sha Zhengju 
---
  mm/memcontrol.c |9 +
  1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e4e9b18..c329940 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1486,6 +1486,15 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup 
*memcg, gfp_t gfp_mask,
  
  	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);

totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
+   if (sysctl_oom_kill_allocating_task && current->mm &&
+   !oom_unkillable_task(current, memcg, NULL) &&
+   current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
+   get_task_struct(current);
+   oom_kill_process(current, gfp_mask, order, 0, totalpages, 
memcg, NULL,
+"Memory cgroup out of memory 
(oom_kill_allocating_task)");
+   return;
+   }
+
for_each_mem_cgroup_tree(iter, memcg) {
struct cgroup *cgroup = iter->css.cgroup;
struct cgroup_iter it;


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 00/10, REBASED] Introduce huge zero page

2012-10-16 Thread Ni zhan Chen

On 10/15/2012 02:00 PM, Kirill A. Shutemov wrote:

From: "Kirill A. Shutemov" 

Hi,

Andrew, here's huge zero page patchset rebased to v3.7-rc1.

Andrea, I've dropped your Reviewed-by due not-so-trivial conflicts in during
rebase. Could you look through it again. Patches 2, 3, 4, 7, 10 had conflicts.
Mostly due new MMU notifiers interface.

=

During testing I noticed big (up to 2.5 times) memory consumption overhead
on some workloads (e.g. ft.A from NPB) if THP is enabled.

The main reason for that big difference is lacking zero page in THP case.
We have to allocate a real page on read page fault.

A program to demonstrate the issue:
#include 
#include 
#include 

#define MB 1024*1024

int main(int argc, char **argv)
{
 char *p;
 int i;

 posix_memalign((void **), 2 * MB, 200 * MB);
 for (i = 0; i < 200 * MB; i+= 4096)
 assert(p[i] == 0);
 pause();
 return 0;
}

With thp-never RSS is about 400k, but with thp-always it's 200M.
After the patcheset thp-always RSS is 400k too.

Design overview.

Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
zeros.  The way how we allocate it changes in the patchset:

- [01/10] simplest way: hzp allocated on boot time in hugepage_init();
- [09/10] lazy allocation on first use;
- [10/10] lockless refcounting + shrinker-reclaimable hzp;

We setup it in do_huge_pmd_anonymous_page() if area around fault address
is suitable for THP and we've got read page fault.
If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we
normally do in THP.

On wp fault to hzp we allocate real memory for the huge page and clear it.
If ENOMEM, graceful fallback: we create a new pmd table and set pte around
fault address to newly allocated normal (4k) page. All other ptes in the
pmd set to normal zero page.

We cannot split hzp (and it's bug if we try), but we can split the pmd
which points to it. On splitting the pmd we create a table with all ptes
set to normal zero page.

Patchset organized in bisect-friendly way:
  Patches 01-07: prepare all code paths for hzp
  Patch 08: all code paths are covered: safe to setup hzp
  Patch 09: lazy allocation
  Patch 10: lockless refcounting for hzp

v4:
  - Rebase to v3.7-rc1;
  - Update commit message;
v3:
  - fix potential deadlock in refcounting code on preemptive kernel.
  - do not mark huge zero page as movable.
  - fix typo in comment.
  - Reviewed-by tag from Andrea Arcangeli.
v2:
  - Avoid find_vma() if we've already had vma on stack.
Suggested by Andrea Arcangeli.
  - Implement refcounting for huge zero page.

--

By hpa request I've tried alternative approach for hzp implementation (see
Virtual huge zero page patchset): pmd table with all entries set to zero
page. This way should be more cache friendly, but it increases TLB
pressure.


Thanks for your excellent works. But could you explain me why current 
implementation not cache friendly and hpa's request cache friendly? 
Thanks in advance.




The problem with virtual huge zero page: it requires per-arch enabling.
We need a way to mark that pmd table has all ptes set to zero page.

Some numbers to compare two implementations (on 4s Westmere-EX):

Mirobenchmark1
==

test:
 posix_memalign((void **), 2 * MB, 8 * GB);
 for (i = 0; i < 100; i++) {
 assert(memcmp(p, p + 4*GB, 4*GB) == 0);
 asm volatile ("": : :"memory");
 }

hzp:
  Performance counter stats for './test_memcmp' (5 runs):

   32356.272845 task-clock#0.998 CPUs utilized  
  ( +-  0.13% )
 40 context-switches  #0.001 K/sec  
  ( +-  0.94% )
  0 CPU-migrations#0.000 K/sec
  4,218 page-faults   #0.130 K/sec  
  ( +-  0.00% )
 76,712,481,765 cycles#2.371 GHz
  ( +-  0.13% ) [83.31%]
 36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles idle   
  ( +-  0.28% ) [83.35%]
  1,684,049,110 stalled-cycles-backend#2.20% backend  cycles idle   
  ( +-  2.96% ) [66.67%]
134,355,715,816 instructions  #1.75  insns per cycle
  #0.27  stalled cycles per 
insn  ( +-  0.10% ) [83.35%]
 13,526,169,702 branches  #  418.039 M/sec  
  ( +-  0.10% ) [83.31%]
  1,058,230 branch-misses #0.01% of all branches
  ( +-  0.91% ) [83.36%]

   32.413866442 seconds time elapsed
  ( +-  0.13% )

vhzp:
  Performance counter stats for './test_memcmp' (5 runs):

   30327.183829 task-clock#0.998 CPUs utilized  
  ( +-  0.13% )
 38 context-switches  #0.001 

Re: [PATCH] mm: thp: Set the accessed flag for old pages on access fault.

2012-10-16 Thread Ni zhan Chen

On 10/01/2012 10:59 PM, Andrea Arcangeli wrote:

Hi Will,

On Mon, Oct 01, 2012 at 02:51:45PM +0100, Will Deacon wrote:

+void huge_pmd_set_accessed(struct mm_struct *mm, struct vm_area_struct *vma,
+  unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
+{
+   pmd_t entry;
+
+   spin_lock(>page_table_lock);
+   entry = pmd_mkyoung(orig_pmd);
+   if (pmdp_set_access_flags(vma, address & HPAGE_PMD_MASK, pmd, entry, 0))
+   update_mmu_cache(vma, address, pmd);

If the pmd is being splitted, this may not be a trasnhuge pmd anymore
by the time you obtained the lock. (orig_pmd could be stale, and it
wasn't verified with pmd_same either)


Could you tell me when should call pmd_same in general?



The lock should be obtained through pmd_trans_huge_lock.

   if (pmd_trans_huge_lock(orig_pmd, vma) == 1)
   {
set young bit
spin_unlock(>page_table_lock);
   }


On x86:

int pmdp_set_access_flags(struct vm_area_struct *vma,
  unsigned long address, pmd_t *pmdp,
  pmd_t entry, int dirty)
{
int changed = !pmd_same(*pmdp, entry);

VM_BUG_ON(address & ~HPAGE_PMD_MASK);

if (changed && dirty) {
*pmdp = entry;

with dirty == 0 it looks like it won't make any difference, but I
guess your arm pmdp_set_access_flag is different.

However it seems "dirty" means write access and so the invocation
would better match the pte case:

if (pmdp_set_access_flags(vma, address & HPAGE_PMD_MASK, pmd, entry,
flags & FAULT_FLAG_WRITE))


But note, you still have to update it even when "dirty" == 0, or it'll
still infinite loop for read accesses.


+   spin_unlock(>page_table_lock);
+}
+
  int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
  {
diff --git a/mm/memory.c b/mm/memory.c
index 5736170..d5c007d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3537,7 +3537,11 @@ retry:
if (unlikely(ret & VM_FAULT_OOM))
goto retry;
return ret;
+   } else {
+   huge_pmd_set_accessed(mm, vma, address, pmd,
+ orig_pmd);
}
+
return 0;

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: thp: Set the accessed flag for old pages on access fault.

2012-10-16 Thread Ni zhan Chen

On 10/01/2012 10:59 PM, Andrea Arcangeli wrote:

Hi Will,

On Mon, Oct 01, 2012 at 02:51:45PM +0100, Will Deacon wrote:

+void huge_pmd_set_accessed(struct mm_struct *mm, struct vm_area_struct *vma,
+  unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
+{
+   pmd_t entry;
+
+   spin_lock(mm-page_table_lock);
+   entry = pmd_mkyoung(orig_pmd);
+   if (pmdp_set_access_flags(vma, address  HPAGE_PMD_MASK, pmd, entry, 0))
+   update_mmu_cache(vma, address, pmd);

If the pmd is being splitted, this may not be a trasnhuge pmd anymore
by the time you obtained the lock. (orig_pmd could be stale, and it
wasn't verified with pmd_same either)


Could you tell me when should call pmd_same in general?



The lock should be obtained through pmd_trans_huge_lock.

   if (pmd_trans_huge_lock(orig_pmd, vma) == 1)
   {
set young bit
spin_unlock(mm-page_table_lock);
   }


On x86:

int pmdp_set_access_flags(struct vm_area_struct *vma,
  unsigned long address, pmd_t *pmdp,
  pmd_t entry, int dirty)
{
int changed = !pmd_same(*pmdp, entry);

VM_BUG_ON(address  ~HPAGE_PMD_MASK);

if (changed  dirty) {
*pmdp = entry;

with dirty == 0 it looks like it won't make any difference, but I
guess your arm pmdp_set_access_flag is different.

However it seems dirty means write access and so the invocation
would better match the pte case:

if (pmdp_set_access_flags(vma, address  HPAGE_PMD_MASK, pmd, entry,
flags  FAULT_FLAG_WRITE))


But note, you still have to update it even when dirty == 0, or it'll
still infinite loop for read accesses.


+   spin_unlock(mm-page_table_lock);
+}
+
  int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
  {
diff --git a/mm/memory.c b/mm/memory.c
index 5736170..d5c007d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3537,7 +3537,11 @@ retry:
if (unlikely(ret  VM_FAULT_OOM))
goto retry;
return ret;
+   } else {
+   huge_pmd_set_accessed(mm, vma, address, pmd,
+ orig_pmd);
}
+
return 0;

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 00/10, REBASED] Introduce huge zero page

2012-10-16 Thread Ni zhan Chen

On 10/15/2012 02:00 PM, Kirill A. Shutemov wrote:

From: Kirill A. Shutemov kirill.shute...@linux.intel.com

Hi,

Andrew, here's huge zero page patchset rebased to v3.7-rc1.

Andrea, I've dropped your Reviewed-by due not-so-trivial conflicts in during
rebase. Could you look through it again. Patches 2, 3, 4, 7, 10 had conflicts.
Mostly due new MMU notifiers interface.

=

During testing I noticed big (up to 2.5 times) memory consumption overhead
on some workloads (e.g. ft.A from NPB) if THP is enabled.

The main reason for that big difference is lacking zero page in THP case.
We have to allocate a real page on read page fault.

A program to demonstrate the issue:
#include assert.h
#include stdlib.h
#include unistd.h

#define MB 1024*1024

int main(int argc, char **argv)
{
 char *p;
 int i;

 posix_memalign((void **)p, 2 * MB, 200 * MB);
 for (i = 0; i  200 * MB; i+= 4096)
 assert(p[i] == 0);
 pause();
 return 0;
}

With thp-never RSS is about 400k, but with thp-always it's 200M.
After the patcheset thp-always RSS is 400k too.

Design overview.

Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
zeros.  The way how we allocate it changes in the patchset:

- [01/10] simplest way: hzp allocated on boot time in hugepage_init();
- [09/10] lazy allocation on first use;
- [10/10] lockless refcounting + shrinker-reclaimable hzp;

We setup it in do_huge_pmd_anonymous_page() if area around fault address
is suitable for THP and we've got read page fault.
If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we
normally do in THP.

On wp fault to hzp we allocate real memory for the huge page and clear it.
If ENOMEM, graceful fallback: we create a new pmd table and set pte around
fault address to newly allocated normal (4k) page. All other ptes in the
pmd set to normal zero page.

We cannot split hzp (and it's bug if we try), but we can split the pmd
which points to it. On splitting the pmd we create a table with all ptes
set to normal zero page.

Patchset organized in bisect-friendly way:
  Patches 01-07: prepare all code paths for hzp
  Patch 08: all code paths are covered: safe to setup hzp
  Patch 09: lazy allocation
  Patch 10: lockless refcounting for hzp

v4:
  - Rebase to v3.7-rc1;
  - Update commit message;
v3:
  - fix potential deadlock in refcounting code on preemptive kernel.
  - do not mark huge zero page as movable.
  - fix typo in comment.
  - Reviewed-by tag from Andrea Arcangeli.
v2:
  - Avoid find_vma() if we've already had vma on stack.
Suggested by Andrea Arcangeli.
  - Implement refcounting for huge zero page.

--

By hpa request I've tried alternative approach for hzp implementation (see
Virtual huge zero page patchset): pmd table with all entries set to zero
page. This way should be more cache friendly, but it increases TLB
pressure.


Thanks for your excellent works. But could you explain me why current 
implementation not cache friendly and hpa's request cache friendly? 
Thanks in advance.




The problem with virtual huge zero page: it requires per-arch enabling.
We need a way to mark that pmd table has all ptes set to zero page.

Some numbers to compare two implementations (on 4s Westmere-EX):

Mirobenchmark1
==

test:
 posix_memalign((void **)p, 2 * MB, 8 * GB);
 for (i = 0; i  100; i++) {
 assert(memcmp(p, p + 4*GB, 4*GB) == 0);
 asm volatile (: : :memory);
 }

hzp:
  Performance counter stats for './test_memcmp' (5 runs):

   32356.272845 task-clock#0.998 CPUs utilized  
  ( +-  0.13% )
 40 context-switches  #0.001 K/sec  
  ( +-  0.94% )
  0 CPU-migrations#0.000 K/sec
  4,218 page-faults   #0.130 K/sec  
  ( +-  0.00% )
 76,712,481,765 cycles#2.371 GHz
  ( +-  0.13% ) [83.31%]
 36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles idle   
  ( +-  0.28% ) [83.35%]
  1,684,049,110 stalled-cycles-backend#2.20% backend  cycles idle   
  ( +-  2.96% ) [66.67%]
134,355,715,816 instructions  #1.75  insns per cycle
  #0.27  stalled cycles per 
insn  ( +-  0.10% ) [83.35%]
 13,526,169,702 branches  #  418.039 M/sec  
  ( +-  0.10% ) [83.31%]
  1,058,230 branch-misses #0.01% of all branches
  ( +-  0.91% ) [83.36%]

   32.413866442 seconds time elapsed
  ( +-  0.13% )

vhzp:
  Performance counter stats for './test_memcmp' (5 runs):

   30327.183829 task-clock#0.998 CPUs utilized  
  ( +-  0.13% )

Re: [PATCH] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening

2012-10-16 Thread Ni zhan Chen

On 10/16/2012 06:12 PM, Sha Zhengju wrote:

From: Sha Zhengju handai@taobao.com

Sysctl oom_kill_allocating_task enables or disables killing the OOM-triggering
task in out-of-memory situations, but it only works on overall system-wide oom.
But it's also a useful indication in memcg so we take it into consideration
while oom happening in memcg. Other sysctl such as panic_on_oom has already
been memcg-ware.


Is it the resend one or new version, could you add changelog if it is 
the last case?




Signed-off-by: Sha Zhengju handai@taobao.com
---
  mm/memcontrol.c |9 +
  1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e4e9b18..c329940 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1486,6 +1486,15 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup 
*memcg, gfp_t gfp_mask,
  
  	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL);

totalpages = mem_cgroup_get_limit(memcg)  PAGE_SHIFT ? : 1;
+   if (sysctl_oom_kill_allocating_task  current-mm 
+   !oom_unkillable_task(current, memcg, NULL) 
+   current-signal-oom_score_adj != OOM_SCORE_ADJ_MIN) {
+   get_task_struct(current);
+   oom_kill_process(current, gfp_mask, order, 0, totalpages, 
memcg, NULL,
+Memory cgroup out of memory 
(oom_kill_allocating_task));
+   return;
+   }
+
for_each_mem_cgroup_tree(iter, memcg) {
struct cgroup *cgroup = iter-css.cgroup;
struct cgroup_iter it;


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 00/10, REBASED] Introduce huge zero page

2012-10-16 Thread Ni zhan Chen

On 10/16/2012 06:54 PM, Kirill A. Shutemov wrote:

On Tue, Oct 16, 2012 at 05:53:07PM +0800, Ni zhan Chen wrote:

By hpa request I've tried alternative approach for hzp implementation (see
Virtual huge zero page patchset): pmd table with all entries set to zero
page. This way should be more cache friendly, but it increases TLB
pressure.

Thanks for your excellent works. But could you explain me why
current implementation not cache friendly and hpa's request cache
friendly? Thanks in advance.

In workloads like microbenchmark1 you need N * size(zero page) cache
space to get zero page fully cached, where N is cache associativity.
If zero page is 2M, cache pressure is significant.

On other hand with table of 4k zero pages (hpa's proposal) will increase
pressure on TLB, since we have more pages for the same memory area. So we
have to do more page translation in this case.

On my test machine with simple memcmp() virtual huge zero page is faster.
But it highly depends on TLB size, cache size, memory access and page
translation costs.

It looks like cache size in modern processors grows faster than TLB size.


Oh, I see, thanks for your quick response. Another one question below,




The problem with virtual huge zero page: it requires per-arch enabling.
We need a way to mark that pmd table has all ptes set to zero page.

Some numbers to compare two implementations (on 4s Westmere-EX):

Mirobenchmark1
==

test:
 posix_memalign((void **)p, 2 * MB, 8 * GB);
 for (i = 0; i  100; i++) {
 assert(memcmp(p, p + 4*GB, 4*GB) == 0);
 asm volatile (: : :memory);
 }

hzp:
  Performance counter stats for './test_memcmp' (5 runs):

   32356.272845 task-clock#0.998 CPUs utilized  
  ( +-  0.13% )
 40 context-switches  #0.001 K/sec  
  ( +-  0.94% )
  0 CPU-migrations#0.000 K/sec
  4,218 page-faults   #0.130 K/sec  
  ( +-  0.00% )
 76,712,481,765 cycles#2.371 GHz
  ( +-  0.13% ) [83.31%]
 36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles idle   
  ( +-  0.28% ) [83.35%]
  1,684,049,110 stalled-cycles-backend#2.20% backend  cycles idle   
  ( +-  2.96% ) [66.67%]
134,355,715,816 instructions  #1.75  insns per cycle
  #0.27  stalled cycles per 
insn  ( +-  0.10% ) [83.35%]
 13,526,169,702 branches  #  418.039 M/sec  
  ( +-  0.10% ) [83.31%]
  1,058,230 branch-misses #0.01% of all branches
  ( +-  0.91% ) [83.36%]

   32.413866442 seconds time elapsed
  ( +-  0.13% )

vhzp:
  Performance counter stats for './test_memcmp' (5 runs):

   30327.183829 task-clock#0.998 CPUs utilized  
  ( +-  0.13% )
 38 context-switches  #0.001 K/sec  
  ( +-  1.53% )
  0 CPU-migrations#0.000 K/sec
  4,218 page-faults   #0.139 K/sec  
  ( +-  0.01% )
 71,964,773,660 cycles#2.373 GHz
  ( +-  0.13% ) [83.35%]
 31,191,284,231 stalled-cycles-frontend   #   43.34% frontend cycles idle   
  ( +-  0.40% ) [83.32%]
773,484,474 stalled-cycles-backend#1.07% backend  cycles idle   
  ( +-  6.61% ) [66.67%]
134,982,215,437 instructions  #1.88  insns per cycle
  #0.23  stalled cycles per 
insn  ( +-  0.11% ) [83.32%]
 13,509,150,683 branches  #  445.447 M/sec  
  ( +-  0.11% ) [83.34%]
  1,017,667 branch-misses #0.01% of all branches
  ( +-  1.07% ) [83.32%]

   30.381324695 seconds time elapsed
  ( +-  0.13% )

Could you tell me which data I should care in this performance
counter. And what's the benefit of your current implementation
compare to hpa's request?


Sorry for my unintelligent. Could you tell me which data I should care 
in this performance counter stats. The same question about the second 
benchmark counter stats, thanks in adance. :-)

Mirobenchmark2
==

test:
 posix_memalign((void **)p, 2 * MB, 8 * GB);
 for (i = 0; i  1000; i++) {
 char *_p = p;
 while (_p  p+4*GB) {
 assert(*_p == *(_p+4*GB));
 _p += 4096;
 asm volatile (: : :memory);
 }
 }

hzp:
  Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):

3505.727639 task-clock#0.998 CPUs utilized  
  ( +-  0.26

Re: [PATCH v4 00/10, REBASED] Introduce huge zero page

2012-10-16 Thread Ni zhan Chen

On 10/16/2012 07:28 PM, Kirill A. Shutemov wrote:

On Tue, Oct 16, 2012 at 07:13:07PM +0800, Ni zhan Chen wrote:

On 10/16/2012 06:54 PM, Kirill A. Shutemov wrote:

On Tue, Oct 16, 2012 at 05:53:07PM +0800, Ni zhan Chen wrote:

By hpa request I've tried alternative approach for hzp implementation (see
Virtual huge zero page patchset): pmd table with all entries set to zero
page. This way should be more cache friendly, but it increases TLB
pressure.

Thanks for your excellent works. But could you explain me why
current implementation not cache friendly and hpa's request cache
friendly? Thanks in advance.

In workloads like microbenchmark1 you need N * size(zero page) cache
space to get zero page fully cached, where N is cache associativity.
If zero page is 2M, cache pressure is significant.

On other hand with table of 4k zero pages (hpa's proposal) will increase
pressure on TLB, since we have more pages for the same memory area. So we
have to do more page translation in this case.

On my test machine with simple memcmp() virtual huge zero page is faster.
But it highly depends on TLB size, cache size, memory access and page
translation costs.

It looks like cache size in modern processors grows faster than TLB size.

Oh, I see, thanks for your quick response. Another one question below,


The problem with virtual huge zero page: it requires per-arch enabling.
We need a way to mark that pmd table has all ptes set to zero page.

Some numbers to compare two implementations (on 4s Westmere-EX):

Mirobenchmark1
==

test:
 posix_memalign((void **)p, 2 * MB, 8 * GB);
 for (i = 0; i  100; i++) {
 assert(memcmp(p, p + 4*GB, 4*GB) == 0);
 asm volatile (: : :memory);
 }

hzp:
  Performance counter stats for './test_memcmp' (5 runs):

   32356.272845 task-clock#0.998 CPUs utilized  
  ( +-  0.13% )
 40 context-switches  #0.001 K/sec  
  ( +-  0.94% )
  0 CPU-migrations#0.000 K/sec
  4,218 page-faults   #0.130 K/sec  
  ( +-  0.00% )
 76,712,481,765 cycles#2.371 GHz
  ( +-  0.13% ) [83.31%]
 36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles idle   
  ( +-  0.28% ) [83.35%]
  1,684,049,110 stalled-cycles-backend#2.20% backend  cycles idle   
  ( +-  2.96% ) [66.67%]
134,355,715,816 instructions  #1.75  insns per cycle
  #0.27  stalled cycles per 
insn  ( +-  0.10% ) [83.35%]
 13,526,169,702 branches  #  418.039 M/sec  
  ( +-  0.10% ) [83.31%]
  1,058,230 branch-misses #0.01% of all branches
  ( +-  0.91% ) [83.36%]

   32.413866442 seconds time elapsed
  ( +-  0.13% )

vhzp:
  Performance counter stats for './test_memcmp' (5 runs):

   30327.183829 task-clock#0.998 CPUs utilized  
  ( +-  0.13% )
 38 context-switches  #0.001 K/sec  
  ( +-  1.53% )
  0 CPU-migrations#0.000 K/sec
  4,218 page-faults   #0.139 K/sec  
  ( +-  0.01% )
 71,964,773,660 cycles#2.373 GHz
  ( +-  0.13% ) [83.35%]
 31,191,284,231 stalled-cycles-frontend   #   43.34% frontend cycles idle   
  ( +-  0.40% ) [83.32%]
773,484,474 stalled-cycles-backend#1.07% backend  cycles idle   
  ( +-  6.61% ) [66.67%]
134,982,215,437 instructions  #1.88  insns per cycle
  #0.23  stalled cycles per 
insn  ( +-  0.11% ) [83.32%]
 13,509,150,683 branches  #  445.447 M/sec  
  ( +-  0.11% ) [83.34%]
  1,017,667 branch-misses #0.01% of all branches
  ( +-  1.07% ) [83.32%]

   30.381324695 seconds time elapsed
  ( +-  0.13% )

Could you tell me which data I should care in this performance
counter. And what's the benefit of your current implementation
compare to hpa's request?

Sorry for my unintelligent. Could you tell me which data I should
care in this performance counter stats. The same question about the
second benchmark counter stats, thanks in adance. :-)

I've missed relevant counters in this run, you can see them in the second
benchmark.

Relevant counters:
L1-dcache-*, LLC-*: shows cache related stats (hits/misses);
dTLB-*: shows data TLB hits and misses.

Indirect relevant counters:
stalled-cycles-*: how long CPU pipeline has to wait for data.


Oh, I see, thanks for your patient. :-)




Mirobenchmark2
==

test:
 posix_memalign((void **)p, 2 * MB, 8 * GB

Re: [PATCH v3 00/10] Introduce huge zero page

2012-10-16 Thread Ni zhan Chen

On 10/03/2012 08:04 AM, Kirill A. Shutemov wrote:

On Tue, Oct 02, 2012 at 03:31:48PM -0700, Andrew Morton wrote:

On Tue,  2 Oct 2012 18:19:22 +0300
Kirill A. Shutemov kirill.shute...@linux.intel.com wrote:


During testing I noticed big (up to 2.5 times) memory consumption overhead
on some workloads (e.g. ft.A from NPB) if THP is enabled.

The main reason for that big difference is lacking zero page in THP case.
We have to allocate a real page on read page fault.

A program to demonstrate the issue:
#include assert.h
#include stdlib.h
#include unistd.h

#define MB 1024*1024

int main(int argc, char **argv)
{
 char *p;
 int i;

 posix_memalign((void **)p, 2 * MB, 200 * MB);
 for (i = 0; i  200 * MB; i+= 4096)
 assert(p[i] == 0);
 pause();
 return 0;
}

With thp-never RSS is about 400k, but with thp-always it's 200M.
After the patcheset thp-always RSS is 400k too.

I'd like to see a full description of the design, please.

Okay. Design overview.

Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
zeros.  The way how we allocate it changes in the patchset:

- [01/10] simplest way: hzp allocated on boot time in hugepage_init();
- [09/10] lazy allocation on first use;
- [10/10] lockless refcounting + shrinker-reclaimable hzp;

We setup it in do_huge_pmd_anonymous_page() if area around fault address
is suitable for THP and we've got read page fault.
If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we
normally do in THP.

On wp fault to hzp we allocate real memory for the huge page and clear it.
If ENOMEM, graceful fallback: we create a new pmd table and set pte around
fault address to newly allocated normal (4k) page. All other ptes in the
pmd set to normal zero page.

We cannot split hzp (and it's bug if we try), but we can split the pmd
which points to it. On splitting the pmd we create a table with all ptes
set to normal zero page.

Patchset organized in bisect-friendly way:
  Patches 01-07: prepare all code paths for hzp
  Patch 08: all code paths are covered: safe to setup hzp
  Patch 09: lazy allocation
  Patch 10: lockless refcounting for hzp

--

By hpa request I've tried alternative approach for hzp implementation (see
Virtual huge zero page patchset): pmd table with all entries set to zero
page. This way should be more cache friendly, but it increases TLB
pressure.

The problem with virtual huge zero page: it requires per-arch enabling.
We need a way to mark that pmd table has all ptes set to zero page.

Some numbers to compare two implementations (on 4s Westmere-EX):

Mirobenchmark1
==

test:
 posix_memalign((void **)p, 2 * MB, 8 * GB);
 for (i = 0; i  100; i++) {
 assert(memcmp(p, p + 4*GB, 4*GB) == 0);
 asm volatile (: : :memory);
 }

hzp:
  Performance counter stats for './test_memcmp' (5 runs):

   32356.272845 task-clock#0.998 CPUs utilized  
  ( +-  0.13% )
 40 context-switches  #0.001 K/sec  
  ( +-  0.94% )
  0 CPU-migrations#0.000 K/sec
  4,218 page-faults   #0.130 K/sec  
  ( +-  0.00% )
 76,712,481,765 cycles#2.371 GHz
  ( +-  0.13% ) [83.31%]
 36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles idle   
  ( +-  0.28% ) [83.35%]
  1,684,049,110 stalled-cycles-backend#2.20% backend  cycles idle   
  ( +-  2.96% ) [66.67%]
134,355,715,816 instructions  #1.75  insns per cycle
  #0.27  stalled cycles per 
insn  ( +-  0.10% ) [83.35%]
 13,526,169,702 branches  #  418.039 M/sec  
  ( +-  0.10% ) [83.31%]
  1,058,230 branch-misses #0.01% of all branches
  ( +-  0.91% ) [83.36%]

   32.413866442 seconds time elapsed
  ( +-  0.13% )

vhzp:
  Performance counter stats for './test_memcmp' (5 runs):

   30327.183829 task-clock#0.998 CPUs utilized  
  ( +-  0.13% )
 38 context-switches  #0.001 K/sec  
  ( +-  1.53% )
  0 CPU-migrations#0.000 K/sec
  4,218 page-faults   #0.139 K/sec  
  ( +-  0.01% )
 71,964,773,660 cycles#2.373 GHz
  ( +-  0.13% ) [83.35%]
 31,191,284,231 stalled-cycles-frontend   #   43.34% frontend cycles idle   
  ( +-  0.40% ) [83.32%]
773,484,474 stalled-cycles-backend#1.07% backend  cycles idle   
  ( +-  6.61% ) [66.67%]
134,982,215,437 instructions  #1.88  insns per cycle
 

Re: [PATCH v3 07/10] thp: implement splitting pmd for huge zero page

2012-10-11 Thread Ni zhan Chen

On 10/12/2012 12:13 PM, Kirill A. Shutemov wrote:

On Fri, Oct 12, 2012 at 11:23:37AM +0800, Ni zhan Chen wrote:

On 10/02/2012 11:19 PM, Kirill A. Shutemov wrote:

From: "Kirill A. Shutemov" 

We can't split huge zero page itself, but we can split the pmd which
points to it.

On splitting the pmd we create a table with all ptes set to normal zero
page.

Signed-off-by: Kirill A. Shutemov 
Reviewed-by: Andrea Arcangeli 
---
  mm/huge_memory.c |   32 
  1 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 95032d3..3f1c59c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1600,6 +1600,7 @@ int split_huge_page(struct page *page)
struct anon_vma *anon_vma;
int ret = 1;
+   BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
BUG_ON(!PageAnon(page));
anon_vma = page_lock_anon_vma(page);
if (!anon_vma)
@@ -2503,6 +2504,32 @@ static int khugepaged(void *none)
return 0;
  }
+static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
+   unsigned long haddr, pmd_t *pmd)
+{
+   pgtable_t pgtable;
+   pmd_t _pmd;
+   int i;
+
+   pmdp_clear_flush_notify(vma, haddr, pmd);

why I can't find function pmdp_clear_flush_notify in kernel source
code? Do you mean pmdp_clear_flush_young_notify or something like
that?

It was changed recently. See commit
2ec74c3 mm: move all mmu notifier invocations to be done outside the PT lock


Oh, thanks!


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 07/10] thp: implement splitting pmd for huge zero page

2012-10-11 Thread Ni zhan Chen

On 10/02/2012 11:19 PM, Kirill A. Shutemov wrote:

From: "Kirill A. Shutemov" 

We can't split huge zero page itself, but we can split the pmd which
points to it.

On splitting the pmd we create a table with all ptes set to normal zero
page.

Signed-off-by: Kirill A. Shutemov 
Reviewed-by: Andrea Arcangeli 
---
  mm/huge_memory.c |   32 
  1 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 95032d3..3f1c59c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1600,6 +1600,7 @@ int split_huge_page(struct page *page)
struct anon_vma *anon_vma;
int ret = 1;
  
+	BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));

BUG_ON(!PageAnon(page));
anon_vma = page_lock_anon_vma(page);
if (!anon_vma)
@@ -2503,6 +2504,32 @@ static int khugepaged(void *none)
return 0;
  }
  
+static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,

+   unsigned long haddr, pmd_t *pmd)
+{
+   pgtable_t pgtable;
+   pmd_t _pmd;
+   int i;
+
+   pmdp_clear_flush_notify(vma, haddr, pmd);


why I can't find function pmdp_clear_flush_notify in kernel source code? 
Do you mean pmdp_clear_flush_young_notify or something like that?



+   /* leave pmd empty until pte is filled */
+
+   pgtable = get_pmd_huge_pte(vma->vm_mm);
+   pmd_populate(vma->vm_mm, &_pmd, pgtable);
+
+   for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+   pte_t *pte, entry;
+   entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
+   entry = pte_mkspecial(entry);
+   pte = pte_offset_map(&_pmd, haddr);
+   VM_BUG_ON(!pte_none(*pte));
+   set_pte_at(vma->vm_mm, haddr, pte, entry);
+   pte_unmap(pte);
+   }
+   smp_wmb(); /* make pte visible before pmd */
+   pmd_populate(vma->vm_mm, pmd, pgtable);
+}
+
  void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmd)
  {
@@ -2516,6 +2543,11 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, 
unsigned long address,
spin_unlock(>vm_mm->page_table_lock);
return;
}
+   if (is_huge_zero_pmd(*pmd)) {
+   __split_huge_zero_page_pmd(vma, haddr, pmd);
+   spin_unlock(>vm_mm->page_table_lock);
+   return;
+   }
page = pmd_page(*pmd);
VM_BUG_ON(!page_count(page));
get_page(page);


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 07/10] thp: implement splitting pmd for huge zero page

2012-10-11 Thread Ni zhan Chen

On 10/02/2012 11:19 PM, Kirill A. Shutemov wrote:

From: Kirill A. Shutemov kirill.shute...@linux.intel.com

We can't split huge zero page itself, but we can split the pmd which
points to it.

On splitting the pmd we create a table with all ptes set to normal zero
page.

Signed-off-by: Kirill A. Shutemov kirill.shute...@linux.intel.com
Reviewed-by: Andrea Arcangeli aarca...@redhat.com
---
  mm/huge_memory.c |   32 
  1 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 95032d3..3f1c59c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1600,6 +1600,7 @@ int split_huge_page(struct page *page)
struct anon_vma *anon_vma;
int ret = 1;
  
+	BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));

BUG_ON(!PageAnon(page));
anon_vma = page_lock_anon_vma(page);
if (!anon_vma)
@@ -2503,6 +2504,32 @@ static int khugepaged(void *none)
return 0;
  }
  
+static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,

+   unsigned long haddr, pmd_t *pmd)
+{
+   pgtable_t pgtable;
+   pmd_t _pmd;
+   int i;
+
+   pmdp_clear_flush_notify(vma, haddr, pmd);


why I can't find function pmdp_clear_flush_notify in kernel source code? 
Do you mean pmdp_clear_flush_young_notify or something like that?



+   /* leave pmd empty until pte is filled */
+
+   pgtable = get_pmd_huge_pte(vma-vm_mm);
+   pmd_populate(vma-vm_mm, _pmd, pgtable);
+
+   for (i = 0; i  HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+   pte_t *pte, entry;
+   entry = pfn_pte(my_zero_pfn(haddr), vma-vm_page_prot);
+   entry = pte_mkspecial(entry);
+   pte = pte_offset_map(_pmd, haddr);
+   VM_BUG_ON(!pte_none(*pte));
+   set_pte_at(vma-vm_mm, haddr, pte, entry);
+   pte_unmap(pte);
+   }
+   smp_wmb(); /* make pte visible before pmd */
+   pmd_populate(vma-vm_mm, pmd, pgtable);
+}
+
  void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmd)
  {
@@ -2516,6 +2543,11 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, 
unsigned long address,
spin_unlock(vma-vm_mm-page_table_lock);
return;
}
+   if (is_huge_zero_pmd(*pmd)) {
+   __split_huge_zero_page_pmd(vma, haddr, pmd);
+   spin_unlock(vma-vm_mm-page_table_lock);
+   return;
+   }
page = pmd_page(*pmd);
VM_BUG_ON(!page_count(page));
get_page(page);


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 07/10] thp: implement splitting pmd for huge zero page

2012-10-11 Thread Ni zhan Chen

On 10/12/2012 12:13 PM, Kirill A. Shutemov wrote:

On Fri, Oct 12, 2012 at 11:23:37AM +0800, Ni zhan Chen wrote:

On 10/02/2012 11:19 PM, Kirill A. Shutemov wrote:

From: Kirill A. Shutemov kirill.shute...@linux.intel.com

We can't split huge zero page itself, but we can split the pmd which
points to it.

On splitting the pmd we create a table with all ptes set to normal zero
page.

Signed-off-by: Kirill A. Shutemov kirill.shute...@linux.intel.com
Reviewed-by: Andrea Arcangeli aarca...@redhat.com
---
  mm/huge_memory.c |   32 
  1 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 95032d3..3f1c59c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1600,6 +1600,7 @@ int split_huge_page(struct page *page)
struct anon_vma *anon_vma;
int ret = 1;
+   BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
BUG_ON(!PageAnon(page));
anon_vma = page_lock_anon_vma(page);
if (!anon_vma)
@@ -2503,6 +2504,32 @@ static int khugepaged(void *none)
return 0;
  }
+static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
+   unsigned long haddr, pmd_t *pmd)
+{
+   pgtable_t pgtable;
+   pmd_t _pmd;
+   int i;
+
+   pmdp_clear_flush_notify(vma, haddr, pmd);

why I can't find function pmdp_clear_flush_notify in kernel source
code? Do you mean pmdp_clear_flush_young_notify or something like
that?

It was changed recently. See commit
2ec74c3 mm: move all mmu notifier invocations to be done outside the PT lock


Oh, thanks!


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 8/10] memory-hotplug : remove page table of x86_64 architecture

2012-10-10 Thread Ni zhan Chen

On 10/08/2012 01:23 PM, Wen Congyang wrote:

At 10/08/2012 12:37 PM, Andi Kleen Wrote:

Yasuaki Ishimatsu  writes:

+   }
+
+   /*
+* We use 2M page, but we need to remove part of them,
+* so split 2M page to 4K page.
+*/
+   pte = alloc_low_page(_phys);

What happens when the allocation fails?

alloc_low_page seems to be buggy there too, it would __pa a NULL
pointer.

Yes, it will cause kernek panicked in __pa() if CONFI_DEBUG_VIRTUAL is set.
Otherwise, it will return a NULL pointer. I will update this patch to deal
with NULL pointer.


+   if (pud_large(*pud)) {
+   if ((addr & ~PUD_MASK) == 0 && next <= end) {
+   set_pud(pud, __pud(0));
+   pages++;
+   continue;
+   }
+
+   /*
+* We use 1G page, but we need to remove part of them,
+* so split 1G page to 2M page.
+*/
+   pmd = alloc_low_page(_phys);

Same here


+   __split_large_page((pte_t *)pud, addr, (pte_t *)pmd);
+
+   spin_lock(_mm.page_table_lock);
+   pud_populate(_mm, pud, __va(pmd_phys));
+   spin_unlock(_mm.page_table_lock);
+   }
+
+   pmd = map_low_page(pmd_offset(pud, 0));
+   phys_pmd_remove(pmd, addr, end);
+   unmap_low_page(pmd);
+   __flush_tlb_all();
+   }
+   __flush_tlb_all();


Hi Congyang,

I see you call __flush_tlb_all() every pud entry(all pmd, pte related to 
it changed) modified, then how to determine the flush frequency? why not 
every pmd entry?


Regards,
Chen


This doesn't flush the other CPUs doesn't it?

How to flush the other CPU's tlb? use on_each_cpu() to run __flush_tlb_all()
on each online cpu?

Thanks
Wen Congyang


-Andi


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 8/10] memory-hotplug : remove page table of x86_64 architecture

2012-10-10 Thread Ni zhan Chen

On 10/08/2012 01:23 PM, Wen Congyang wrote:

At 10/08/2012 12:37 PM, Andi Kleen Wrote:

Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com writes:

+   }
+
+   /*
+* We use 2M page, but we need to remove part of them,
+* so split 2M page to 4K page.
+*/
+   pte = alloc_low_page(pte_phys);

What happens when the allocation fails?

alloc_low_page seems to be buggy there too, it would __pa a NULL
pointer.

Yes, it will cause kernek panicked in __pa() if CONFI_DEBUG_VIRTUAL is set.
Otherwise, it will return a NULL pointer. I will update this patch to deal
with NULL pointer.


+   if (pud_large(*pud)) {
+   if ((addr  ~PUD_MASK) == 0  next = end) {
+   set_pud(pud, __pud(0));
+   pages++;
+   continue;
+   }
+
+   /*
+* We use 1G page, but we need to remove part of them,
+* so split 1G page to 2M page.
+*/
+   pmd = alloc_low_page(pmd_phys);

Same here


+   __split_large_page((pte_t *)pud, addr, (pte_t *)pmd);
+
+   spin_lock(init_mm.page_table_lock);
+   pud_populate(init_mm, pud, __va(pmd_phys));
+   spin_unlock(init_mm.page_table_lock);
+   }
+
+   pmd = map_low_page(pmd_offset(pud, 0));
+   phys_pmd_remove(pmd, addr, end);
+   unmap_low_page(pmd);
+   __flush_tlb_all();
+   }
+   __flush_tlb_all();


Hi Congyang,

I see you call __flush_tlb_all() every pud entry(all pmd, pte related to 
it changed) modified, then how to determine the flush frequency? why not 
every pmd entry?


Regards,
Chen


This doesn't flush the other CPUs doesn't it?

How to flush the other CPU's tlb? use on_each_cpu() to run __flush_tlb_all()
on each online cpu?

Thanks
Wen Congyang


-Andi


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: memmap_init_zone() performance improvement

2012-10-08 Thread Ni zhan Chen

On 10/08/2012 11:16 PM, Mel Gorman wrote:

On Wed, Oct 03, 2012 at 08:56:14AM -0600, Mike Yoknis wrote:

memmap_init_zone() loops through every Page Frame Number (pfn),
including pfn values that are within the gaps between existing
memory sections.  The unneeded looping will become a boot
performance issue when machines configure larger memory ranges
that will contain larger and more numerous gaps.

The code will skip across invalid sections to reduce the
number of loops executed.

Signed-off-by: Mike Yoknis 

This only helps SPARSEMEM and changes more headers than should be
necessary. It would have been easier to do something simple like

if (!early_pfn_valid(pfn)) {
pfn = ALIGN(pfn + MAX_ORDER_NR_PAGES, MAX_ORDER_NR_PAGES) - 1;
continue;
}


So if present memoy section in sparsemem can have 
MAX_ORDER_NR_PAGES-aligned range are all invalid?

If the answer is yes, when this will happen?



because that would obey the expectation that pages within a
MAX_ORDER_NR_PAGES-aligned range are all valid or all invalid (ARM is the
exception that breaks this rule). It would be less efficient on
SPARSEMEM than what you're trying to merge but I do not see the need for
the additional complexity unless you can show it makes a big difference
to boot times.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: memmap_init_zone() performance improvement

2012-10-08 Thread Ni zhan Chen

On 10/08/2012 11:16 PM, Mel Gorman wrote:

On Wed, Oct 03, 2012 at 08:56:14AM -0600, Mike Yoknis wrote:

memmap_init_zone() loops through every Page Frame Number (pfn),
including pfn values that are within the gaps between existing
memory sections.  The unneeded looping will become a boot
performance issue when machines configure larger memory ranges
that will contain larger and more numerous gaps.

The code will skip across invalid sections to reduce the
number of loops executed.

Signed-off-by: Mike Yoknis mike.yok...@hp.com

This only helps SPARSEMEM and changes more headers than should be
necessary. It would have been easier to do something simple like

if (!early_pfn_valid(pfn)) {
pfn = ALIGN(pfn + MAX_ORDER_NR_PAGES, MAX_ORDER_NR_PAGES) - 1;
continue;
}


So if present memoy section in sparsemem can have 
MAX_ORDER_NR_PAGES-aligned range are all invalid?

If the answer is yes, when this will happen?



because that would obey the expectation that pages within a
MAX_ORDER_NR_PAGES-aligned range are all valid or all invalid (ARM is the
exception that breaks this rule). It would be less efficient on
SPARSEMEM than what you're trying to merge but I do not see the need for
the additional complexity unless you can show it makes a big difference
to boot times.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: memmap_init_zone() performance improvement

2012-10-06 Thread Ni zhan Chen

On 10/03/2012 10:56 PM, Mike Yoknis wrote:

memmap_init_zone() loops through every Page Frame Number (pfn),
including pfn values that are within the gaps between existing
memory sections.  The unneeded looping will become a boot
performance issue when machines configure larger memory ranges
that will contain larger and more numerous gaps.

The code will skip across invalid sections to reduce the
number of loops executed.


looks reasonable to me.



Signed-off-by: Mike Yoknis 
---
  arch/x86/include/asm/mmzone_32.h |2 ++
  arch/x86/include/asm/page_32.h   |1 +
  arch/x86/include/asm/page_64_types.h |3 ++-
  include/asm-generic/page.h   |1 +
  include/linux/mmzone.h   |6 ++
  mm/page_alloc.c  |5 -
  6 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/mmzone_32.h b/arch/x86/include/asm/mmzone_32.h
index eb05fb3..73c5c74 100644
--- a/arch/x86/include/asm/mmzone_32.h
+++ b/arch/x86/include/asm/mmzone_32.h
@@ -48,6 +48,8 @@ static inline int pfn_to_nid(unsigned long pfn)
  #endif
  }
  
+#define next_pfn_try(pfn)	((pfn)+1)

+
  static inline int pfn_valid(int pfn)
  {
int nid = pfn_to_nid(pfn);
diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
index da4e762..e2c4cfc 100644
--- a/arch/x86/include/asm/page_32.h
+++ b/arch/x86/include/asm/page_32.h
@@ -19,6 +19,7 @@ extern unsigned long __phys_addr(unsigned long);
  
  #ifdef CONFIG_FLATMEM

  #define pfn_valid(pfn)((pfn) < max_mapnr)
+#define next_pfn_try(pfn)  ((pfn)+1)
  #endif /* CONFIG_FLATMEM */
  
  #ifdef CONFIG_X86_USE_3DNOW

diff --git a/arch/x86/include/asm/page_64_types.h 
b/arch/x86/include/asm/page_64_types.h
index 320f7bb..02d82e5 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -69,7 +69,8 @@ extern void init_extra_mapping_wb(unsigned long phys, 
unsigned long size);
  #endif/* !__ASSEMBLY__ */
  
  #ifdef CONFIG_FLATMEM

-#define pfn_valid(pfn)  ((pfn) < max_pfn)
+#define pfn_valid(pfn) ((pfn) < max_pfn)
+#define next_pfn_try(pfn)  ((pfn)+1)
  #endif
  
  #endif /* _ASM_X86_PAGE_64_DEFS_H */

diff --git a/include/asm-generic/page.h b/include/asm-generic/page.h
index 37d1fe2..316200d 100644
--- a/include/asm-generic/page.h
+++ b/include/asm-generic/page.h
@@ -91,6 +91,7 @@ extern unsigned long memory_end;
  #endif
  
  #define pfn_valid(pfn)		((pfn) >= ARCH_PFN_OFFSET && ((pfn) - ARCH_PFN_OFFSET) < max_mapnr)

+#define next_pfn_try(pfn)  ((pfn)+1)
  
  #define	virt_addr_valid(kaddr)	(((void *)(kaddr) >= (void *)PAGE_OFFSET) && \

((void *)(kaddr) < (void *)memory_end))
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f7d88ba..04d3c39 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1166,6 +1166,12 @@ static inline int pfn_valid(unsigned long pfn)
return 0;
return valid_section(__nr_to_section(pfn_to_section_nr(pfn)));
  }
+
+static inline unsigned long next_pfn_try(unsigned long pfn)
+{
+   /* Skip entire section, because all of it is invalid. */
+   return section_nr_to_pfn(pfn_to_section_nr(pfn) + 1);
+}
  #endif
  
  static inline int pfn_present(unsigned long pfn)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5b6b6b1..dd2af8b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3798,8 +3798,11 @@ void __meminit memmap_init_zone(unsigned long size, int 
nid, unsigned long zone,
 * exist on hotplugged memory.
 */
if (context == MEMMAP_EARLY) {
-   if (!early_pfn_valid(pfn))
+   if (!early_pfn_valid(pfn)) {
+   pfn = next_pfn_try(pfn);
+   pfn--;
continue;
+   }
if (!early_pfn_in_nid(pfn, nid))
continue;
}


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/4] acpi,memory-hotplug : implement framework for hot removing memory

2012-10-06 Thread Ni zhan Chen
On 10/03/2012 05:52 PM, Yasuaki Ishimatsu wrote:
> We are trying to implement a physical memory hot removing function as
> following thread.
>
> https://lkml.org/lkml/2012/9/5/201
>
> But there is not enough review to merge into linux kernel.
>
> I think there are following blockades.
>   1. no physical memory hot removable system

Which kind of special machine support physical memory hot-remove now?

>   2. huge patch-set
>
> If you have a KVM system, we can get rid of 1st blockade. Because
> applying following patch, we can create memory hot removable system
> on KVM guest.
>
> http://lists.gnu.org/archive/html/qemu-devel/2012-07/msg01389.html
>
> 2nd blockade is own problem. So we try to divide huge patch into
> a small patch in each function as follows: 
>
>  - bug fix
>  - acpi framework
>  - kernel core
>
> We had already sent bug fix patches.
> https://lkml.org/lkml/2012/9/27/39
> https://lkml.org/lkml/2012/10/2/83
>
> The patch-set implements a framework for hot removing memory.
>
> The memory device can be removed by 2 ways:
> 1. send eject request by SCI
> 2. echo 1 >/sys/bus/pci/devices/PNP0C80:XX/eject
>
> In the 1st case, acpi_memory_disable_device() will be called.
> In the 2nd case, acpi_memory_device_remove() will be called.
> acpi_memory_device_remove() will also be called when we unbind the
> memory device from the driver acpi_memhotplug.
>
> acpi_memory_disable_device() has already implemented a code which
> offlines memory and releases acpi_memory_info struct . But
> acpi_memory_device_remove() has not implemented it yet.
>
> So the patch prepares the framework for hot removing memory and
> adds the framework intoacpi_memory_device_remove(). And it prepares
> remove_memory(). But the function does nothing because we cannot
> support memory hot remove.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v9 PATCH 16/21] memory-hotplug: free memmap of sparse-vmemmap

2012-10-06 Thread Ni zhan Chen

On 10/04/2012 02:26 PM, Yasuaki Ishimatsu wrote:

Hi Chen,

Sorry for late reply.

2012/10/02 13:21, Ni zhan Chen wrote:

On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote:

From: Yasuaki Ishimatsu 

All pages of virtual mapping in removed memory cannot be freed, 
since some pages
used as PGD/PUD includes not only removed memory but also other 
memory. So the

patch checks whether page can be freed or not.

How to check whether page can be freed or not?
  1. When removing memory, the page structs of the revmoved memory 
are filled

 with 0FD.
  2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be 
cleared.

 In this case, the page used as PT/PMD can be freed.

Applying patch, __remove_section() of CONFIG_SPARSEMEM_VMEMMAP is 
integrated

into one. So __remove_section() of CONFIG_SPARSEMEM_VMEMMAP is deleted.

Note:  vmemmap_kfree() and vmemmap_free_bootmem() are not 
implemented for ia64,

ppc, s390, and sparc.

CC: David Rientjes 
CC: Jiang Liu 
CC: Len Brown 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Christoph Lameter 
Cc: Minchan Kim 
CC: Andrew Morton 
CC: KOSAKI Motohiro 
CC: Wen Congyang 
Signed-off-by: Yasuaki Ishimatsu 
---
  arch/ia64/mm/discontig.c  |8 +++
  arch/powerpc/mm/init_64.c |8 +++
  arch/s390/mm/vmem.c   |8 +++
  arch/sparc/mm/init_64.c   |8 +++
  arch/x86/mm/init_64.c |  119 
+

  include/linux/mm.h|2 +
  mm/memory_hotplug.c   |   17 +--
  mm/sparse.c   |5 +-
  8 files changed, 158 insertions(+), 17 deletions(-)

diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
index 33943db..0d23b69 100644
--- a/arch/ia64/mm/discontig.c
+++ b/arch/ia64/mm/discontig.c
@@ -823,6 +823,14 @@ int __meminit vmemmap_populate(struct page 
*start_page,

  return vmemmap_populate_basepages(start_page, size, node);
  }
+void vmemmap_kfree(struct page *memmap, unsigned long nr_pages)
+{
+}
+
+void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages)
+{
+}
+
  void register_page_bootmem_memmap(unsigned long section_nr,
struct page *start_page, unsigned long size)
  {
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 3690c44..835a2b3 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -299,6 +299,14 @@ int __meminit vmemmap_populate(struct page 
*start_page,

  return 0;
  }
+void vmemmap_kfree(struct page *memmap, unsigned long nr_pages)
+{
+}
+
+void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages)
+{
+}
+
  void register_page_bootmem_memmap(unsigned long section_nr,
struct page *start_page, unsigned long size)
  {
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index eda55cd..4b42b0b 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -227,6 +227,14 @@ out:
  return ret;
  }
+void vmemmap_kfree(struct page *memmap, unsigned long nr_pages)
+{
+}
+
+void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages)
+{
+}
+
  void register_page_bootmem_memmap(unsigned long section_nr,
struct page *start_page, unsigned long size)
  {
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index add1cc7..1384826 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2078,6 +2078,14 @@ void __meminit vmemmap_populate_print_last(void)
  }
  }
+void vmemmap_kfree(struct page *memmap, unsigned long nr_pages)
+{
+}
+
+void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages)
+{
+}
+
  void register_page_bootmem_memmap(unsigned long section_nr,
struct page *start_page, unsigned long size)
  {
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 0075592..4e8f8a4 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1138,6 +1138,125 @@ vmemmap_populate(struct page *start_page, 
unsigned long size, int node)

  return 0;
  }
+#define PAGE_INUSE 0xFD
+
+unsigned long find_and_clear_pte_page(unsigned long addr, unsigned 
long end,

+struct page **pp, int *page_size)
+{
+pgd_t *pgd;
+pud_t *pud;
+pmd_t *pmd;
+pte_t *pte;
+void *page_addr;
+unsigned long next;
+
+*pp = NULL;
+
+pgd = pgd_offset_k(addr);
+if (pgd_none(*pgd))
+return pgd_addr_end(addr, end);
+
+pud = pud_offset(pgd, addr);
+if (pud_none(*pud))
+return pud_addr_end(addr, end);
+
+if (!cpu_has_pse) {
+next = (addr + PAGE_SIZE) & PAGE_MASK;
+pmd = pmd_offset(pud, addr);
+if (pmd_none(*pmd))
+return next;
+
+pte = pte_offset_kernel(pmd, addr);
+if (pte_none(*pte))
+return next;
+
+*page_size = PAGE_SIZE;
+*pp = pte_page(*pte);
+} else {
+next = pmd_addr_end(addr, end);
+
+pmd = pmd_offset(pud, addr);
+if (pmd_none(*pmd))
+return next;
+
+*page_size = PMD_

Re: [RFC v9 PATCH 16/21] memory-hotplug: free memmap of sparse-vmemmap

2012-10-06 Thread Ni zhan Chen

On 10/04/2012 02:26 PM, Yasuaki Ishimatsu wrote:

Hi Chen,

Sorry for late reply.

2012/10/02 13:21, Ni zhan Chen wrote:

On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote:

From: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com

All pages of virtual mapping in removed memory cannot be freed, 
since some pages
used as PGD/PUD includes not only removed memory but also other 
memory. So the

patch checks whether page can be freed or not.

How to check whether page can be freed or not?
  1. When removing memory, the page structs of the revmoved memory 
are filled

 with 0FD.
  2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be 
cleared.

 In this case, the page used as PT/PMD can be freed.

Applying patch, __remove_section() of CONFIG_SPARSEMEM_VMEMMAP is 
integrated

into one. So __remove_section() of CONFIG_SPARSEMEM_VMEMMAP is deleted.

Note:  vmemmap_kfree() and vmemmap_free_bootmem() are not 
implemented for ia64,

ppc, s390, and sparc.

CC: David Rientjes rient...@google.com
CC: Jiang Liu liu...@gmail.com
CC: Len Brown len.br...@intel.com
CC: Benjamin Herrenschmidt b...@kernel.crashing.org
CC: Paul Mackerras pau...@samba.org
CC: Christoph Lameter c...@linux.com
Cc: Minchan Kim minchan@gmail.com
CC: Andrew Morton a...@linux-foundation.org
CC: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com
CC: Wen Congyang we...@cn.fujitsu.com
Signed-off-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com
---
  arch/ia64/mm/discontig.c  |8 +++
  arch/powerpc/mm/init_64.c |8 +++
  arch/s390/mm/vmem.c   |8 +++
  arch/sparc/mm/init_64.c   |8 +++
  arch/x86/mm/init_64.c |  119 
+

  include/linux/mm.h|2 +
  mm/memory_hotplug.c   |   17 +--
  mm/sparse.c   |5 +-
  8 files changed, 158 insertions(+), 17 deletions(-)

diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
index 33943db..0d23b69 100644
--- a/arch/ia64/mm/discontig.c
+++ b/arch/ia64/mm/discontig.c
@@ -823,6 +823,14 @@ int __meminit vmemmap_populate(struct page 
*start_page,

  return vmemmap_populate_basepages(start_page, size, node);
  }
+void vmemmap_kfree(struct page *memmap, unsigned long nr_pages)
+{
+}
+
+void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages)
+{
+}
+
  void register_page_bootmem_memmap(unsigned long section_nr,
struct page *start_page, unsigned long size)
  {
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 3690c44..835a2b3 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -299,6 +299,14 @@ int __meminit vmemmap_populate(struct page 
*start_page,

  return 0;
  }
+void vmemmap_kfree(struct page *memmap, unsigned long nr_pages)
+{
+}
+
+void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages)
+{
+}
+
  void register_page_bootmem_memmap(unsigned long section_nr,
struct page *start_page, unsigned long size)
  {
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index eda55cd..4b42b0b 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -227,6 +227,14 @@ out:
  return ret;
  }
+void vmemmap_kfree(struct page *memmap, unsigned long nr_pages)
+{
+}
+
+void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages)
+{
+}
+
  void register_page_bootmem_memmap(unsigned long section_nr,
struct page *start_page, unsigned long size)
  {
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index add1cc7..1384826 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2078,6 +2078,14 @@ void __meminit vmemmap_populate_print_last(void)
  }
  }
+void vmemmap_kfree(struct page *memmap, unsigned long nr_pages)
+{
+}
+
+void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages)
+{
+}
+
  void register_page_bootmem_memmap(unsigned long section_nr,
struct page *start_page, unsigned long size)
  {
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 0075592..4e8f8a4 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1138,6 +1138,125 @@ vmemmap_populate(struct page *start_page, 
unsigned long size, int node)

  return 0;
  }
+#define PAGE_INUSE 0xFD
+
+unsigned long find_and_clear_pte_page(unsigned long addr, unsigned 
long end,

+struct page **pp, int *page_size)
+{
+pgd_t *pgd;
+pud_t *pud;
+pmd_t *pmd;
+pte_t *pte;
+void *page_addr;
+unsigned long next;
+
+*pp = NULL;
+
+pgd = pgd_offset_k(addr);
+if (pgd_none(*pgd))
+return pgd_addr_end(addr, end);
+
+pud = pud_offset(pgd, addr);
+if (pud_none(*pud))
+return pud_addr_end(addr, end);
+
+if (!cpu_has_pse) {
+next = (addr + PAGE_SIZE)  PAGE_MASK;
+pmd = pmd_offset(pud, addr);
+if (pmd_none(*pmd))
+return next;
+
+pte = pte_offset_kernel(pmd, addr);
+if (pte_none(*pte

Re: [PATCH 0/4] acpi,memory-hotplug : implement framework for hot removing memory

2012-10-06 Thread Ni zhan Chen
On 10/03/2012 05:52 PM, Yasuaki Ishimatsu wrote:
 We are trying to implement a physical memory hot removing function as
 following thread.

 https://lkml.org/lkml/2012/9/5/201

 But there is not enough review to merge into linux kernel.

 I think there are following blockades.
   1. no physical memory hot removable system

Which kind of special machine support physical memory hot-remove now?

   2. huge patch-set

 If you have a KVM system, we can get rid of 1st blockade. Because
 applying following patch, we can create memory hot removable system
 on KVM guest.

 http://lists.gnu.org/archive/html/qemu-devel/2012-07/msg01389.html

 2nd blockade is own problem. So we try to divide huge patch into
 a small patch in each function as follows: 

  - bug fix
  - acpi framework
  - kernel core

 We had already sent bug fix patches.
 https://lkml.org/lkml/2012/9/27/39
 https://lkml.org/lkml/2012/10/2/83

 The patch-set implements a framework for hot removing memory.

 The memory device can be removed by 2 ways:
 1. send eject request by SCI
 2. echo 1 /sys/bus/pci/devices/PNP0C80:XX/eject

 In the 1st case, acpi_memory_disable_device() will be called.
 In the 2nd case, acpi_memory_device_remove() will be called.
 acpi_memory_device_remove() will also be called when we unbind the
 memory device from the driver acpi_memhotplug.

 acpi_memory_disable_device() has already implemented a code which
 offlines memory and releases acpi_memory_info struct . But
 acpi_memory_device_remove() has not implemented it yet.

 So the patch prepares the framework for hot removing memory and
 adds the framework intoacpi_memory_device_remove(). And it prepares
 remove_memory(). But the function does nothing because we cannot
 support memory hot remove.

 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: memmap_init_zone() performance improvement

2012-10-06 Thread Ni zhan Chen

On 10/03/2012 10:56 PM, Mike Yoknis wrote:

memmap_init_zone() loops through every Page Frame Number (pfn),
including pfn values that are within the gaps between existing
memory sections.  The unneeded looping will become a boot
performance issue when machines configure larger memory ranges
that will contain larger and more numerous gaps.

The code will skip across invalid sections to reduce the
number of loops executed.


looks reasonable to me.



Signed-off-by: Mike Yoknis mike.yok...@hp.com
---
  arch/x86/include/asm/mmzone_32.h |2 ++
  arch/x86/include/asm/page_32.h   |1 +
  arch/x86/include/asm/page_64_types.h |3 ++-
  include/asm-generic/page.h   |1 +
  include/linux/mmzone.h   |6 ++
  mm/page_alloc.c  |5 -
  6 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/mmzone_32.h b/arch/x86/include/asm/mmzone_32.h
index eb05fb3..73c5c74 100644
--- a/arch/x86/include/asm/mmzone_32.h
+++ b/arch/x86/include/asm/mmzone_32.h
@@ -48,6 +48,8 @@ static inline int pfn_to_nid(unsigned long pfn)
  #endif
  }
  
+#define next_pfn_try(pfn)	((pfn)+1)

+
  static inline int pfn_valid(int pfn)
  {
int nid = pfn_to_nid(pfn);
diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
index da4e762..e2c4cfc 100644
--- a/arch/x86/include/asm/page_32.h
+++ b/arch/x86/include/asm/page_32.h
@@ -19,6 +19,7 @@ extern unsigned long __phys_addr(unsigned long);
  
  #ifdef CONFIG_FLATMEM

  #define pfn_valid(pfn)((pfn)  max_mapnr)
+#define next_pfn_try(pfn)  ((pfn)+1)
  #endif /* CONFIG_FLATMEM */
  
  #ifdef CONFIG_X86_USE_3DNOW

diff --git a/arch/x86/include/asm/page_64_types.h 
b/arch/x86/include/asm/page_64_types.h
index 320f7bb..02d82e5 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -69,7 +69,8 @@ extern void init_extra_mapping_wb(unsigned long phys, 
unsigned long size);
  #endif/* !__ASSEMBLY__ */
  
  #ifdef CONFIG_FLATMEM

-#define pfn_valid(pfn)  ((pfn)  max_pfn)
+#define pfn_valid(pfn) ((pfn)  max_pfn)
+#define next_pfn_try(pfn)  ((pfn)+1)
  #endif
  
  #endif /* _ASM_X86_PAGE_64_DEFS_H */

diff --git a/include/asm-generic/page.h b/include/asm-generic/page.h
index 37d1fe2..316200d 100644
--- a/include/asm-generic/page.h
+++ b/include/asm-generic/page.h
@@ -91,6 +91,7 @@ extern unsigned long memory_end;
  #endif
  
  #define pfn_valid(pfn)		((pfn) = ARCH_PFN_OFFSET  ((pfn) - ARCH_PFN_OFFSET)  max_mapnr)

+#define next_pfn_try(pfn)  ((pfn)+1)
  
  #define	virt_addr_valid(kaddr)	(((void *)(kaddr) = (void *)PAGE_OFFSET)  \

((void *)(kaddr)  (void *)memory_end))
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f7d88ba..04d3c39 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1166,6 +1166,12 @@ static inline int pfn_valid(unsigned long pfn)
return 0;
return valid_section(__nr_to_section(pfn_to_section_nr(pfn)));
  }
+
+static inline unsigned long next_pfn_try(unsigned long pfn)
+{
+   /* Skip entire section, because all of it is invalid. */
+   return section_nr_to_pfn(pfn_to_section_nr(pfn) + 1);
+}
  #endif
  
  static inline int pfn_present(unsigned long pfn)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5b6b6b1..dd2af8b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3798,8 +3798,11 @@ void __meminit memmap_init_zone(unsigned long size, int 
nid, unsigned long zone,
 * exist on hotplugged memory.
 */
if (context == MEMMAP_EARLY) {
-   if (!early_pfn_valid(pfn))
+   if (!early_pfn_valid(pfn)) {
+   pfn = next_pfn_try(pfn);
+   pfn--;
continue;
+   }
if (!early_pfn_in_nid(pfn, nid))
continue;
}


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] CPU hotplug, writeback: Don't call writeback_set_ratelimit() too often during hotplug

2012-10-03 Thread Ni zhan Chen

On 09/28/2012 08:27 PM, Fengguang Wu wrote:

On Tue, Sep 25, 2012 at 02:18:20AM +0530, Srivatsa S. Bhat wrote:

From: Srivatsa S. Bhat 

The CPU hotplug callback related to writeback calls writeback_set_ratelimit()
during every state change in the hotplug sequence. This is unnecessary
since num_online_cpus() changes only once during the entire hotplug operation.

So invoke the function only once per hotplug, thereby avoiding the
unnecessary repetition of those costly calculations.

Signed-off-by: Srivatsa S. Bhat 
---

Looks good to me. I'll include it in the writeback tree.


Hi Fengguang,

Could you tell me when  inode->i_state & I_DIRTY will be set? thanks.

Regards,
Chen



Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] CPU hotplug, writeback: Don't call writeback_set_ratelimit() too often during hotplug

2012-10-03 Thread Ni zhan Chen

On 09/28/2012 08:27 PM, Fengguang Wu wrote:

On Tue, Sep 25, 2012 at 02:18:20AM +0530, Srivatsa S. Bhat wrote:

From: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com

The CPU hotplug callback related to writeback calls writeback_set_ratelimit()
during every state change in the hotplug sequence. This is unnecessary
since num_online_cpus() changes only once during the entire hotplug operation.

So invoke the function only once per hotplug, thereby avoiding the
unnecessary repetition of those costly calculations.

Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com
---

Looks good to me. I'll include it in the writeback tree.


Hi Fengguang,

Could you tell me when  inode-i_state  I_DIRTY will be set? thanks.

Regards,
Chen



Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] memory-hotplug : notification of memoty block's state

2012-10-02 Thread Ni zhan Chen

On 10/03/2012 09:21 AM, Yasuaki Ishimatsu wrote:

Hi Andrew,

2012/10/03 6:42, Andrew Morton wrote:

On Tue, 2 Oct 2012 17:25:06 +0900
Yasuaki Ishimatsu  wrote:

remove_memory() offlines memory. And it is called by following two 
cases:


1. echo offline >/sys/devices/system/memory/memoryXX/state
2. hot remove a memory device

In the 1st case, the memory block's state is changed and the 
notification

that memory block's state changed is sent to userland after calling
offline_memory(). So user can notice memory block is changed.

But in the 2nd case, the memory block's state is not changed and the
notification is not also sent to userspcae even if calling 
offline_memory().

So user cannot notice memory block is changed.

We should also notify to userspace at 2nd case.


These two little patches look reasonable to me.

There's a lot of recent activity with memory hotplug!  We're in the 3.7
merge window now so it is not a good time to be merging new material.



Also there appear to be two teams working on it and it's unclear to me
how well coordinated this work is?


As you know, there are two teams for developing the memory hotplug.
  - Wen's patch-set
https://lkml.org/lkml/2012/9/5/201

  - Lai's patch-set
https://lkml.org/lkml/2012/9/10/180

Wen's patch-set is for removing physical memory. Now, I'm splitting the
patch-set for reviewing more easy. If the patch-set is merged into
linux kernel, I believe that linux on x86 can hot remove a physical
memory device.

But it is not enough since we cannot remove a memory which has kernel
memory. If we guarantee the memory hot remove, the memory must belong
to ZONE_MOVABLE.

So Lai's patch-set tries to create a movable node that the all memory
belongs to ZONE_MOVABLE.

I think there are two chances for creating the movable node.
  - boot time
  - after hot add memory

- boot time

For creating a movable memory, linux has two kernel parameters
(kernelcore and movablecore). But it is not enough, since even if we
set the kernel paramter, the movable memory is distributed evenly in
each node. So we introduce the kernelcore_max_addr boot parameter.
The parameter limits the range of the memory used as a kernel memory.

For example, the system has following nodes.

node0 : 0x4000 - 0x8000
node1 : 0x8000 - 0xc000

And when I want to hot remove a node1, we set 
"kernelcore_max_addr=0x8000".

In doing so, kernel memory is limited within 0x8000 and node1's
memory belongs to ZONE_MOEVALBE. As a result, we can guarantee that
node1 is a movable node and we always hot remove node1.

- after hot add memory

When hot adding memory, the memory belongs to ZONE_NORMAL and is offline.
If we online the memory, the memory may have kernel memory. In this case,
we cannot hot remove the memory. So we introduce the online_movable
function. If we use the function as follow, the memory belongs to
ZONE_MOVABLE.

echo online_movable > /sys/devices/system/node/nodeX/memoryX/state

So when new node is hot added and I echo "online_movale" to all hot added
memory, the node's memory belongs to ZONE_MOVABLE. As a result, we can Y
guarantee that the node is a movable node and we always hot remove node.


Hi Yasuaki,

This time can kernel memory allocated from ZONE_MOVABLE ?



# I hope to help your understanding about our works by the information.

Thanks,
Yasuaki Ishimatsu



However these two patches are pretty simple and do fix a problem, so I
added them to the 3.7 MM queue.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v9 PATCH 13/21] memory-hotplug: check page type in get_page_bootmem

2012-10-02 Thread Ni zhan Chen

On 10/01/2012 11:03 AM, Yasuaki Ishimatsu wrote:

Hi Chen,

2012/09/29 11:15, Ni zhan Chen wrote:

On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote:

From: Yasuaki Ishimatsu 

The function get_page_bootmem() may be called more than one time to 
the same
page. There is no need to set page's type, private if the function 
is not

the first time called to the page.

Note: the patch is just optimization and does not fix any problem.


Hi Yasuaki,

this patch is reasonable to me. I have another question associated to 
get_page_bootmem(), the question is from another fujitsu guy's patch 
changelog [commit : 04753278769f3], the changelog said  that:


  1) When the memmap of removing section is allocated on other
  section by bootmem, it should/can be free.
  2) When the memmap of removing section is allocated on the
  same section, it shouldn't be freed. Because the section has to be
  logical memory offlined already and all pages must be isolated 
against
  page allocater. If it is freed, page allocator may use it which 
will

  be removed physically soon.

but I don't see his patch guarantee 2), it means that his patch 
doesn't guarantee the memmap of removing section which is allocated 
on other section by bootmem doesn't be freed. Hopefully get your 
explaination in details, thanks in advance. :-)


In my understanding, the patch does not guarantee it.
Please see [commit : 0c0a4a517a31e]. free_map_bootmem() in the commit
guarantees it.


Thanks Yasuaki, I have already seen the commit you mentioned. But the 
changelog of the commit I point out 2), why it said that "If it is 
freed, page allocator may use it which will be removed physically soon", 
does it mean that use-after-free ? AFAK, the isolated pages will be free 
if no users use it, so why not free the associated memmap?




Thanks,
Yasuaki Ishimatsu





CC: David Rientjes 
CC: Jiang Liu 
CC: Len Brown 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Christoph Lameter 
Cc: Minchan Kim 
CC: Andrew Morton 
CC: KOSAKI Motohiro 
CC: Wen Congyang 
Signed-off-by: Yasuaki Ishimatsu 
---
  mm/memory_hotplug.c |   15 +++
  1 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d736df3..26a5012 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -95,10 +95,17 @@ static void release_memory_resource(struct 
resource *res)

  static void get_page_bootmem(unsigned long info,  struct page *page,
   unsigned long type)
  {
-page->lru.next = (struct list_head *) type;
-SetPagePrivate(page);
-set_page_private(page, info);
-atomic_inc(>_count);
+unsigned long page_type;
+
+page_type = (unsigned long)page->lru.next;
+if (page_type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
+page_type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE){
+page->lru.next = (struct list_head *)type;
+SetPagePrivate(page);
+set_page_private(page, info);
+atomic_inc(>_count);
+} else
+atomic_inc(>_count);
  }
  /* reference to __meminit __free_pages_bootmem is valid








--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] memory-hotplug : notification of memoty block's state

2012-10-02 Thread Ni zhan Chen
On 10/02/2012 04:25 PM, Yasuaki Ishimatsu wrote:
> We are trying to implement a physical memory hot removing function as
> following thread.
>
> https://lkml.org/lkml/2012/9/5/201
>
> But there is not enough review to merge into linux kernel.
>
> I think there are following blockades.
>   1. no physical memory hot removable system
>   2. huge patch-set
>
> If you have a KVM system, we can get rid of 1st blockade. Because
> applying following patch, we can create memory hot removable system
> on KVM guest.
>
> http://lists.gnu.org/archive/html/qemu-devel/2012-07/msg01389.html
>
> 2nd blockade is own problem. So we try to divide huge patch into
> a small patch in each function as follows: 
>
>  - bug fix
>  - acpi framework
>  - kernel core
>
> We had already sent bug fix patches.
>
> https://lkml.org/lkml/2012/9/27/39
>
> And the patch fixes following bug.
>
> remove_memory() offlines memory. And it is called by following two cases:
>
> 1. echo offline >/sys/devices/system/memory/memoryXX/state
> 2. hot remove a memory device
>
> In the 1st case, the memory block's state is changed and the notification
> that memory block's state changed is sent to userland after calling
> offline_memory(). So user can notice memory block is changed.,

Hi Yasuaki,

Thanks for splitting the patchset, it's more easier to review this time.
One question:

How can notify userspace? you mean function node_memory_callback or
, but
this function basically do nothing.

>
> But in the 2nd case, the memory block's state is not changed and the
> notification is not also sent to userspcae even if calling offline_memory().
> So user cannot notice memory block is changed.
>
> We should also notify to userspace at 2nd case.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] memory-hotplug : notification of memoty block's state

2012-10-02 Thread Ni zhan Chen
On 10/02/2012 04:25 PM, Yasuaki Ishimatsu wrote:
 We are trying to implement a physical memory hot removing function as
 following thread.

 https://lkml.org/lkml/2012/9/5/201

 But there is not enough review to merge into linux kernel.

 I think there are following blockades.
   1. no physical memory hot removable system
   2. huge patch-set

 If you have a KVM system, we can get rid of 1st blockade. Because
 applying following patch, we can create memory hot removable system
 on KVM guest.

 http://lists.gnu.org/archive/html/qemu-devel/2012-07/msg01389.html

 2nd blockade is own problem. So we try to divide huge patch into
 a small patch in each function as follows: 

  - bug fix
  - acpi framework
  - kernel core

 We had already sent bug fix patches.

 https://lkml.org/lkml/2012/9/27/39

 And the patch fixes following bug.

 remove_memory() offlines memory. And it is called by following two cases:

 1. echo offline /sys/devices/system/memory/memoryXX/state
 2. hot remove a memory device

 In the 1st case, the memory block's state is changed and the notification
 that memory block's state changed is sent to userland after calling
 offline_memory(). So user can notice memory block is changed.,

Hi Yasuaki,

Thanks for splitting the patchset, it's more easier to review this time.
One question:

How can notify userspace? you mean function node_memory_callback or
, but
this function basically do nothing.


 But in the 2nd case, the memory block's state is not changed and the
 notification is not also sent to userspcae even if calling offline_memory().
 So user cannot notice memory block is changed.

 We should also notify to userspace at 2nd case.

 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v9 PATCH 13/21] memory-hotplug: check page type in get_page_bootmem

2012-10-02 Thread Ni zhan Chen

On 10/01/2012 11:03 AM, Yasuaki Ishimatsu wrote:

Hi Chen,

2012/09/29 11:15, Ni zhan Chen wrote:

On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote:

From: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com

The function get_page_bootmem() may be called more than one time to 
the same
page. There is no need to set page's type, private if the function 
is not

the first time called to the page.

Note: the patch is just optimization and does not fix any problem.


Hi Yasuaki,

this patch is reasonable to me. I have another question associated to 
get_page_bootmem(), the question is from another fujitsu guy's patch 
changelog [commit : 04753278769f3], the changelog said  that:


  1) When the memmap of removing section is allocated on other
  section by bootmem, it should/can be free.
  2) When the memmap of removing section is allocated on the
  same section, it shouldn't be freed. Because the section has to be
  logical memory offlined already and all pages must be isolated 
against
  page allocater. If it is freed, page allocator may use it which 
will

  be removed physically soon.

but I don't see his patch guarantee 2), it means that his patch 
doesn't guarantee the memmap of removing section which is allocated 
on other section by bootmem doesn't be freed. Hopefully get your 
explaination in details, thanks in advance. :-)


In my understanding, the patch does not guarantee it.
Please see [commit : 0c0a4a517a31e]. free_map_bootmem() in the commit
guarantees it.


Thanks Yasuaki, I have already seen the commit you mentioned. But the 
changelog of the commit I point out 2), why it said that If it is 
freed, page allocator may use it which will be removed physically soon, 
does it mean that use-after-free ? AFAK, the isolated pages will be free 
if no users use it, so why not free the associated memmap?




Thanks,
Yasuaki Ishimatsu





CC: David Rientjes rient...@google.com
CC: Jiang Liu liu...@gmail.com
CC: Len Brown len.br...@intel.com
CC: Benjamin Herrenschmidt b...@kernel.crashing.org
CC: Paul Mackerras pau...@samba.org
CC: Christoph Lameter c...@linux.com
Cc: Minchan Kim minchan@gmail.com
CC: Andrew Morton a...@linux-foundation.org
CC: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com
CC: Wen Congyang we...@cn.fujitsu.com
Signed-off-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com
---
  mm/memory_hotplug.c |   15 +++
  1 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d736df3..26a5012 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -95,10 +95,17 @@ static void release_memory_resource(struct 
resource *res)

  static void get_page_bootmem(unsigned long info,  struct page *page,
   unsigned long type)
  {
-page-lru.next = (struct list_head *) type;
-SetPagePrivate(page);
-set_page_private(page, info);
-atomic_inc(page-_count);
+unsigned long page_type;
+
+page_type = (unsigned long)page-lru.next;
+if (page_type  MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
+page_type  MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE){
+page-lru.next = (struct list_head *)type;
+SetPagePrivate(page);
+set_page_private(page, info);
+atomic_inc(page-_count);
+} else
+atomic_inc(page-_count);
  }
  /* reference to __meminit __free_pages_bootmem is valid








--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] memory-hotplug : notification of memoty block's state

2012-10-02 Thread Ni zhan Chen

On 10/03/2012 09:21 AM, Yasuaki Ishimatsu wrote:

Hi Andrew,

2012/10/03 6:42, Andrew Morton wrote:

On Tue, 2 Oct 2012 17:25:06 +0900
Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com wrote:

remove_memory() offlines memory. And it is called by following two 
cases:


1. echo offline /sys/devices/system/memory/memoryXX/state
2. hot remove a memory device

In the 1st case, the memory block's state is changed and the 
notification

that memory block's state changed is sent to userland after calling
offline_memory(). So user can notice memory block is changed.

But in the 2nd case, the memory block's state is not changed and the
notification is not also sent to userspcae even if calling 
offline_memory().

So user cannot notice memory block is changed.

We should also notify to userspace at 2nd case.


These two little patches look reasonable to me.

There's a lot of recent activity with memory hotplug!  We're in the 3.7
merge window now so it is not a good time to be merging new material.



Also there appear to be two teams working on it and it's unclear to me
how well coordinated this work is?


As you know, there are two teams for developing the memory hotplug.
  - Wen's patch-set
https://lkml.org/lkml/2012/9/5/201

  - Lai's patch-set
https://lkml.org/lkml/2012/9/10/180

Wen's patch-set is for removing physical memory. Now, I'm splitting the
patch-set for reviewing more easy. If the patch-set is merged into
linux kernel, I believe that linux on x86 can hot remove a physical
memory device.

But it is not enough since we cannot remove a memory which has kernel
memory. If we guarantee the memory hot remove, the memory must belong
to ZONE_MOVABLE.

So Lai's patch-set tries to create a movable node that the all memory
belongs to ZONE_MOVABLE.

I think there are two chances for creating the movable node.
  - boot time
  - after hot add memory

- boot time

For creating a movable memory, linux has two kernel parameters
(kernelcore and movablecore). But it is not enough, since even if we
set the kernel paramter, the movable memory is distributed evenly in
each node. So we introduce the kernelcore_max_addr boot parameter.
The parameter limits the range of the memory used as a kernel memory.

For example, the system has following nodes.

node0 : 0x4000 - 0x8000
node1 : 0x8000 - 0xc000

And when I want to hot remove a node1, we set 
kernelcore_max_addr=0x8000.

In doing so, kernel memory is limited within 0x8000 and node1's
memory belongs to ZONE_MOEVALBE. As a result, we can guarantee that
node1 is a movable node and we always hot remove node1.

- after hot add memory

When hot adding memory, the memory belongs to ZONE_NORMAL and is offline.
If we online the memory, the memory may have kernel memory. In this case,
we cannot hot remove the memory. So we introduce the online_movable
function. If we use the function as follow, the memory belongs to
ZONE_MOVABLE.

echo online_movable  /sys/devices/system/node/nodeX/memoryX/state

So when new node is hot added and I echo online_movale to all hot added
memory, the node's memory belongs to ZONE_MOVABLE. As a result, we can Y
guarantee that the node is a movable node and we always hot remove node.


Hi Yasuaki,

This time can kernel memory allocated from ZONE_MOVABLE ?



# I hope to help your understanding about our works by the information.

Thanks,
Yasuaki Ishimatsu



However these two patches are pretty simple and do fix a problem, so I
added them to the 3.7 MM queue.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v9 PATCH 16/21] memory-hotplug: free memmap of sparse-vmemmap

2012-10-01 Thread Ni zhan Chen

On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote:

From: Yasuaki Ishimatsu 

All pages of virtual mapping in removed memory cannot be freed, since some pages
used as PGD/PUD includes not only removed memory but also other memory. So the
patch checks whether page can be freed or not.

How to check whether page can be freed or not?
  1. When removing memory, the page structs of the revmoved memory are filled
 with 0FD.
  2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared.
 In this case, the page used as PT/PMD can be freed.

Applying patch, __remove_section() of CONFIG_SPARSEMEM_VMEMMAP is integrated
into one. So __remove_section() of CONFIG_SPARSEMEM_VMEMMAP is deleted.

Note:  vmemmap_kfree() and vmemmap_free_bootmem() are not implemented for ia64,
ppc, s390, and sparc.

CC: David Rientjes 
CC: Jiang Liu 
CC: Len Brown 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Christoph Lameter 
Cc: Minchan Kim 
CC: Andrew Morton 
CC: KOSAKI Motohiro 
CC: Wen Congyang 
Signed-off-by: Yasuaki Ishimatsu 
---
  arch/ia64/mm/discontig.c  |8 +++
  arch/powerpc/mm/init_64.c |8 +++
  arch/s390/mm/vmem.c   |8 +++
  arch/sparc/mm/init_64.c   |8 +++
  arch/x86/mm/init_64.c |  119 +
  include/linux/mm.h|2 +
  mm/memory_hotplug.c   |   17 +--
  mm/sparse.c   |5 +-
  8 files changed, 158 insertions(+), 17 deletions(-)

diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
index 33943db..0d23b69 100644
--- a/arch/ia64/mm/discontig.c
+++ b/arch/ia64/mm/discontig.c
@@ -823,6 +823,14 @@ int __meminit vmemmap_populate(struct page *start_page,
return vmemmap_populate_basepages(start_page, size, node);
  }
  
+void vmemmap_kfree(struct page *memmap, unsigned long nr_pages)

+{
+}
+
+void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages)
+{
+}
+
  void register_page_bootmem_memmap(unsigned long section_nr,
  struct page *start_page, unsigned long size)
  {
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 3690c44..835a2b3 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -299,6 +299,14 @@ int __meminit vmemmap_populate(struct page *start_page,
return 0;
  }
  
+void vmemmap_kfree(struct page *memmap, unsigned long nr_pages)

+{
+}
+
+void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages)
+{
+}
+
  void register_page_bootmem_memmap(unsigned long section_nr,
  struct page *start_page, unsigned long size)
  {
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index eda55cd..4b42b0b 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -227,6 +227,14 @@ out:
return ret;
  }
  
+void vmemmap_kfree(struct page *memmap, unsigned long nr_pages)

+{
+}
+
+void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages)
+{
+}
+
  void register_page_bootmem_memmap(unsigned long section_nr,
  struct page *start_page, unsigned long size)
  {
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index add1cc7..1384826 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2078,6 +2078,14 @@ void __meminit vmemmap_populate_print_last(void)
}
  }
  
+void vmemmap_kfree(struct page *memmap, unsigned long nr_pages)

+{
+}
+
+void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages)
+{
+}
+
  void register_page_bootmem_memmap(unsigned long section_nr,
  struct page *start_page, unsigned long size)
  {
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 0075592..4e8f8a4 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1138,6 +1138,125 @@ vmemmap_populate(struct page *start_page, unsigned long 
size, int node)
return 0;
  }
  
+#define PAGE_INUSE 0xFD

+
+unsigned long find_and_clear_pte_page(unsigned long addr, unsigned long end,
+   struct page **pp, int *page_size)
+{
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *pte;
+   void *page_addr;
+   unsigned long next;
+
+   *pp = NULL;
+
+   pgd = pgd_offset_k(addr);
+   if (pgd_none(*pgd))
+   return pgd_addr_end(addr, end);
+
+   pud = pud_offset(pgd, addr);
+   if (pud_none(*pud))
+   return pud_addr_end(addr, end);
+
+   if (!cpu_has_pse) {
+   next = (addr + PAGE_SIZE) & PAGE_MASK;
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd))
+   return next;
+
+   pte = pte_offset_kernel(pmd, addr);
+   if (pte_none(*pte))
+   return next;
+
+   *page_size = PAGE_SIZE;
+   *pp = pte_page(*pte);
+   } else {
+   next = pmd_addr_end(addr, end);
+
+   pmd = pmd_offset(pud, 

Re: [RFC v9 PATCH 06/21] memory-hotplug: export the function acpi_bus_remove()

2012-10-01 Thread Ni zhan Chen

On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote:

From: Wen Congyang 

The function acpi_bus_remove() can remove a acpi device from acpi device.


IIUC, s/acpi device/acpi bus

  
When a acpi device is removed, we need to call this function to remove

the acpi device from acpi bus. So export this function.

CC: David Rientjes 
CC: Jiang Liu 
CC: Len Brown 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Christoph Lameter 
Cc: Minchan Kim 
CC: Andrew Morton 
CC: KOSAKI Motohiro 
CC: Yasuaki Ishimatsu 
Signed-off-by: Wen Congyang 
---
  drivers/acpi/scan.c |3 ++-
  include/acpi/acpi_bus.h |1 +
  2 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
index d1ecca2..1cefc34 100644
--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -1224,7 +1224,7 @@ static int acpi_device_set_context(struct acpi_device 
*device)
return -ENODEV;
  }
  
-static int acpi_bus_remove(struct acpi_device *dev, int rmdevice)

+int acpi_bus_remove(struct acpi_device *dev, int rmdevice)
  {
if (!dev)
return -EINVAL;
@@ -1246,6 +1246,7 @@ static int acpi_bus_remove(struct acpi_device *dev, int 
rmdevice)
  
  	return 0;

  }
+EXPORT_SYMBOL(acpi_bus_remove);
  
  static int acpi_add_single_object(struct acpi_device **child,

  acpi_handle handle, int type,
diff --git a/include/acpi/acpi_bus.h b/include/acpi/acpi_bus.h
index bde976e..2ccf109 100644
--- a/include/acpi/acpi_bus.h
+++ b/include/acpi/acpi_bus.h
@@ -360,6 +360,7 @@ bool acpi_bus_power_manageable(acpi_handle handle);
  bool acpi_bus_can_wakeup(acpi_handle handle);
  int acpi_power_resource_register_device(struct device *dev, acpi_handle 
handle);
  void acpi_power_resource_unregister_device(struct device *dev, acpi_handle 
handle);
+int acpi_bus_remove(struct acpi_device *dev, int rmdevice);
  #ifdef CONFIG_ACPI_PROC_EVENT
  int acpi_bus_generate_proc_event(struct acpi_device *device, u8 type, int 
data);
  int acpi_bus_generate_proc_event4(const char *class, const char *bid, u8 
type, int data);


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v9 PATCH 00/21] memory-hotplug: hot-remove physical memory

2012-10-01 Thread Ni zhan Chen

On 10/01/2012 12:44 PM, Yasuaki Ishimatsu wrote:

Hi Chen,

2012/09/29 17:19, Ni zhan Chen wrote:

On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote:

From: Wen Congyang 

This patch series aims to support physical memory hot-remove.

The patches can free/remove the following things:

   - acpi_memory_info  : [RFC PATCH 4/19]
   - /sys/firmware/memmap/X/{end, start, type} : [RFC PATCH 8/19]
   - iomem_resource: [RFC PATCH 9/19]
   - mem_section and related sysfs files   : [RFC PATCH 10-11, 
13-16/19]

   - page table of removed memory  : [RFC PATCH 12/19]
   - node and related sysfs files  : [RFC PATCH 18-19/19]

If you find lack of function for physical memory hot-remove, please 
let me

know.

How to test this patchset?
1. apply this patchset and build the kernel. MEMORY_HOTPLUG, 
MEMORY_HOTREMOVE,

ACPI_HOTPLUG_MEMORY must be selected.
2. load the module acpi_memhotplug


Hi Yasuaki,

where is the acpi_memhotplug module?


If you build acpi_memhotplug as module, it is created under
/lib/modules//driver/acpi/ directory. It depends
on config ACPI_HOTPLUG_MEMORY. The confing is [*], it becomes built-in
function. So you don't need to care about it.
Thanks,
Yasuaki Ishimatsu


Hi Yasuaki,

I build the kernel, MEMORY_HOTPLUG, MEMORY_HOTREMOVE, 
ACPI_HOTPLUG_MEMORY are seleted as [*], but I can't find PNP0C80:XX 
under the directory /sys/bus/acpi/devices/.


[root@localhost ~]# ls /sys/bus/acpi/devices/
device:00  device:07  device:0e  device:15  device:1c  device:23 
device:2a   LNXCPU:00  LNXCPU:07PNP0501:00  PNP0C02:00 PNP0C0F:02  
PNP0C14:01
device:01  device:08  device:0f  device:16  device:1d  device:24 
device:2b   LNXCPU:01  LNXPWRBN:00  PNP0800:00  PNP0C02:01 PNP0C0F:03  
PNP0C31:00
device:02  device:09  device:10  device:17  device:1e  device:25 
device:2c   LNXCPU:02  LNXSYSTM:00  PNP0A08:00  PNP0C02:02 PNP0C0F:04
device:03  device:0a  device:11  device:18  device:1f  device:26 
device:2d   LNXCPU:03  PNP:00   PNP0B00:00  PNP0C04:00 PNP0C0F:05
device:04  device:0b  device:12  device:19  device:20  device:27 
device:2e   LNXCPU:04  PNP0100:00   PNP0C01:00  PNP0C0C:00 PNP0C0F:06
device:05  device:0c  device:13  device:1a  device:21  device:28 
device:2f   LNXCPU:05  PNP0103:00   PNP0C01:01  PNP0C0F:00 PNP0C0F:07
device:06  device:0d  device:14  device:1b  device:22  device:29 
INT3F0D:00  LNXCPU:06  PNP0200:00   PNP0C01:02  PNP0C0F:01 PNP0C14:00


then what I miss ? thanks.






3. hotplug the memory device(it depends on your hardware)
You will see the memory device under the directory 
/sys/bus/acpi/devices/.

Its name is PNP0C80:XX.
4. online/offline pages provided by this memory device
You can write online/offline to 
/sys/devices/system/memory/memoryX/state to

online/offline pages provided by this memory device
5. hotremove the memory device
You can hotremove the memory device by the hardware, or writing 
1 to

/sys/bus/acpi/devices/PNP0C80:XX/eject.

Note: if the memory provided by the memory device is used by the 
kernel, it

can't be offlined. It is not a bug.

Known problems:
1. memory can't be offlined when CONFIG_MEMCG is selected.
For example: there is a memory device on node 1. The address range
is [1G, 1.5G). You will find 4 new directories memory8, memory9, 
memory10,

and memory11 under the directory /sys/devices/system/memory/.
If CONFIG_MEMCG is selected, we will allocate memory to store 
page cgroup
when we online pages. When we online memory8, the memory stored 
page cgroup
is not provided by this memory device. But when we online 
memory9, the memory
stored page cgroup may be provided by memory8. So we can't 
offline memory8

now. We should offline the memory in the reversed order.
When the memory device is hotremoved, we will auto offline 
memory provided
by this memory device. But we don't know which memory is onlined 
first, so
offlining memory may fail. In such case, you should offline the 
memory by

hand before hotremoving the memory device.
2. hotremoving memory device may cause kernel panicked
This bug will be fixed by Liu Jiang's patch:
https://lkml.org/lkml/2012/7/3/1

change log of v9:
  [RFC PATCH v9 8/21]
* add a lock to protect the list map_entries
* add an indicator to firmware_map_entry to remember whether the 
memory

  is allocated from bootmem
  [RFC PATCH v9 10/21]
* change the macro to inline function
  [RFC PATCH v9 19/21]
* don't offline the node if the cpu on the node is onlined
  [RFC PATCH v9 21/21]
* create new patch: auto offline page_cgroup when onlining 
memory block

  failed

change log of v8:
  [RFC PATCH v8 17/20]
* Fix problems when one node's range include the other nodes
  [RFC PATCH v8 18/20]
* fix building error when CONFIG_MEMORY_HOTPLUG_SPARSE or 
CONFIG_HUGETLBFS

  is not defined.
  [RFC PATCH v8 19/20]
* don't offline

Re: [RFC v9 PATCH 00/21] memory-hotplug: hot-remove physical memory

2012-10-01 Thread Ni zhan Chen

On 10/01/2012 12:44 PM, Yasuaki Ishimatsu wrote:

Hi Chen,

2012/09/29 17:19, Ni zhan Chen wrote:

On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote:

From: Wen Congyang we...@cn.fujitsu.com

This patch series aims to support physical memory hot-remove.

The patches can free/remove the following things:

   - acpi_memory_info  : [RFC PATCH 4/19]
   - /sys/firmware/memmap/X/{end, start, type} : [RFC PATCH 8/19]
   - iomem_resource: [RFC PATCH 9/19]
   - mem_section and related sysfs files   : [RFC PATCH 10-11, 
13-16/19]

   - page table of removed memory  : [RFC PATCH 12/19]
   - node and related sysfs files  : [RFC PATCH 18-19/19]

If you find lack of function for physical memory hot-remove, please 
let me

know.

How to test this patchset?
1. apply this patchset and build the kernel. MEMORY_HOTPLUG, 
MEMORY_HOTREMOVE,

ACPI_HOTPLUG_MEMORY must be selected.
2. load the module acpi_memhotplug


Hi Yasuaki,

where is the acpi_memhotplug module?


If you build acpi_memhotplug as module, it is created under
/lib/modules/kernel-version/driver/acpi/ directory. It depends
on config ACPI_HOTPLUG_MEMORY. The confing is [*], it becomes built-in
function. So you don't need to care about it.
Thanks,
Yasuaki Ishimatsu


Hi Yasuaki,

I build the kernel, MEMORY_HOTPLUG, MEMORY_HOTREMOVE, 
ACPI_HOTPLUG_MEMORY are seleted as [*], but I can't find PNP0C80:XX 
under the directory /sys/bus/acpi/devices/.


[root@localhost ~]# ls /sys/bus/acpi/devices/
device:00  device:07  device:0e  device:15  device:1c  device:23 
device:2a   LNXCPU:00  LNXCPU:07PNP0501:00  PNP0C02:00 PNP0C0F:02  
PNP0C14:01
device:01  device:08  device:0f  device:16  device:1d  device:24 
device:2b   LNXCPU:01  LNXPWRBN:00  PNP0800:00  PNP0C02:01 PNP0C0F:03  
PNP0C31:00
device:02  device:09  device:10  device:17  device:1e  device:25 
device:2c   LNXCPU:02  LNXSYSTM:00  PNP0A08:00  PNP0C02:02 PNP0C0F:04
device:03  device:0a  device:11  device:18  device:1f  device:26 
device:2d   LNXCPU:03  PNP:00   PNP0B00:00  PNP0C04:00 PNP0C0F:05
device:04  device:0b  device:12  device:19  device:20  device:27 
device:2e   LNXCPU:04  PNP0100:00   PNP0C01:00  PNP0C0C:00 PNP0C0F:06
device:05  device:0c  device:13  device:1a  device:21  device:28 
device:2f   LNXCPU:05  PNP0103:00   PNP0C01:01  PNP0C0F:00 PNP0C0F:07
device:06  device:0d  device:14  device:1b  device:22  device:29 
INT3F0D:00  LNXCPU:06  PNP0200:00   PNP0C01:02  PNP0C0F:01 PNP0C14:00


then what I miss ? thanks.






3. hotplug the memory device(it depends on your hardware)
You will see the memory device under the directory 
/sys/bus/acpi/devices/.

Its name is PNP0C80:XX.
4. online/offline pages provided by this memory device
You can write online/offline to 
/sys/devices/system/memory/memoryX/state to

online/offline pages provided by this memory device
5. hotremove the memory device
You can hotremove the memory device by the hardware, or writing 
1 to

/sys/bus/acpi/devices/PNP0C80:XX/eject.

Note: if the memory provided by the memory device is used by the 
kernel, it

can't be offlined. It is not a bug.

Known problems:
1. memory can't be offlined when CONFIG_MEMCG is selected.
For example: there is a memory device on node 1. The address range
is [1G, 1.5G). You will find 4 new directories memory8, memory9, 
memory10,

and memory11 under the directory /sys/devices/system/memory/.
If CONFIG_MEMCG is selected, we will allocate memory to store 
page cgroup
when we online pages. When we online memory8, the memory stored 
page cgroup
is not provided by this memory device. But when we online 
memory9, the memory
stored page cgroup may be provided by memory8. So we can't 
offline memory8

now. We should offline the memory in the reversed order.
When the memory device is hotremoved, we will auto offline 
memory provided
by this memory device. But we don't know which memory is onlined 
first, so
offlining memory may fail. In such case, you should offline the 
memory by

hand before hotremoving the memory device.
2. hotremoving memory device may cause kernel panicked
This bug will be fixed by Liu Jiang's patch:
https://lkml.org/lkml/2012/7/3/1

change log of v9:
  [RFC PATCH v9 8/21]
* add a lock to protect the list map_entries
* add an indicator to firmware_map_entry to remember whether the 
memory

  is allocated from bootmem
  [RFC PATCH v9 10/21]
* change the macro to inline function
  [RFC PATCH v9 19/21]
* don't offline the node if the cpu on the node is onlined
  [RFC PATCH v9 21/21]
* create new patch: auto offline page_cgroup when onlining 
memory block

  failed

change log of v8:
  [RFC PATCH v8 17/20]
* Fix problems when one node's range include the other nodes
  [RFC PATCH v8 18/20]
* fix building error when CONFIG_MEMORY_HOTPLUG_SPARSE or 
CONFIG_HUGETLBFS

  is not defined.
  [RFC PATCH

Re: [RFC v9 PATCH 06/21] memory-hotplug: export the function acpi_bus_remove()

2012-10-01 Thread Ni zhan Chen

On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote:

From: Wen Congyang we...@cn.fujitsu.com

The function acpi_bus_remove() can remove a acpi device from acpi device.


IIUC, s/acpi device/acpi bus

  
When a acpi device is removed, we need to call this function to remove

the acpi device from acpi bus. So export this function.

CC: David Rientjes rient...@google.com
CC: Jiang Liu liu...@gmail.com
CC: Len Brown len.br...@intel.com
CC: Benjamin Herrenschmidt b...@kernel.crashing.org
CC: Paul Mackerras pau...@samba.org
CC: Christoph Lameter c...@linux.com
Cc: Minchan Kim minchan@gmail.com
CC: Andrew Morton a...@linux-foundation.org
CC: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com
CC: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com
Signed-off-by: Wen Congyang we...@cn.fujitsu.com
---
  drivers/acpi/scan.c |3 ++-
  include/acpi/acpi_bus.h |1 +
  2 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
index d1ecca2..1cefc34 100644
--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -1224,7 +1224,7 @@ static int acpi_device_set_context(struct acpi_device 
*device)
return -ENODEV;
  }
  
-static int acpi_bus_remove(struct acpi_device *dev, int rmdevice)

+int acpi_bus_remove(struct acpi_device *dev, int rmdevice)
  {
if (!dev)
return -EINVAL;
@@ -1246,6 +1246,7 @@ static int acpi_bus_remove(struct acpi_device *dev, int 
rmdevice)
  
  	return 0;

  }
+EXPORT_SYMBOL(acpi_bus_remove);
  
  static int acpi_add_single_object(struct acpi_device **child,

  acpi_handle handle, int type,
diff --git a/include/acpi/acpi_bus.h b/include/acpi/acpi_bus.h
index bde976e..2ccf109 100644
--- a/include/acpi/acpi_bus.h
+++ b/include/acpi/acpi_bus.h
@@ -360,6 +360,7 @@ bool acpi_bus_power_manageable(acpi_handle handle);
  bool acpi_bus_can_wakeup(acpi_handle handle);
  int acpi_power_resource_register_device(struct device *dev, acpi_handle 
handle);
  void acpi_power_resource_unregister_device(struct device *dev, acpi_handle 
handle);
+int acpi_bus_remove(struct acpi_device *dev, int rmdevice);
  #ifdef CONFIG_ACPI_PROC_EVENT
  int acpi_bus_generate_proc_event(struct acpi_device *device, u8 type, int 
data);
  int acpi_bus_generate_proc_event4(const char *class, const char *bid, u8 
type, int data);


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v9 PATCH 16/21] memory-hotplug: free memmap of sparse-vmemmap

2012-10-01 Thread Ni zhan Chen

On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote:

From: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com

All pages of virtual mapping in removed memory cannot be freed, since some pages
used as PGD/PUD includes not only removed memory but also other memory. So the
patch checks whether page can be freed or not.

How to check whether page can be freed or not?
  1. When removing memory, the page structs of the revmoved memory are filled
 with 0FD.
  2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared.
 In this case, the page used as PT/PMD can be freed.

Applying patch, __remove_section() of CONFIG_SPARSEMEM_VMEMMAP is integrated
into one. So __remove_section() of CONFIG_SPARSEMEM_VMEMMAP is deleted.

Note:  vmemmap_kfree() and vmemmap_free_bootmem() are not implemented for ia64,
ppc, s390, and sparc.

CC: David Rientjes rient...@google.com
CC: Jiang Liu liu...@gmail.com
CC: Len Brown len.br...@intel.com
CC: Benjamin Herrenschmidt b...@kernel.crashing.org
CC: Paul Mackerras pau...@samba.org
CC: Christoph Lameter c...@linux.com
Cc: Minchan Kim minchan@gmail.com
CC: Andrew Morton a...@linux-foundation.org
CC: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com
CC: Wen Congyang we...@cn.fujitsu.com
Signed-off-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com
---
  arch/ia64/mm/discontig.c  |8 +++
  arch/powerpc/mm/init_64.c |8 +++
  arch/s390/mm/vmem.c   |8 +++
  arch/sparc/mm/init_64.c   |8 +++
  arch/x86/mm/init_64.c |  119 +
  include/linux/mm.h|2 +
  mm/memory_hotplug.c   |   17 +--
  mm/sparse.c   |5 +-
  8 files changed, 158 insertions(+), 17 deletions(-)

diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
index 33943db..0d23b69 100644
--- a/arch/ia64/mm/discontig.c
+++ b/arch/ia64/mm/discontig.c
@@ -823,6 +823,14 @@ int __meminit vmemmap_populate(struct page *start_page,
return vmemmap_populate_basepages(start_page, size, node);
  }
  
+void vmemmap_kfree(struct page *memmap, unsigned long nr_pages)

+{
+}
+
+void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages)
+{
+}
+
  void register_page_bootmem_memmap(unsigned long section_nr,
  struct page *start_page, unsigned long size)
  {
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 3690c44..835a2b3 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -299,6 +299,14 @@ int __meminit vmemmap_populate(struct page *start_page,
return 0;
  }
  
+void vmemmap_kfree(struct page *memmap, unsigned long nr_pages)

+{
+}
+
+void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages)
+{
+}
+
  void register_page_bootmem_memmap(unsigned long section_nr,
  struct page *start_page, unsigned long size)
  {
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index eda55cd..4b42b0b 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -227,6 +227,14 @@ out:
return ret;
  }
  
+void vmemmap_kfree(struct page *memmap, unsigned long nr_pages)

+{
+}
+
+void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages)
+{
+}
+
  void register_page_bootmem_memmap(unsigned long section_nr,
  struct page *start_page, unsigned long size)
  {
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index add1cc7..1384826 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2078,6 +2078,14 @@ void __meminit vmemmap_populate_print_last(void)
}
  }
  
+void vmemmap_kfree(struct page *memmap, unsigned long nr_pages)

+{
+}
+
+void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages)
+{
+}
+
  void register_page_bootmem_memmap(unsigned long section_nr,
  struct page *start_page, unsigned long size)
  {
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 0075592..4e8f8a4 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1138,6 +1138,125 @@ vmemmap_populate(struct page *start_page, unsigned long 
size, int node)
return 0;
  }
  
+#define PAGE_INUSE 0xFD

+
+unsigned long find_and_clear_pte_page(unsigned long addr, unsigned long end,
+   struct page **pp, int *page_size)
+{
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *pte;
+   void *page_addr;
+   unsigned long next;
+
+   *pp = NULL;
+
+   pgd = pgd_offset_k(addr);
+   if (pgd_none(*pgd))
+   return pgd_addr_end(addr, end);
+
+   pud = pud_offset(pgd, addr);
+   if (pud_none(*pud))
+   return pud_addr_end(addr, end);
+
+   if (!cpu_has_pse) {
+   next = (addr + PAGE_SIZE)  PAGE_MASK;
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd))
+   return next;
+
+   pte = pte_offset_kernel(pmd, addr);

Re: [RFC v9 PATCH 00/21] memory-hotplug: hot-remove physical memory

2012-09-29 Thread Ni zhan Chen

On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote:

From: Wen Congyang 

This patch series aims to support physical memory hot-remove.

The patches can free/remove the following things:

   - acpi_memory_info  : [RFC PATCH 4/19]
   - /sys/firmware/memmap/X/{end, start, type} : [RFC PATCH 8/19]
   - iomem_resource: [RFC PATCH 9/19]
   - mem_section and related sysfs files   : [RFC PATCH 10-11, 13-16/19]
   - page table of removed memory  : [RFC PATCH 12/19]
   - node and related sysfs files  : [RFC PATCH 18-19/19]

If you find lack of function for physical memory hot-remove, please let me
know.

How to test this patchset?
1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE,
ACPI_HOTPLUG_MEMORY must be selected.
2. load the module acpi_memhotplug


Hi Yasuaki,

where is the acpi_memhotplug module?


3. hotplug the memory device(it depends on your hardware)
You will see the memory device under the directory /sys/bus/acpi/devices/.
Its name is PNP0C80:XX.
4. online/offline pages provided by this memory device
You can write online/offline to /sys/devices/system/memory/memoryX/state to
online/offline pages provided by this memory device
5. hotremove the memory device
You can hotremove the memory device by the hardware, or writing 1 to
/sys/bus/acpi/devices/PNP0C80:XX/eject.

Note: if the memory provided by the memory device is used by the kernel, it
can't be offlined. It is not a bug.

Known problems:
1. memory can't be offlined when CONFIG_MEMCG is selected.
For example: there is a memory device on node 1. The address range
is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
and memory11 under the directory /sys/devices/system/memory/.
If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
when we online pages. When we online memory8, the memory stored page cgroup
is not provided by this memory device. But when we online memory9, the 
memory
stored page cgroup may be provided by memory8. So we can't offline memory8
now. We should offline the memory in the reversed order.
When the memory device is hotremoved, we will auto offline memory provided
by this memory device. But we don't know which memory is onlined first, so
offlining memory may fail. In such case, you should offline the memory by
hand before hotremoving the memory device.
2. hotremoving memory device may cause kernel panicked
This bug will be fixed by Liu Jiang's patch:
https://lkml.org/lkml/2012/7/3/1

change log of v9:
  [RFC PATCH v9 8/21]
* add a lock to protect the list map_entries
* add an indicator to firmware_map_entry to remember whether the memory
  is allocated from bootmem
  [RFC PATCH v9 10/21]
* change the macro to inline function
  [RFC PATCH v9 19/21]
* don't offline the node if the cpu on the node is onlined
  [RFC PATCH v9 21/21]
* create new patch: auto offline page_cgroup when onlining memory block
  failed

change log of v8:
  [RFC PATCH v8 17/20]
* Fix problems when one node's range include the other nodes
  [RFC PATCH v8 18/20]
* fix building error when CONFIG_MEMORY_HOTPLUG_SPARSE or CONFIG_HUGETLBFS
  is not defined.
  [RFC PATCH v8 19/20]
* don't offline node when some memory sections are not removed
  [RFC PATCH v8 20/20]
* create new patch: clear hwpoisoned flag when onlining pages

change log of v7:
  [RFC PATCH v7 4/19]
* do not continue if acpi_memory_device_remove_memory() fails.
  [RFC PATCH v7 15/19]
* handle usemap in register_page_bootmem_info_section() too.

change log of v6:
  [RFC PATCH v6 12/19]
* fix building error on other archtitectures than x86

  [RFC PATCH v6 15-16/19]
* fix building error on other archtitectures than x86

change log of v5:
  * merge the patchset to clear page table and the patchset to hot remove
memory(from ishimatsu) to one big patchset.

  [RFC PATCH v5 1/19]
* rename remove_memory() to offline_memory()/offline_pages()

  [RFC PATCH v5 2/19]
* new patch: implement offline_memory(). This function offlines pages,
  update memory block's state, and notify the userspace that the memory
  block's state is changed.

  [RFC PATCH v5 4/19]
* offline and remove memory in acpi_memory_disable_device() too.

  [RFC PATCH v5 17/19]
* new patch: add a new function __remove_zone() to revert the things done
  in the function __add_zone().

  [RFC PATCH v5 18/19]
* flush work befor reseting node device.

change log of v4:
  * remove "memory-hotplug : unify argument of firmware_map_add_early/hotplug"
from the patch series, since the patch is a bugfix. It is being disccussed
on other thread. But for testing the patch series, the patch is needed.
So I added the patch as [PATCH 0/13].

  [RFC PATCH v4 2/13]
* check memory is online or not at remove_memory()

Re: [RFC v9 PATCH 00/21] memory-hotplug: hot-remove physical memory

2012-09-29 Thread Ni zhan Chen

On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote:

From: Wen Congyang we...@cn.fujitsu.com

This patch series aims to support physical memory hot-remove.

The patches can free/remove the following things:

   - acpi_memory_info  : [RFC PATCH 4/19]
   - /sys/firmware/memmap/X/{end, start, type} : [RFC PATCH 8/19]
   - iomem_resource: [RFC PATCH 9/19]
   - mem_section and related sysfs files   : [RFC PATCH 10-11, 13-16/19]
   - page table of removed memory  : [RFC PATCH 12/19]
   - node and related sysfs files  : [RFC PATCH 18-19/19]

If you find lack of function for physical memory hot-remove, please let me
know.

How to test this patchset?
1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE,
ACPI_HOTPLUG_MEMORY must be selected.
2. load the module acpi_memhotplug


Hi Yasuaki,

where is the acpi_memhotplug module?


3. hotplug the memory device(it depends on your hardware)
You will see the memory device under the directory /sys/bus/acpi/devices/.
Its name is PNP0C80:XX.
4. online/offline pages provided by this memory device
You can write online/offline to /sys/devices/system/memory/memoryX/state to
online/offline pages provided by this memory device
5. hotremove the memory device
You can hotremove the memory device by the hardware, or writing 1 to
/sys/bus/acpi/devices/PNP0C80:XX/eject.

Note: if the memory provided by the memory device is used by the kernel, it
can't be offlined. It is not a bug.

Known problems:
1. memory can't be offlined when CONFIG_MEMCG is selected.
For example: there is a memory device on node 1. The address range
is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
and memory11 under the directory /sys/devices/system/memory/.
If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
when we online pages. When we online memory8, the memory stored page cgroup
is not provided by this memory device. But when we online memory9, the 
memory
stored page cgroup may be provided by memory8. So we can't offline memory8
now. We should offline the memory in the reversed order.
When the memory device is hotremoved, we will auto offline memory provided
by this memory device. But we don't know which memory is onlined first, so
offlining memory may fail. In such case, you should offline the memory by
hand before hotremoving the memory device.
2. hotremoving memory device may cause kernel panicked
This bug will be fixed by Liu Jiang's patch:
https://lkml.org/lkml/2012/7/3/1

change log of v9:
  [RFC PATCH v9 8/21]
* add a lock to protect the list map_entries
* add an indicator to firmware_map_entry to remember whether the memory
  is allocated from bootmem
  [RFC PATCH v9 10/21]
* change the macro to inline function
  [RFC PATCH v9 19/21]
* don't offline the node if the cpu on the node is onlined
  [RFC PATCH v9 21/21]
* create new patch: auto offline page_cgroup when onlining memory block
  failed

change log of v8:
  [RFC PATCH v8 17/20]
* Fix problems when one node's range include the other nodes
  [RFC PATCH v8 18/20]
* fix building error when CONFIG_MEMORY_HOTPLUG_SPARSE or CONFIG_HUGETLBFS
  is not defined.
  [RFC PATCH v8 19/20]
* don't offline node when some memory sections are not removed
  [RFC PATCH v8 20/20]
* create new patch: clear hwpoisoned flag when onlining pages

change log of v7:
  [RFC PATCH v7 4/19]
* do not continue if acpi_memory_device_remove_memory() fails.
  [RFC PATCH v7 15/19]
* handle usemap in register_page_bootmem_info_section() too.

change log of v6:
  [RFC PATCH v6 12/19]
* fix building error on other archtitectures than x86

  [RFC PATCH v6 15-16/19]
* fix building error on other archtitectures than x86

change log of v5:
  * merge the patchset to clear page table and the patchset to hot remove
memory(from ishimatsu) to one big patchset.

  [RFC PATCH v5 1/19]
* rename remove_memory() to offline_memory()/offline_pages()

  [RFC PATCH v5 2/19]
* new patch: implement offline_memory(). This function offlines pages,
  update memory block's state, and notify the userspace that the memory
  block's state is changed.

  [RFC PATCH v5 4/19]
* offline and remove memory in acpi_memory_disable_device() too.

  [RFC PATCH v5 17/19]
* new patch: add a new function __remove_zone() to revert the things done
  in the function __add_zone().

  [RFC PATCH v5 18/19]
* flush work befor reseting node device.

change log of v4:
  * remove memory-hotplug : unify argument of firmware_map_add_early/hotplug
from the patch series, since the patch is a bugfix. It is being disccussed
on other thread. But for testing the patch series, the patch is needed.
So I added the patch as [PATCH 0/13].

  [RFC PATCH v4 2/13]
* check memory is online or not at 

Re: [RFC v9 PATCH 00/21] memory-hotplug: hot-remove physical memory

2012-09-28 Thread Ni zhan Chen

On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote:

From: Wen Congyang 

This patch series aims to support physical memory hot-remove.

The patches can free/remove the following things:

   - acpi_memory_info  : [RFC PATCH 4/19]
   - /sys/firmware/memmap/X/{end, start, type} : [RFC PATCH 8/19]
   - iomem_resource: [RFC PATCH 9/19]
   - mem_section and related sysfs files   : [RFC PATCH 10-11, 13-16/19]
   - page table of removed memory  : [RFC PATCH 12/19]
   - node and related sysfs files  : [RFC PATCH 18-19/19]

If you find lack of function for physical memory hot-remove, please let me
know.


Since patchset is too big, could you add more patchset changelog to 
describe how this patchset works? in order that it is easier to review.




How to test this patchset?
1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE,
ACPI_HOTPLUG_MEMORY must be selected.
2. load the module acpi_memhotplug
3. hotplug the memory device(it depends on your hardware)
You will see the memory device under the directory /sys/bus/acpi/devices/.
Its name is PNP0C80:XX.
4. online/offline pages provided by this memory device
You can write online/offline to /sys/devices/system/memory/memoryX/state to
online/offline pages provided by this memory device
5. hotremove the memory device
You can hotremove the memory device by the hardware, or writing 1 to
/sys/bus/acpi/devices/PNP0C80:XX/eject.

Note: if the memory provided by the memory device is used by the kernel, it
can't be offlined. It is not a bug.

Known problems:
1. memory can't be offlined when CONFIG_MEMCG is selected.
For example: there is a memory device on node 1. The address range
is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
and memory11 under the directory /sys/devices/system/memory/.
If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
when we online pages. When we online memory8, the memory stored page cgroup
is not provided by this memory device. But when we online memory9, the 
memory
stored page cgroup may be provided by memory8. So we can't offline memory8
now. We should offline the memory in the reversed order.
When the memory device is hotremoved, we will auto offline memory provided
by this memory device. But we don't know which memory is onlined first, so
offlining memory may fail. In such case, you should offline the memory by
hand before hotremoving the memory device.
2. hotremoving memory device may cause kernel panicked
This bug will be fixed by Liu Jiang's patch:
https://lkml.org/lkml/2012/7/3/1

change log of v9:
  [RFC PATCH v9 8/21]
* add a lock to protect the list map_entries
* add an indicator to firmware_map_entry to remember whether the memory
  is allocated from bootmem
  [RFC PATCH v9 10/21]
* change the macro to inline function
  [RFC PATCH v9 19/21]
* don't offline the node if the cpu on the node is onlined
  [RFC PATCH v9 21/21]
* create new patch: auto offline page_cgroup when onlining memory block
  failed

change log of v8:
  [RFC PATCH v8 17/20]
* Fix problems when one node's range include the other nodes
  [RFC PATCH v8 18/20]
* fix building error when CONFIG_MEMORY_HOTPLUG_SPARSE or CONFIG_HUGETLBFS
  is not defined.
  [RFC PATCH v8 19/20]
* don't offline node when some memory sections are not removed
  [RFC PATCH v8 20/20]
* create new patch: clear hwpoisoned flag when onlining pages

change log of v7:
  [RFC PATCH v7 4/19]
* do not continue if acpi_memory_device_remove_memory() fails.
  [RFC PATCH v7 15/19]
* handle usemap in register_page_bootmem_info_section() too.

change log of v6:
  [RFC PATCH v6 12/19]
* fix building error on other archtitectures than x86

  [RFC PATCH v6 15-16/19]
* fix building error on other archtitectures than x86

change log of v5:
  * merge the patchset to clear page table and the patchset to hot remove
memory(from ishimatsu) to one big patchset.

  [RFC PATCH v5 1/19]
* rename remove_memory() to offline_memory()/offline_pages()

  [RFC PATCH v5 2/19]
* new patch: implement offline_memory(). This function offlines pages,
  update memory block's state, and notify the userspace that the memory
  block's state is changed.

  [RFC PATCH v5 4/19]
* offline and remove memory in acpi_memory_disable_device() too.

  [RFC PATCH v5 17/19]
* new patch: add a new function __remove_zone() to revert the things done
  in the function __add_zone().

  [RFC PATCH v5 18/19]
* flush work befor reseting node device.

change log of v4:
  * remove "memory-hotplug : unify argument of firmware_map_add_early/hotplug"
from the patch series, since the patch is a bugfix. It is being disccussed
on other thread. But for testing the patch series, the patch is needed.
So I added the patch as 

Re: [PATCH 0/4] bugfix for memory hotplug

2012-09-28 Thread Ni zhan Chen

On 09/27/2012 01:45 PM, we...@cn.fujitsu.com wrote:

From: Wen Congyang 

Wen Congyang (2):
   memory-hotplug: clear hwpoisoned flag when onlining pages
   memory-hotplug: auto offline page_cgroup when onlining memory block
 failed


Again, you should explain these two patches are the new version of 
memory-hotplug: hot-remove physical memory [20/21,21/21]




Yasuaki Ishimatsu (2):
   memory-hotplug: add memory_block_release
   memory-hotplug: add node_device_release

  drivers/base/memory.c |9 -
  drivers/base/node.c   |   11 +++
  mm/memory_hotplug.c   |8 
  mm/page_cgroup.c  |3 +++
  4 files changed, 30 insertions(+), 1 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v9 PATCH 13/21] memory-hotplug: check page type in get_page_bootmem

2012-09-28 Thread Ni zhan Chen

On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote:

From: Yasuaki Ishimatsu 

The function get_page_bootmem() may be called more than one time to the same
page. There is no need to set page's type, private if the function is not
the first time called to the page.

Note: the patch is just optimization and does not fix any problem.


Hi Yasuaki,

this patch is reasonable to me. I have another question associated to 
get_page_bootmem(), the question is from another fujitsu guy's patch 
changelog [commit : 04753278769f3], the changelog said  that:


 1) When the memmap of removing section is allocated on other
 section by bootmem, it should/can be free.
 2) When the memmap of removing section is allocated on the
 same section, it shouldn't be freed. Because the section has to be
 logical memory offlined already and all pages must be isolated against
 page allocater. If it is freed, page allocator may use it which will
 be removed physically soon.

but I don't see his patch guarantee 2), it means that his patch doesn't 
guarantee the memmap of removing section which is allocated on other 
section by bootmem doesn't be freed. Hopefully get your explaination in 
details, thanks in advance. :-)




CC: David Rientjes 
CC: Jiang Liu 
CC: Len Brown 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Christoph Lameter 
Cc: Minchan Kim 
CC: Andrew Morton 
CC: KOSAKI Motohiro 
CC: Wen Congyang 
Signed-off-by: Yasuaki Ishimatsu 
---
  mm/memory_hotplug.c |   15 +++
  1 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d736df3..26a5012 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -95,10 +95,17 @@ static void release_memory_resource(struct resource *res)
  static void get_page_bootmem(unsigned long info,  struct page *page,
 unsigned long type)
  {
-   page->lru.next = (struct list_head *) type;
-   SetPagePrivate(page);
-   set_page_private(page, info);
-   atomic_inc(>_count);
+   unsigned long page_type;
+
+   page_type = (unsigned long)page->lru.next;
+   if (page_type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
+   page_type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE){
+   page->lru.next = (struct list_head *)type;
+   SetPagePrivate(page);
+   set_page_private(page, info);
+   atomic_inc(>_count);
+   } else
+   atomic_inc(>_count);
  }
  
  /* reference to __meminit __free_pages_bootmem is valid


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] memory_hotplug: Don't modify the zone_start_pfn outside of zone_span_writelock()

2012-09-28 Thread Ni zhan Chen

On 09/28/2012 03:29 PM, Lai Jiangshan wrote:

Hi, Chen,

On 09/27/2012 09:19 PM, Ni zhan Chen wrote:

On 09/27/2012 02:47 PM, Lai Jiangshan wrote:

The __add_zone() maybe call sleep-able init_currently_empty_zone()
to init wait_table,

But this function also modifies the zone_start_pfn without any lock.
It is bugy.

So we move this modification out, and we ensure the modification
of zone_start_pfn is only done with zone_span_writelock() held or in booting.

Since zone_start_pfn is not modified by init_currently_empty_zone()
grow_zone_span() needs to check zone_start_pfn before update it.

CC: Mel Gorman 
Signed-off-by: Lai Jiangshan 
Reported-by: Yasuaki ISIMATU 
Tested-by: Wen Congyang 
---
   mm/memory_hotplug.c |2 +-
   mm/page_alloc.c |3 +--
   2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b62d429b..790561f 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -205,7 +205,7 @@ static void grow_zone_span(struct zone *zone, unsigned long 
start_pfn,
   zone_span_writelock(zone);
 old_zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
-if (start_pfn < zone->zone_start_pfn)
+if (!zone->zone_start_pfn || start_pfn < zone->zone_start_pfn)
   zone->zone_start_pfn = start_pfn;
 zone->spanned_pages = max(old_zone_end_pfn, end_pfn) -
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c13ea75..2545013 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3997,8 +3997,6 @@ int __meminit init_currently_empty_zone(struct zone *zone,
   return ret;
   pgdat->nr_zones = zone_idx(zone) + 1;
   -zone->zone_start_pfn = zone_start_pfn;
-

then how can mminit_dprintk print zone->zone_start_pfn ? always print 0 make no 
sense.


The full code here:

mminit_dprintk(MMINIT_TRACE, "memmap_init",
"Initialising map node %d zone %lu pfns %lu -> %lu\n",
pgdat->node_id,
(unsigned long)zone_idx(zone),
zone_start_pfn, (zone_start_pfn + size));


It doesn't always print 0, it still behaves as I expected.
Could you elaborate?


Yeah, you are right. I mean mminit_dprintk is called after 
zone->zone_start_pfn initialized to show initialising state, but after 
this patch applied zone->zone_start_pfn will not be initialized before 
this print point.




Thanks,
Lai



   mminit_dprintk(MMINIT_TRACE, "memmap_init",
   "Initialising map node %d zone %lu pfns %lu -> %lu\n",
   pgdat->node_id,
@@ -4465,6 +4463,7 @@ static void __paginginit free_area_init_core(struct 
pglist_data *pgdat,
   ret = init_currently_empty_zone(zone, zone_start_pfn,
   size, MEMMAP_EARLY);
   BUG_ON(ret);
+zone->zone_start_pfn = zone_start_pfn;
   memmap_init(size, nid, j, zone_start_pfn);
   zone_start_pfn += size;
   }






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4] memory-hotplug: add memory_block_release

2012-09-28 Thread Ni zhan Chen

On 09/28/2012 11:45 AM, Yasuaki Ishimatsu wrote:

Hi Kosaki-san,

2012/09/28 10:35, KOSAKI Motohiro wrote:

On Thu, Sep 27, 2012 at 8:24 PM, Yasuaki Ishimatsu
 wrote:

Hi Chen,


2012/09/27 19:20, Ni zhan Chen wrote:


Hi Congyang,

2012/9/27 


From: Yasuaki Ishimatsu 

When calling remove_memory_block(), the function shows following 
message

at
device_release().

Device 'memory528' does not have a release() function, it is 
broken and

must
be fixed.



What's the difference between the patch and original implemetation?



The implementation is for removing a memory_block. So the purpose is
same as original one. But original code is bad manner. 
kobject_cleanup()

is called by remove_memory_block() at last. But release function for
releasing memory_block is not registered. As a result, the kernel 
message

is shown. IMHO, memory_block should be release by the releae function.


but your patch introduced use after free bug, if i understand correctly.
See unregister_memory() function. After your patch, kobject_put() call
release_memory_block() and kfree(). and then device_unregister() will
touch freed memory.


It is not correct. The kobject_put() is prepared against 
find_memory_block()

in remove_memory_block() since kobject->kref is incremented in it.
So release_memory_block() is called by device_unregister() correctly 
as follows:


Another issue is memory hotplug which is not associated to this patch 
report to you:
IIUC, function register_mem_sect_under_node should be renamed to 
register_mem_block_under_node,

since this function is register memory block instead of memory section.



[ 1014.589008] Pid: 126, comm: kworker/0:2 Not tainted 
3.6.0-rc3-enable-memory-hotremove-and-root-bridge #3

[ 1014.702437] Call Trace:
[ 1014.731684]  [] release_memory_block+0x16/0x30
[ 1014.803581]  [] device_release+0x27/0xa0
[ 1014.869312]  [] kobject_cleanup+0x82/0x1b0
[ 1014.937062]  [] kobject_release+0xd/0x10
[ 1015.002718]  [] kobject_put+0x2c/0x60
[ 1015.065271]  [] put_device+0x17/0x20
[ 1015.126794]  [] device_unregister+0x2a/0x60
[ 1015.195578]  [] remove_memory_block+0xbb/0xf0
[ 1015.266434]  [] unregister_memory_section+0x1f/0x30
[ 1015.343532]  [] __remove_section+0x68/0x110
[ 1015.412318]  [] __remove_pages+0xe7/0x120
[ 1015.479021]  [] arch_remove_memory+0x2c/0x80
[ 1015.548845]  [] remove_memory+0x6b/0xd0
[ 1015.613474]  [] 
acpi_memory_device_remove_memory+0x48/0x73

[ 1015.697834]  [] acpi_memory_device_remove+0x2b/0x44
[ 1015.774922]  [] acpi_device_remove+0x90/0xb2
[ 1015.844796]  [] __device_release_driver+0x7c/0xf0
[ 1015.919814]  [] device_release_driver+0x2f/0x50
[ 1015.992753]  [] acpi_bus_remove+0x32/0x6d
[ 1016.059462]  [] acpi_bus_trim+0x91/0x102
[ 1016.125128]  [] 
acpi_bus_hot_remove_device+0x88/0x16b

[ 1016.204295]  [] acpi_os_execute_deferred+0x27/0x34
[ 1016.280350]  [] process_one_work+0x219/0x680
[ 1016.350173]  [] ? process_one_work+0x1b8/0x680
[ 1016.422072]  [] ? 
acpi_os_wait_events_complete+0x23/0x23

[ 1016.504357]  [] worker_thread+0x12e/0x320
[ 1016.571064]  [] ? manage_workers+0x110/0x110
[ 1016.640886]  [] kthread+0xc6/0xd0
[ 1016.699290]  [] kernel_thread_helper+0x4/0x10
[ 1016.770149]  [] ? retint_restore_args+0x13/0x13
[ 1016.843165]  [] ? __init_kthread_worker+0x70/0x70
[ 1016.918200]  [] ? gs_change+0x13/0x13

Thanks,
Yasuaki Ishimatsu



static void
unregister_memory(struct memory_block *memory)
{
BUG_ON(memory->dev.bus != _subsys);

/* drop the ref. we got in remove_memory_block() */
kobject_put(>dev.kobj);
device_unregister(>dev);
}







--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4] memory-hotplug: add memory_block_release

2012-09-28 Thread Ni zhan Chen

On 09/28/2012 11:45 AM, Yasuaki Ishimatsu wrote:

Hi Kosaki-san,

2012/09/28 10:35, KOSAKI Motohiro wrote:

On Thu, Sep 27, 2012 at 8:24 PM, Yasuaki Ishimatsu
 wrote:

Hi Chen,


2012/09/27 19:20, Ni zhan Chen wrote:


Hi Congyang,

2012/9/27 


From: Yasuaki Ishimatsu 

When calling remove_memory_block(), the function shows following 
message

at
device_release().

Device 'memory528' does not have a release() function, it is 
broken and

must
be fixed.



What's the difference between the patch and original implemetation?



The implementation is for removing a memory_block. So the purpose is
same as original one. But original code is bad manner. 
kobject_cleanup()

is called by remove_memory_block() at last. But release function for
releasing memory_block is not registered. As a result, the kernel 
message

is shown. IMHO, memory_block should be release by the releae function.


but your patch introduced use after free bug, if i understand correctly.
See unregister_memory() function. After your patch, kobject_put() call
release_memory_block() and kfree(). and then device_unregister() will
touch freed memory.




this patch is similiar to [RFC v9 PATCH 10/21] memory-hotplug: add 
memory_block_release, they handle the same issue, can these two patches 
be fold to one?


It is not correct. The kobject_put() is prepared against 
find_memory_block()

in remove_memory_block() since kobject->kref is incremented in it.
So release_memory_block() is called by device_unregister() correctly 
as follows:


[ 1014.589008] Pid: 126, comm: kworker/0:2 Not tainted 
3.6.0-rc3-enable-memory-hotremove-and-root-bridge #3

[ 1014.702437] Call Trace:
[ 1014.731684]  [] release_memory_block+0x16/0x30
[ 1014.803581]  [] device_release+0x27/0xa0
[ 1014.869312]  [] kobject_cleanup+0x82/0x1b0
[ 1014.937062]  [] kobject_release+0xd/0x10
[ 1015.002718]  [] kobject_put+0x2c/0x60
[ 1015.065271]  [] put_device+0x17/0x20
[ 1015.126794]  [] device_unregister+0x2a/0x60
[ 1015.195578]  [] remove_memory_block+0xbb/0xf0
[ 1015.266434]  [] unregister_memory_section+0x1f/0x30
[ 1015.343532]  [] __remove_section+0x68/0x110
[ 1015.412318]  [] __remove_pages+0xe7/0x120
[ 1015.479021]  [] arch_remove_memory+0x2c/0x80
[ 1015.548845]  [] remove_memory+0x6b/0xd0
[ 1015.613474]  [] 
acpi_memory_device_remove_memory+0x48/0x73

[ 1015.697834]  [] acpi_memory_device_remove+0x2b/0x44
[ 1015.774922]  [] acpi_device_remove+0x90/0xb2
[ 1015.844796]  [] __device_release_driver+0x7c/0xf0
[ 1015.919814]  [] device_release_driver+0x2f/0x50
[ 1015.992753]  [] acpi_bus_remove+0x32/0x6d
[ 1016.059462]  [] acpi_bus_trim+0x91/0x102
[ 1016.125128]  [] 
acpi_bus_hot_remove_device+0x88/0x16b

[ 1016.204295]  [] acpi_os_execute_deferred+0x27/0x34
[ 1016.280350]  [] process_one_work+0x219/0x680
[ 1016.350173]  [] ? process_one_work+0x1b8/0x680
[ 1016.422072]  [] ? 
acpi_os_wait_events_complete+0x23/0x23

[ 1016.504357]  [] worker_thread+0x12e/0x320
[ 1016.571064]  [] ? manage_workers+0x110/0x110
[ 1016.640886]  [] kthread+0xc6/0xd0
[ 1016.699290]  [] kernel_thread_helper+0x4/0x10
[ 1016.770149]  [] ? retint_restore_args+0x13/0x13
[ 1016.843165]  [] ? __init_kthread_worker+0x70/0x70
[ 1016.918200]  [] ? gs_change+0x13/0x13

Thanks,
Yasuaki Ishimatsu



static void
unregister_memory(struct memory_block *memory)
{
BUG_ON(memory->dev.bus != _subsys);

/* drop the ref. we got in remove_memory_block() */
kobject_put(>dev.kobj);
device_unregister(>dev);
}







--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4] memory-hotplug: add memory_block_release

2012-09-28 Thread Ni zhan Chen

On 09/28/2012 11:45 AM, Yasuaki Ishimatsu wrote:

Hi Kosaki-san,

2012/09/28 10:35, KOSAKI Motohiro wrote:

On Thu, Sep 27, 2012 at 8:24 PM, Yasuaki Ishimatsu
isimatu.yasu...@jp.fujitsu.com wrote:

Hi Chen,


2012/09/27 19:20, Ni zhan Chen wrote:


Hi Congyang,

2012/9/27 we...@cn.fujitsu.com


From: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com

When calling remove_memory_block(), the function shows following 
message

at
device_release().

Device 'memory528' does not have a release() function, it is 
broken and

must
be fixed.



What's the difference between the patch and original implemetation?



The implementation is for removing a memory_block. So the purpose is
same as original one. But original code is bad manner. 
kobject_cleanup()

is called by remove_memory_block() at last. But release function for
releasing memory_block is not registered. As a result, the kernel 
message

is shown. IMHO, memory_block should be release by the releae function.


but your patch introduced use after free bug, if i understand correctly.
See unregister_memory() function. After your patch, kobject_put() call
release_memory_block() and kfree(). and then device_unregister() will
touch freed memory.




this patch is similiar to [RFC v9 PATCH 10/21] memory-hotplug: add 
memory_block_release, they handle the same issue, can these two patches 
be fold to one?


It is not correct. The kobject_put() is prepared against 
find_memory_block()

in remove_memory_block() since kobject-kref is incremented in it.
So release_memory_block() is called by device_unregister() correctly 
as follows:


[ 1014.589008] Pid: 126, comm: kworker/0:2 Not tainted 
3.6.0-rc3-enable-memory-hotremove-and-root-bridge #3

[ 1014.702437] Call Trace:
[ 1014.731684]  [8144d096] release_memory_block+0x16/0x30
[ 1014.803581]  [81438587] device_release+0x27/0xa0
[ 1014.869312]  [8133e962] kobject_cleanup+0x82/0x1b0
[ 1014.937062]  [8133ea9d] kobject_release+0xd/0x10
[ 1015.002718]  [8133e7ec] kobject_put+0x2c/0x60
[ 1015.065271]  [81438107] put_device+0x17/0x20
[ 1015.126794]  [8143918a] device_unregister+0x2a/0x60
[ 1015.195578]  [8144d55b] remove_memory_block+0xbb/0xf0
[ 1015.266434]  [8144d5af] unregister_memory_section+0x1f/0x30
[ 1015.343532]  [811c0a58] __remove_section+0x68/0x110
[ 1015.412318]  [811c0be7] __remove_pages+0xe7/0x120
[ 1015.479021]  [81653d8c] arch_remove_memory+0x2c/0x80
[ 1015.548845]  [8165497b] remove_memory+0x6b/0xd0
[ 1015.613474]  [813d946c] 
acpi_memory_device_remove_memory+0x48/0x73

[ 1015.697834]  [813d94c2] acpi_memory_device_remove+0x2b/0x44
[ 1015.774922]  [813a61e4] acpi_device_remove+0x90/0xb2
[ 1015.844796]  [8143c2fc] __device_release_driver+0x7c/0xf0
[ 1015.919814]  [8143c47f] device_release_driver+0x2f/0x50
[ 1015.992753]  [813a70dc] acpi_bus_remove+0x32/0x6d
[ 1016.059462]  [813a71a8] acpi_bus_trim+0x91/0x102
[ 1016.125128]  [813a72a1] 
acpi_bus_hot_remove_device+0x88/0x16b

[ 1016.204295]  [813a2e57] acpi_os_execute_deferred+0x27/0x34
[ 1016.280350]  [81090599] process_one_work+0x219/0x680
[ 1016.350173]  [81090538] ? process_one_work+0x1b8/0x680
[ 1016.422072]  [813a2e30] ? 
acpi_os_wait_events_complete+0x23/0x23

[ 1016.504357]  [810923ce] worker_thread+0x12e/0x320
[ 1016.571064]  [810922a0] ? manage_workers+0x110/0x110
[ 1016.640886]  [810983a6] kthread+0xc6/0xd0
[ 1016.699290]  [8167b144] kernel_thread_helper+0x4/0x10
[ 1016.770149]  [81670bb0] ? retint_restore_args+0x13/0x13
[ 1016.843165]  [810982e0] ? __init_kthread_worker+0x70/0x70
[ 1016.918200]  [8167b140] ? gs_change+0x13/0x13

Thanks,
Yasuaki Ishimatsu



static void
unregister_memory(struct memory_block *memory)
{
BUG_ON(memory-dev.bus != memory_subsys);

/* drop the ref. we got in remove_memory_block() */
kobject_put(memory-dev.kobj);
device_unregister(memory-dev);
}







--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4] memory-hotplug: add memory_block_release

2012-09-28 Thread Ni zhan Chen

On 09/28/2012 11:45 AM, Yasuaki Ishimatsu wrote:

Hi Kosaki-san,

2012/09/28 10:35, KOSAKI Motohiro wrote:

On Thu, Sep 27, 2012 at 8:24 PM, Yasuaki Ishimatsu
isimatu.yasu...@jp.fujitsu.com wrote:

Hi Chen,


2012/09/27 19:20, Ni zhan Chen wrote:


Hi Congyang,

2012/9/27 we...@cn.fujitsu.com


From: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com

When calling remove_memory_block(), the function shows following 
message

at
device_release().

Device 'memory528' does not have a release() function, it is 
broken and

must
be fixed.



What's the difference between the patch and original implemetation?



The implementation is for removing a memory_block. So the purpose is
same as original one. But original code is bad manner. 
kobject_cleanup()

is called by remove_memory_block() at last. But release function for
releasing memory_block is not registered. As a result, the kernel 
message

is shown. IMHO, memory_block should be release by the releae function.


but your patch introduced use after free bug, if i understand correctly.
See unregister_memory() function. After your patch, kobject_put() call
release_memory_block() and kfree(). and then device_unregister() will
touch freed memory.


It is not correct. The kobject_put() is prepared against 
find_memory_block()

in remove_memory_block() since kobject-kref is incremented in it.
So release_memory_block() is called by device_unregister() correctly 
as follows:


Another issue is memory hotplug which is not associated to this patch 
report to you:
IIUC, function register_mem_sect_under_node should be renamed to 
register_mem_block_under_node,

since this function is register memory block instead of memory section.



[ 1014.589008] Pid: 126, comm: kworker/0:2 Not tainted 
3.6.0-rc3-enable-memory-hotremove-and-root-bridge #3

[ 1014.702437] Call Trace:
[ 1014.731684]  [8144d096] release_memory_block+0x16/0x30
[ 1014.803581]  [81438587] device_release+0x27/0xa0
[ 1014.869312]  [8133e962] kobject_cleanup+0x82/0x1b0
[ 1014.937062]  [8133ea9d] kobject_release+0xd/0x10
[ 1015.002718]  [8133e7ec] kobject_put+0x2c/0x60
[ 1015.065271]  [81438107] put_device+0x17/0x20
[ 1015.126794]  [8143918a] device_unregister+0x2a/0x60
[ 1015.195578]  [8144d55b] remove_memory_block+0xbb/0xf0
[ 1015.266434]  [8144d5af] unregister_memory_section+0x1f/0x30
[ 1015.343532]  [811c0a58] __remove_section+0x68/0x110
[ 1015.412318]  [811c0be7] __remove_pages+0xe7/0x120
[ 1015.479021]  [81653d8c] arch_remove_memory+0x2c/0x80
[ 1015.548845]  [8165497b] remove_memory+0x6b/0xd0
[ 1015.613474]  [813d946c] 
acpi_memory_device_remove_memory+0x48/0x73

[ 1015.697834]  [813d94c2] acpi_memory_device_remove+0x2b/0x44
[ 1015.774922]  [813a61e4] acpi_device_remove+0x90/0xb2
[ 1015.844796]  [8143c2fc] __device_release_driver+0x7c/0xf0
[ 1015.919814]  [8143c47f] device_release_driver+0x2f/0x50
[ 1015.992753]  [813a70dc] acpi_bus_remove+0x32/0x6d
[ 1016.059462]  [813a71a8] acpi_bus_trim+0x91/0x102
[ 1016.125128]  [813a72a1] 
acpi_bus_hot_remove_device+0x88/0x16b

[ 1016.204295]  [813a2e57] acpi_os_execute_deferred+0x27/0x34
[ 1016.280350]  [81090599] process_one_work+0x219/0x680
[ 1016.350173]  [81090538] ? process_one_work+0x1b8/0x680
[ 1016.422072]  [813a2e30] ? 
acpi_os_wait_events_complete+0x23/0x23

[ 1016.504357]  [810923ce] worker_thread+0x12e/0x320
[ 1016.571064]  [810922a0] ? manage_workers+0x110/0x110
[ 1016.640886]  [810983a6] kthread+0xc6/0xd0
[ 1016.699290]  [8167b144] kernel_thread_helper+0x4/0x10
[ 1016.770149]  [81670bb0] ? retint_restore_args+0x13/0x13
[ 1016.843165]  [810982e0] ? __init_kthread_worker+0x70/0x70
[ 1016.918200]  [8167b140] ? gs_change+0x13/0x13

Thanks,
Yasuaki Ishimatsu



static void
unregister_memory(struct memory_block *memory)
{
BUG_ON(memory-dev.bus != memory_subsys);

/* drop the ref. we got in remove_memory_block() */
kobject_put(memory-dev.kobj);
device_unregister(memory-dev);
}







--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] memory_hotplug: Don't modify the zone_start_pfn outside of zone_span_writelock()

2012-09-28 Thread Ni zhan Chen

On 09/28/2012 03:29 PM, Lai Jiangshan wrote:

Hi, Chen,

On 09/27/2012 09:19 PM, Ni zhan Chen wrote:

On 09/27/2012 02:47 PM, Lai Jiangshan wrote:

The __add_zone() maybe call sleep-able init_currently_empty_zone()
to init wait_table,

But this function also modifies the zone_start_pfn without any lock.
It is bugy.

So we move this modification out, and we ensure the modification
of zone_start_pfn is only done with zone_span_writelock() held or in booting.

Since zone_start_pfn is not modified by init_currently_empty_zone()
grow_zone_span() needs to check zone_start_pfn before update it.

CC: Mel Gorman m...@csn.ul.ie
Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com
Reported-by: Yasuaki ISIMATU isimatu.yasu...@jp.fujitsu.com
Tested-by: Wen Congyang we...@cn.fujitsu.com
---
   mm/memory_hotplug.c |2 +-
   mm/page_alloc.c |3 +--
   2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b62d429b..790561f 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -205,7 +205,7 @@ static void grow_zone_span(struct zone *zone, unsigned long 
start_pfn,
   zone_span_writelock(zone);
 old_zone_end_pfn = zone-zone_start_pfn + zone-spanned_pages;
-if (start_pfn  zone-zone_start_pfn)
+if (!zone-zone_start_pfn || start_pfn  zone-zone_start_pfn)
   zone-zone_start_pfn = start_pfn;
 zone-spanned_pages = max(old_zone_end_pfn, end_pfn) -
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c13ea75..2545013 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3997,8 +3997,6 @@ int __meminit init_currently_empty_zone(struct zone *zone,
   return ret;
   pgdat-nr_zones = zone_idx(zone) + 1;
   -zone-zone_start_pfn = zone_start_pfn;
-

then how can mminit_dprintk print zone-zone_start_pfn ? always print 0 make no 
sense.


The full code here:

mminit_dprintk(MMINIT_TRACE, memmap_init,
Initialising map node %d zone %lu pfns %lu - %lu\n,
pgdat-node_id,
(unsigned long)zone_idx(zone),
zone_start_pfn, (zone_start_pfn + size));


It doesn't always print 0, it still behaves as I expected.
Could you elaborate?


Yeah, you are right. I mean mminit_dprintk is called after 
zone-zone_start_pfn initialized to show initialising state, but after 
this patch applied zone-zone_start_pfn will not be initialized before 
this print point.




Thanks,
Lai



   mminit_dprintk(MMINIT_TRACE, memmap_init,
   Initialising map node %d zone %lu pfns %lu - %lu\n,
   pgdat-node_id,
@@ -4465,6 +4463,7 @@ static void __paginginit free_area_init_core(struct 
pglist_data *pgdat,
   ret = init_currently_empty_zone(zone, zone_start_pfn,
   size, MEMMAP_EARLY);
   BUG_ON(ret);
+zone-zone_start_pfn = zone_start_pfn;
   memmap_init(size, nid, j, zone_start_pfn);
   zone_start_pfn += size;
   }






--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v9 PATCH 13/21] memory-hotplug: check page type in get_page_bootmem

2012-09-28 Thread Ni zhan Chen

On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote:

From: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com

The function get_page_bootmem() may be called more than one time to the same
page. There is no need to set page's type, private if the function is not
the first time called to the page.

Note: the patch is just optimization and does not fix any problem.


Hi Yasuaki,

this patch is reasonable to me. I have another question associated to 
get_page_bootmem(), the question is from another fujitsu guy's patch 
changelog [commit : 04753278769f3], the changelog said  that:


 1) When the memmap of removing section is allocated on other
 section by bootmem, it should/can be free.
 2) When the memmap of removing section is allocated on the
 same section, it shouldn't be freed. Because the section has to be
 logical memory offlined already and all pages must be isolated against
 page allocater. If it is freed, page allocator may use it which will
 be removed physically soon.

but I don't see his patch guarantee 2), it means that his patch doesn't 
guarantee the memmap of removing section which is allocated on other 
section by bootmem doesn't be freed. Hopefully get your explaination in 
details, thanks in advance. :-)




CC: David Rientjes rient...@google.com
CC: Jiang Liu liu...@gmail.com
CC: Len Brown len.br...@intel.com
CC: Benjamin Herrenschmidt b...@kernel.crashing.org
CC: Paul Mackerras pau...@samba.org
CC: Christoph Lameter c...@linux.com
Cc: Minchan Kim minchan@gmail.com
CC: Andrew Morton a...@linux-foundation.org
CC: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com
CC: Wen Congyang we...@cn.fujitsu.com
Signed-off-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com
---
  mm/memory_hotplug.c |   15 +++
  1 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d736df3..26a5012 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -95,10 +95,17 @@ static void release_memory_resource(struct resource *res)
  static void get_page_bootmem(unsigned long info,  struct page *page,
 unsigned long type)
  {
-   page-lru.next = (struct list_head *) type;
-   SetPagePrivate(page);
-   set_page_private(page, info);
-   atomic_inc(page-_count);
+   unsigned long page_type;
+
+   page_type = (unsigned long)page-lru.next;
+   if (page_type  MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
+   page_type  MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE){
+   page-lru.next = (struct list_head *)type;
+   SetPagePrivate(page);
+   set_page_private(page, info);
+   atomic_inc(page-_count);
+   } else
+   atomic_inc(page-_count);
  }
  
  /* reference to __meminit __free_pages_bootmem is valid


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/4] bugfix for memory hotplug

2012-09-28 Thread Ni zhan Chen

On 09/27/2012 01:45 PM, we...@cn.fujitsu.com wrote:

From: Wen Congyang we...@cn.fujitsu.com

Wen Congyang (2):
   memory-hotplug: clear hwpoisoned flag when onlining pages
   memory-hotplug: auto offline page_cgroup when onlining memory block
 failed


Again, you should explain these two patches are the new version of 
memory-hotplug: hot-remove physical memory [20/21,21/21]




Yasuaki Ishimatsu (2):
   memory-hotplug: add memory_block_release
   memory-hotplug: add node_device_release

  drivers/base/memory.c |9 -
  drivers/base/node.c   |   11 +++
  mm/memory_hotplug.c   |8 
  mm/page_cgroup.c  |3 +++
  4 files changed, 30 insertions(+), 1 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v9 PATCH 00/21] memory-hotplug: hot-remove physical memory

2012-09-28 Thread Ni zhan Chen

On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote:

From: Wen Congyang we...@cn.fujitsu.com

This patch series aims to support physical memory hot-remove.

The patches can free/remove the following things:

   - acpi_memory_info  : [RFC PATCH 4/19]
   - /sys/firmware/memmap/X/{end, start, type} : [RFC PATCH 8/19]
   - iomem_resource: [RFC PATCH 9/19]
   - mem_section and related sysfs files   : [RFC PATCH 10-11, 13-16/19]
   - page table of removed memory  : [RFC PATCH 12/19]
   - node and related sysfs files  : [RFC PATCH 18-19/19]

If you find lack of function for physical memory hot-remove, please let me
know.


Since patchset is too big, could you add more patchset changelog to 
describe how this patchset works? in order that it is easier to review.




How to test this patchset?
1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE,
ACPI_HOTPLUG_MEMORY must be selected.
2. load the module acpi_memhotplug
3. hotplug the memory device(it depends on your hardware)
You will see the memory device under the directory /sys/bus/acpi/devices/.
Its name is PNP0C80:XX.
4. online/offline pages provided by this memory device
You can write online/offline to /sys/devices/system/memory/memoryX/state to
online/offline pages provided by this memory device
5. hotremove the memory device
You can hotremove the memory device by the hardware, or writing 1 to
/sys/bus/acpi/devices/PNP0C80:XX/eject.

Note: if the memory provided by the memory device is used by the kernel, it
can't be offlined. It is not a bug.

Known problems:
1. memory can't be offlined when CONFIG_MEMCG is selected.
For example: there is a memory device on node 1. The address range
is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
and memory11 under the directory /sys/devices/system/memory/.
If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
when we online pages. When we online memory8, the memory stored page cgroup
is not provided by this memory device. But when we online memory9, the 
memory
stored page cgroup may be provided by memory8. So we can't offline memory8
now. We should offline the memory in the reversed order.
When the memory device is hotremoved, we will auto offline memory provided
by this memory device. But we don't know which memory is onlined first, so
offlining memory may fail. In such case, you should offline the memory by
hand before hotremoving the memory device.
2. hotremoving memory device may cause kernel panicked
This bug will be fixed by Liu Jiang's patch:
https://lkml.org/lkml/2012/7/3/1

change log of v9:
  [RFC PATCH v9 8/21]
* add a lock to protect the list map_entries
* add an indicator to firmware_map_entry to remember whether the memory
  is allocated from bootmem
  [RFC PATCH v9 10/21]
* change the macro to inline function
  [RFC PATCH v9 19/21]
* don't offline the node if the cpu on the node is onlined
  [RFC PATCH v9 21/21]
* create new patch: auto offline page_cgroup when onlining memory block
  failed

change log of v8:
  [RFC PATCH v8 17/20]
* Fix problems when one node's range include the other nodes
  [RFC PATCH v8 18/20]
* fix building error when CONFIG_MEMORY_HOTPLUG_SPARSE or CONFIG_HUGETLBFS
  is not defined.
  [RFC PATCH v8 19/20]
* don't offline node when some memory sections are not removed
  [RFC PATCH v8 20/20]
* create new patch: clear hwpoisoned flag when onlining pages

change log of v7:
  [RFC PATCH v7 4/19]
* do not continue if acpi_memory_device_remove_memory() fails.
  [RFC PATCH v7 15/19]
* handle usemap in register_page_bootmem_info_section() too.

change log of v6:
  [RFC PATCH v6 12/19]
* fix building error on other archtitectures than x86

  [RFC PATCH v6 15-16/19]
* fix building error on other archtitectures than x86

change log of v5:
  * merge the patchset to clear page table and the patchset to hot remove
memory(from ishimatsu) to one big patchset.

  [RFC PATCH v5 1/19]
* rename remove_memory() to offline_memory()/offline_pages()

  [RFC PATCH v5 2/19]
* new patch: implement offline_memory(). This function offlines pages,
  update memory block's state, and notify the userspace that the memory
  block's state is changed.

  [RFC PATCH v5 4/19]
* offline and remove memory in acpi_memory_disable_device() too.

  [RFC PATCH v5 17/19]
* new patch: add a new function __remove_zone() to revert the things done
  in the function __add_zone().

  [RFC PATCH v5 18/19]
* flush work befor reseting node device.

change log of v4:
  * remove memory-hotplug : unify argument of firmware_map_add_early/hotplug
from the patch series, since the patch is a bugfix. It is being disccussed
on other thread. But for testing the patch series, the patch is needed.
So I added 

Re: [RFC v9 PATCH 04/21] memory-hotplug: offline and remove memory when removing the memory device

2012-09-27 Thread Ni zhan Chen

On 09/05/2012 05:25 PM, we...@cn.fujitsu.com wrote:

From: Yasuaki Ishimatsu 

We should offline and remove memory when removing the memory device.
The memory device can be removed by 2 ways:
1. send eject request by SCI
2. echo 1 >/sys/bus/pci/devices/PNP0C80:XX/eject

In the 1st case, acpi_memory_disable_device() will be called. In the 2nd
case, acpi_memory_device_remove() will be called. acpi_memory_device_remove()
will also be called when we unbind the memory device from the driver
acpi_memhotplug. If the type is ACPI_BUS_REMOVAL_EJECT, it means
that the user wants to eject the memory device, and we should offline
and remove memory in acpi_memory_device_remove().

The function remove_memory() is not implemeted now. It only check whether
all memory has been offllined now.

CC: David Rientjes 
CC: Jiang Liu 
CC: Len Brown 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Christoph Lameter 
Cc: Minchan Kim 
CC: Andrew Morton 
CC: KOSAKI Motohiro 
Signed-off-by: Yasuaki Ishimatsu 
Signed-off-by: Wen Congyang 
---
  drivers/acpi/acpi_memhotplug.c |   45 +--
  drivers/base/memory.c  |   39 ++
  include/linux/memory.h |5 
  include/linux/memory_hotplug.h |5 
  mm/memory_hotplug.c|   22 +++
  5 files changed, 109 insertions(+), 7 deletions(-)

diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 7873832..9d47458 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -29,6 +29,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -310,25 +311,44 @@ static int acpi_memory_powerdown_device(struct 
acpi_memory_device *mem_device)
return 0;
  }
  
-static int acpi_memory_disable_device(struct acpi_memory_device *mem_device)

+static int
+acpi_memory_device_remove_memory(struct acpi_memory_device *mem_device)
  {
int result;
struct acpi_memory_info *info, *n;
+   int node = mem_device->nid;
  
-

-   /*
-* Ask the VM to offline this memory range.
-* Note: Assume that this function returns zero on success
-*/
list_for_each_entry_safe(info, n, _device->res_list, list) {
if (info->enabled) {
result = offline_memory(info->start_addr, info->length);
if (result)
return result;
+
+   result = remove_memory(node, info->start_addr,
+  info->length);
+   if (result)
+   return result;
}
+
+   list_del(>list);
kfree(info);
}
  
+	return 0;

+}
+
+static int acpi_memory_disable_device(struct acpi_memory_device *mem_device)
+{
+   int result;
+
+   /*
+* Ask the VM to offline this memory range.
+* Note: Assume that this function returns zero on success
+*/
+   result = acpi_memory_device_remove_memory(mem_device);
+   if (result)
+   return result;
+
/* Power-off and eject the device */
result = acpi_memory_powerdown_device(mem_device);
if (result) {
@@ -477,12 +497,23 @@ static int acpi_memory_device_add(struct acpi_device 
*device)
  static int acpi_memory_device_remove(struct acpi_device *device, int type)
  {
struct acpi_memory_device *mem_device = NULL;
-
+   int result;
  
  	if (!device || !acpi_driver_data(device))

return -EINVAL;
  
  	mem_device = acpi_driver_data(device);

+
+   if (type == ACPI_BUS_REMOVAL_EJECT) {
+   /*
+* offline and remove memory only when the memory device is
+* ejected.
+*/
+   result = acpi_memory_device_remove_memory(mem_device);
+   if (result)
+   return result;
+   }
+
kfree(mem_device);
  
  	return 0;

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 86c8821..038be73 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -70,6 +70,45 @@ void unregister_memory_isolate_notifier(struct 
notifier_block *nb)
  }
  EXPORT_SYMBOL(unregister_memory_isolate_notifier);
  
+bool is_memblk_offline(unsigned long start, unsigned long size)

+{
+   struct memory_block *mem = NULL;
+   struct mem_section *section;
+   unsigned long start_pfn, end_pfn;
+   unsigned long pfn, section_nr;
+
+   start_pfn = PFN_DOWN(start);
+   end_pfn = PFN_UP(start + size);
+
+   for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+   section_nr = pfn_to_section_nr(pfn);
+   if (!present_section_nr(section_nr))
+   continue;
+
+   section = __nr_to_section(section_nr);
+   /* same memblock? */
+   if (mem)
+   

Re: [RFC v9 PATCH 05/21] memory-hotplug: check whether memory is present or not

2012-09-27 Thread Ni zhan Chen

On 09/11/2012 10:24 AM, Yasuaki Ishimatsu wrote:

Hi Wen,

2012/09/11 11:15, Wen Congyang wrote:

Hi, ishimatsu

At 09/05/2012 05:25 PM, we...@cn.fujitsu.com Wrote:

From: Yasuaki Ishimatsu 

If system supports memory hot-remove, online_pages() may online 
removed pages.
So online_pages() need to check whether onlining pages are present 
or not.


Because we use memory_block_change_state() to hotremoving memory, I 
think

this patch can be removed. What do you think?


Pleae teach me detals a little more. If we use 
memory_block_change_state(),

does the conflict never occur? Why?


since memory hot-add or hot-remove is based on memblock, if check in 
memory_block_change_state()

can guarantee conflict never occur?



Thansk,
Yasuaki Ishimatsu


Thanks
Wen Congyang



CC: David Rientjes 
CC: Jiang Liu 
CC: Len Brown 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Christoph Lameter 
Cc: Minchan Kim 
CC: Andrew Morton 
CC: KOSAKI Motohiro 
CC: Wen Congyang 
Signed-off-by: Yasuaki Ishimatsu 
---
  include/linux/mmzone.h |   19 +++
  mm/memory_hotplug.c|   13 +
  2 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2daa54f..ac3ae30 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1180,6 +1180,25 @@ void sparse_init(void);
  #define sparse_index_init(_sec, _nid)  do {} while (0)
  #endif /* CONFIG_SPARSEMEM */

+#ifdef CONFIG_SPARSEMEM
+static inline int pfns_present(unsigned long pfn, unsigned long 
nr_pages)

+{
+int i;
+for (i = 0; i < nr_pages; i++) {
+if (pfn_present(pfn + i))
+continue;
+else
+return -EINVAL;
+}
+return 0;
+}
+#else
+static inline int pfns_present(unsigned long pfn, unsigned long 
nr_pages)

+{
+return 0;
+}
+#endif /* CONFIG_SPARSEMEM*/
+
  #ifdef CONFIG_NODES_SPAN_OTHER_NODES
  bool early_pfn_in_nid(unsigned long pfn, int nid);
  #else
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 49f7747..299747d 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -467,6 +467,19 @@ int __ref online_pages(unsigned long pfn, 
unsigned long nr_pages)

  struct memory_notify arg;

  lock_memory_hotplug();
+/*
+ * If system supports memory hot-remove, the memory may have been
+ * removed. So we check whether the memory has been removed or 
not.

+ *
+ * Note: When CONFIG_SPARSEMEM is defined, pfns_present() become
+ *   effective. If CONFIG_SPARSEMEM is not defined, 
pfns_present()

+ *   always returns 0.
+ */
+ret = pfns_present(pfn, nr_pages);
+if (ret) {
+unlock_memory_hotplug();
+return ret;
+}
  arg.start_pfn = pfn;
  arg.nr_pages = nr_pages;
  arg.status_change_nid = -1;





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >