Re: [PATCH] sysctl_panic_on_oom broken

2007-04-17 Thread Yasunori Goto
 On Tue, 17 Apr 2007, Larry Woodman wrote:
 
  out_of_memory() does not panic when sysctl_panic_on_oom is set
  if constrained_alloc() does not return CONSTRAINT_NONE.  Instead,
  out_of_memory() kills the current process whenever constrained_alloc()
  returns either CONSTRAINT_MEMORY_POLICY or CONSTRAINT_CPUSET.
  This patch fixes this problem:
 
 It recreates the old problem that we OOM while we still have memory 
 in other parts of the system.

Hmm. User's expectation is failover of clustering ASAP by panic.
Even if free memory remain due to cpuset/mempolicy setting,
some people may want failover soon.

Of course some other people don't want panic if free memory remain.
I think it depends on user.

If panic_on_oom is 1, only panic if mempolicy/cpuset is not used.
And if panic_on_oom is 2, panic on all case.
This might be desirable.

Bye.


-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Make new setting of panic_on_oom

2007-04-20 Thread Yasunori Goto

 read_lock(tasklist_lock);

   + if (sysctl_panic_on_oom == 2)
   + panic(out of memory. Compulsory panic_on_oom is selected.\n);
   +
  
  Wouldn't it be safer to put the panic before the read_lock()?
 
 I agree. Otherwise the patch seem to be okay.

Ok. This is take 2.
Thanks for your comment.

-

The current panic_on_oom may not work if there is a process using 
cpusets/mempolicy, because other nodes' memory may remain.
But some people want failover by panic ASAP even if they are used.
This patch makes new setting for its request.

This is not tested yet. But it would work.

Please apply.

Signed-off-by: Yasunori Goto [EMAIL PROTECTED]

---
 Documentation/sysctl/vm.txt |   23 +--
 mm/oom_kill.c   |3 +++
 2 files changed, 20 insertions(+), 6 deletions(-)

Index: panic_on_oom2/Documentation/sysctl/vm.txt
===
--- panic_on_oom2.orig/Documentation/sysctl/vm.txt  2007-04-21 
12:39:09.0 +0900
+++ panic_on_oom2/Documentation/sysctl/vm.txt   2007-04-21 12:39:58.0 
+0900
@@ -197,11 +197,22 @@
 
 panic_on_oom
 
-This enables or disables panic on out-of-memory feature.  If this is set to 1,
-the kernel panics when out-of-memory happens.  If this is set to 0, the kernel
-will kill some rogue process, called oom_killer.  Usually, oom_killer can kill
-rogue processes and system will survive.  If you want to panic the system
-rather than killing rogue processes, set this to 1.
+This enables or disables panic on out-of-memory feature.
 
-The default value is 0.
+If this is set to 0, the kernel will kill some rogue process,
+called oom_killer.  Usually, oom_killer can kill rogue processes and
+system will survive.
+
+If this is set to 1, the kernel panics when out-of-memory happens.
+However, if a process limits using nodes by mempolicy/cpusets,
+and those nodes become memory exhaustion status, one process
+may be killed by oom-killer. No panic occurs in this case.
+Because other nodes' memory may be free. This means system total status
+may be not fatal yet.
 
+If this is set to 2, the kernel panics compulsorily even on the
+above-mentioned.
+
+The default value is 0.
+1 and 2 are for failover of clustering. Please select either
+according to your policy of failover.
Index: panic_on_oom2/mm/oom_kill.c
===
--- panic_on_oom2.orig/mm/oom_kill.c2007-04-21 12:39:09.0 +0900
+++ panic_on_oom2/mm/oom_kill.c 2007-04-21 12:40:31.0 +0900
@@ -409,6 +409,9 @@
show_mem();
}
 
+   if (sysctl_panic_on_oom == 2)
+   panic(out of memory. Compulsory panic_on_oom is selected.\n);
+
cpuset_lock();
read_lock(tasklist_lock);
 


-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH]Fix parsing kernelcore boot option for ia64

2007-04-23 Thread Yasunori Goto
 On Fri, 13 Apr 2007 14:26:22 +0900 Yasunori Goto [EMAIL PROTECTED] wrote:
 
  Hello.
  
  cmdline_parse_kernelcore() should return the next pointer of boot option
  like memparse() doing. If not, it is cause of eternal loop on ia64 box.
  This patch is for 2.6.21-rc6-mm1.
  
  Signed-off-by: Yasunori Goto [EMAIL PROTECTED]
  
  
  
   arch/ia64/kernel/efi.c |2 +-
   include/linux/mm.h |2 +-
   mm/page_alloc.c|4 ++--
   3 files changed, 4 insertions(+), 4 deletions(-)
  
  Index: current_test/arch/ia64/kernel/efi.c
  ===
  --- current_test.orig/arch/ia64/kernel/efi.c2007-04-12 
  17:33:28.0 +0900
  +++ current_test/arch/ia64/kernel/efi.c 2007-04-13 12:13:21.0 
  +0900
  @@ -424,7 +424,7 @@ efi_init (void)
  } else if (memcmp(cp, max_addr=, 9) == 0) {
  max_addr = GRANULEROUNDDOWN(memparse(cp + 9, cp));
  } else if (memcmp(cp, kernelcore=,11) == 0) {
  -   cmdline_parse_kernelcore(cp+11);
  +   cmdline_parse_kernelcore(cp+11, cp);
  } else if (memcmp(cp, min_addr=, 9) == 0) {
  min_addr = GRANULEROUNDDOWN(memparse(cp + 9, cp));
  } else {
  Index: current_test/mm/page_alloc.c
  ===
  --- current_test.orig/mm/page_alloc.c   2007-04-12 18:25:37.0 
  +0900
  +++ current_test/mm/page_alloc.c2007-04-13 12:12:58.0 +0900
  @@ -3736,13 +3736,13 @@ void __init free_area_init_nodes(unsigne
* kernelcore=size sets the amount of memory for use for allocations that
* cannot be reclaimed or migrated.
*/
  -int __init cmdline_parse_kernelcore(char *p)
  +int __init cmdline_parse_kernelcore(char *p, char **retp)
   {
  unsigned long long coremem;
  if (!p)
  return -EINVAL;
   
  -   coremem = memparse(p, p);
  +   coremem = memparse(p, retp);
  required_kernelcore = coremem  PAGE_SHIFT;
   
  /* Paranoid check that UL is enough for required_kernelcore */
  Index: current_test/include/linux/mm.h
  ===
  --- current_test.orig/include/linux/mm.h2007-04-11 14:15:33.0 
  +0900
  +++ current_test/include/linux/mm.h 2007-04-13 12:12:20.0 +0900
  @@ -1051,7 +1051,7 @@ extern unsigned long find_max_pfn_with_a
   extern void free_bootmem_with_active_regions(int nid,
  unsigned long max_low_pfn);
   extern void sparse_memory_present_with_active_regions(int nid);
  -extern int cmdline_parse_kernelcore(char *p);
  +extern int cmdline_parse_kernelcore(char *p, char **retp);
   #ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
   extern int early_pfn_to_nid(unsigned long pfn);
   #endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
  
 
 This will cause all other architectures to crash when kernelcore= is used.
 
 I wasn't even aware of this kernelcore thing.  It's pretty nasty-looking. 
 yet another reminder that this code hasn't been properly reviewed in the
 past year or three.

Just now, I'm making memory-unplug patches with current MOVABLE_ZONE
code. So, I might be the first user of it on ia64.

Anyway, I'll try to fix it.


-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Make new setting of panic_on_oom

2007-04-23 Thread Yasunori Goto

I tested this patch. It worked well.
So, I fixed its description.

Please apply.


--

The current panic_on_oom may not work if there is a process using 
cpusets/mempolicy, because other nodes' memory may remain.
But some people want failover by panic ASAP even if they are used.
This patch makes new setting for its request.

This is tested on my ia64 box which has 3 nodes.

Please apply.

Signed-off-by: Yasunori Goto [EMAIL PROTECTED]
Signed-off-by: Benjamin LaHaise [EMAIL PROTECTED]


---
 Documentation/sysctl/vm.txt |   23 +--
 mm/oom_kill.c   |3 +++
 2 files changed, 20 insertions(+), 6 deletions(-)

Index: panic_on_oom2/Documentation/sysctl/vm.txt
===
--- panic_on_oom2.orig/Documentation/sysctl/vm.txt  2007-04-21 
12:39:09.0 +0900
+++ panic_on_oom2/Documentation/sysctl/vm.txt   2007-04-21 12:39:58.0 
+0900
@@ -197,11 +197,22 @@
 
 panic_on_oom
 
-This enables or disables panic on out-of-memory feature.  If this is set to 1,
-the kernel panics when out-of-memory happens.  If this is set to 0, the kernel
-will kill some rogue process, called oom_killer.  Usually, oom_killer can kill
-rogue processes and system will survive.  If you want to panic the system
-rather than killing rogue processes, set this to 1.
+This enables or disables panic on out-of-memory feature.
 
-The default value is 0.
+If this is set to 0, the kernel will kill some rogue process,
+called oom_killer.  Usually, oom_killer can kill rogue processes and
+system will survive.
+
+If this is set to 1, the kernel panics when out-of-memory happens.
+However, if a process limits using nodes by mempolicy/cpusets,
+and those nodes become memory exhaustion status, one process
+may be killed by oom-killer. No panic occurs in this case.
+Because other nodes' memory may be free. This means system total status
+may be not fatal yet.
 
+If this is set to 2, the kernel panics compulsorily even on the
+above-mentioned.
+
+The default value is 0.
+1 and 2 are for failover of clustering. Please select either
+according to your policy of failover.
Index: panic_on_oom2/mm/oom_kill.c
===
--- panic_on_oom2.orig/mm/oom_kill.c2007-04-21 12:39:09.0 +0900
+++ panic_on_oom2/mm/oom_kill.c 2007-04-21 12:40:31.0 +0900
@@ -409,6 +409,9 @@
show_mem();
}
 
+   if (sysctl_panic_on_oom == 2)
+   panic(out of memory. Compulsory panic_on_oom is selected.\n);
+
cpuset_lock();
read_lock(tasklist_lock);
 

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH]Fix parsing kernelcore boot option for ia64

2007-04-24 Thread Yasunori Goto
Mel-san.

I tested your patch (Thanks!). It worked. But..

 In my understanding, why ia64 doesn't use early_param() macro for mem= at el. 
 is that 
 it has to use mem= option at efi handling which is called before 
 parse_early_param().
 
 Current ia64's boot path is
  setup_arch()
 - efi handling - parse_early_param() - numa handling - pgdat/zone init
 
 kernelcore= option is just used at pgdat/zone initialization. (no arch 
 dependent part...)
 
 So I think just adding
 ==
 early_param(kernelcore,cmpdline_parse_kernelcore)
 ==
 to ia64 is ok.

Then, it can be common code.
How is this patch? I confirmed this can work well too.



When kernelcore boot option is specified, kernel can't boot up
on ia64. It is cause of eternal loop.
In addition, its code can be common code. This is fix for it.
I tested this patch on my ia64 box.


Signed-off-by: Yasunori Goto [EMAIL PROTECTED]

-

 arch/i386/kernel/setup.c   |1 -
 arch/ia64/kernel/efi.c |2 --
 arch/powerpc/kernel/prom.c |1 -
 arch/ppc/mm/init.c |2 --
 arch/x86_64/kernel/e820.c  |1 -
 include/linux/mm.h |1 -
 mm/page_alloc.c|3 +++
 7 files changed, 3 insertions(+), 8 deletions(-)

Index: kernelcore/arch/ia64/kernel/efi.c
===
--- kernelcore.orig/arch/ia64/kernel/efi.c  2007-04-24 15:09:37.0 
+0900
+++ kernelcore/arch/ia64/kernel/efi.c   2007-04-24 15:25:22.0 +0900
@@ -423,8 +423,6 @@ efi_init (void)
mem_limit = memparse(cp + 4, cp);
} else if (memcmp(cp, max_addr=, 9) == 0) {
max_addr = GRANULEROUNDDOWN(memparse(cp + 9, cp));
-   } else if (memcmp(cp, kernelcore=,11) == 0) {
-   cmdline_parse_kernelcore(cp+11);
} else if (memcmp(cp, min_addr=, 9) == 0) {
min_addr = GRANULEROUNDDOWN(memparse(cp + 9, cp));
} else {
Index: kernelcore/arch/i386/kernel/setup.c
===
--- kernelcore.orig/arch/i386/kernel/setup.c2007-04-24 15:29:20.0 
+0900
+++ kernelcore/arch/i386/kernel/setup.c 2007-04-24 15:29:39.0 +0900
@@ -195,7 +195,6 @@ static int __init parse_mem(char *arg)
return 0;
 }
 early_param(mem, parse_mem);
-early_param(kernelcore, cmdline_parse_kernelcore);
 
 #ifdef CONFIG_PROC_VMCORE
 /* elfcorehdr= specifies the location of elf core header
Index: kernelcore/arch/powerpc/kernel/prom.c
===
--- kernelcore.orig/arch/powerpc/kernel/prom.c  2007-04-24 15:04:47.0 
+0900
+++ kernelcore/arch/powerpc/kernel/prom.c   2007-04-24 15:30:25.0 
+0900
@@ -431,7 +431,6 @@ static int __init early_parse_mem(char *
return 0;
 }
 early_param(mem, early_parse_mem);
-early_param(kernelcore, cmdline_parse_kernelcore);
 
 /*
  * The device tree may be allocated below our memory limit, or inside the
Index: kernelcore/arch/ppc/mm/init.c
===
--- kernelcore.orig/arch/ppc/mm/init.c  2007-04-24 15:04:47.0 +0900
+++ kernelcore/arch/ppc/mm/init.c   2007-04-24 15:30:56.0 +0900
@@ -214,8 +214,6 @@ void MMU_setup(void)
}
 }
 
-early_param(kernelcore, cmdline_parse_kernelcore);
-
 /*
  * MMU_init sets up the basic memory mappings for the kernel,
  * including both RAM and possibly some I/O regions,
Index: kernelcore/arch/x86_64/kernel/e820.c
===
--- kernelcore.orig/arch/x86_64/kernel/e820.c   2007-04-24 15:04:47.0 
+0900
+++ kernelcore/arch/x86_64/kernel/e820.c2007-04-24 15:34:02.0 
+0900
@@ -604,7 +604,6 @@ static int __init parse_memopt(char *p)
return 0;
 } 
 early_param(mem, parse_memopt);
-early_param(kernelcore, cmdline_parse_kernelcore);
 
 static int userdef __initdata;
 
Index: kernelcore/include/linux/mm.h
===
--- kernelcore.orig/include/linux/mm.h  2007-04-24 15:09:37.0 +0900
+++ kernelcore/include/linux/mm.h   2007-04-24 15:35:52.0 +0900
@@ -1051,7 +1051,6 @@ extern unsigned long find_max_pfn_with_a
 extern void free_bootmem_with_active_regions(int nid,
unsigned long max_low_pfn);
 extern void sparse_memory_present_with_active_regions(int nid);
-extern int cmdline_parse_kernelcore(char *p);
 #ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
 extern int early_pfn_to_nid(unsigned long pfn);
 #endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
Index: kernelcore/mm/page_alloc.c
===
--- kernelcore.orig/mm/page_alloc.c 2007-04-24 15:09:37.0 +0900
+++ kernelcore/mm/page_alloc.c  2007-04-24 16:00:21.0 +0900

Re: [PATCH]Fix parsing kernelcore boot option for ia64

2007-04-24 Thread Yasunori Goto


 Subject: Check zone boundaries when freeing bootmem
 Zone boundaries do not have to be aligned to MAX_ORDER_NR_PAGES. 

Hmm. I don't understand here yet... Could you explain more? 

This issue occurs only when ZONE_MOVABLE is specified.
If its boundary is aligned to MAX_ORDER automatically,
I guess user will not mind it.

From memory hotplug view, I prefer section size alignment to make
simple code. :-P


 However,
 during boot, there is an implicit assumption that they are aligned to a
 BITS_PER_LONG boundary when freeing pages as quickly as possible. This
 patch checks the zone boundaries when freeing pages from the bootmem 
 allocator.

Anyway, the patch works well.

Bye.

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] Align ZONE_MOVABLE to a MAX_ORDER_NR_PAGES boundary

2007-04-24 Thread Yasunori Goto
Looks good. :-)
Thanks.

Acked-by: Yasunori Goto [EMAIL PROTECTED]


 
 The boot memory allocator makes assumptions on the alignment of zone
 boundaries even though the buddy allocator has no requirements on the
 alignment of zones. This may cause boot problems in situations where
 ZONE_MOVABLE is populated because the bootmem allocator assumes zones are
 at least order-log2(BITS_PER_LONG) aligned. As the two potential users
 (huge pages and memory hot-remove) of ZONE_MOVABLE would prefer a higher
 alignment, this patch aligns the start of the zone instead of fixing the
 different assumptions made by the bootmem allocator.
 
 This patch rounds the start of ZONE_MOVABLE in each node to a
 MAX_ORDER_NR_PAGES boundary. If the rounding pushes the start of ZONE_MOVABLE
 above the end of the node then the zone will contain no memory and will not
 be used at runtime. The value is rounded up instead of down as it is
 better to have the kernel-portion of memory larger than requested instead
 of smaller. The impact is that the kernel-usable portion of memory because a
 minimum guarantee instead of the exact size requested by the user.
 
 
 Signed-off-by: Mel Gorman [EMAIL PROTECTED]
 Acked-by: Andy Whitcroft [EMAIL PROTECTED]
 ---
 
  page_alloc.c |5 +
  1 files changed, 5 insertions(+)
 
 diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
 linux-2.6.21-rc6-mm1-002_commonparse/mm/page_alloc.c 
 linux-2.6.21-rc6-mm1-003_alignmovable/mm/page_alloc.c
 --- linux-2.6.21-rc6-mm1-002_commonparse/mm/page_alloc.c  2007-04-24 
 09:38:30.0 +0100
 +++ linux-2.6.21-rc6-mm1-003_alignmovable/mm/page_alloc.c 2007-04-24 
 11:15:40.0 +0100
 @@ -3642,6 +3642,11 @@ restart:
   usable_nodes--;
   if (usable_nodes  required_kernelcore  usable_nodes)
   goto restart;
 + 
 + /* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
 + for (nid = 0; nid  MAX_NUMNODES; nid++)
 + zone_movable_pfn[nid] =
 + roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);
  }
  
  /**

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] FRV: Fix unannotated variable declarations

2007-03-21 Thread Yasunori Goto
 From: David Howells [EMAIL PROTECTED]
 
 Fix unannotated variable declarations.  Variables that have allocation section
 annotations (such as __meminitdata) on their definitions must also have them 
 on
 their declarations as not doing so may affect the addressing mode used by the
 compiler and may result in a linker error.

Right. Thanks.

Acked-by: Yasunori Goto [EMAIL PROTECTED]


-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.21-rc4-mm1 + 3 hot-fixes -- WARNING: could not find versions for .tmp_versions/built-in.mod

2007-03-23 Thread Yasunori Goto
Hello.

  WARNING: mm/built-in.o - Section mismatch: reference to
  .init.text:__alloc_bootmem_node from .text between 'sparse_init' (at
  offset 0x15c8f) and '__section_nr'
 I took a look at this one.
 You have SPARSEMEM enabled in your config.
 And then I see that in sparse.c we call alloc_bootmem_node()
 from a function I thought should be marked __devinit (it
 is used by memory_hotplug.c).
 But I am not familiar enough to judge if __alloc_bootmen_node
 are marked correct with __init or __devinit (to say
 this is used in the HOTPLUG case) is more correct.
 Anyone?
 
  WARNING: mm/built-in.o - Section mismatch: reference to
  .init.text:__alloc_bootmem_node from .text between 'sparse_init' (at
  offset 0x15d02) and '__section_nr'
 Same as above

Memory hotplug code has __meminit for its purpose.
But, I suspect that many other places of memory hotplug code may have
same issue. I will chase them.

BTW, does -mm code checks more strict than stock kernel? I can't see
these warnings in 2.6.21-rc4.

Bye.

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH]Fix parsing kernelcore boot option for ia64

2007-04-12 Thread Yasunori Goto
Hello.

cmdline_parse_kernelcore() should return the next pointer of boot option
like memparse() doing. If not, it is cause of eternal loop on ia64 box.
This patch is for 2.6.21-rc6-mm1.

Signed-off-by: Yasunori Goto [EMAIL PROTECTED]



 arch/ia64/kernel/efi.c |2 +-
 include/linux/mm.h |2 +-
 mm/page_alloc.c|4 ++--
 3 files changed, 4 insertions(+), 4 deletions(-)

Index: current_test/arch/ia64/kernel/efi.c
===
--- current_test.orig/arch/ia64/kernel/efi.c2007-04-12 17:33:28.0 
+0900
+++ current_test/arch/ia64/kernel/efi.c 2007-04-13 12:13:21.0 +0900
@@ -424,7 +424,7 @@ efi_init (void)
} else if (memcmp(cp, max_addr=, 9) == 0) {
max_addr = GRANULEROUNDDOWN(memparse(cp + 9, cp));
} else if (memcmp(cp, kernelcore=,11) == 0) {
-   cmdline_parse_kernelcore(cp+11);
+   cmdline_parse_kernelcore(cp+11, cp);
} else if (memcmp(cp, min_addr=, 9) == 0) {
min_addr = GRANULEROUNDDOWN(memparse(cp + 9, cp));
} else {
Index: current_test/mm/page_alloc.c
===
--- current_test.orig/mm/page_alloc.c   2007-04-12 18:25:37.0 +0900
+++ current_test/mm/page_alloc.c2007-04-13 12:12:58.0 +0900
@@ -3736,13 +3736,13 @@ void __init free_area_init_nodes(unsigne
  * kernelcore=size sets the amount of memory for use for allocations that
  * cannot be reclaimed or migrated.
  */
-int __init cmdline_parse_kernelcore(char *p)
+int __init cmdline_parse_kernelcore(char *p, char **retp)
 {
unsigned long long coremem;
if (!p)
return -EINVAL;
 
-   coremem = memparse(p, p);
+   coremem = memparse(p, retp);
required_kernelcore = coremem  PAGE_SHIFT;
 
/* Paranoid check that UL is enough for required_kernelcore */
Index: current_test/include/linux/mm.h
===
--- current_test.orig/include/linux/mm.h2007-04-11 14:15:33.0 
+0900
+++ current_test/include/linux/mm.h 2007-04-13 12:12:20.0 +0900
@@ -1051,7 +1051,7 @@ extern unsigned long find_max_pfn_with_a
 extern void free_bootmem_with_active_regions(int nid,
unsigned long max_low_pfn);
 extern void sparse_memory_present_with_active_regions(int nid);
-extern int cmdline_parse_kernelcore(char *p);
+extern int cmdline_parse_kernelcore(char *p, char **retp);
 #ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
 extern int early_pfn_to_nid(unsigned long pfn);
 #endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] fix BUG_ON check at move_freepages() (Re: 2.6.21-rc3-mm2)

2007-03-07 Thread Yasunori Goto

Hello.

The BUG_ON() check at move_freepages() is wrong.
Its end_page is start_page + MAX_ORDER_NR_PAGES. So, it can be 
next zone. BUG_ON() should check end_page - 1.

This is fix of 2.6.21-rc3-mm2 for it.

Signed-off-by: Yasunori Goto [EMAIL PROTECTED]

---
 mm/page_alloc.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: current_test/mm/page_alloc.c
===
--- current_test.orig/mm/page_alloc.c   2007-03-08 15:44:10.0 +0900
+++ current_test/mm/page_alloc.c2007-03-08 16:17:29.0 +0900
@@ -707,7 +707,7 @@ int move_freepages(struct zone *zone,
unsigned long order;
int blocks_moved = 0;
 
-   BUG_ON(page_zone(start_page) != page_zone(end_page));
+   BUG_ON(page_zone(start_page) != page_zone(end_page - 1));
 
for (page = start_page; page  end_page;) {
if (!PageBuddy(page)) {

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC:PATCH]regster memory init functions into white list of section mismatch.

2007-03-28 Thread Yasunori Goto

WARNING: mm/built-in.o - Section mismatch: reference to
.init.text:__alloc_bootmem_node from .text between 'sparse_init' (at
offset 0x15c8f) and '__section_nr'
   I took a look at this one.
   You have SPARSEMEM enabled in your config.
   And then I see that in sparse.c we call alloc_bootmem_node()
   from a function I thought should be marked __devinit (it
   is used by memory_hotplug.c).
   But I am not familiar enough to judge if __alloc_bootmen_node
   are marked correct with __init or __devinit (to say
   this is used in the HOTPLUG case) is more correct.
   Anyone?
   
WARNING: mm/built-in.o - Section mismatch: reference to
.init.text:__alloc_bootmem_node from .text between 'sparse_init' (at
offset 0x15d02) and '__section_nr'
   Same as above
  
  Memory hotplug code has __meminit for its purpose.
  But, I suspect that many other places of memory hotplug code may have
  same issue. I will chase them.


Hello.

  I chased section mismatch codes on memory hotplug code. Many of
  them should be defined as __meminit. (This check was great helpful
  for checking it. Thanks!)
  But, I would like to add a new pattern in white list for some of
  them. (I'll post another patch for others.)
  
  sparse.c (sparse_index_alloc()) calles alloc_bootmem_node() as you mentioned.
  And, zone_wait_table_init() calles it too.
  These functions call it on only boot time, and call
  vmalloc()/kmalloc() on hotplug time. It is distinguished by
  system_state value or slab_is_available(). Just refrerences remain
  at them after boot.
  
  Bootmem allocation functions are called by many functions and it must be
  used only at boot time. I think __init of them should keep for
  section mismatch check. So, I would like to register sparse_index_alloc()
  and zone_wait_table_init() into white list.

  Please comment. If there is a more good way, please let me know...

Thanks.

P.S.
  Pattarn 10 is for ia64 (not for memory hotplug). 
  ia64's .machvec section is mixture table of .init functions and normal text.
  It is defined for platform dependent functions. This is also cause of 
  warnings. I think this should be registered too. 


Signed-off-by: Yasunori Goto [EMAIL PROTECTED]

---
 mm/page_alloc.c   |2 +-
 mm/sparse.c   |2 +-
 scripts/mod/modpost.c |   29 +
 3 files changed, 31 insertions(+), 2 deletions(-)

Index: current_test/scripts/mod/modpost.c
===
--- current_test.orig/scripts/mod/modpost.c 2007-03-27 20:21:20.0 
+0900
+++ current_test/scripts/mod/modpost.c  2007-03-29 14:16:05.0 +0900
@@ -643,6 +643,17 @@ static int strrcmp(const char *s, const 
  *  The pattern is:
  *  tosec= .init.text
  *  fromsec  = __ksymtab*
+ *
+ * Pattern 9:
+ *  Some of functions are common code between boot time and hotplug
+ *  time. The bootmem allocater is called only boot time in its
+ *  functions. So it's ok to reference.
+ *  tosec= .init.text
+ *
+ * Pattern 10:
+ *  ia64 has machvec table for each platform. It is mixture of function
+ *  pointer of .init.text and .text.
+ *  fromsec  = .machvec
  **/
 static int secref_whitelist(const char *modname, const char *tosec,
const char *fromsec, const char *atsym,
@@ -669,6 +680,12 @@ static int secref_whitelist(const char *
NULL
};
 
+   const char *pat4sym[] = {
+   sparse_index_alloc,
+   zone_wait_table_init,
+   NULL
+   };
+
/* Check for pattern 1 */
if (strcmp(tosec, .init.data) != 0)
f1 = 0;
@@ -725,6 +742,18 @@ static int secref_whitelist(const char *
if ((strcmp(tosec, .init.text) == 0) 
(strncmp(fromsec, __ksymtab, strlen(__ksymtab)) == 0))
return 1;
+
+   /* Check for pattern 9 */
+   if ((strcmp(tosec, .init.text) == 0) 
+   (strcmp(fromsec, .text) == 0))
+   for (s = pat4sym; *s; s++)
+   if (strcmp(atsym, *s) == 0)
+   return 1;
+
+   /* Check for pattern 10 */
+   if (strcmp(fromsec, .machvec) == 0)
+   return 1;
+
return 0;
 }
 
Index: current_test/mm/page_alloc.c
===
--- current_test.orig/mm/page_alloc.c   2007-03-27 16:04:41.0 +0900
+++ current_test/mm/page_alloc.c2007-03-29 14:14:42.0 +0900
@@ -2673,7 +2673,7 @@ void __init setup_per_cpu_pageset(void)
 
 #endif
 
-static __meminit
+static __meminit noinline
 int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
 {
int i;
Index: current_test/mm/sparse.c
===
--- current_test.orig/mm/sparse.c   2007-03-27 16:04:41.0 +0900
+++ current_test/mm/sparse.c2007-03-29 14:15:00.0 +0900
@@ -44,7 +44,7

[Patch] Fix section mismatch of memory hotplug related code.

2007-04-05 Thread Yasunori Goto
Hello.

This is to fix many section mismatches of code related to memory hotplug.
I checked compile with memory hotplug on/off on ia64 and x86-64 box.
This patch is for 2.6.21-rc5-mm4.

Please apply.

Signed-off-by: Yasunori Goto [EMAIL PROTECTED]

---

 arch/ia64/mm/discontig.c |2 ++
 arch/x86_64/mm/init.c|6 +++---
 drivers/acpi/numa.c  |4 ++--
 mm/page_alloc.c  |   30 +++---
 mm/sparse.c  |   12 +++-
 5 files changed, 29 insertions(+), 25 deletions(-)

Index: meminit/mm/sparse.c
===
--- meminit.orig/mm/sparse.c2007-04-04 20:15:58.0 +0900
+++ meminit/mm/sparse.c 2007-04-04 20:55:44.0 +0900
@@ -61,7 +61,7 @@ static struct mem_section *sparse_index_
return section;
 }
 
-static int sparse_index_init(unsigned long section_nr, int nid)
+static int __meminit sparse_index_init(unsigned long section_nr, int nid)
 {
static DEFINE_SPINLOCK(index_init_lock);
unsigned long root = SECTION_NR_TO_ROOT(section_nr);
@@ -138,7 +138,7 @@ static inline int sparse_early_nid(struc
 }
 
 /* Record a memory area against a node. */
-void memory_present(int nid, unsigned long start, unsigned long end)
+void __init memory_present(int nid, unsigned long start, unsigned long end)
 {
unsigned long pfn;
 
@@ -197,7 +197,7 @@ struct page *sparse_decode_mem_map(unsig
return ((struct page *)coded_mem_map) + section_nr_to_pfn(pnum);
 }
 
-static int sparse_init_one_section(struct mem_section *ms,
+static int __meminit sparse_init_one_section(struct mem_section *ms,
unsigned long pnum, struct page *mem_map,
unsigned long *pageblock_bitmap)
 {
@@ -211,7 +211,7 @@ static int sparse_init_one_section(struc
return 1;
 }
 
-static struct page *sparse_early_mem_map_alloc(unsigned long pnum)
+static struct page __init *sparse_early_mem_map_alloc(unsigned long pnum)
 {
struct page *map;
struct mem_section *ms = __nr_to_section(pnum);
@@ -301,7 +301,7 @@ static unsigned long *sparse_early_usema
  * Allocate the accumulated non-linear sections, allocate a mem_map
  * for each and record the physical to section mapping.
  */
-void sparse_init(void)
+void __init sparse_init(void)
 {
unsigned long pnum;
struct page *map;
@@ -324,6 +324,7 @@ void sparse_init(void)
}
 }
 
+#ifdef CONFIG_MEMORY_HOTPLUG
 /*
  * returns the number of sections whose mem_maps were properly
  * set.  If this is =0, then that means that the passed-in
@@ -370,3 +371,4 @@ out:
__kfree_section_memmap(memmap, nr_pages);
return ret;
 }
+#endif
Index: meminit/arch/ia64/mm/discontig.c
===
--- meminit.orig/arch/ia64/mm/discontig.c   2007-04-04 20:15:58.0 
+0900
+++ meminit/arch/ia64/mm/discontig.c2007-04-04 20:16:02.0 +0900
@@ -696,6 +696,7 @@ void __init paging_init(void)
zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page));
 }
 
+#ifdef CONFIG_MEMORY_HOTPLUG
 pg_data_t *arch_alloc_nodedata(int nid)
 {
unsigned long size = compute_pernodesize(nid);
@@ -713,3 +714,4 @@ void arch_refresh_nodedata(int update_no
pgdat_list[update_node] = update_pgdat;
scatter_node_data();
 }
+#endif
Index: meminit/mm/page_alloc.c
===
--- meminit.orig/mm/page_alloc.c2007-04-04 20:15:58.0 +0900
+++ meminit/mm/page_alloc.c 2007-04-04 20:55:44.0 +0900
@@ -105,7 +105,7 @@ int min_free_kbytes = 1024;
 
 unsigned long __meminitdata nr_kernel_pages;
 unsigned long __meminitdata nr_all_pages;
-static unsigned long __initdata dma_reserve;
+static unsigned long __meminitdata dma_reserve;
 
 #ifdef CONFIG_ARCH_POPULATES_NODE_MAP
   /*
@@ -128,16 +128,16 @@ static unsigned long __initdata dma_rese
 #endif
   #endif
 
-  struct node_active_region __initdata early_node_map[MAX_ACTIVE_REGIONS];
-  int __initdata nr_nodemap_entries;
-  unsigned long __initdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
-  unsigned long __initdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
+  struct node_active_region __meminitdata early_node_map[MAX_ACTIVE_REGIONS];
+  int __meminitdata nr_nodemap_entries;
+  unsigned long __meminitdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
+  unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
 #ifdef CONFIG_MEMORY_HOTPLUG_RESERVE
   unsigned long __initdata node_boundary_start_pfn[MAX_NUMNODES];
   unsigned long __initdata node_boundary_end_pfn[MAX_NUMNODES];
 #endif /* CONFIG_MEMORY_HOTPLUG_RESERVE */
   unsigned long __initdata required_kernelcore;
-  unsigned long __initdata zone_movable_pfn[MAX_NUMNODES];
+  unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
 
   /* movable_zone is the real zone pages in ZONE_MOVABLE are taken from

[Patch] Add white list into modpost.c for memory hotplug code and ia64's machvec section

2007-04-05 Thread Yasunori Goto
This patch is add white list into modpost.c for some functions and
ia64's section to fix section mismatchs.

  sparse_index_alloc() and zone_wait_table_init() calls bootmem allocator
  at boot time, and kmalloc/vmalloc at hotplug time. If config
  memory hotplug is on, there are references of bootmem allocater(init text)
  from them (normal text). This is cause of section mismatch.
  
  Bootmem is called by many functions and it must be
  used only at boot time. I think __init of them should keep for
  section mismatch check. So, I would like to register sparse_index_alloc()
  and zone_wait_table_init() into white list.
  
  In addition, ia64's .machvec section is function table of some platform
  dependent code. It is mixture of .init.text and normal text. These
  reference of __init functions are valid too.

  This is for 2.6.21-rc5-mm4.

Please apply.

Signed-off-by: Yasunori Goto [EMAIL PROTECTED]

---
 mm/page_alloc.c   |2 +-
 mm/sparse.c   |2 +-
 scripts/mod/modpost.c |   28 
 3 files changed, 30 insertions(+), 2 deletions(-)

Index: current_test/scripts/mod/modpost.c
===
--- current_test.orig/scripts/mod/modpost.c 2007-04-03 16:04:57.0 
+0900
+++ current_test/scripts/mod/modpost.c  2007-04-03 16:09:59.0 +0900
@@ -649,6 +649,17 @@ static int strrcmp(const char *s, const 
  *  The pattern is:
  *  tosec   = .init.text
  *  fromsec  = .paravirtprobe
+ *
+ * Pattern 10:
+ *  Some of functions are common code between boot time and hotplug
+ *  time. The bootmem allocater is called only boot time in its
+ *  functions. So it's ok to reference.
+ *  tosec= .init.text
+ *
+ * Pattern 11:
+ *  ia64 has machvec table for each platform. It is mixture of function
+ *  pointer of .init.text and .text.
+ *  fromsec  = .machvec
  **/
 static int secref_whitelist(const char *modname, const char *tosec,
const char *fromsec, const char *atsym,
@@ -675,6 +686,12 @@ static int secref_whitelist(const char *
NULL
};
 
+   const char *pat4sym[] = {
+   sparse_index_alloc,
+   zone_wait_table_init,
+   NULL
+   };
+
/* Check for pattern 1 */
if (strcmp(tosec, .init.data) != 0)
f1 = 0;
@@ -738,6 +755,17 @@ static int secref_whitelist(const char *
(strcmp(fromsec, .paravirtprobe) == 0))
return 1;
 
+   /* Check for pattern 10 */
+   if ((strcmp(tosec, .init.text) == 0) 
+   (strcmp(fromsec, .text) == 0))
+   for (s = pat4sym; *s; s++)
+   if (strcmp(atsym, *s) == 0)
+   return 1;
+
+   /* Check for pattern 11 */
+   if (strcmp(fromsec, .machvec) == 0)
+   return 1;
+
return 0;
 }
 
Index: current_test/mm/page_alloc.c
===
--- current_test.orig/mm/page_alloc.c   2007-04-03 16:04:57.0 +0900
+++ current_test/mm/page_alloc.c2007-04-03 16:05:26.0 +0900
@@ -2667,7 +2667,7 @@ void __init setup_per_cpu_pageset(void)
 
 #endif
 
-static __meminit
+static __meminit noinline
 int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
 {
int i;
Index: current_test/mm/sparse.c
===
--- current_test.orig/mm/sparse.c   2007-04-03 16:04:57.0 +0900
+++ current_test/mm/sparse.c2007-04-03 16:05:26.0 +0900
@@ -44,7 +44,7 @@ EXPORT_SYMBOL(page_to_nid);
 #endif
 
 #ifdef CONFIG_SPARSEMEM_EXTREME
-static struct mem_section *sparse_index_alloc(int nid)
+static struct mem_section noinline *sparse_index_alloc(int nid)
 {
struct mem_section *section = NULL;
unsigned long array_size = SECTIONS_PER_ROOT *

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 4/4] [RESEND] Recomputing msgmni on memory add / remove

2008-01-15 Thread Yasunori Goto

Hello Nadia-san.

 @@ -118,6 +122,10 @@ struct ipc_namespace {
   size_t  shm_ctlall;
   int shm_ctlmni;
   int shm_tot;
 +
 +#ifdef CONFIG_MEMORY_HOTPLUG
 + struct notifier_block ipc_memory_hotplug;
 +#endif
  };

I'm sorry, but I don't see why each ipc namespace must have each callbacks
of memory hotplug.
I prefer only one callback for each subsystem, not for each namespace.
In addition, the recompute_msgmni() calculation looks very similar for
all ipc namespace.
Or do you wish each ipc namespace have different callback for the future?



BTW, have you ever tested this patch? If you don't have any test environment
for memory hotplug code, then I'll check it. :-)

Bye.

-- 
Yasunori Goto 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 4/4] [RESEND] Recomputing msgmni on memory add / remove

2008-01-15 Thread Yasunori Goto
 Yasunori Goto wrote:
  Hello Nadia-san.
  
  
 @@ -118,6 +122,10 @@ struct ipc_namespace {
 size_t  shm_ctlall;
 int shm_ctlmni;
 int shm_tot;
 +
 +#ifdef CONFIG_MEMORY_HOTPLUG
 +   struct notifier_block ipc_memory_hotplug;
 +#endif
  };
  
  
  I'm sorry, but I don't see why each ipc namespace must have each callbacks
  of memory hotplug.
  I prefer only one callback for each subsystem, not for each namespace.
  In addition, the recompute_msgmni() calculation looks very similar for
  all ipc namespace.
  Or do you wish each ipc namespace have different callback for the future?
  
 
 Actually, this is what I wanted to do at the very beginning: have a 
 single callback that would recompute the msgmni for each ipc namespace. 
 But the issue here is that the namespaces are not linked to each other, 
 so I had no simple way to go through all the namespaces.
 I solved the issue by having a callback for any single ipc namespace and 
 make it recompute the msgmni value for itslef.

The recompute_msg() must be called when new ipc_namespace is created/removed
as you mentioned. I think namespaces should be linked each other for it
in the end



  
  BTW, have you ever tested this patch? If you don't have any test environment
  for memory hotplug code, then I'll check it. :-)
 
 Well, I tested it but not in real configuration: what I did is that I 
 changed the status by hand under sysfs to offline. I also changed 
 remove_memory() in mm/memory_hotplug.c in the following way (instead of 
 returninf EINVAL):
 1) decrease the total_ram pages
 2) call memory_notify(MEM_OFFLINE, NULL)
 
 and checked that the msgmni was recomputed.

You can also online again after offline by writing sysfs.

 But sure, if you are candidate to test it, that would be great!

Ok. I'll check it too.
Bye.

-- 
Yasunori Goto 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -mm] mm: Fix memory hotplug + sparsemem build.

2007-09-13 Thread Yasunori Goto
 On Tue, 11 Sep 2007 18:37:12 +0900 Yasunori Goto [EMAIL PROTECTED] wrote:
 
  
+   if (onlined_pages){
   
   Nit, needs a space there before the '{'.
  
  Ah, Ok. I attached fixed patch in this mail.
  
   The problem as I see it is that when we boot the system we start a
   kswapd on all nodes with memory.  If the hot-add adds memory to a
   pre-existing node with no memory we will not start one and we end up
   with a node with memory and no kswapd.  Bad.
   
   As kswapd_run is a no-op when a kswapd already exists this seems a safe
   way to fix that.  Paul's -zone conversion is obviously correct also.
   
   Acked-by: Andy Whitcroft [EMAIL PROTECTED]
  
  Thanks for your explanation.
  You mentioned all of my intention correctly. :-)
  
  
  
  
  Fix kswapd doesn't run when memory is added on memory-less-node.
  Fix compile error of zone-node when CONFIG_NUMA is off.
  
  Signed-off-by: Yasunori Goto [EMAIL PROTECTED]
  Signed-off-by: Paul Mundt [EMAIL PROTECTED]
  Acked-by: Andy Whitcroft [EMAIL PROTECTED]
  
  
  ---
   mm/memory_hotplug.c |9 -
   1 file changed, 4 insertions(+), 5 deletions(-)
  
  Index: current/mm/memory_hotplug.c
  ===
  --- current.orig/mm/memory_hotplug.c2007-09-07 18:08:07.0 
  +0900
  +++ current/mm/memory_hotplug.c 2007-09-11 17:29:19.0 +0900
  @@ -211,10 +211,12 @@ int online_pages(unsigned long pfn, unsi
  online_pages_range);
  zone-present_pages += onlined_pages;
  zone-zone_pgdat-node_present_pages += onlined_pages;
  -   if (onlined_pages)
  -   node_set_state(zone-node, N_HIGH_MEMORY);
   
  setup_per_zone_pages_min();
  +   if (onlined_pages) {
  +   kswapd_run(zone_to_nid(zone));
  +   node_set_state(zone_to_nid(zone), N_HIGH_MEMORY);
  +   }
   
  if (need_zonelists_rebuild)
  build_all_zonelists();
  @@ -269,9 +271,6 @@ int add_memory(int nid, u64 start, u64 s
  if (!pgdat)
  return -ENOMEM;
  new_pgdat = 1;
  -   ret = kswapd_run(nid);
  -   if (ret)
  -   goto error;
  }
   
  /* call arch's memory hotadd */
  
 
 OK, we're getting into a mess here.  This patch fixes
 update-n_high_memory-node-state-for-memory-hotadd.patch, but which patch
 does update-n_high_memory-node-state-for-memory-hotadd.patch fix?
 
 At present I just whacked
 update-n_high_memory-node-state-for-memory-hotadd.patch at the end of
 everything, but that was lazy of me and it ends up making a mess.

It is enough. No more patch is necessary for these issues.
I already fixed about Andy-san's comment. :-)


Thanks.
-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -mm] mm: Fix memory hotplug + sparsemem build.

2007-09-13 Thread Yasunori Goto

 On Fri, 14 Sep 2007 11:02:43 +0900 Yasunori Goto [EMAIL PROTECTED] wrote:
 
/* call arch's memory hotadd */

   
   OK, we're getting into a mess here.  This patch fixes
   update-n_high_memory-node-state-for-memory-hotadd.patch, but which patch
   does update-n_high_memory-node-state-for-memory-hotadd.patch fix?
   
   At present I just whacked
   update-n_high_memory-node-state-for-memory-hotadd.patch at the end of
   everything, but that was lazy of me and it ends up making a mess.
  
  It is enough. No more patch is necessary for these issues.
  I already fixed about Andy-san's comment. :-)
 
 Now I'm more confused.  I have two separeate questions:
 
 a) Is the justr-added 
 update-n_high_memory-node-state-for-memory-hotadd-fix.patch
still needed?

I'm not sure exact meaning of just-added. 
But, update-n_high_memory-node-state-for-memory-hotadd-fix.patch is
necessary for 2.6.23-rc4-mm1.

 b) Which patch in 2.6.22-rc4-mm1 does

2.6.23-rc4-mm1?

update-n_high_memory-node-state-for-memory-hotadd.patch fix?  In other
words, into which patch should I fold
update-n_high_memory-node-state-for-memory-hotadd.patch prior to sending
to Linus?

In my understanding, 
update-n_high_memory-node-state-for-memory-hotadd.patch should be folded
with all of memoryless-nodes-.patch.
It sets N_HIGH_MEMORY for a new node-with-memory.

But if you need specifing of more detail patch, becase N_HIGH_MEMORY is
set in memoryless-nodes-introduce-ask-of-nodes-with-memory.patch, 
I suppose update-n_high_memory-node-state-for-memory-hotadd.patch
should be fold with it.


update-n_high_memory-node-state-for-memory-hotadd-fix.patch
  ^^^
is fixes of update-n_high_memory-node-state-for-memory-hotadd.patch
and memoryless-nodes-no-need-for-kswapd.patch


Is it enough for your question? Or more confuse?


(I (usually) get to work this out for myself.  Sometimes it is painful).
 
 Generally, if people tell me which patch-in-mm their patch is fixing,
 it really helps.  Adrian does this all the time.

Sorry for your confusing...


-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: EIP is at device_shutdown+0x32/0x60

2007-11-15 Thread Yasunori Goto
 On Thu, 15 Nov 2007 12:11:58 +0300 Alexey Dobriyan [EMAIL PROTECTED] wrote:
 
  Three boxes rarely oops during reboot or poweroff with 2.6.24-rc2-mm1
  (1) and during 2.6.24 cycle (2):
  
  kernel_restart
  sys_reboot
  [garbage]
  Code: 8b 88 a8 00 00 00 85 c9 74 04 89
  EIP is at device_shutdown+0x32/0x60
 
 Yes, all my test boxes did that - it's what I referred to in the releaee
 notes.  Greg is pondering the problem - seem he's the only person who
 cannot reproduce it ;)

Fortunately, my ia64 box reproduces this oops every time. 
So, I could chase it.

device_shutdown() function in drivers/base/power/shutdown.c
is followings.
---
/**
 * device_shutdown - call -shutdown() on each device to shutdown.
 */
void device_shutdown(void)
{
struct device * dev, *devn;

list_for_each_entry_safe_reverse(dev, devn, devices_kset-list,
kobj.entry) {
if (dev-bus  dev-bus-shutdown) {
dev_dbg(dev, shutdown\n);
dev-bus-shutdown(dev);
} else if (dev-driver  dev-driver-shutdown) {
dev_dbg(dev, shutdown\n);
dev-driver-shutdown(dev);
}
}
}

When oops occured, dev-driver pointed kset_ktype's address,
and dev-driver-shutdown was the address of bus_type_list.
So, Oops was caused by Illegal operation fault.
kset_ktypes is pointed by system_kset.

If my understanding is correct, this loop can't distinguish between
struct device and struct kset, but both are connected in this list,
right? It may be the cause of this.

Bye.

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: EIP is at device_shutdown+0x32/0x60

2007-11-15 Thread Yasunori Goto
  
  Care to try this:
+   system_kset = kset_create_and_register(system, NULL,
+  devices_kset-kobj, NULL);
  
  We should not join the kset, only use it as a parent.
 
 Yes, that fixes the problem for me!
 
 Can anyone else verify this?

I confirmed it fixed the problem. :-)

Thanks.


-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PS3: trouble with SPARSEMEM_VMEMMAP and kexec

2007-12-05 Thread Yasunori Goto

 I'll try Milton's suggestion to pre-allocate the memory early.  It seems
 that should work as long as nothing else before the hot-plug mem is added
 needs a large chunk.

Hello. Geoff-san. Sorry for late response.

Could you tell me the value of the following page_size calculation
in vmemmap_populate()? I think this page_size may be too big value. 

--
int __meminit vmemmap_populate(struct page *start_page,
   unsigned long nr_pages, int node)
   :
   :
unsigned long page_size = 1  mmu_psize_defs[mmu_linear_psize].shift;
   :
---


In addition, I remember that current add_memory() is designed for
only 1 section's addition. (See: memory_probe_store() and
sparse_mem_map_populate().
they require only for 1 section's mem_map by specifing
PAGES_PER_SECTION.)
The 1 section size for normal powerpc box is only 16MB.
(IA64 - 1GB, x86-64 - 128MB).

But, if my understanding is correct, PS3's add_memory() requires all
of total memory. I'm afraid something other problems might be hidden
in this issue yet.

(However, I think Milton-san's suggestion is very desirable. 
 If preallocation of hotadd works on ia64 too, I'm very glad.)

Thanks.

-- 
Yasunori Goto 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PS3: trouble with SPARSEMEM_VMEMMAP and kexec

2007-12-06 Thread Yasunori Goto
 On Thu, 6 Dec 2007, Geert Uytterhoeven wrote:
  On Thu, 6 Dec 2007, Yasunori Goto wrote:
I'll try Milton's suggestion to pre-allocate the memory early.  It seems
that should work as long as nothing else before the hot-plug mem is 
added
needs a large chunk.
   
   Hello. Geoff-san. Sorry for late response.
   
   Could you tell me the value of the following page_size calculation
   in vmemmap_populate()? I think this page_size may be too big value. 
   
   --
   int __meminit vmemmap_populate(struct page *start_page,
  unsigned long nr_pages, int node)
  :
  :
   unsigned long page_size = 1  
   mmu_psize_defs[mmu_linear_psize].shift;
  :
   ---
  
  24 MiB
 
 Bummer, messing up bits and MiB.
 
 16 MiB of course.

16 MiB is not page size. It is section size. 
IIRC, powerpc's page size must be 4K (or 64K).
If page size is 4k, vmemmap_alloc_block will call the order 12 page.

Is it really correct value for vmemmap population?

 PS3 initially starts with 128 MiB.
 Later hotplug is used to add the remaining memory (96 or 112 MIB, IIRC).

Ok.
Then, add_memory() must be called 6 or 7 times for each sections.

Thanks.


-- 
Yasunori Goto 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Patch] mm/sparse.c: Improve the error handling for sparse_add_one_section()

2007-11-26 Thread Yasunori Goto
Hi, Cong-san.

   ms-section_mem_map |= SECTION_MARKED_PRESENT;
  
   ret = sparse_init_one_section(ms, section_nr, memmap, usemap);
  
  out:
   pgdat_resize_unlock(pgdat, flags);
 - if (ret = 0)
 - __kfree_section_memmap(memmap, nr_pages);
 +
   return ret;
  }
  #endif

Hmm. When sparse_init_one_section() returns error, memmap and 
usemap should be free.

Thanks for your fixing.

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Patch](Resend) mm/sparse.c: Improve the error handling for sparse_add_one_section()

2007-11-27 Thread Yasunori Goto
   ret = sparse_init_one_section(ms, section_nr, memmap, usemap);
 @@ -414,7 +418,7 @@ int sparse_add_one_section(struct zone *
  out:
   pgdat_resize_unlock(pgdat, flags);
   if (ret = 0)
 - __kfree_section_memmap(memmap, nr_pages);
 + kfree(usemap);
   return ret;
  }
  #endif
 

I guess you think __kfree_section_memmap() is not necessary due to
no implementation. But, it is still available when
CONFIG_SPARSEMEM_VMEMMAP is off. So, it should not be removed.


Bye.

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Patch](Resend) mm/sparse.c: Improve the error handling for sparse_add_one_section()

2007-11-28 Thread Yasunori Goto

Looks good to me.

Thanks.

Acked-by: Yasunori Goto [EMAIL PROTECTED]



 On Tue, Nov 27, 2007 at 10:53:45AM -0800, Dave Hansen wrote:
 On Tue, 2007-11-27 at 10:26 +0800, WANG Cong wrote:
  
  @@ -414,7 +418,7 @@ int sparse_add_one_section(struct zone *
   out:
  pgdat_resize_unlock(pgdat, flags);
  if (ret = 0)
  -   __kfree_section_memmap(memmap, nr_pages);
  +   kfree(usemap);
  return ret;
   }
   #endif 
 
 Why did you get rid of the memmap free here?  A bad return from
 sparse_init_one_section() indicates that we didn't use the memmap, so it
 will leak otherwise.
 
 Sorry, I was confused by the recursion. This one should be OK.
 
 Thanks.
 
 
 
 Improve the error handling for mm/sparse.c::sparse_add_one_section().  And I
 see no reason to check 'usemap' until holding the 'pgdat_resize_lock'.
 
 Cc: Christoph Lameter [EMAIL PROTECTED]
 Cc: Dave Hansen [EMAIL PROTECTED]
 Cc: Rik van Riel [EMAIL PROTECTED]
 Cc: Yasunori Goto [EMAIL PROTECTED]
 Cc: Andy Whitcroft [EMAIL PROTECTED]
 Signed-off-by: WANG Cong [EMAIL PROTECTED]
 
 ---
 Index: linux-2.6/mm/sparse.c
 ===
 --- linux-2.6.orig/mm/sparse.c
 +++ linux-2.6/mm/sparse.c
 @@ -391,9 +391,17 @@ int sparse_add_one_section(struct zone *
* no locking for this, because it does its own
* plus, it does a kmalloc
*/
 - sparse_index_init(section_nr, pgdat-node_id);
 + ret = sparse_index_init(section_nr, pgdat-node_id);
 + if (ret  0)
 + return ret;
   memmap = kmalloc_section_memmap(section_nr, pgdat-node_id, nr_pages);
 + if (!memmap)
 + return -ENOMEM;
   usemap = __kmalloc_section_usemap();
 + if (!usemap) {
 + __kfree_section_memmap(memmap, nr_pages);
 + return -ENOMEM;
 + }
  
   pgdat_resize_lock(pgdat, flags);
  
 @@ -403,18 +411,16 @@ int sparse_add_one_section(struct zone *
   goto out;
   }
  
 - if (!usemap) {
 - ret = -ENOMEM;
 - goto out;
 - }
   ms-section_mem_map |= SECTION_MARKED_PRESENT;
  
   ret = sparse_init_one_section(ms, section_nr, memmap, usemap);
  
  out:
   pgdat_resize_unlock(pgdat, flags);
 - if (ret = 0)
 + if (ret = 0) {
 + kfree(usemap);
   __kfree_section_memmap(memmap, nr_pages);
 + }
   return ret;
  }
  #endif

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PS3: trouble with SPARSEMEM_VMEMMAP and kexec

2007-12-09 Thread Yasunori Goto
 Yasunori Goto wrote:
  On Thu, 6 Dec 2007, Geert Uytterhoeven wrote:
   On Thu, 6 Dec 2007, Yasunori Goto wrote:
 I'll try Milton's suggestion to pre-allocate the memory early.  It 
 seems
 that should work as long as nothing else before the hot-plug mem is 
 added
 needs a large chunk.

Hello. Geoff-san. Sorry for late response.

Could you tell me the value of the following page_size calculation
in vmemmap_populate()? I think this page_size may be too big value. 

--
int __meminit vmemmap_populate(struct page *start_page,
   unsigned long nr_pages, int 
node)
   :
   :
unsigned long page_size = 1  
mmu_psize_defs[mmu_linear_psize].shift;
   :
---
  
  16 MiB of course.
  
  16 MiB is not page size. It is section size. 
  IIRC, powerpc's page size must be 4K (or 64K).
  If page size is 4k, vmemmap_alloc_block will call the order 12 page.
 
 
 By default PS3 uses 4K virtual pages, and 16M linear pages.
 
 
  Is it really correct value for vmemmap population?
 
 
 It seems vmemmap needs linear pages, so I think it is ok.

Oh, I see. Sorry for noise.

Bye.

-- 
Yasunori Goto 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sparsemem: Make SPARSEMEM_VMEMMAP selectable

2007-12-09 Thread Yasunori Goto
Looks good to me.

Thanks.

Acked-by: Yasunori Goto [EMAIL PROTECTED]


 
 From: Geoff Levand [EMAIL PROTECTED]
 
 SPARSEMEM_VMEMMAP needs to be a selectable config option to
 support building the kernel both with and without sparsemem
 vmemmap support.  This selection is desirable for platforms
 which could be configured one way for platform specific
 builds and the other for multi-platform builds.
 
 Signed-off-by: Miguel Boton [EMAIL PROTECTED]
 Signed-off-by: Geoff Levand [EMAIL PROTECTED]
 ---
 
 Andrew, 
 
 Please consider for 2.6.24.
 
 -Geoff
 
 
  mm/Kconfig |   15 +++
  1 file changed, 7 insertions(+), 8 deletions(-)
 
 --- a/mm/Kconfig
 +++ b/mm/Kconfig
 @@ -112,18 +112,17 @@ config SPARSEMEM_EXTREME
   def_bool y
   depends on SPARSEMEM  !SPARSEMEM_STATIC
  
 -#
 -# SPARSEMEM_VMEMMAP uses a virtually mapped mem_map to optimise pfn_to_page
 -# and page_to_pfn.  The most efficient option where kernel virtual space is
 -# not under pressure.
 -#
  config SPARSEMEM_VMEMMAP_ENABLE
   def_bool n
  
  config SPARSEMEM_VMEMMAP
 - bool
 - depends on SPARSEMEM
 - default y if (SPARSEMEM_VMEMMAP_ENABLE)
 + bool Sparse Memory virtual memmap
 + depends on SPARSEMEM  SPARSEMEM_VMEMMAP_ENABLE
 + default y
 + help
 +  SPARSEMEM_VMEMMAP uses a virtually mapped memmap to optimise
 +  pfn_to_page and page_to_pfn operations.  This is the most
 +  efficient option when sufficient kernel resources are available.
  
  # eventually, we can have this option just 'select SPARSEMEM'
  config MEMORY_HOTPLUG
 
 

-- 
Yasunori Goto 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Add IORESOUCE_BUSY flag for System RAM (Re: [Question] How to represent SYSTEM_RAM in kerenel/resouce.c)

2007-11-01 Thread Yasunori Goto
Hello.

I was asked from Kame-san to write this patch.

Please apply.

-
i386 and x86-64 registers System RAM as IORESOURCE_MEM | IORESOURCE_BUSY.

But ia64 registers it as IORESOURCE_MEM only.
In addition, memory hotplug code registers new memory as IORESOURCE_MEM too.

This patch adds IORESOURCE_BUSY for them to avoid potential overlap mapping
by PCI device.

Signed-off-by: Yasunori Goto [EMAIL PROTECTED]

---
 arch/ia64/kernel/efi.c |6 ++
 mm/memory_hotplug.c|2 +-
 2 files changed, 3 insertions(+), 5 deletions(-)

Index: current/arch/ia64/kernel/efi.c
===
--- current.orig/arch/ia64/kernel/efi.c 2007-11-01 15:24:05.0 +0900
+++ current/arch/ia64/kernel/efi.c  2007-11-01 15:24:18.0 +0900
@@ -,7 +,7 @@ efi_initialize_iomem_resources(struct re
if (md-num_pages == 0) /* should not happen */
continue;
 
-   flags = IORESOURCE_MEM;
+   flags = IORESOURCE_MEM | IORESOURCE_BUSY;
switch (md-type) {
 
case EFI_MEMORY_MAPPED_IO:
@@ -1133,12 +1133,11 @@ efi_initialize_iomem_resources(struct re
 
case EFI_ACPI_MEMORY_NVS:
name = ACPI Non-volatile Storage;
-   flags |= IORESOURCE_BUSY;
break;
 
case EFI_UNUSABLE_MEMORY:
name = reserved;
-   flags |= IORESOURCE_BUSY | IORESOURCE_DISABLED;
+   flags |= IORESOURCE_DISABLED;
break;
 
case EFI_RESERVED_TYPE:
@@ -1147,7 +1146,6 @@ efi_initialize_iomem_resources(struct re
case EFI_ACPI_RECLAIM_MEMORY:
default:
name = reserved;
-   flags |= IORESOURCE_BUSY;
break;
}
 
Index: current/mm/memory_hotplug.c
===
--- current.orig/mm/memory_hotplug.c2007-11-01 15:24:16.0 +0900
+++ current/mm/memory_hotplug.c 2007-11-01 15:41:27.0 +0900
@@ -39,7 +39,7 @@ static struct resource *register_memory_
res-name = System RAM;
res-start = start;
res-end = start + size - 1;
-   res-flags = IORESOURCE_MEM;
+   res-flags = IORESOURCE_MEM | IORESOURCE_BUSY;
if (request_resource(iomem_resource, res)  0) {
printk(System RAM resource %llx - %llx cannot be added\n,
(unsigned long long)res-start, (unsigned long long)res-end);

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch 000/002](memory hotplug) Rearrange notifier of memory hotplug (take 2)

2007-10-17 Thread Yasunori Goto
Hello.

This patch set is to rearrange event notifier for memory hotplug,
because the old notifier has some defects. For example, there is no
information like new memory's pfn and # of pages for callback functions.

Fortunately, nothing uses this notifier so far, there is no impact by
this change. (SLUB will use this after this patch set to make
kmem_cache_node structure).

In addition, descriptions of notifer is added to memory hotplug
document.

This patch was a part of patch set to make kmem_cache_node of SLUB 
to avoid panic of memory online. But, I think this change becomes
not only for SLUB but also for others. So, I extracted this from it.

This patch set is for 2.6.23-mm1.
I tested this patch on my ia64 box.

Please apply.

Bye.

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch 001/002](memory hotplug) Make description of memory hotplug notifier in document

2007-10-17 Thread Yasunori Goto

Add description about event notification callback routine to the document.

Signed-off-by: Yasunori Goto [EMAIL PROTECTED]

---
 Documentation/memory-hotplug.txt |   58 ---
 1 file changed, 55 insertions(+), 3 deletions(-)

Index: current/Documentation/memory-hotplug.txt
===
--- current.orig/Documentation/memory-hotplug.txt   2007-10-17 
15:57:50.0 +0900
+++ current/Documentation/memory-hotplug.txt2007-10-17 21:26:30.0 
+0900
@@ -2,7 +2,8 @@
 Memory Hotplug
 ==
 
-Last Updated: Jul 28 2007
+Created:   Jul 28 2007
+Add description of notifier of memory hotplug  Oct 11 2007
 
 This document is about memory hotplug including how-to-use and current status.
 Because Memory Hotplug is still under development, contents of this text will
@@ -24,7 +25,8 @@ be changed often.
   6.1 Memory offline and ZONE_MOVABLE
   6.2. How to offline memory
 7. Physical memory remove
-8. Future Work List
+8. Memory hotplug event notifier
+9. Future Work List
 
 Note(1): x86_64's has special implementation for memory hotplug.
  This text does not describe it.
@@ -307,8 +309,58 @@ Need more implementation yet
  - Notification completion of remove works by OS to firmware.
  - Guard from remove if not yet.
 
+
+8. Memory hotplug event notifier
+
+Memory hotplug has event notifer. There are 6 types of notification.
+
+MEMORY_GOING_ONLINE
+  Generated before new memory becomes available in order to be able to
+  prepare subsystems to handle memory. The page allocator is still unable
+  to allocate from the new memory.
+
+MEMORY_CANCEL_ONLINE
+  Generated if MEMORY_GOING_ONLINE fails.
+
+MEMORY_ONLINE
+  Generated when memory has succesfully brought online. The callback may
+  allocate pages from the new memory.
+
+MEMORY_GOING_OFFLINE
+  Generated to begin the process of offlining memory. Allocations are no
+  longer possible from the memory but some of the memory to be offlined
+  is still in use. The callback can be used to free memory known to a
+  subsystem from the indicated memory section.
+
+MEMORY_CANCEL_OFFLINE
+  Generated if MEMORY_GOING_OFFLINE fails. Memory is available again from
+  the section that we attempted to offline.
+
+MEMORY_OFFLINE
+  Generated after offlining memory is complete.
+
+A callback routine can be registered by
+  hotplug_memory_notifier(callback_func, priority)
+
+The second argument of callback function (action) is event types of above.
+The third argument is passed by pointer of struct memory_notify.
+
+struct memory_notify {
+   unsigned long start_pfn;
+   unsigned long nr_pages;
+   int status_cahnge_nid;
+}
+
+start_pfn is start_pfn of online/offline memory.
+nr_pages is # of pages of online/offline memory.
+status_change_nid is set node id when N_HIGH_MEMORY of nodemask is (will be)
+set/clear. It means a new(memoryless) node gets new memory by online and a
+node loses all memory. If this is -1, then nodemask status is not changed.
+If status_changed_nid = 0, callback should create/discard structures for the
+node if necessary.
+
 --
-8. Future Work
+9. Future Work
 --
   - allowing memory hot-add to ZONE_MOVABLE. maybe we need some switch like
 sysctl or new control file.

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch 002/002](memory hotplug) rearrange patch for notifier of memory hotplug

2007-10-17 Thread Yasunori Goto

Current memory notifier has some defects yet. (Fortunately, nothing uses it.)
This patch is to fix and rearrange for them.

  - Add information of start_pfn, nr_pages, and node id if node status is
changes from/to memoryless node for callback functions.
Callbacks can't do anything without those information.
  - Add notification going-online status.
It is necessary for creating per node structure before the node's
pages are available.
  - Move GOING_OFFLINE status notification after page isolation.
It is good place for return memory like cache for callback,
because returned page is not used again.
  - Make CANCEL events for rollingback when error occurs.
  - Delete MEM_MAPPING_INVALID notification. It will be not used.
  - Fix compile error of (un)register_memory_notifier().


Signed-off-by: Yasunori Goto [EMAIL PROTECTED]


---
 drivers/base/memory.c  |9 +
 include/linux/memory.h |   27 +++
 mm/memory_hotplug.c|   48 +---
 3 files changed, 61 insertions(+), 23 deletions(-)

Index: current/drivers/base/memory.c
===
--- current.orig/drivers/base/memory.c  2007-10-17 21:17:54.0 +0900
+++ current/drivers/base/memory.c   2007-10-17 21:21:30.0 +0900
@@ -137,7 +137,7 @@ static ssize_t show_mem_state(struct sys
return len;
 }
 
-static inline int memory_notify(unsigned long val, void *v)
+int memory_notify(unsigned long val, void *v)
 {
return blocking_notifier_call_chain(memory_chain, val, v);
 }
@@ -183,7 +183,6 @@ memory_block_action(struct memory_block 
break;
case MEM_OFFLINE:
mem-state = MEM_GOING_OFFLINE;
-   memory_notify(MEM_GOING_OFFLINE, NULL);
start_paddr = page_to_pfn(first_page)  PAGE_SHIFT;
ret = remove_memory(start_paddr,
PAGES_PER_SECTION  PAGE_SHIFT);
@@ -191,7 +190,6 @@ memory_block_action(struct memory_block 
mem-state = old_state;
break;
}
-   memory_notify(MEM_MAPPING_INVALID, NULL);
break;
default:
printk(KERN_WARNING %s(%p, %ld) unknown action: %ld\n,
@@ -199,11 +197,6 @@ memory_block_action(struct memory_block 
WARN_ON(1);
ret = -EINVAL;
}
-   /*
-* For now, only notify on successful memory operations
-*/
-   if (!ret)
-   memory_notify(action, NULL);
 
return ret;
 }
Index: current/include/linux/memory.h
===
--- current.orig/include/linux/memory.h 2007-10-17 21:17:54.0 +0900
+++ current/include/linux/memory.h  2007-10-17 21:21:30.0 +0900
@@ -41,18 +41,15 @@ struct memory_block {
 #defineMEM_ONLINE  (10) /* exposed to userspace */
 #defineMEM_GOING_OFFLINE   (11) /* exposed to userspace */
 #defineMEM_OFFLINE (12) /* exposed to userspace */
+#defineMEM_GOING_ONLINE(13)
+#defineMEM_CANCEL_ONLINE   (14)
+#defineMEM_CANCEL_OFFLINE  (15)
 
-/*
- * All of these states are currently kernel-internal for notifying
- * kernel components and architectures.
- *
- * For MEM_MAPPING_INVALID, all notifier chains with priority 0
- * are called before pfn_to_page() becomes invalid.  The priority=0
- * entry is reserved for the function that actually makes
- * pfn_to_page() stop working.  Any notifiers that want to be called
- * after that should have priority 0.
- */
-#defineMEM_MAPPING_INVALID (13)
+struct memory_notify {
+   unsigned long start_pfn;
+   unsigned long nr_pages;
+   int status_change_nid;
+};
 
 struct notifier_block;
 struct mem_section;
@@ -69,12 +66,18 @@ static inline int register_memory_notifi
 static inline void unregister_memory_notifier(struct notifier_block *nb)
 {
 }
+static inline int memory_notify(unsigned long val, void *v)
+{
+   return 0;
+}
 #else
+extern int register_memory_notifier(struct notifier_block *nb);
+extern void unregister_memory_notifier(struct notifier_block *nb);
 extern int register_new_memory(struct mem_section *);
 extern int unregister_memory_section(struct mem_section *);
 extern int memory_dev_init(void);
 extern int remove_memory_block(unsigned long, struct mem_section *, int);
-
+extern int memory_notify(unsigned long val, void *v);
 #define CONFIG_MEM_BLOCK_SIZE  (PAGES_PER_SECTIONPAGE_SHIFT)
 
 
Index: current/mm/memory_hotplug.c
===
--- current.orig/mm/memory_hotplug.c2007-10-17 21:17:54.0 +0900
+++ current/mm/memory_hotplug.c

[Patch](memory hotplug) Make kmem_cache_node for SLUB on memory online to avoid panic(take 3)

2007-10-17 Thread Yasunori Goto

This patch fixes panic due to access NULL pointer
of kmem_cache_node at discard_slab() after memory online.

When memory online is called, kmem_cache_nodes are created for
all SLUBs for new node whose memory are available.

slab_mem_going_online_callback() is called to make kmem_cache_node()
in callback of memory online event. If it (or other callbacks) fails,
then slab_mem_offline_callback() is called for rollback.

In memory offline, slab_mem_going_offline_callback() is called to
shrink all slub cache, then slab_mem_offline_callback() is called later.

This patch is tested on my ia64 box.

Please apply.

Signed-off-by: Yasunori Goto [EMAIL PROTECTED]


---
 mm/slub.c |  115 ++
 1 file changed, 115 insertions(+)

Index: current/mm/slub.c
===
--- current.orig/mm/slub.c  2007-10-17 21:17:53.0 +0900
+++ current/mm/slub.c   2007-10-17 22:23:08.0 +0900
@@ -20,6 +20,7 @@
 #include linux/mempolicy.h
 #include linux/ctype.h
 #include linux/kallsyms.h
+#include linux/memory.h
 
 /*
  * Lock order:
@@ -2688,6 +2689,118 @@ int kmem_cache_shrink(struct kmem_cache 
 }
 EXPORT_SYMBOL(kmem_cache_shrink);
 
+#if defined(CONFIG_NUMA)  defined(CONFIG_MEMORY_HOTPLUG)
+static int slab_mem_going_offline_callback(void *arg)
+{
+   struct kmem_cache *s;
+
+   down_read(slub_lock);
+   list_for_each_entry(s, slab_caches, list)
+   kmem_cache_shrink(s);
+   up_read(slub_lock);
+
+   return 0;
+}
+
+static void slab_mem_offline_callback(void *arg)
+{
+   struct kmem_cache_node *n;
+   struct kmem_cache *s;
+   struct memory_notify *marg = arg;
+   int offline_node;
+
+   offline_node = marg-status_change_nid;
+
+   /*
+* If the node still has available memory. we need kmem_cache_node
+* for it yet.
+*/
+   if (offline_node  0)
+   return;
+
+   down_read(slub_lock);
+   list_for_each_entry(s, slab_caches, list) {
+   n = get_node(s, offline_node);
+   if (n) {
+   /*
+* if n-nr_slabs  0, slabs still exist on the node
+* that is going down. We were unable to free them,
+* and offline_pages() function shoudn't call this
+* callback. So, we must fail.
+*/
+   BUG_ON(atomic_read(n-nr_slabs));
+
+   s-node[offline_node] = NULL;
+   kmem_cache_free(kmalloc_caches, n);
+   }
+   }
+   up_read(slub_lock);
+}
+
+static int slab_mem_going_online_callback(void *arg)
+{
+   struct kmem_cache_node *n;
+   struct kmem_cache *s;
+   struct memory_notify *marg = arg;
+   int nid = marg-status_change_nid;
+
+   /*
+* If the node's memory is already available, then kmem_cache_node is
+* already created. Nothing to do.
+*/
+   if (nid  0)
+   return 0;
+
+   /*
+* We are bringing a node online. No memory is availabe yet. We must
+* allocate a kmem_cache_node structure in order to bring the node
+* online.
+*/
+   down_read(slub_lock);
+   list_for_each_entry(s, slab_caches, list) {
+   /*
+* XXX: kmem_cache_alloc_node will fallback to other nodes
+*  since memory is not yet available from the node that
+*  is brought up.
+*/
+   n = kmem_cache_alloc(kmalloc_caches, GFP_KERNEL);
+   if (!n)
+   return -ENOMEM;
+   init_kmem_cache_node(n);
+   s-node[nid] = n;
+   }
+   up_read(slub_lock);
+
+   return 0;
+}
+
+static int slab_memory_callback(struct notifier_block *self,
+   unsigned long action, void *arg)
+{
+   int ret = 0;
+
+   switch (action) {
+   case MEM_GOING_ONLINE:
+   ret = slab_mem_going_online_callback(arg);
+   break;
+   case MEM_GOING_OFFLINE:
+   ret = slab_mem_going_offline_callback(arg);
+   break;
+   case MEM_OFFLINE:
+   case MEM_CANCEL_ONLINE:
+   slab_mem_offline_callback(arg);
+   break;
+   case MEM_ONLINE:
+   case MEM_CANCEL_OFFLINE:
+   break;
+   }
+
+   ret = notifier_from_errno(ret);
+   return ret;
+}
+
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
 /
  * Basic setup of slabs
  ***/
@@ -2709,6 +2822,8 @@ void __init kmem_cache_init(void)
sizeof(struct kmem_cache_node), GFP_KERNEL);
kmalloc_caches[0].refcount = -1;
caches

Re: [Patch](memory hotplug) Make kmem_cache_node for SLUB on memory online to avoid panic(take 3)

2007-10-18 Thread Yasunori Goto
 On Wed, 17 Oct 2007 23:25:58 -0700 (PDT) Christoph Lameter [EMAIL 
 PROTECTED] wrote:
 
   So that's slub.  Does slab already have this functionality or are you
   not bothering to maintain slab in this area?
  
  Slab brings up a per node structure when the corresponding cpu is brought 
  up. That was sufficient as long as we did not have any memoryless nodes. 

Right. At least, I don't have any experience of panic with SLAB so far.
(If panic occurred, I already made a patch.).

  Now we may have to fix some things over there as well.

Though the fix may be better for it, my priority is very low for it
now.



-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Patch](memory hotplug) Make kmem_cache_node for SLUB on memory online to avoid panic(take 3)

2007-10-18 Thread Yasunori Goto
 On Thu, 18 Oct 2007 12:25:37 +0900 Yasunori Goto [EMAIL PROTECTED] wrote:
 
  
  This patch fixes panic due to access NULL pointer
  of kmem_cache_node at discard_slab() after memory online.
  
  When memory online is called, kmem_cache_nodes are created for
  all SLUBs for new node whose memory are available.
  
  slab_mem_going_online_callback() is called to make kmem_cache_node()
  in callback of memory online event. If it (or other callbacks) fails,
  then slab_mem_offline_callback() is called for rollback.
  
  In memory offline, slab_mem_going_offline_callback() is called to
  shrink all slub cache, then slab_mem_offline_callback() is called later.
  
  This patch is tested on my ia64 box.
  
  ...
   
  +#if defined(CONFIG_NUMA)  defined(CONFIG_MEMORY_HOTPLUG)
 
 hm.  There should be no linkage between memory hotpluggability and
 NUMA, surely?

Sure. IBM's powerpc boxes have to support memory hotplug even if it 
is non-numa machine. They have the Dynamic Logical Partitioning feature.

  +   down_read(slub_lock);
  +   list_for_each_entry(s, slab_caches, list) {
  +   n = get_node(s, offline_node);
  +   if (n) {
  +   /*
  +* if n-nr_slabs  0, slabs still exist on the node
  +* that is going down. We were unable to free them,
  +* and offline_pages() function shoudn't call this
  +* callback. So, we must fail.
  +*/
  +   BUG_ON(atomic_read(n-nr_slabs));
 
 Expereince tells us that WARN_ON is preferred for newly added code ;)

Oh... Ok!

  +   s-node[offline_node] = NULL;
  +   kmem_cache_free(kmalloc_caches, n);
  +   }
  +   }
  +   up_read(slub_lock);
  +}
  +
  +static int slab_mem_going_online_callback(void *arg)
  +{
  +   struct kmem_cache_node *n;
  +   struct kmem_cache *s;
  +   struct memory_notify *marg = arg;
  +   int nid = marg-status_change_nid;
  +
  +   /*
  +* If the node's memory is already available, then kmem_cache_node is
  +* already created. Nothing to do.
  +*/
  +   if (nid  0)
  +   return 0;
  +
  +   /*
  +* We are bringing a node online. No memory is availabe yet. We must
  +* allocate a kmem_cache_node structure in order to bring the node
  +* online.
  +*/
  +   down_read(slub_lock);
  +   list_for_each_entry(s, slab_caches, list) {
  +   /*
  +* XXX: kmem_cache_alloc_node will fallback to other nodes
  +*  since memory is not yet available from the node that
  +*  is brought up.
  +*/
  +   n = kmem_cache_alloc(kmalloc_caches, GFP_KERNEL);
  +   if (!n)
  +   return -ENOMEM;
 
 err, we forgot slub_lock.  I'll fix that.

Oops. Indeed. Thanks for your check.

Bye.

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix warning in mm/slub.c

2007-10-22 Thread Yasunori Goto
 Make kmem_cache_node for SLUB on memory online to avoid panic introduced
 the following:
 
 mm/slub.c:2737: warning: passing argument 1 of 'atomic_read' from
 incompatible pointer type
 
 
 Signed-off-by: Olof Johansson [EMAIL PROTECTED]
 
 
 diff --git a/mm/slub.c b/mm/slub.c
 index aac1dd3..bcdb2c8 100644
 --- a/mm/slub.c
 +++ b/mm/slub.c
 @@ -2734,7 +2734,7 @@ static void slab_mem_offline_callback(void *arg)
* and offline_pages() function shoudn't call this
* callback. So, we must fail.
*/
 - BUG_ON(atomic_read(n-nr_slabs));
 + BUG_ON(atomic_long_read(n-nr_slabs));
  
   s-node[offline_node] = NULL;
   kmem_cache_free(kmalloc_caches, n);


Oops, yes. Thanks.

Acked-by: Yasunori Goto [EMAIL PROTECTED]



-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [-mm PATCH] register_memory/unregister_memory clean ups

2008-02-12 Thread Yasunori Goto
 On Mon, 2008-02-11 at 11:48 -0800, Andrew Morton wrote:
  On Mon, 11 Feb 2008 09:23:18 -0800
  Badari Pulavarty [EMAIL PROTECTED] wrote:
  
   Hi Andrew,
   
   While testing hotplug memory remove against -mm, I noticed
   that unregister_memory() is not cleaning up /sysfs entries
   correctly. It also de-references structures after destroying
   them (luckily in the code which never gets used). So, I cleaned
   up the code and fixed the extra reference issue.
   
   Could you please include it in -mm ?
   
   Thanks,
   Badari
   
   register_memory()/unregister_memory() never gets called with
   root. unregister_memory() is accessing kobject_name of
   the object just freed up. Since no one uses the code,
   lets take the code out. And also, make register_memory() static.  
   
   Another bug fix - before calling unregister_memory()
   remove_memory_block() gets a ref on kobject. unregister_memory()
   need to drop that ref before calling sysdev_unregister().
   
  
  I'd say this:
  
   Subject: [-mm PATCH] register_memory/unregister_memory clean ups
  
  is rather tame.  These are more than cleanups!  These sound like
  machine-crashing bugs.  Do they crash machines?  How come nobody noticed
  it?
  
 
 No they don't crash machine - mainly because, they never get called
 with root argument (where we have the bug). They were never tested
 before, since we don't have memory remove work yet. All it does
 is, it leave /sysfs directory laying around and causing next
 memory add failure. 

Badari-san.

Which function does call unregister_memory() or unregister_memory_section()?
I can't find its caller in current 2.6.24-mm1.


???()
  |
  |nothing calls?
  |
  +--unregister_memory_section()
   |
   |call
   |
   +--- remove_memory_block()
  |
  |call
  |
  + unregister_memory()

unregister_memory_section() is only externed in linux/memory.h.

Do you have any another patch to call it?
I think it is necessary for physical memory removing.

If you have not posted it or it is not merged to -mm,
I can understand why this bug remains.
If you posted it, could you point it to me?

Or do I misunderstand something?


Thanks.

-- 
Yasunori Goto 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [-mm PATCH] register_memory/unregister_memory clean ups

2008-02-12 Thread Yasunori Goto
Thanks Badari-san.

I understand what was occured. :-)

 On Tue, 2008-02-12 at 13:56 -0800, Badari Pulavarty wrote:
+   /*
+* Its ugly, but this is the best I can do - HELP !!
+* We don't know where the allocations for section memmap and usemap
+* came from. If they are allocated at the boot time, they would 
come
+* from bootmem. If they are added through hot-memory-add they 
could be
+* from sla or vmalloc. If they are allocated as part of hot-mem-add
+* free them up properly. If they are allocated at boot, no easy way
+* to correctly free them :(
+*/
+   if (usemap) {
+   if (PageSlab(virt_to_page(usemap))) {
+   kfree(usemap);
+   if (memmap)
+   __kfree_section_memmap(memmap, nr_pages);
+   }
+   }
+}
   
   Do what we did with the memmap and store some of its origination
   information in the low bits.
  
  Hmm. my understand of memmap is limited. Can you help me out here ?
 
 Never mind.  That was a bad suggestion.  I do think it would be a good
 idea to mark the 'struct page' of ever page we use as bootmem in some
 way.  Perhaps page-private? 

I agree. page-private is not used by bootmem allocator.

I would like to mark not only memmap but also pgdat (and so on)
for next step. It will be necessary for removing whole node. :-)


  Otherwise, you can simply try all of the
 possibilities and consider the remainder bootmem.  Did you ever find out
 if we properly initialize the bootmem 'struct page's?
 
 Please have mercy and put this in a helper, first of all.
 
 static void free_usemap(unsigned long *usemap)
 {
   if (!usemap_
   return;
 
   if (PageSlab(virt_to_page(usemap))) {
   kfree(usemap)
   } else if (is_vmalloc_addr(usemap)) {
   vfree(usemap);
   } else {
   int nid = page_to_nid(virt_to_page(usemap));
   bootmem_fun_here(NODE_DATA(nid), usemap);
   }
 }
 
 right?

It may work. But, to be honest, I feel there are TOO MANY allocation/free
way for memmap (usemap and so on). If possible, I would like to
unify some of them. I would like to try it.

Bye.

-- 
Yasunori Goto 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [5/8] Fix logic error in 64bit memory hotadd

2008-02-12 Thread Yasunori Goto
Hi Ingo-san.

  Does anyone even  
  use memory hotplug currently? 
 
 I don't know.

IBM's powerpc box can memory hot-add/remove by dynamic partitioning.
And our fujitsu server has memory hot-add feature (Ia-64).
So, they are concrete user of memory hotplug.

In x86, E8500 chipset has the feature of memory-hotplug.
(I searched a data-sheet from intel site.)
http://download.intel.com/design/chipsets/e8500/datashts/30674501.pdf
(6.3.8 IMI Hot-Plug)

So, it depends on how many server uses it, I think.

Thanks.

-- 
Yasunori Goto 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch / 000](memory hotplug) Fix NULL pointer access of kmem_cache_node when hot-add.

2007-10-01 Thread Yasunori Goto
Hello.

This patch set is to fix panic due to access NULL pointer of SLUB.

When new memory is hot-added on the new node (or memory less node),
kmem_cache_node for the new node is not prepared,
and panic occurs by it. So, new kmem_cache_node should be created
before new memory is available on the node.

This is the first user of the callback of memory notifier.
So, the first patch is to change some defects of it.

This patch set is for 2.6.23-rc8-mm2.
I tested this patch on my ia64 box.

Please apply.

Bye.

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch / 001](memory hotplug) fix some defects of memory notifer callback interface.

2007-10-01 Thread Yasunori Goto

Current memory notifier has some defects yet. (Nothing uses it.)
This patch is to fix for them.

  - Add information of start_pfn and nr_pages for callback functions.
They can't do anything without those information.
  - Add notification going-online status.
It is necessary for creating per node structure before the node's
pages are available.
  - Fix compile error of (un)register_memory_notifier().


Signed-off-by: Yasunori Goto [EMAIL PROTECTED]

---
 drivers/base/memory.c  |   10 +++---
 include/linux/memory.h |   16 
 2 files changed, 19 insertions(+), 7 deletions(-)

Index: current/drivers/base/memory.c
===
--- current.orig/drivers/base/memory.c  2007-09-28 11:21:00.0 +0900
+++ current/drivers/base/memory.c   2007-09-28 11:23:46.0 +0900
@@ -155,10 +155,13 @@ memory_block_action(struct memory_block 
struct page *first_page;
int ret;
int old_state = mem-state;
+   struct memory_notify arg;
 
psection = mem-phys_index;
first_page = pfn_to_page(psection  PFN_SECTION_SHIFT);
 
+   arg.start_pfn = page_to_pfn(first_page);
+   arg.nr_pages = PAGES_PER_SECTION;
/*
 * The probe routines leave the pages reserved, just
 * as the bootmem code does.  Make sure they're still
@@ -178,12 +181,13 @@ memory_block_action(struct memory_block 
 
switch (action) {
case MEM_ONLINE:
+   memory_notify(MEM_GOING_ONLINE, arg);
start_pfn = page_to_pfn(first_page);
ret = online_pages(start_pfn, PAGES_PER_SECTION);
break;
case MEM_OFFLINE:
mem-state = MEM_GOING_OFFLINE;
-   memory_notify(MEM_GOING_OFFLINE, NULL);
+   memory_notify(MEM_GOING_OFFLINE, arg);
start_paddr = page_to_pfn(first_page)  PAGE_SHIFT;
ret = remove_memory(start_paddr,
PAGES_PER_SECTION  PAGE_SHIFT);
@@ -191,7 +195,7 @@ memory_block_action(struct memory_block 
mem-state = old_state;
break;
}
-   memory_notify(MEM_MAPPING_INVALID, NULL);
+   memory_notify(MEM_MAPPING_INVALID, arg);
break;
default:
printk(KERN_WARNING %s(%p, %ld) unknown action: %ld\n,
@@ -203,7 +207,7 @@ memory_block_action(struct memory_block 
 * For now, only notify on successful memory operations
 */
if (!ret)
-   memory_notify(action, NULL);
+   memory_notify(action, arg);
 
return ret;
 }
Index: current/include/linux/memory.h
===
--- current.orig/include/linux/memory.h 2007-09-28 11:18:25.0 +0900
+++ current/include/linux/memory.h  2007-09-28 11:23:46.0 +0900
@@ -37,10 +37,16 @@ struct memory_block {
struct sys_device sysdev;
 };
 
+struct memory_notify {
+   unsigned long start_pfn;
+   unsigned long nr_pages;
+};
+
 /* These states are exposed to userspace as text strings in sysfs */
-#defineMEM_ONLINE  (10) /* exposed to userspace */
-#defineMEM_GOING_OFFLINE   (11) /* exposed to userspace */
-#defineMEM_OFFLINE (12) /* exposed to userspace */
+#define MEM_GOING_ONLINE   (10) /* exposed to userspace */
+#defineMEM_ONLINE  (11) /* exposed to userspace */
+#defineMEM_GOING_OFFLINE   (12) /* exposed to userspace */
+#defineMEM_OFFLINE (13) /* exposed to userspace */
 
 /*
  * All of these states are currently kernel-internal for notifying
@@ -52,7 +58,7 @@ struct memory_block {
  * pfn_to_page() stop working.  Any notifiers that want to be called
  * after that should have priority 0.
  */
-#defineMEM_MAPPING_INVALID (13)
+#defineMEM_MAPPING_INVALID (14)
 
 struct notifier_block;
 struct mem_section;
@@ -70,6 +76,8 @@ static inline void unregister_memory_not
 {
 }
 #else
+extern int register_memory_notifier(struct notifier_block *nb);
+extern void unregister_memory_notifier(struct notifier_block *nb);
 extern int register_new_memory(struct mem_section *);
 extern int unregister_memory_section(struct mem_section *);
 extern int memory_dev_init(void);

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch / 002](memory hotplug) Callback function to create kmem_cache_node.

2007-10-01 Thread Yasunori Goto

This is to make kmem_cache_nodes of all SLUBs for new node when 
memory-hotadd is called. This fixes panic due to access NULL pointer at
discard_slab() after memory hot-add.

If pages on the new node available, slub can use it before making
new kmem_cache_nodes. So, this callback should be called
BEFORE pages on the node are available.


Signed-off-by: Yasunori Goto [EMAIL PROTECTED]

---
 mm/slub.c |   79 ++
 1 file changed, 79 insertions(+)

Index: current/mm/slub.c
===
--- current.orig/mm/slub.c  2007-09-28 11:23:50.0 +0900
+++ current/mm/slub.c   2007-09-28 11:23:59.0 +0900
@@ -20,6 +20,7 @@
 #include linux/mempolicy.h
 #include linux/ctype.h
 #include linux/kallsyms.h
+#include linux/memory.h
 
 /*
  * Lock order:
@@ -2097,6 +2098,82 @@ static int init_kmem_cache_nodes(struct 
}
return 1;
 }
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static void __slab_callback_offline(int nid)
+{
+   struct kmem_cache_node *n;
+   struct kmem_cache *s;
+
+   list_for_each_entry(s, slab_caches, list) {
+   if (s-node[nid]) {
+   n = get_node(s, nid);
+   s-node[nid] = NULL;
+   kmem_cache_free(kmalloc_caches, n);
+   }
+   }
+}
+
+static int slab_callback_going_online(void *arg)
+{
+   struct kmem_cache_node *n;
+   struct kmem_cache *s;
+   struct memory_notify *marg = arg;
+   int nid;
+
+   nid = page_to_nid(pfn_to_page(marg-start_pfn));
+
+   /* If the node already has memory, then nothing is necessary. */
+   if (node_state(nid, N_HIGH_MEMORY))
+   return 0;
+
+   /*
+* New memory will be onlined on the node which has no memory so far.
+* New kmem_cache_node is necssary for it.
+*/
+   down_read(slub_lock);
+   list_for_each_entry(s, slab_caches, list) {
+   /*
+* XXX: The new node's memory can't be allocated yet,
+*  kmem_cache_node will be allocated other node.
+*/
+   n = kmem_cache_alloc(kmalloc_caches, GFP_KERNEL);
+   if (!n)
+   goto error;
+   init_kmem_cache_node(n);
+   s-node[nid] = n;
+   }
+   up_read(slub_lock);
+
+   return 0;
+
+error:
+   __slab_callback_offline(nid);
+   up_read(slub_lock);
+
+   return -ENOMEM;
+}
+
+static int slab_callback(struct notifier_block *self, unsigned long action,
+void *arg)
+{
+   int ret = 0;
+
+   switch (action) {
+   case MEM_GOING_ONLINE:
+   ret = slab_callback_going_online(arg);
+   break;
+   case MEM_ONLINE:
+   case MEM_GOING_OFFLINE:
+   case MEM_MAPPING_INVALID:
+   break;
+   }
+
+   ret = notifier_from_errno(ret);
+   return ret;
+}
+
+#endif /* CONFIG_MEMORY_HOTPLUG */
 #else
 static void free_kmem_cache_nodes(struct kmem_cache *s)
 {
@@ -2730,6 +2807,8 @@ void __init kmem_cache_init(void)
sizeof(struct kmem_cache_node), GFP_KERNEL);
kmalloc_caches[0].refcount = -1;
caches++;
+
+   hotplug_memory_notifier(slab_callback, 1);
 #endif
 
/* Able to allocate the per node structures */

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Patch 000/002](memory hotplug) Fix NULL pointer access of kmem_cache_node when hot-add.

2007-10-01 Thread Yasunori Goto

I'm sorry. There are 2 patches for this fix. Subtitle should be
[Patch 000/002]. :-(


-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Patch / 002](memory hotplug) Callback function to create kmem_cache_node.

2007-10-01 Thread Yasunori Goto
 On Mon, 1 Oct 2007, Yasunori Goto wrote:
 
  +#ifdef CONFIG_MEMORY_HOTPLUG
  +static void __slab_callback_offline(int nid)
  +{
  +   struct kmem_cache_node *n;
  +   struct kmem_cache *s;
  +
  +   list_for_each_entry(s, slab_caches, list) {
  +   if (s-node[nid]) {
  +   n = get_node(s, nid);
  +   s-node[nid] = NULL;
  +   kmem_cache_free(kmalloc_caches, n);
  +   }
  +   }
  +}
 
 I think we need to bug here if there are still objects on the node that 
 are in use. This will silently discard the objects.
 
Here is just the rollback code for an allocation failure of
kmem_cache_node in halfway.
So, there is a case some of them are not allocated yet.
Any slabs don't use new kmem_cache_node before the new nodes page is
available --so far--.
But, in the future, here will be useful for node hot-unplug code,
and its check will be necessary.  Ok. I'll add its check.

Do you mean that just nr_slabs should be checked like followings?
I'm not sure this is enough.

:
if (s-node[nid]) {
n = get_node(s, nid);
if (!atomic_read(n-nr_slabs)) {
s-node[nid] = NULL;
kmem_cache_free(kmalloc_caches, n);
}
}
:
:

Thanks.

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86 patches was Re: -mm merge plans for 2.6.24

2007-10-02 Thread Yasunori Goto
 On Tue, 2 Oct 2007 00:43:24 -0700
 Andrew Morton [EMAIL PROTECTED] wrote:
 
  On Tue, 2 Oct 2007 16:36:24 +0900 KAMEZAWA Hiroyuki [EMAIL PROTECTED] 
Don't think so.  A node is a lump of circuitry which can have zero or 
more
CPUs, IO and memory.

It may initially have been conceived as a memory-only concept in the 
Linux
kernel, but that doesn't fully map onto reality (does it?)

There was a real-world need for this, I think from the Fujitsu guys.  
That
should be spelled out in the changelog but isn't.
   
   Yes, Fujitsu and HP guys really need this memory-less-node support. 
   
  
  For what reason, please?
  
 
 For fujitsu, problem is called empty node.
 
 When ACPI's SRAT table includes possible nodes, ia64 
 bootstrap(acpi_numa_init)
 creates nodes, which includes no memory, no cpu.
 
 I tried to remove empty-node in past, but that was denied.
 It was because we can hot-add cpu to the empty node.
 (node-hotplug triggered by cpu is not implemented now. and it will be ugly.)
 
 
 For HP, (Lee can comment on this later), they have memory-less-node.
 As far as I hear, HP's machine can have following configration.
 
 (example)
 Node0: CPU0   memory AAA MB
 Node1: CPU1   memory AAA MB
 Node2: CPU2   memory AAA MB
 Node3: CPU3   memory AAA MB
 Node4: Memory XXX GB
 
 AAA is very small value (below 16MB)  and will be omitted by ia64 bootstrap.
 After boot, only Node 4 has valid memory (but have no cpu.)
 
 Maybe this is memory-interleave by firmware config.

From memory-hotplug view, memory-less node is very helpful.
It can define and arrange some halfway conditions of node hot-plug.
I guess that node unpluging code will be simpler by it.


Bye.

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Patch / 002](memory hotplug) Callback function to create kmem_cache_node.

2007-10-03 Thread Yasunori Goto
 On Tue, 2 Oct 2007, Yasunori Goto wrote:
 
  Do you mean that just nr_slabs should be checked like followings?
  I'm not sure this is enough.
  
  :
  if (s-node[nid]) {
  n = get_node(s, nid);
  if (!atomic_read(n-nr_slabs)) {
  s-node[nid] = NULL;
  kmem_cache_free(kmalloc_caches, n);
  }
  }
  :
  :
 
 That would work. But it would be better to shrink the cache first. The 
 first 2 slabs on a node may be empty and the shrinking will remove those. 
 If you do not shrink then the code may falsely assume that there are 
 objects on the node.

I'm sorry, but I don't think I understand what you mean... :-(
Could you explain more? 

Which slabs should be shrinked? kmem_cache_node and kmem_cache_cpu?

I think kmem_cache_cpu should be disabled by cpu hotplug,
not memory/node hotplug. Basically, cpu should be offlined before
memory offline on the node.

Sorry, I'm confusing now...

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Patch / 002](memory hotplug) Callback function to create kmem_cache_node.

2007-10-03 Thread Yasunori Goto
 On Wed, 3 Oct 2007, Yasunori Goto wrote:
 
   
   That would work. But it would be better to shrink the cache first. The 
   first 2 slabs on a node may be empty and the shrinking will remove those. 
   If you do not shrink then the code may falsely assume that there are 
   objects on the node.
  
  I'm sorry, but I don't think I understand what you mean... :-(
  Could you explain more? 
  
  Which slabs should be shrinked? kmem_cache_node and kmem_cache_cpu?
 
 The slab for which you are trying to set the kmem_cache_node pointer to 
 NULL needs to be shrunk.
  
  I think kmem_cache_cpu should be disabled by cpu hotplug,
  not memory/node hotplug. Basically, cpu should be offlined before
  memory offline on the node.
 
 Hmmm.. Ok for cpu hotplug you could simply disregard the per cpu 
 structure if the per cpu slab was flushed first.
 
 However, the per node structure may hold slabs with no objects even after 
 all objects were removed on a node. These need to be flushed by calling
 kmem_cache_shrink() on the slab cache.
 
 On the other hand: If you can guarantee that they will not be used and 
 that no objects are in them and that you can recover the pages used in 
 different ways then zapping the per node pointer like that is okay.

Thanks for your advise. I'll reconsider and fix my patches.

Bye.

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -mm] mm: Fix memory hotplug + sparsemem build.

2007-09-11 Thread Yasunori Goto
   if (onlined_pages)
 - node_set_state(zone-node, N_HIGH_MEMORY);
 + node_set_state(zone_to_nid(zone), N_HIGH_MEMORY);
  
   setup_per_zone_pages_min();

Thanks Paul-san. 

I also have another issue around here.
(Kswapd doesn't run on memory less node now. It should run when
 the node has memory.)

I would like to merge them like following if you don't mind.


Bye.

---

Fix kswapd doesn't run when memory is added on memory-less-node.
Fix compile error of zone-node when CONFIG_NUMA is off.

Signed-off-by: Yasunori Goto [EMAIL PROTECTED]
Signed-off-by: Paul Mundt [EMAIL PROTECTED]


---
 mm/memory_hotplug.c |9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

Index: current/mm/memory_hotplug.c
===
--- current.orig/mm/memory_hotplug.c2007-09-07 18:08:07.0 +0900
+++ current/mm/memory_hotplug.c 2007-09-11 17:29:19.0 +0900
@@ -211,10 +211,12 @@ int online_pages(unsigned long pfn, unsi
online_pages_range);
zone-present_pages += onlined_pages;
zone-zone_pgdat-node_present_pages += onlined_pages;
-   if (onlined_pages)
-   node_set_state(zone-node, N_HIGH_MEMORY);
 
setup_per_zone_pages_min();
+   if (onlined_pages){
+   kswapd_run(zone_to_nid(zone));
+   node_set_state(zone_to_nid(zone), N_HIGH_MEMORY);
+   }
 
if (need_zonelists_rebuild)
build_all_zonelists();
@@ -269,9 +271,6 @@ int add_memory(int nid, u64 start, u64 s
if (!pgdat)
return -ENOMEM;
new_pgdat = 1;
-   ret = kswapd_run(nid);
-   if (ret)
-   goto error;
}
 
/* call arch's memory hotadd */


-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -mm] mm: Fix memory hotplug + sparsemem build.

2007-09-11 Thread Yasunori Goto

  +   if (onlined_pages){
 
 Nit, needs a space there before the '{'.

Ah, Ok. I attached fixed patch in this mail.

 The problem as I see it is that when we boot the system we start a
 kswapd on all nodes with memory.  If the hot-add adds memory to a
 pre-existing node with no memory we will not start one and we end up
 with a node with memory and no kswapd.  Bad.
 
 As kswapd_run is a no-op when a kswapd already exists this seems a safe
 way to fix that.  Paul's -zone conversion is obviously correct also.
 
 Acked-by: Andy Whitcroft [EMAIL PROTECTED]

Thanks for your explanation.
You mentioned all of my intention correctly. :-)




Fix kswapd doesn't run when memory is added on memory-less-node.
Fix compile error of zone-node when CONFIG_NUMA is off.

Signed-off-by: Yasunori Goto [EMAIL PROTECTED]
Signed-off-by: Paul Mundt [EMAIL PROTECTED]
Acked-by: Andy Whitcroft [EMAIL PROTECTED]


---
 mm/memory_hotplug.c |9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

Index: current/mm/memory_hotplug.c
===
--- current.orig/mm/memory_hotplug.c2007-09-07 18:08:07.0 +0900
+++ current/mm/memory_hotplug.c 2007-09-11 17:29:19.0 +0900
@@ -211,10 +211,12 @@ int online_pages(unsigned long pfn, unsi
online_pages_range);
zone-present_pages += onlined_pages;
zone-zone_pgdat-node_present_pages += onlined_pages;
-   if (onlined_pages)
-   node_set_state(zone-node, N_HIGH_MEMORY);
 
setup_per_zone_pages_min();
+   if (onlined_pages) {
+   kswapd_run(zone_to_nid(zone));
+   node_set_state(zone_to_nid(zone), N_HIGH_MEMORY);
+   }
 
if (need_zonelists_rebuild)
build_all_zonelists();
@@ -269,9 +271,6 @@ int add_memory(int nid, u64 start, u64 s
if (!pgdat)
return -ENOMEM;
new_pgdat = 1;
-   ret = kswapd_run(nid);
-   if (ret)
-   goto error;
}
 
/* call arch's memory hotadd */

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch] Fix panic of cpu online with memory less node

2007-09-11 Thread Yasunori Goto

When a cpu is onlined on memory-less-node box, kernel panics due to
touch NULL pointer of pgdat-kswapd. Current kswapd runs only
nodes which have memory. So, calling of set_cpus_allowed()
is not necessary for memory-less node.

This is fix for it.


Signed-off-by: Yasunori Goto [EMAIL PROTECTED]

---
 mm/vmscan.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: current/mm/vmscan.c
===
--- current.orig/mm/vmscan.c2007-09-03 16:36:18.0 +0900
+++ current/mm/vmscan.c 2007-09-11 13:02:20.0 +0900
@@ -1843,9 +1843,11 @@ static int __devinit cpu_callback(struct
 {
pg_data_t *pgdat;
cpumask_t mask;
+   int nid;
 
if (action == CPU_ONLINE || action == CPU_ONLINE_FROZEN) {
-   for_each_online_pgdat(pgdat) {
+   for_each_node_state(nid, N_HIGH_MEMORY) {
+   pgdat = NODE_DATA(nid);
mask = node_to_cpumask(pgdat-node_id);
if (any_online_cpu(mask) != NR_CPUS)
/* One of our CPUs online: restore mask */

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][22/37] Clean up duplicate includes in include/linux/memory_hotplug.h

2007-07-23 Thread Yasunori Goto

Oops. This should be 
Thanks!

Acked-by: Yasunori Goto [EMAIL PROTECTED]


 Hi,
 
 This patch cleans up duplicate includes in
   include/linux/memory_hotplug.h
 
 
 Signed-off-by: Jesper Juhl [EMAIL PROTECTED]
 ---
 
 diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
 index 7b54666..b573d1e 100644
 --- a/include/linux/memory_hotplug.h
 +++ b/include/linux/memory_hotplug.h
 @@ -3,7 +3,6 @@
  
  #include linux/mmzone.h
  #include linux/spinlock.h
 -#include linux/mmzone.h
  #include linux/notifier.h
  
  struct page;

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC][Doc] memory hotplug documentaion take 2.

2007-07-27 Thread Yasunori Goto
Hello.

This is new version of document of memory hotplug.
At first, I was asked from Kame-san to review his new version which was only
updated against previous comments. But, I became to want to change/add
many description after reviewing. So, I'll post this. :-)
Please comment.

Change log from take 1.
- updates against comments from Randy-san (Thanks a lot!)
- mention about physical/logical phase of hotplug.
  change sections for it.
- add description of kernel config option.
- add description of relationship against ACPI node-hotplug.
- make patch style.
- etc.


---
This is add a document for memory hotplug to describe How to use and Current
status.


---
Signed-off-by: KAMEZAWA Hiroyuki [EMAIL PROTECTED]
Signed-off-by: Yasunori Goto [EMAIL PROTECTED]


 Documentation/memory-hotplug.txt |  322 +++
 1 files changed, 322 insertions(+)

Index: makedocument/Documentation/memory-hotplug.txt
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ makedocument/Documentation/memory-hotplug.txt   2007-07-27 
22:31:11.0 +0900
@@ -0,0 +1,322 @@
+==
+Memory Hotplug
+==
+
+Last Updated: Jul 27 2007
+
+This document is about memory hotplug including how-to-use and current status.
+Because Memory Hotplug is still under development, contents of this text will
+be changed often.
+
+1. Introduction
+  1.1 purpose of memory hotplug
+  1.2. Phases of memory hotplug
+  1.3. Unit of Memory online/offline operation
+2. Kernel Configuration
+3. sysfs files for memory hotplug
+4. Physical memory hot-add phase
+  4.1 Hardware(Firmware) Support
+  4.2 Notify memory hot-add event by hand
+5. Logical Memory hot-add phase
+  5.1. State of memory
+  5.2. How to online memory
+6. Logical memory remove
+  6.1 Memory offline and ZONE_MOVABLE
+  6.2. How to offline memory
+7. Physical memory remove
+8. Future Work List
+
+Note(1): x86_64's has special implementation for memory hotplug.
+ This test does not describe it.
+Note(2): This text assumes that sysfs is mounted at /sys.
+
+
+---
+1. Introduction
+---
+
+1.1 purpose of memory hotplug
+
+Memory Hotplug allows users to increase/decrease the amount of memory.
+Generally, there are two purposes.
+
+(A) For changing the amount of memory.
+This is to allow a feature like capacity on demand.
+(B) For installing/removing DIMMs or NUMA-nodes physically.
+This is to exchange DIMMs/NUMA-nodes, reduce power consumption, etc.
+
+(A) is required by highly virtualized environments and (B) is required by
+hardware which supports memory power management.
+
+Linux memory hotplug is designed for both purpose.
+
+
+1.2. Phases of memory hotplug
+---
+There are 2 phases in Memory Hotplug.
+  1) Physical Memory Hotplug phase
+  2) Logical Memory Hotplug phase.
+
+The First phase is to communicate hardware/firmware and make/erase
+environment for hotplugged memory. Basically, this phase is necessary
+for the purpose (B), but this is good phase for communication between
+highly virtulaized environments too.
+
+When memory is hotplugged, the kernel recognizes new memory, makes new memory
+management tables, and makes sysfs files for new memory's operation.
+
+If firmware supports notification of connection of new memory to OS,
+this phase is triggered automatically. ACPI can notify this event. If not,
+probe operation by system administration works instead of it.
+(see Section 4.).
+
+Logical Memory Hotplug phase is to change memory state into
+avaiable/unavailable for users. Amount of memory from user's view is
+changed by this phase. The kernel makes all memory in it as free pages
+when a memory range is into available.
+
+In this document, this phase is described online/offline.
+
+Logical Memory Hotplug phase is trigged by write of sysfs file by system
+administrator. When hot-add case, it must be executed after Physical Hotplug
+phase by hand.
+(However, if you writes udev's hotplug scripts for memory hotplug, these
+ phases can be execute in seamless way.)
+
+
+1.3. Unit of Memory online/offline operation
+
+Memory hotplug uses SPARSEMEM memory model. SPARSEMEM divides the whole memory
+into chunks of the same size. The chunk is called a section. The size of
+a section is architecture dependent. For example, power uses 16MiB, ia64 uses
+1GiB. The unit of online/offline operation is one section. (see Section 3.)
+
+To know the size of sections, please read this file:
+
+/sys/devices/system/memory/block_size_bytes
+
+This file shows the size of sections in byte.
+
+---
+2. Kernel Configuration
+---
+To use memory hotplug feature, kernel must be compiled with following
+config options.
+
+- For all memory hotplug
+Memory model - Sparse Memory  (CONFIG_SPARSEMEM)
+Allow for memory hot-add   (CONFIG_MEMORY_HOTPLUG)
+
+- For using

Re: [RFC][Doc] memory hotplug documentaion take 2.

2007-07-27 Thread Yasunori Goto
Thanks for your comment.
Fixed patch is attached at the last of this mail.

  +
  +Note(1): x86_64's has special implementation for memory hotplug.
  + This test does not describe it.
 
  text (?)

Oops. Yes.

  +1.2. Phases of memory hotplug
  +---
  +There are 2 phases in Memory Hotplug.
  +  1) Physical Memory Hotplug phase
  +  2) Logical Memory Hotplug phase.
  +
  +The First phase is to communicate hardware/firmware and make/erase
  +environment for hotplugged memory. Basically, this phase is necessary
  +for the purpose (B), but this is good phase for communication between
  +highly virtulaized environments too.
 
   virtualized

Yes. fixed...

 
  +
  +When memory is hotplugged, the kernel recognizes new memory, makes new 
  memory
  +management tables, and makes sysfs files for new memory's operation.
  +
  +If firmware supports notification of connection of new memory to OS,
  +this phase is triggered automatically. ACPI can notify this event. If not,
  +probe operation by system administration works instead of it.
 
   is used instead.

Ah, ok.


  +(see Section 4.).
  +
  +Logical Memory Hotplug phase is to change memory state into
  +avaiable/unavailable for users. Amount of memory from user's view is
  +changed by this phase. The kernel makes all memory in it as free pages
  +when a memory range is into available.
 
   ?? drop into ?
 or is a memory range always available?  Confusing.

Ok. I didn't know it was confusing. Thanks. I dropped it.

  +In this document, this phase is described online/offline.
 
described as online/offline.

OK.

  +
  +Logical Memory Hotplug phase is trigged by write of sysfs file by system
 
triggered

Oops. yes.

 
  +administrator. When hot-add case, it must be executed after Physical 
  Hotplug
 
   For the hot-add case,

OK.

 
  +phase by hand.
  +(However, if you writes udev's hotplug scripts for memory hotplug, these
  + phases can be execute in seamless way.)
  +
  +
  +1.3. Unit of Memory online/offline operation
  +
  +Memory hotplug uses SPARSEMEM memory model. SPARSEMEM divides the whole 
  memory
  +into chunks of the same size. The chunk is called a section. The size of
  +a section is architecture dependent. For example, power uses 16MiB, ia64 
  uses
  +1GiB. The unit of online/offline operation is one section. (see Section 
  3.)
  +
  +To know the size of sections, please read this file:
 
To determine the size ...

I didn't know determine can be used for this sentence.
I remembered it means just decide due to my English 
vocabulary problem. Thanks. I changed it. :-)

  +- For using remove memory, followings are necessary too
 
  To enable memory removal, the following are also necessary


Ok.

 
  +Allow for memory hot remove(CONFIG_MEMORY_HOTREMOVE)
  +Page Migration (CONFIG_MIGRATION)
  +
  +- For ACPI memory hotplug, followings are necessary too
 
   the following are also necessary

Ok.

  +Now, XXX is defined as start_address_of_section / secion_size.
 
  section_size.

Yes. Thanks.

  +
  +For example, assume 1GiB section size. A device for a memory starts from 
  address
 
for memory starting at

Ok.

  +
  +In general, the firmware (ACPI) which supports memory hotplug defines
  +memory class object of _HID PNP0C80. When a notify is asserted to 
  PNP0C80,
  +Linux's ACPI handler does hot-add memory to the system and calls a hotplug 
  udev
  +script. This will be done in automatically.
 
  drop in

Ok.


  +If firmware supports NUMA-node hotplug, and define object of _HID 
  ACPI0004,
 
defines an object

Ok.

 
  +PNP0A05, or PNP0A06, notification is asserted to it, and ACPI hander
 
  handler

Ah, yes.

Thanks again!



---
This is add a document for memory hotplug to describe How to use and Current
status.


---
Signed-off-by: KAMEZAWA Hiroyuki [EMAIL PROTECTED]
Signed-off-by: Yasunori Goto [EMAIL PROTECTED]


 Documentation/memory-hotplug.txt |  322 +++
 1 files changed, 322 insertions(+)

Index: makedocument/Documentation/memory-hotplug.txt
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ makedocument/Documentation/memory-hotplug.txt   2007-07-28 
11:47:52.0 +0900
@@ -0,0 +1,322 @@
+==
+Memory Hotplug
+==
+
+Last Updated: Jul 28 2007
+
+This document is about memory hotplug including how-to-use and current status.
+Because Memory Hotplug is still under development

Re: [BUG] 2.6.23-rc3-mm1 Kernel panic - not syncing: DMA: Memory would be corrupted

2007-08-23 Thread Yasunori Goto
 On (22/08/07 16:27), Luck, Tony didst pronounce:
   The more ioc's you have, the more space you will use.
  
  Default SW IOTLB allocation is 64MB ... how much should we see
  used per ioc?
  
  Kamelesh: You could try increasing the amount of sw iotlb space
  available by booting with a swiotlb=131072 argument (argument
  value is the number of 2K slabs to allocate ... 131072 would
  give you four times as much space as the default allocation).
  
 
 I tried that value and just in case swiotlb=262144. An IA-64 machines I
 have here fails with the same message anyway. i.e.
 
 [   19.834906] mptbase: Initiating ioc1 bringup
 [   20.317152] ioc1: LSI53C1030 C0: Capabilities={Initiator}
 [   15.474303] scsi1 : ioc1: LSI53C1030 C0, FwRev=01032821h, Ports=1, 
 MaxQ=222, IRQ=72
 [   20.669730] GSI 142 (level, low) - CPU 5 (0x1200) vector 73
 [   20.675602] ACPI: PCI Interrupt :41:03.0[A] - GSI 142 (level, low) - 
 IRQ 73
 [   20.683508] mptbase: Initiating ioc2 bringup
 [   21.166796] ioc2: LSI53C1030 C0: Capabilities={Initiator}
 [   21.180539] DMA: Out of SW-IOMMU space for 263200 bytes at device ?
 [   21.187018] Kernel panic - not syncing: DMA: Memory would be corrupted

I saw same trouble on my box, and I chased what was wrong.
Here is today's progress of mine.

__get_free_pages() of swiotlb_alloc_coherent() fails in rc3-mm1.
(See following patch)
But, it doesn't fail on rc2-mm2, and kernel can boot up.

Hmmm


(2.6.23-rc3-mm1)
---
swiotlb_alloc_coherent flags=21 order=3 ret=
DMA: Out of SW-IOMMU space for 266368 bytes at device ?
Kernel panic - not syncing: DMA: Memory would be corrupted
---




(2.6.23-rc2-mm2)
---
swiotlb_alloc_coherent flags=21 order=3 ret=e0002008
   :
   (boot up continue...)

---
 lib/swiotlb.c |2 ++
 1 file changed, 2 insertions(+)

Index: current/lib/swiotlb.c
===
--- current.orig/lib/swiotlb.c  2007-08-23 22:27:01.0 +0900
+++ current/lib/swiotlb.c   2007-08-23 22:29:49.0 +0900
@@ -455,6 +455,8 @@ swiotlb_alloc_coherent(struct device *hw
flags |= GFP_DMA;
 
ret = (void *)__get_free_pages(flags, order);
+
+   printk(%s flags=%0x order=%d ret=%p\n,__func__, flags, order, ret);
if (ret  address_needs_mapping(hwdev, virt_to_bus(ret))) {
/*
 * The allocated memory isn't reachable by the device.


-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Fix find_next_best_node (Re: [BUG] 2.6.23-rc3-mm1 Kernel panic - not syncing: DMA: Memory would be corrupted)

2007-08-24 Thread Yasunori Goto

I found find_next_best_node() was wrong.
I confirmed boot up by the following patch.
Mel-san, Kamalesh-san, could you try this?

Bye.
---

Fix decision of memoryless node in find_next_best_node().
This can be cause of SW-IOMMU's allocation failure.

This patch is for 2.6.23-rc3-mm1.

Signed-off-by: Yasunori Goto [EMAIL PROTECTED]

---
 mm/page_alloc.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: current/mm/page_alloc.c
===
--- current.orig/mm/page_alloc.c2007-08-24 16:03:17.0 +0900
+++ current/mm/page_alloc.c 2007-08-24 16:04:06.0 +0900
@@ -2136,7 +2136,7 @@ static int find_next_best_node(int node,
 * Note:  N_HIGH_MEMORY state not guaranteed to be
 *populated yet.
 */
-   if (pgdat-node_present_pages)
+   if (!pgdat-node_present_pages)
continue;
 
/* Don't want a node to appear more than once */

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch 000/002] Rearrange notifier of memory hotplug

2007-10-11 Thread Yasunori Goto

Hello.

This patch set is to rearrange event notifier for memory hotplug,
because the old notifier has some defects. For example, there is no
information like new memory's pfn and # of pages for callback functions.

Fortunately, nothing uses this notifier so far, there is no impact by
this change. (SLUB will use this after this patch set to make
kmem_cache_node structure).

In addition, descriptions of notifer is added to memory hotplug
document.

This patch was a part of patch set to make kmem_cache_node of SLUB 
to avoid panic of memory online. But, I think this change becomes
not only for SLUB but also for others. So, I extracted this from it.

This patch set is for 2.6.23-rc8-mm2.
I tested this patch on my ia64 box.

Please apply.

Bye.

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch 001/002] Make description of memory hotplug notifier in document

2007-10-11 Thread Yasunori Goto

Add description about event notification callback routine to the document.

Signed-off-by: Yasunori Goto [EMAIL PROTECTED]

---
 Documentation/memory-hotplug.txt |   56 ---
 1 file changed, 53 insertions(+), 3 deletions(-)

Index: current/Documentation/memory-hotplug.txt
===
--- current.orig/Documentation/memory-hotplug.txt
+++ current/Documentation/memory-hotplug.txt
@@ -2,7 +2,8 @@
 Memory Hotplug
 ==
 
-Last Updated: Jul 28 2007
+Created:   Jul 28 2007
+Add description of notifier of memory hotplug  Oct 11 2007
 
 This document is about memory hotplug including how-to-use and current status.
 Because Memory Hotplug is still under development, contents of this text will
@@ -24,7 +25,8 @@ be changed often.
   6.1 Memory offline and ZONE_MOVABLE
   6.2. How to offline memory
 7. Physical memory remove
-8. Future Work List
+8. Memory hotplug event notifier
+9. Future Work List
 
 Note(1): x86_64's has special implementation for memory hotplug.
  This text does not describe it.
@@ -307,8 +309,68 @@ Need more implementation yet
  - Notification completion of remove works by OS to firmware.
  - Guard from remove if not yet.
 
+
+8. Memory hotplug event notifier
+
+Memory hotplug has event notifer. There are 6 types of notification.
+
+MEMORY_GOING_ONLINE
+  This is notified before memory online. If some structures must be prepared
+  for new memory, it should be done at this event's callback.
+  The new onlining memory can't be used yet.
+  
+MEMORY_CANCEL_ONLINE
+  If memory online fails, this event is notified for rollback of setting at
+  MEMORY_GOING_ONLINE.
+  (Currently, this event is notified only the case which a callback routine
+   of MEMORY_GOING_ONLINE fails).
+
+MEMORY_ONLINE
+  This event is called when memory online is completed. The page allocator uses
+  new memory area before this notification. In other words, callback routine
+  use new memory area via page allocator.
+  The failures of callbacks of this notification will be ignored.
+
+MEMORY_GOING_OFFLINE
+  This is notified on halfway of memory offline. The offlining pages are
+  isolated. In other words, the page allocater doesn't allocate new pages from
+  offlining memory area at this time. If callback routine freed some pages,
+  they are not used by the page allocator.
+  This is good place for shrinking cache. (If possible, it is desirable to
+  migrate to other area.)
+
+MEMORY_CANCEL_OFFLINE
+  If memory offline fails, this event is notified for rollback against
+  MEMORY_GOING_OFFLINE. The page allocator will use target memory area after
+  this callback again.
+
+MEMORY_OFFLINE
+  This is notified after memory offline completed. The failures of callbacks
+  of this notification will be ignored. Callback routine can return structures
+  for offlined memory.
+  If the node which has offlined memory,
+   
+A callback routine can be registered by 
+  hotplug_memory_notifier(callback_func, priority).
+
+The second argument of callback function (action) is event types of above.
+The third argument is passed by pointer of struct memory_notify.
+
+struct memory_notify {
+   unsigned long start_pfn;
+   unsigned long nr_pages;
+   int status_change_nid;
+};
+start_pfn is start pfn of online/offline memory.
+nr_pages is # of pages of online/offline memory.
+status_change_nid is set node id when N_HIGH_MEMORY of nodemask is (will be)
+set/clear. It means a new(memoryless) node gets new memory by online and a
+node lose all memory. If this is -1, then nodemask status is not changed.
+If status_changed_nid = 0, callback should create/discard structures for the
+node if necessary.
+
 --
-8. Future Work
+9. Future Work
 --
   - allowing memory hot-add to ZONE_MOVABLE. maybe we need some switch like
 sysctl or new control file.

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch 002/002] rearrange patch for notifier of memory hotplug

2007-10-11 Thread Yasunori Goto

Current memory notifier has some defects yet. (Fortunately, nothing uses it.)
This patch is to fix and rearrange for them.

  - Add information of start_pfn, nr_pages, and node id if node status is
changes from/to memoryless node for callback functions.
Callbacks can't do anything without those information.
  - Add notification going-online status.
It is necessary for creating per node structure before the node's
pages are available.
  - Move GOING_OFFLINE status notification after page isolation.
It is good place for return memory like cache for callback,
because returned page is not used again.
  - Make CANCEL events for rollingback when error occurs.
  - Delete MEM_MAPPING_INVALID notification. It will be not used.
  - Fix compile error of (un)register_memory_notifier().


Signed-off-by: Yasunori Goto [EMAIL PROTECTED]


---
 drivers/base/memory.c  |9 +
 include/linux/memory.h |   27 +++
 mm/memory_hotplug.c|   48 +---
 3 files changed, 61 insertions(+), 23 deletions(-)

Index: current/drivers/base/memory.c
===
--- current.orig/drivers/base/memory.c  2007-10-11 14:33:02.0 +0900
+++ current/drivers/base/memory.c   2007-10-11 14:33:07.0 +0900
@@ -137,7 +137,7 @@ static ssize_t show_mem_state(struct sys
return len;
 }
 
-static inline int memory_notify(unsigned long val, void *v)
+int memory_notify(unsigned long val, void *v)
 {
return blocking_notifier_call_chain(memory_chain, val, v);
 }
@@ -183,7 +183,6 @@ memory_block_action(struct memory_block 
break;
case MEM_OFFLINE:
mem-state = MEM_GOING_OFFLINE;
-   memory_notify(MEM_GOING_OFFLINE, NULL);
start_paddr = page_to_pfn(first_page)  PAGE_SHIFT;
ret = remove_memory(start_paddr,
PAGES_PER_SECTION  PAGE_SHIFT);
@@ -191,7 +190,6 @@ memory_block_action(struct memory_block 
mem-state = old_state;
break;
}
-   memory_notify(MEM_MAPPING_INVALID, NULL);
break;
default:
printk(KERN_WARNING %s(%p, %ld) unknown action: %ld\n,
@@ -199,11 +197,6 @@ memory_block_action(struct memory_block 
WARN_ON(1);
ret = -EINVAL;
}
-   /*
-* For now, only notify on successful memory operations
-*/
-   if (!ret)
-   memory_notify(action, NULL);
 
return ret;
 }
Index: current/include/linux/memory.h
===
--- current.orig/include/linux/memory.h 2007-10-11 14:33:02.0 +0900
+++ current/include/linux/memory.h  2007-10-11 15:19:31.0 +0900
@@ -41,18 +41,15 @@ struct memory_block {
 #defineMEM_ONLINE  (10) /* exposed to userspace */
 #defineMEM_GOING_OFFLINE   (11) /* exposed to userspace */
 #defineMEM_OFFLINE (12) /* exposed to userspace */
+#defineMEM_GOING_ONLINE(13)
+#defineMEM_CANCEL_ONLINE   (14)
+#defineMEM_CANCEL_OFFLINE  (15)
 
-/*
- * All of these states are currently kernel-internal for notifying
- * kernel components and architectures.
- *
- * For MEM_MAPPING_INVALID, all notifier chains with priority 0
- * are called before pfn_to_page() becomes invalid.  The priority=0
- * entry is reserved for the function that actually makes
- * pfn_to_page() stop working.  Any notifiers that want to be called
- * after that should have priority 0.
- */
-#defineMEM_MAPPING_INVALID (13)
+struct memory_notify {
+   unsigned long start_pfn;
+   unsigned long nr_pages;
+   int status_change_nid;
+};
 
 struct notifier_block;
 struct mem_section;
@@ -69,12 +66,18 @@ static inline int register_memory_notifi
 static inline void unregister_memory_notifier(struct notifier_block *nb)
 {
 }
+static inline int memory_notify(unsigned long val, void *v)
+{
+   return 0;
+}
 #else
+extern int register_memory_notifier(struct notifier_block *nb);
+extern void unregister_memory_notifier(struct notifier_block *nb);
 extern int register_new_memory(struct mem_section *);
 extern int unregister_memory_section(struct mem_section *);
 extern int memory_dev_init(void);
 extern int remove_memory_block(unsigned long, struct mem_section *, int);
-
+extern int memory_notify(unsigned long val, void *v);
 #define CONFIG_MEM_BLOCK_SIZE  (PAGES_PER_SECTIONPAGE_SHIFT)
 
 
Index: current/mm/memory_hotplug.c
===
--- current.orig/mm/memory_hotplug.c2007-10-11 14:33:02.0 +0900
+++ current/mm/memory_hotplug.c

[Patch 000/002] Make kmem_cache_node for SLUB on memory online to avoid panic(take 2)

2007-10-11 Thread Yasunori Goto
This patch set is to fix panic due to access NULL pointer of SLUB.

When new memory is hot-added on the new node (or memory less node),
kmem_cache_node for the new node is not prepared,
and panic occurs by it. So, kmem_cache_node should be created for the node
before new memory is available on the node.
Incidentally, it is freed on memory offline if it becomes not necessary.

This is the first user of the callback of memory notifier, and
requires its rearrange patch set.

This patch set is for 2.6.23-rc8-mm2.
I tested this patch on my ia64 box.

Please apply.

Bye.


-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch 001/002] extract kmem_cache_shrink

2007-10-11 Thread Yasunori Goto
Make kmem_cache_shrink_node() for callback routine of memory hotplug
notifier. This is just extract a part of kmem_cache_shrink().

Signed-off-by: Yasunori Goto [EMAIL PROTECTED]

---
 mm/slub.c |  111 ++
 1 file changed, 61 insertions(+), 50 deletions(-)

Index: current/mm/slub.c
===
--- current.orig/mm/slub.c  2007-10-11 20:30:45.0 +0900
+++ current/mm/slub.c   2007-10-11 21:58:47.0 +0900
@@ -2626,6 +2626,56 @@ void kfree(const void *x)
 }
 EXPORT_SYMBOL(kfree);
 
+static inline void __kmem_cache_shrink_node(struct kmem_cache *s, int node,
+   struct list_head *slabs_by_inuse)
+{
+   struct kmem_cache_node *n;
+   int i;
+   struct page *page;
+   struct page *t;
+   unsigned long flags;
+
+   n = get_node(s, node);
+
+   if (!n-nr_partial)
+   return;
+
+   for (i = 0; i  s-objects; i++)
+   INIT_LIST_HEAD(slabs_by_inuse + i);
+
+   spin_lock_irqsave(n-list_lock, flags);
+
+   /*
+* Build lists indexed by the items in use in each slab.
+*
+* Note that concurrent frees may occur while we hold the
+* list_lock. page-inuse here is the upper limit.
+*/
+   list_for_each_entry_safe(page, t, n-partial, lru) {
+   if (!page-inuse  slab_trylock(page)) {
+   /*
+* Must hold slab lock here because slab_free
+* may have freed the last object and be
+* waiting to release the slab.
+*/
+   list_del(page-lru);
+   n-nr_partial--;
+   slab_unlock(page);
+   discard_slab(s, page);
+   } else
+   list_move(page-lru, slabs_by_inuse + page-inuse);
+   }
+
+   /*
+* Rebuild the partial list with the slabs filled up most
+* first and the least used slabs at the end.
+*/
+   for (i = s-objects - 1; i = 0; i--)
+   list_splice(slabs_by_inuse + i, n-partial.prev);
+
+   spin_unlock_irqrestore(n-list_lock, flags);
+}
+
 /*
  * kmem_cache_shrink removes empty slabs from the partial lists and sorts
  * the remaining slabs by the number of items in use. The slabs with the
@@ -2636,68 +2686,29 @@ EXPORT_SYMBOL(kfree);
  * being allocated from last increasing the chance that the last objects
  * are freed in them.
  */
-int kmem_cache_shrink(struct kmem_cache *s)
+int kmem_cache_shrink_node(struct kmem_cache *s, int node)
 {
-   int node;
-   int i;
-   struct kmem_cache_node *n;
-   struct page *page;
-   struct page *t;
struct list_head *slabs_by_inuse =
kmalloc(sizeof(struct list_head) * s-objects, GFP_KERNEL);
-   unsigned long flags;
 
if (!slabs_by_inuse)
return -ENOMEM;
 
flush_all(s);
-   for_each_node_state(node, N_NORMAL_MEMORY) {
-   n = get_node(s, node);
-
-   if (!n-nr_partial)
-   continue;
-
-   for (i = 0; i  s-objects; i++)
-   INIT_LIST_HEAD(slabs_by_inuse + i);
-
-   spin_lock_irqsave(n-list_lock, flags);
-
-   /*
-* Build lists indexed by the items in use in each slab.
-*
-* Note that concurrent frees may occur while we hold the
-* list_lock. page-inuse here is the upper limit.
-*/
-   list_for_each_entry_safe(page, t, n-partial, lru) {
-   if (!page-inuse  slab_trylock(page)) {
-   /*
-* Must hold slab lock here because slab_free
-* may have freed the last object and be
-* waiting to release the slab.
-*/
-   list_del(page-lru);
-   n-nr_partial--;
-   slab_unlock(page);
-   discard_slab(s, page);
-   } else {
-   list_move(page-lru,
-   slabs_by_inuse + page-inuse);
-   }
-   }
-
-   /*
-* Rebuild the partial list with the slabs filled up most
-* first and the least used slabs at the end.
-*/
-   for (i = s-objects - 1; i = 0; i--)
-   list_splice(slabs_by_inuse + i, n-partial.prev);
-
-   spin_unlock_irqrestore(n-list_lock, flags);
-   }
+   if (node = 0)
+   __kmem_cache_shrink_node(s, node, slabs_by_inuse);
+   else

[Patch 002/002] Create/delete kmem_cache_node for SLUB on memory online callback

2007-10-11 Thread Yasunori Goto

This is to make kmem_cache_nodes of all SLUBs for new node when 
memory-hotadd is called. This fixes panic due to access NULL pointer at
discard_slab() after memory hot-add.

If pages on the new node available, slub can use it before making
new kmem_cache_nodes. So, this callback should be called
BEFORE pages on the node are available.

When memory online is called, slab_mem_going_online_callback() is
called to make kmem_cache_node(). if it (or other callbacks) fails,
then slab_mem_offline_callback() is called for rollback.

In memory offline, slab_mem_going_offline_callback() is called to
shrink cache, then slab_mem_offline_callback() is called later.


Signed-off-by: Yasunori Goto [EMAIL PROTECTED]

---
 mm/slub.c |  117 ++
 1 file changed, 117 insertions(+)

Index: current/mm/slub.c
===
--- current.orig/mm/slub.c  2007-10-11 20:31:37.0 +0900
+++ current/mm/slub.c   2007-10-11 21:58:10.0 +0900
@@ -20,6 +20,7 @@
 #include linux/mempolicy.h
 #include linux/ctype.h
 #include linux/kallsyms.h
+#include linux/memory.h
 
 /*
  * Lock order:
@@ -2711,6 +2712,120 @@ int kmem_cache_shrink(struct kmem_cache 
 }
 EXPORT_SYMBOL(kmem_cache_shrink);
 
+#if defined(CONFIG_NUMA)  defined(CONFIG_MEMORY_HOTPLUG)
+static int slab_mem_going_offline_callback(void *arg)
+{
+   struct kmem_cache *s;
+   struct memory_notify *marg = arg;
+   int local_node, offline_node = marg-status_change_nid;
+
+   if (offline_node  0)
+   /* node has memory yet. nothing to do. */
+   return 0;
+
+   down_read(slub_lock);
+   list_for_each_entry(s, slab_caches, list) {
+   local_node = page_to_nid(virt_to_page(s));
+   if (local_node == offline_node)
+   /* This slub is on the offline node. */
+   return -EBUSY;
+   }
+   up_read(slub_lock);
+
+   kmem_cache_shrink_node(s, offline_node);
+
+   return 0;
+}
+
+static void slab_mem_offline_callback(void *arg)
+{
+   struct kmem_cache_node *n;
+   struct kmem_cache *s;
+   struct memory_notify *marg = arg;
+   int offline_node;
+
+   offline_node = marg-status_change_nid;
+
+   if (offline_node  0)
+   /* node has memory yet. nothing to do. */
+   return;
+
+   down_read(slub_lock);
+   list_for_each_entry(s, slab_caches, list) {
+   n = get_node(s, offline_node);
+   if (n) {
+   /*
+* if n-nr_slabs  0, offline_pages() must be fail,
+* because the node is used by slub yet.
+*/
+   BUG_ON(atomic_read(n-nr_slabs));
+
+   s-node[offline_node] = NULL;
+   kmem_cache_free(kmalloc_caches, n);
+   }
+   }
+   up_read(slub_lock);
+}
+
+static int slab_mem_going_online_callback(void *arg)
+{
+   struct kmem_cache_node *n;
+   struct kmem_cache *s;
+   struct memory_notify *marg = arg;
+   int nid = marg-status_change_nid;
+
+   /* If the node already has memory, then nothing is necessary. */
+   if (nid  0)
+   return 0;
+
+   /*
+* New memory will be onlined on the node which has no memory so far.
+* New kmem_cache_node is necssary for it.
+*/
+   down_read(slub_lock);
+   list_for_each_entry(s, slab_caches, list) {
+   /*
+* XXX: The new node's memory can't be allocated yet,
+*  kmem_cache_node will be allocated other node.
+*/
+   n = kmem_cache_alloc(kmalloc_caches, GFP_KERNEL);
+   if (!n)
+   return -ENOMEM;
+   init_kmem_cache_node(n);
+   s-node[nid] = n;
+   }
+   up_read(slub_lock);
+
+   return 0;
+}
+
+static int slab_memory_callback(struct notifier_block *self,
+   unsigned long action, void *arg)
+{
+   int ret = 0;
+
+   switch (action) {
+   case MEM_GOING_ONLINE:
+   ret = slab_mem_going_online_callback(arg);
+   break;
+   case MEM_GOING_OFFLINE:
+   ret = slab_mem_going_offline_callback(arg);
+   break;
+   case MEM_OFFLINE:
+   case MEM_CANCEL_ONLINE:
+   slab_mem_offline_callback(arg);
+   break;
+   case MEM_ONLINE:
+   case MEM_CANCEL_OFFLINE:
+   break;
+   }
+
+   ret = notifier_from_errno(ret);
+   return ret;
+}
+
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
 /
  * Basic setup of slabs
  ***/
@@ -2741,6 +2856,8 @@ void __init kmem_cache_init(void

Re: [Patch 001/002] Make description of memory hotplug notifier in document

2007-10-11 Thread Yasunori Goto
 Looks good. Some suggestions on improving the wording.

Thanks! I'll fix them.

Bye.

 
 On Fri, 12 Oct 2007, Yasunori Goto wrote:
 
  +MEMORY_GOING_ONLINE
  +  This is notified before memory online. If some structures must be 
  prepared
  +  for new memory, it should be done at this event's callback.
  +  The new onlining memory can't be used yet.
 
 Generated before new memory becomes available in order to be able to 
 prepare subsystems to handle memory. The page allocator is still unable
 to allocate from the new memory.
 
  +MEMORY_CANCEL_ONLINE
  +  If memory online fails, this event is notified for rollback of setting at
  +  MEMORY_GOING_ONLINE.
  +  (Currently, this event is notified only the case which a callback routine
  +   of MEMORY_GOING_ONLINE fails).
 
 Generated if MEMORY_GOING_ONLINE fails.
 
  +MEMORY_ONLINE
  +  This event is called when memory online is completed. The page allocator 
  uses
  +  new memory area before this notification. In other words, callback 
  routine
  +  use new memory area via page allocator.
  +  The failures of callbacks of this notification will be ignored.
 
 Generated when memory has succesfully brought online. The callback may 
 allocate from the new memory.
 
  +MEMORY_GOING_OFFLINE
  +  This is notified on halfway of memory offline. The offlining pages are
  +  isolated. In other words, the page allocater doesn't allocate new pages 
  from
  +  offlining memory area at this time. If callback routine freed some pages,
  +  they are not used by the page allocator.
  +  This is good place for shrinking cache. (If possible, it is desirable to
  +  migrate to other area.)
 
 Generated to begin the process of offlining memory. Allocations are no 
 longer possible from the memory but some of the memory to be offlined
 is still in use. The callback can be used to free memory known to a 
 subsystem from the indicated node.
 
  +MEMORY_CANCEL_OFFLINE
  +  If memory offline fails, this event is notified for rollback against
  +  MEMORY_GOING_OFFLINE. The page allocator will use target memory area 
  after
  +  this callback again.
 
 Generated if MEMORY_GOING_OFFLINE fails. Memory is available again from 
 the node that we attempted to offline.
 
  + +MEMORY_OFFLINE
 
 Generated after offlining memory is complete.

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Patch 001/002] extract kmem_cache_shrink

2007-10-11 Thread Yasunori Goto
 On Fri, 12 Oct 2007, Yasunori Goto wrote:
 
  Make kmem_cache_shrink_node() for callback routine of memory hotplug
  notifier. This is just extract a part of kmem_cache_shrink().
 
 Could we just call kmem_cache_shrink? It will do the shrink on every node 
 but memory hotplug is rare?

Yes it is. Memory hotplug is rare.
Ok. I'll do it.

Thanks.
-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Patch 002/002] Create/delete kmem_cache_node for SLUB on memory online callback

2007-10-12 Thread Yasunori Goto
 On Fri, 12 Oct 2007, Yasunori Goto wrote:
  
  If pages on the new node available, slub can use it before making
  new kmem_cache_nodes. So, this callback should be called
  BEFORE pages on the node are available.
 
 If its called before pages on the node are available then it must 
 fallback and cannot use the pages.

Hmm. My description may be wrong. I would like to just
mention that kmem_cache_node should be created before the node's page
can be allocated. If not, it will cause of panic.


  +#if defined(CONFIG_NUMA)  defined(CONFIG_MEMORY_HOTPLUG)
  +static int slab_mem_going_offline_callback(void *arg)
  +{
  +   struct kmem_cache *s;
  +   struct memory_notify *marg = arg;
  +   int local_node, offline_node = marg-status_change_nid;
  +
  +   if (offline_node  0)
  +   /* node has memory yet. nothing to do. */
 
 Please clarify the comment. This seems to indicate that we should not
 do anything because the node still has memory?

Yes. kmem_cache_node is still necessary for remaining memory on the
node.

 Doesnt the node always have memory before offlining?

If node doesn't have memory and offline_pages() called for it,
it must be check and fail. This callback shouldn't be called.
If not, it is bug of memory hotplug, I think.


  +   return 0;
  +
  +   down_read(slub_lock);
  +   list_for_each_entry(s, slab_caches, list) {
  +   local_node = page_to_nid(virt_to_page(s));
  +   if (local_node == offline_node)
  +   /* This slub is on the offline node. */
  +   return -EBUSY;
  +   }
  +   up_read(slub_lock);
 
 So this checks if the any kmem_cache structure is on the offlined node? If
 so then we cannot offline the node?

Right. If slabs' migration is possible, here would be good place for
doing it. But, it is not possible (at least now).

  +   kmem_cache_shrink_node(s, offline_node);
 
 kmem_cache_shrink(s) would be okay here I would think. The function is
 reasonably fast. Offlining is a rare event.

Ok. I'll fix it.

  +static void slab_mem_offline_callback(void *arg)
 
 We call this after we have established that no kmem_cache structures are 
 on this and after we have shrunk the slabs. Is there any guarantee that
 no slab operations have occurrent since then?

If slabs still exist, it can't be migrated and offline_pages() has
to give up offline. This means MEM_OFFLINE event is not generated when
slabs are on the removing node.
In other word, when this event is generated, all of pages on
this section is isolated and there are no used pages(slabs).


 
  +{
  +   struct kmem_cache_node *n;
  +   struct kmem_cache *s;
  +   struct memory_notify *marg = arg;
  +   int offline_node;
  +
  +   offline_node = marg-status_change_nid;
  +
  +   if (offline_node  0)
  +   /* node has memory yet. nothing to do. */
  +   return;
 
 Does this mean that the node still has memory?

Yes.


  +   down_read(slub_lock);
  +   list_for_each_entry(s, slab_caches, list) {
  +   n = get_node(s, offline_node);
  +   if (n) {
  +   /*
  +* if n-nr_slabs  0, offline_pages() must be fail,
  +* because the node is used by slub yet.
  +*/
 
 It may be clearer to say:
 
 If nr_slabs  0 then slabs still exist on the node that is going down.
 We were unable to free them so we must fail.

Again. If nr_slabs  0, offline_pages must be fail due to slabs
remaining on the node before. So, this callback isn't called.

  +static int slab_mem_going_online_callback(void *arg)
  +{
  +   struct kmem_cache_node *n;
  +   struct kmem_cache *s;
  +   struct memory_notify *marg = arg;
  +   int nid = marg-status_change_nid;
  +
  +   /* If the node already has memory, then nothing is necessary. */
  +   if (nid  0)
  +   return 0;
 
 The node must have memory  Or we have already brought up the code?

kmem_cache_node is created at boot time if the node has memory.
(Or, it is created by this callback on first added memory on the node).

When nid = - 1, kmem_cache_node is created before this node due to
node has memory. 

 
  +   /*
  +* New memory will be onlined on the node which has no memory so far.
  +* New kmem_cache_node is necssary for it.
 
 We are bringing a node online. No memory is available yet. We must 
 allocate a kmem_cache_node structure in order to bring the node online. ?

Your mention might be ok.
But. I would like to prefer to define status of node hotplug for
exactitude like followings


A)Node online -- pgdat is created and can be accessed for this node.
 but there are no gurantee that cpu or memory is onlined.
 This status is very close from memory-less node.
 But this might be halfway status for node hotplug.
 Node online bit is set. But N_HIGH_MEMORY
 (or N_NORMAL_MEMORY) might be not set.

B)Node has memory--
 one or more sections

Re: [Patch 002/002] Create/delete kmem_cache_node for SLUB on memory online callback

2007-10-12 Thread Yasunori Goto
 On Fri, 12 Oct 2007, Yasunori Goto wrote:
 
+   down_read(slub_lock);
+   list_for_each_entry(s, slab_caches, list) {
+   local_node = page_to_nid(virt_to_page(s));
+   if (local_node == offline_node)
+   /* This slub is on the offline node. */
+   return -EBUSY;
+   }
+   up_read(slub_lock);
   
   So this checks if the any kmem_cache structure is on the offlined node? If
   so then we cannot offline the node?
  
  Right. If slabs' migration is possible, here would be good place for
  doing it. But, it is not possible (at least now).
 
 I think you can avoid this check. The kmem_cache structures are allocated 
 from the kmalloc array. The check if the kmalloc slabs are empty will fail 
 if kmem_cache structures still exist on the node.

Ah, Ok.


 
+* because the node is used by slub yet.
+*/
   
   It may be clearer to say:
   
   If nr_slabs  0 then slabs still exist on the node that is going down.
   We were unable to free them so we must fail.
  
  Again. If nr_slabs  0, offline_pages must be fail due to slabs
  remaining on the node before. So, this callback isn't called.
 
 Ok then we can remove these checks?

Hmm. Yes. I'll remove it.

 
+static int slab_mem_going_online_callback(void *arg)
+{
+   struct kmem_cache_node *n;
+   struct kmem_cache *s;
+   struct memory_notify *marg = arg;
+   int nid = marg-status_change_nid;
+
+   /* If the node already has memory, then nothing is necessary. */
+   if (nid  0)
+   return 0;
   
   The node must have memory  Or we have already brought up the code?
  
  kmem_cache_node is created at boot time if the node has memory.
  (Or, it is created by this callback on first added memory on the node).
  
  When nid = - 1, kmem_cache_node is created before this node due to
  node has memory. 
 
 So the function can be called for a node that is already online?

already node memory available, accurately ;-)


 
+* New memory will be onlined on the node which has no memory 
so far.
+* New kmem_cache_node is necssary for it.
   
   We are bringing a node online. No memory is available yet. We must 
   allocate a kmem_cache_node structure in order to bring the node online. ?
  
  Your mention might be ok.
  But. I would like to prefer to define status of node hotplug for
  exactitude like followings
  
  
  A)Node online -- pgdat is created and can be accessed for this node.
   but there are no gurantee that cpu or memory is onlined.
   This status is very close from memory-less node.
   But this might be halfway status for node hotplug.
   Node online bit is set. But N_HIGH_MEMORY
   (or N_NORMAL_MEMORY) might be not set.
 
 Ahh.. Okay.
 
  B)Node has memory--
   one or more sections memory is onlined on the node.
   N_HIGH_MEMORY (or N_NORMAL_MEMORY) is set.
  
  If first memory is onlined on the node, the node status changes
  from A) to B).
  
  I feel this is very useful to manage halfway status of node
  hotplug. (So, memory-less node patch is very helpful for me.)
  
  So, I would like to avoid using the word node online at here.
  But, if above definition is messy for others, I'll change it.
 
 Ok can we talk about this as
 
   node online
 
 and
 
   node memory available?

Yes. Thanks.


-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Doc] Memory hotplug document take 3

2007-08-02 Thread Yasunori Goto
Hello.

This is the newest version of document of memory hotplug.
Please apply.

--

Change log from take 2.

- updates against comments from Randy-san.
  (Take 3 is same as http://lkml.org/lkml/2007/7/27/432 )

Change log from take 1.
- updates against comments from Randy-san (Thanks a lot!)
- mention about physical/logical phase of hotplug.
  change sections for it.
- add description of kernel config option.
- add description of relationship against ACPI node-hotplug.
- make patch style.
- etc.

---
This is add a document for memory hotplug to describe How to use and Current
status.


---
Signed-off-by: KAMEZAWA Hiroyuki [EMAIL PROTECTED]
Signed-off-by: Yasunori Goto [EMAIL PROTECTED]


 Documentation/memory-hotplug.txt |  322 +++
 1 files changed, 322 insertions(+)

Index: makedocument/Documentation/memory-hotplug.txt
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ makedocument/Documentation/memory-hotplug.txt   2007-07-28 
11:47:52.0 +0900
@@ -0,0 +1,322 @@
+==
+Memory Hotplug
+==
+
+Last Updated: Jul 28 2007
+
+This document is about memory hotplug including how-to-use and current status.
+Because Memory Hotplug is still under development, contents of this text will
+be changed often.
+
+1. Introduction
+  1.1 purpose of memory hotplug
+  1.2. Phases of memory hotplug
+  1.3. Unit of Memory online/offline operation
+2. Kernel Configuration
+3. sysfs files for memory hotplug
+4. Physical memory hot-add phase
+  4.1 Hardware(Firmware) Support
+  4.2 Notify memory hot-add event by hand
+5. Logical Memory hot-add phase
+  5.1. State of memory
+  5.2. How to online memory
+6. Logical memory remove
+  6.1 Memory offline and ZONE_MOVABLE
+  6.2. How to offline memory
+7. Physical memory remove
+8. Future Work List
+
+Note(1): x86_64's has special implementation for memory hotplug.
+ This text does not describe it.
+Note(2): This text assumes that sysfs is mounted at /sys.
+
+
+---
+1. Introduction
+---
+
+1.1 purpose of memory hotplug
+
+Memory Hotplug allows users to increase/decrease the amount of memory.
+Generally, there are two purposes.
+
+(A) For changing the amount of memory.
+This is to allow a feature like capacity on demand.
+(B) For installing/removing DIMMs or NUMA-nodes physically.
+This is to exchange DIMMs/NUMA-nodes, reduce power consumption, etc.
+
+(A) is required by highly virtualized environments and (B) is required by
+hardware which supports memory power management.
+
+Linux memory hotplug is designed for both purpose.
+
+
+1.2. Phases of memory hotplug
+---
+There are 2 phases in Memory Hotplug.
+  1) Physical Memory Hotplug phase
+  2) Logical Memory Hotplug phase.
+
+The First phase is to communicate hardware/firmware and make/erase
+environment for hotplugged memory. Basically, this phase is necessary
+for the purpose (B), but this is good phase for communication between
+highly virtualized environments too.
+
+When memory is hotplugged, the kernel recognizes new memory, makes new memory
+management tables, and makes sysfs files for new memory's operation.
+
+If firmware supports notification of connection of new memory to OS,
+this phase is triggered automatically. ACPI can notify this event. If not,
+probe operation by system administration is used instead.
+(see Section 4.).
+
+Logical Memory Hotplug phase is to change memory state into
+avaiable/unavailable for users. Amount of memory from user's view is
+changed by this phase. The kernel makes all memory in it as free pages
+when a memory range is available.
+
+In this document, this phase is described as online/offline.
+
+Logical Memory Hotplug phase is triggred by write of sysfs file by system
+administrator. For the hot-add case, it must be executed after Physical Hotplug
+phase by hand.
+(However, if you writes udev's hotplug scripts for memory hotplug, these
+ phases can be execute in seamless way.)
+
+
+1.3. Unit of Memory online/offline operation
+
+Memory hotplug uses SPARSEMEM memory model. SPARSEMEM divides the whole memory
+into chunks of the same size. The chunk is called a section. The size of
+a section is architecture dependent. For example, power uses 16MiB, ia64 uses
+1GiB. The unit of online/offline operation is one section. (see Section 3.)
+
+To determine the size of sections, please read this file:
+
+/sys/devices/system/memory/block_size_bytes
+
+This file shows the size of sections in byte.
+
+---
+2. Kernel Configuration
+---
+To use memory hotplug feature, kernel must be compiled with following
+config options.
+
+- For all memory hotplug
+Memory model - Sparse Memory  (CONFIG_SPARSEMEM)
+Allow for memory hot-add   (CONFIG_MEMORY_HOTPLUG)
+
+- To enable memory removal, the followings are also necessary
+Allow

Re: [2.6 patch] mm/migrate.c: cleanups

2007-08-02 Thread Yasunori Goto
Sorry for late response.

But, this patch is the cause of compile error of memory unplug code of
2.6.23-rc1-mm2. It uses putback_lru_pages(). 
Don't make it static please... :-(

Bye.


 CC  mm/memory_hotplug.o
mm/memory_hotplug.c: In function ‘do_migrate_range’:
mm/memory_hotplug.c:402: error: implicit declaration of function 
‘putback_lru_pages’
make[1]: *** [mm/memory_hotplug.o] Error 1




 This patch contains the following cleanups:
 - every file should include the headers containing the prototypes for
   its global functions
 - make the needlessly global putback_lru_pages() static
 
 Signed-off-by: Adrian Bunk [EMAIL PROTECTED]
 Acked-by: Christoph Lameter [EMAIL PROTECTED]
 
 ---
 
 This patch has been sent on:
 - 6 Jul 2007
 
  include/linux/migrate.h |2 --
  mm/migrate.c|3 ++-
  2 files changed, 2 insertions(+), 3 deletions(-)
 
 --- linux-2.6.22-rc6-mm1/include/linux/migrate.h.old  2007-07-05 
 17:10:01.0 +0200
 +++ linux-2.6.22-rc6-mm1/include/linux/migrate.h  2007-07-05 
 17:10:10.0 +0200
 @@ -26,7 +26,6 @@
  }
  
  extern int isolate_lru_page(struct page *p, struct list_head *pagelist);
 -extern int putback_lru_pages(struct list_head *l);
  extern int migrate_page(struct address_space *,
   struct page *, struct page *);
  extern int migrate_pages(struct list_head *l, new_page_t x, unsigned long);
 @@ -44,7 +43,6 @@
  
  static inline int isolate_lru_page(struct page *p, struct list_head *list)
   { return -ENOSYS; }
 -static inline int putback_lru_pages(struct list_head *l) { return 0; }
  static inline int migrate_pages(struct list_head *l, new_page_t x,
   unsigned long private) { return -ENOSYS; }
  
 --- linux-2.6.22-rc6-mm1/mm/migrate.c.old 2007-07-05 17:10:16.0 
 +0200
 +++ linux-2.6.22-rc6-mm1/mm/migrate.c 2007-07-05 17:11:43.0 +0200
 @@ -28,6 +28,7 @@
  #include linux/mempolicy.h
  #include linux/vmalloc.h
  #include linux/security.h
 +#include linux/syscalls.h
  
  #include internal.h
  
 @@ -101,7 +102,7 @@
   *
   * returns the number of pages put back.
   */
 -int putback_lru_pages(struct list_head *l)
 +static int putback_lru_pages(struct list_head *l)
  {
   struct page *page;
   struct page *page2;
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Avoiding fragmentation through different allocator

2005-01-15 Thread Yasunori Goto
Hello.

I'm also very interested in your patches, because I'm working for
memory hotplug too.

 On possibility is that we could say that the UserRclm and KernRclm pool
 are always eligible for hotplug and have hotplug banks only satisy those
 allocations pushing KernNonRclm allocations to fixed banks. How is it
 currently known if a bank of memory is hotplug? Is there a node for each
 hotplug bank? If yes, we could flag those nodes to only satisify UserRclm
 and KernRclm allocations and force fallback to other nodes. 

There are 2 types of memory hotplug.

a)SMP machine case
  A some part of memory will be added and removed.

b)NUMA machine case.
  Whole of a node will be able to remove and add.
  However, if a block of memory like DIMM is broken and disabled,
  Its close from a).

How to know where is hotpluggable bank is platform/archtecture
dependent issue. 
 ex) Asking to ACPI.
 Just node0 become unremovable, and other nodes are removable.
 etc...

In current your patch, first attribute of all pages are NoRclm.
But if your patches has interface to decide where will be Rclm for
each arch/platform, it might be good.


 The danger is
 that allocations would fail because non-hotplug banks were already full
 and pageout would not happen because the watermarks were satisified.

In this case, if user can change attribute Rclm area to 
NoRclm, it is better than nothing. 
In hotplug patches, there will be new zone as ZONE_REMOVABLE.
But in this patch, this change attribute is a little bit difficult.
(At first remove the pages from free_area of removable zone, 
 then add them to free_area of Un-removable zone.)
Probably its change is easier in your patch.


 (Bear in mind I can't test hotplug-related issues due to lack of suitable
 hardware)

I also don't have real hotplug machine now. ;-)
I just use software emulation.

  It looks like you left the per_cpu_pages as-is.  Did you
  consider separating those as well to reflect kernel vs. user
  pools?
 
 
 I kept the per-cpu caches for UserRclm-style allocations only because
 otherwise a Kernel-nonreclaimable allocation could easily be taken from a
 UserRclm pool.

I agree that dividing per-cpu caches is not good way.
But if Kernel-nonreclaimable allocation use its UserRclm pool, 
its removable memory bank will be harder to remove suddenly.
Is it correct? If so, it is not good for memory hotplug.
H.

Anyway, thank you for your patch. It is very interesting.

Bye.

-- 
Yasunori Goto ygoto at us.fujitsu.com


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Avoiding fragmentation through different allocator

2005-01-17 Thread Yasunori Goto
  There are 2 types of memory hotplug.
 
  a)SMP machine case
A some part of memory will be added and removed.
 
  b)NUMA machine case.
Whole of a node will be able to remove and add.
However, if a block of memory like DIMM is broken and disabled,
Its close from a).
 
  How to know where is hotpluggable bank is platform/archtecture
  dependent issue.
   ex) Asking to ACPI.
   Just node0 become unremovable, and other nodes are removable.
   etc...
 
 
 Is there an architecture-independant way of finding this out?

  No. At least, I have no idea. :-(


  In current your patch, first attribute of all pages are NoRclm.
  But if your patches has interface to decide where will be Rclm for
  each arch/platform, it might be good.
 
 
 It doesn't have an API as such. In page_alloc.c, there is a function
 get_pageblock_type() that returns what type of allocation the block of
 memory is being used for. There is no guarentee there is only those type
 of allocations there though.

OK. I will write a patch of function to set it for some arch/platform.

 What's the current attidute for adding a new zone? I felt there would be
 resistence as a new zone would affect a lot of code paths and be yet
 another zone that needed balancing. For example, is there a HIGHMEM
 version of the ZONE_REMOVABLE or could normal and highmem be in this zone?

Yes. In my current patch of memory hotplug, Removable is like Highmem.
 ( http://sourceforge.net/mailarchive/forum.php?forum_id=223
 It is group B of Hot Add patches for NUMA )

I tried to make new removable zone which could be with normal and dma
before it. But, it needs too much work as you said. So, I gave up it.
I heard Matt-san has some ideas for it. So, I'm looking forward to 
see it.

  I agree that dividing per-cpu caches is not good way.
  But if Kernel-nonreclaimable allocation use its UserRclm pool,
  its removable memory bank will be harder to remove suddenly.
  Is it correct? If so, it is not good for memory hotplug.
  H.
 
 
 It is correct. However, this will only happen in low-memory conditions.
 For a kernel-nonreclaimable allocation to use the userrclm pool, three
 conditions have to be met;
 
 1. Kernel-nonreclaimable pool has no pages
 2. There are no global 2^MAX_ORDER pages
 3. Kern-reclaimable pool has no pages

I suppose if this patch have worked for one year,
unlucky case might occur. Probably, enterprise system will not
allow it. So, I will try disabling fallback for KernNoRclm.

Thanks.

-- 
Yasunori Goto ygoto at us.fujitsu.com


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch]compile error of register_memory()

2006-12-19 Thread Yasunori Goto
Hello.

register_memory() becomes double definition in 2.6.20-rc1.
It is defined in arch/i386/kernel/setup.c as static definition in
2.6.19.  But it is moved to arch/i386/kernel/e820.c in 2.6.20-rc1.
And same name function is defined in driver/base/memory.c too.
So, it becomes cause of compile error of duplicate definition if 
memory hotplug option is on.

This patch is to fix it. 

Signed-off-by: Yasunori Goto [EMAIL PROTECTED]


---
 arch/i386/kernel/e820.c  |2 +-
 arch/i386/kernel/setup.c |2 +-
 include/asm-i386/e820.h  |2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6.20-rc1/arch/i386/kernel/e820.c
===
--- linux-2.6.20-rc1.orig/arch/i386/kernel/e820.c   2006-12-19 
21:52:36.0 +0900
+++ linux-2.6.20-rc1/arch/i386/kernel/e820.c2006-12-19 22:15:59.0 
+0900
@@ -668,7 +668,7 @@
}
 }
 
-void __init register_memory(void)
+void __init e820_register_memory(void)
 {
unsigned long gapstart, gapsize, round;
unsigned long long last;
Index: linux-2.6.20-rc1/arch/i386/kernel/setup.c
===
--- linux-2.6.20-rc1.orig/arch/i386/kernel/setup.c  2006-12-19 
21:52:36.0 +0900
+++ linux-2.6.20-rc1/arch/i386/kernel/setup.c   2006-12-19 22:15:59.0 
+0900
@@ -639,7 +639,7 @@
get_smp_config();
 #endif
 
-   register_memory();
+   e820_register_memory();
 
 #ifdef CONFIG_VT
 #if defined(CONFIG_VGA_CONSOLE)
Index: linux-2.6.20-rc1/include/asm-i386/e820.h
===
--- linux-2.6.20-rc1.orig/include/asm-i386/e820.h   2006-12-19 
21:52:36.0 +0900
+++ linux-2.6.20-rc1/include/asm-i386/e820.h2006-12-19 22:16:28.0 
+0900
@@ -40,7 +40,7 @@
   unsigned type);
 extern void find_max_pfn(void);
 extern void register_bootmem_low_pages(unsigned long max_low_pfn);
-extern void register_memory(void);
+extern void e820_register_memory(void);
 extern void limit_regions(unsigned long long size);
 extern void print_memory_map(char *who);
 

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch](memory hotplug) fix compile error for i386 with NUMA config (take 3).

2006-12-20 Thread Yasunori Goto
Hello.

This is take 3 patch to fix compile error when config
memory hotplug with numa on i386.

The cause of compile error was missing of arch_add_memory(), remove_memory(),
and memory_add_physaddr_to_nid().

I fixed some bad points, and tested no compile error of it.

This is for 2.6.20-rc1. 

Please apply.

Signed-off-by: Yasunori Goto [EMAIL PROTECTED]

---
 arch/i386/mm/discontig.c |   28 
 arch/i386/mm/init.c  |   10 ++
 2 files changed, 30 insertions(+), 8 deletions(-)

Index: linux-2.6.20-rc1/arch/i386/mm/init.c
===
--- linux-2.6.20-rc1.orig/arch/i386/mm/init.c   2006-12-20 22:12:07.0 
+0900
+++ linux-2.6.20-rc1/arch/i386/mm/init.c2006-12-20 22:12:09.0 
+0900
@@ -673,16 +673,10 @@
 #endif
 }
 
-/*
- * this is for the non-NUMA, single node SMP system case.
- * Specifically, in the case of x86, we will always add
- * memory to the highmem for now.
- */
 #ifdef CONFIG_MEMORY_HOTPLUG
-#ifndef CONFIG_NEED_MULTIPLE_NODES
 int arch_add_memory(int nid, u64 start, u64 size)
 {
-   struct pglist_data *pgdata = contig_page_data;
+   struct pglist_data *pgdata = NODE_DATA(nid);
struct zone *zone = pgdata-node_zones + ZONE_HIGHMEM;
unsigned long start_pfn = start  PAGE_SHIFT;
unsigned long nr_pages = size  PAGE_SHIFT;
@@ -694,7 +688,7 @@
 {
return -EINVAL;
 }
-#endif
+EXPORT_SYMBOL_GPL(remove_memory);
 #endif
 
 struct kmem_cache *pgd_cache;
Index: linux-2.6.20-rc1/arch/i386/mm/discontig.c
===
--- linux-2.6.20-rc1.orig/arch/i386/mm/discontig.c  2006-12-20 
22:12:07.0 +0900
+++ linux-2.6.20-rc1/arch/i386/mm/discontig.c   2006-12-20 22:37:54.0 
+0900
@@ -405,3 +405,31 @@
totalram_pages += totalhigh_pages;
 #endif
 }
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+int paddr_to_nid(u64 addr)
+{
+   int nid;
+   unsigned long pfn = PFN_DOWN(addr);
+
+   for_each_node(nid)
+   if (node_start_pfn[nid] = pfn 
+   pfn  node_end_pfn[nid])
+   return nid;
+
+   return -1;
+}
+
+/*
+ * This function is used to ask node id BEFORE memmap and mem_section's
+ * initialization (pfn_to_nid() can't be used yet).
+ * If _PXM is not defined on ACPI's DSDT, node id must be found by this.
+ */
+int memory_add_physaddr_to_nid(u64 addr)
+{
+   int nid = paddr_to_nid(addr);
+   return (nid = 0) ? nid : 0;
+}
+
+EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
+#endif

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch](memory hotplug) Fix compile error for i386 with NUMA config

2006-12-09 Thread Yasunori Goto
Hello.

This patch is to fix compile error when config memory hotplug
with numa on i386.

The cause of compile error was missing of arch_add_memory(), remove_memory(),
and memory_add_physaddr_to_nid() when NUMA config is on.

This is for 2.6.19, and I tested no compile error of it.

Please apply.

Signed-off-by: Yasunori Goto [EMAIL PROTECTED]

---
 arch/i386/mm/discontig.c |   17 +
 arch/i386/mm/init.c  |4 +---
 2 files changed, 18 insertions(+), 3 deletions(-)

Index: linux-2.6.19/arch/i386/mm/init.c
===
--- linux-2.6.19.orig/arch/i386/mm/init.c   2006-12-04 20:06:32.0 
+0900
+++ linux-2.6.19/arch/i386/mm/init.c2006-12-04 21:09:49.0 +0900
@@ -681,10 +681,9 @@
  * memory to the highmem for now.
  */
 #ifdef CONFIG_MEMORY_HOTPLUG
-#ifndef CONFIG_NEED_MULTIPLE_NODES
 int arch_add_memory(int nid, u64 start, u64 size)
 {
-   struct pglist_data *pgdata = contig_page_data;
+   struct pglist_data *pgdata = NODE_DATA(nid);
struct zone *zone = pgdata-node_zones + ZONE_HIGHMEM;
unsigned long start_pfn = start  PAGE_SHIFT;
unsigned long nr_pages = size  PAGE_SHIFT;
@@ -697,7 +696,6 @@
return -EINVAL;
 }
 #endif
-#endif
 
 kmem_cache_t *pgd_cache;
 kmem_cache_t *pmd_cache;
Index: linux-2.6.19/arch/i386/mm/discontig.c
===
--- linux-2.6.19.orig/arch/i386/mm/discontig.c  2006-12-04 20:06:32.0 
+0900
+++ linux-2.6.19/arch/i386/mm/discontig.c   2006-12-09 17:30:24.0 
+0900
@@ -405,3 +405,20 @@
totalram_pages += totalhigh_pages;
 #endif
 }
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+/* This is the case that there is no _PXM on DSDT for added memory */
+int memory_add_physaddr_to_nid(u64 addr)
+{
+   int nid;
+   unsigned long pfn = addr  PAGE_SHIFT;
+
+   for (nid = 0; nid  MAX_NUMNODES; nid++){
+   if (node_start_pfn[nid] = pfn 
+   pfn  node_end_pfn[nid])
+   return nid;
+   }
+
+   return 0;
+}
+#endif

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Patch](memory hotplug) Fix compile error for i386 with NUMA config

2006-12-10 Thread Yasunori Goto
Hi David-san.

 On Sat, 9 Dec 2006, Yasunori Goto wrote:
 
  Hello.
  
  This patch is to fix compile error when config memory hotplug
  with numa on i386.
  
  The cause of compile error was missing of arch_add_memory(), 
  remove_memory(),
  and memory_add_physaddr_to_nid() when NUMA config is on.
  
  This is for 2.6.19, and I tested no compile error of it.
  
  Please apply.
  
  Signed-off-by: Yasunori Goto [EMAIL PROTECTED]
  
  ---
   arch/i386/mm/discontig.c |   17 +
   arch/i386/mm/init.c  |4 +---
   2 files changed, 18 insertions(+), 3 deletions(-)
  
  Index: linux-2.6.19/arch/i386/mm/init.c
  ===
  --- linux-2.6.19.orig/arch/i386/mm/init.c   2006-12-04 20:06:32.0 
  +0900
  +++ linux-2.6.19/arch/i386/mm/init.c2006-12-04 21:09:49.0 
  +0900
  @@ -681,10 +681,9 @@
* memory to the highmem for now.
*/
   #ifdef CONFIG_MEMORY_HOTPLUG
  -#ifndef CONFIG_NEED_MULTIPLE_NODES
   int arch_add_memory(int nid, u64 start, u64 size)
   {
  -   struct pglist_data *pgdata = contig_page_data;
  +   struct pglist_data *pgdata = NODE_DATA(nid);
  struct zone *zone = pgdata-node_zones + ZONE_HIGHMEM;
  unsigned long start_pfn = start  PAGE_SHIFT;
  unsigned long nr_pages = size  PAGE_SHIFT;
  @@ -697,7 +696,6 @@
  return -EINVAL;
   }
   #endif
  -#endif
   
   kmem_cache_t *pgd_cache;
   kmem_cache_t *pmd_cache;
 
 The reason for the #ifndef CONFIG_NEED_MULTIPLE_NODES check seems to 
 solely exist for excluding the NUMA case, so it doesn't appear as though 
 this is the correct fix since your changelog indicates a compile problem 
 with a NUMA build.  This hypothesis is supported by the comment which 
 conveniently appears just before arch_add_memory which _explicitly_ states 
 that the following is for non-NUMA cases.

No.
Other arch's arch_add_memory() and remove_memory() have been already
used for NUMA case too. But i386 didn't do it because just 
contig_page_data is used. 
Current NODE_DATA() macro is defined both case appropriately.
So, this #ifdef is redundant now.

(See: http://marc.theaimsgroup.com/?l=linux-mmm=116494983531221w=2)

 
  Index: linux-2.6.19/arch/i386/mm/discontig.c
  ===
  --- linux-2.6.19.orig/arch/i386/mm/discontig.c  2006-12-04 
  20:06:32.0 +0900
  +++ linux-2.6.19/arch/i386/mm/discontig.c   2006-12-09 17:30:24.0 
  +0900
  @@ -405,3 +405,20 @@
  totalram_pages += totalhigh_pages;
   #endif
   }
  +
  +#ifdef CONFIG_MEMORY_HOTPLUG
  +/* This is the case that there is no _PXM on DSDT for added memory */
  +int memory_add_physaddr_to_nid(u64 addr)
  +{
  +   int nid;
  +   unsigned long pfn = addr  PAGE_SHIFT;
  +
  +   for (nid = 0; nid  MAX_NUMNODES; nid++){
  +   if (node_start_pfn[nid] = pfn 
  +   pfn  node_end_pfn[nid])
  +   return nid;
  +   }
  +
  +   return 0;
  +}
  +#endif
  
 
 memory_add_physaddr_to_nid is only declared as extern in 
 include/linux/memory_hotplug.h in the CONFIG_NUMA case so this also 
 doesn't appear as the correct fix but probably worked for your compile 
 since you had CONFIG_MEMORY_HOTPLUG enabled.

memory_add_physaddr_to_nid() is used for memory hotplug to 
find node id for new memory when ACPI's DSDT doesn't define
_PXM for new memory. So, when CONFIG_MEMORY_HOTPLUG is not set,
this function is not used.

Bye.

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Patch](memory hotplug) Fix compile error for i386 with NUMA config

2006-12-11 Thread Yasunori Goto

  No.
  Other arch's arch_add_memory() and remove_memory() have been already
  used for NUMA case too. But i386 didn't do it because just 
  contig_page_data is used. 
  Current NODE_DATA() macro is defined both case appropriately.
  So, this #ifdef is redundant now.
  
 
 Then I assume the comment directly above this change is also redundant 
 since it explicitly states that the following code is for the non-NUMA 
 case.

Ah. Yes indeed.

Here is fixed patch. Thanks for your comment.

Bye.


---

This patch is to fix compile error when config memory hotplug
with numa on i386.

The cause of compile error was missing of arch_add_memory(), remove_memory(),
and memory_add_physaddr_to_nid().

This is for 2.6.19, and I tested no compile error of it.

Please apply.

Signed-off-by: Yasunori Goto [EMAIL PROTECTED]

---
 arch/i386/mm/discontig.c |   17 +
 arch/i386/mm/init.c  |9 +
 2 files changed, 18 insertions(+), 8 deletions(-)

Index: linux-2.6.19/arch/i386/mm/init.c
===
--- linux-2.6.19.orig/arch/i386/mm/init.c   2006-12-09 17:42:06.0 
+0900
+++ linux-2.6.19/arch/i386/mm/init.c2006-12-11 16:58:49.0 +0900
@@ -675,16 +675,10 @@
 #endif
 }
 
-/*
- * this is for the non-NUMA, single node SMP system case.
- * Specifically, in the case of x86, we will always add
- * memory to the highmem for now.
- */
 #ifdef CONFIG_MEMORY_HOTPLUG
-#ifndef CONFIG_NEED_MULTIPLE_NODES
 int arch_add_memory(int nid, u64 start, u64 size)
 {
-   struct pglist_data *pgdata = contig_page_data;
+   struct pglist_data *pgdata = NODE_DATA(nid);
struct zone *zone = pgdata-node_zones + ZONE_HIGHMEM;
unsigned long start_pfn = start  PAGE_SHIFT;
unsigned long nr_pages = size  PAGE_SHIFT;
@@ -697,7 +691,6 @@
return -EINVAL;
 }
 #endif
-#endif
 
 kmem_cache_t *pgd_cache;
 kmem_cache_t *pmd_cache;
Index: linux-2.6.19/arch/i386/mm/discontig.c
===
--- linux-2.6.19.orig/arch/i386/mm/discontig.c  2006-12-09 17:42:06.0 
+0900
+++ linux-2.6.19/arch/i386/mm/discontig.c   2006-12-09 17:58:32.0 
+0900
@@ -405,3 +405,20 @@
totalram_pages += totalhigh_pages;
 #endif
 }
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+/* This is the case that there is no _PXM on DSDT for added memory */
+int memory_add_physaddr_to_nid(u64 addr)
+{
+   int nid;
+   unsigned long pfn = addr  PAGE_SHIFT;
+
+   for (nid = 0; nid  MAX_NUMNODES; nid++){
+   if (node_start_pfn[nid] = pfn 
+   pfn  node_end_pfn[nid])
+   return nid;
+   }
+
+   return 0;
+}
+#endif

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: memory hotplug function redefinition/confusion

2006-11-17 Thread Yasunori Goto
Hello.

 include/linux/memory_hotplug.h uses CONFIG_NUMA to decide:
(snip)
 but mm/init.c uses CONFIG_ACPI_NUMA to decide:
(snip)
 (sic: duplicate function above)

Indeed. It is strange. This is a patch for it.

Thanks for your report!

Bye.



This is to fix compile error of x86-64 memory hotplug without
any NUMA option.

  CC  arch/x86_64/mm/init.o
arch/x86_64/mm/init.c:501: error: redefinition of 'memory_add_physaddr_to_nid'
include/linux/memory_hotplug.h:71: error: previous definition of 
'memory_add_phys
addr_to_nid' was here
arch/x86_64/mm/init.c:509: error: redefinition of 'memory_add_physaddr_to_nid'
arch/x86_64/mm/init.c:501: error: previous definition of 
'memory_add_physaddr_to_
nid' was here
make[1]: *** [arch/x86_64/mm/init.o] Error 1

I confirmed compile completion with !NUMA, (NUMA  !ACPI_NUMA),
or (NUMA  ACPI_NUMA).

This patch is for 2.6.19-rc5-mm2.

Signed-off-by: Yasunori Goto [EMAIL PROTECTED]



 arch/x86_64/mm/init.c |9 +
 1 files changed, 1 insertion(+), 8 deletions(-)

Index: 19-rc5-mm2/arch/x86_64/mm/init.c
===
--- 19-rc5-mm2.orig/arch/x86_64/mm/init.c   2006-11-17 22:31:30.0 
+0900
+++ 19-rc5-mm2/arch/x86_64/mm/init.c2006-11-17 22:31:40.0 +0900
@@ -496,7 +496,7 @@ int remove_memory(u64 start, u64 size)
 }
 EXPORT_SYMBOL_GPL(remove_memory);
 
-#ifndef CONFIG_ACPI_NUMA
+#if !defined(CONFIG_ACPI_NUMA)  defined(CONFIG_NUMA)
 int memory_add_physaddr_to_nid(u64 start)
 {
return 0;
@@ -504,13 +504,6 @@ int memory_add_physaddr_to_nid(u64 start
 EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
 #endif
 
-#ifndef CONFIG_ACPI_NUMA
-int memory_add_physaddr_to_nid(u64 start)
-{
-   return 0;
-}
-#endif
-
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 #ifdef CONFIG_MEMORY_HOTPLUG_RESERVE

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH v2 7/7] Do not recompute msgmni anymore if explicitely set by user

2008-02-05 Thread Yasunori Goto
Thanks Nadia-san.

I tested this patch set on my box. It works well.
I have only one comment.


 ---
  ipc/ipc_sysctl.c |   43 +--
  1 file changed, 41 insertions(+), 2 deletions(-)
 
 Index: linux-2.6.24/ipc/ipc_sysctl.c
 ===
 --- linux-2.6.24.orig/ipc/ipc_sysctl.c2008-01-29 16:55:04.0 
 +0100
 +++ linux-2.6.24/ipc/ipc_sysctl.c 2008-01-31 13:13:14.0 +0100
 @@ -34,6 +34,24 @@ static int proc_ipc_dointvec(ctl_table *
   return proc_dointvec(ipc_table, write, filp, buffer, lenp, ppos);
  }
  
 +static int proc_ipc_callback_dointvec(ctl_table *table, int write,
 + struct file *filp, void __user *buffer, size_t *lenp, loff_t *ppos)
 +{
 + size_t lenp_bef = *lenp;
 + int rc;
 +
 + rc = proc_ipc_dointvec(table, write, filp, buffer, lenp, ppos);
 +
 + if (write  !rc  lenp_bef == *lenp)
 + /*
 +  * Tunable has successfully been changed from userland:
 +  * disable its automatic recomputing.
 +  */
 + unregister_ipcns_notifier(current-nsproxy-ipc_ns);
 +
 + return rc;
 +}
 +


Hmmm. I suppose this may be side effect which user does not wish.

I would like to recommend there should be a switch which can turn on/off
automatic recomputing.
If user would like to change this value, it should be turned off.
Otherwise, his requrest will be rejected with some messages.

Probably, user can understand easier than this side effect.

Bye.

-- 
Yasunori Goto 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] Document about lowmem_reserve_ratio

2008-01-17 Thread Yasunori Goto
Hello.

I found the documentation about lowmem_reserve_ratio is not written, and
the lower_zone_protection's description remains yet. I fixed it.

I may be something wrong due to misunderstanding. And probably, sentence
is not natural. (I'm not native English speaker.)

So, please review it.

Thanks.

---

Though the lower_zone_protection was changed to lowmem_reserve_ratio,
the document has been not changed.
The lowmem_reserve_ratio seems quite hard to estimate, but there is
no guidance. This patch is to change document for it.

Signed-off-by: Yasunori Goto [EMAIL PROTECTED]

---
 Documentation/filesystems/proc.txt |   76 +
 1 file changed, 61 insertions(+), 15 deletions(-)

Index: current/Documentation/filesystems/proc.txt
===
--- current.orig/Documentation/filesystems/proc.txt 2008-01-17 
20:01:37.0 +0900
+++ current/Documentation/filesystems/proc.txt  2008-01-18 12:22:10.0 
+0900
@@ -1311,7 +1311,7 @@
 If non-zero, this sysctl disables the new 32-bit mmap mmap layout - the kernel
 will use the legacy (2.4) layout for all processes.
 
-lower_zone_protection
+lowmem_reserve_ratio
 -
 
 For some specialised workloads on highmem machines it is dangerous for
@@ -1331,25 +1331,71 @@
 mechanism will also defend that region from allocations which could use
 highmem or lowmem).
 
-The `lower_zone_protection' tunable determines how aggressive the kernel is
-in defending these lower zones.  The default value is zero - no
-protection at all.
+The `lowmem_reserve_ratio' tunable determines how aggressive the kernel is
+in defending these lower zones.
 
 If you have a machine which uses highmem or ISA DMA and your
 applications are using mlock(), or if you are running with no swap then
-you probably should increase the lower_zone_protection setting.
+you probably should change the lowmem_reserve_ratio setting.
 
-The units of this tunable are fairly vague.  It is approximately equal
-to megabytes, so setting lower_zone_protection=100 will protect around 100
-megabytes of the lowmem zone from user allocations.  It will also make
-those 100 megabytes unavailable for use by applications and by
-pagecache, so there is a cost.
-
-The effects of this tunable may be observed by monitoring
-/proc/meminfo:LowFree.  Write a single huge file and observe the point
-at which LowFree ceases to fall.
+The lowmem_reserve_ratio is an array. You can see them by reading this file.
+-
+% cat /proc/sys/vm/lowmem_reserve_ratio
+256 256 32
+-
+Note: # of this elements is one fewer than number of zones. Because the highest
+  zone's value is not necessary for following calculation.
+
+But, these values are not used directly. The kernel calculates # of protection
+pages for each zones from them. These are shown as array of protection pages
+in /proc/zoneinfo like followings. (This is an example of x86-64 box).
+Each zone has an array of protection pages like this.
+
+-
+Node 0, zone  DMA
+  pages free 1355
+min  3
+low  3
+high 4
+   :
+   :
+numa_other   0
+protection: (0, 2004, 2004, 2004)
+   ^
+  pagesets
+cpu: 0 pcp: 0
+:
+-
+These protections are added to score to judge whether this zone should be used
+for page allocation or should be reclaimed.
+
+In this example, if normal pages (index=2) are required to this DMA zone and
+pages_high is used for watermark, the kernel judges this zone should not be
+used because pages_free(1355) is smaller than watermark + protection[2]
+(4 + 2004 = 2008). If this protection value is 0, this zone would be used for
+normal page requirement. If requirement is DMA zone(index=0), protection[0]
+(=0) is used.
+
+zone[i]'s protection[j] is calculated by following exprssion.
+
+(i  j):
+  zone[i]-protection[j]
+  = (total sums of present_pages from zone[i+1] to zone[j] on the node)
+/ lowmem_reserve_ratio[i];
+(i = j):
+   (should not be protected. = 0;
+(i  j):
+   (not necessary, but looks 0)
+
+The default values of lowmem_reserve_ratio[i] are
+256 (if zone[i] means DMA or DMA32 zone)
+32  (others).
+As above expression, they are reciprocal number of ratio.
+256 means 1/256. # of protection pages becomes about 0.39% of total present
+pages of higher zones on the node.
 
-A reasonable value for lower_zone_protection is 100.
+If you would like to protect more pages, smaller values are effective.
+The minimum value is 1 (1/1 - 100%).
 
 page-cluster
 

-- 
Yasunori Goto 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Document about lowmem_reserve_ratio

2008-01-17 Thread Yasunori Goto
Oops. I sent to Andrea's old mail address.
Sorry for repost.


---

Hello.

I found the documentation about lowmem_reserve_ratio is not written, and
the lower_zone_protection's description remains yet. I fixed it.

I may be something wrong due to misunderstanding. And probably, sentence
is not natural. (I'm not native English speaker.)

So, please review it.

Thanks.

---

Though the lower_zone_protection was changed to lowmem_reserve_ratio,
the document has been not changed.
The lowmem_reserve_ratio seems quite hard to estimate, but there is
no guidance. This patch is to change document for it.

Signed-off-by: Yasunori Goto [EMAIL PROTECTED]

---
 Documentation/filesystems/proc.txt |   76 +
 1 file changed, 61 insertions(+), 15 deletions(-)

Index: current/Documentation/filesystems/proc.txt
===
--- current.orig/Documentation/filesystems/proc.txt 2008-01-17 
20:01:37.0 +0900
+++ current/Documentation/filesystems/proc.txt  2008-01-18 12:22:10.0 
+0900
@@ -1311,7 +1311,7 @@
 If non-zero, this sysctl disables the new 32-bit mmap mmap layout - the kernel
 will use the legacy (2.4) layout for all processes.
 
-lower_zone_protection
+lowmem_reserve_ratio
 -
 
 For some specialised workloads on highmem machines it is dangerous for
@@ -1331,25 +1331,71 @@
 mechanism will also defend that region from allocations which could use
 highmem or lowmem).
 
-The `lower_zone_protection' tunable determines how aggressive the kernel is
-in defending these lower zones.  The default value is zero - no
-protection at all.
+The `lowmem_reserve_ratio' tunable determines how aggressive the kernel is
+in defending these lower zones.
 
 If you have a machine which uses highmem or ISA DMA and your
 applications are using mlock(), or if you are running with no swap then
-you probably should increase the lower_zone_protection setting.
+you probably should change the lowmem_reserve_ratio setting.
 
-The units of this tunable are fairly vague.  It is approximately equal
-to megabytes, so setting lower_zone_protection=100 will protect around 100
-megabytes of the lowmem zone from user allocations.  It will also make
-those 100 megabytes unavailable for use by applications and by
-pagecache, so there is a cost.
-
-The effects of this tunable may be observed by monitoring
-/proc/meminfo:LowFree.  Write a single huge file and observe the point
-at which LowFree ceases to fall.
+The lowmem_reserve_ratio is an array. You can see them by reading this file.
+-
+% cat /proc/sys/vm/lowmem_reserve_ratio
+256 256 32
+-
+Note: # of this elements is one fewer than number of zones. Because the highest
+  zone's value is not necessary for following calculation.
+
+But, these values are not used directly. The kernel calculates # of protection
+pages for each zones from them. These are shown as array of protection pages
+in /proc/zoneinfo like followings. (This is an example of x86-64 box).
+Each zone has an array of protection pages like this.
+
+-
+Node 0, zone  DMA
+  pages free 1355
+min  3
+low  3
+high 4
+   :
+   :
+numa_other   0
+protection: (0, 2004, 2004, 2004)
+   ^
+  pagesets
+cpu: 0 pcp: 0
+:
+-
+These protections are added to score to judge whether this zone should be used
+for page allocation or should be reclaimed.
+
+In this example, if normal pages (index=2) are required to this DMA zone and
+pages_high is used for watermark, the kernel judges this zone should not be
+used because pages_free(1355) is smaller than watermark + protection[2]
+(4 + 2004 = 2008). If this protection value is 0, this zone would be used for
+normal page requirement. If requirement is DMA zone(index=0), protection[0]
+(=0) is used.
+
+zone[i]'s protection[j] is calculated by following exprssion.
+
+(i  j):
+  zone[i]-protection[j]
+  = (total sums of present_pages from zone[i+1] to zone[j] on the node)
+/ lowmem_reserve_ratio[i];
+(i = j):
+   (should not be protected. = 0;
+(i  j):
+   (not necessary, but looks 0)
+
+The default values of lowmem_reserve_ratio[i] are
+256 (if zone[i] means DMA or DMA32 zone)
+32  (others).
+As above expression, they are reciprocal number of ratio.
+256 means 1/256. # of protection pages becomes about 0.39% of total present
+pages of higher zones on the node.
 
-A reasonable value for lower_zone_protection is 100.
+If you would like to protect more pages, smaller values are effective.
+The minimum value is 1 (1/1 - 100%).
 
 page-cluster
 

-- 
Yasunori Goto 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/2]: MM: Make Paget Tables Relocatable--Conditional TLB Flush

2008-01-23 Thread Yasunori Goto

Hello.

This is a nitpick, but all of archtectures code except generic use
MMF_NNED_FLUSH at clear_bit()...
 ^
Please fix misspell.

Bye.

 
 diff -uprwNbB -X 2.6.23/Documentation/dontdiff 2.6.23/arch/alpha/kernel/smp.c 
 2.6.23a/arch/alpha/kernel/smp.c
 --- 2.6.23/arch/alpha/kernel/smp.c2007-10-09 13:31:38.0 -0700
 +++ 2.6.23a/arch/alpha/kernel/smp.c   2007-10-29 13:50:06.0 -0700
 @@ -850,6 +850,8 @@ flush_tlb_mm(struct mm_struct *mm)
  {
   preempt_disable();
  
 + clear_bit(MMF_NNED_FLUSH, mm-flags);
 +
   if (mm == current-active_mm) {
   flush_tlb_current(mm);
   if (atomic_read(mm-mm_users) = 1) {


-- 
Yasunori Goto 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Add IORESOUCE_BUSY flag for System RAM take 2.

2007-11-05 Thread Yasunori Goto

Hello.

I merged Baradi-san's patch and mine. This and Kame-san's
following patch is necessary for x86-64 memory unplug.

http://marc.info/?l=linux-mmm=119399026017901w=2

I heard Kame-san's patch is already included in -mm.
So, I'll repost merged patch now.

This patch is tested on 2.6.23-mm1.

Please apply.

---

i386 and x86-64 registers System RAM as IORESOURCE_MEM | IORESOURCE_BUSY.

But ia64 registers it as IORESOURCE_MEM only.
In addition, memory hotplug code registers new memory as IORESOURCE_MEM too.

This difference causes a failure of memory unplug of x86-64.
This patch fix it.

This patch adds IORESOURCE_BUSY to avoid potential overlap mapping
by PCI device.


Signed-off-by: Yasunori Goto [EMAIL PROTECTED]
Signed-off-by: Badari Pulavarty [EMAIL PROTECTED]

---
 arch/ia64/kernel/efi.c |6 ++
 kernel/resource.c  |2 +-
 mm/memory_hotplug.c|2 +-
 3 files changed, 4 insertions(+), 6 deletions(-)

Index: current/arch/ia64/kernel/efi.c
===
--- current.orig/arch/ia64/kernel/efi.c 2007-11-02 17:17:30.0 +0900
+++ current/arch/ia64/kernel/efi.c  2007-11-02 17:19:10.0 +0900
@@ -,7 +,7 @@ efi_initialize_iomem_resources(struct re
if (md-num_pages == 0) /* should not happen */
continue;
 
-   flags = IORESOURCE_MEM;
+   flags = IORESOURCE_MEM | IORESOURCE_BUSY;
switch (md-type) {
 
case EFI_MEMORY_MAPPED_IO:
@@ -1133,12 +1133,11 @@ efi_initialize_iomem_resources(struct re
 
case EFI_ACPI_MEMORY_NVS:
name = ACPI Non-volatile Storage;
-   flags |= IORESOURCE_BUSY;
break;
 
case EFI_UNUSABLE_MEMORY:
name = reserved;
-   flags |= IORESOURCE_BUSY | IORESOURCE_DISABLED;
+   flags |= IORESOURCE_DISABLED;
break;
 
case EFI_RESERVED_TYPE:
@@ -1147,7 +1146,6 @@ efi_initialize_iomem_resources(struct re
case EFI_ACPI_RECLAIM_MEMORY:
default:
name = reserved;
-   flags |= IORESOURCE_BUSY;
break;
}
 
Index: current/mm/memory_hotplug.c
===
--- current.orig/mm/memory_hotplug.c2007-11-02 17:19:09.0 +0900
+++ current/mm/memory_hotplug.c 2007-11-02 17:19:10.0 +0900
@@ -39,7 +39,7 @@ static struct resource *register_memory_
res-name = System RAM;
res-start = start;
res-end = start + size - 1;
-   res-flags = IORESOURCE_MEM;
+   res-flags = IORESOURCE_MEM | IORESOURCE_BUSY;
if (request_resource(iomem_resource, res)  0) {
printk(System RAM resource %llx - %llx cannot be added\n,
(unsigned long long)res-start, (unsigned long long)res-end);
Index: current/kernel/resource.c
===
--- current.orig/kernel/resource.c  2007-11-02 17:19:15.0 +0900
+++ current/kernel/resource.c   2007-11-02 17:22:39.0 +0900
@@ -287,7 +287,7 @@ walk_memory_resource(unsigned long start
int ret = -1;
res.start = (u64) start_pfn  PAGE_SHIFT;
res.end = ((u64)(start_pfn + nr_pages)  PAGE_SHIFT) - 1;
-   res.flags = IORESOURCE_MEM;
+   res.flags = IORESOURCE_MEM | IORESOURCE_BUSY;
orig_end = res.end;
while ((res.start  res.end)  (find_next_system_ram(res) = 0)) {
pfn = (unsigned long)(res.start  PAGE_SHIFT);

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: memory/cpu hotplug section mismatch.

2007-06-11 Thread Yasunori Goto
  
  If CONFIG_MEMORY_HOTPLUG=n __meminit == __init, and if
  CONFIG_HOTPLUG_CPU=n __cpuinit == __init. However, with one set and the
  other disabled, you end up with a reference between __init and a regular
  non-init function.
 
 My plan is to define dedicated sections for both __devinit and __meminit.
 Then we can apply the checks no matter the definition of CONFIG_HOTPLUG*

I prefer defining __nodeinit for __cpuinit and __meminit case to
__devinit.   __devinit is used many devices like I/O, and it is
useful for many desktop users. But, cpu/memory hotpluggable box
is very rare. And it should be in init section for many people.

This kind of issue is caused by initialization of pgdat/zone.
I think __nodeinit is enough and desirable.

Bye.

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: Fix memory/cpu hotplug section mismatch and oops.

2007-06-14 Thread Yasunori Goto

Thanks. I tested compile with cpu/memory hotplug off/on.
It was OK.

Acked-by: Yasunori Goto [EMAIL PROTECTED]


 (This is a resend of the earlier patch, this issue still needs to be
 fixed.)
 
 When building with memory hotplug enabled and cpu hotplug disabled, we
 end up with the following section mismatch:
 
 WARNING: mm/built-in.o(.text+0x4e58): Section mismatch: reference to
 .init.text: (between 'free_area_init_node' and '__build_all_zonelists')
 
 This happens as a result of:
 
 - free_area_init_node()
   - free_area_init_core()
 - zone_pcp_init() -- all __meminit up to this point
   - zone_batchsize() -- marked as __cpuinit 
 fo
 
 This happens because CONFIG_HOTPLUG_CPU=n sets __cpuinit to __init, but
 CONFIG_MEMORY_HOTPLUG=y unsets __meminit.
 
 Changing zone_batchsize() to __devinit fixes this.
 
 __devinit is the only thing that is common between CONFIG_HOTPLUG_CPU=y and
 CONFIG_MEMORY_HOTPLUG=y. In the long run, perhaps this should be moved to
 another section identifier completely. Without this, memory hot-add
 of offline nodes (via hotadd_new_pgdat()) will oops if CPU hotplug is
 not also enabled.
 
 Signed-off-by: Paul Mundt [EMAIL PROTECTED]
 
 --
 
  mm/page_alloc.c |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/mm/page_alloc.c b/mm/page_alloc.c
 index bd8e335..05ace44 100644
 --- a/mm/page_alloc.c
 +++ b/mm/page_alloc.c
 @@ -1968,7 +1968,7 @@ void zone_init_free_lists(struct pglist_data *pgdat, 
 struct zone *zone,
   memmap_init_zone((size), (nid), (zone), (start_pfn), MEMMAP_EARLY)
  #endif
  
 -static int __cpuinit zone_batchsize(struct zone *zone)
 +static int __devinit zone_batchsize(struct zone *zone)
  {
   int batch;
  
 -
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: More __meminit annotations.

2007-06-17 Thread Yasunori Goto
Thanks for your checking.

 -void zone_init_free_lists(struct pglist_data *pgdat, struct zone *zone,
 - unsigned long size)
 +static void __meminit zone_init_free_lists(struct pglist_data *pgdat,
 + struct zone *zone, unsigned long size)
  {
   int order;
   for (order = 0; order  MAX_ORDER ; order++) {
 @@ -2431,7 +2431,7 @@ void __meminit get_pfn_range_for_nid(unsigned int nid,
   * Return the number of pages a zone spans in a node, including holes
   * present_pages = zone_spanned_pages_in_node() - zone_absent_pages_in_node()
   */
 -unsigned long __meminit zone_spanned_pages_in_node(int nid,
 +static unsigned long __meminit zone_spanned_pages_in_node(int nid,
   unsigned long zone_type,
   unsigned long *ignored)
  {
 @@ -2519,7 +2519,7 @@ unsigned long __init absent_pages_in_range(unsigned 
 long start_pfn,
  }
  
  /* Return the number of page frames in holes in a zone on a node */
 -unsigned long __meminit zone_absent_pages_in_node(int nid,
 +static unsigned long __meminit zone_absent_pages_in_node(int nid,
   unsigned long zone_type,
   unsigned long *ignored)
  {

Ah, Yes. Thanks. It is better.

 @@ -2536,14 +2536,14 @@ unsigned long __meminit zone_absent_pages_in_node(int 
 nid,
  }
  
  #else
 -static inline unsigned long zone_spanned_pages_in_node(int nid,
 +static inline unsigned long __meminit zone_spanned_pages_in_node(int nid,
   unsigned long zone_type,
   unsigned long *zones_size)
  {
   return zones_size[zone_type];
  }
  
 -static inline unsigned long zone_absent_pages_in_node(int nid,
 +static inline unsigned long __meminit zone_absent_pages_in_node(int nid,
   unsigned long zone_type,
   unsigned long *zholes_size)
  {

I thought __meminit is not effective for these static functions,
because they are inlined function. So, it depends on caller's 
defenition. Is it wrong? 

Bye.

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: More __meminit annotations.

2007-06-18 Thread Yasunori Goto
 On Mon, Jun 18, 2007 at 02:49:24PM +0900, Yasunori Goto wrote:
   -static inline unsigned long zone_absent_pages_in_node(int nid,
   +static inline unsigned long __meminit zone_absent_pages_in_node(int nid,
 unsigned long zone_type,
 unsigned long *zholes_size)
{
  
  I thought __meminit is not effective for these static functions,
  because they are inlined function. So, it depends on caller's 
  defenition. Is it wrong? 
  
 Ah, that's possible, I hadn't considered that. It seems to be a bit more
 obvious what the intention is if it's annotated, especially as this is
 the convention that's used by the rest of mm/page_alloc.c. A bit more
 consistent, if nothing more.

I'm not sure which is intended. I found some functions define both
__init and inline in kernel tree. And probably, some functions don't
do it. So, it seems there is no convention.

I'm Okay if you prefer both defined. :-)


-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/7] Introduce a means of compacting memory within a zone

2007-06-19 Thread Yasunori Goto
Hi Mel-san.
This is very interesting feature.

Now, I'm testing your patches.

 +static int isolate_migratepages(struct zone *zone,
 + struct compact_control *cc)
 +{
 + unsigned long high_pfn, low_pfn, end_pfn, start_pfn;

(snip)

 + /* Time to isolate some pages for migration */
 + spin_lock_irq(zone-lru_lock);
 + for (; low_pfn  end_pfn; low_pfn++) {
 + if (!pfn_valid_within(low_pfn))
 + continue;
 +
 + /* Get the page and skip if free */
 + page = pfn_to_page(low_pfn);

I met panic at here on my tiger4.

I compiled with CONFIG_SPARSEMEM. So, CONFIG_HOLES_IN_ZONE is not set.
pfn_valid_within() returns 1 every time on this configuration.
(This config is for only virtual memmap)
But, my tiger4 box has memory holes in normal zone.

When it is changed to normal pfn_valid(), no panic occurs.

Hmmm.

Bye.
-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sparsemem: Shut up unused symbol compiler warnings.

2007-05-31 Thread Yasunori Goto

I think this issue is fixed by
move-three-functions-that-are-only-needed-for.patch in current -mm tree.
Is it not enough?

Thanks.

 __kmalloc_section_memmap()/__kfree_section_memmap() and friends are only
 used by the memory hotplug code. Move these in to the existing
 CONFIG_MEMORY_HOTPLUG block.
 
 Signed-off-by: Paul Mundt [EMAIL PROTECTED]
 
 --
 
  mm/sparse.c |   42 +-
  1 file changed, 21 insertions(+), 21 deletions(-)
 
 diff --git a/mm/sparse.c b/mm/sparse.c
 index 1302f83..35f739a 100644
 --- a/mm/sparse.c
 +++ b/mm/sparse.c
 @@ -229,6 +229,7 @@ static struct page __init 
 *sparse_early_mem_map_alloc(unsigned long pnum)
   return NULL;
  }
  
 +#ifdef CONFIG_MEMORY_HOTPLUG
  static struct page *__kmalloc_section_memmap(unsigned long nr_pages)
  {
   struct page *page, *ret;
 @@ -269,27 +270,6 @@ static void __kfree_section_memmap(struct page *memmap, 
 unsigned long nr_pages)
  }
  
  /*
 - * Allocate the accumulated non-linear sections, allocate a mem_map
 - * for each and record the physical to section mapping.
 - */
 -void __init sparse_init(void)
 -{
 - unsigned long pnum;
 - struct page *map;
 -
 - for (pnum = 0; pnum  NR_MEM_SECTIONS; pnum++) {
 - if (!valid_section_nr(pnum))
 - continue;
 -
 - map = sparse_early_mem_map_alloc(pnum);
 - if (!map)
 - continue;
 - sparse_init_one_section(__nr_to_section(pnum), pnum, map);
 - }
 -}
 -
 -#ifdef CONFIG_MEMORY_HOTPLUG
 -/*
   * returns the number of sections whose mem_maps were properly
   * set.  If this is =0, then that means that the passed-in
   * map was not consumed and must be freed.
 @@ -329,3 +309,23 @@ out:
   return ret;
  }
  #endif
 +
 +/*
 + * Allocate the accumulated non-linear sections, allocate a mem_map
 + * for each and record the physical to section mapping.
 + */
 +void __init sparse_init(void)
 +{
 + unsigned long pnum;
 + struct page *map;
 +
 + for (pnum = 0; pnum  NR_MEM_SECTIONS; pnum++) {
 + if (!valid_section_nr(pnum))
 + continue;
 +
 + map = sparse_early_mem_map_alloc(pnum);
 + if (!map)
 + continue;
 + sparse_init_one_section(__nr_to_section(pnum), pnum, map);
 + }
 +}
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to [EMAIL PROTECTED]  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:[EMAIL PROTECTED] [EMAIL PROTECTED] /a

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sparsemem: Shut up unused symbol compiler warnings.

2007-05-31 Thread Yasunori Goto
 On Fri, Jun 01, 2007 at 02:26:17PM +0900, Yasunori Goto wrote:
  I think this issue is fixed by
  move-three-functions-that-are-only-needed-for.patch in current -mm tree.
  Is it not enough?
  
 That's possible, I hadn't checked -mm. This was simply against current
 git. If there's already a fix in -mm, then this can simply be ignored.

Okay. Thanks for your report.

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86_64 memory hotplug simulation support?

2007-07-05 Thread Yasunori Goto
Hello. Nigel-san.

 I'm wondering whether anyone has patches lying around that might be useful 
 for 
 simulating memory hotplug on x86_64. Goggling has revealed some old x86 
 patches, but that's all.

I'm not sure what simulation means.
Could you tell me how/what do you expect memory hotplug
simulation exactly?

Memory hot-add code is included in kernel. And, remove(unplug) code
has developed (and hopefully, it will be merged to -mm after
some cleanups, I think.)

I would like to make sure what is necessary.

Thanks.

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86_64 memory hotplug simulation support?

2007-07-05 Thread Yasunori Goto
 Thanks for your reply. Please, just call me Nigel :).

Haha. Okay. Nigel.
(Though, San is useful for even friendly/frank situation in Japanese :) )

 I saw a patch that Dave 
 Hansen had posted, back around the time of 2.6.11 iirc. It was for x86, and 
 (so far as I understand) allowed a person who doesn't really have 
 hotpluggable memory to make their computer pretend that it does.

 Just in case I'm not being clear enough, let me get more concrete. I have had 
 some code for a while that uses bitmaps to simulate page flags, without 
 needing to take up those precious bits in page-flags. I've begun to add 
 support for memory hotplugging, in the hope that I can make it general enough 
 that it will be useful for more than just suspend2. To do that, I'd like to 
 be able to test the memory hotplugging paths, without needing to actually 
 have hotpluggable memory. I do have an x86 desktop I could work on, but would 
 prefer to do it on my x86_64 laptop if I can.

Current memory hot-add code expects special hardware which allows
memory hot-unplug physically. Yes, these is no way to use it on 
normal PC without emulation. And I don't have emulation code for
x86-64. Usually, I'm using ia64 box for test it.

These are 2 ideas to use memory hotplug with normal x86-64 box.

- Make emulation code for x86-64.
  To add memory, some of memory have to be ignored at boot time.
  And add memory after boot up.
  This way may need fake BIOS information. And if memory is added once,
  reboot is necessary for next hot-add test.

- Bootup normaly. Unplug some memory at first, then hot-add them
  later.  You can try hot-plug code many times after bootup.

  Unplug code must is not merged yet. Followings are the newest one.
  http://marc.info/?l=linux-mmm=118180415304117w=2
  But the 6th patch of them is only for ia64.
  http://marc.info/?l=linux-mmm=118180483715610w=2
  So, same role patch for x86-64 is still necessary.

  In addition, some of route of hot-add code can't be tested.
  Because, current hot-add has 2 phase.
1. Physically hot-add.
 - Accept notification from firmware.
 - Make sysfs file for new memory.
 - register SPARSEMEM and allocate memmap/pgdat/zone.
2. logically online
 - free each pages of new memory to use them.
 - rebuild zonelist.

  But, unplug code just do logical offline. physicall hot-unplug
  must be necessary for test phase 1.


Hmm. I don't know what is necessary for suspend2.
But, some works looks still necessary for each way. 

Thanks.

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH](memory hotplug) Fix unnecessary calling of init_currenty_empty_zone()

2007-05-29 Thread Yasunori Goto
Hello.

This patch is to fix unnecessary calling of init_currently_empty_zone().

zone-present_pages is updated in online_pages(). But,
__add_zone() can be called twice or more before calling online_pages().
So, init_currenty_empty_zone() can be called unnecessary times.
It is cause of memory leak of zone's wait_table.

This patch is tested on my ia64 box with 2.6.22-rc2-mm1.

Signed-off-by: Yasunori Goto [EMAIL PROTECTED]

 mm/memory_hotplug.c |2 +-
 1 files changed, 1 insertion(+), 1 deletion(-)

Index: vmemmap/mm/memory_hotplug.c
===
--- vmemmap.orig/mm/memory_hotplug.c2007-05-29 15:30:28.0 +0900
+++ vmemmap/mm/memory_hotplug.c 2007-05-29 17:31:43.0 +0900
@@ -65,7 +65,7 @@ static int __add_zone(struct zone *zone,
int zone_type;
 
zone_type = zone - pgdat-node_zones;
-   if (!populated_zone(zone)) {
+   if (!zone-wait_table) {
int ret = 0;
ret = init_currently_empty_zone(zone, phys_start_pfn,
nr_pages, MEMMAP_HOTPLUG);

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Patch] Fix unnecesary meminit

2007-05-08 Thread Yasunori Goto
  It doesn't make a lot of sense to export an __init symbol to modules.  I
  guess it's OK in this case, but we get warnings:
 
 It seems wrong to me to first tell linker to discard the code after init and
 next to export the symbol to make it available for any module anytime.
 
 Both function are relatively small so better avoid playing games and
 drop the __meminit tag.

Ok. This is the patch.


---
This is to fix unnecessary __meminit definition.
These are exported for kernel modules.

I compiled on ia64/x86-64 with memory hotplug on/off.

Signed-off-by: Yasunori Goto [EMAIL PROTECTED]

 drivers/acpi/numa.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6.21-mm1/drivers/acpi/numa.c
===
--- linux-2.6.21-mm1.orig/drivers/acpi/numa.c   2007-05-08 19:33:05.0 
+0900
+++ linux-2.6.21-mm1/drivers/acpi/numa.c2007-05-08 19:33:12.0 
+0900
@@ -228,7 +228,7 @@ int __init acpi_numa_init(void)
return 0;
 }
 
-int __meminit acpi_get_pxm(acpi_handle h)
+int acpi_get_pxm(acpi_handle h)
 {
unsigned long pxm;
acpi_status status;
@@ -246,7 +246,7 @@ int __meminit acpi_get_pxm(acpi_handle h
 }
 EXPORT_SYMBOL(acpi_get_pxm);
 
-int __meminit acpi_get_node(acpi_handle *handle)
+int acpi_get_node(acpi_handle *handle)
 {
int pxm, node = -1;
 

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] memory hotremove patch take 2 [00/10]

2007-05-08 Thread Yasunori Goto

Hello.

I rebased and debugged Kame-san's memory hot-remove patches.
This work is not finished yet. (Some pages keep un-removable status)
But, I would like to show current progress of it, because it has
been a long time since previous post, and some bugs are fixed.

If you have concern, please check this. Any comments are welcome.

Thanks.

---

These patches are for memory hot-remove.

How to use
  - kernelcore=xx[GMK] must be specified at boottime option to create
ZONE_MOVABLE area.
  - After bootup, execute following.
 # echo offline  /sys/devices/system/memory/memoryX/status



Change log from previous version.
  - Rebase to 2.6.21-mm1.
  - Old original ZONE_MOVABLE code is removed. Mel-san's ZONE_REMOVABLE
for anti-fragmentation is used.
  - Fix wrong return code check of isolate_lru_page()
  - Page is isolated ASAP, which was source of page migration when
memory-hotremove. In old code, it uses just put_page(),
and we expected that migrated source page is catched in
__free_one_page() as isolated page. But, it is spooled in
per_cpu_page and used soon for next destination page of migration.
This was cause of eternal loop in offline_pages().
  - There is a page which is not mapped but added to swapcache in
swap-in code. It was cause of panic in try_to_unmap(). fixed it.
  - end_pfn is rounded up at memmap_init. If there is a small hole on
end of section. These page is not initialized.

TODO:
  - There are some pages which are un-removable page on memory stress
condition. (These pages are set PG_swapcache or PG_mappedtodisk
without connecting to lru.)
  - Should make i386/x86-64/powerpc interface code. But not yet 
(really sorry :-( ).
  - If bootmem parameter or efi's memory map is stored by efi, memory
can't be removed even if it is in removable zone.
  - node hotplug support. (this may needs some amount of patches.)
  - test under heavy work load and more careful race check.
  - Fix where we should allocate migration target page from.
  - H And so on.

[1] counters patch -- per-zone counter for ZONE_MOVABLE

==page isolation==
[2] page isolation patch . basic defintions of page isolation.
[3] drain_all_zone_pages patch . drain all cpus' pcp pages.
[4] isolate freed page patch . isolate pages in free_area[]

==memory unplug==
offline a section of pages. isolate specified section and migrate
content of used pages to out of section. (Because free pages in a
section is isolated, it never be returned by alloc_pages())
This patch doesn't care where we should allocate migration new pages from.
[5] memory unplug core patch --- maybe need more work.
[6] interface patch  --- offline interface support 

==migration nocontext==
Fix race condition of page migration without process context
(not taking mm-sem). This patch delayes kmem_cache_free() of
anon_vma until migration ends.
[7] migration nocontext patch --- support page migration without
acquiring mm-sem. need careful debug...

==other fixes==
[8] round up end_pfn at memmap_init
[9] page isolation ASAP when memory-hotremove case.
[10] fix swapping-in page panic.

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] memory hotremove patch take 2 [02/10] (make page unused)

2007-05-08 Thread Yasunori Goto
This patch is for supporting making page unused.

Isolate pages by capturing freed pages before inserting free_area[],
buddy allocator.
If you have an idea for avoiding spin_lock(), please advise me.

Isolating pages in free_area[] is implemented in other patch.

Signed-Off-By: KAMEZAWA Hiroyuki [EMAIL PROTECTED]
Signed-off-by: Yasunori Goto [EMAIL PROTECTED]


 include/linux/mmzone.h |8 +
 include/linux/page_isolation.h |   52 +++
 mm/Kconfig |7 +
 mm/page_alloc.c|  187 +
 4 files changed, 254 insertions(+)

Index: current_test/include/linux/mmzone.h
===
--- current_test.orig/include/linux/mmzone.h2007-05-08 15:06:49.0 
+0900
+++ current_test/include/linux/mmzone.h 2007-05-08 15:08:03.0 +0900
@@ -314,6 +314,14 @@ struct zone {
/* zone_start_pfn == zone_start_paddr  PAGE_SHIFT */
unsigned long   zone_start_pfn;
 
+#ifdef CONFIG_PAGE_ISOLATION
+   /*
+*  For pages which are not used but not free.
+*  See include/linux/page_isolation.h
+*/
+   spinlock_t  isolation_lock;
+   struct list_headisolation_list;
+#endif
/*
 * zone_start_pfn, spanned_pages and present_pages are all
 * protected by span_seqlock.  It is a seqlock because it has
Index: current_test/mm/page_alloc.c
===
--- current_test.orig/mm/page_alloc.c   2007-05-08 15:07:20.0 +0900
+++ current_test/mm/page_alloc.c2007-05-08 15:08:34.0 +0900
@@ -41,6 +41,7 @@
 #include linux/pfn.h
 #include linux/backing-dev.h
 #include linux/fault-inject.h
+#include linux/page_isolation.h
 
 #include asm/tlbflush.h
 #include asm/div64.h
@@ -448,6 +449,9 @@ static inline void __free_one_page(struc
if (unlikely(PageCompound(page)))
destroy_compound_page(page, order);
 
+   if (page_under_isolation(zone, page, order))
+   return;
+
page_idx = page_to_pfn(page)  ((1  MAX_ORDER) - 1);
 
VM_BUG_ON(page_idx  (order_size - 1));
@@ -3259,6 +3263,10 @@ static void __meminit free_area_init_cor
zone-nr_scan_inactive = 0;
zap_zone_vm_stats(zone);
atomic_set(zone-reclaim_in_progress, 0);
+#ifdef CONFIG_PAGE_ISOLATION
+   spin_lock_init(zone-isolation_lock);
+   INIT_LIST_HEAD(zone-isolation_list);
+#endif
if (!size)
continue;
 
@@ -4214,3 +4222,182 @@ void set_pageblock_flags_group(struct pa
else
__clear_bit(bitidx + start_bitidx, bitmap);
 }
+
+#ifdef CONFIG_PAGE_ISOLATION
+/*
+ * Page Isolation.
+ *
+ * If a page is removed from usual free_list and will never be used,
+ * It is linked to struct isolation_info and set Reserved, Private
+ * bit. page-mapping points to isolation_info in it.
+ * and page_count(page) is 0.
+ *
+ * This can be used for creating a chunk of contiguous *unused* memory.
+ *
+ * current user is Memory-Hot-Remove.
+ * maybe move to some other file is better.
+ */
+static void
+isolate_page_nolock(struct isolation_info *info, struct page *page, int order)
+{
+   int pagenum;
+   pagenum = 1  order;
+   while (pagenum  0) {
+   SetPageReserved(page);
+   SetPagePrivate(page);
+   page-private = (unsigned long)info;
+   list_add(page-lru, info-pages);
+   page++;
+   pagenum--;
+   }
+}
+
+/*
+ * This function is called from page_under_isolation()l
+ */
+
+int __page_under_isolation(struct zone *zone, struct page *page, int order)
+{
+   struct isolation_info *info;
+   unsigned long pfn = page_to_pfn(page);
+   unsigned long flags;
+   int found = 0;
+
+   spin_lock_irqsave(zone-isolation_lock,flags);
+   list_for_each_entry(info, zone-isolation_list, list) {
+   if (info-start_pfn = pfn  pfn  info-end_pfn) {
+   found = 1;
+   break;
+   }
+   }
+   if (found) {
+   isolate_page_nolock(info, page, order);
+   }
+   spin_unlock_irqrestore(zone-isolation_lock, flags);
+   return found;
+}
+
+/*
+ * start and end must be in the same zone.
+ *
+ */
+struct isolation_info  *
+register_isolation(unsigned long start, unsigned long end)
+{
+   struct zone *zone;
+   struct isolation_info *info = NULL, *tmp;
+   unsigned long flags;
+   unsigned long last_pfn = end - 1;
+
+   if (!pfn_valid(start) || !pfn_valid(last_pfn) || (start = end))
+   return ERR_PTR(-EINVAL);
+   /* check start and end is in the same zone */
+   zone = page_zone(pfn_to_page(start));
+
+   if (zone != page_zone(pfn_to_page(last_pfn)))
+   return ERR_PTR

[RFC] memory hotremove patch take 2 [04/10] (isolate all free pages)

2007-05-08 Thread Yasunori Goto
Isolate all freed pages (means in buddy_list) in the range.
See page_buddy() and free_one_page() function if unsure.

Signed-Off-By: KAMEZAWA Hiroyuki [EMAIL PROTECTED]
Signed-off-by: Yasunori Goto [EMAIL PROTECTED]

 include/linux/page_isolation.h |1 
 mm/page_alloc.c|   45 +
 2 files changed, 46 insertions(+)

Index: current_test/mm/page_alloc.c
===
--- current_test.orig/mm/page_alloc.c   2007-05-08 15:08:04.0 +0900
+++ current_test/mm/page_alloc.c2007-05-08 15:08:26.0 +0900
@@ -4411,6 +4411,51 @@ free_all_isolated_pages(struct isolation
}
 }
 
+/*
+ * Isolate already freed pages.
+ */
+int
+capture_isolate_freed_pages(struct isolation_info *info)
+{
+   struct zone *zone;
+   unsigned long pfn;
+   struct page *page;
+   int order, order_size;
+   int nr_pages = 0;
+   unsigned long last_pfn = info-end_pfn - 1;
+   pfn = info-start_pfn;
+   if (!pfn_valid(pfn))
+   return -EINVAL;
+   zone = info-zone;
+   if ((zone != page_zone(pfn_to_page(pfn))) ||
+   (zone != page_zone(pfn_to_page(last_pfn
+   return -EINVAL;
+   drain_all_pages();
+   spin_lock(zone-lock);
+   while (pfn  info-end_pfn) {
+   if (!pfn_valid(pfn)) {
+   pfn++;
+   continue;
+   }
+   page = pfn_to_page(pfn);
+   /* See page_is_buddy()  */
+   if (page_count(page) == 0  PageBuddy(page)) {
+   order = page_order(page);
+   order_size = 1  order;
+   zone-free_area[order].nr_free--;
+   __mod_zone_page_state(zone, NR_FREE_PAGES, -order_size);
+   list_del(page-lru);
+   rmv_page_order(page);
+   isolate_page_nolock(info, page, order);
+   nr_pages += order_size;
+   pfn += order_size;
+   } else {
+   pfn++;
+   }
+   }
+   spin_unlock(zone-lock);
+   return nr_pages;
+}
 #endif /* CONFIG_PAGE_ISOLATION */
 
 
Index: current_test/include/linux/page_isolation.h
===
--- current_test.orig/include/linux/page_isolation.h2007-05-08 
15:08:04.0 +0900
+++ current_test/include/linux/page_isolation.h 2007-05-08 15:08:27.0 
+0900
@@ -40,6 +40,7 @@ extern void free_isolation_info(struct i
 extern void unuse_all_isolated_pages(struct isolation_info *info);
 extern void free_all_isolated_pages(struct isolation_info *info);
 extern void drain_all_pages(void);
+extern int capture_isolate_freed_pages(struct isolation_info *info);
 
 #else
 

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] memory hotremove patch take 2 [01/10] (counter of removable page)

2007-05-08 Thread Yasunori Goto
Show #of Movable pages and vmstat.

Signed-Off-By: KAMEZAWA Hiroyuki [EMAIL PROTECTED]
Signed-off-by: Yasunori Goto [EMAIL PROTECTED]

 arch/ia64/mm/init.c|2 ++
 drivers/base/node.c|4 
 fs/proc/proc_misc.c|4 
 include/linux/kernel.h |2 ++
 include/linux/swap.h   |1 +
 mm/page_alloc.c|   22 ++
 6 files changed, 35 insertions(+)

Index: current_test/mm/page_alloc.c
===
--- current_test.orig/mm/page_alloc.c   2007-05-08 15:06:50.0 +0900
+++ current_test/mm/page_alloc.c2007-05-08 15:08:36.0 +0900
@@ -58,6 +58,7 @@ unsigned long totalram_pages __read_most
 unsigned long totalreserve_pages __read_mostly;
 long nr_swap_pages;
 int percpu_pagelist_fraction;
+unsigned long total_movable_pages __read_mostly;
 
 static void __free_pages_ok(struct page *page, unsigned int order);
 
@@ -1827,6 +1828,18 @@ static unsigned int nr_free_zone_pages(i
return sum;
 }
 
+unsigned int nr_free_movable_pages(void)
+{
+   unsigned long nr_pages = 0;
+   struct zone *zone;
+   int nid;
+
+   for_each_online_node(nid) {
+   zone = (NODE_DATA(nid)-node_zones[ZONE_MOVABLE]);
+   nr_pages += zone_page_state(zone, NR_FREE_PAGES);
+   }
+   return nr_pages;
+}
 /*
  * Amount of free RAM allocatable within ZONE_DMA and ZONE_NORMAL
  */
@@ -1889,6 +1902,8 @@ void si_meminfo(struct sysinfo *val)
val-totalhigh = totalhigh_pages;
val-freehigh = nr_free_highpages();
val-mem_unit = PAGE_SIZE;
+   val-movable = total_movable_pages;
+   val-free_movable = nr_free_movable_pages();
 }
 
 EXPORT_SYMBOL(si_meminfo);
@@ -1908,6 +1923,11 @@ void si_meminfo_node(struct sysinfo *val
val-totalhigh = 0;
val-freehigh = 0;
 #endif
+
+   val-movable = pgdat-node_zones[ZONE_MOVABLE].present_pages;
+   val-free_movable = zone_page_state(pgdat-node_zones[ZONE_MOVABLE],
+   NR_FREE_PAGES);
+
val-mem_unit = PAGE_SIZE;
 }
 #endif
@@ -3216,6 +3236,8 @@ static void __meminit free_area_init_cor
 
zone-spanned_pages = size;
zone-present_pages = realsize;
+   if (j == ZONE_MOVABLE)
+   total_movable_pages += realsize;
 #ifdef CONFIG_NUMA
zone-node = nid;
zone-min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)
Index: current_test/include/linux/kernel.h
===
--- current_test.orig/include/linux/kernel.h2007-05-08 15:06:49.0 
+0900
+++ current_test/include/linux/kernel.h 2007-05-08 15:07:20.0 +0900
@@ -352,6 +352,8 @@ struct sysinfo {
unsigned short pad; /* explicit padding for m68k */
unsigned long totalhigh;/* Total high memory size */
unsigned long freehigh; /* Available high memory size */
+   unsigned long movable;  /* pages used only for data */
+   unsigned long free_movable; /* Avaiable pages in movable */
unsigned int mem_unit;  /* Memory unit size in bytes */
char _f[20-2*sizeof(long)-sizeof(int)]; /* Padding: libc5 uses this.. */
 };
Index: current_test/fs/proc/proc_misc.c
===
--- current_test.orig/fs/proc/proc_misc.c   2007-05-08 15:06:48.0 
+0900
+++ current_test/fs/proc/proc_misc.c2007-05-08 15:07:20.0 +0900
@@ -161,6 +161,8 @@ static int meminfo_read_proc(char *page,
LowTotal: %8lu kB\n
LowFree:  %8lu kB\n
 #endif
+   MovableTotal: %8lu kB\n
+   MovableFree:  %8lu kB\n
SwapTotal:%8lu kB\n
SwapFree: %8lu kB\n
Dirty:%8lu kB\n
@@ -191,6 +193,8 @@ static int meminfo_read_proc(char *page,
K(i.totalram-i.totalhigh),
K(i.freeram-i.freehigh),
 #endif
+   K(i.movable),
+   K(i.free_movable),
K(i.totalswap),
K(i.freeswap),
K(global_page_state(NR_FILE_DIRTY)),
Index: current_test/drivers/base/node.c
===
--- current_test.orig/drivers/base/node.c   2007-05-08 15:06:10.0 
+0900
+++ current_test/drivers/base/node.c2007-05-08 15:07:20.0 +0900
@@ -55,6 +55,8 @@ static ssize_t node_read_meminfo(struct 
   Node %d LowTotal: %8lu kB\n
   Node %d LowFree:  %8lu kB\n
 #endif
+  Node %d MovableTotal: %8lu kB\n
+  Node %d MovableFree:  %8lu kB\n
   Node %d Dirty:%8lu kB\n
   Node %d Writeback:%8lu kB\n
   Node %d FilePages:%8lu kB\n
@@ -77,6

[RFC] memory hotremove patch take 2 [03/10] (drain all pages)

2007-05-08 Thread Yasunori Goto
This patch add function drain_all_pages(void) to drain all 
pages on per-cpu-freelist.
Page isolation will catch them in free_one_page.

Signed-Off-By: KAMEZAWA Hiroyuki [EMAIL PROTECTED]
Signed-off-by: Yasunori Goto [EMAIL PROTECTED]

 include/linux/page_isolation.h |1 +
 mm/page_alloc.c|   13 +
 2 files changed, 14 insertions(+)

Index: current_test/mm/page_alloc.c
===
--- current_test.orig/mm/page_alloc.c   2007-05-08 15:08:03.0 +0900
+++ current_test/mm/page_alloc.c2007-05-08 15:08:33.0 +0900
@@ -1070,6 +1070,19 @@ void drain_all_local_pages(void)
smp_call_function(smp_drain_local_pages, NULL, 0, 1);
 }
 
+#ifdef CONFIG_PAGE_ISOLATION
+static void drain_local_zone_pages(struct work_struct *work)
+{
+   drain_local_pages();
+}
+
+void drain_all_pages(void)
+{
+   schedule_on_each_cpu(drain_local_zone_pages);
+}
+
+#endif /* CONFIG_PAGE_ISOLATION */
+
 /*
  * Free a 0-order page
  */
Index: current_test/include/linux/page_isolation.h
===
--- current_test.orig/include/linux/page_isolation.h2007-05-08 
15:08:03.0 +0900
+++ current_test/include/linux/page_isolation.h 2007-05-08 15:08:33.0 
+0900
@@ -39,6 +39,7 @@ extern void detach_isolation_info_zone(s
 extern void free_isolation_info(struct isolation_info *info);
 extern void unuse_all_isolated_pages(struct isolation_info *info);
 extern void free_all_isolated_pages(struct isolation_info *info);
+extern void drain_all_pages(void);
 
 #else
 

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] memory hotremove patch take 2 [05/10] (make basic remove code)

2007-05-08 Thread Yasunori Goto
Add MEMORY_HOTREMOVE config and implements basic algorythm.

This config selects ZONE_MOVABLE and PAGE_ISOLATION

how work:
1. register Isololation area of specified section
2. search mem_map and migrate pages.
3. detach isolation and make pages unused.

This works on my easy test, but I think I need more work on loop algorythm 
and policy.

Signed-Off-By: KAMEZAWA Hiroyuki [EMAIL PROTECTED]
Signed-off-by: Yasunori Goto [EMAIL PROTECTED]


 include/linux/memory_hotplug.h |1 
 mm/Kconfig |8 +
 mm/memory_hotplug.c|  221 +
 3 files changed, 229 insertions(+), 1 deletion(-)

Index: current_test/mm/Kconfig
===
--- current_test.orig/mm/Kconfig2007-05-08 15:08:03.0 +0900
+++ current_test/mm/Kconfig 2007-05-08 15:08:27.0 +0900
@@ -126,6 +126,12 @@ config MEMORY_HOTPLUG_SPARSE
def_bool y
depends on SPARSEMEM  MEMORY_HOTPLUG
 
+config MEMORY_HOTREMOVE
+   bool Allow for memory hot-remove
+   depends on MEMORY_HOTPLUG_SPARSE
+   select  MIGRATION
+   select  PAGE_ISOLATION
+
 # Heavily threaded applications may benefit from splitting the mm-wide
 # page_table_lock, so that faults on different parts of the user address
 # space can be handled with less contention: split it at this NR_CPUS.
@@ -145,7 +151,7 @@ config SPLIT_PTLOCK_CPUS
 config MIGRATION
bool Page migration
def_bool y
-   depends on NUMA
+   depends on NUMA || MEMORY_HOTREMOVE
help
  Allows the migration of the physical location of pages of processes
  while the virtual addresses are not changed. This is useful for
Index: current_test/mm/memory_hotplug.c
===
--- current_test.orig/mm/memory_hotplug.c   2007-05-08 15:02:48.0 
+0900
+++ current_test/mm/memory_hotplug.c2007-05-08 15:08:27.0 +0900
@@ -23,6 +23,9 @@
 #include linux/vmalloc.h
 #include linux/ioport.h
 #include linux/cpuset.h
+#include linux/page_isolation.h
+#include linux/delay.h
+#include linux/migrate.h
 
 #include asm/tlbflush.h
 
@@ -308,3 +311,221 @@ error:
return ret;
 }
 EXPORT_SYMBOL_GPL(add_memory);
+
+
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+
+/*
+ * Just an easy implementation.
+ */
+static struct page *
+hotremove_migrate_alloc(struct page *page,
+   unsigned long private,
+   int **x)
+{
+   return alloc_page(GFP_HIGH_MOVABLE);
+}
+
+/* scans # of pages per itelation */
+#define HOTREMOVE_UNIT (1024)
+
+static int do_migrate_and_isolate_pages(struct isolation_info *info,
+   unsigned long start_pfn,
+   unsigned long end_pfn)
+{
+   int move_pages = HOTREMOVE_UNIT;
+   int ret, managed, not_managed;
+   unsigned long pfn;
+   struct page *page;
+   LIST_HEAD(source);
+
+   not_managed = 0;
+   for (pfn = start_pfn; pfn  end_pfn  move_pages  0; pfn++) {
+   if (!pfn_valid(pfn))  /* never happens in sparsemem */
+   continue;
+   page = pfn_to_page(pfn);
+   if (is_page_isolated(info,page))
+   continue;
+   ret = isolate_lru_page(page, source);
+
+   if (ret == 0) {
+   move_pages--;
+   managed++;
+   } else {
+   if (page_count(page))
+   not_managed++; /* someone uses this */
+   }
+   }
+   ret = -EBUSY;
+   if (not_managed) {
+   if (!list_empty(source))
+   putback_lru_pages(source);
+   goto out;
+   }
+   ret = 0;
+   if (list_empty(source))
+   goto out;
+   /* this function returns # of failed pages */
+   ret = migrate_pages(source, hotremove_migrate_alloc,
+  (unsigned long)info);
+out:
+   return ret;
+}
+
+
+/*
+ * Check All pages registered as IORESOURCE_RAM are isolated or not.
+ */
+static int check_removal_success(struct isolation_info *info)
+{
+   struct resource res;
+   unsigned long section_end;
+   unsigned long start_pfn, i, nr_pages;
+   struct page *page;
+   int removed = 0;
+   res.start = info-start_pfn  PAGE_SHIFT;
+   res.end = (info-end_pfn - 1)  PAGE_SHIFT;
+   res.flags = IORESOURCE_MEM;
+   section_end = res.end;
+   while ((res.start  res.end)  (find_next_system_ram(res) = 0)) {
+   start_pfn =(res.start  PAGE_SHIFT);
+   nr_pages = (res.end + 1UL - res.start)  PAGE_SHIFT;
+   for (i = 0; i  nr_pages; i++) {
+   page = pfn_to_page(start_pfn + i);
+   if (!is_page_isolated(info,page))
+   return -EBUSY

[RFC] memory hotremove patch take 2 [06/10] (ia64's remove_memory code)

2007-05-08 Thread Yasunori Goto
Call offline pages from remove_memory().
Signed-off-by: Yasunori Goto [EMAIL PROTECTED]

Signed-Off-By: KAMEZAWA Hiroyuki [EMAIL PROTECTED]
 arch/ia64/mm/init.c |   13 -
 1 files changed, 12 insertions(+), 1 deletion(-)

Index: current_test/arch/ia64/mm/init.c
===
--- current_test.orig/arch/ia64/mm/init.c   2007-05-08 15:07:20.0 
+0900
+++ current_test/arch/ia64/mm/init.c2007-05-08 15:08:07.0 +0900
@@ -726,7 +726,18 @@ int arch_add_memory(int nid, u64 start, 
 
 int remove_memory(u64 start, u64 size)
 {
-   return -EINVAL;
+   unsigned long start_pfn, end_pfn;
+   unsigned long timeout = 120 * HZ;
+   int ret;
+   start_pfn = start  PAGE_SHIFT;
+   end_pfn = start_pfn + (size  PAGE_SHIFT);
+   ret = offline_pages(start_pfn, end_pfn, timeout);
+   if (ret)
+   goto out;
+   /* we can free mem_map at this point */
+out:
+   return ret;
 }
+
 EXPORT_SYMBOL_GPL(remove_memory);
 #endif

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] memory hotremove patch take 2 [10/10] (retry swap-in page)

2007-05-08 Thread Yasunori Goto

There is a race condition between swap-in and unmap_and_move().
When swap-in occur, page_mapped might be not set yet.
So, unmap_and_move() gives up at once, and tries later.



Signed-off-by: Yasunori Goto [EMAIL PROTECTED]


 mm/migrate.c |5 +
 1 files changed, 5 insertions(+)

Index: current_test/mm/migrate.c
===
--- current_test.orig/mm/migrate.c  2007-05-08 15:08:09.0 +0900
+++ current_test/mm/migrate.c   2007-05-08 15:08:09.0 +0900
@@ -670,6 +670,11 @@ static int unmap_and_move(new_page_t get
/* hold this anon_vma until remove_migration_ptes() finishes */
anon_vma_hold(page);
}
+
+   if (PageSwapCache(page)  !page_mapped(page))
+   /* swap in now. try lator*/
+   goto unlock;
+
/*
 * Establish migration ptes or remove ptes
 */

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] memory hotremove patch take 2 [09/10] (direct isolation for remove)

2007-05-08 Thread Yasunori Goto

This patch is to isolate source page of migration ASAP in unmap_and_move(),
when memory-hotremove.

In old code, it uses just put_page(),
and we expected that migrated source page is catched in __free_one_page()
as isolated page. But, it is spooled in per_cpu_page and used soon 
for next destination page of migration. This was cause of eternal loop in
offline_pages().

Signed-off-by: Yasunori Goto [EMAIL PROTECTED]

 include/linux/page_isolation.h |   14 
 mm/Kconfig |1 
 mm/migrate.c   |   46 +++--
 3 files changed, 59 insertions(+), 2 deletions(-)

Index: current_test/mm/migrate.c
===
--- current_test.orig/mm/migrate.c  2007-05-08 15:08:07.0 +0900
+++ current_test/mm/migrate.c   2007-05-08 15:08:21.0 +0900
@@ -249,6 +249,32 @@ static void remove_migration_ptes(struct
remove_file_migration_ptes(old, new);
 }
 
+
+static int
+is_page_isolated_noinfo(struct page *page)
+{
+   int ret = 0;
+   struct zone *zone;
+   unsigned long flags;
+   struct isolation_info *info;
+
+   if (unlikely(PageReserved(page)  PagePrivate(page) 
+page_count(page) == 1)){
+   zone = page_zone(page);
+   spin_lock_irqsave(zone-isolation_lock, flags);
+   list_for_each_entry(info, zone-isolation_list, list) {
+   if (PageReserved(page)  PagePrivate(page) 
+   page_count(page) == 1 
+   page-private == (unsigned long)info){
+   ret = 1;
+   break;
+   }
+   }
+   spin_unlock_irqrestore(zone-isolation_lock, flags);
+
+   }
+   return ret;
+}
 /*
  * Something used the pte of a page under migration. We need to
  * get to the page and wait until migration is finished.
@@ -278,7 +304,14 @@ void migration_entry_wait(struct mm_stru
get_page(page);
pte_unmap_unlock(ptep, ptl);
wait_on_page_locked(page);
-   put_page(page);
+
+   /*
+* The page might be migrated and directly isolated.
+* If not, then release page.
+*/
+   if (!is_page_isolated_noinfo(page))
+   put_page(page);
+
return;
 out:
pte_unmap_unlock(ptep, ptl);
@@ -653,6 +686,15 @@ static int unmap_and_move(new_page_t get
anon_vma_release(page);
}
 
+   if (rc != -EAGAIN  is_migrate_isolation(flag)) {
+   /* page must be removed sooner. */
+   list_del(page-lru);
+   page_under_isolation(page_zone(page), page, 0);
+   __put_page(page);
+   unlock_page(page);
+   goto move_newpage;
+   }
+
 unlock:
unlock_page(page);
 
@@ -758,7 +800,7 @@ int migrate_pages_and_remove(struct list
new_page_t get_new_page, unsigned long private)
 {
return __migrate_pages(from, get_new_page, private,
-   MIGRATE_NOCONTEXT);
+   MIGRATE_NOCONTEXT | MIGRATE_ISOLATION);
 }
 #endif
 
Index: current_test/include/linux/page_isolation.h
===
--- current_test.orig/include/linux/page_isolation.h2007-05-08 
15:08:07.0 +0900
+++ current_test/include/linux/page_isolation.h 2007-05-08 15:08:09.0 
+0900
@@ -33,12 +33,20 @@ is_page_isolated(struct isolation_info *
 }
 
 #define MIGRATE_NOCONTEXT 0x1
+#define MIGRATE_ISOLATION 0x2
+
 static inline int
 is_migrate_nocontext(int flag)
 {
return (flag  MIGRATE_NOCONTEXT) == MIGRATE_NOCONTEXT;
 }
 
+static inline int
+is_migrate_isolation(int flag)
+{
+   return (flag  MIGRATE_ISOLATION) == MIGRATE_ISOLATION;
+}
+
 extern struct isolation_info *
 register_isolation(unsigned long start, unsigned long end);
 
@@ -64,5 +72,11 @@ is_migrate_nocontext(int flag)
return 0;
 }
 
+static inline int
+is_migrate_isolation(int flag)
+{
+   return 0;
+}
+
 #endif
 #endif
Index: current_test/mm/Kconfig
===
--- current_test.orig/mm/Kconfig2007-05-08 15:08:07.0 +0900
+++ current_test/mm/Kconfig 2007-05-08 15:08:09.0 +0900
@@ -169,6 +169,7 @@ config MIGRATION_REMOVE
  migration target pages. This has a small race condition.
  If this config is selected, some workaround for fix them is enabled.
  This may be add slight performance influence.
+ In addition, page must be isolated sooner for remove.
 
 config RESOURCES_64BIT
bool 64 bit Memory and IO resources (EXPERIMENTAL) if (!64BIT  
EXPERIMENTAL)

-- 
Yasunori Goto 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http

Re: [RFC] memory hotremove patch take 2 [07/10] (delay freeing anon_vma)

2007-05-08 Thread Yasunori Goto
Delaying freeing anon_vma until migration finishes.

We cannot trust page-mapping (of ANON) when page_mapcount(page) ==0.

page migration puts page_mocount(page) to be 0. So we have to
guarantee anon_vma pointed by page-mapping is valid by some hook.

Usual page migration guarantees this by mm-sem. but we can't do it.
So, just delaying freeing anon_vma.

Signed-Off-By: KAMEZAWA Hiroyuki [EMAIL PROTECTED]
Signed-off-by: Yasunori Goto [EMAIL PROTECTED]

 include/linux/migrate.h|2 ++
 include/linux/page_isolation.h |   14 ++
 include/linux/rmap.h   |   22 ++
 mm/Kconfig |   12 
 mm/memory_hotplug.c|4 ++--
 mm/migrate.c   |   37 +++--
 mm/rmap.c  |   36 +++-
 7 files changed, 118 insertions(+), 9 deletions(-)

Index: current_test/mm/migrate.c
===
--- current_test.orig/mm/migrate.c  2007-05-08 15:06:50.0 +0900
+++ current_test/mm/migrate.c   2007-05-08 15:08:24.0 +0900
@@ -28,6 +28,7 @@
 #include linux/mempolicy.h
 #include linux/vmalloc.h
 #include linux/security.h
+#include linux/page_isolation.h
 
 #include internal.h
 
@@ -607,7 +608,7 @@ static int move_to_new_page(struct page 
  * to the newly allocated page in newpage.
  */
 static int unmap_and_move(new_page_t get_new_page, unsigned long private,
-   struct page *page, int force)
+   struct page *page, int force, int flag)
 {
int rc = 0;
int *result = NULL;
@@ -632,7 +633,10 @@ static int unmap_and_move(new_page_t get
goto unlock;
wait_on_page_writeback(page);
}
-
+   if (PageAnon(page)  is_migrate_nocontext(flag)) {
+   /* hold this anon_vma until remove_migration_ptes() finishes */
+   anon_vma_hold(page);
+   }
/*
 * Establish migration ptes or remove ptes
 */
@@ -640,8 +644,14 @@ static int unmap_and_move(new_page_t get
if (!page_mapped(page))
rc = move_to_new_page(newpage, page);
 
-   if (rc)
+   if (rc) {
remove_migration_ptes(page, page);
+   if (PageAnon(page)  is_migrate_nocontext(flag))
+   anon_vma_release(page);
+   } else {
+   if (PageAnon(newpage)  is_migrate_nocontext(flag))
+   anon_vma_release(page);
+   }
 
 unlock:
unlock_page(page);
@@ -686,8 +696,8 @@ move_newpage:
  *
  * Return: Number of pages not migrated or error code.
  */
-int migrate_pages(struct list_head *from,
-   new_page_t get_new_page, unsigned long private)
+static int __migrate_pages(struct list_head *from,
+   new_page_t get_new_page, unsigned long private, int flag)
 {
int retry = 1;
int nr_failed = 0;
@@ -707,7 +717,7 @@ int migrate_pages(struct list_head *from
cond_resched();
 
rc = unmap_and_move(get_new_page, private,
-   page, pass  2);
+   page, pass  2, flag);
 
switch(rc) {
case -ENOMEM:
@@ -737,6 +747,21 @@ out:
return nr_failed + retry;
 }
 
+int migrate_pages(struct list_head *from,
+   new_page_t get_new_page, unsigned long private)
+{
+   return __migrate_pages(from, get_new_page, private, 0);
+}
+
+#ifdef CONFIG_MIGRATION_REMOVE
+int migrate_pages_and_remove(struct list_head *from,
+   new_page_t get_new_page, unsigned long private)
+{
+   return __migrate_pages(from, get_new_page, private,
+   MIGRATE_NOCONTEXT);
+}
+#endif
+
 #ifdef CONFIG_NUMA
 /*
  * Move a list of individual pages
Index: current_test/include/linux/rmap.h
===
--- current_test.orig/include/linux/rmap.h  2007-05-08 15:06:49.0 
+0900
+++ current_test/include/linux/rmap.h   2007-05-08 15:08:07.0 +0900
@@ -26,6 +26,9 @@
 struct anon_vma {
spinlock_t lock;/* Serialize access to vma list */
struct list_head head;  /* List of private related vmas */
+#ifdef CONFIG_MIGRATION_REMOVE
+   atomic_thold;   /* == 0 if we can free this immediately */
+#endif
 };
 
 #ifdef CONFIG_MMU
@@ -37,10 +40,14 @@ static inline struct anon_vma *anon_vma_
return kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
 }
 
+#ifndef CONFIG_MIGRATION_REMOVE
 static inline void anon_vma_free(struct anon_vma *anon_vma)
 {
kmem_cache_free(anon_vma_cachep, anon_vma);
 }
+#else
+extern void anon_vma_free(struct anon_vma *anon_vma);
+#endif
 
 static inline void anon_vma_lock(struct vm_area_struct *vma)
 {
@@ -75,6 +82,21 @@ void page_add_file_rmap(struct page *);
 void

  1   2   3   >