Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-30 Thread Toshi Kani
On Thu, 2012-11-29 at 10:25 +0800, Jiang Liu wrote:
> On 2012-11-29 9:42, Jaegeuk Hanse wrote:
> > On Wed, Nov 28, 2012 at 04:47:42PM +0800, Jiang Liu wrote:
> >> Hi all,
> >>Seems it's a great chance to discuss about the memory hotplug feature
> >> within this thread. So I will try to give some high level thoughts about 
> >> memory
> >> hotplug feature on x86/IA64. Any comments are welcomed!
> >>First of all, I think usability really matters. Ideally, memory hotplug
> >> feature should just work out of box, and we shouldn't expect 
> >> administrators to 
> >> add several extra platform dependent parameters to enable memory hotplug. 
> >> But how to enable memory (or CPU/node) hotplug out of box? I think the key 
> >> point
> >> is to cooperate with BIOS/ACPI/firmware/device management teams. 
> >>I still position memory hotplug as an advanced feature for high end 
> >> servers and those systems may/should provide some management interfaces to 
> >> configure CPU/memory/node hotplug features. The configuration UI may be 
> >> provided
> >> by BIOS, BMC or centralized system management suite. Once administrator 
> >> enables
> >> hotplug feature through those management UI, OS should support system 
> >> device
> >> hotplug out of box. For example, HP SuperDome2 management suite provides 
> >> interface
> >> to configure a node as floating node(hot-removable). And OpenSolaris 
> >> supports
> >> CPU/memory hotplug out of box without any extra configurations. So we 
> >> should
> >> shape interfaces between firmware and OS to better support system device 
> >> hotplug.

Well described.  I agree with you.  I am also OK to have the boot option
for the time being, but we should be able to get the info from ACPI for
better TCE.

> >>On the other hand, I think there are no commercial available x86/IA64
> >> platforms with system device hotplug capabilities in the field yet, at 
> >> least only
> >> limited quantity if any. So backward compatibility is not a big issue for 
> >> us now.

HP SuperDome is IA64-based and supports node hotplug when running with
HP-UX.  It implements vendor-unique ACPI interface to describe movable
memory ranges.

> >> So I think it's doable to rely on firmware to provide better support for 
> >> system
> >> device hotplug.
> >>Then what should be enhanced to better support system device hotplug?
> >>
> >> 1) ACPI specification should be enhanced to provide a static table to 
> >> describe
> >> components with hotplug features, so OS could reserve special resources for
> >> hotplug at early boot stages. For example, to reserve enough CPU ids for 
> >> CPU
> >> hot-add. Currently we guess maximum number of CPUs supported by the 
> >> platform
> >> by counting CPU entries in APIC table, that's not reliable.

Right.  HP SuperDome implements vendor-unique ACPI interface for this as
well.  For Linux, it is nice to have a standard interface defined.

> >> 2) BIOS should implement SRAT, MPST and PMTT tables to better support 
> >> memory
> >> hotplug. SRAT associates memory ranges with proximity domains with an extra
> >> "hotpluggable" flag. PMTT provides memory device topology information, such
> >> as "socket->memory controller->DIMM". MPST is used for memory power 
> >> management
> >> and provides a way to associate memory ranges with memory devices in PMTT.
> >> With all information from SRAT, MPST and PMTT, OS could figure out hotplug
> >> memory ranges automatically, so no extra kernel parameters needed.

I agree that using SRAT is a good compromise.  The hotpluggable flag is
supposed to indicate the platform's capability, but could use for this
purpose until we have a better interface defined.

> >> 3) Enhance ACPICA to provide a method to scan static ACPI tables before
> >> memory subsystem has been initialized because OS need to access SRAT,
> >> MPST and PMTT when initializing memory subsystem.

I do not think this is an ACPICA issue.  HP-UX also uses ACPICA, and can
access ACPI tables and walk ACPI namespace during early boot-time.  This
is achieved by the acpi_os layer to use special early boot-time memory
allocator at early boot-time.  Therefore, boot-time and hot-add config
code are very consistent in HP-UX.

> >> 4) The last and the most important issue is how to minimize performance
> >> drop caused by memory hotplug. As proposed by this patchset, once we
> >> configure all memory of a NUMA node as movable, it essentially disable
> >> NUMA optimization of kernel memory allocation from that node. According
> >> to experience, that will cause huge performance drop. We have observed
> >> 10-30% performance drop with memory hotplug enabled. And on another
> >> OS the average performance drop caused by memory hotplug is about 10%.
> >> If we can't resolve the performance drop, memory hotplug is just a feature
> >> for demo:( With help from hardware, we do have some chances to reduce
> >> performance penalty caused by memory hotplug.
> >>As we know, Linux 

Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-30 Thread Jiang Liu
On 11/30/2012 11:15 AM, Yasuaki Ishimatsu wrote:
> Hi Jiang,
> 
>>
>> For the first issue, I think we could automatically convert pages
>> from movable zones into normal zones. Congyan from Fujitsu has provided
>> a patchset to manually convert pages from movable zones into normal zones,
>> I think we could extend that mechanism to automatically convert when
>> normal zones are under pressure by hooking into the slow page allocation
>> path.
>>
>> We rely on hardware features to solve the second and third issues.
>> Some new platforms provide a new RAS feature called "hardware memory
>> migration", which transparent migrate memory from one memory device
>> to another. With hardware memory migration, we could configure one
>> memory device on a NUMA node to host normal zone, and the other memory
>> devices to host movable zone. By this configuration, it won't cause
>> performance drop because each NUMA node still has local normal zone.
>> When trying to remove a memory device hosting normal zone, we just
>> need to find another spare memory device and use hardware memory migration
>> to transparently migrate memory content to the spare one. The drawback
>> is we have strong dependency on hardware features so it's not a common
>> solution for all architectures.
> 
> I agree with you. If BIOS and hardware support memory hotplug, OS should
> use them. But if OS cannot use them, we need to solve in OS. I think
> that our proposal which used ZONE_MOVABLE is first step for supporting
> memory hotplug.
Hi Yasuaki,
It's true, we should start with first step then improve it.
Regards!
Gerry

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-30 Thread Mel Gorman
On Fri, Nov 30, 2012 at 02:58:40AM +, Luck, Tony wrote:
> > If any significant percentage of memory is in ZONE_MOVABLE then the memory
> > hotplug people will have to deal with all the lowmem/highmem problems
> > that used to be faced by 32-bit x86 with PAE enabled. 
> 
> While these problems may still exist on large systems - I think it becomes
> harder to construct workloads that run into problems.  In those bad old days
> a significant fraction of lowmem was consumed by the kernel ... so it was
> pretty easy to find meta-data intensive workloads that would push it over
> a cliff.  Here we  are talking about systems with say 128GB per node divided
> into 64GB moveable and 64GB non-moveable (and I'd regard this as a rather
> low-end machine).  Unless the workload consists of zillions of tiny processes
> all mapping shared memory blocks, the percentage of memory allocated to
> the kernel is going to be tiny compared with the old 4GB days.
> 

Sure, if that's how the end-user decides to configure it. My concern is
what they'll do is configure node-0 to be ZONE_NORMAL and all other nodes
to be ZONE_MOVABLE -- 3 to 1 ratio "highmem" to "lowmem" effectively on
a 4-node machine or 7 to 1 on an 8-node. It'll be harder than it was in
the old days to trigger the problems but it'll still be possible and it
will generate bug reports down the road. Some will be obvious at least --
OOM killer triggered for GFP_KERNEL with plenty of free memory but all in
ZONE_MOVABLE. Others will be less obvious -- major stalls during IO tests
while ramping up with large amounts of reclaim activity visible even though
only 20-40% of memory is in use.

I'm not even getting into the impact this has on NUMA performance.

I'm not saying that ZONE_MOVABLE will not work. It will and it'll work
in the short-term but it's far from being a great long-term solution and
it is going to generate bug reports that will have to be supported by
distributions. Even if the interface to how it is configured gets ironed
out there still should be a replacement plan in place. FWIW, I dislike the
command-line configuration option. If it was me, I would have gone with
starting a machine with memory mostly off-lined and used sysfs files or
different sysfs strings written to the "online" file to determine if a
section was ZONE_MOVABLE or the next best alternative.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-30 Thread Glauber Costa
On 11/30/2012 06:58 AM, Luck, Tony wrote:
>> If any significant percentage of memory is in ZONE_MOVABLE then the memory
>> hotplug people will have to deal with all the lowmem/highmem problems
>> that used to be faced by 32-bit x86 with PAE enabled. 
> 
> While these problems may still exist on large systems - I think it becomes
> harder to construct workloads that run into problems.  In those bad old days
> a significant fraction of lowmem was consumed by the kernel ... so it was
> pretty easy to find meta-data intensive workloads that would push it over
> a cliff.  Here we  are talking about systems with say 128GB per node divided
> into 64GB moveable and 64GB non-moveable (and I'd regard this as a rather
> low-end machine).  Unless the workload consists of zillions of tiny processes
> all mapping shared memory blocks, the percentage of memory allocated to
> the kernel is going to be tiny compared with the old 4GB days.
> 

Which is a perfectly common workload for containers, where you can have
hundreds of machines (per node) being sold out to third parties, a lot
of them consuming every single bit of metadata they can.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-30 Thread Lai Jiangshan
On 11/28/2012 12:08 PM, Jiang Liu wrote:
> On 2012-11-28 11:24, Bob Liu wrote:
>> On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen  wrote:
>>> On 11/27/2012 08:09 PM, Bob Liu wrote:

 On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen
 wrote:
>
> Hi Liu,
>
>
> This feature is used in memory hotplug.
>
> In order to implement a whole node hotplug, we need to make sure the
> node contains no kernel memory, because memory used by kernel could
> not be migrated. (Since the kernel memory is directly mapped,
> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>
> User could specify all the memory on a node to be movable, so that the
> node could be hot-removed.
>

 Thank you for your explanation. It's reasonable.

 But i think it's a bit duplicated with CMA, i'm not sure but maybe we
 can combine it with CMA which already in mainline?

>>> Hi Liu,
>>>
>>> Thanks for your advice. :)
>>>
>>> CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
>>> controlling where is the start of ZONE_MOVABLE of each node. Could
>>> CMA do this job ?
>>
>> cma will not control the start of ZONE_MOVABLE of each node, but it
>> can declare a memory that always movable
>> and all non movable allocate request will not happen on that area.
>>
>> Currently cma use a boot parameter "cma=" to declare a memory size
>> that always movable.
>> I think it might fulfill your requirement if extending the boot
>> parameter with a start address.
>>
>> more info at http://lwn.net/Articles/468044/
>>>
>>> And also, after a short investigation, CMA seems need to base on
>>> memblock. But we need to limit memblock not to allocate memory on
>>> ZONE_MOVABLE. As a result, we need to know the ranges before memblock
>>> could be used. I'm afraid we still need an approach to get the ranges,
>>> such as a boot option, or from static ACPI tables such as SRAT/MPST.
>>>
>>
>> Yes, it's based on memblock and with boot option.
>> In setup_arch32()
>> dma_contiguous_reserve(0);   => will declare a cma area using
>> memblock_reserve()
>>
>>> I'm don't know much about CMA for now. So if you have any better idea,
>>> please share with us, thanks. :)
>>
>> My idea is reuse cma like below patch(even not compiled) and boot with
>> "cma=size@start_address".
>> I don't know whether it can work and whether suitable for your
>> requirement, if not forgive me for this noises.
>>
>> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
>> index 612afcc..564962a 100644
>> --- a/drivers/base/dma-contiguous.c
>> +++ b/drivers/base/dma-contiguous.c
>> @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
>>   */
>>  static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
>>  static long size_cmdline = -1;
>> +static long cma_start_cmdline = -1;
>>
>>  static int __init early_cma(char *p)
>>  {
>> +   char *oldp;
>> pr_debug("%s(%s)\n", __func__, p);
>> +   oldp = p;
>> size_cmdline = memparse(p, );
>> +
>> +   if (*p == '@')
>> +   cma_start_cmdline = memparse(p+1, );
>> +   printk("cma start:0x%x, size: 0x%x\n", size_cmdline, 
>> cma_start_cmdline);
>> return 0;
>>  }
>>  early_param("cma", early_cma);
>> @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>> if (selected_size) {
>> pr_debug("%s: reserving %ld MiB for global area\n", __func__,
>>  selected_size / SZ_1M);
>> -
>> -   dma_declare_contiguous(NULL, selected_size, 0, limit);
>> +   if (cma_size_cmdline != -1)
>> +   dma_declare_contiguous(NULL, selected_size,
>> cma_start_cmdline, limit);
>> +   else
>> +   dma_declare_contiguous(NULL, selected_size, 0, 
>> limit);
>> }
>>  };
> Seems a good idea to reserve memory by reusing CMA logic, though need more
> investigation here. One of CMA goal is to ensure pages in CMA are really
> movable, and this patchset tries to achieve the same goal at a first glance.
> 

The approach is already implemented: https://lkml.org/lkml/2012/7/4/145
(add new MIGRATE_HOTREMOVE, not reuse MIGRATE_CMA)

MIGRATE_HOTREMOVE and MIGRATE_CMA both have this problem:
https://lkml.org/lkml/2012/7/5/83

R.I.P for this idea.

zone->managed_pages(you proposed, but don't manage MIGRATE_HOTREMOVE nor 
MIGRATE_CMA) +
proxy zone(handle all MIGRATE_HOTREMOVE, MIGRATE_CMA and ZONE_MOVABLE of the 
node)
may be a good idea.

Thanks,
Lai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-30 Thread Lai Jiangshan
On 11/28/2012 12:08 PM, Jiang Liu wrote:
 On 2012-11-28 11:24, Bob Liu wrote:
 On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen tangc...@cn.fujitsu.com wrote:
 On 11/27/2012 08:09 PM, Bob Liu wrote:

 On Tue, Nov 27, 2012 at 4:29 PM, Tang Chentangc...@cn.fujitsu.com
 wrote:

 Hi Liu,


 This feature is used in memory hotplug.

 In order to implement a whole node hotplug, we need to make sure the
 node contains no kernel memory, because memory used by kernel could
 not be migrated. (Since the kernel memory is directly mapped,
 VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)

 User could specify all the memory on a node to be movable, so that the
 node could be hot-removed.


 Thank you for your explanation. It's reasonable.

 But i think it's a bit duplicated with CMA, i'm not sure but maybe we
 can combine it with CMA which already in mainline?

 Hi Liu,

 Thanks for your advice. :)

 CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
 controlling where is the start of ZONE_MOVABLE of each node. Could
 CMA do this job ?

 cma will not control the start of ZONE_MOVABLE of each node, but it
 can declare a memory that always movable
 and all non movable allocate request will not happen on that area.

 Currently cma use a boot parameter cma= to declare a memory size
 that always movable.
 I think it might fulfill your requirement if extending the boot
 parameter with a start address.

 more info at http://lwn.net/Articles/468044/

 And also, after a short investigation, CMA seems need to base on
 memblock. But we need to limit memblock not to allocate memory on
 ZONE_MOVABLE. As a result, we need to know the ranges before memblock
 could be used. I'm afraid we still need an approach to get the ranges,
 such as a boot option, or from static ACPI tables such as SRAT/MPST.


 Yes, it's based on memblock and with boot option.
 In setup_arch32()
 dma_contiguous_reserve(0);   = will declare a cma area using
 memblock_reserve()

 I'm don't know much about CMA for now. So if you have any better idea,
 please share with us, thanks. :)

 My idea is reuse cma like below patch(even not compiled) and boot with
 cma=size@start_address.
 I don't know whether it can work and whether suitable for your
 requirement, if not forgive me for this noises.

 diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
 index 612afcc..564962a 100644
 --- a/drivers/base/dma-contiguous.c
 +++ b/drivers/base/dma-contiguous.c
 @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
   */
  static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
  static long size_cmdline = -1;
 +static long cma_start_cmdline = -1;

  static int __init early_cma(char *p)
  {
 +   char *oldp;
 pr_debug(%s(%s)\n, __func__, p);
 +   oldp = p;
 size_cmdline = memparse(p, p);
 +
 +   if (*p == '@')
 +   cma_start_cmdline = memparse(p+1, p);
 +   printk(cma start:0x%x, size: 0x%x\n, size_cmdline, 
 cma_start_cmdline);
 return 0;
  }
  early_param(cma, early_cma);
 @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
 if (selected_size) {
 pr_debug(%s: reserving %ld MiB for global area\n, __func__,
  selected_size / SZ_1M);
 -
 -   dma_declare_contiguous(NULL, selected_size, 0, limit);
 +   if (cma_size_cmdline != -1)
 +   dma_declare_contiguous(NULL, selected_size,
 cma_start_cmdline, limit);
 +   else
 +   dma_declare_contiguous(NULL, selected_size, 0, 
 limit);
 }
  };
 Seems a good idea to reserve memory by reusing CMA logic, though need more
 investigation here. One of CMA goal is to ensure pages in CMA are really
 movable, and this patchset tries to achieve the same goal at a first glance.
 

The approach is already implemented: https://lkml.org/lkml/2012/7/4/145
(add new MIGRATE_HOTREMOVE, not reuse MIGRATE_CMA)

MIGRATE_HOTREMOVE and MIGRATE_CMA both have this problem:
https://lkml.org/lkml/2012/7/5/83

R.I.P for this idea.

zone-managed_pages(you proposed, but don't manage MIGRATE_HOTREMOVE nor 
MIGRATE_CMA) +
proxy zone(handle all MIGRATE_HOTREMOVE, MIGRATE_CMA and ZONE_MOVABLE of the 
node)
may be a good idea.

Thanks,
Lai
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-30 Thread Glauber Costa
On 11/30/2012 06:58 AM, Luck, Tony wrote:
 If any significant percentage of memory is in ZONE_MOVABLE then the memory
 hotplug people will have to deal with all the lowmem/highmem problems
 that used to be faced by 32-bit x86 with PAE enabled. 
 
 While these problems may still exist on large systems - I think it becomes
 harder to construct workloads that run into problems.  In those bad old days
 a significant fraction of lowmem was consumed by the kernel ... so it was
 pretty easy to find meta-data intensive workloads that would push it over
 a cliff.  Here we  are talking about systems with say 128GB per node divided
 into 64GB moveable and 64GB non-moveable (and I'd regard this as a rather
 low-end machine).  Unless the workload consists of zillions of tiny processes
 all mapping shared memory blocks, the percentage of memory allocated to
 the kernel is going to be tiny compared with the old 4GB days.
 

Which is a perfectly common workload for containers, where you can have
hundreds of machines (per node) being sold out to third parties, a lot
of them consuming every single bit of metadata they can.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-30 Thread Mel Gorman
On Fri, Nov 30, 2012 at 02:58:40AM +, Luck, Tony wrote:
  If any significant percentage of memory is in ZONE_MOVABLE then the memory
  hotplug people will have to deal with all the lowmem/highmem problems
  that used to be faced by 32-bit x86 with PAE enabled. 
 
 While these problems may still exist on large systems - I think it becomes
 harder to construct workloads that run into problems.  In those bad old days
 a significant fraction of lowmem was consumed by the kernel ... so it was
 pretty easy to find meta-data intensive workloads that would push it over
 a cliff.  Here we  are talking about systems with say 128GB per node divided
 into 64GB moveable and 64GB non-moveable (and I'd regard this as a rather
 low-end machine).  Unless the workload consists of zillions of tiny processes
 all mapping shared memory blocks, the percentage of memory allocated to
 the kernel is going to be tiny compared with the old 4GB days.
 

Sure, if that's how the end-user decides to configure it. My concern is
what they'll do is configure node-0 to be ZONE_NORMAL and all other nodes
to be ZONE_MOVABLE -- 3 to 1 ratio highmem to lowmem effectively on
a 4-node machine or 7 to 1 on an 8-node. It'll be harder than it was in
the old days to trigger the problems but it'll still be possible and it
will generate bug reports down the road. Some will be obvious at least --
OOM killer triggered for GFP_KERNEL with plenty of free memory but all in
ZONE_MOVABLE. Others will be less obvious -- major stalls during IO tests
while ramping up with large amounts of reclaim activity visible even though
only 20-40% of memory is in use.

I'm not even getting into the impact this has on NUMA performance.

I'm not saying that ZONE_MOVABLE will not work. It will and it'll work
in the short-term but it's far from being a great long-term solution and
it is going to generate bug reports that will have to be supported by
distributions. Even if the interface to how it is configured gets ironed
out there still should be a replacement plan in place. FWIW, I dislike the
command-line configuration option. If it was me, I would have gone with
starting a machine with memory mostly off-lined and used sysfs files or
different sysfs strings written to the online file to determine if a
section was ZONE_MOVABLE or the next best alternative.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-30 Thread Jiang Liu
On 11/30/2012 11:15 AM, Yasuaki Ishimatsu wrote:
 Hi Jiang,
 

 For the first issue, I think we could automatically convert pages
 from movable zones into normal zones. Congyan from Fujitsu has provided
 a patchset to manually convert pages from movable zones into normal zones,
 I think we could extend that mechanism to automatically convert when
 normal zones are under pressure by hooking into the slow page allocation
 path.

 We rely on hardware features to solve the second and third issues.
 Some new platforms provide a new RAS feature called hardware memory
 migration, which transparent migrate memory from one memory device
 to another. With hardware memory migration, we could configure one
 memory device on a NUMA node to host normal zone, and the other memory
 devices to host movable zone. By this configuration, it won't cause
 performance drop because each NUMA node still has local normal zone.
 When trying to remove a memory device hosting normal zone, we just
 need to find another spare memory device and use hardware memory migration
 to transparently migrate memory content to the spare one. The drawback
 is we have strong dependency on hardware features so it's not a common
 solution for all architectures.
 
 I agree with you. If BIOS and hardware support memory hotplug, OS should
 use them. But if OS cannot use them, we need to solve in OS. I think
 that our proposal which used ZONE_MOVABLE is first step for supporting
 memory hotplug.
Hi Yasuaki,
It's true, we should start with first step then improve it.
Regards!
Gerry

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-30 Thread Toshi Kani
On Thu, 2012-11-29 at 10:25 +0800, Jiang Liu wrote:
 On 2012-11-29 9:42, Jaegeuk Hanse wrote:
  On Wed, Nov 28, 2012 at 04:47:42PM +0800, Jiang Liu wrote:
  Hi all,
 Seems it's a great chance to discuss about the memory hotplug feature
  within this thread. So I will try to give some high level thoughts about 
  memory
  hotplug feature on x86/IA64. Any comments are welcomed!
 First of all, I think usability really matters. Ideally, memory hotplug
  feature should just work out of box, and we shouldn't expect 
  administrators to 
  add several extra platform dependent parameters to enable memory hotplug. 
  But how to enable memory (or CPU/node) hotplug out of box? I think the key 
  point
  is to cooperate with BIOS/ACPI/firmware/device management teams. 
 I still position memory hotplug as an advanced feature for high end 
  servers and those systems may/should provide some management interfaces to 
  configure CPU/memory/node hotplug features. The configuration UI may be 
  provided
  by BIOS, BMC or centralized system management suite. Once administrator 
  enables
  hotplug feature through those management UI, OS should support system 
  device
  hotplug out of box. For example, HP SuperDome2 management suite provides 
  interface
  to configure a node as floating node(hot-removable). And OpenSolaris 
  supports
  CPU/memory hotplug out of box without any extra configurations. So we 
  should
  shape interfaces between firmware and OS to better support system device 
  hotplug.

Well described.  I agree with you.  I am also OK to have the boot option
for the time being, but we should be able to get the info from ACPI for
better TCE.

 On the other hand, I think there are no commercial available x86/IA64
  platforms with system device hotplug capabilities in the field yet, at 
  least only
  limited quantity if any. So backward compatibility is not a big issue for 
  us now.

HP SuperDome is IA64-based and supports node hotplug when running with
HP-UX.  It implements vendor-unique ACPI interface to describe movable
memory ranges.

  So I think it's doable to rely on firmware to provide better support for 
  system
  device hotplug.
 Then what should be enhanced to better support system device hotplug?
 
  1) ACPI specification should be enhanced to provide a static table to 
  describe
  components with hotplug features, so OS could reserve special resources for
  hotplug at early boot stages. For example, to reserve enough CPU ids for 
  CPU
  hot-add. Currently we guess maximum number of CPUs supported by the 
  platform
  by counting CPU entries in APIC table, that's not reliable.

Right.  HP SuperDome implements vendor-unique ACPI interface for this as
well.  For Linux, it is nice to have a standard interface defined.

  2) BIOS should implement SRAT, MPST and PMTT tables to better support 
  memory
  hotplug. SRAT associates memory ranges with proximity domains with an extra
  hotpluggable flag. PMTT provides memory device topology information, such
  as socket-memory controller-DIMM. MPST is used for memory power 
  management
  and provides a way to associate memory ranges with memory devices in PMTT.
  With all information from SRAT, MPST and PMTT, OS could figure out hotplug
  memory ranges automatically, so no extra kernel parameters needed.

I agree that using SRAT is a good compromise.  The hotpluggable flag is
supposed to indicate the platform's capability, but could use for this
purpose until we have a better interface defined.

  3) Enhance ACPICA to provide a method to scan static ACPI tables before
  memory subsystem has been initialized because OS need to access SRAT,
  MPST and PMTT when initializing memory subsystem.

I do not think this is an ACPICA issue.  HP-UX also uses ACPICA, and can
access ACPI tables and walk ACPI namespace during early boot-time.  This
is achieved by the acpi_os layer to use special early boot-time memory
allocator at early boot-time.  Therefore, boot-time and hot-add config
code are very consistent in HP-UX.

  4) The last and the most important issue is how to minimize performance
  drop caused by memory hotplug. As proposed by this patchset, once we
  configure all memory of a NUMA node as movable, it essentially disable
  NUMA optimization of kernel memory allocation from that node. According
  to experience, that will cause huge performance drop. We have observed
  10-30% performance drop with memory hotplug enabled. And on another
  OS the average performance drop caused by memory hotplug is about 10%.
  If we can't resolve the performance drop, memory hotplug is just a feature
  for demo:( With help from hardware, we do have some chances to reduce
  performance penalty caused by memory hotplug.
 As we know, Linux could migrate movable page, but can't migrate
  non-movable pages used by kernel/DMA etc. And the most hard part is how
  to deal with those unmovable pages when hot-removing a memory device.
  Now hardware has 

RE: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-29 Thread H. Peter Anvin
Disk I/O is still a big consumer of lowmem.

"Luck, Tony"  wrote:

>> If any significant percentage of memory is in ZONE_MOVABLE then the
>memory
>> hotplug people will have to deal with all the lowmem/highmem problems
>> that used to be faced by 32-bit x86 with PAE enabled. 
>
>While these problems may still exist on large systems - I think it
>becomes
>harder to construct workloads that run into problems.  In those bad old
>days
>a significant fraction of lowmem was consumed by the kernel ... so it
>was
>pretty easy to find meta-data intensive workloads that would push it
>over
>a cliff.  Here we  are talking about systems with say 128GB per node
>divided
>into 64GB moveable and 64GB non-moveable (and I'd regard this as a
>rather
>low-end machine).  Unless the workload consists of zillions of tiny
>processes
>all mapping shared memory blocks, the percentage of memory allocated to
>the kernel is going to be tiny compared with the old 4GB days.
>
>-Tony

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-29 Thread Yasuaki Ishimatsu

Hi Jiang,

2012/11/30 11:56, Jiang Liu wrote:

Hi Mel,
Thanks for your great comments!

On 2012-11-29 19:00, Mel Gorman wrote:

On Wed, Nov 28, 2012 at 01:38:47PM -0800, H. Peter Anvin wrote:

On 11/28/2012 01:34 PM, Luck, Tony wrote:


2. use boot option
   This is our proposal. New boot option can specify memory range to use
   as movable memory.


Isn't this just moving the work to the user? To pick good values for the
movable areas, they need to know how the memory lines up across
node boundaries ... because they need to make sure to allow some
non-movable memory allocations on each node so that the kernel can
take advantage of node locality.

So the user would have to read at least the SRAT table, and perhaps
more, to figure out what to provide as arguments.

Since this is going to be used on a dynamic system where nodes might
be added an removed - the right values for these arguments might
change from one boot to the next. So even if the user gets them right
on day 1, a month later when a new node has been added, or a broken
node removed the values would be stale.



I gave this feedback in person at LCE: I consider the kernel
configuration option to be useless for anything other than debugging.
Trying to promote it as an actual solution, to be used by end users in
the field, is ridiculous at best.



I've not been paying a whole pile of attention to this because it's not an
area I'm active in but I agree that configuring ZONE_MOVABLE like
this at boot-time is going to be problematic. As awkward as it is, it
would probably work out better to only boot with one node by default and
then hot-add the nodes at runtime using either an online sysfs file or
an online-reserved file that hot-adds the memory to ZONE_MOVABLE. Still
clumsy but better than specifying addresses on the command line.

That said, I also find using ZONE_MOVABLE to be a problem in itself that
will cause problems down the road. Maybe this was discussed already but
just in case I'll describe the problems I see.

If any significant percentage of memory is in ZONE_MOVABLE then the memory
hotplug people will have to deal with all the lowmem/highmem problems
that used to be faced by 32-bit x86 with PAE enabled. As a simple example,
metadata intensive workloads will not be able to use all of memory because
the kernel allocations will be confined to a subset of memory. A more
complex example is that page table page allocations are also restricted
meaning it's possible that a process will not even be able to mmap() a high
percentage of memory simply because it cannot allocate the page tables to
store the mappings. ZONE_MOVABLE works up to a *point*, but it's a hack. It
was a hack when it was introduced but at least then the expectation was
that ZONE_MOVABLE was going to be used for huge pages and there at least
an expectation that it would not be available for normal usage.

Fundamentally the reason one would want to use ZONE_MOVABLE is because
we cannot migrate a lot of kernel memory -- slab pages, page table pages,
device-allocated buffers etc.  My understanding is that other OS's get around
this by requiring that subsystems and drivers have callbacks that allow the
core VM to force certain memory to be released but that may be impractical
for Linux. I don't know for sure though, this is just what I heard.

As I know, one other OS limits immovable pages at low end, and the limit
will increase on demand. But the drawback of this solution is serious
performance drop (average about 10%) because it essentially disable NUMA
optimization for kernel/DMA memory allocations.


For Linux, the hotplug people need to start thinking about how to get
around this migration problem. The first problem faced is the memory model
and how it maps virt->phys addresses. We have a 1:1 mapping because it's
fast but not because it's a fundamental requirement. Start considering
what happens if the memory model is changed to allow some sections to have
fast lookup for virt_to_phys and other sections to have slow lookups. On
hotplug, try and empty all the sections. If the section cannot be emptied
because of kernel pages then the section gets marked as "offline-migrated"
or something. Stop the whole machine (yes, I mean stop_machine), copy
those unmovable pages to another location, update the kernel virt->phys
mapping for the section being offlined so the virt addresses point to the
new physical addresses and resume.  Virt->phys lookups are going to be
a lot slower because a full section lookup will be necessary every time
effectively breaking SPARSE_VMEMMAP and there will be a performance penalty
but it should work. This will cover some slab pages where the data is only
accessed via the virtual address -- inode caches, dcache etc.

It will not work where the physical address is used. The obvious example
is page table pages. For page tables, during stop machine you will have to
walk all processes page tables looking for references to the page you're
trying to 

Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-29 Thread Jiang Liu
Hi Mel,
Thanks for your great comments!

On 2012-11-29 19:00, Mel Gorman wrote:
> On Wed, Nov 28, 2012 at 01:38:47PM -0800, H. Peter Anvin wrote:
>> On 11/28/2012 01:34 PM, Luck, Tony wrote:

 2. use boot option
   This is our proposal. New boot option can specify memory range to use
   as movable memory.
>>>
>>> Isn't this just moving the work to the user? To pick good values for the
>>> movable areas, they need to know how the memory lines up across
>>> node boundaries ... because they need to make sure to allow some
>>> non-movable memory allocations on each node so that the kernel can
>>> take advantage of node locality.
>>>
>>> So the user would have to read at least the SRAT table, and perhaps
>>> more, to figure out what to provide as arguments.
>>>
>>> Since this is going to be used on a dynamic system where nodes might
>>> be added an removed - the right values for these arguments might
>>> change from one boot to the next. So even if the user gets them right
>>> on day 1, a month later when a new node has been added, or a broken
>>> node removed the values would be stale.
>>>
>>
>> I gave this feedback in person at LCE: I consider the kernel
>> configuration option to be useless for anything other than debugging.
>> Trying to promote it as an actual solution, to be used by end users in
>> the field, is ridiculous at best.
>>
> 
> I've not been paying a whole pile of attention to this because it's not an
> area I'm active in but I agree that configuring ZONE_MOVABLE like
> this at boot-time is going to be problematic. As awkward as it is, it
> would probably work out better to only boot with one node by default and
> then hot-add the nodes at runtime using either an online sysfs file or
> an online-reserved file that hot-adds the memory to ZONE_MOVABLE. Still
> clumsy but better than specifying addresses on the command line.
> 
> That said, I also find using ZONE_MOVABLE to be a problem in itself that
> will cause problems down the road. Maybe this was discussed already but
> just in case I'll describe the problems I see.
> 
> If any significant percentage of memory is in ZONE_MOVABLE then the memory
> hotplug people will have to deal with all the lowmem/highmem problems
> that used to be faced by 32-bit x86 with PAE enabled. As a simple example,
> metadata intensive workloads will not be able to use all of memory because
> the kernel allocations will be confined to a subset of memory. A more
> complex example is that page table page allocations are also restricted
> meaning it's possible that a process will not even be able to mmap() a high
> percentage of memory simply because it cannot allocate the page tables to
> store the mappings. ZONE_MOVABLE works up to a *point*, but it's a hack. It
> was a hack when it was introduced but at least then the expectation was
> that ZONE_MOVABLE was going to be used for huge pages and there at least
> an expectation that it would not be available for normal usage.
> 
> Fundamentally the reason one would want to use ZONE_MOVABLE is because
> we cannot migrate a lot of kernel memory -- slab pages, page table pages,
> device-allocated buffers etc.  My understanding is that other OS's get around
> this by requiring that subsystems and drivers have callbacks that allow the
> core VM to force certain memory to be released but that may be impractical
> for Linux. I don't know for sure though, this is just what I heard.
As I know, one other OS limits immovable pages at low end, and the limit
will increase on demand. But the drawback of this solution is serious
performance drop (average about 10%) because it essentially disable NUMA
optimization for kernel/DMA memory allocations.

> For Linux, the hotplug people need to start thinking about how to get
> around this migration problem. The first problem faced is the memory model
> and how it maps virt->phys addresses. We have a 1:1 mapping because it's
> fast but not because it's a fundamental requirement. Start considering
> what happens if the memory model is changed to allow some sections to have
> fast lookup for virt_to_phys and other sections to have slow lookups. On
> hotplug, try and empty all the sections. If the section cannot be emptied
> because of kernel pages then the section gets marked as "offline-migrated"
> or something. Stop the whole machine (yes, I mean stop_machine), copy
> those unmovable pages to another location, update the kernel virt->phys
> mapping for the section being offlined so the virt addresses point to the
> new physical addresses and resume.  Virt->phys lookups are going to be
> a lot slower because a full section lookup will be necessary every time
> effectively breaking SPARSE_VMEMMAP and there will be a performance penalty
> but it should work. This will cover some slab pages where the data is only
> accessed via the virtual address -- inode caches, dcache etc.
> 
> It will not work where the physical address is used. The obvious example
> is page table 

RE: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-29 Thread Luck, Tony
> If any significant percentage of memory is in ZONE_MOVABLE then the memory
> hotplug people will have to deal with all the lowmem/highmem problems
> that used to be faced by 32-bit x86 with PAE enabled. 

While these problems may still exist on large systems - I think it becomes
harder to construct workloads that run into problems.  In those bad old days
a significant fraction of lowmem was consumed by the kernel ... so it was
pretty easy to find meta-data intensive workloads that would push it over
a cliff.  Here we  are talking about systems with say 128GB per node divided
into 64GB moveable and 64GB non-moveable (and I'd regard this as a rather
low-end machine).  Unless the workload consists of zillions of tiny processes
all mapping shared memory blocks, the percentage of memory allocated to
the kernel is going to be tiny compared with the old 4GB days.

-Tony

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-29 Thread H. Peter Anvin
On 11/29/2012 02:41 PM, Luck, Tony wrote:
>> The other bit is that if you really really want high reliability, memory
>> mirroring is the way to go; it is the only way you will be able to
>> hotremove memory without having to have a pre-event to migrate the
>> memory away from the affected node before the memory is offlined.
> 
> Some platforms don't support cross-node mirrors ... but we still want to
> be able to remove a node.
> 

Yes, well, those platforms don't support that degree of "really really
high reliability", since the unannounced failure of the node controller
will bring down the system.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-29 Thread Luck, Tony
> The other bit is that if you really really want high reliability, memory
> mirroring is the way to go; it is the only way you will be able to
> hotremove memory without having to have a pre-event to migrate the
> memory away from the affected node before the memory is offlined.

Some platforms don't support cross-node mirrors ... but we still want to
be able to remove a node.

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-29 Thread H. Peter Anvin
On 11/29/2012 03:00 AM, Mel Gorman wrote:
> 
> I've not been paying a whole pile of attention to this because it's not an
> area I'm active in but I agree that configuring ZONE_MOVABLE like
> this at boot-time is going to be problematic. As awkward as it is, it
> would probably work out better to only boot with one node by default and
> then hot-add the nodes at runtime using either an online sysfs file or
> an online-reserved file that hot-adds the memory to ZONE_MOVABLE. Still
> clumsy but better than specifying addresses on the command line.
> 
> That said, I also find using ZONE_MOVABLE to be a problem in itself that
> will cause problems down the road. Maybe this was discussed already but
> just in case I'll describe the problems I see.
> 

Yes, and it does mean that we definitely don't want everything that can
be in ZONE_MOVABLE to be there without administrator control.  I suspect
that a lot of users of such platforms actually will not use the feature,
and don't want to take the substantial penalty.

The other bit is that if you really really want high reliability, memory
mirroring is the way to go; it is the only way you will be able to
hotremove memory without having to have a pre-event to migrate the
memory away from the affected node before the memory is offlined.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-29 Thread Jiang Liu
Hi Yasuaki,
Forgot to mention that I have no objection to this patchset.
I think it's a good start point, but we still need to improve usabilities
of memory hotplug by passing platform specific information from BIOS.
And mechanism provided by this patchset will/may be used to improve
usabilities too. 

Regards!
Gerry

On 11/29/2012 06:38 PM, Yasuaki Ishimatsu wrote:
> Hi Tony,
> 
> 2012/11/29 6:34, Luck, Tony wrote:
>>> 1. use firmware information
>>>According to ACPI spec 5.0, SRAT table has memory affinity structure
>>>and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>>>Affinity Structure". If we use the information, we might be able to
>>>specify movable memory by firmware. For example, if Hot Pluggable
>>>Filed is enabled, Linux sets the memory as movable memory.
>>>
>>> 2. use boot option
>>>This is our proposal. New boot option can specify memory range to use
>>>as movable memory.
>>
>> Isn't this just moving the work to the user? To pick good values for the
> 
> Yes.
> 
>> movable areas, they need to know how the memory lines up across
>> node boundaries ... because they need to make sure to allow some
>> non-movable memory allocations on each node so that the kernel can
>> take advantage of node locality.
> 
> There is no problem.
> Linux has already two boot options, kernelcore= and movablecore=.
> So if we use them, non-movable memory is divided into each node evenly.
> 
> But there is no way to specify a node used as movable currently. So
> we proposed the new boot option.
> 
>> So the user would have to read at least the SRAT table, and perhaps
>> more, to figure out what to provide as arguments.
>>
> 
>> Since this is going to be used on a dynamic system where nodes might
>> be added an removed - the right values for these arguments might
>> change from one boot to the next. So even if the user gets them right
>> on day 1, a month later when a new node has been added, or a broken
>> node removed the values would be stale.
> 
> I don't think so. Even if we hot add/remove node, the memory range of
> each memory device is not changed. So we don't need to change the boot
> option.
> 
> Thanks,
> Yasuaki Ishimatsu
> 
>>
>> -Tony
>>
> 
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-29 Thread Jiang Liu
On 11/29/2012 06:38 PM, Yasuaki Ishimatsu wrote:
> Hi Tony,
> 
> 2012/11/29 6:34, Luck, Tony wrote:
>>> 1. use firmware information
>>>According to ACPI spec 5.0, SRAT table has memory affinity structure
>>>and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>>>Affinity Structure". If we use the information, we might be able to
>>>specify movable memory by firmware. For example, if Hot Pluggable
>>>Filed is enabled, Linux sets the memory as movable memory.
>>>
>>> 2. use boot option
>>>This is our proposal. New boot option can specify memory range to use
>>>as movable memory.
>>
>> Isn't this just moving the work to the user? To pick good values for the
> 
> Yes.
> 
>> movable areas, they need to know how the memory lines up across
>> node boundaries ... because they need to make sure to allow some
>> non-movable memory allocations on each node so that the kernel can
>> take advantage of node locality.
> 
> There is no problem.
> Linux has already two boot options, kernelcore= and movablecore=.
> So if we use them, non-movable memory is divided into each node evenly.
> 
> But there is no way to specify a node used as movable currently. So
> we proposed the new boot option.
> 
>> So the user would have to read at least the SRAT table, and perhaps
>> more, to figure out what to provide as arguments.
>>
> 
>> Since this is going to be used on a dynamic system where nodes might
>> be added an removed - the right values for these arguments might
>> change from one boot to the next. So even if the user gets them right
>> on day 1, a month later when a new node has been added, or a broken
>> node removed the values would be stale.
> 
> I don't think so. Even if we hot add/remove node, the memory range of
> each memory device is not changed. So we don't need to change the boot
> option.
Hi Yasuaki,
Addresses assigned to each memory device may change under different 
hardware configurations.
According to my experiences with some hotplug capable Xeon and Itanium
systems, a typical algorithm adopted by BIOS to support memory hotplug is:
1) For backward compatibility, BIOS assigns continuous addresses to memory
devices present at boot time. In other words, there are no holes in the memory
addresses except the hole just below 4G reserved for MMIO and other arch 
specific usage.
2) To support memory hotplug, BIOS reserves enough memory address ranges 
at the high end.
 
Let's take a typical 4 sockets system as an example. Say we have four
sockets S0-S3, and each socket supports two memory devices(M0-M1) at maximum. 
Each memory device supports 128G memory at maximum. And at boot, all memory
slots are fully populated with 4GB memory. Then the address assignment looks
like:
0-2G:   S0.M0
2-4G:   MMIO
4-8G:   S0.M1
8-12G:  S1.M0
12-16G: S1.M1
16-20G: S2.M0
20-24G: S2.M1
24-28G: S2.M0
28-32G: S2.M1
32-34G: S0.M0 (memory recovered from the MMIO hole)
1024-1152G: reserved for S0.M0
1152-1280G: reserved for S0.M1
1280-1408G: reserved for S1.M0
1408-1536G: reserved for S1.M1
1536-1664G: reserved for S2.M0
1664-1792G: reserved for S2.M1
1792-1920G: reserved for S3.M0
1920-2048G: reserved for S4.M1

If we hot-remove S2.M0 and add back a bigger memory device with 8G memory, it 
will
be assigned a new memory address range 1536-1544G.

Based on above algorithm, and we configure 16-24G(S2.M0 and S2.M1) as movable 
memory.
1) memory on S3 will be configured as movable if S2 isn't present at boot time. 
(the
same effect as "movable_node" in discussion at 
https://lkml.org/lkml/2012/11/27/154)
2) S2.M0 will be configured as non-movable and S3.M0 will be configured as 
movable
   if S1.M0 isn't present at boot.
3) And how about replace S1.M0 with a 8GB memory device?

To summarize, kernel parameter to configure movable memory for hotplug will 
easily
become invalid if hardware configuration changes, and that may confuse 
administrators.
I still think the most reliable way is to figure out movable memory for hotplug 
by
parsing hardware configuration information from BIOS.

Regards!
Gerry

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-29 Thread Mel Gorman
On Thu, Nov 29, 2012 at 07:38:26PM +0900, Yasuaki Ishimatsu wrote:
> Hi Tony,
> 
> 2012/11/29 6:34, Luck, Tony wrote:
> >>1. use firmware information
> >>   According to ACPI spec 5.0, SRAT table has memory affinity structure
> >>   and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
> >>   Affinity Structure". If we use the information, we might be able to
> >>   specify movable memory by firmware. For example, if Hot Pluggable
> >>   Filed is enabled, Linux sets the memory as movable memory.
> >>
> >>2. use boot option
> >>   This is our proposal. New boot option can specify memory range to use
> >>   as movable memory.
> >
> >Isn't this just moving the work to the user? To pick good values for the
> 
> Yes.
> 
> >movable areas, they need to know how the memory lines up across
> >node boundaries ... because they need to make sure to allow some
> >non-movable memory allocations on each node so that the kernel can
> >take advantage of node locality.
> 
> There is no problem.
> Linux has already two boot options, kernelcore= and movablecore=.
> So if we use them, non-movable memory is divided into each node evenly.
> 

The motivation for those options was to reserve a percentage of memory
to be used for hugepage allocation. If hugepages were not being used at
a particular time then they could be used for other purposes. While the
system could in theory face lowmem/highmem style problems, in practice
it did not happen because the memory would be allocated as hugetlbfs
pages and unavailable anyway. The same does not really apply to a general
purpose system that you want to support memory hot-remove on so be wary of
lowmem/highmem style problems caused by relying too heavily on ZONE_MOVABLE.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-29 Thread Mel Gorman
On Wed, Nov 28, 2012 at 01:38:47PM -0800, H. Peter Anvin wrote:
> On 11/28/2012 01:34 PM, Luck, Tony wrote:
> >>
> >> 2. use boot option
> >>   This is our proposal. New boot option can specify memory range to use
> >>   as movable memory.
> > 
> > Isn't this just moving the work to the user? To pick good values for the
> > movable areas, they need to know how the memory lines up across
> > node boundaries ... because they need to make sure to allow some
> > non-movable memory allocations on each node so that the kernel can
> > take advantage of node locality.
> > 
> > So the user would have to read at least the SRAT table, and perhaps
> > more, to figure out what to provide as arguments.
> > 
> > Since this is going to be used on a dynamic system where nodes might
> > be added an removed - the right values for these arguments might
> > change from one boot to the next. So even if the user gets them right
> > on day 1, a month later when a new node has been added, or a broken
> > node removed the values would be stale.
> > 
> 
> I gave this feedback in person at LCE: I consider the kernel
> configuration option to be useless for anything other than debugging.
> Trying to promote it as an actual solution, to be used by end users in
> the field, is ridiculous at best.
> 

I've not been paying a whole pile of attention to this because it's not an
area I'm active in but I agree that configuring ZONE_MOVABLE like
this at boot-time is going to be problematic. As awkward as it is, it
would probably work out better to only boot with one node by default and
then hot-add the nodes at runtime using either an online sysfs file or
an online-reserved file that hot-adds the memory to ZONE_MOVABLE. Still
clumsy but better than specifying addresses on the command line.

That said, I also find using ZONE_MOVABLE to be a problem in itself that
will cause problems down the road. Maybe this was discussed already but
just in case I'll describe the problems I see.

If any significant percentage of memory is in ZONE_MOVABLE then the memory
hotplug people will have to deal with all the lowmem/highmem problems
that used to be faced by 32-bit x86 with PAE enabled. As a simple example,
metadata intensive workloads will not be able to use all of memory because
the kernel allocations will be confined to a subset of memory. A more
complex example is that page table page allocations are also restricted
meaning it's possible that a process will not even be able to mmap() a high
percentage of memory simply because it cannot allocate the page tables to
store the mappings. ZONE_MOVABLE works up to a *point*, but it's a hack. It
was a hack when it was introduced but at least then the expectation was
that ZONE_MOVABLE was going to be used for huge pages and there at least
an expectation that it would not be available for normal usage.

Fundamentally the reason one would want to use ZONE_MOVABLE is because
we cannot migrate a lot of kernel memory -- slab pages, page table pages,
device-allocated buffers etc.  My understanding is that other OS's get around
this by requiring that subsystems and drivers have callbacks that allow the
core VM to force certain memory to be released but that may be impractical
for Linux. I don't know for sure though, this is just what I heard.

For Linux, the hotplug people need to start thinking about how to get
around this migration problem. The first problem faced is the memory model
and how it maps virt->phys addresses. We have a 1:1 mapping because it's
fast but not because it's a fundamental requirement. Start considering
what happens if the memory model is changed to allow some sections to have
fast lookup for virt_to_phys and other sections to have slow lookups. On
hotplug, try and empty all the sections. If the section cannot be emptied
because of kernel pages then the section gets marked as "offline-migrated"
or something. Stop the whole machine (yes, I mean stop_machine), copy
those unmovable pages to another location, update the kernel virt->phys
mapping for the section being offlined so the virt addresses point to the
new physical addresses and resume.  Virt->phys lookups are going to be
a lot slower because a full section lookup will be necessary every time
effectively breaking SPARSE_VMEMMAP and there will be a performance penalty
but it should work. This will cover some slab pages where the data is only
accessed via the virtual address -- inode caches, dcache etc.

It will not work where the physical address is used. The obvious example
is page table pages. For page tables, during stop machine you will have to
walk all processes page tables looking for references to the page you're
trying to move and update them. It is possible to just plain migrate
page table pages but when it was last implemented years ago there was a
constant performance penalty for everybody and it was not popular.  Taking a
heavy-handed approach just during memory hot-remove might be more palatable.

For the remaining pages such 

Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-29 Thread Yasuaki Ishimatsu

Hi Tony,

2012/11/29 6:34, Luck, Tony wrote:

1. use firmware information
   According to ACPI spec 5.0, SRAT table has memory affinity structure
   and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
   Affinity Structure". If we use the information, we might be able to
   specify movable memory by firmware. For example, if Hot Pluggable
   Filed is enabled, Linux sets the memory as movable memory.

2. use boot option
   This is our proposal. New boot option can specify memory range to use
   as movable memory.


Isn't this just moving the work to the user? To pick good values for the


Yes.


movable areas, they need to know how the memory lines up across
node boundaries ... because they need to make sure to allow some
non-movable memory allocations on each node so that the kernel can
take advantage of node locality.


There is no problem.
Linux has already two boot options, kernelcore= and movablecore=.
So if we use them, non-movable memory is divided into each node evenly.

But there is no way to specify a node used as movable currently. So
we proposed the new boot option.


So the user would have to read at least the SRAT table, and perhaps
more, to figure out what to provide as arguments.




Since this is going to be used on a dynamic system where nodes might
be added an removed - the right values for these arguments might
change from one boot to the next. So even if the user gets them right
on day 1, a month later when a new node has been added, or a broken
node removed the values would be stale.


I don't think so. Even if we hot add/remove node, the memory range of
each memory device is not changed. So we don't need to change the boot
option.

Thanks,
Yasuaki Ishimatsu



-Tony




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-29 Thread Yasuaki Ishimatsu

Hi Tony,

2012/11/29 6:34, Luck, Tony wrote:

1. use firmware information
   According to ACPI spec 5.0, SRAT table has memory affinity structure
   and the structure has Hot Pluggable Filed. See 5.2.16.2 Memory
   Affinity Structure. If we use the information, we might be able to
   specify movable memory by firmware. For example, if Hot Pluggable
   Filed is enabled, Linux sets the memory as movable memory.

2. use boot option
   This is our proposal. New boot option can specify memory range to use
   as movable memory.


Isn't this just moving the work to the user? To pick good values for the


Yes.


movable areas, they need to know how the memory lines up across
node boundaries ... because they need to make sure to allow some
non-movable memory allocations on each node so that the kernel can
take advantage of node locality.


There is no problem.
Linux has already two boot options, kernelcore= and movablecore=.
So if we use them, non-movable memory is divided into each node evenly.

But there is no way to specify a node used as movable currently. So
we proposed the new boot option.


So the user would have to read at least the SRAT table, and perhaps
more, to figure out what to provide as arguments.




Since this is going to be used on a dynamic system where nodes might
be added an removed - the right values for these arguments might
change from one boot to the next. So even if the user gets them right
on day 1, a month later when a new node has been added, or a broken
node removed the values would be stale.


I don't think so. Even if we hot add/remove node, the memory range of
each memory device is not changed. So we don't need to change the boot
option.

Thanks,
Yasuaki Ishimatsu



-Tony




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-29 Thread Mel Gorman
On Wed, Nov 28, 2012 at 01:38:47PM -0800, H. Peter Anvin wrote:
 On 11/28/2012 01:34 PM, Luck, Tony wrote:
 
  2. use boot option
This is our proposal. New boot option can specify memory range to use
as movable memory.
  
  Isn't this just moving the work to the user? To pick good values for the
  movable areas, they need to know how the memory lines up across
  node boundaries ... because they need to make sure to allow some
  non-movable memory allocations on each node so that the kernel can
  take advantage of node locality.
  
  So the user would have to read at least the SRAT table, and perhaps
  more, to figure out what to provide as arguments.
  
  Since this is going to be used on a dynamic system where nodes might
  be added an removed - the right values for these arguments might
  change from one boot to the next. So even if the user gets them right
  on day 1, a month later when a new node has been added, or a broken
  node removed the values would be stale.
  
 
 I gave this feedback in person at LCE: I consider the kernel
 configuration option to be useless for anything other than debugging.
 Trying to promote it as an actual solution, to be used by end users in
 the field, is ridiculous at best.
 

I've not been paying a whole pile of attention to this because it's not an
area I'm active in but I agree that configuring ZONE_MOVABLE like
this at boot-time is going to be problematic. As awkward as it is, it
would probably work out better to only boot with one node by default and
then hot-add the nodes at runtime using either an online sysfs file or
an online-reserved file that hot-adds the memory to ZONE_MOVABLE. Still
clumsy but better than specifying addresses on the command line.

That said, I also find using ZONE_MOVABLE to be a problem in itself that
will cause problems down the road. Maybe this was discussed already but
just in case I'll describe the problems I see.

If any significant percentage of memory is in ZONE_MOVABLE then the memory
hotplug people will have to deal with all the lowmem/highmem problems
that used to be faced by 32-bit x86 with PAE enabled. As a simple example,
metadata intensive workloads will not be able to use all of memory because
the kernel allocations will be confined to a subset of memory. A more
complex example is that page table page allocations are also restricted
meaning it's possible that a process will not even be able to mmap() a high
percentage of memory simply because it cannot allocate the page tables to
store the mappings. ZONE_MOVABLE works up to a *point*, but it's a hack. It
was a hack when it was introduced but at least then the expectation was
that ZONE_MOVABLE was going to be used for huge pages and there at least
an expectation that it would not be available for normal usage.

Fundamentally the reason one would want to use ZONE_MOVABLE is because
we cannot migrate a lot of kernel memory -- slab pages, page table pages,
device-allocated buffers etc.  My understanding is that other OS's get around
this by requiring that subsystems and drivers have callbacks that allow the
core VM to force certain memory to be released but that may be impractical
for Linux. I don't know for sure though, this is just what I heard.

For Linux, the hotplug people need to start thinking about how to get
around this migration problem. The first problem faced is the memory model
and how it maps virt-phys addresses. We have a 1:1 mapping because it's
fast but not because it's a fundamental requirement. Start considering
what happens if the memory model is changed to allow some sections to have
fast lookup for virt_to_phys and other sections to have slow lookups. On
hotplug, try and empty all the sections. If the section cannot be emptied
because of kernel pages then the section gets marked as offline-migrated
or something. Stop the whole machine (yes, I mean stop_machine), copy
those unmovable pages to another location, update the kernel virt-phys
mapping for the section being offlined so the virt addresses point to the
new physical addresses and resume.  Virt-phys lookups are going to be
a lot slower because a full section lookup will be necessary every time
effectively breaking SPARSE_VMEMMAP and there will be a performance penalty
but it should work. This will cover some slab pages where the data is only
accessed via the virtual address -- inode caches, dcache etc.

It will not work where the physical address is used. The obvious example
is page table pages. For page tables, during stop machine you will have to
walk all processes page tables looking for references to the page you're
trying to move and update them. It is possible to just plain migrate
page table pages but when it was last implemented years ago there was a
constant performance penalty for everybody and it was not popular.  Taking a
heavy-handed approach just during memory hot-remove might be more palatable.

For the remaining pages such as those that have been handed to devices
or are pinned 

Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-29 Thread Mel Gorman
On Thu, Nov 29, 2012 at 07:38:26PM +0900, Yasuaki Ishimatsu wrote:
 Hi Tony,
 
 2012/11/29 6:34, Luck, Tony wrote:
 1. use firmware information
According to ACPI spec 5.0, SRAT table has memory affinity structure
and the structure has Hot Pluggable Filed. See 5.2.16.2 Memory
Affinity Structure. If we use the information, we might be able to
specify movable memory by firmware. For example, if Hot Pluggable
Filed is enabled, Linux sets the memory as movable memory.
 
 2. use boot option
This is our proposal. New boot option can specify memory range to use
as movable memory.
 
 Isn't this just moving the work to the user? To pick good values for the
 
 Yes.
 
 movable areas, they need to know how the memory lines up across
 node boundaries ... because they need to make sure to allow some
 non-movable memory allocations on each node so that the kernel can
 take advantage of node locality.
 
 There is no problem.
 Linux has already two boot options, kernelcore= and movablecore=.
 So if we use them, non-movable memory is divided into each node evenly.
 

The motivation for those options was to reserve a percentage of memory
to be used for hugepage allocation. If hugepages were not being used at
a particular time then they could be used for other purposes. While the
system could in theory face lowmem/highmem style problems, in practice
it did not happen because the memory would be allocated as hugetlbfs
pages and unavailable anyway. The same does not really apply to a general
purpose system that you want to support memory hot-remove on so be wary of
lowmem/highmem style problems caused by relying too heavily on ZONE_MOVABLE.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-29 Thread Jiang Liu
On 11/29/2012 06:38 PM, Yasuaki Ishimatsu wrote:
 Hi Tony,
 
 2012/11/29 6:34, Luck, Tony wrote:
 1. use firmware information
According to ACPI spec 5.0, SRAT table has memory affinity structure
and the structure has Hot Pluggable Filed. See 5.2.16.2 Memory
Affinity Structure. If we use the information, we might be able to
specify movable memory by firmware. For example, if Hot Pluggable
Filed is enabled, Linux sets the memory as movable memory.

 2. use boot option
This is our proposal. New boot option can specify memory range to use
as movable memory.

 Isn't this just moving the work to the user? To pick good values for the
 
 Yes.
 
 movable areas, they need to know how the memory lines up across
 node boundaries ... because they need to make sure to allow some
 non-movable memory allocations on each node so that the kernel can
 take advantage of node locality.
 
 There is no problem.
 Linux has already two boot options, kernelcore= and movablecore=.
 So if we use them, non-movable memory is divided into each node evenly.
 
 But there is no way to specify a node used as movable currently. So
 we proposed the new boot option.
 
 So the user would have to read at least the SRAT table, and perhaps
 more, to figure out what to provide as arguments.

 
 Since this is going to be used on a dynamic system where nodes might
 be added an removed - the right values for these arguments might
 change from one boot to the next. So even if the user gets them right
 on day 1, a month later when a new node has been added, or a broken
 node removed the values would be stale.
 
 I don't think so. Even if we hot add/remove node, the memory range of
 each memory device is not changed. So we don't need to change the boot
 option.
Hi Yasuaki,
Addresses assigned to each memory device may change under different 
hardware configurations.
According to my experiences with some hotplug capable Xeon and Itanium
systems, a typical algorithm adopted by BIOS to support memory hotplug is:
1) For backward compatibility, BIOS assigns continuous addresses to memory
devices present at boot time. In other words, there are no holes in the memory
addresses except the hole just below 4G reserved for MMIO and other arch 
specific usage.
2) To support memory hotplug, BIOS reserves enough memory address ranges 
at the high end.
 
Let's take a typical 4 sockets system as an example. Say we have four
sockets S0-S3, and each socket supports two memory devices(M0-M1) at maximum. 
Each memory device supports 128G memory at maximum. And at boot, all memory
slots are fully populated with 4GB memory. Then the address assignment looks
like:
0-2G:   S0.M0
2-4G:   MMIO
4-8G:   S0.M1
8-12G:  S1.M0
12-16G: S1.M1
16-20G: S2.M0
20-24G: S2.M1
24-28G: S2.M0
28-32G: S2.M1
32-34G: S0.M0 (memory recovered from the MMIO hole)
1024-1152G: reserved for S0.M0
1152-1280G: reserved for S0.M1
1280-1408G: reserved for S1.M0
1408-1536G: reserved for S1.M1
1536-1664G: reserved for S2.M0
1664-1792G: reserved for S2.M1
1792-1920G: reserved for S3.M0
1920-2048G: reserved for S4.M1

If we hot-remove S2.M0 and add back a bigger memory device with 8G memory, it 
will
be assigned a new memory address range 1536-1544G.

Based on above algorithm, and we configure 16-24G(S2.M0 and S2.M1) as movable 
memory.
1) memory on S3 will be configured as movable if S2 isn't present at boot time. 
(the
same effect as movable_node in discussion at 
https://lkml.org/lkml/2012/11/27/154)
2) S2.M0 will be configured as non-movable and S3.M0 will be configured as 
movable
   if S1.M0 isn't present at boot.
3) And how about replace S1.M0 with a 8GB memory device?

To summarize, kernel parameter to configure movable memory for hotplug will 
easily
become invalid if hardware configuration changes, and that may confuse 
administrators.
I still think the most reliable way is to figure out movable memory for hotplug 
by
parsing hardware configuration information from BIOS.

Regards!
Gerry

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-29 Thread Jiang Liu
Hi Yasuaki,
Forgot to mention that I have no objection to this patchset.
I think it's a good start point, but we still need to improve usabilities
of memory hotplug by passing platform specific information from BIOS.
And mechanism provided by this patchset will/may be used to improve
usabilities too. 

Regards!
Gerry

On 11/29/2012 06:38 PM, Yasuaki Ishimatsu wrote:
 Hi Tony,
 
 2012/11/29 6:34, Luck, Tony wrote:
 1. use firmware information
According to ACPI spec 5.0, SRAT table has memory affinity structure
and the structure has Hot Pluggable Filed. See 5.2.16.2 Memory
Affinity Structure. If we use the information, we might be able to
specify movable memory by firmware. For example, if Hot Pluggable
Filed is enabled, Linux sets the memory as movable memory.

 2. use boot option
This is our proposal. New boot option can specify memory range to use
as movable memory.

 Isn't this just moving the work to the user? To pick good values for the
 
 Yes.
 
 movable areas, they need to know how the memory lines up across
 node boundaries ... because they need to make sure to allow some
 non-movable memory allocations on each node so that the kernel can
 take advantage of node locality.
 
 There is no problem.
 Linux has already two boot options, kernelcore= and movablecore=.
 So if we use them, non-movable memory is divided into each node evenly.
 
 But there is no way to specify a node used as movable currently. So
 we proposed the new boot option.
 
 So the user would have to read at least the SRAT table, and perhaps
 more, to figure out what to provide as arguments.

 
 Since this is going to be used on a dynamic system where nodes might
 be added an removed - the right values for these arguments might
 change from one boot to the next. So even if the user gets them right
 on day 1, a month later when a new node has been added, or a broken
 node removed the values would be stale.
 
 I don't think so. Even if we hot add/remove node, the memory range of
 each memory device is not changed. So we don't need to change the boot
 option.
 
 Thanks,
 Yasuaki Ishimatsu
 

 -Tony

 
 
 -- 
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-29 Thread H. Peter Anvin
On 11/29/2012 03:00 AM, Mel Gorman wrote:
 
 I've not been paying a whole pile of attention to this because it's not an
 area I'm active in but I agree that configuring ZONE_MOVABLE like
 this at boot-time is going to be problematic. As awkward as it is, it
 would probably work out better to only boot with one node by default and
 then hot-add the nodes at runtime using either an online sysfs file or
 an online-reserved file that hot-adds the memory to ZONE_MOVABLE. Still
 clumsy but better than specifying addresses on the command line.
 
 That said, I also find using ZONE_MOVABLE to be a problem in itself that
 will cause problems down the road. Maybe this was discussed already but
 just in case I'll describe the problems I see.
 

Yes, and it does mean that we definitely don't want everything that can
be in ZONE_MOVABLE to be there without administrator control.  I suspect
that a lot of users of such platforms actually will not use the feature,
and don't want to take the substantial penalty.

The other bit is that if you really really want high reliability, memory
mirroring is the way to go; it is the only way you will be able to
hotremove memory without having to have a pre-event to migrate the
memory away from the affected node before the memory is offlined.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-29 Thread Luck, Tony
 The other bit is that if you really really want high reliability, memory
 mirroring is the way to go; it is the only way you will be able to
 hotremove memory without having to have a pre-event to migrate the
 memory away from the affected node before the memory is offlined.

Some platforms don't support cross-node mirrors ... but we still want to
be able to remove a node.

-Tony
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-29 Thread H. Peter Anvin
On 11/29/2012 02:41 PM, Luck, Tony wrote:
 The other bit is that if you really really want high reliability, memory
 mirroring is the way to go; it is the only way you will be able to
 hotremove memory without having to have a pre-event to migrate the
 memory away from the affected node before the memory is offlined.
 
 Some platforms don't support cross-node mirrors ... but we still want to
 be able to remove a node.
 

Yes, well, those platforms don't support that degree of really really
high reliability, since the unannounced failure of the node controller
will bring down the system.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-29 Thread Luck, Tony
 If any significant percentage of memory is in ZONE_MOVABLE then the memory
 hotplug people will have to deal with all the lowmem/highmem problems
 that used to be faced by 32-bit x86 with PAE enabled. 

While these problems may still exist on large systems - I think it becomes
harder to construct workloads that run into problems.  In those bad old days
a significant fraction of lowmem was consumed by the kernel ... so it was
pretty easy to find meta-data intensive workloads that would push it over
a cliff.  Here we  are talking about systems with say 128GB per node divided
into 64GB moveable and 64GB non-moveable (and I'd regard this as a rather
low-end machine).  Unless the workload consists of zillions of tiny processes
all mapping shared memory blocks, the percentage of memory allocated to
the kernel is going to be tiny compared with the old 4GB days.

-Tony

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-29 Thread Jiang Liu
Hi Mel,
Thanks for your great comments!

On 2012-11-29 19:00, Mel Gorman wrote:
 On Wed, Nov 28, 2012 at 01:38:47PM -0800, H. Peter Anvin wrote:
 On 11/28/2012 01:34 PM, Luck, Tony wrote:

 2. use boot option
   This is our proposal. New boot option can specify memory range to use
   as movable memory.

 Isn't this just moving the work to the user? To pick good values for the
 movable areas, they need to know how the memory lines up across
 node boundaries ... because they need to make sure to allow some
 non-movable memory allocations on each node so that the kernel can
 take advantage of node locality.

 So the user would have to read at least the SRAT table, and perhaps
 more, to figure out what to provide as arguments.

 Since this is going to be used on a dynamic system where nodes might
 be added an removed - the right values for these arguments might
 change from one boot to the next. So even if the user gets them right
 on day 1, a month later when a new node has been added, or a broken
 node removed the values would be stale.


 I gave this feedback in person at LCE: I consider the kernel
 configuration option to be useless for anything other than debugging.
 Trying to promote it as an actual solution, to be used by end users in
 the field, is ridiculous at best.

 
 I've not been paying a whole pile of attention to this because it's not an
 area I'm active in but I agree that configuring ZONE_MOVABLE like
 this at boot-time is going to be problematic. As awkward as it is, it
 would probably work out better to only boot with one node by default and
 then hot-add the nodes at runtime using either an online sysfs file or
 an online-reserved file that hot-adds the memory to ZONE_MOVABLE. Still
 clumsy but better than specifying addresses on the command line.
 
 That said, I also find using ZONE_MOVABLE to be a problem in itself that
 will cause problems down the road. Maybe this was discussed already but
 just in case I'll describe the problems I see.
 
 If any significant percentage of memory is in ZONE_MOVABLE then the memory
 hotplug people will have to deal with all the lowmem/highmem problems
 that used to be faced by 32-bit x86 with PAE enabled. As a simple example,
 metadata intensive workloads will not be able to use all of memory because
 the kernel allocations will be confined to a subset of memory. A more
 complex example is that page table page allocations are also restricted
 meaning it's possible that a process will not even be able to mmap() a high
 percentage of memory simply because it cannot allocate the page tables to
 store the mappings. ZONE_MOVABLE works up to a *point*, but it's a hack. It
 was a hack when it was introduced but at least then the expectation was
 that ZONE_MOVABLE was going to be used for huge pages and there at least
 an expectation that it would not be available for normal usage.
 
 Fundamentally the reason one would want to use ZONE_MOVABLE is because
 we cannot migrate a lot of kernel memory -- slab pages, page table pages,
 device-allocated buffers etc.  My understanding is that other OS's get around
 this by requiring that subsystems and drivers have callbacks that allow the
 core VM to force certain memory to be released but that may be impractical
 for Linux. I don't know for sure though, this is just what I heard.
As I know, one other OS limits immovable pages at low end, and the limit
will increase on demand. But the drawback of this solution is serious
performance drop (average about 10%) because it essentially disable NUMA
optimization for kernel/DMA memory allocations.

 For Linux, the hotplug people need to start thinking about how to get
 around this migration problem. The first problem faced is the memory model
 and how it maps virt-phys addresses. We have a 1:1 mapping because it's
 fast but not because it's a fundamental requirement. Start considering
 what happens if the memory model is changed to allow some sections to have
 fast lookup for virt_to_phys and other sections to have slow lookups. On
 hotplug, try and empty all the sections. If the section cannot be emptied
 because of kernel pages then the section gets marked as offline-migrated
 or something. Stop the whole machine (yes, I mean stop_machine), copy
 those unmovable pages to another location, update the kernel virt-phys
 mapping for the section being offlined so the virt addresses point to the
 new physical addresses and resume.  Virt-phys lookups are going to be
 a lot slower because a full section lookup will be necessary every time
 effectively breaking SPARSE_VMEMMAP and there will be a performance penalty
 but it should work. This will cover some slab pages where the data is only
 accessed via the virtual address -- inode caches, dcache etc.
 
 It will not work where the physical address is used. The obvious example
 is page table pages. For page tables, during stop machine you will have to
 walk all processes page tables looking for references to the page you're
 

Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-29 Thread Yasuaki Ishimatsu

Hi Jiang,

2012/11/30 11:56, Jiang Liu wrote:

Hi Mel,
Thanks for your great comments!

On 2012-11-29 19:00, Mel Gorman wrote:

On Wed, Nov 28, 2012 at 01:38:47PM -0800, H. Peter Anvin wrote:

On 11/28/2012 01:34 PM, Luck, Tony wrote:


2. use boot option
   This is our proposal. New boot option can specify memory range to use
   as movable memory.


Isn't this just moving the work to the user? To pick good values for the
movable areas, they need to know how the memory lines up across
node boundaries ... because they need to make sure to allow some
non-movable memory allocations on each node so that the kernel can
take advantage of node locality.

So the user would have to read at least the SRAT table, and perhaps
more, to figure out what to provide as arguments.

Since this is going to be used on a dynamic system where nodes might
be added an removed - the right values for these arguments might
change from one boot to the next. So even if the user gets them right
on day 1, a month later when a new node has been added, or a broken
node removed the values would be stale.



I gave this feedback in person at LCE: I consider the kernel
configuration option to be useless for anything other than debugging.
Trying to promote it as an actual solution, to be used by end users in
the field, is ridiculous at best.



I've not been paying a whole pile of attention to this because it's not an
area I'm active in but I agree that configuring ZONE_MOVABLE like
this at boot-time is going to be problematic. As awkward as it is, it
would probably work out better to only boot with one node by default and
then hot-add the nodes at runtime using either an online sysfs file or
an online-reserved file that hot-adds the memory to ZONE_MOVABLE. Still
clumsy but better than specifying addresses on the command line.

That said, I also find using ZONE_MOVABLE to be a problem in itself that
will cause problems down the road. Maybe this was discussed already but
just in case I'll describe the problems I see.

If any significant percentage of memory is in ZONE_MOVABLE then the memory
hotplug people will have to deal with all the lowmem/highmem problems
that used to be faced by 32-bit x86 with PAE enabled. As a simple example,
metadata intensive workloads will not be able to use all of memory because
the kernel allocations will be confined to a subset of memory. A more
complex example is that page table page allocations are also restricted
meaning it's possible that a process will not even be able to mmap() a high
percentage of memory simply because it cannot allocate the page tables to
store the mappings. ZONE_MOVABLE works up to a *point*, but it's a hack. It
was a hack when it was introduced but at least then the expectation was
that ZONE_MOVABLE was going to be used for huge pages and there at least
an expectation that it would not be available for normal usage.

Fundamentally the reason one would want to use ZONE_MOVABLE is because
we cannot migrate a lot of kernel memory -- slab pages, page table pages,
device-allocated buffers etc.  My understanding is that other OS's get around
this by requiring that subsystems and drivers have callbacks that allow the
core VM to force certain memory to be released but that may be impractical
for Linux. I don't know for sure though, this is just what I heard.

As I know, one other OS limits immovable pages at low end, and the limit
will increase on demand. But the drawback of this solution is serious
performance drop (average about 10%) because it essentially disable NUMA
optimization for kernel/DMA memory allocations.


For Linux, the hotplug people need to start thinking about how to get
around this migration problem. The first problem faced is the memory model
and how it maps virt-phys addresses. We have a 1:1 mapping because it's
fast but not because it's a fundamental requirement. Start considering
what happens if the memory model is changed to allow some sections to have
fast lookup for virt_to_phys and other sections to have slow lookups. On
hotplug, try and empty all the sections. If the section cannot be emptied
because of kernel pages then the section gets marked as offline-migrated
or something. Stop the whole machine (yes, I mean stop_machine), copy
those unmovable pages to another location, update the kernel virt-phys
mapping for the section being offlined so the virt addresses point to the
new physical addresses and resume.  Virt-phys lookups are going to be
a lot slower because a full section lookup will be necessary every time
effectively breaking SPARSE_VMEMMAP and there will be a performance penalty
but it should work. This will cover some slab pages where the data is only
accessed via the virtual address -- inode caches, dcache etc.

It will not work where the physical address is used. The obvious example
is page table pages. For page tables, during stop machine you will have to
walk all processes page tables looking for references to the page you're
trying to move 

RE: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-29 Thread H. Peter Anvin
Disk I/O is still a big consumer of lowmem.

Luck, Tony tony.l...@intel.com wrote:

 If any significant percentage of memory is in ZONE_MOVABLE then the
memory
 hotplug people will have to deal with all the lowmem/highmem problems
 that used to be faced by 32-bit x86 with PAE enabled. 

While these problems may still exist on large systems - I think it
becomes
harder to construct workloads that run into problems.  In those bad old
days
a significant fraction of lowmem was consumed by the kernel ... so it
was
pretty easy to find meta-data intensive workloads that would push it
over
a cliff.  Here we  are talking about systems with say 128GB per node
divided
into 64GB moveable and 64GB non-moveable (and I'd regard this as a
rather
low-end machine).  Unless the workload consists of zillions of tiny
processes
all mapping shared memory blocks, the percentage of memory allocated to
the kernel is going to be tiny compared with the old 4GB days.

-Tony

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-28 Thread Jiang Liu
On 2012-11-29 10:49, Wanpeng Li wrote:
> On Thu, Nov 29, 2012 at 10:25:40AM +0800, Jiang Liu wrote:
>> On 2012-11-29 9:42, Jaegeuk Hanse wrote:
>>> On Wed, Nov 28, 2012 at 04:47:42PM +0800, Jiang Liu wrote:
 Hi all,
Seems it's a great chance to discuss about the memory hotplug feature
 within this thread. So I will try to give some high level thoughts about 
 memory
 hotplug feature on x86/IA64. Any comments are welcomed!
First of all, I think usability really matters. Ideally, memory hotplug
 feature should just work out of box, and we shouldn't expect 
 administrators to 
 add several extra platform dependent parameters to enable memory hotplug. 
 But how to enable memory (or CPU/node) hotplug out of box? I think the key 
 point
 is to cooperate with BIOS/ACPI/firmware/device management teams. 
I still position memory hotplug as an advanced feature for high end 
 servers and those systems may/should provide some management interfaces to 
 configure CPU/memory/node hotplug features. The configuration UI may be 
 provided
 by BIOS, BMC or centralized system management suite. Once administrator 
 enables
 hotplug feature through those management UI, OS should support system 
 device
 hotplug out of box. For example, HP SuperDome2 management suite provides 
 interface
 to configure a node as floating node(hot-removable). And OpenSolaris 
 supports
 CPU/memory hotplug out of box without any extra configurations. So we 
 should
 shape interfaces between firmware and OS to better support system device 
 hotplug.
On the other hand, I think there are no commercial available x86/IA64
 platforms with system device hotplug capabilities in the field yet, at 
 least only
 limited quantity if any. So backward compatibility is not a big issue for 
 us now.
 So I think it's doable to rely on firmware to provide better support for 
 system
 device hotplug.
Then what should be enhanced to better support system device hotplug?

 1) ACPI specification should be enhanced to provide a static table to 
 describe
 components with hotplug features, so OS could reserve special resources for
 hotplug at early boot stages. For example, to reserve enough CPU ids for 
 CPU
 hot-add. Currently we guess maximum number of CPUs supported by the 
 platform
 by counting CPU entries in APIC table, that's not reliable.

 2) BIOS should implement SRAT, MPST and PMTT tables to better support 
 memory
 hotplug. SRAT associates memory ranges with proximity domains with an extra
 "hotpluggable" flag. PMTT provides memory device topology information, such
 as "socket->memory controller->DIMM". MPST is used for memory power 
 management
 and provides a way to associate memory ranges with memory devices in PMTT.
 With all information from SRAT, MPST and PMTT, OS could figure out hotplug
 memory ranges automatically, so no extra kernel parameters needed.

 3) Enhance ACPICA to provide a method to scan static ACPI tables before
 memory subsystem has been initialized because OS need to access SRAT,
 MPST and PMTT when initializing memory subsystem.

 4) The last and the most important issue is how to minimize performance
 drop caused by memory hotplug. As proposed by this patchset, once we
 configure all memory of a NUMA node as movable, it essentially disable
 NUMA optimization of kernel memory allocation from that node. According
 to experience, that will cause huge performance drop. We have observed
 10-30% performance drop with memory hotplug enabled. And on another
 OS the average performance drop caused by memory hotplug is about 10%.
 If we can't resolve the performance drop, memory hotplug is just a feature
 for demo:( With help from hardware, we do have some chances to reduce
 performance penalty caused by memory hotplug.
As we know, Linux could migrate movable page, but can't migrate
 non-movable pages used by kernel/DMA etc. And the most hard part is how
 to deal with those unmovable pages when hot-removing a memory device.
 Now hardware has given us a hand with a technology named memory migration,
 which could transparently migrate memory between memory devices. There's
 no OS visible changes except NUMA topology before and after hardware memory
 migration.
And if there are multiple memory devices within a NUMA node,
 we could configure some memory devices to host unmovable memory and the
 other to host movable memory. With this configuration, there won't be
 bigger performance drop because we have preserved all NUMA optimizations.
 We also could achieve memory hotplug remove by:
 1) Use existing page migration mechanism to reclaim movable pages.
 2) For memory devices hosting 

Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-28 Thread Jiang Liu
On 2012-11-29 9:42, Jaegeuk Hanse wrote:
> On Wed, Nov 28, 2012 at 04:47:42PM +0800, Jiang Liu wrote:
>> Hi all,
>>  Seems it's a great chance to discuss about the memory hotplug feature
>> within this thread. So I will try to give some high level thoughts about 
>> memory
>> hotplug feature on x86/IA64. Any comments are welcomed!
>>  First of all, I think usability really matters. Ideally, memory hotplug
>> feature should just work out of box, and we shouldn't expect administrators 
>> to 
>> add several extra platform dependent parameters to enable memory hotplug. 
>> But how to enable memory (or CPU/node) hotplug out of box? I think the key 
>> point
>> is to cooperate with BIOS/ACPI/firmware/device management teams. 
>>  I still position memory hotplug as an advanced feature for high end 
>> servers and those systems may/should provide some management interfaces to 
>> configure CPU/memory/node hotplug features. The configuration UI may be 
>> provided
>> by BIOS, BMC or centralized system management suite. Once administrator 
>> enables
>> hotplug feature through those management UI, OS should support system device
>> hotplug out of box. For example, HP SuperDome2 management suite provides 
>> interface
>> to configure a node as floating node(hot-removable). And OpenSolaris supports
>> CPU/memory hotplug out of box without any extra configurations. So we should
>> shape interfaces between firmware and OS to better support system device 
>> hotplug.
>>  On the other hand, I think there are no commercial available x86/IA64
>> platforms with system device hotplug capabilities in the field yet, at least 
>> only
>> limited quantity if any. So backward compatibility is not a big issue for us 
>> now.
>> So I think it's doable to rely on firmware to provide better support for 
>> system
>> device hotplug.
>>  Then what should be enhanced to better support system device hotplug?
>>
>> 1) ACPI specification should be enhanced to provide a static table to 
>> describe
>> components with hotplug features, so OS could reserve special resources for
>> hotplug at early boot stages. For example, to reserve enough CPU ids for CPU
>> hot-add. Currently we guess maximum number of CPUs supported by the platform
>> by counting CPU entries in APIC table, that's not reliable.
>>
>> 2) BIOS should implement SRAT, MPST and PMTT tables to better support memory
>> hotplug. SRAT associates memory ranges with proximity domains with an extra
>> "hotpluggable" flag. PMTT provides memory device topology information, such
>> as "socket->memory controller->DIMM". MPST is used for memory power 
>> management
>> and provides a way to associate memory ranges with memory devices in PMTT.
>> With all information from SRAT, MPST and PMTT, OS could figure out hotplug
>> memory ranges automatically, so no extra kernel parameters needed.
>>
>> 3) Enhance ACPICA to provide a method to scan static ACPI tables before
>> memory subsystem has been initialized because OS need to access SRAT,
>> MPST and PMTT when initializing memory subsystem.
>>
>> 4) The last and the most important issue is how to minimize performance
>> drop caused by memory hotplug. As proposed by this patchset, once we
>> configure all memory of a NUMA node as movable, it essentially disable
>> NUMA optimization of kernel memory allocation from that node. According
>> to experience, that will cause huge performance drop. We have observed
>> 10-30% performance drop with memory hotplug enabled. And on another
>> OS the average performance drop caused by memory hotplug is about 10%.
>> If we can't resolve the performance drop, memory hotplug is just a feature
>> for demo:( With help from hardware, we do have some chances to reduce
>> performance penalty caused by memory hotplug.
>>  As we know, Linux could migrate movable page, but can't migrate
>> non-movable pages used by kernel/DMA etc. And the most hard part is how
>> to deal with those unmovable pages when hot-removing a memory device.
>> Now hardware has given us a hand with a technology named memory migration,
>> which could transparently migrate memory between memory devices. There's
>> no OS visible changes except NUMA topology before and after hardware memory
>> migration.
>>  And if there are multiple memory devices within a NUMA node,
>> we could configure some memory devices to host unmovable memory and the
>> other to host movable memory. With this configuration, there won't be
>> bigger performance drop because we have preserved all NUMA optimizations.
>> We also could achieve memory hotplug remove by:
>> 1) Use existing page migration mechanism to reclaim movable pages.
>> 2) For memory devices hosting unmovable pages, we need:
>> 2.1) find a movable memory device on other nodes with enough capacity
>> and reclaim it.
>> 2.2) use hardware migration technology to migrate unmovable memory to
> 
> Hi Jiang,
> 
> Could you give an explanation how hardware migration technology works?
Hi 

Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-28 Thread Jaegeuk Hanse
On Wed, Nov 28, 2012 at 04:47:42PM +0800, Jiang Liu wrote:
>Hi all,
>   Seems it's a great chance to discuss about the memory hotplug feature
>within this thread. So I will try to give some high level thoughts about memory
>hotplug feature on x86/IA64. Any comments are welcomed!
>   First of all, I think usability really matters. Ideally, memory hotplug
>feature should just work out of box, and we shouldn't expect administrators to 
>add several extra platform dependent parameters to enable memory hotplug. 
>But how to enable memory (or CPU/node) hotplug out of box? I think the key 
>point
>is to cooperate with BIOS/ACPI/firmware/device management teams. 
>   I still position memory hotplug as an advanced feature for high end 
>servers and those systems may/should provide some management interfaces to 
>configure CPU/memory/node hotplug features. The configuration UI may be 
>provided
>by BIOS, BMC or centralized system management suite. Once administrator enables
>hotplug feature through those management UI, OS should support system device
>hotplug out of box. For example, HP SuperDome2 management suite provides 
>interface
>to configure a node as floating node(hot-removable). And OpenSolaris supports
>CPU/memory hotplug out of box without any extra configurations. So we should
>shape interfaces between firmware and OS to better support system device 
>hotplug.
>   On the other hand, I think there are no commercial available x86/IA64
>platforms with system device hotplug capabilities in the field yet, at least 
>only
>limited quantity if any. So backward compatibility is not a big issue for us 
>now.
>So I think it's doable to rely on firmware to provide better support for system
>device hotplug.
>   Then what should be enhanced to better support system device hotplug?
>
>1) ACPI specification should be enhanced to provide a static table to describe
>components with hotplug features, so OS could reserve special resources for
>hotplug at early boot stages. For example, to reserve enough CPU ids for CPU
>hot-add. Currently we guess maximum number of CPUs supported by the platform
>by counting CPU entries in APIC table, that's not reliable.
>
>2) BIOS should implement SRAT, MPST and PMTT tables to better support memory
>hotplug. SRAT associates memory ranges with proximity domains with an extra
>"hotpluggable" flag. PMTT provides memory device topology information, such
>as "socket->memory controller->DIMM". MPST is used for memory power management
>and provides a way to associate memory ranges with memory devices in PMTT.
>With all information from SRAT, MPST and PMTT, OS could figure out hotplug
>memory ranges automatically, so no extra kernel parameters needed.
>
>3) Enhance ACPICA to provide a method to scan static ACPI tables before
>memory subsystem has been initialized because OS need to access SRAT,
>MPST and PMTT when initializing memory subsystem.
>
>4) The last and the most important issue is how to minimize performance
>drop caused by memory hotplug. As proposed by this patchset, once we
>configure all memory of a NUMA node as movable, it essentially disable
>NUMA optimization of kernel memory allocation from that node. According
>to experience, that will cause huge performance drop. We have observed
>10-30% performance drop with memory hotplug enabled. And on another
>OS the average performance drop caused by memory hotplug is about 10%.
>If we can't resolve the performance drop, memory hotplug is just a feature
>for demo:( With help from hardware, we do have some chances to reduce
>performance penalty caused by memory hotplug.
>   As we know, Linux could migrate movable page, but can't migrate
>non-movable pages used by kernel/DMA etc. And the most hard part is how
>to deal with those unmovable pages when hot-removing a memory device.
>Now hardware has given us a hand with a technology named memory migration,
>which could transparently migrate memory between memory devices. There's
>no OS visible changes except NUMA topology before and after hardware memory
>migration.
>   And if there are multiple memory devices within a NUMA node,
>we could configure some memory devices to host unmovable memory and the
>other to host movable memory. With this configuration, there won't be
>bigger performance drop because we have preserved all NUMA optimizations.
>We also could achieve memory hotplug remove by:
>1) Use existing page migration mechanism to reclaim movable pages.
>2) For memory devices hosting unmovable pages, we need:
>2.1) find a movable memory device on other nodes with enough capacity
>and reclaim it.
>2.2) use hardware migration technology to migrate unmovable memory to

Hi Jiang,

Could you give an explanation how hardware migration technology works?

Regards,
Jaegeuk

>the just reclaimed memory device on other nodes.
>
>   I hope we could expect users to adopt memory hotplug technology
>with all these implemented.
>
>   Back to this patch, we could 

Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-28 Thread Tang Chen

On 11/29/2012 08:43 AM, Jaegeuk Hanse wrote:

Hi Tang,

I haven't read the patchset yet, but could you give a short describe how
you design your implementation in this patchset?

Regards,
Jaegeuk



Hi Jaegeuk,

Thanks for your joining in. :)

This feature is used in memory hotplug.

In order to implement a whole node hotplug, we need to make sure the
node contains no kernel memory, because memory used by kernel could
not be migrated. (Since the kernel memory is directly mapped,
VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)

With this boot option, user could specify all the memory on a node to
be movable(which means they are in ZONE_MOVABLE), so that the node
could be hot-removed.

Thanks.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-28 Thread Jaegeuk Hanse
On Wed, Nov 28, 2012 at 04:29:01PM +0800, Wen Congyang wrote:
>At 11/28/2012 12:08 PM, Jiang Liu Wrote:
>> On 2012-11-28 11:24, Bob Liu wrote:
>>> On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen  wrote:
 On 11/27/2012 08:09 PM, Bob Liu wrote:
>
> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen
> wrote:
>>
>> Hi Liu,
>>
>>
>> This feature is used in memory hotplug.
>>
>> In order to implement a whole node hotplug, we need to make sure the
>> node contains no kernel memory, because memory used by kernel could
>> not be migrated. (Since the kernel memory is directly mapped,
>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>
>> User could specify all the memory on a node to be movable, so that the
>> node could be hot-removed.
>>
>
> Thank you for your explanation. It's reasonable.
>
> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
> can combine it with CMA which already in mainline?
>
 Hi Liu,

 Thanks for your advice. :)

 CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
 controlling where is the start of ZONE_MOVABLE of each node. Could
 CMA do this job ?
>>>
>>> cma will not control the start of ZONE_MOVABLE of each node, but it
>>> can declare a memory that always movable
>>> and all non movable allocate request will not happen on that area.
>>>
>>> Currently cma use a boot parameter "cma=" to declare a memory size
>>> that always movable.
>>> I think it might fulfill your requirement if extending the boot
>>> parameter with a start address.
>>>
>>> more info at http://lwn.net/Articles/468044/

 And also, after a short investigation, CMA seems need to base on
 memblock. But we need to limit memblock not to allocate memory on
 ZONE_MOVABLE. As a result, we need to know the ranges before memblock
 could be used. I'm afraid we still need an approach to get the ranges,
 such as a boot option, or from static ACPI tables such as SRAT/MPST.

>>>
>>> Yes, it's based on memblock and with boot option.
>>> In setup_arch32()
>>> dma_contiguous_reserve(0);   => will declare a cma area using
>>> memblock_reserve()
>>>
 I'm don't know much about CMA for now. So if you have any better idea,
 please share with us, thanks. :)
>>>
>>> My idea is reuse cma like below patch(even not compiled) and boot with
>>> "cma=size@start_address".
>>> I don't know whether it can work and whether suitable for your
>>> requirement, if not forgive me for this noises.
>>>
>>> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
>>> index 612afcc..564962a 100644
>>> --- a/drivers/base/dma-contiguous.c
>>> +++ b/drivers/base/dma-contiguous.c
>>> @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
>>>   */
>>>  static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
>>>  static long size_cmdline = -1;
>>> +static long cma_start_cmdline = -1;
>>>
>>>  static int __init early_cma(char *p)
>>>  {
>>> +   char *oldp;
>>> pr_debug("%s(%s)\n", __func__, p);
>>> +   oldp = p;
>>> size_cmdline = memparse(p, );
>>> +
>>> +   if (*p == '@')
>>> +   cma_start_cmdline = memparse(p+1, );
>>> +   printk("cma start:0x%x, size: 0x%x\n", size_cmdline, 
>>> cma_start_cmdline);
>>> return 0;
>>>  }
>>>  early_param("cma", early_cma);
>>> @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>>> if (selected_size) {
>>> pr_debug("%s: reserving %ld MiB for global area\n", 
>>> __func__,
>>>  selected_size / SZ_1M);
>>> -
>>> -   dma_declare_contiguous(NULL, selected_size, 0, limit);
>>> +   if (cma_size_cmdline != -1)
>>> +   dma_declare_contiguous(NULL, selected_size,
>>> cma_start_cmdline, limit);
>>> +   else
>>> +   dma_declare_contiguous(NULL, selected_size, 0, 
>>> limit);
>>> }
>>>  };
>> Seems a good idea to reserve memory by reusing CMA logic, though need more
>> investigation here. One of CMA goal is to ensure pages in CMA are really
>> movable, and this patchset tries to achieve the same goal at a first glance.
>
>Hmm, I don't like to reuse CMA. Because CMA is used for DMA. If we reuse it
>for movable memory, I think movable zone is enough. And the start address is
>not acceptable, because we want to specify the start address for each node.
>
>I think we can implement movablecore_map like that:
>1. parse the parameter
>2. reserve the memory after efi_reserve_boot_services()
>3. release the memory in mem_init
>

Hi Tang,

I haven't read the patchset yet, but could you give a short describe how 
you design your implementation in this patchset?

Regards,
Jaegeuk

>What about this?
>
>Thanks
>Wen Congyang
>> 
>>  
>> 
>> 
>> 
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in

Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-28 Thread H. Peter Anvin
On 11/28/2012 01:34 PM, Luck, Tony wrote:
>>
>> 2. use boot option
>>   This is our proposal. New boot option can specify memory range to use
>>   as movable memory.
> 
> Isn't this just moving the work to the user? To pick good values for the
> movable areas, they need to know how the memory lines up across
> node boundaries ... because they need to make sure to allow some
> non-movable memory allocations on each node so that the kernel can
> take advantage of node locality.
> 
> So the user would have to read at least the SRAT table, and perhaps
> more, to figure out what to provide as arguments.
> 
> Since this is going to be used on a dynamic system where nodes might
> be added an removed - the right values for these arguments might
> change from one boot to the next. So even if the user gets them right
> on day 1, a month later when a new node has been added, or a broken
> node removed the values would be stale.
> 

I gave this feedback in person at LCE: I consider the kernel
configuration option to be useless for anything other than debugging.
Trying to promote it as an actual solution, to be used by end users in
the field, is ridiculous at best.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-28 Thread Luck, Tony
> 1. use firmware information
>   According to ACPI spec 5.0, SRAT table has memory affinity structure
>   and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>   Affinity Structure". If we use the information, we might be able to
>   specify movable memory by firmware. For example, if Hot Pluggable
>   Filed is enabled, Linux sets the memory as movable memory.
> 
> 2. use boot option
>   This is our proposal. New boot option can specify memory range to use
>   as movable memory.

Isn't this just moving the work to the user? To pick good values for the
movable areas, they need to know how the memory lines up across
node boundaries ... because they need to make sure to allow some
non-movable memory allocations on each node so that the kernel can
take advantage of node locality.

So the user would have to read at least the SRAT table, and perhaps
more, to figure out what to provide as arguments.

Since this is going to be used on a dynamic system where nodes might
be added an removed - the right values for these arguments might
change from one boot to the next. So even if the user gets them right
on day 1, a month later when a new node has been added, or a broken
node removed the values would be stale.

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-28 Thread Jiang Liu
Hi all,
Seems it's a great chance to discuss about the memory hotplug feature
within this thread. So I will try to give some high level thoughts about memory
hotplug feature on x86/IA64. Any comments are welcomed!
First of all, I think usability really matters. Ideally, memory hotplug
feature should just work out of box, and we shouldn't expect administrators to 
add several extra platform dependent parameters to enable memory hotplug. 
But how to enable memory (or CPU/node) hotplug out of box? I think the key point
is to cooperate with BIOS/ACPI/firmware/device management teams. 
I still position memory hotplug as an advanced feature for high end 
servers and those systems may/should provide some management interfaces to 
configure CPU/memory/node hotplug features. The configuration UI may be provided
by BIOS, BMC or centralized system management suite. Once administrator enables
hotplug feature through those management UI, OS should support system device
hotplug out of box. For example, HP SuperDome2 management suite provides 
interface
to configure a node as floating node(hot-removable). And OpenSolaris supports
CPU/memory hotplug out of box without any extra configurations. So we should
shape interfaces between firmware and OS to better support system device 
hotplug.
On the other hand, I think there are no commercial available x86/IA64
platforms with system device hotplug capabilities in the field yet, at least 
only
limited quantity if any. So backward compatibility is not a big issue for us 
now.
So I think it's doable to rely on firmware to provide better support for system
device hotplug.
Then what should be enhanced to better support system device hotplug?

1) ACPI specification should be enhanced to provide a static table to describe
components with hotplug features, so OS could reserve special resources for
hotplug at early boot stages. For example, to reserve enough CPU ids for CPU
hot-add. Currently we guess maximum number of CPUs supported by the platform
by counting CPU entries in APIC table, that's not reliable.

2) BIOS should implement SRAT, MPST and PMTT tables to better support memory
hotplug. SRAT associates memory ranges with proximity domains with an extra
"hotpluggable" flag. PMTT provides memory device topology information, such
as "socket->memory controller->DIMM". MPST is used for memory power management
and provides a way to associate memory ranges with memory devices in PMTT.
With all information from SRAT, MPST and PMTT, OS could figure out hotplug
memory ranges automatically, so no extra kernel parameters needed.

3) Enhance ACPICA to provide a method to scan static ACPI tables before
memory subsystem has been initialized because OS need to access SRAT,
MPST and PMTT when initializing memory subsystem.

4) The last and the most important issue is how to minimize performance
drop caused by memory hotplug. As proposed by this patchset, once we
configure all memory of a NUMA node as movable, it essentially disable
NUMA optimization of kernel memory allocation from that node. According
to experience, that will cause huge performance drop. We have observed
10-30% performance drop with memory hotplug enabled. And on another
OS the average performance drop caused by memory hotplug is about 10%.
If we can't resolve the performance drop, memory hotplug is just a feature
for demo:( With help from hardware, we do have some chances to reduce
performance penalty caused by memory hotplug.
As we know, Linux could migrate movable page, but can't migrate
non-movable pages used by kernel/DMA etc. And the most hard part is how
to deal with those unmovable pages when hot-removing a memory device.
Now hardware has given us a hand with a technology named memory migration,
which could transparently migrate memory between memory devices. There's
no OS visible changes except NUMA topology before and after hardware memory
migration.
And if there are multiple memory devices within a NUMA node,
we could configure some memory devices to host unmovable memory and the
other to host movable memory. With this configuration, there won't be
bigger performance drop because we have preserved all NUMA optimizations.
We also could achieve memory hotplug remove by:
1) Use existing page migration mechanism to reclaim movable pages.
2) For memory devices hosting unmovable pages, we need:
2.1) find a movable memory device on other nodes with enough capacity
and reclaim it.
2.2) use hardware migration technology to migrate unmovable memory to
the just reclaimed memory device on other nodes.

I hope we could expect users to adopt memory hotplug technology
with all these implemented.

Back to this patch, we could rely on the mechanism provided
by it to automatically mark memory ranges as movable with information
from ACPI SRAT/MPST/PMTT tables. So we don't need administrator to
manually configure kernel parameters to enable memory hotplug.


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-28 Thread Wen Congyang
At 11/28/2012 04:28 PM, Jiang Liu Wrote:
> On 2012-11-28 16:29, Wen Congyang wrote:
>> At 11/28/2012 12:08 PM, Jiang Liu Wrote:
>>> On 2012-11-28 11:24, Bob Liu wrote:
 On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen  wrote:
> On 11/27/2012 08:09 PM, Bob Liu wrote:
>>
>> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen
>> wrote:
>>>
>>> Hi Liu,
>>>
>>>
>>> This feature is used in memory hotplug.
>>>
>>> In order to implement a whole node hotplug, we need to make sure the
>>> node contains no kernel memory, because memory used by kernel could
>>> not be migrated. (Since the kernel memory is directly mapped,
>>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>>
>>> User could specify all the memory on a node to be movable, so that the
>>> node could be hot-removed.
>>>
>>
>> Thank you for your explanation. It's reasonable.
>>
>> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
>> can combine it with CMA which already in mainline?
>>
> Hi Liu,
>
> Thanks for your advice. :)
>
> CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
> controlling where is the start of ZONE_MOVABLE of each node. Could
> CMA do this job ?

 cma will not control the start of ZONE_MOVABLE of each node, but it
 can declare a memory that always movable
 and all non movable allocate request will not happen on that area.

 Currently cma use a boot parameter "cma=" to declare a memory size
 that always movable.
 I think it might fulfill your requirement if extending the boot
 parameter with a start address.

 more info at http://lwn.net/Articles/468044/
>
> And also, after a short investigation, CMA seems need to base on
> memblock. But we need to limit memblock not to allocate memory on
> ZONE_MOVABLE. As a result, we need to know the ranges before memblock
> could be used. I'm afraid we still need an approach to get the ranges,
> such as a boot option, or from static ACPI tables such as SRAT/MPST.
>

 Yes, it's based on memblock and with boot option.
 In setup_arch32()
 dma_contiguous_reserve(0);   => will declare a cma area using
 memblock_reserve()

> I'm don't know much about CMA for now. So if you have any better idea,
> please share with us, thanks. :)

 My idea is reuse cma like below patch(even not compiled) and boot with
 "cma=size@start_address".
 I don't know whether it can work and whether suitable for your
 requirement, if not forgive me for this noises.

 diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
 index 612afcc..564962a 100644
 --- a/drivers/base/dma-contiguous.c
 +++ b/drivers/base/dma-contiguous.c
 @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
   */
  static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
  static long size_cmdline = -1;
 +static long cma_start_cmdline = -1;

  static int __init early_cma(char *p)
  {
 +   char *oldp;
 pr_debug("%s(%s)\n", __func__, p);
 +   oldp = p;
 size_cmdline = memparse(p, );
 +
 +   if (*p == '@')
 +   cma_start_cmdline = memparse(p+1, );
 +   printk("cma start:0x%x, size: 0x%x\n", size_cmdline, 
 cma_start_cmdline);
 return 0;
  }
  early_param("cma", early_cma);
 @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
 if (selected_size) {
 pr_debug("%s: reserving %ld MiB for global area\n", 
 __func__,
  selected_size / SZ_1M);
 -
 -   dma_declare_contiguous(NULL, selected_size, 0, limit);
 +   if (cma_size_cmdline != -1)
 +   dma_declare_contiguous(NULL, selected_size,
 cma_start_cmdline, limit);
 +   else
 +   dma_declare_contiguous(NULL, selected_size, 0, 
 limit);
 }
  };
>>> Seems a good idea to reserve memory by reusing CMA logic, though need more
>>> investigation here. One of CMA goal is to ensure pages in CMA are really
>>> movable, and this patchset tries to achieve the same goal at a first glance.
>>
>> Hmm, I don't like to reuse CMA. Because CMA is used for DMA. If we reuse it
>> for movable memory, I think movable zone is enough. And the start address is
>> not acceptable, because we want to specify the start address for each node.
>>
>> I think we can implement movablecore_map like that:
>> 1. parse the parameter
>> 2. reserve the memory after efi_reserve_boot_services()
> This sounds good, but the code to reserve memory for movable
> nodes will be similar to dma_declare_contiguous().

Yes, it may be very similar. I think we can move it into 

Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-28 Thread Jiang Liu
On 2012-11-28 16:29, Wen Congyang wrote:
> At 11/28/2012 12:08 PM, Jiang Liu Wrote:
>> On 2012-11-28 11:24, Bob Liu wrote:
>>> On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen  wrote:
 On 11/27/2012 08:09 PM, Bob Liu wrote:
>
> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen
> wrote:
>>
>> Hi Liu,
>>
>>
>> This feature is used in memory hotplug.
>>
>> In order to implement a whole node hotplug, we need to make sure the
>> node contains no kernel memory, because memory used by kernel could
>> not be migrated. (Since the kernel memory is directly mapped,
>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>
>> User could specify all the memory on a node to be movable, so that the
>> node could be hot-removed.
>>
>
> Thank you for your explanation. It's reasonable.
>
> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
> can combine it with CMA which already in mainline?
>
 Hi Liu,

 Thanks for your advice. :)

 CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
 controlling where is the start of ZONE_MOVABLE of each node. Could
 CMA do this job ?
>>>
>>> cma will not control the start of ZONE_MOVABLE of each node, but it
>>> can declare a memory that always movable
>>> and all non movable allocate request will not happen on that area.
>>>
>>> Currently cma use a boot parameter "cma=" to declare a memory size
>>> that always movable.
>>> I think it might fulfill your requirement if extending the boot
>>> parameter with a start address.
>>>
>>> more info at http://lwn.net/Articles/468044/

 And also, after a short investigation, CMA seems need to base on
 memblock. But we need to limit memblock not to allocate memory on
 ZONE_MOVABLE. As a result, we need to know the ranges before memblock
 could be used. I'm afraid we still need an approach to get the ranges,
 such as a boot option, or from static ACPI tables such as SRAT/MPST.

>>>
>>> Yes, it's based on memblock and with boot option.
>>> In setup_arch32()
>>> dma_contiguous_reserve(0);   => will declare a cma area using
>>> memblock_reserve()
>>>
 I'm don't know much about CMA for now. So if you have any better idea,
 please share with us, thanks. :)
>>>
>>> My idea is reuse cma like below patch(even not compiled) and boot with
>>> "cma=size@start_address".
>>> I don't know whether it can work and whether suitable for your
>>> requirement, if not forgive me for this noises.
>>>
>>> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
>>> index 612afcc..564962a 100644
>>> --- a/drivers/base/dma-contiguous.c
>>> +++ b/drivers/base/dma-contiguous.c
>>> @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
>>>   */
>>>  static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
>>>  static long size_cmdline = -1;
>>> +static long cma_start_cmdline = -1;
>>>
>>>  static int __init early_cma(char *p)
>>>  {
>>> +   char *oldp;
>>> pr_debug("%s(%s)\n", __func__, p);
>>> +   oldp = p;
>>> size_cmdline = memparse(p, );
>>> +
>>> +   if (*p == '@')
>>> +   cma_start_cmdline = memparse(p+1, );
>>> +   printk("cma start:0x%x, size: 0x%x\n", size_cmdline, 
>>> cma_start_cmdline);
>>> return 0;
>>>  }
>>>  early_param("cma", early_cma);
>>> @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>>> if (selected_size) {
>>> pr_debug("%s: reserving %ld MiB for global area\n", 
>>> __func__,
>>>  selected_size / SZ_1M);
>>> -
>>> -   dma_declare_contiguous(NULL, selected_size, 0, limit);
>>> +   if (cma_size_cmdline != -1)
>>> +   dma_declare_contiguous(NULL, selected_size,
>>> cma_start_cmdline, limit);
>>> +   else
>>> +   dma_declare_contiguous(NULL, selected_size, 0, 
>>> limit);
>>> }
>>>  };
>> Seems a good idea to reserve memory by reusing CMA logic, though need more
>> investigation here. One of CMA goal is to ensure pages in CMA are really
>> movable, and this patchset tries to achieve the same goal at a first glance.
> 
> Hmm, I don't like to reuse CMA. Because CMA is used for DMA. If we reuse it
> for movable memory, I think movable zone is enough. And the start address is
> not acceptable, because we want to specify the start address for each node.
> 
> I think we can implement movablecore_map like that:
> 1. parse the parameter
> 2. reserve the memory after efi_reserve_boot_services()
This sounds good, but the code to reserve memory for movable
nodes will be similar to dma_declare_contiguous().

> 3. release the memory in mem_init
> 
> What about this?
> 
> Thanks
> Wen Congyang
>>
>>  
>>
>>
>>
> 
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to 

Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-28 Thread Wen Congyang
At 11/28/2012 12:08 PM, Jiang Liu Wrote:
> On 2012-11-28 11:24, Bob Liu wrote:
>> On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen  wrote:
>>> On 11/27/2012 08:09 PM, Bob Liu wrote:

 On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen
 wrote:
>
> Hi Liu,
>
>
> This feature is used in memory hotplug.
>
> In order to implement a whole node hotplug, we need to make sure the
> node contains no kernel memory, because memory used by kernel could
> not be migrated. (Since the kernel memory is directly mapped,
> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>
> User could specify all the memory on a node to be movable, so that the
> node could be hot-removed.
>

 Thank you for your explanation. It's reasonable.

 But i think it's a bit duplicated with CMA, i'm not sure but maybe we
 can combine it with CMA which already in mainline?

>>> Hi Liu,
>>>
>>> Thanks for your advice. :)
>>>
>>> CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
>>> controlling where is the start of ZONE_MOVABLE of each node. Could
>>> CMA do this job ?
>>
>> cma will not control the start of ZONE_MOVABLE of each node, but it
>> can declare a memory that always movable
>> and all non movable allocate request will not happen on that area.
>>
>> Currently cma use a boot parameter "cma=" to declare a memory size
>> that always movable.
>> I think it might fulfill your requirement if extending the boot
>> parameter with a start address.
>>
>> more info at http://lwn.net/Articles/468044/
>>>
>>> And also, after a short investigation, CMA seems need to base on
>>> memblock. But we need to limit memblock not to allocate memory on
>>> ZONE_MOVABLE. As a result, we need to know the ranges before memblock
>>> could be used. I'm afraid we still need an approach to get the ranges,
>>> such as a boot option, or from static ACPI tables such as SRAT/MPST.
>>>
>>
>> Yes, it's based on memblock and with boot option.
>> In setup_arch32()
>> dma_contiguous_reserve(0);   => will declare a cma area using
>> memblock_reserve()
>>
>>> I'm don't know much about CMA for now. So if you have any better idea,
>>> please share with us, thanks. :)
>>
>> My idea is reuse cma like below patch(even not compiled) and boot with
>> "cma=size@start_address".
>> I don't know whether it can work and whether suitable for your
>> requirement, if not forgive me for this noises.
>>
>> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
>> index 612afcc..564962a 100644
>> --- a/drivers/base/dma-contiguous.c
>> +++ b/drivers/base/dma-contiguous.c
>> @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
>>   */
>>  static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
>>  static long size_cmdline = -1;
>> +static long cma_start_cmdline = -1;
>>
>>  static int __init early_cma(char *p)
>>  {
>> +   char *oldp;
>> pr_debug("%s(%s)\n", __func__, p);
>> +   oldp = p;
>> size_cmdline = memparse(p, );
>> +
>> +   if (*p == '@')
>> +   cma_start_cmdline = memparse(p+1, );
>> +   printk("cma start:0x%x, size: 0x%x\n", size_cmdline, 
>> cma_start_cmdline);
>> return 0;
>>  }
>>  early_param("cma", early_cma);
>> @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>> if (selected_size) {
>> pr_debug("%s: reserving %ld MiB for global area\n", __func__,
>>  selected_size / SZ_1M);
>> -
>> -   dma_declare_contiguous(NULL, selected_size, 0, limit);
>> +   if (cma_size_cmdline != -1)
>> +   dma_declare_contiguous(NULL, selected_size,
>> cma_start_cmdline, limit);
>> +   else
>> +   dma_declare_contiguous(NULL, selected_size, 0, 
>> limit);
>> }
>>  };
> Seems a good idea to reserve memory by reusing CMA logic, though need more
> investigation here. One of CMA goal is to ensure pages in CMA are really
> movable, and this patchset tries to achieve the same goal at a first glance.

Hmm, I don't like to reuse CMA. Because CMA is used for DMA. If we reuse it
for movable memory, I think movable zone is enough. And the start address is
not acceptable, because we want to specify the start address for each node.

I think we can implement movablecore_map like that:
1. parse the parameter
2. reserve the memory after efi_reserve_boot_services()
3. release the memory in mem_init

What about this?

Thanks
Wen Congyang
> 
>  
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-28 Thread Wen Congyang
At 11/28/2012 12:08 PM, Jiang Liu Wrote:
 On 2012-11-28 11:24, Bob Liu wrote:
 On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen tangc...@cn.fujitsu.com wrote:
 On 11/27/2012 08:09 PM, Bob Liu wrote:

 On Tue, Nov 27, 2012 at 4:29 PM, Tang Chentangc...@cn.fujitsu.com
 wrote:

 Hi Liu,


 This feature is used in memory hotplug.

 In order to implement a whole node hotplug, we need to make sure the
 node contains no kernel memory, because memory used by kernel could
 not be migrated. (Since the kernel memory is directly mapped,
 VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)

 User could specify all the memory on a node to be movable, so that the
 node could be hot-removed.


 Thank you for your explanation. It's reasonable.

 But i think it's a bit duplicated with CMA, i'm not sure but maybe we
 can combine it with CMA which already in mainline?

 Hi Liu,

 Thanks for your advice. :)

 CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
 controlling where is the start of ZONE_MOVABLE of each node. Could
 CMA do this job ?

 cma will not control the start of ZONE_MOVABLE of each node, but it
 can declare a memory that always movable
 and all non movable allocate request will not happen on that area.

 Currently cma use a boot parameter cma= to declare a memory size
 that always movable.
 I think it might fulfill your requirement if extending the boot
 parameter with a start address.

 more info at http://lwn.net/Articles/468044/

 And also, after a short investigation, CMA seems need to base on
 memblock. But we need to limit memblock not to allocate memory on
 ZONE_MOVABLE. As a result, we need to know the ranges before memblock
 could be used. I'm afraid we still need an approach to get the ranges,
 such as a boot option, or from static ACPI tables such as SRAT/MPST.


 Yes, it's based on memblock and with boot option.
 In setup_arch32()
 dma_contiguous_reserve(0);   = will declare a cma area using
 memblock_reserve()

 I'm don't know much about CMA for now. So if you have any better idea,
 please share with us, thanks. :)

 My idea is reuse cma like below patch(even not compiled) and boot with
 cma=size@start_address.
 I don't know whether it can work and whether suitable for your
 requirement, if not forgive me for this noises.

 diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
 index 612afcc..564962a 100644
 --- a/drivers/base/dma-contiguous.c
 +++ b/drivers/base/dma-contiguous.c
 @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
   */
  static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
  static long size_cmdline = -1;
 +static long cma_start_cmdline = -1;

  static int __init early_cma(char *p)
  {
 +   char *oldp;
 pr_debug(%s(%s)\n, __func__, p);
 +   oldp = p;
 size_cmdline = memparse(p, p);
 +
 +   if (*p == '@')
 +   cma_start_cmdline = memparse(p+1, p);
 +   printk(cma start:0x%x, size: 0x%x\n, size_cmdline, 
 cma_start_cmdline);
 return 0;
  }
  early_param(cma, early_cma);
 @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
 if (selected_size) {
 pr_debug(%s: reserving %ld MiB for global area\n, __func__,
  selected_size / SZ_1M);
 -
 -   dma_declare_contiguous(NULL, selected_size, 0, limit);
 +   if (cma_size_cmdline != -1)
 +   dma_declare_contiguous(NULL, selected_size,
 cma_start_cmdline, limit);
 +   else
 +   dma_declare_contiguous(NULL, selected_size, 0, 
 limit);
 }
  };
 Seems a good idea to reserve memory by reusing CMA logic, though need more
 investigation here. One of CMA goal is to ensure pages in CMA are really
 movable, and this patchset tries to achieve the same goal at a first glance.

Hmm, I don't like to reuse CMA. Because CMA is used for DMA. If we reuse it
for movable memory, I think movable zone is enough. And the start address is
not acceptable, because we want to specify the start address for each node.

I think we can implement movablecore_map like that:
1. parse the parameter
2. reserve the memory after efi_reserve_boot_services()
3. release the memory in mem_init

What about this?

Thanks
Wen Congyang
 
  
 
 
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-28 Thread Jiang Liu
On 2012-11-28 16:29, Wen Congyang wrote:
 At 11/28/2012 12:08 PM, Jiang Liu Wrote:
 On 2012-11-28 11:24, Bob Liu wrote:
 On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen tangc...@cn.fujitsu.com wrote:
 On 11/27/2012 08:09 PM, Bob Liu wrote:

 On Tue, Nov 27, 2012 at 4:29 PM, Tang Chentangc...@cn.fujitsu.com
 wrote:

 Hi Liu,


 This feature is used in memory hotplug.

 In order to implement a whole node hotplug, we need to make sure the
 node contains no kernel memory, because memory used by kernel could
 not be migrated. (Since the kernel memory is directly mapped,
 VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)

 User could specify all the memory on a node to be movable, so that the
 node could be hot-removed.


 Thank you for your explanation. It's reasonable.

 But i think it's a bit duplicated with CMA, i'm not sure but maybe we
 can combine it with CMA which already in mainline?

 Hi Liu,

 Thanks for your advice. :)

 CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
 controlling where is the start of ZONE_MOVABLE of each node. Could
 CMA do this job ?

 cma will not control the start of ZONE_MOVABLE of each node, but it
 can declare a memory that always movable
 and all non movable allocate request will not happen on that area.

 Currently cma use a boot parameter cma= to declare a memory size
 that always movable.
 I think it might fulfill your requirement if extending the boot
 parameter with a start address.

 more info at http://lwn.net/Articles/468044/

 And also, after a short investigation, CMA seems need to base on
 memblock. But we need to limit memblock not to allocate memory on
 ZONE_MOVABLE. As a result, we need to know the ranges before memblock
 could be used. I'm afraid we still need an approach to get the ranges,
 such as a boot option, or from static ACPI tables such as SRAT/MPST.


 Yes, it's based on memblock and with boot option.
 In setup_arch32()
 dma_contiguous_reserve(0);   = will declare a cma area using
 memblock_reserve()

 I'm don't know much about CMA for now. So if you have any better idea,
 please share with us, thanks. :)

 My idea is reuse cma like below patch(even not compiled) and boot with
 cma=size@start_address.
 I don't know whether it can work and whether suitable for your
 requirement, if not forgive me for this noises.

 diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
 index 612afcc..564962a 100644
 --- a/drivers/base/dma-contiguous.c
 +++ b/drivers/base/dma-contiguous.c
 @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
   */
  static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
  static long size_cmdline = -1;
 +static long cma_start_cmdline = -1;

  static int __init early_cma(char *p)
  {
 +   char *oldp;
 pr_debug(%s(%s)\n, __func__, p);
 +   oldp = p;
 size_cmdline = memparse(p, p);
 +
 +   if (*p == '@')
 +   cma_start_cmdline = memparse(p+1, p);
 +   printk(cma start:0x%x, size: 0x%x\n, size_cmdline, 
 cma_start_cmdline);
 return 0;
  }
  early_param(cma, early_cma);
 @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
 if (selected_size) {
 pr_debug(%s: reserving %ld MiB for global area\n, 
 __func__,
  selected_size / SZ_1M);
 -
 -   dma_declare_contiguous(NULL, selected_size, 0, limit);
 +   if (cma_size_cmdline != -1)
 +   dma_declare_contiguous(NULL, selected_size,
 cma_start_cmdline, limit);
 +   else
 +   dma_declare_contiguous(NULL, selected_size, 0, 
 limit);
 }
  };
 Seems a good idea to reserve memory by reusing CMA logic, though need more
 investigation here. One of CMA goal is to ensure pages in CMA are really
 movable, and this patchset tries to achieve the same goal at a first glance.
 
 Hmm, I don't like to reuse CMA. Because CMA is used for DMA. If we reuse it
 for movable memory, I think movable zone is enough. And the start address is
 not acceptable, because we want to specify the start address for each node.
 
 I think we can implement movablecore_map like that:
 1. parse the parameter
 2. reserve the memory after efi_reserve_boot_services()
This sounds good, but the code to reserve memory for movable
nodes will be similar to dma_declare_contiguous().

 3. release the memory in mem_init
 
 What about this?
 
 Thanks
 Wen Congyang

  



 
 
 .
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-28 Thread Wen Congyang
At 11/28/2012 04:28 PM, Jiang Liu Wrote:
 On 2012-11-28 16:29, Wen Congyang wrote:
 At 11/28/2012 12:08 PM, Jiang Liu Wrote:
 On 2012-11-28 11:24, Bob Liu wrote:
 On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen tangc...@cn.fujitsu.com wrote:
 On 11/27/2012 08:09 PM, Bob Liu wrote:

 On Tue, Nov 27, 2012 at 4:29 PM, Tang Chentangc...@cn.fujitsu.com
 wrote:

 Hi Liu,


 This feature is used in memory hotplug.

 In order to implement a whole node hotplug, we need to make sure the
 node contains no kernel memory, because memory used by kernel could
 not be migrated. (Since the kernel memory is directly mapped,
 VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)

 User could specify all the memory on a node to be movable, so that the
 node could be hot-removed.


 Thank you for your explanation. It's reasonable.

 But i think it's a bit duplicated with CMA, i'm not sure but maybe we
 can combine it with CMA which already in mainline?

 Hi Liu,

 Thanks for your advice. :)

 CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
 controlling where is the start of ZONE_MOVABLE of each node. Could
 CMA do this job ?

 cma will not control the start of ZONE_MOVABLE of each node, but it
 can declare a memory that always movable
 and all non movable allocate request will not happen on that area.

 Currently cma use a boot parameter cma= to declare a memory size
 that always movable.
 I think it might fulfill your requirement if extending the boot
 parameter with a start address.

 more info at http://lwn.net/Articles/468044/

 And also, after a short investigation, CMA seems need to base on
 memblock. But we need to limit memblock not to allocate memory on
 ZONE_MOVABLE. As a result, we need to know the ranges before memblock
 could be used. I'm afraid we still need an approach to get the ranges,
 such as a boot option, or from static ACPI tables such as SRAT/MPST.


 Yes, it's based on memblock and with boot option.
 In setup_arch32()
 dma_contiguous_reserve(0);   = will declare a cma area using
 memblock_reserve()

 I'm don't know much about CMA for now. So if you have any better idea,
 please share with us, thanks. :)

 My idea is reuse cma like below patch(even not compiled) and boot with
 cma=size@start_address.
 I don't know whether it can work and whether suitable for your
 requirement, if not forgive me for this noises.

 diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
 index 612afcc..564962a 100644
 --- a/drivers/base/dma-contiguous.c
 +++ b/drivers/base/dma-contiguous.c
 @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
   */
  static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
  static long size_cmdline = -1;
 +static long cma_start_cmdline = -1;

  static int __init early_cma(char *p)
  {
 +   char *oldp;
 pr_debug(%s(%s)\n, __func__, p);
 +   oldp = p;
 size_cmdline = memparse(p, p);
 +
 +   if (*p == '@')
 +   cma_start_cmdline = memparse(p+1, p);
 +   printk(cma start:0x%x, size: 0x%x\n, size_cmdline, 
 cma_start_cmdline);
 return 0;
  }
  early_param(cma, early_cma);
 @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
 if (selected_size) {
 pr_debug(%s: reserving %ld MiB for global area\n, 
 __func__,
  selected_size / SZ_1M);
 -
 -   dma_declare_contiguous(NULL, selected_size, 0, limit);
 +   if (cma_size_cmdline != -1)
 +   dma_declare_contiguous(NULL, selected_size,
 cma_start_cmdline, limit);
 +   else
 +   dma_declare_contiguous(NULL, selected_size, 0, 
 limit);
 }
  };
 Seems a good idea to reserve memory by reusing CMA logic, though need more
 investigation here. One of CMA goal is to ensure pages in CMA are really
 movable, and this patchset tries to achieve the same goal at a first glance.

 Hmm, I don't like to reuse CMA. Because CMA is used for DMA. If we reuse it
 for movable memory, I think movable zone is enough. And the start address is
 not acceptable, because we want to specify the start address for each node.

 I think we can implement movablecore_map like that:
 1. parse the parameter
 2. reserve the memory after efi_reserve_boot_services()
 This sounds good, but the code to reserve memory for movable
 nodes will be similar to dma_declare_contiguous().

Yes, it may be very similar. I think we can move it into mm/page_alloc.c, and
both cma and movablecore_map can use this function.

Thanks
Wen Congyang

 
 3. release the memory in mem_init

 What about this?

 Thanks
 Wen Congyang

  





 .

 
 
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-28 Thread Jiang Liu
Hi all,
Seems it's a great chance to discuss about the memory hotplug feature
within this thread. So I will try to give some high level thoughts about memory
hotplug feature on x86/IA64. Any comments are welcomed!
First of all, I think usability really matters. Ideally, memory hotplug
feature should just work out of box, and we shouldn't expect administrators to 
add several extra platform dependent parameters to enable memory hotplug. 
But how to enable memory (or CPU/node) hotplug out of box? I think the key point
is to cooperate with BIOS/ACPI/firmware/device management teams. 
I still position memory hotplug as an advanced feature for high end 
servers and those systems may/should provide some management interfaces to 
configure CPU/memory/node hotplug features. The configuration UI may be provided
by BIOS, BMC or centralized system management suite. Once administrator enables
hotplug feature through those management UI, OS should support system device
hotplug out of box. For example, HP SuperDome2 management suite provides 
interface
to configure a node as floating node(hot-removable). And OpenSolaris supports
CPU/memory hotplug out of box without any extra configurations. So we should
shape interfaces between firmware and OS to better support system device 
hotplug.
On the other hand, I think there are no commercial available x86/IA64
platforms with system device hotplug capabilities in the field yet, at least 
only
limited quantity if any. So backward compatibility is not a big issue for us 
now.
So I think it's doable to rely on firmware to provide better support for system
device hotplug.
Then what should be enhanced to better support system device hotplug?

1) ACPI specification should be enhanced to provide a static table to describe
components with hotplug features, so OS could reserve special resources for
hotplug at early boot stages. For example, to reserve enough CPU ids for CPU
hot-add. Currently we guess maximum number of CPUs supported by the platform
by counting CPU entries in APIC table, that's not reliable.

2) BIOS should implement SRAT, MPST and PMTT tables to better support memory
hotplug. SRAT associates memory ranges with proximity domains with an extra
hotpluggable flag. PMTT provides memory device topology information, such
as socket-memory controller-DIMM. MPST is used for memory power management
and provides a way to associate memory ranges with memory devices in PMTT.
With all information from SRAT, MPST and PMTT, OS could figure out hotplug
memory ranges automatically, so no extra kernel parameters needed.

3) Enhance ACPICA to provide a method to scan static ACPI tables before
memory subsystem has been initialized because OS need to access SRAT,
MPST and PMTT when initializing memory subsystem.

4) The last and the most important issue is how to minimize performance
drop caused by memory hotplug. As proposed by this patchset, once we
configure all memory of a NUMA node as movable, it essentially disable
NUMA optimization of kernel memory allocation from that node. According
to experience, that will cause huge performance drop. We have observed
10-30% performance drop with memory hotplug enabled. And on another
OS the average performance drop caused by memory hotplug is about 10%.
If we can't resolve the performance drop, memory hotplug is just a feature
for demo:( With help from hardware, we do have some chances to reduce
performance penalty caused by memory hotplug.
As we know, Linux could migrate movable page, but can't migrate
non-movable pages used by kernel/DMA etc. And the most hard part is how
to deal with those unmovable pages when hot-removing a memory device.
Now hardware has given us a hand with a technology named memory migration,
which could transparently migrate memory between memory devices. There's
no OS visible changes except NUMA topology before and after hardware memory
migration.
And if there are multiple memory devices within a NUMA node,
we could configure some memory devices to host unmovable memory and the
other to host movable memory. With this configuration, there won't be
bigger performance drop because we have preserved all NUMA optimizations.
We also could achieve memory hotplug remove by:
1) Use existing page migration mechanism to reclaim movable pages.
2) For memory devices hosting unmovable pages, we need:
2.1) find a movable memory device on other nodes with enough capacity
and reclaim it.
2.2) use hardware migration technology to migrate unmovable memory to
the just reclaimed memory device on other nodes.

I hope we could expect users to adopt memory hotplug technology
with all these implemented.

Back to this patch, we could rely on the mechanism provided
by it to automatically mark memory ranges as movable with information
from ACPI SRAT/MPST/PMTT tables. So we don't need administrator to
manually configure kernel parameters to enable memory hotplug.


RE: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-28 Thread Luck, Tony
 1. use firmware information
   According to ACPI spec 5.0, SRAT table has memory affinity structure
   and the structure has Hot Pluggable Filed. See 5.2.16.2 Memory
   Affinity Structure. If we use the information, we might be able to
   specify movable memory by firmware. For example, if Hot Pluggable
   Filed is enabled, Linux sets the memory as movable memory.
 
 2. use boot option
   This is our proposal. New boot option can specify memory range to use
   as movable memory.

Isn't this just moving the work to the user? To pick good values for the
movable areas, they need to know how the memory lines up across
node boundaries ... because they need to make sure to allow some
non-movable memory allocations on each node so that the kernel can
take advantage of node locality.

So the user would have to read at least the SRAT table, and perhaps
more, to figure out what to provide as arguments.

Since this is going to be used on a dynamic system where nodes might
be added an removed - the right values for these arguments might
change from one boot to the next. So even if the user gets them right
on day 1, a month later when a new node has been added, or a broken
node removed the values would be stale.

-Tony
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-28 Thread H. Peter Anvin
On 11/28/2012 01:34 PM, Luck, Tony wrote:

 2. use boot option
   This is our proposal. New boot option can specify memory range to use
   as movable memory.
 
 Isn't this just moving the work to the user? To pick good values for the
 movable areas, they need to know how the memory lines up across
 node boundaries ... because they need to make sure to allow some
 non-movable memory allocations on each node so that the kernel can
 take advantage of node locality.
 
 So the user would have to read at least the SRAT table, and perhaps
 more, to figure out what to provide as arguments.
 
 Since this is going to be used on a dynamic system where nodes might
 be added an removed - the right values for these arguments might
 change from one boot to the next. So even if the user gets them right
 on day 1, a month later when a new node has been added, or a broken
 node removed the values would be stale.
 

I gave this feedback in person at LCE: I consider the kernel
configuration option to be useless for anything other than debugging.
Trying to promote it as an actual solution, to be used by end users in
the field, is ridiculous at best.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-28 Thread Jaegeuk Hanse
On Wed, Nov 28, 2012 at 04:29:01PM +0800, Wen Congyang wrote:
At 11/28/2012 12:08 PM, Jiang Liu Wrote:
 On 2012-11-28 11:24, Bob Liu wrote:
 On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen tangc...@cn.fujitsu.com wrote:
 On 11/27/2012 08:09 PM, Bob Liu wrote:

 On Tue, Nov 27, 2012 at 4:29 PM, Tang Chentangc...@cn.fujitsu.com
 wrote:

 Hi Liu,


 This feature is used in memory hotplug.

 In order to implement a whole node hotplug, we need to make sure the
 node contains no kernel memory, because memory used by kernel could
 not be migrated. (Since the kernel memory is directly mapped,
 VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)

 User could specify all the memory on a node to be movable, so that the
 node could be hot-removed.


 Thank you for your explanation. It's reasonable.

 But i think it's a bit duplicated with CMA, i'm not sure but maybe we
 can combine it with CMA which already in mainline?

 Hi Liu,

 Thanks for your advice. :)

 CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
 controlling where is the start of ZONE_MOVABLE of each node. Could
 CMA do this job ?

 cma will not control the start of ZONE_MOVABLE of each node, but it
 can declare a memory that always movable
 and all non movable allocate request will not happen on that area.

 Currently cma use a boot parameter cma= to declare a memory size
 that always movable.
 I think it might fulfill your requirement if extending the boot
 parameter with a start address.

 more info at http://lwn.net/Articles/468044/

 And also, after a short investigation, CMA seems need to base on
 memblock. But we need to limit memblock not to allocate memory on
 ZONE_MOVABLE. As a result, we need to know the ranges before memblock
 could be used. I'm afraid we still need an approach to get the ranges,
 such as a boot option, or from static ACPI tables such as SRAT/MPST.


 Yes, it's based on memblock and with boot option.
 In setup_arch32()
 dma_contiguous_reserve(0);   = will declare a cma area using
 memblock_reserve()

 I'm don't know much about CMA for now. So if you have any better idea,
 please share with us, thanks. :)

 My idea is reuse cma like below patch(even not compiled) and boot with
 cma=size@start_address.
 I don't know whether it can work and whether suitable for your
 requirement, if not forgive me for this noises.

 diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
 index 612afcc..564962a 100644
 --- a/drivers/base/dma-contiguous.c
 +++ b/drivers/base/dma-contiguous.c
 @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
   */
  static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
  static long size_cmdline = -1;
 +static long cma_start_cmdline = -1;

  static int __init early_cma(char *p)
  {
 +   char *oldp;
 pr_debug(%s(%s)\n, __func__, p);
 +   oldp = p;
 size_cmdline = memparse(p, p);
 +
 +   if (*p == '@')
 +   cma_start_cmdline = memparse(p+1, p);
 +   printk(cma start:0x%x, size: 0x%x\n, size_cmdline, 
 cma_start_cmdline);
 return 0;
  }
  early_param(cma, early_cma);
 @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
 if (selected_size) {
 pr_debug(%s: reserving %ld MiB for global area\n, 
 __func__,
  selected_size / SZ_1M);
 -
 -   dma_declare_contiguous(NULL, selected_size, 0, limit);
 +   if (cma_size_cmdline != -1)
 +   dma_declare_contiguous(NULL, selected_size,
 cma_start_cmdline, limit);
 +   else
 +   dma_declare_contiguous(NULL, selected_size, 0, 
 limit);
 }
  };
 Seems a good idea to reserve memory by reusing CMA logic, though need more
 investigation here. One of CMA goal is to ensure pages in CMA are really
 movable, and this patchset tries to achieve the same goal at a first glance.

Hmm, I don't like to reuse CMA. Because CMA is used for DMA. If we reuse it
for movable memory, I think movable zone is enough. And the start address is
not acceptable, because we want to specify the start address for each node.

I think we can implement movablecore_map like that:
1. parse the parameter
2. reserve the memory after efi_reserve_boot_services()
3. release the memory in mem_init


Hi Tang,

I haven't read the patchset yet, but could you give a short describe how 
you design your implementation in this patchset?

Regards,
Jaegeuk

What about this?

Thanks
Wen Congyang
 
  
 
 
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  

Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-28 Thread Tang Chen

On 11/29/2012 08:43 AM, Jaegeuk Hanse wrote:

Hi Tang,

I haven't read the patchset yet, but could you give a short describe how
you design your implementation in this patchset?

Regards,
Jaegeuk



Hi Jaegeuk,

Thanks for your joining in. :)

This feature is used in memory hotplug.

In order to implement a whole node hotplug, we need to make sure the
node contains no kernel memory, because memory used by kernel could
not be migrated. (Since the kernel memory is directly mapped,
VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)

With this boot option, user could specify all the memory on a node to
be movable(which means they are in ZONE_MOVABLE), so that the node
could be hot-removed.

Thanks.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-28 Thread Jaegeuk Hanse
On Wed, Nov 28, 2012 at 04:47:42PM +0800, Jiang Liu wrote:
Hi all,
   Seems it's a great chance to discuss about the memory hotplug feature
within this thread. So I will try to give some high level thoughts about memory
hotplug feature on x86/IA64. Any comments are welcomed!
   First of all, I think usability really matters. Ideally, memory hotplug
feature should just work out of box, and we shouldn't expect administrators to 
add several extra platform dependent parameters to enable memory hotplug. 
But how to enable memory (or CPU/node) hotplug out of box? I think the key 
point
is to cooperate with BIOS/ACPI/firmware/device management teams. 
   I still position memory hotplug as an advanced feature for high end 
servers and those systems may/should provide some management interfaces to 
configure CPU/memory/node hotplug features. The configuration UI may be 
provided
by BIOS, BMC or centralized system management suite. Once administrator enables
hotplug feature through those management UI, OS should support system device
hotplug out of box. For example, HP SuperDome2 management suite provides 
interface
to configure a node as floating node(hot-removable). And OpenSolaris supports
CPU/memory hotplug out of box without any extra configurations. So we should
shape interfaces between firmware and OS to better support system device 
hotplug.
   On the other hand, I think there are no commercial available x86/IA64
platforms with system device hotplug capabilities in the field yet, at least 
only
limited quantity if any. So backward compatibility is not a big issue for us 
now.
So I think it's doable to rely on firmware to provide better support for system
device hotplug.
   Then what should be enhanced to better support system device hotplug?

1) ACPI specification should be enhanced to provide a static table to describe
components with hotplug features, so OS could reserve special resources for
hotplug at early boot stages. For example, to reserve enough CPU ids for CPU
hot-add. Currently we guess maximum number of CPUs supported by the platform
by counting CPU entries in APIC table, that's not reliable.

2) BIOS should implement SRAT, MPST and PMTT tables to better support memory
hotplug. SRAT associates memory ranges with proximity domains with an extra
hotpluggable flag. PMTT provides memory device topology information, such
as socket-memory controller-DIMM. MPST is used for memory power management
and provides a way to associate memory ranges with memory devices in PMTT.
With all information from SRAT, MPST and PMTT, OS could figure out hotplug
memory ranges automatically, so no extra kernel parameters needed.

3) Enhance ACPICA to provide a method to scan static ACPI tables before
memory subsystem has been initialized because OS need to access SRAT,
MPST and PMTT when initializing memory subsystem.

4) The last and the most important issue is how to minimize performance
drop caused by memory hotplug. As proposed by this patchset, once we
configure all memory of a NUMA node as movable, it essentially disable
NUMA optimization of kernel memory allocation from that node. According
to experience, that will cause huge performance drop. We have observed
10-30% performance drop with memory hotplug enabled. And on another
OS the average performance drop caused by memory hotplug is about 10%.
If we can't resolve the performance drop, memory hotplug is just a feature
for demo:( With help from hardware, we do have some chances to reduce
performance penalty caused by memory hotplug.
   As we know, Linux could migrate movable page, but can't migrate
non-movable pages used by kernel/DMA etc. And the most hard part is how
to deal with those unmovable pages when hot-removing a memory device.
Now hardware has given us a hand with a technology named memory migration,
which could transparently migrate memory between memory devices. There's
no OS visible changes except NUMA topology before and after hardware memory
migration.
   And if there are multiple memory devices within a NUMA node,
we could configure some memory devices to host unmovable memory and the
other to host movable memory. With this configuration, there won't be
bigger performance drop because we have preserved all NUMA optimizations.
We also could achieve memory hotplug remove by:
1) Use existing page migration mechanism to reclaim movable pages.
2) For memory devices hosting unmovable pages, we need:
2.1) find a movable memory device on other nodes with enough capacity
and reclaim it.
2.2) use hardware migration technology to migrate unmovable memory to

Hi Jiang,

Could you give an explanation how hardware migration technology works?

Regards,
Jaegeuk

the just reclaimed memory device on other nodes.

   I hope we could expect users to adopt memory hotplug technology
with all these implemented.

   Back to this patch, we could rely on the mechanism provided
by it to automatically mark memory ranges as movable with 

Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-28 Thread Jiang Liu
On 2012-11-29 9:42, Jaegeuk Hanse wrote:
 On Wed, Nov 28, 2012 at 04:47:42PM +0800, Jiang Liu wrote:
 Hi all,
  Seems it's a great chance to discuss about the memory hotplug feature
 within this thread. So I will try to give some high level thoughts about 
 memory
 hotplug feature on x86/IA64. Any comments are welcomed!
  First of all, I think usability really matters. Ideally, memory hotplug
 feature should just work out of box, and we shouldn't expect administrators 
 to 
 add several extra platform dependent parameters to enable memory hotplug. 
 But how to enable memory (or CPU/node) hotplug out of box? I think the key 
 point
 is to cooperate with BIOS/ACPI/firmware/device management teams. 
  I still position memory hotplug as an advanced feature for high end 
 servers and those systems may/should provide some management interfaces to 
 configure CPU/memory/node hotplug features. The configuration UI may be 
 provided
 by BIOS, BMC or centralized system management suite. Once administrator 
 enables
 hotplug feature through those management UI, OS should support system device
 hotplug out of box. For example, HP SuperDome2 management suite provides 
 interface
 to configure a node as floating node(hot-removable). And OpenSolaris supports
 CPU/memory hotplug out of box without any extra configurations. So we should
 shape interfaces between firmware and OS to better support system device 
 hotplug.
  On the other hand, I think there are no commercial available x86/IA64
 platforms with system device hotplug capabilities in the field yet, at least 
 only
 limited quantity if any. So backward compatibility is not a big issue for us 
 now.
 So I think it's doable to rely on firmware to provide better support for 
 system
 device hotplug.
  Then what should be enhanced to better support system device hotplug?

 1) ACPI specification should be enhanced to provide a static table to 
 describe
 components with hotplug features, so OS could reserve special resources for
 hotplug at early boot stages. For example, to reserve enough CPU ids for CPU
 hot-add. Currently we guess maximum number of CPUs supported by the platform
 by counting CPU entries in APIC table, that's not reliable.

 2) BIOS should implement SRAT, MPST and PMTT tables to better support memory
 hotplug. SRAT associates memory ranges with proximity domains with an extra
 hotpluggable flag. PMTT provides memory device topology information, such
 as socket-memory controller-DIMM. MPST is used for memory power 
 management
 and provides a way to associate memory ranges with memory devices in PMTT.
 With all information from SRAT, MPST and PMTT, OS could figure out hotplug
 memory ranges automatically, so no extra kernel parameters needed.

 3) Enhance ACPICA to provide a method to scan static ACPI tables before
 memory subsystem has been initialized because OS need to access SRAT,
 MPST and PMTT when initializing memory subsystem.

 4) The last and the most important issue is how to minimize performance
 drop caused by memory hotplug. As proposed by this patchset, once we
 configure all memory of a NUMA node as movable, it essentially disable
 NUMA optimization of kernel memory allocation from that node. According
 to experience, that will cause huge performance drop. We have observed
 10-30% performance drop with memory hotplug enabled. And on another
 OS the average performance drop caused by memory hotplug is about 10%.
 If we can't resolve the performance drop, memory hotplug is just a feature
 for demo:( With help from hardware, we do have some chances to reduce
 performance penalty caused by memory hotplug.
  As we know, Linux could migrate movable page, but can't migrate
 non-movable pages used by kernel/DMA etc. And the most hard part is how
 to deal with those unmovable pages when hot-removing a memory device.
 Now hardware has given us a hand with a technology named memory migration,
 which could transparently migrate memory between memory devices. There's
 no OS visible changes except NUMA topology before and after hardware memory
 migration.
  And if there are multiple memory devices within a NUMA node,
 we could configure some memory devices to host unmovable memory and the
 other to host movable memory. With this configuration, there won't be
 bigger performance drop because we have preserved all NUMA optimizations.
 We also could achieve memory hotplug remove by:
 1) Use existing page migration mechanism to reclaim movable pages.
 2) For memory devices hosting unmovable pages, we need:
 2.1) find a movable memory device on other nodes with enough capacity
 and reclaim it.
 2.2) use hardware migration technology to migrate unmovable memory to
 
 Hi Jiang,
 
 Could you give an explanation how hardware migration technology works?
Hi Jaegeuk,
Now some severs support a hardware memory RAS feature called memory
mirror, something like RAID1. The mirrored memory devices will be configured
with the same 

Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-28 Thread Jiang Liu
On 2012-11-29 10:49, Wanpeng Li wrote:
 On Thu, Nov 29, 2012 at 10:25:40AM +0800, Jiang Liu wrote:
 On 2012-11-29 9:42, Jaegeuk Hanse wrote:
 On Wed, Nov 28, 2012 at 04:47:42PM +0800, Jiang Liu wrote:
 Hi all,
Seems it's a great chance to discuss about the memory hotplug feature
 within this thread. So I will try to give some high level thoughts about 
 memory
 hotplug feature on x86/IA64. Any comments are welcomed!
First of all, I think usability really matters. Ideally, memory hotplug
 feature should just work out of box, and we shouldn't expect 
 administrators to 
 add several extra platform dependent parameters to enable memory hotplug. 
 But how to enable memory (or CPU/node) hotplug out of box? I think the key 
 point
 is to cooperate with BIOS/ACPI/firmware/device management teams. 
I still position memory hotplug as an advanced feature for high end 
 servers and those systems may/should provide some management interfaces to 
 configure CPU/memory/node hotplug features. The configuration UI may be 
 provided
 by BIOS, BMC or centralized system management suite. Once administrator 
 enables
 hotplug feature through those management UI, OS should support system 
 device
 hotplug out of box. For example, HP SuperDome2 management suite provides 
 interface
 to configure a node as floating node(hot-removable). And OpenSolaris 
 supports
 CPU/memory hotplug out of box without any extra configurations. So we 
 should
 shape interfaces between firmware and OS to better support system device 
 hotplug.
On the other hand, I think there are no commercial available x86/IA64
 platforms with system device hotplug capabilities in the field yet, at 
 least only
 limited quantity if any. So backward compatibility is not a big issue for 
 us now.
 So I think it's doable to rely on firmware to provide better support for 
 system
 device hotplug.
Then what should be enhanced to better support system device hotplug?

 1) ACPI specification should be enhanced to provide a static table to 
 describe
 components with hotplug features, so OS could reserve special resources for
 hotplug at early boot stages. For example, to reserve enough CPU ids for 
 CPU
 hot-add. Currently we guess maximum number of CPUs supported by the 
 platform
 by counting CPU entries in APIC table, that's not reliable.

 2) BIOS should implement SRAT, MPST and PMTT tables to better support 
 memory
 hotplug. SRAT associates memory ranges with proximity domains with an extra
 hotpluggable flag. PMTT provides memory device topology information, such
 as socket-memory controller-DIMM. MPST is used for memory power 
 management
 and provides a way to associate memory ranges with memory devices in PMTT.
 With all information from SRAT, MPST and PMTT, OS could figure out hotplug
 memory ranges automatically, so no extra kernel parameters needed.

 3) Enhance ACPICA to provide a method to scan static ACPI tables before
 memory subsystem has been initialized because OS need to access SRAT,
 MPST and PMTT when initializing memory subsystem.

 4) The last and the most important issue is how to minimize performance
 drop caused by memory hotplug. As proposed by this patchset, once we
 configure all memory of a NUMA node as movable, it essentially disable
 NUMA optimization of kernel memory allocation from that node. According
 to experience, that will cause huge performance drop. We have observed
 10-30% performance drop with memory hotplug enabled. And on another
 OS the average performance drop caused by memory hotplug is about 10%.
 If we can't resolve the performance drop, memory hotplug is just a feature
 for demo:( With help from hardware, we do have some chances to reduce
 performance penalty caused by memory hotplug.
As we know, Linux could migrate movable page, but can't migrate
 non-movable pages used by kernel/DMA etc. And the most hard part is how
 to deal with those unmovable pages when hot-removing a memory device.
 Now hardware has given us a hand with a technology named memory migration,
 which could transparently migrate memory between memory devices. There's
 no OS visible changes except NUMA topology before and after hardware memory
 migration.
And if there are multiple memory devices within a NUMA node,
 we could configure some memory devices to host unmovable memory and the
 other to host movable memory. With this configuration, there won't be
 bigger performance drop because we have preserved all NUMA optimizations.
 We also could achieve memory hotplug remove by:
 1) Use existing page migration mechanism to reclaim movable pages.
 2) For memory devices hosting unmovable pages, we need:
 2.1) find a movable memory device on other nodes with enough capacity
 and reclaim it.
 2.2) use hardware migration technology to migrate unmovable memory to

 Hi Jiang,

 Could you give an explanation how hardware migration technology works?
 Hi Jaegeuk,
  Now some severs support a hardware memory RAS feature called 

Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread Jiang Liu
Hi Chen,

If a pageblock's migration type is movable, it may be converted to
reclaimable under memory pressure. CMA is introduced to guarantee
that pages of CMA won't be converted to other migratetypes.

And we are trying to avoid allocating kernel/DMA memory from specific
memory ranges, so we could easily reclaim pages when hot-removing
memory devices. 

I think the idea is not to directly reuse CMA for hotplug, but to 
reuse the mechanism to reserve specific memory ranges from bootmem
allocator. So CMA and hotplug could use the same code.
Basically we may try to reuse dma_declare_contiguous(), so that
we don't need to add special logic into bootmem allocator.

Regards!
Gerry

On 2012-11-28 14:16, Tang Chen wrote:
> Hi Bob, Liu Jiang,
> 
> About CMA, could you give me more info ?
> Thanks for your patent and nice advice. :)
> 
> 
> 1) I saw the following on http://lwn.net/Articles/447405/:
> 
> The "CMA" type is sticky; pages which are marked as being for CMA
> should never have their migration type changed by the kernel.
> 
> As Wen said, we now support a user interface to change movable memory
> into kernel memory. But seeing from above, the memory specified as
> CMA will not be able to be changed, right ?  If so, I don't think
> using CMA is a good idea.
> 
> 
> 2) Is CMA just implemented on ARM platform ?  I found the following in
> kernel-parameters.txt.
> 
> cma=nn[MG]  [ARM,KNL]
> Sets the size of kernel global memory area for contiguous
> memory allocations. For more information, see
> include/linux/dma-contiguous.h
> 
> We are developing on x86. Could we use it ?
> 
> 
> 3) Is CMA just used for DMA ? I am a little confused here. :)
> I found the main code of CMA is implemented in dma-contiguous.c.
> 
> 
> 4) The boot options cma=xxx and movablecore_map=xxx have different
> meanings for user. Reusing CMA could make user confused, I'm afraid.
> 
> And, even if we reuse "cma=" option, we still need to do the work
> in patch 3~5, right ?
> 
> 
> Thanks. :)
> 
> 
> 
> On 11/28/2012 12:08 PM, Jiang Liu wrote:
>> On 2012-11-28 11:24, Bob Liu wrote:
>>> On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen  wrote:
 On 11/27/2012 08:09 PM, Bob Liu wrote:
>
> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen
> wrote:
>>
>> Hi Liu,
>>
>>
>> This feature is used in memory hotplug.
>>
>> In order to implement a whole node hotplug, we need to make sure the
>> node contains no kernel memory, because memory used by kernel could
>> not be migrated. (Since the kernel memory is directly mapped,
>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>
>> User could specify all the memory on a node to be movable, so that the
>> node could be hot-removed.
>>
>
> Thank you for your explanation. It's reasonable.
>
> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
> can combine it with CMA which already in mainline?
>
 Hi Liu,

 Thanks for your advice. :)

 CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
 controlling where is the start of ZONE_MOVABLE of each node. Could
 CMA do this job ?
>>>
>>> cma will not control the start of ZONE_MOVABLE of each node, but it
>>> can declare a memory that always movable
>>> and all non movable allocate request will not happen on that area.
>>>
>>> Currently cma use a boot parameter "cma=" to declare a memory size
>>> that always movable.
>>> I think it might fulfill your requirement if extending the boot
>>> parameter with a start address.
>>>
>>> more info at http://lwn.net/Articles/468044/

 And also, after a short investigation, CMA seems need to base on
 memblock. But we need to limit memblock not to allocate memory on
 ZONE_MOVABLE. As a result, we need to know the ranges before memblock
 could be used. I'm afraid we still need an approach to get the ranges,
 such as a boot option, or from static ACPI tables such as SRAT/MPST.

>>>
>>> Yes, it's based on memblock and with boot option.
>>> In setup_arch32()
>>>  dma_contiguous_reserve(0);   =>  will declare a cma area using
>>> memblock_reserve()
>>>
 I'm don't know much about CMA for now. So if you have any better idea,
 please share with us, thanks. :)
>>>
>>> My idea is reuse cma like below patch(even not compiled) and boot with
>>> "cma=size@start_address".
>>> I don't know whether it can work and whether suitable for your
>>> requirement, if not forgive me for this noises.
>>>
>>> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
>>> index 612afcc..564962a 100644
>>> --- a/drivers/base/dma-contiguous.c
>>> +++ b/drivers/base/dma-contiguous.c
>>> @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
>>>*/
>>>   static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
>>>   static long size_cmdline = -1;
>>> +static long cma_start_cmdline = -1;
>>>

Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread Tang Chen

Hi Bob, Liu Jiang,

About CMA, could you give me more info ?
Thanks for your patent and nice advice. :)


1) I saw the following on http://lwn.net/Articles/447405/:

The "CMA" type is sticky; pages which are marked as being for CMA
should never have their migration type changed by the kernel.

As Wen said, we now support a user interface to change movable memory
into kernel memory. But seeing from above, the memory specified as
CMA will not be able to be changed, right ?  If so, I don't think
using CMA is a good idea.


2) Is CMA just implemented on ARM platform ?  I found the following in
kernel-parameters.txt.

cma=nn[MG]  [ARM,KNL]
Sets the size of kernel global memory area for contiguous
memory allocations. For more information, see
include/linux/dma-contiguous.h

We are developing on x86. Could we use it ?


3) Is CMA just used for DMA ? I am a little confused here. :)
I found the main code of CMA is implemented in dma-contiguous.c.


4) The boot options cma=xxx and movablecore_map=xxx have different
meanings for user. Reusing CMA could make user confused, I'm afraid.

And, even if we reuse "cma=" option, we still need to do the work
in patch 3~5, right ?


Thanks. :)



On 11/28/2012 12:08 PM, Jiang Liu wrote:

On 2012-11-28 11:24, Bob Liu wrote:

On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen  wrote:

On 11/27/2012 08:09 PM, Bob Liu wrote:


On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen
wrote:


Hi Liu,


This feature is used in memory hotplug.

In order to implement a whole node hotplug, we need to make sure the
node contains no kernel memory, because memory used by kernel could
not be migrated. (Since the kernel memory is directly mapped,
VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)

User could specify all the memory on a node to be movable, so that the
node could be hot-removed.



Thank you for your explanation. It's reasonable.

But i think it's a bit duplicated with CMA, i'm not sure but maybe we
can combine it with CMA which already in mainline?


Hi Liu,

Thanks for your advice. :)

CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
controlling where is the start of ZONE_MOVABLE of each node. Could
CMA do this job ?


cma will not control the start of ZONE_MOVABLE of each node, but it
can declare a memory that always movable
and all non movable allocate request will not happen on that area.

Currently cma use a boot parameter "cma=" to declare a memory size
that always movable.
I think it might fulfill your requirement if extending the boot
parameter with a start address.

more info at http://lwn.net/Articles/468044/


And also, after a short investigation, CMA seems need to base on
memblock. But we need to limit memblock not to allocate memory on
ZONE_MOVABLE. As a result, we need to know the ranges before memblock
could be used. I'm afraid we still need an approach to get the ranges,
such as a boot option, or from static ACPI tables such as SRAT/MPST.



Yes, it's based on memblock and with boot option.
In setup_arch32()
 dma_contiguous_reserve(0);   =>  will declare a cma area using
memblock_reserve()


I'm don't know much about CMA for now. So if you have any better idea,
please share with us, thanks. :)


My idea is reuse cma like below patch(even not compiled) and boot with
"cma=size@start_address".
I don't know whether it can work and whether suitable for your
requirement, if not forgive me for this noises.

diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
index 612afcc..564962a 100644
--- a/drivers/base/dma-contiguous.c
+++ b/drivers/base/dma-contiguous.c
@@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
   */
  static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
  static long size_cmdline = -1;
+static long cma_start_cmdline = -1;

  static int __init early_cma(char *p)
  {
+   char *oldp;
 pr_debug("%s(%s)\n", __func__, p);
+   oldp = p;
 size_cmdline = memparse(p,);
+
+   if (*p == '@')
+   cma_start_cmdline = memparse(p+1,);
+   printk("cma start:0x%x, size: 0x%x\n", size_cmdline, cma_start_cmdline);
 return 0;
  }
  early_param("cma", early_cma);
@@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
 if (selected_size) {
 pr_debug("%s: reserving %ld MiB for global area\n", __func__,
  selected_size / SZ_1M);
-
-   dma_declare_contiguous(NULL, selected_size, 0, limit);
+   if (cma_size_cmdline != -1)
+   dma_declare_contiguous(NULL, selected_size,
cma_start_cmdline, limit);
+   else
+   dma_declare_contiguous(NULL, selected_size, 0, limit);
 }
  };

Seems a good idea to reserve memory by reusing CMA logic, though need more
investigation here. One of CMA goal is to ensure pages in CMA are really
movable, and this patchset tries to achieve the same goal at a 

Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread Jiang Liu
On 2012-11-28 13:21, Wen Congyang wrote:
> At 11/28/2012 12:01 PM, Jiang Liu Wrote:
>> On 2012-11-28 11:47, Tang Chen wrote:
>>> On 11/27/2012 11:10 AM, wujianguo wrote:

 Hi Tang,
 DMA address can't be set as movable, if some one boot kernel with
 movablecore_map=4G@0xa0 or other memory region that contains DMA 
 address,
 system maybe boot failed. Should this case be handled or mentioned
 in the change log and kernel-parameters.txt?
>>>
>>> Hi Wu,
>>>
>>> I think we can use MAX_DMA_PFN and MAX_DMA32_PFN to prevent setting DMA
>>> address as movable. Just ignore the address lower than them, and set
>>> the rest as movable. How do you think ?
>>>
>>> And, since we cannot figure out the minimum of memory kernel needs, I
>>> think for now, we can just add some warning into kernel-parameters.txt.
>>>
>>> Thanks. :)
>> On one other OS, there is a mechanism to dynamically convert pages from
>> movable zones into normal zones.
> 
> The OS auto does it? Or the user coverts it?
> 
> We can convert pages from movable zones into normal zones by the following
> interface:
> echo online_kernel >/sys/devices/system/memory/memoryX/state
> 
> We have posted a patchset to implement it, and it is in mm tree now.
OS automatically converts it, no manual operations needed.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread Wen Congyang
At 11/28/2012 12:01 PM, Jiang Liu Wrote:
> On 2012-11-28 11:47, Tang Chen wrote:
>> On 11/27/2012 11:10 AM, wujianguo wrote:
>>>
>>> Hi Tang,
>>> DMA address can't be set as movable, if some one boot kernel with
>>> movablecore_map=4G@0xa0 or other memory region that contains DMA 
>>> address,
>>> system maybe boot failed. Should this case be handled or mentioned
>>> in the change log and kernel-parameters.txt?
>>
>> Hi Wu,
>>
>> I think we can use MAX_DMA_PFN and MAX_DMA32_PFN to prevent setting DMA
>> address as movable. Just ignore the address lower than them, and set
>> the rest as movable. How do you think ?
>>
>> And, since we cannot figure out the minimum of memory kernel needs, I
>> think for now, we can just add some warning into kernel-parameters.txt.
>>
>> Thanks. :)
> On one other OS, there is a mechanism to dynamically convert pages from
> movable zones into normal zones.

The OS auto does it? Or the user coverts it?

We can convert pages from movable zones into normal zones by the following
interface:
echo online_kernel >/sys/devices/system/memory/memoryX/state

We have posted a patchset to implement it, and it is in mm tree now.

Thanks
Wen Congyang

> 
> Regards!
> Gerry
> 
>>
>>>
>>> Thanks,
>>> Jianguo Wu
>>>
>>
>> .
>>
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread Jianguo Wu
On 2012/11/28 11:47, Tang Chen wrote:

> On 11/27/2012 11:10 AM, wujianguo wrote:
>>
>> Hi Tang,
>> DMA address can't be set as movable, if some one boot kernel with
>> movablecore_map=4G@0xa0 or other memory region that contains DMA address,
>> system maybe boot failed. Should this case be handled or mentioned
>> in the change log and kernel-parameters.txt?
> 
> Hi Wu,
> 
> I think we can use MAX_DMA_PFN and MAX_DMA32_PFN to prevent setting DMA
> address as movable. Just ignore the address lower than them, and set
> the rest as movable. How do you think ?
> 

I think it's OK for now.

> And, since we cannot figure out the minimum of memory kernel needs, I
> think for now, we can just add some warning into kernel-parameters.txt.
> 
> Thanks. :)
> 
>>
>> Thanks,
>> Jianguo Wu
>>
> 
> .
> 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread Jiang Liu
On 2012-11-28 11:24, Bob Liu wrote:
> On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen  wrote:
>> On 11/27/2012 08:09 PM, Bob Liu wrote:
>>>
>>> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen
>>> wrote:

 Hi Liu,


 This feature is used in memory hotplug.

 In order to implement a whole node hotplug, we need to make sure the
 node contains no kernel memory, because memory used by kernel could
 not be migrated. (Since the kernel memory is directly mapped,
 VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)

 User could specify all the memory on a node to be movable, so that the
 node could be hot-removed.

>>>
>>> Thank you for your explanation. It's reasonable.
>>>
>>> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
>>> can combine it with CMA which already in mainline?
>>>
>> Hi Liu,
>>
>> Thanks for your advice. :)
>>
>> CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
>> controlling where is the start of ZONE_MOVABLE of each node. Could
>> CMA do this job ?
> 
> cma will not control the start of ZONE_MOVABLE of each node, but it
> can declare a memory that always movable
> and all non movable allocate request will not happen on that area.
> 
> Currently cma use a boot parameter "cma=" to declare a memory size
> that always movable.
> I think it might fulfill your requirement if extending the boot
> parameter with a start address.
> 
> more info at http://lwn.net/Articles/468044/
>>
>> And also, after a short investigation, CMA seems need to base on
>> memblock. But we need to limit memblock not to allocate memory on
>> ZONE_MOVABLE. As a result, we need to know the ranges before memblock
>> could be used. I'm afraid we still need an approach to get the ranges,
>> such as a boot option, or from static ACPI tables such as SRAT/MPST.
>>
> 
> Yes, it's based on memblock and with boot option.
> In setup_arch32()
> dma_contiguous_reserve(0);   => will declare a cma area using
> memblock_reserve()
> 
>> I'm don't know much about CMA for now. So if you have any better idea,
>> please share with us, thanks. :)
> 
> My idea is reuse cma like below patch(even not compiled) and boot with
> "cma=size@start_address".
> I don't know whether it can work and whether suitable for your
> requirement, if not forgive me for this noises.
> 
> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
> index 612afcc..564962a 100644
> --- a/drivers/base/dma-contiguous.c
> +++ b/drivers/base/dma-contiguous.c
> @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
>   */
>  static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
>  static long size_cmdline = -1;
> +static long cma_start_cmdline = -1;
> 
>  static int __init early_cma(char *p)
>  {
> +   char *oldp;
> pr_debug("%s(%s)\n", __func__, p);
> +   oldp = p;
> size_cmdline = memparse(p, );
> +
> +   if (*p == '@')
> +   cma_start_cmdline = memparse(p+1, );
> +   printk("cma start:0x%x, size: 0x%x\n", size_cmdline, 
> cma_start_cmdline);
> return 0;
>  }
>  early_param("cma", early_cma);
> @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
> if (selected_size) {
> pr_debug("%s: reserving %ld MiB for global area\n", __func__,
>  selected_size / SZ_1M);
> -
> -   dma_declare_contiguous(NULL, selected_size, 0, limit);
> +   if (cma_size_cmdline != -1)
> +   dma_declare_contiguous(NULL, selected_size,
> cma_start_cmdline, limit);
> +   else
> +   dma_declare_contiguous(NULL, selected_size, 0, limit);
> }
>  };
Seems a good idea to reserve memory by reusing CMA logic, though need more
investigation here. One of CMA goal is to ensure pages in CMA are really
movable, and this patchset tries to achieve the same goal at a first glance.

 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread Jiang Liu
On 2012-11-28 11:47, Tang Chen wrote:
> On 11/27/2012 11:10 AM, wujianguo wrote:
>>
>> Hi Tang,
>> DMA address can't be set as movable, if some one boot kernel with
>> movablecore_map=4G@0xa0 or other memory region that contains DMA address,
>> system maybe boot failed. Should this case be handled or mentioned
>> in the change log and kernel-parameters.txt?
> 
> Hi Wu,
> 
> I think we can use MAX_DMA_PFN and MAX_DMA32_PFN to prevent setting DMA
> address as movable. Just ignore the address lower than them, and set
> the rest as movable. How do you think ?
> 
> And, since we cannot figure out the minimum of memory kernel needs, I
> think for now, we can just add some warning into kernel-parameters.txt.
> 
> Thanks. :)
On one other OS, there is a mechanism to dynamically convert pages from
movable zones into normal zones.

Regards!
Gerry

> 
>>
>> Thanks,
>> Jianguo Wu
>>
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread Tang Chen

On 11/27/2012 11:10 AM, wujianguo wrote:


Hi Tang,
DMA address can't be set as movable, if some one boot kernel with
movablecore_map=4G@0xa0 or other memory region that contains DMA address,
system maybe boot failed. Should this case be handled or mentioned
in the change log and kernel-parameters.txt?


Hi Wu,

I think we can use MAX_DMA_PFN and MAX_DMA32_PFN to prevent setting DMA
address as movable. Just ignore the address lower than them, and set
the rest as movable. How do you think ?

And, since we cannot figure out the minimum of memory kernel needs, I
think for now, we can just add some warning into kernel-parameters.txt.

Thanks. :)



Thanks,
Jianguo Wu


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread Bob Liu
On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen  wrote:
> On 11/27/2012 08:09 PM, Bob Liu wrote:
>>
>> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen
>> wrote:
>>>
>>> Hi Liu,
>>>
>>>
>>> This feature is used in memory hotplug.
>>>
>>> In order to implement a whole node hotplug, we need to make sure the
>>> node contains no kernel memory, because memory used by kernel could
>>> not be migrated. (Since the kernel memory is directly mapped,
>>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>>
>>> User could specify all the memory on a node to be movable, so that the
>>> node could be hot-removed.
>>>
>>
>> Thank you for your explanation. It's reasonable.
>>
>> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
>> can combine it with CMA which already in mainline?
>>
> Hi Liu,
>
> Thanks for your advice. :)
>
> CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
> controlling where is the start of ZONE_MOVABLE of each node. Could
> CMA do this job ?

cma will not control the start of ZONE_MOVABLE of each node, but it
can declare a memory that always movable
and all non movable allocate request will not happen on that area.

Currently cma use a boot parameter "cma=" to declare a memory size
that always movable.
I think it might fulfill your requirement if extending the boot
parameter with a start address.

more info at http://lwn.net/Articles/468044/
>
> And also, after a short investigation, CMA seems need to base on
> memblock. But we need to limit memblock not to allocate memory on
> ZONE_MOVABLE. As a result, we need to know the ranges before memblock
> could be used. I'm afraid we still need an approach to get the ranges,
> such as a boot option, or from static ACPI tables such as SRAT/MPST.
>

Yes, it's based on memblock and with boot option.
In setup_arch32()
dma_contiguous_reserve(0);   => will declare a cma area using
memblock_reserve()

> I'm don't know much about CMA for now. So if you have any better idea,
> please share with us, thanks. :)

My idea is reuse cma like below patch(even not compiled) and boot with
"cma=size@start_address".
I don't know whether it can work and whether suitable for your
requirement, if not forgive me for this noises.

diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
index 612afcc..564962a 100644
--- a/drivers/base/dma-contiguous.c
+++ b/drivers/base/dma-contiguous.c
@@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
  */
 static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
 static long size_cmdline = -1;
+static long cma_start_cmdline = -1;

 static int __init early_cma(char *p)
 {
+   char *oldp;
pr_debug("%s(%s)\n", __func__, p);
+   oldp = p;
size_cmdline = memparse(p, );
+
+   if (*p == '@')
+   cma_start_cmdline = memparse(p+1, );
+   printk("cma start:0x%x, size: 0x%x\n", size_cmdline, cma_start_cmdline);
return 0;
 }
 early_param("cma", early_cma);
@@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
if (selected_size) {
pr_debug("%s: reserving %ld MiB for global area\n", __func__,
 selected_size / SZ_1M);
-
-   dma_declare_contiguous(NULL, selected_size, 0, limit);
+   if (cma_size_cmdline != -1)
+   dma_declare_contiguous(NULL, selected_size,
cma_start_cmdline, limit);
+   else
+   dma_declare_contiguous(NULL, selected_size, 0, limit);
}
 };

-- 
Regards,
--Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread Tang Chen

On 11/27/2012 08:09 PM, Bob Liu wrote:

On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen  wrote:

Hi Liu,

This feature is used in memory hotplug.

In order to implement a whole node hotplug, we need to make sure the
node contains no kernel memory, because memory used by kernel could
not be migrated. (Since the kernel memory is directly mapped,
VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)

User could specify all the memory on a node to be movable, so that the
node could be hot-removed.



Thank you for your explanation. It's reasonable.

But i think it's a bit duplicated with CMA, i'm not sure but maybe we
can combine it with CMA which already in mainline?


Hi Liu,

Thanks for your advice. :)

CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
controlling where is the start of ZONE_MOVABLE of each node. Could
CMA do this job ?

And also, after a short investigation, CMA seems need to base on
memblock. But we need to limit memblock not to allocate memory on
ZONE_MOVABLE. As a result, we need to know the ranges before memblock
could be used. I'm afraid we still need an approach to get the ranges,
such as a boot option, or from static ACPI tables such as SRAT/MPST.

I'm don't know much about CMA for now. So if you have any better idea,
please share with us, thanks. :)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread Bob Liu
On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen  wrote:
> On 11/27/2012 04:00 PM, Bob Liu wrote:
>>
>> Hi Tang,
>>
>> On Fri, Nov 23, 2012 at 6:44 PM, Tang Chen
>> wrote:
>>>
>>> [What we are doing]
>>> This patchset provide a boot option for user to specify ZONE_MOVABLE
>>> memory
>>> map for each node in the system.
>>>
>>> movablecore_map=nn[KMG]@ss[KMG]
>>>
>>> This option make sure memory range from ss to ss+nn is movable memory.
>>>
>>>
>>> [Why we do this]
>>> If we hot remove a memroy, the memory cannot have kernel memory,
>>> because Linux cannot migrate kernel memory currently. Therefore,
>>> we have to guarantee that the hot removed memory has only movable
>>> memoroy.
>>>
>>> Linux has two boot options, kernelcore= and movablecore=, for
>>> creating movable memory. These boot options can specify the amount
>>> of memory use as kernel or movable memory. Using them, we can
>>> create ZONE_MOVABLE which has only movable memory.
>>>
>>> But it does not fulfill a requirement of memory hot remove, because
>>> even if we specify the boot options, movable memory is distributed
>>> in each node evenly. So when we want to hot remove memory which
>>> memory range is 0x8000-0c000, we have no way to specify
>>> the memory as movable memory.
>>>
>>
>> Sorry, I'm still not get your idea.
>> Why you need a specify range that is movable?
>> Could you describe the requirement and situation a bit more?
>> Thank you.
>
>
> Hi Liu,
>
> This feature is used in memory hotplug.
>
> In order to implement a whole node hotplug, we need to make sure the
> node contains no kernel memory, because memory used by kernel could
> not be migrated. (Since the kernel memory is directly mapped,
> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>
> User could specify all the memory on a node to be movable, so that the
> node could be hot-removed.
>

Thank you for your explanation. It's reasonable.

But i think it's a bit duplicated with CMA, i'm not sure but maybe we
can combine it with CMA which already in mainline?

> Another approach is like the following:
> movable_node = 1,3-5,8
> This could set all the memory on the nodes to be movable. And the rest
> of memory works as usual. But movablecore_map is more flexible.
>
> Thanks. :)
>
>
>>
>>> So we proposed a new feature which specifies memory range to use as
>>> movable memory.
>>>
>>>
>>> [Ways to do this]
>>> There may be 2 ways to specify movable memory.
>>>   1. use firmware information
>>>   2. use boot option
>>>
>>> 1. use firmware information
>>>According to ACPI spec 5.0, SRAT table has memory affinity structure
>>>and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>>>Affinity Structure". If we use the information, we might be able to
>>>specify movable memory by firmware. For example, if Hot Pluggable
>>>Filed is enabled, Linux sets the memory as movable memory.
>>>
>>> 2. use boot option
>>>This is our proposal. New boot option can specify memory range to use
>>>as movable memory.
>>>
>>>
>>> [How we do this]
>>> We chose second way, because if we use first way, users cannot change
>>> memory range to use as movable memory easily. We think if we create
>>> movable memory, performance regression may occur by NUMA. In this case,
>>> user can turn off the feature easily if we prepare the boot option.
>>> And if we prepare the boot optino, the user can select which memory
>>> to use as movable memory easily.
>>>
>>>
>>> [How to use]
>>> Specify the following boot option:
>>> movablecore_map=nn[KMG]@ss[KMG]
>>>
>>> That means physical address range from ss to ss+nn will be allocated as
>>> ZONE_MOVABLE.
>>>
>>> And the following points should be considered.
>>>
>>> 1) If the range is involved in a single node, then from ss to the end of
>>> the node will be ZONE_MOVABLE.
>>> 2) If the range covers two or more nodes, then from ss to the end of
>>> the node will be ZONE_MOVABLE, and all the other nodes will only
>>> have ZONE_MOVABLE.
>>> 3) If no range is in the node, then the node will have no ZONE_MOVABLE
>>> unless kernelcore or movablecore is specified.
>>> 4) This option could be specified at most MAX_NUMNODES times.
>>> 5) If kernelcore or movablecore is also specified, movablecore_map will
>>> have
>>> higher priority to be satisfied.
>>> 6) This option has no conflict with memmap option.
>>>
>>>
>>>
>>> Tang Chen (4):
>>>page_alloc: add movable_memmap kernel parameter
>>>page_alloc: Introduce zone_movable_limit[] to keep movable limit for
>>>  nodes
>>>page_alloc: Make movablecore_map has higher priority
>>>page_alloc: Bootmem limit with movablecore_map
>>>
>>> Yasuaki Ishimatsu (1):
>>>x86: get pg_data_t's memory from other node
>>>
>>>   Documentation/kernel-parameters.txt |   17 +++
>>>   arch/x86/mm/numa.c  |   11 ++-
>>>   include/linux/memblock.h|1 +
>>>   include/linux/mm.h  |   11 ++
>>>   

Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread Yasuaki Ishimatsu

Hi HPA and Tang,

2012/11/27 17:49, H. Peter Anvin wrote:

On 11/27/2012 12:29 AM, Tang Chen wrote:

Another approach is like the following:
movable_node = 1,3-5,8
This could set all the memory on the nodes to be movable. And the rest
of memory works as usual. But movablecore_map is more flexible.


... but *much* harder for users, so movable_node is better in most cases.


It seems that movable_node is easier to use than movablecore_map.
But I do not think movable_node is better because the node number is set
by OS and changed easily.


For exmaple:
If system has 4 nodes and we set moveble_node=2, we can hot remove node2.

   node0   node1   node2   node3
  +-+ +-+ +-+ +-+
  | | | | |/| | |
  | | | | |/| | |
  | | | | |/| | |
  | | | | |/| | |
  +-+ +-+ +-+ +-+
  movable
   node

But if we hot remove node2 and reboot the system, node3 is changed to node2
and set to movable node.

   node0   node1   node2
  +-+ +-+ +-+
  | | | | |/|
  | | | | |/|
  | | | | |/|
  | | | | |/|
  +-+ +-+ +-+
  movable
   node

Originally, node3 is not movable node. Changing the node attribution to
movable node is not intended. So if user uses movable_node,
user must confirm whether boot option is correctly set at hotplug.

But memory range is set by firmware and not changed. So if we set node2
as movable node by movablecore_map, the issue does not occur.

Thanks,
Yasuaki Ishimatsu



-hpa




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread H. Peter Anvin

On 11/27/2012 01:47 AM, Wen Congyang wrote:

At 11/27/2012 04:49 PM, H. Peter Anvin Wrote:

On 11/27/2012 12:29 AM, Tang Chen wrote:

Another approach is like the following:
movable_node = 1,3-5,8
This could set all the memory on the nodes to be movable. And the rest
of memory works as usual. But movablecore_map is more flexible.


... but *much* harder for users, so movable_node is better in most cases.


But numa is initialized very later, and we need the information in SRAT...

Thanks
Wen Congyang



I think you need to deal with it for usability reasons, though...


--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread Wen Congyang
At 11/27/2012 04:49 PM, H. Peter Anvin Wrote:
> On 11/27/2012 12:29 AM, Tang Chen wrote:
>> Another approach is like the following:
>> movable_node = 1,3-5,8
>> This could set all the memory on the nodes to be movable. And the rest
>> of memory works as usual. But movablecore_map is more flexible.
> 
> ... but *much* harder for users, so movable_node is better in most cases.

But numa is initialized very later, and we need the information in SRAT...

Thanks
Wen Congyang

> 
>   -hpa
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread H. Peter Anvin
On 11/27/2012 12:29 AM, Tang Chen wrote:
> Another approach is like the following:
> movable_node = 1,3-5,8
> This could set all the memory on the nodes to be movable. And the rest
> of memory works as usual. But movablecore_map is more flexible.

... but *much* harder for users, so movable_node is better in most cases.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread Tang Chen

On 11/27/2012 04:00 PM, Bob Liu wrote:

Hi Tang,

On Fri, Nov 23, 2012 at 6:44 PM, Tang Chen  wrote:

[What we are doing]
This patchset provide a boot option for user to specify ZONE_MOVABLE memory
map for each node in the system.

movablecore_map=nn[KMG]@ss[KMG]

This option make sure memory range from ss to ss+nn is movable memory.


[Why we do this]
If we hot remove a memroy, the memory cannot have kernel memory,
because Linux cannot migrate kernel memory currently. Therefore,
we have to guarantee that the hot removed memory has only movable
memoroy.

Linux has two boot options, kernelcore= and movablecore=, for
creating movable memory. These boot options can specify the amount
of memory use as kernel or movable memory. Using them, we can
create ZONE_MOVABLE which has only movable memory.

But it does not fulfill a requirement of memory hot remove, because
even if we specify the boot options, movable memory is distributed
in each node evenly. So when we want to hot remove memory which
memory range is 0x8000-0c000, we have no way to specify
the memory as movable memory.



Sorry, I'm still not get your idea.
Why you need a specify range that is movable?
Could you describe the requirement and situation a bit more?
Thank you.


Hi Liu,

This feature is used in memory hotplug.

In order to implement a whole node hotplug, we need to make sure the
node contains no kernel memory, because memory used by kernel could
not be migrated. (Since the kernel memory is directly mapped,
VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)

User could specify all the memory on a node to be movable, so that the
node could be hot-removed.

Another approach is like the following:
movable_node = 1,3-5,8
This could set all the memory on the nodes to be movable. And the rest
of memory works as usual. But movablecore_map is more flexible.

Thanks. :)




So we proposed a new feature which specifies memory range to use as
movable memory.


[Ways to do this]
There may be 2 ways to specify movable memory.
  1. use firmware information
  2. use boot option

1. use firmware information
   According to ACPI spec 5.0, SRAT table has memory affinity structure
   and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
   Affinity Structure". If we use the information, we might be able to
   specify movable memory by firmware. For example, if Hot Pluggable
   Filed is enabled, Linux sets the memory as movable memory.

2. use boot option
   This is our proposal. New boot option can specify memory range to use
   as movable memory.


[How we do this]
We chose second way, because if we use first way, users cannot change
memory range to use as movable memory easily. We think if we create
movable memory, performance regression may occur by NUMA. In this case,
user can turn off the feature easily if we prepare the boot option.
And if we prepare the boot optino, the user can select which memory
to use as movable memory easily.


[How to use]
Specify the following boot option:
movablecore_map=nn[KMG]@ss[KMG]

That means physical address range from ss to ss+nn will be allocated as
ZONE_MOVABLE.

And the following points should be considered.

1) If the range is involved in a single node, then from ss to the end of
the node will be ZONE_MOVABLE.
2) If the range covers two or more nodes, then from ss to the end of
the node will be ZONE_MOVABLE, and all the other nodes will only
have ZONE_MOVABLE.
3) If no range is in the node, then the node will have no ZONE_MOVABLE
unless kernelcore or movablecore is specified.
4) This option could be specified at most MAX_NUMNODES times.
5) If kernelcore or movablecore is also specified, movablecore_map will have
higher priority to be satisfied.
6) This option has no conflict with memmap option.



Tang Chen (4):
   page_alloc: add movable_memmap kernel parameter
   page_alloc: Introduce zone_movable_limit[] to keep movable limit for
 nodes
   page_alloc: Make movablecore_map has higher priority
   page_alloc: Bootmem limit with movablecore_map

Yasuaki Ishimatsu (1):
   x86: get pg_data_t's memory from other node

  Documentation/kernel-parameters.txt |   17 +++
  arch/x86/mm/numa.c  |   11 ++-
  include/linux/memblock.h|1 +
  include/linux/mm.h  |   11 ++
  mm/memblock.c   |   15 +++-
  mm/page_alloc.c |  216 ++-
  6 files changed, 263 insertions(+), 8 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email:mailto:"d...@kvack.org;>  em...@kvack.org




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread Bob Liu
Hi Tang,

On Fri, Nov 23, 2012 at 6:44 PM, Tang Chen  wrote:
> [What we are doing]
> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
> map for each node in the system.
>
> movablecore_map=nn[KMG]@ss[KMG]
>
> This option make sure memory range from ss to ss+nn is movable memory.
>
>
> [Why we do this]
> If we hot remove a memroy, the memory cannot have kernel memory,
> because Linux cannot migrate kernel memory currently. Therefore,
> we have to guarantee that the hot removed memory has only movable
> memoroy.
>
> Linux has two boot options, kernelcore= and movablecore=, for
> creating movable memory. These boot options can specify the amount
> of memory use as kernel or movable memory. Using them, we can
> create ZONE_MOVABLE which has only movable memory.
>
> But it does not fulfill a requirement of memory hot remove, because
> even if we specify the boot options, movable memory is distributed
> in each node evenly. So when we want to hot remove memory which
> memory range is 0x8000-0c000, we have no way to specify
> the memory as movable memory.
>

Sorry, I'm still not get your idea.
Why you need a specify range that is movable?
Could you describe the requirement and situation a bit more?
Thank you.

> So we proposed a new feature which specifies memory range to use as
> movable memory.
>
>
> [Ways to do this]
> There may be 2 ways to specify movable memory.
>  1. use firmware information
>  2. use boot option
>
> 1. use firmware information
>   According to ACPI spec 5.0, SRAT table has memory affinity structure
>   and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>   Affinity Structure". If we use the information, we might be able to
>   specify movable memory by firmware. For example, if Hot Pluggable
>   Filed is enabled, Linux sets the memory as movable memory.
>
> 2. use boot option
>   This is our proposal. New boot option can specify memory range to use
>   as movable memory.
>
>
> [How we do this]
> We chose second way, because if we use first way, users cannot change
> memory range to use as movable memory easily. We think if we create
> movable memory, performance regression may occur by NUMA. In this case,
> user can turn off the feature easily if we prepare the boot option.
> And if we prepare the boot optino, the user can select which memory
> to use as movable memory easily.
>
>
> [How to use]
> Specify the following boot option:
> movablecore_map=nn[KMG]@ss[KMG]
>
> That means physical address range from ss to ss+nn will be allocated as
> ZONE_MOVABLE.
>
> And the following points should be considered.
>
> 1) If the range is involved in a single node, then from ss to the end of
>the node will be ZONE_MOVABLE.
> 2) If the range covers two or more nodes, then from ss to the end of
>the node will be ZONE_MOVABLE, and all the other nodes will only
>have ZONE_MOVABLE.
> 3) If no range is in the node, then the node will have no ZONE_MOVABLE
>unless kernelcore or movablecore is specified.
> 4) This option could be specified at most MAX_NUMNODES times.
> 5) If kernelcore or movablecore is also specified, movablecore_map will have
>higher priority to be satisfied.
> 6) This option has no conflict with memmap option.
>
>
>
> Tang Chen (4):
>   page_alloc: add movable_memmap kernel parameter
>   page_alloc: Introduce zone_movable_limit[] to keep movable limit for
> nodes
>   page_alloc: Make movablecore_map has higher priority
>   page_alloc: Bootmem limit with movablecore_map
>
> Yasuaki Ishimatsu (1):
>   x86: get pg_data_t's memory from other node
>
>  Documentation/kernel-parameters.txt |   17 +++
>  arch/x86/mm/numa.c  |   11 ++-
>  include/linux/memblock.h|1 +
>  include/linux/mm.h  |   11 ++
>  mm/memblock.c   |   15 +++-
>  mm/page_alloc.c |  216 
> ++-
>  6 files changed, 263 insertions(+), 8 deletions(-)
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread Bob Liu
Hi Tang,

On Fri, Nov 23, 2012 at 6:44 PM, Tang Chen tangc...@cn.fujitsu.com wrote:
 [What we are doing]
 This patchset provide a boot option for user to specify ZONE_MOVABLE memory
 map for each node in the system.

 movablecore_map=nn[KMG]@ss[KMG]

 This option make sure memory range from ss to ss+nn is movable memory.


 [Why we do this]
 If we hot remove a memroy, the memory cannot have kernel memory,
 because Linux cannot migrate kernel memory currently. Therefore,
 we have to guarantee that the hot removed memory has only movable
 memoroy.

 Linux has two boot options, kernelcore= and movablecore=, for
 creating movable memory. These boot options can specify the amount
 of memory use as kernel or movable memory. Using them, we can
 create ZONE_MOVABLE which has only movable memory.

 But it does not fulfill a requirement of memory hot remove, because
 even if we specify the boot options, movable memory is distributed
 in each node evenly. So when we want to hot remove memory which
 memory range is 0x8000-0c000, we have no way to specify
 the memory as movable memory.


Sorry, I'm still not get your idea.
Why you need a specify range that is movable?
Could you describe the requirement and situation a bit more?
Thank you.

 So we proposed a new feature which specifies memory range to use as
 movable memory.


 [Ways to do this]
 There may be 2 ways to specify movable memory.
  1. use firmware information
  2. use boot option

 1. use firmware information
   According to ACPI spec 5.0, SRAT table has memory affinity structure
   and the structure has Hot Pluggable Filed. See 5.2.16.2 Memory
   Affinity Structure. If we use the information, we might be able to
   specify movable memory by firmware. For example, if Hot Pluggable
   Filed is enabled, Linux sets the memory as movable memory.

 2. use boot option
   This is our proposal. New boot option can specify memory range to use
   as movable memory.


 [How we do this]
 We chose second way, because if we use first way, users cannot change
 memory range to use as movable memory easily. We think if we create
 movable memory, performance regression may occur by NUMA. In this case,
 user can turn off the feature easily if we prepare the boot option.
 And if we prepare the boot optino, the user can select which memory
 to use as movable memory easily.


 [How to use]
 Specify the following boot option:
 movablecore_map=nn[KMG]@ss[KMG]

 That means physical address range from ss to ss+nn will be allocated as
 ZONE_MOVABLE.

 And the following points should be considered.

 1) If the range is involved in a single node, then from ss to the end of
the node will be ZONE_MOVABLE.
 2) If the range covers two or more nodes, then from ss to the end of
the node will be ZONE_MOVABLE, and all the other nodes will only
have ZONE_MOVABLE.
 3) If no range is in the node, then the node will have no ZONE_MOVABLE
unless kernelcore or movablecore is specified.
 4) This option could be specified at most MAX_NUMNODES times.
 5) If kernelcore or movablecore is also specified, movablecore_map will have
higher priority to be satisfied.
 6) This option has no conflict with memmap option.



 Tang Chen (4):
   page_alloc: add movable_memmap kernel parameter
   page_alloc: Introduce zone_movable_limit[] to keep movable limit for
 nodes
   page_alloc: Make movablecore_map has higher priority
   page_alloc: Bootmem limit with movablecore_map

 Yasuaki Ishimatsu (1):
   x86: get pg_data_t's memory from other node

  Documentation/kernel-parameters.txt |   17 +++
  arch/x86/mm/numa.c  |   11 ++-
  include/linux/memblock.h|1 +
  include/linux/mm.h  |   11 ++
  mm/memblock.c   |   15 +++-
  mm/page_alloc.c |  216 
 ++-
  6 files changed, 263 insertions(+), 8 deletions(-)

 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread Tang Chen

On 11/27/2012 04:00 PM, Bob Liu wrote:

Hi Tang,

On Fri, Nov 23, 2012 at 6:44 PM, Tang Chentangc...@cn.fujitsu.com  wrote:

[What we are doing]
This patchset provide a boot option for user to specify ZONE_MOVABLE memory
map for each node in the system.

movablecore_map=nn[KMG]@ss[KMG]

This option make sure memory range from ss to ss+nn is movable memory.


[Why we do this]
If we hot remove a memroy, the memory cannot have kernel memory,
because Linux cannot migrate kernel memory currently. Therefore,
we have to guarantee that the hot removed memory has only movable
memoroy.

Linux has two boot options, kernelcore= and movablecore=, for
creating movable memory. These boot options can specify the amount
of memory use as kernel or movable memory. Using them, we can
create ZONE_MOVABLE which has only movable memory.

But it does not fulfill a requirement of memory hot remove, because
even if we specify the boot options, movable memory is distributed
in each node evenly. So when we want to hot remove memory which
memory range is 0x8000-0c000, we have no way to specify
the memory as movable memory.



Sorry, I'm still not get your idea.
Why you need a specify range that is movable?
Could you describe the requirement and situation a bit more?
Thank you.


Hi Liu,

This feature is used in memory hotplug.

In order to implement a whole node hotplug, we need to make sure the
node contains no kernel memory, because memory used by kernel could
not be migrated. (Since the kernel memory is directly mapped,
VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)

User could specify all the memory on a node to be movable, so that the
node could be hot-removed.

Another approach is like the following:
movable_node = 1,3-5,8
This could set all the memory on the nodes to be movable. And the rest
of memory works as usual. But movablecore_map is more flexible.

Thanks. :)




So we proposed a new feature which specifies memory range to use as
movable memory.


[Ways to do this]
There may be 2 ways to specify movable memory.
  1. use firmware information
  2. use boot option

1. use firmware information
   According to ACPI spec 5.0, SRAT table has memory affinity structure
   and the structure has Hot Pluggable Filed. See 5.2.16.2 Memory
   Affinity Structure. If we use the information, we might be able to
   specify movable memory by firmware. For example, if Hot Pluggable
   Filed is enabled, Linux sets the memory as movable memory.

2. use boot option
   This is our proposal. New boot option can specify memory range to use
   as movable memory.


[How we do this]
We chose second way, because if we use first way, users cannot change
memory range to use as movable memory easily. We think if we create
movable memory, performance regression may occur by NUMA. In this case,
user can turn off the feature easily if we prepare the boot option.
And if we prepare the boot optino, the user can select which memory
to use as movable memory easily.


[How to use]
Specify the following boot option:
movablecore_map=nn[KMG]@ss[KMG]

That means physical address range from ss to ss+nn will be allocated as
ZONE_MOVABLE.

And the following points should be considered.

1) If the range is involved in a single node, then from ss to the end of
the node will be ZONE_MOVABLE.
2) If the range covers two or more nodes, then from ss to the end of
the node will be ZONE_MOVABLE, and all the other nodes will only
have ZONE_MOVABLE.
3) If no range is in the node, then the node will have no ZONE_MOVABLE
unless kernelcore or movablecore is specified.
4) This option could be specified at most MAX_NUMNODES times.
5) If kernelcore or movablecore is also specified, movablecore_map will have
higher priority to be satisfied.
6) This option has no conflict with memmap option.



Tang Chen (4):
   page_alloc: add movable_memmap kernel parameter
   page_alloc: Introduce zone_movable_limit[] to keep movable limit for
 nodes
   page_alloc: Make movablecore_map has higher priority
   page_alloc: Bootmem limit with movablecore_map

Yasuaki Ishimatsu (1):
   x86: get pg_data_t's memory from other node

  Documentation/kernel-parameters.txt |   17 +++
  arch/x86/mm/numa.c  |   11 ++-
  include/linux/memblock.h|1 +
  include/linux/mm.h  |   11 ++
  mm/memblock.c   |   15 +++-
  mm/page_alloc.c |  216 ++-
  6 files changed, 263 insertions(+), 8 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email:a href=mailto:d...@kvack.org;  em...@kvack.org/a




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  

Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread H. Peter Anvin
On 11/27/2012 12:29 AM, Tang Chen wrote:
 Another approach is like the following:
 movable_node = 1,3-5,8
 This could set all the memory on the nodes to be movable. And the rest
 of memory works as usual. But movablecore_map is more flexible.

... but *much* harder for users, so movable_node is better in most cases.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread Wen Congyang
At 11/27/2012 04:49 PM, H. Peter Anvin Wrote:
 On 11/27/2012 12:29 AM, Tang Chen wrote:
 Another approach is like the following:
 movable_node = 1,3-5,8
 This could set all the memory on the nodes to be movable. And the rest
 of memory works as usual. But movablecore_map is more flexible.
 
 ... but *much* harder for users, so movable_node is better in most cases.

But numa is initialized very later, and we need the information in SRAT...

Thanks
Wen Congyang

 
   -hpa
 
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread H. Peter Anvin

On 11/27/2012 01:47 AM, Wen Congyang wrote:

At 11/27/2012 04:49 PM, H. Peter Anvin Wrote:

On 11/27/2012 12:29 AM, Tang Chen wrote:

Another approach is like the following:
movable_node = 1,3-5,8
This could set all the memory on the nodes to be movable. And the rest
of memory works as usual. But movablecore_map is more flexible.


... but *much* harder for users, so movable_node is better in most cases.


But numa is initialized very later, and we need the information in SRAT...

Thanks
Wen Congyang



I think you need to deal with it for usability reasons, though...


--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread Yasuaki Ishimatsu

Hi HPA and Tang,

2012/11/27 17:49, H. Peter Anvin wrote:

On 11/27/2012 12:29 AM, Tang Chen wrote:

Another approach is like the following:
movable_node = 1,3-5,8
This could set all the memory on the nodes to be movable. And the rest
of memory works as usual. But movablecore_map is more flexible.


... but *much* harder for users, so movable_node is better in most cases.


It seems that movable_node is easier to use than movablecore_map.
But I do not think movable_node is better because the node number is set
by OS and changed easily.


For exmaple:
If system has 4 nodes and we set moveble_node=2, we can hot remove node2.

   node0   node1   node2   node3
  +-+ +-+ +-+ +-+
  | | | | |/| | |
  | | | | |/| | |
  | | | | |/| | |
  | | | | |/| | |
  +-+ +-+ +-+ +-+
  movable
   node

But if we hot remove node2 and reboot the system, node3 is changed to node2
and set to movable node.

   node0   node1   node2
  +-+ +-+ +-+
  | | | | |/|
  | | | | |/|
  | | | | |/|
  | | | | |/|
  +-+ +-+ +-+
  movable
   node

Originally, node3 is not movable node. Changing the node attribution to
movable node is not intended. So if user uses movable_node,
user must confirm whether boot option is correctly set at hotplug.

But memory range is set by firmware and not changed. So if we set node2
as movable node by movablecore_map, the issue does not occur.

Thanks,
Yasuaki Ishimatsu



-hpa




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread Bob Liu
On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen tangc...@cn.fujitsu.com wrote:
 On 11/27/2012 04:00 PM, Bob Liu wrote:

 Hi Tang,

 On Fri, Nov 23, 2012 at 6:44 PM, Tang Chentangc...@cn.fujitsu.com
 wrote:

 [What we are doing]
 This patchset provide a boot option for user to specify ZONE_MOVABLE
 memory
 map for each node in the system.

 movablecore_map=nn[KMG]@ss[KMG]

 This option make sure memory range from ss to ss+nn is movable memory.


 [Why we do this]
 If we hot remove a memroy, the memory cannot have kernel memory,
 because Linux cannot migrate kernel memory currently. Therefore,
 we have to guarantee that the hot removed memory has only movable
 memoroy.

 Linux has two boot options, kernelcore= and movablecore=, for
 creating movable memory. These boot options can specify the amount
 of memory use as kernel or movable memory. Using them, we can
 create ZONE_MOVABLE which has only movable memory.

 But it does not fulfill a requirement of memory hot remove, because
 even if we specify the boot options, movable memory is distributed
 in each node evenly. So when we want to hot remove memory which
 memory range is 0x8000-0c000, we have no way to specify
 the memory as movable memory.


 Sorry, I'm still not get your idea.
 Why you need a specify range that is movable?
 Could you describe the requirement and situation a bit more?
 Thank you.


 Hi Liu,

 This feature is used in memory hotplug.

 In order to implement a whole node hotplug, we need to make sure the
 node contains no kernel memory, because memory used by kernel could
 not be migrated. (Since the kernel memory is directly mapped,
 VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)

 User could specify all the memory on a node to be movable, so that the
 node could be hot-removed.


Thank you for your explanation. It's reasonable.

But i think it's a bit duplicated with CMA, i'm not sure but maybe we
can combine it with CMA which already in mainline?

 Another approach is like the following:
 movable_node = 1,3-5,8
 This could set all the memory on the nodes to be movable. And the rest
 of memory works as usual. But movablecore_map is more flexible.

 Thanks. :)



 So we proposed a new feature which specifies memory range to use as
 movable memory.


 [Ways to do this]
 There may be 2 ways to specify movable memory.
   1. use firmware information
   2. use boot option

 1. use firmware information
According to ACPI spec 5.0, SRAT table has memory affinity structure
and the structure has Hot Pluggable Filed. See 5.2.16.2 Memory
Affinity Structure. If we use the information, we might be able to
specify movable memory by firmware. For example, if Hot Pluggable
Filed is enabled, Linux sets the memory as movable memory.

 2. use boot option
This is our proposal. New boot option can specify memory range to use
as movable memory.


 [How we do this]
 We chose second way, because if we use first way, users cannot change
 memory range to use as movable memory easily. We think if we create
 movable memory, performance regression may occur by NUMA. In this case,
 user can turn off the feature easily if we prepare the boot option.
 And if we prepare the boot optino, the user can select which memory
 to use as movable memory easily.


 [How to use]
 Specify the following boot option:
 movablecore_map=nn[KMG]@ss[KMG]

 That means physical address range from ss to ss+nn will be allocated as
 ZONE_MOVABLE.

 And the following points should be considered.

 1) If the range is involved in a single node, then from ss to the end of
 the node will be ZONE_MOVABLE.
 2) If the range covers two or more nodes, then from ss to the end of
 the node will be ZONE_MOVABLE, and all the other nodes will only
 have ZONE_MOVABLE.
 3) If no range is in the node, then the node will have no ZONE_MOVABLE
 unless kernelcore or movablecore is specified.
 4) This option could be specified at most MAX_NUMNODES times.
 5) If kernelcore or movablecore is also specified, movablecore_map will
 have
 higher priority to be satisfied.
 6) This option has no conflict with memmap option.



 Tang Chen (4):
page_alloc: add movable_memmap kernel parameter
page_alloc: Introduce zone_movable_limit[] to keep movable limit for
  nodes
page_alloc: Make movablecore_map has higher priority
page_alloc: Bootmem limit with movablecore_map

 Yasuaki Ishimatsu (1):
x86: get pg_data_t's memory from other node

   Documentation/kernel-parameters.txt |   17 +++
   arch/x86/mm/numa.c  |   11 ++-
   include/linux/memblock.h|1 +
   include/linux/mm.h  |   11 ++
   mm/memblock.c   |   15 +++-
   mm/page_alloc.c |  216
 ++-
   6 files changed, 263 insertions(+), 8 deletions(-)

 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info 

Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread Tang Chen

On 11/27/2012 08:09 PM, Bob Liu wrote:

On Tue, Nov 27, 2012 at 4:29 PM, Tang Chentangc...@cn.fujitsu.com  wrote:

Hi Liu,

This feature is used in memory hotplug.

In order to implement a whole node hotplug, we need to make sure the
node contains no kernel memory, because memory used by kernel could
not be migrated. (Since the kernel memory is directly mapped,
VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)

User could specify all the memory on a node to be movable, so that the
node could be hot-removed.



Thank you for your explanation. It's reasonable.

But i think it's a bit duplicated with CMA, i'm not sure but maybe we
can combine it with CMA which already in mainline?


Hi Liu,

Thanks for your advice. :)

CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
controlling where is the start of ZONE_MOVABLE of each node. Could
CMA do this job ?

And also, after a short investigation, CMA seems need to base on
memblock. But we need to limit memblock not to allocate memory on
ZONE_MOVABLE. As a result, we need to know the ranges before memblock
could be used. I'm afraid we still need an approach to get the ranges,
such as a boot option, or from static ACPI tables such as SRAT/MPST.

I'm don't know much about CMA for now. So if you have any better idea,
please share with us, thanks. :)


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread Bob Liu
On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen tangc...@cn.fujitsu.com wrote:
 On 11/27/2012 08:09 PM, Bob Liu wrote:

 On Tue, Nov 27, 2012 at 4:29 PM, Tang Chentangc...@cn.fujitsu.com
 wrote:

 Hi Liu,


 This feature is used in memory hotplug.

 In order to implement a whole node hotplug, we need to make sure the
 node contains no kernel memory, because memory used by kernel could
 not be migrated. (Since the kernel memory is directly mapped,
 VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)

 User could specify all the memory on a node to be movable, so that the
 node could be hot-removed.


 Thank you for your explanation. It's reasonable.

 But i think it's a bit duplicated with CMA, i'm not sure but maybe we
 can combine it with CMA which already in mainline?

 Hi Liu,

 Thanks for your advice. :)

 CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
 controlling where is the start of ZONE_MOVABLE of each node. Could
 CMA do this job ?

cma will not control the start of ZONE_MOVABLE of each node, but it
can declare a memory that always movable
and all non movable allocate request will not happen on that area.

Currently cma use a boot parameter cma= to declare a memory size
that always movable.
I think it might fulfill your requirement if extending the boot
parameter with a start address.

more info at http://lwn.net/Articles/468044/

 And also, after a short investigation, CMA seems need to base on
 memblock. But we need to limit memblock not to allocate memory on
 ZONE_MOVABLE. As a result, we need to know the ranges before memblock
 could be used. I'm afraid we still need an approach to get the ranges,
 such as a boot option, or from static ACPI tables such as SRAT/MPST.


Yes, it's based on memblock and with boot option.
In setup_arch32()
dma_contiguous_reserve(0);   = will declare a cma area using
memblock_reserve()

 I'm don't know much about CMA for now. So if you have any better idea,
 please share with us, thanks. :)

My idea is reuse cma like below patch(even not compiled) and boot with
cma=size@start_address.
I don't know whether it can work and whether suitable for your
requirement, if not forgive me for this noises.

diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
index 612afcc..564962a 100644
--- a/drivers/base/dma-contiguous.c
+++ b/drivers/base/dma-contiguous.c
@@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
  */
 static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
 static long size_cmdline = -1;
+static long cma_start_cmdline = -1;

 static int __init early_cma(char *p)
 {
+   char *oldp;
pr_debug(%s(%s)\n, __func__, p);
+   oldp = p;
size_cmdline = memparse(p, p);
+
+   if (*p == '@')
+   cma_start_cmdline = memparse(p+1, p);
+   printk(cma start:0x%x, size: 0x%x\n, size_cmdline, cma_start_cmdline);
return 0;
 }
 early_param(cma, early_cma);
@@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
if (selected_size) {
pr_debug(%s: reserving %ld MiB for global area\n, __func__,
 selected_size / SZ_1M);
-
-   dma_declare_contiguous(NULL, selected_size, 0, limit);
+   if (cma_size_cmdline != -1)
+   dma_declare_contiguous(NULL, selected_size,
cma_start_cmdline, limit);
+   else
+   dma_declare_contiguous(NULL, selected_size, 0, limit);
}
 };

-- 
Regards,
--Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread Tang Chen

On 11/27/2012 11:10 AM, wujianguo wrote:


Hi Tang,
DMA address can't be set as movable, if some one boot kernel with
movablecore_map=4G@0xa0 or other memory region that contains DMA address,
system maybe boot failed. Should this case be handled or mentioned
in the change log and kernel-parameters.txt?


Hi Wu,

I think we can use MAX_DMA_PFN and MAX_DMA32_PFN to prevent setting DMA
address as movable. Just ignore the address lower than them, and set
the rest as movable. How do you think ?

And, since we cannot figure out the minimum of memory kernel needs, I
think for now, we can just add some warning into kernel-parameters.txt.

Thanks. :)



Thanks,
Jianguo Wu


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread Jiang Liu
On 2012-11-28 11:47, Tang Chen wrote:
 On 11/27/2012 11:10 AM, wujianguo wrote:

 Hi Tang,
 DMA address can't be set as movable, if some one boot kernel with
 movablecore_map=4G@0xa0 or other memory region that contains DMA address,
 system maybe boot failed. Should this case be handled or mentioned
 in the change log and kernel-parameters.txt?
 
 Hi Wu,
 
 I think we can use MAX_DMA_PFN and MAX_DMA32_PFN to prevent setting DMA
 address as movable. Just ignore the address lower than them, and set
 the rest as movable. How do you think ?
 
 And, since we cannot figure out the minimum of memory kernel needs, I
 think for now, we can just add some warning into kernel-parameters.txt.
 
 Thanks. :)
On one other OS, there is a mechanism to dynamically convert pages from
movable zones into normal zones.

Regards!
Gerry

 

 Thanks,
 Jianguo Wu

 
 .
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread Jiang Liu
On 2012-11-28 11:24, Bob Liu wrote:
 On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen tangc...@cn.fujitsu.com wrote:
 On 11/27/2012 08:09 PM, Bob Liu wrote:

 On Tue, Nov 27, 2012 at 4:29 PM, Tang Chentangc...@cn.fujitsu.com
 wrote:

 Hi Liu,


 This feature is used in memory hotplug.

 In order to implement a whole node hotplug, we need to make sure the
 node contains no kernel memory, because memory used by kernel could
 not be migrated. (Since the kernel memory is directly mapped,
 VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)

 User could specify all the memory on a node to be movable, so that the
 node could be hot-removed.


 Thank you for your explanation. It's reasonable.

 But i think it's a bit duplicated with CMA, i'm not sure but maybe we
 can combine it with CMA which already in mainline?

 Hi Liu,

 Thanks for your advice. :)

 CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
 controlling where is the start of ZONE_MOVABLE of each node. Could
 CMA do this job ?
 
 cma will not control the start of ZONE_MOVABLE of each node, but it
 can declare a memory that always movable
 and all non movable allocate request will not happen on that area.
 
 Currently cma use a boot parameter cma= to declare a memory size
 that always movable.
 I think it might fulfill your requirement if extending the boot
 parameter with a start address.
 
 more info at http://lwn.net/Articles/468044/

 And also, after a short investigation, CMA seems need to base on
 memblock. But we need to limit memblock not to allocate memory on
 ZONE_MOVABLE. As a result, we need to know the ranges before memblock
 could be used. I'm afraid we still need an approach to get the ranges,
 such as a boot option, or from static ACPI tables such as SRAT/MPST.

 
 Yes, it's based on memblock and with boot option.
 In setup_arch32()
 dma_contiguous_reserve(0);   = will declare a cma area using
 memblock_reserve()
 
 I'm don't know much about CMA for now. So if you have any better idea,
 please share with us, thanks. :)
 
 My idea is reuse cma like below patch(even not compiled) and boot with
 cma=size@start_address.
 I don't know whether it can work and whether suitable for your
 requirement, if not forgive me for this noises.
 
 diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
 index 612afcc..564962a 100644
 --- a/drivers/base/dma-contiguous.c
 +++ b/drivers/base/dma-contiguous.c
 @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
   */
  static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
  static long size_cmdline = -1;
 +static long cma_start_cmdline = -1;
 
  static int __init early_cma(char *p)
  {
 +   char *oldp;
 pr_debug(%s(%s)\n, __func__, p);
 +   oldp = p;
 size_cmdline = memparse(p, p);
 +
 +   if (*p == '@')
 +   cma_start_cmdline = memparse(p+1, p);
 +   printk(cma start:0x%x, size: 0x%x\n, size_cmdline, 
 cma_start_cmdline);
 return 0;
  }
  early_param(cma, early_cma);
 @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
 if (selected_size) {
 pr_debug(%s: reserving %ld MiB for global area\n, __func__,
  selected_size / SZ_1M);
 -
 -   dma_declare_contiguous(NULL, selected_size, 0, limit);
 +   if (cma_size_cmdline != -1)
 +   dma_declare_contiguous(NULL, selected_size,
 cma_start_cmdline, limit);
 +   else
 +   dma_declare_contiguous(NULL, selected_size, 0, limit);
 }
  };
Seems a good idea to reserve memory by reusing CMA logic, though need more
investigation here. One of CMA goal is to ensure pages in CMA are really
movable, and this patchset tries to achieve the same goal at a first glance.

 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread Jianguo Wu
On 2012/11/28 11:47, Tang Chen wrote:

 On 11/27/2012 11:10 AM, wujianguo wrote:

 Hi Tang,
 DMA address can't be set as movable, if some one boot kernel with
 movablecore_map=4G@0xa0 or other memory region that contains DMA address,
 system maybe boot failed. Should this case be handled or mentioned
 in the change log and kernel-parameters.txt?
 
 Hi Wu,
 
 I think we can use MAX_DMA_PFN and MAX_DMA32_PFN to prevent setting DMA
 address as movable. Just ignore the address lower than them, and set
 the rest as movable. How do you think ?
 

I think it's OK for now.

 And, since we cannot figure out the minimum of memory kernel needs, I
 think for now, we can just add some warning into kernel-parameters.txt.
 
 Thanks. :)
 

 Thanks,
 Jianguo Wu

 
 .
 



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread Wen Congyang
At 11/28/2012 12:01 PM, Jiang Liu Wrote:
 On 2012-11-28 11:47, Tang Chen wrote:
 On 11/27/2012 11:10 AM, wujianguo wrote:

 Hi Tang,
 DMA address can't be set as movable, if some one boot kernel with
 movablecore_map=4G@0xa0 or other memory region that contains DMA 
 address,
 system maybe boot failed. Should this case be handled or mentioned
 in the change log and kernel-parameters.txt?

 Hi Wu,

 I think we can use MAX_DMA_PFN and MAX_DMA32_PFN to prevent setting DMA
 address as movable. Just ignore the address lower than them, and set
 the rest as movable. How do you think ?

 And, since we cannot figure out the minimum of memory kernel needs, I
 think for now, we can just add some warning into kernel-parameters.txt.

 Thanks. :)
 On one other OS, there is a mechanism to dynamically convert pages from
 movable zones into normal zones.

The OS auto does it? Or the user coverts it?

We can convert pages from movable zones into normal zones by the following
interface:
echo online_kernel /sys/devices/system/memory/memoryX/state

We have posted a patchset to implement it, and it is in mm tree now.

Thanks
Wen Congyang

 
 Regards!
 Gerry
 


 Thanks,
 Jianguo Wu


 .

 
 
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread Jiang Liu
On 2012-11-28 13:21, Wen Congyang wrote:
 At 11/28/2012 12:01 PM, Jiang Liu Wrote:
 On 2012-11-28 11:47, Tang Chen wrote:
 On 11/27/2012 11:10 AM, wujianguo wrote:

 Hi Tang,
 DMA address can't be set as movable, if some one boot kernel with
 movablecore_map=4G@0xa0 or other memory region that contains DMA 
 address,
 system maybe boot failed. Should this case be handled or mentioned
 in the change log and kernel-parameters.txt?

 Hi Wu,

 I think we can use MAX_DMA_PFN and MAX_DMA32_PFN to prevent setting DMA
 address as movable. Just ignore the address lower than them, and set
 the rest as movable. How do you think ?

 And, since we cannot figure out the minimum of memory kernel needs, I
 think for now, we can just add some warning into kernel-parameters.txt.

 Thanks. :)
 On one other OS, there is a mechanism to dynamically convert pages from
 movable zones into normal zones.
 
 The OS auto does it? Or the user coverts it?
 
 We can convert pages from movable zones into normal zones by the following
 interface:
 echo online_kernel /sys/devices/system/memory/memoryX/state
 
 We have posted a patchset to implement it, and it is in mm tree now.
OS automatically converts it, no manual operations needed.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread Tang Chen

Hi Bob, Liu Jiang,

About CMA, could you give me more info ?
Thanks for your patent and nice advice. :)


1) I saw the following on http://lwn.net/Articles/447405/:

The CMA type is sticky; pages which are marked as being for CMA
should never have their migration type changed by the kernel.

As Wen said, we now support a user interface to change movable memory
into kernel memory. But seeing from above, the memory specified as
CMA will not be able to be changed, right ?  If so, I don't think
using CMA is a good idea.


2) Is CMA just implemented on ARM platform ?  I found the following in
kernel-parameters.txt.

cma=nn[MG]  [ARM,KNL]
Sets the size of kernel global memory area for contiguous
memory allocations. For more information, see
include/linux/dma-contiguous.h

We are developing on x86. Could we use it ?


3) Is CMA just used for DMA ? I am a little confused here. :)
I found the main code of CMA is implemented in dma-contiguous.c.


4) The boot options cma=xxx and movablecore_map=xxx have different
meanings for user. Reusing CMA could make user confused, I'm afraid.

And, even if we reuse cma= option, we still need to do the work
in patch 3~5, right ?


Thanks. :)



On 11/28/2012 12:08 PM, Jiang Liu wrote:

On 2012-11-28 11:24, Bob Liu wrote:

On Tue, Nov 27, 2012 at 8:49 PM, Tang Chentangc...@cn.fujitsu.com  wrote:

On 11/27/2012 08:09 PM, Bob Liu wrote:


On Tue, Nov 27, 2012 at 4:29 PM, Tang Chentangc...@cn.fujitsu.com
wrote:


Hi Liu,


This feature is used in memory hotplug.

In order to implement a whole node hotplug, we need to make sure the
node contains no kernel memory, because memory used by kernel could
not be migrated. (Since the kernel memory is directly mapped,
VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)

User could specify all the memory on a node to be movable, so that the
node could be hot-removed.



Thank you for your explanation. It's reasonable.

But i think it's a bit duplicated with CMA, i'm not sure but maybe we
can combine it with CMA which already in mainline?


Hi Liu,

Thanks for your advice. :)

CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
controlling where is the start of ZONE_MOVABLE of each node. Could
CMA do this job ?


cma will not control the start of ZONE_MOVABLE of each node, but it
can declare a memory that always movable
and all non movable allocate request will not happen on that area.

Currently cma use a boot parameter cma= to declare a memory size
that always movable.
I think it might fulfill your requirement if extending the boot
parameter with a start address.

more info at http://lwn.net/Articles/468044/


And also, after a short investigation, CMA seems need to base on
memblock. But we need to limit memblock not to allocate memory on
ZONE_MOVABLE. As a result, we need to know the ranges before memblock
could be used. I'm afraid we still need an approach to get the ranges,
such as a boot option, or from static ACPI tables such as SRAT/MPST.



Yes, it's based on memblock and with boot option.
In setup_arch32()
 dma_contiguous_reserve(0);   =  will declare a cma area using
memblock_reserve()


I'm don't know much about CMA for now. So if you have any better idea,
please share with us, thanks. :)


My idea is reuse cma like below patch(even not compiled) and boot with
cma=size@start_address.
I don't know whether it can work and whether suitable for your
requirement, if not forgive me for this noises.

diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
index 612afcc..564962a 100644
--- a/drivers/base/dma-contiguous.c
+++ b/drivers/base/dma-contiguous.c
@@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
   */
  static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
  static long size_cmdline = -1;
+static long cma_start_cmdline = -1;

  static int __init early_cma(char *p)
  {
+   char *oldp;
 pr_debug(%s(%s)\n, __func__, p);
+   oldp = p;
 size_cmdline = memparse(p,p);
+
+   if (*p == '@')
+   cma_start_cmdline = memparse(p+1,p);
+   printk(cma start:0x%x, size: 0x%x\n, size_cmdline, cma_start_cmdline);
 return 0;
  }
  early_param(cma, early_cma);
@@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
 if (selected_size) {
 pr_debug(%s: reserving %ld MiB for global area\n, __func__,
  selected_size / SZ_1M);
-
-   dma_declare_contiguous(NULL, selected_size, 0, limit);
+   if (cma_size_cmdline != -1)
+   dma_declare_contiguous(NULL, selected_size,
cma_start_cmdline, limit);
+   else
+   dma_declare_contiguous(NULL, selected_size, 0, limit);
 }
  };

Seems a good idea to reserve memory by reusing CMA logic, though need more
investigation here. One of CMA goal is to ensure pages in CMA are really
movable, and this patchset tries 

Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-27 Thread Jiang Liu
Hi Chen,

If a pageblock's migration type is movable, it may be converted to
reclaimable under memory pressure. CMA is introduced to guarantee
that pages of CMA won't be converted to other migratetypes.

And we are trying to avoid allocating kernel/DMA memory from specific
memory ranges, so we could easily reclaim pages when hot-removing
memory devices. 

I think the idea is not to directly reuse CMA for hotplug, but to 
reuse the mechanism to reserve specific memory ranges from bootmem
allocator. So CMA and hotplug could use the same code.
Basically we may try to reuse dma_declare_contiguous(), so that
we don't need to add special logic into bootmem allocator.

Regards!
Gerry

On 2012-11-28 14:16, Tang Chen wrote:
 Hi Bob, Liu Jiang,
 
 About CMA, could you give me more info ?
 Thanks for your patent and nice advice. :)
 
 
 1) I saw the following on http://lwn.net/Articles/447405/:
 
 The CMA type is sticky; pages which are marked as being for CMA
 should never have their migration type changed by the kernel.
 
 As Wen said, we now support a user interface to change movable memory
 into kernel memory. But seeing from above, the memory specified as
 CMA will not be able to be changed, right ?  If so, I don't think
 using CMA is a good idea.
 
 
 2) Is CMA just implemented on ARM platform ?  I found the following in
 kernel-parameters.txt.
 
 cma=nn[MG]  [ARM,KNL]
 Sets the size of kernel global memory area for contiguous
 memory allocations. For more information, see
 include/linux/dma-contiguous.h
 
 We are developing on x86. Could we use it ?
 
 
 3) Is CMA just used for DMA ? I am a little confused here. :)
 I found the main code of CMA is implemented in dma-contiguous.c.
 
 
 4) The boot options cma=xxx and movablecore_map=xxx have different
 meanings for user. Reusing CMA could make user confused, I'm afraid.
 
 And, even if we reuse cma= option, we still need to do the work
 in patch 3~5, right ?
 
 
 Thanks. :)
 
 
 
 On 11/28/2012 12:08 PM, Jiang Liu wrote:
 On 2012-11-28 11:24, Bob Liu wrote:
 On Tue, Nov 27, 2012 at 8:49 PM, Tang Chentangc...@cn.fujitsu.com  wrote:
 On 11/27/2012 08:09 PM, Bob Liu wrote:

 On Tue, Nov 27, 2012 at 4:29 PM, Tang Chentangc...@cn.fujitsu.com
 wrote:

 Hi Liu,


 This feature is used in memory hotplug.

 In order to implement a whole node hotplug, we need to make sure the
 node contains no kernel memory, because memory used by kernel could
 not be migrated. (Since the kernel memory is directly mapped,
 VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)

 User could specify all the memory on a node to be movable, so that the
 node could be hot-removed.


 Thank you for your explanation. It's reasonable.

 But i think it's a bit duplicated with CMA, i'm not sure but maybe we
 can combine it with CMA which already in mainline?

 Hi Liu,

 Thanks for your advice. :)

 CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
 controlling where is the start of ZONE_MOVABLE of each node. Could
 CMA do this job ?

 cma will not control the start of ZONE_MOVABLE of each node, but it
 can declare a memory that always movable
 and all non movable allocate request will not happen on that area.

 Currently cma use a boot parameter cma= to declare a memory size
 that always movable.
 I think it might fulfill your requirement if extending the boot
 parameter with a start address.

 more info at http://lwn.net/Articles/468044/

 And also, after a short investigation, CMA seems need to base on
 memblock. But we need to limit memblock not to allocate memory on
 ZONE_MOVABLE. As a result, we need to know the ranges before memblock
 could be used. I'm afraid we still need an approach to get the ranges,
 such as a boot option, or from static ACPI tables such as SRAT/MPST.


 Yes, it's based on memblock and with boot option.
 In setup_arch32()
  dma_contiguous_reserve(0);   =  will declare a cma area using
 memblock_reserve()

 I'm don't know much about CMA for now. So if you have any better idea,
 please share with us, thanks. :)

 My idea is reuse cma like below patch(even not compiled) and boot with
 cma=size@start_address.
 I don't know whether it can work and whether suitable for your
 requirement, if not forgive me for this noises.

 diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
 index 612afcc..564962a 100644
 --- a/drivers/base/dma-contiguous.c
 +++ b/drivers/base/dma-contiguous.c
 @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
*/
   static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
   static long size_cmdline = -1;
 +static long cma_start_cmdline = -1;

   static int __init early_cma(char *p)
   {
 +   char *oldp;
  pr_debug(%s(%s)\n, __func__, p);
 +   oldp = p;
  size_cmdline = memparse(p,p);
 +
 +   if (*p == '@')
 +   cma_start_cmdline = memparse(p+1,p);
 +   printk(cma start:0x%x, size: 0x%x\n, 

Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-26 Thread Jianguo Wu
On 2012/11/27 13:43, Tang Chen wrote:

> On 11/27/2012 11:10 AM, wujianguo wrote:
>> On 2012-11-23 18:44, Tang Chen wrote:
>>> [What we are doing]
>>> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
>>> map for each node in the system.
>>>
>>> movablecore_map=nn[KMG]@ss[KMG]
>>>
>>
>> Hi Tang,
>> DMA address can't be set as movable, if some one boot kernel with
>> movablecore_map=4G@0xa0 or other memory region that contains DMA address,
>> system maybe boot failed. Should this case be handled or mentioned
>> in the change log and kernel-parameters.txt?
> 
> Hi Wu,
> 
> Right, DMA address can't be set as movable. And I should have mentioned
> it in the doc more clear. :)
> 
> Actually, the situation is not only for DMA address. Because we limited
> the memblock allocation, even if users did not specified the DMA
> address, but set too much memory as movable, which means there was too
> little memory for kernel to use, kernel will also fail to boot.
> 
> I added the following info into doc, but obviously it was not clear
> enough. :)
> +If kernelcore or movablecore is also specified,
> +movablecore_map will have higher priority to be
> +satisfied. So the administrator should be careful that
> +the amount of movablecore_map areas are not too large.
> +Otherwise kernel won't have enough memory to start.
> 
> 
> And about how to fix it, as you said, we can handle the situation if
> user specified DMA address as movable. But how to handle "too little
> memory for kernel to start" case ?  Is there any info about how much
> at least memory kernel needs ?
> 

As I know, bootmem is mostly used by page structs when CONFIG_SPARSEMEM=y.
But it is hard to calculate how much bootmem is needed exactly.

> 
> Thanks for the comments. :)
> 
>>
>> Thanks,
>> Jianguo Wu
>>
> 
> 
> 
> .
> 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-26 Thread H. Peter Anvin

On 11/26/2012 09:43 PM, Tang Chen wrote:


And about how to fix it, as you said, we can handle the situation if
user specified DMA address as movable. But how to handle "too little
memory for kernel to start" case ?  Is there any info about how much
at least memory kernel needs ?



Not really, and it depends on so many variables.

-hpa


--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-26 Thread Tang Chen

On 11/27/2012 11:10 AM, wujianguo wrote:

On 2012-11-23 18:44, Tang Chen wrote:

[What we are doing]
This patchset provide a boot option for user to specify ZONE_MOVABLE memory
map for each node in the system.

movablecore_map=nn[KMG]@ss[KMG]



Hi Tang,
DMA address can't be set as movable, if some one boot kernel with
movablecore_map=4G@0xa0 or other memory region that contains DMA address,
system maybe boot failed. Should this case be handled or mentioned
in the change log and kernel-parameters.txt?


Hi Wu,

Right, DMA address can't be set as movable. And I should have mentioned
it in the doc more clear. :)

Actually, the situation is not only for DMA address. Because we limited
the memblock allocation, even if users did not specified the DMA
address, but set too much memory as movable, which means there was too
little memory for kernel to use, kernel will also fail to boot.

I added the following info into doc, but obviously it was not clear
enough. :)
+   If kernelcore or movablecore is also specified,
+   movablecore_map will have higher priority to be
+   satisfied. So the administrator should be careful that
+   the amount of movablecore_map areas are not too large.
+   Otherwise kernel won't have enough memory to start.


And about how to fix it, as you said, we can handle the situation if
user specified DMA address as movable. But how to handle "too little
memory for kernel to start" case ?  Is there any info about how much
at least memory kernel needs ?


Thanks for the comments. :)



Thanks,
Jianguo Wu




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-26 Thread wujianguo
On 2012-11-23 18:44, Tang Chen wrote:
> [What we are doing]
> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
> map for each node in the system.
> 
> movablecore_map=nn[KMG]@ss[KMG]
> 

Hi Tang,
DMA address can't be set as movable, if some one boot kernel with
movablecore_map=4G@0xa0 or other memory region that contains DMA address,
system maybe boot failed. Should this case be handled or mentioned
in the change log and kernel-parameters.txt?

Thanks,
Jianguo Wu

> This option make sure memory range from ss to ss+nn is movable memory.
> 
> 
> [Why we do this]
> If we hot remove a memroy, the memory cannot have kernel memory,
> because Linux cannot migrate kernel memory currently. Therefore,
> we have to guarantee that the hot removed memory has only movable
> memoroy.
> 
> Linux has two boot options, kernelcore= and movablecore=, for
> creating movable memory. These boot options can specify the amount
> of memory use as kernel or movable memory. Using them, we can
> create ZONE_MOVABLE which has only movable memory.
> 
> But it does not fulfill a requirement of memory hot remove, because
> even if we specify the boot options, movable memory is distributed
> in each node evenly. So when we want to hot remove memory which
> memory range is 0x8000-0c000, we have no way to specify
> the memory as movable memory.
> 
> So we proposed a new feature which specifies memory range to use as
> movable memory.
> 
> 
> [Ways to do this]
> There may be 2 ways to specify movable memory.
>  1. use firmware information
>  2. use boot option
> 
> 1. use firmware information
>   According to ACPI spec 5.0, SRAT table has memory affinity structure
>   and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>   Affinity Structure". If we use the information, we might be able to
>   specify movable memory by firmware. For example, if Hot Pluggable
>   Filed is enabled, Linux sets the memory as movable memory.
> 
> 2. use boot option
>   This is our proposal. New boot option can specify memory range to use
>   as movable memory.
> 
> 
> [How we do this]
> We chose second way, because if we use first way, users cannot change
> memory range to use as movable memory easily. We think if we create
> movable memory, performance regression may occur by NUMA. In this case,
> user can turn off the feature easily if we prepare the boot option.
> And if we prepare the boot optino, the user can select which memory
> to use as movable memory easily. 
> 
> 
> [How to use]
> Specify the following boot option:
> movablecore_map=nn[KMG]@ss[KMG]
> 
> That means physical address range from ss to ss+nn will be allocated as
> ZONE_MOVABLE.
> 
> And the following points should be considered.
> 
> 1) If the range is involved in a single node, then from ss to the end of
>the node will be ZONE_MOVABLE.
> 2) If the range covers two or more nodes, then from ss to the end of
>the node will be ZONE_MOVABLE, and all the other nodes will only
>have ZONE_MOVABLE.
> 3) If no range is in the node, then the node will have no ZONE_MOVABLE
>unless kernelcore or movablecore is specified.
> 4) This option could be specified at most MAX_NUMNODES times.
> 5) If kernelcore or movablecore is also specified, movablecore_map will have
>higher priority to be satisfied.
> 6) This option has no conflict with memmap option.
> 
> 
> 
> Tang Chen (4):
>   page_alloc: add movable_memmap kernel parameter
>   page_alloc: Introduce zone_movable_limit[] to keep movable limit for
> nodes
>   page_alloc: Make movablecore_map has higher priority
>   page_alloc: Bootmem limit with movablecore_map
> 
> Yasuaki Ishimatsu (1):
>   x86: get pg_data_t's memory from other node
> 
>  Documentation/kernel-parameters.txt |   17 +++
>  arch/x86/mm/numa.c  |   11 ++-
>  include/linux/memblock.h|1 +
>  include/linux/mm.h  |   11 ++
>  mm/memblock.c   |   15 +++-
>  mm/page_alloc.c |  216 
> ++-
>  6 files changed, 263 insertions(+), 8 deletions(-)
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-26 Thread wujianguo
On 2012-11-23 18:44, Tang Chen wrote:
 [What we are doing]
 This patchset provide a boot option for user to specify ZONE_MOVABLE memory
 map for each node in the system.
 
 movablecore_map=nn[KMG]@ss[KMG]
 

Hi Tang,
DMA address can't be set as movable, if some one boot kernel with
movablecore_map=4G@0xa0 or other memory region that contains DMA address,
system maybe boot failed. Should this case be handled or mentioned
in the change log and kernel-parameters.txt?

Thanks,
Jianguo Wu

 This option make sure memory range from ss to ss+nn is movable memory.
 
 
 [Why we do this]
 If we hot remove a memroy, the memory cannot have kernel memory,
 because Linux cannot migrate kernel memory currently. Therefore,
 we have to guarantee that the hot removed memory has only movable
 memoroy.
 
 Linux has two boot options, kernelcore= and movablecore=, for
 creating movable memory. These boot options can specify the amount
 of memory use as kernel or movable memory. Using them, we can
 create ZONE_MOVABLE which has only movable memory.
 
 But it does not fulfill a requirement of memory hot remove, because
 even if we specify the boot options, movable memory is distributed
 in each node evenly. So when we want to hot remove memory which
 memory range is 0x8000-0c000, we have no way to specify
 the memory as movable memory.
 
 So we proposed a new feature which specifies memory range to use as
 movable memory.
 
 
 [Ways to do this]
 There may be 2 ways to specify movable memory.
  1. use firmware information
  2. use boot option
 
 1. use firmware information
   According to ACPI spec 5.0, SRAT table has memory affinity structure
   and the structure has Hot Pluggable Filed. See 5.2.16.2 Memory
   Affinity Structure. If we use the information, we might be able to
   specify movable memory by firmware. For example, if Hot Pluggable
   Filed is enabled, Linux sets the memory as movable memory.
 
 2. use boot option
   This is our proposal. New boot option can specify memory range to use
   as movable memory.
 
 
 [How we do this]
 We chose second way, because if we use first way, users cannot change
 memory range to use as movable memory easily. We think if we create
 movable memory, performance regression may occur by NUMA. In this case,
 user can turn off the feature easily if we prepare the boot option.
 And if we prepare the boot optino, the user can select which memory
 to use as movable memory easily. 
 
 
 [How to use]
 Specify the following boot option:
 movablecore_map=nn[KMG]@ss[KMG]
 
 That means physical address range from ss to ss+nn will be allocated as
 ZONE_MOVABLE.
 
 And the following points should be considered.
 
 1) If the range is involved in a single node, then from ss to the end of
the node will be ZONE_MOVABLE.
 2) If the range covers two or more nodes, then from ss to the end of
the node will be ZONE_MOVABLE, and all the other nodes will only
have ZONE_MOVABLE.
 3) If no range is in the node, then the node will have no ZONE_MOVABLE
unless kernelcore or movablecore is specified.
 4) This option could be specified at most MAX_NUMNODES times.
 5) If kernelcore or movablecore is also specified, movablecore_map will have
higher priority to be satisfied.
 6) This option has no conflict with memmap option.
 
 
 
 Tang Chen (4):
   page_alloc: add movable_memmap kernel parameter
   page_alloc: Introduce zone_movable_limit[] to keep movable limit for
 nodes
   page_alloc: Make movablecore_map has higher priority
   page_alloc: Bootmem limit with movablecore_map
 
 Yasuaki Ishimatsu (1):
   x86: get pg_data_t's memory from other node
 
  Documentation/kernel-parameters.txt |   17 +++
  arch/x86/mm/numa.c  |   11 ++-
  include/linux/memblock.h|1 +
  include/linux/mm.h  |   11 ++
  mm/memblock.c   |   15 +++-
  mm/page_alloc.c |  216 
 ++-
  6 files changed, 263 insertions(+), 8 deletions(-)
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-26 Thread Tang Chen

On 11/27/2012 11:10 AM, wujianguo wrote:

On 2012-11-23 18:44, Tang Chen wrote:

[What we are doing]
This patchset provide a boot option for user to specify ZONE_MOVABLE memory
map for each node in the system.

movablecore_map=nn[KMG]@ss[KMG]



Hi Tang,
DMA address can't be set as movable, if some one boot kernel with
movablecore_map=4G@0xa0 or other memory region that contains DMA address,
system maybe boot failed. Should this case be handled or mentioned
in the change log and kernel-parameters.txt?


Hi Wu,

Right, DMA address can't be set as movable. And I should have mentioned
it in the doc more clear. :)

Actually, the situation is not only for DMA address. Because we limited
the memblock allocation, even if users did not specified the DMA
address, but set too much memory as movable, which means there was too
little memory for kernel to use, kernel will also fail to boot.

I added the following info into doc, but obviously it was not clear
enough. :)
+   If kernelcore or movablecore is also specified,
+   movablecore_map will have higher priority to be
+   satisfied. So the administrator should be careful that
+   the amount of movablecore_map areas are not too large.
+   Otherwise kernel won't have enough memory to start.


And about how to fix it, as you said, we can handle the situation if
user specified DMA address as movable. But how to handle too little
memory for kernel to start case ?  Is there any info about how much
at least memory kernel needs ?


Thanks for the comments. :)



Thanks,
Jianguo Wu




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-26 Thread H. Peter Anvin

On 11/26/2012 09:43 PM, Tang Chen wrote:


And about how to fix it, as you said, we can handle the situation if
user specified DMA address as movable. But how to handle too little
memory for kernel to start case ?  Is there any info about how much
at least memory kernel needs ?



Not really, and it depends on so many variables.

-hpa


--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-26 Thread Jianguo Wu
On 2012/11/27 13:43, Tang Chen wrote:

 On 11/27/2012 11:10 AM, wujianguo wrote:
 On 2012-11-23 18:44, Tang Chen wrote:
 [What we are doing]
 This patchset provide a boot option for user to specify ZONE_MOVABLE memory
 map for each node in the system.

 movablecore_map=nn[KMG]@ss[KMG]


 Hi Tang,
 DMA address can't be set as movable, if some one boot kernel with
 movablecore_map=4G@0xa0 or other memory region that contains DMA address,
 system maybe boot failed. Should this case be handled or mentioned
 in the change log and kernel-parameters.txt?
 
 Hi Wu,
 
 Right, DMA address can't be set as movable. And I should have mentioned
 it in the doc more clear. :)
 
 Actually, the situation is not only for DMA address. Because we limited
 the memblock allocation, even if users did not specified the DMA
 address, but set too much memory as movable, which means there was too
 little memory for kernel to use, kernel will also fail to boot.
 
 I added the following info into doc, but obviously it was not clear
 enough. :)
 +If kernelcore or movablecore is also specified,
 +movablecore_map will have higher priority to be
 +satisfied. So the administrator should be careful that
 +the amount of movablecore_map areas are not too large.
 +Otherwise kernel won't have enough memory to start.
 
 
 And about how to fix it, as you said, we can handle the situation if
 user specified DMA address as movable. But how to handle too little
 memory for kernel to start case ?  Is there any info about how much
 at least memory kernel needs ?
 

As I know, bootmem is mostly used by page structs when CONFIG_SPARSEMEM=y.
But it is hard to calculate how much bootmem is needed exactly.

 
 Thanks for the comments. :)
 

 Thanks,
 Jianguo Wu

 
 
 
 .
 



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 0/5] Add movablecore_map boot option

2012-11-23 Thread Tang Chen
[What we are doing]
This patchset provide a boot option for user to specify ZONE_MOVABLE memory
map for each node in the system.

movablecore_map=nn[KMG]@ss[KMG]

This option make sure memory range from ss to ss+nn is movable memory.


[Why we do this]
If we hot remove a memroy, the memory cannot have kernel memory,
because Linux cannot migrate kernel memory currently. Therefore,
we have to guarantee that the hot removed memory has only movable
memoroy.

Linux has two boot options, kernelcore= and movablecore=, for
creating movable memory. These boot options can specify the amount
of memory use as kernel or movable memory. Using them, we can
create ZONE_MOVABLE which has only movable memory.

But it does not fulfill a requirement of memory hot remove, because
even if we specify the boot options, movable memory is distributed
in each node evenly. So when we want to hot remove memory which
memory range is 0x8000-0c000, we have no way to specify
the memory as movable memory.

So we proposed a new feature which specifies memory range to use as
movable memory.


[Ways to do this]
There may be 2 ways to specify movable memory.
 1. use firmware information
 2. use boot option

1. use firmware information
  According to ACPI spec 5.0, SRAT table has memory affinity structure
  and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
  Affinity Structure". If we use the information, we might be able to
  specify movable memory by firmware. For example, if Hot Pluggable
  Filed is enabled, Linux sets the memory as movable memory.

2. use boot option
  This is our proposal. New boot option can specify memory range to use
  as movable memory.


[How we do this]
We chose second way, because if we use first way, users cannot change
memory range to use as movable memory easily. We think if we create
movable memory, performance regression may occur by NUMA. In this case,
user can turn off the feature easily if we prepare the boot option.
And if we prepare the boot optino, the user can select which memory
to use as movable memory easily. 


[How to use]
Specify the following boot option:
movablecore_map=nn[KMG]@ss[KMG]

That means physical address range from ss to ss+nn will be allocated as
ZONE_MOVABLE.

And the following points should be considered.

1) If the range is involved in a single node, then from ss to the end of
   the node will be ZONE_MOVABLE.
2) If the range covers two or more nodes, then from ss to the end of
   the node will be ZONE_MOVABLE, and all the other nodes will only
   have ZONE_MOVABLE.
3) If no range is in the node, then the node will have no ZONE_MOVABLE
   unless kernelcore or movablecore is specified.
4) This option could be specified at most MAX_NUMNODES times.
5) If kernelcore or movablecore is also specified, movablecore_map will have
   higher priority to be satisfied.
6) This option has no conflict with memmap option.



Tang Chen (4):
  page_alloc: add movable_memmap kernel parameter
  page_alloc: Introduce zone_movable_limit[] to keep movable limit for
nodes
  page_alloc: Make movablecore_map has higher priority
  page_alloc: Bootmem limit with movablecore_map

Yasuaki Ishimatsu (1):
  x86: get pg_data_t's memory from other node

 Documentation/kernel-parameters.txt |   17 +++
 arch/x86/mm/numa.c  |   11 ++-
 include/linux/memblock.h|1 +
 include/linux/mm.h  |   11 ++
 mm/memblock.c   |   15 +++-
 mm/page_alloc.c |  216 ++-
 6 files changed, 263 insertions(+), 8 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 0/5] Add movablecore_map boot option

2012-11-23 Thread Tang Chen
[What we are doing]
This patchset provide a boot option for user to specify ZONE_MOVABLE memory
map for each node in the system.

movablecore_map=nn[KMG]@ss[KMG]

This option make sure memory range from ss to ss+nn is movable memory.


[Why we do this]
If we hot remove a memroy, the memory cannot have kernel memory,
because Linux cannot migrate kernel memory currently. Therefore,
we have to guarantee that the hot removed memory has only movable
memoroy.

Linux has two boot options, kernelcore= and movablecore=, for
creating movable memory. These boot options can specify the amount
of memory use as kernel or movable memory. Using them, we can
create ZONE_MOVABLE which has only movable memory.

But it does not fulfill a requirement of memory hot remove, because
even if we specify the boot options, movable memory is distributed
in each node evenly. So when we want to hot remove memory which
memory range is 0x8000-0c000, we have no way to specify
the memory as movable memory.

So we proposed a new feature which specifies memory range to use as
movable memory.


[Ways to do this]
There may be 2 ways to specify movable memory.
 1. use firmware information
 2. use boot option

1. use firmware information
  According to ACPI spec 5.0, SRAT table has memory affinity structure
  and the structure has Hot Pluggable Filed. See 5.2.16.2 Memory
  Affinity Structure. If we use the information, we might be able to
  specify movable memory by firmware. For example, if Hot Pluggable
  Filed is enabled, Linux sets the memory as movable memory.

2. use boot option
  This is our proposal. New boot option can specify memory range to use
  as movable memory.


[How we do this]
We chose second way, because if we use first way, users cannot change
memory range to use as movable memory easily. We think if we create
movable memory, performance regression may occur by NUMA. In this case,
user can turn off the feature easily if we prepare the boot option.
And if we prepare the boot optino, the user can select which memory
to use as movable memory easily. 


[How to use]
Specify the following boot option:
movablecore_map=nn[KMG]@ss[KMG]

That means physical address range from ss to ss+nn will be allocated as
ZONE_MOVABLE.

And the following points should be considered.

1) If the range is involved in a single node, then from ss to the end of
   the node will be ZONE_MOVABLE.
2) If the range covers two or more nodes, then from ss to the end of
   the node will be ZONE_MOVABLE, and all the other nodes will only
   have ZONE_MOVABLE.
3) If no range is in the node, then the node will have no ZONE_MOVABLE
   unless kernelcore or movablecore is specified.
4) This option could be specified at most MAX_NUMNODES times.
5) If kernelcore or movablecore is also specified, movablecore_map will have
   higher priority to be satisfied.
6) This option has no conflict with memmap option.



Tang Chen (4):
  page_alloc: add movable_memmap kernel parameter
  page_alloc: Introduce zone_movable_limit[] to keep movable limit for
nodes
  page_alloc: Make movablecore_map has higher priority
  page_alloc: Bootmem limit with movablecore_map

Yasuaki Ishimatsu (1):
  x86: get pg_data_t's memory from other node

 Documentation/kernel-parameters.txt |   17 +++
 arch/x86/mm/numa.c  |   11 ++-
 include/linux/memblock.h|1 +
 include/linux/mm.h  |   11 ++
 mm/memblock.c   |   15 +++-
 mm/page_alloc.c |  216 ++-
 6 files changed, 263 insertions(+), 8 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >