Re: [PATCH v7 0/5] Make cpuid <-> nodeid mapping persistent

2016-05-23 Thread Zhu Guihua

Hi,

On 05/19/2016 10:46 PM, Peter Zijlstra wrote:

On Thu, May 19, 2016 at 06:39:41PM +0800, Zhu Guihua wrote:

[Problem]

cpuid <-> nodeid mapping is firstly established at boot time. And workqueue 
caches
the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.

When doing node online/offline, cpuid <-> nodeid mapping is 
established/destroyed,
which means, cpuid <-> nodeid mapping will change if node hotplug happens. But
workqueue does not update wq_numa_possible_cpumask.

So why are you not fixing up wq_numa_possible_cpumask instead? That
seems the far easier solution.


We tried to do that. You can see our patch at
http://www.gossamer-threads.com/lists/linux/kernel/2116748

But maintainer thought, we should establish persistent cpuid<->nodeid 
relationship,

there is no need to change the map.

Cc TJ,
Could we return to workqueue to fix this?

Thanks,
Zhu


Do all the other archs that support NUMA and HOTPLUG have the mapping
stable, or will you now go fix each and every one of them?


.







Re: [PATCH v7 0/5] Make cpuid <-> nodeid mapping persistent

2016-05-23 Thread Zhu Guihua

Hi,

On 05/19/2016 10:46 PM, Peter Zijlstra wrote:

On Thu, May 19, 2016 at 06:39:41PM +0800, Zhu Guihua wrote:

[Problem]

cpuid <-> nodeid mapping is firstly established at boot time. And workqueue 
caches
the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.

When doing node online/offline, cpuid <-> nodeid mapping is 
established/destroyed,
which means, cpuid <-> nodeid mapping will change if node hotplug happens. But
workqueue does not update wq_numa_possible_cpumask.

So why are you not fixing up wq_numa_possible_cpumask instead? That
seems the far easier solution.


We tried to do that. You can see our patch at
http://www.gossamer-threads.com/lists/linux/kernel/2116748

But maintainer thought, we should establish persistent cpuid<->nodeid 
relationship,

there is no need to change the map.

Cc TJ,
Could we return to workqueue to fix this?

Thanks,
Zhu


Do all the other archs that support NUMA and HOTPLUG have the mapping
stable, or will you now go fix each and every one of them?


.







[PATCH v7 4/5] x86, acpi, cpu-hotplug: Enable MADT APIs to return disabled apicid.

2016-05-19 Thread Zhu Guihua
From: Gu Zheng <guz.f...@cn.fujitsu.com>

The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that,
when node online/offline happens, cache based on cpuid <-> nodeid mapping such 
as
wq_numa_possible_cpumask will not cause any problem.
It contains 4 steps:
1. Enable apic registeration flow to handle both enabled and disabled cpus.
2. Introduce a new array storing all possible cpuid <-> apicid mapping.
3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
4. Establish all possible cpuid <-> nodeid mapping.

This patch finishes step 3.

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm(persistent)
2. apicid (physical cpu id)   <->   nodeid (persistent)
3. cpuid (logical cpu id) <->   apicid (not persistent, now persistent 
by step 2)
4. cpuid (logical cpu id) <->   nodeid (not persistent)

So, in order to setup persistent cpuid <-> nodeid mapping for all possible CPUs,
we should:
1. Setup cpuid <-> apicid mapping for all possible CPUs, which has been done in 
step 1, 2.
2. Setup cpuid <-> nodeid mapping for all possible CPUs. But before that, we 
should
   obtain all apicids from MADT.

All processors' apicids can be obtained by _MAT method or from MADT in ACPI.
The current code ignores disabled processors and returns -ENODEV.

After this patch, a new parameter will be added to MADT APIs so that caller
is able to control if disabled processors are ignored.

Signed-off-by: Gu Zheng <guz.f...@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangc...@cn.fujitsu.com>
Signed-off-by: Zhu Guihua <zhugh.f...@cn.fujitsu.com>
---
 drivers/acpi/acpi_processor.c |  5 +++-
 drivers/acpi/processor_core.c | 57 +++
 2 files changed, 40 insertions(+), 22 deletions(-)

diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
index 0d92d0f..9aa2dd1 100644
--- a/drivers/acpi/acpi_processor.c
+++ b/drivers/acpi/acpi_processor.c
@@ -300,8 +300,11 @@ static int acpi_processor_get_info(struct acpi_device 
*device)
 *  Extra Processor objects may be enumerated on MP systems with
 *  less than the max # of CPUs. They should be ignored _iff
 *  they are physically not present.
+*
+*  NOTE: Even if the processor has a cpuid, it may not present because
+*  cpuid <-> apicid mapping is persistent now.
 */
-   if (invalid_logical_cpuid(pr->id)) {
+   if (invalid_logical_cpuid(pr->id) || !cpu_present(pr->id)) {
int ret = acpi_processor_hotadd_init(pr);
if (ret)
return ret;
diff --git a/drivers/acpi/processor_core.c b/drivers/acpi/processor_core.c
index 33a38d6..824b98b 100644
--- a/drivers/acpi/processor_core.c
+++ b/drivers/acpi/processor_core.c
@@ -32,12 +32,12 @@ static struct acpi_table_madt *get_madt_table(void)
 }
 
 static int map_lapic_id(struct acpi_subtable_header *entry,
-u32 acpi_id, phys_cpuid_t *apic_id)
+u32 acpi_id, phys_cpuid_t *apic_id, bool ignore_disabled)
 {
struct acpi_madt_local_apic *lapic =
container_of(entry, struct acpi_madt_local_apic, header);
 
-   if (!(lapic->lapic_flags & ACPI_MADT_ENABLED))
+   if (ignore_disabled && !(lapic->lapic_flags & ACPI_MADT_ENABLED))
return -ENODEV;
 
if (lapic->processor_id != acpi_id)
@@ -48,12 +48,13 @@ static int map_lapic_id(struct acpi_subtable_header *entry,
 }
 
 static int map_x2apic_id(struct acpi_subtable_header *entry,
-   int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id)
+   int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id,
+   bool ignore_disabled)
 {
struct acpi_madt_local_x2apic *apic =
container_of(entry, struct acpi_madt_local_x2apic, header);
 
-   if (!(apic->lapic_flags & ACPI_MADT_ENABLED))
+   if (ignore_disabled && !(apic->lapic_flags & ACPI_MADT_ENABLED))
return -ENODEV;
 
if (device_declaration && (apic->uid == acpi_id)) {
@@ -65,12 +66,13 @@ static int map_x2apic_id(struct acpi_subtable_header *entry,
 }
 
 static int map_lsapic_id(struct acpi_subtable_header *entry,
-   int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id)
+   int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id,
+   bool ignore_disabled)
 {
struct acpi_madt_local_sapic *lsapic =
container_of(entry, struct acpi_madt_local_sapic, header);
 
-   if (!(lsapic->lapic_flags & ACPI_MADT_ENABLED))
+   if (ignore_disabled && !(lsapic->lapic_flags & ACPI_MADT_ENABLED))
return -ENODEV;
 
if (device_declaration) {
@@ -87,12 +89,1

[PATCH v7 5/5] x86, acpi, cpu-hotplug: Set persistent cpuid <-> nodeid mapping when booting.

2016-05-19 Thread Zhu Guihua
From: Gu Zheng <guz.f...@cn.fujitsu.com>

The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that,
when node online/offline happens, cache based on cpuid <-> nodeid mapping such 
as
wq_numa_possible_cpumask will not cause any problem.
It contains 4 steps:
1. Enable apic registeration flow to handle both enabled and disabled cpus.
2. Introduce a new array storing all possible cpuid <-> apicid mapping.
3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
4. Establish all possible cpuid <-> nodeid mapping.

This patch finishes step 4.

This patch set the persistent cpuid <-> nodeid mapping for all enabled/disabled
processors at boot time via an additional acpi namespace walk for processors.

Signed-off-by: Gu Zheng <guz.f...@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangc...@cn.fujitsu.com>
Signed-off-by: Zhu Guihua <zhugh.f...@cn.fujitsu.com>
---
 arch/ia64/kernel/acpi.c   |  3 +-
 arch/x86/kernel/acpi/boot.c   |  4 ++-
 drivers/acpi/acpi_processor.c |  5 
 drivers/acpi/bus.c|  3 ++
 drivers/acpi/processor_core.c | 65 +++
 include/linux/acpi.h  |  2 ++
 6 files changed, 80 insertions(+), 2 deletions(-)

diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c
index b1698bc..bb36515 100644
--- a/arch/ia64/kernel/acpi.c
+++ b/arch/ia64/kernel/acpi.c
@@ -796,7 +796,7 @@ int acpi_isa_irq_to_gsi(unsigned isa_irq, u32 *gsi)
  *  ACPI based hotplug CPU support
  */
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
-static int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 {
 #ifdef CONFIG_ACPI_NUMA
/*
@@ -811,6 +811,7 @@ static int acpi_map_cpu2node(acpi_handle handle, int cpu, 
int physid)
 #endif
return 0;
 }
+EXPORT_SYMBOL(acpi_map_cpu2node);
 
 int additional_cpus __initdata = -1;
 
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 37248c3..0900264f 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -695,7 +695,7 @@ static void __init acpi_set_irq_model_ioapic(void)
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
 #include 
 
-static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 {
 #ifdef CONFIG_ACPI_NUMA
int nid;
@@ -706,7 +706,9 @@ static void acpi_map_cpu2node(acpi_handle handle, int cpu, 
int physid)
numa_set_node(cpu, nid);
}
 #endif
+   return 0;
 }
+EXPORT_SYMBOL(acpi_map_cpu2node);
 
 int acpi_map_cpu(acpi_handle handle, phys_cpuid_t physid, int *pcpu)
 {
diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
index 9aa2dd1..b20f417 100644
--- a/drivers/acpi/acpi_processor.c
+++ b/drivers/acpi/acpi_processor.c
@@ -182,6 +182,11 @@ int __weak arch_register_cpu(int cpu)
 
 void __weak arch_unregister_cpu(int cpu) {}
 
+int __weak acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+{
+   return -ENODEV;
+}
+
 static int acpi_processor_hotadd_init(struct acpi_processor *pr)
 {
unsigned long long sta;
diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c
index 31e8da6..02f5721 100644
--- a/drivers/acpi/bus.c
+++ b/drivers/acpi/bus.c
@@ -1124,6 +1124,9 @@ static int __init acpi_init(void)
acpi_sleep_proc_init();
acpi_wakeup_device_init();
acpi_debugger_init();
+#ifdef CONFIG_ACPI_HOTPLUG_CPU
+   acpi_set_processor_mapping();
+#endif
return 0;
 }
 
diff --git a/drivers/acpi/processor_core.c b/drivers/acpi/processor_core.c
index 824b98b..69fb027 100644
--- a/drivers/acpi/processor_core.c
+++ b/drivers/acpi/processor_core.c
@@ -261,6 +261,71 @@ int acpi_get_cpuid(acpi_handle handle, int type, u32 
acpi_id)
 }
 EXPORT_SYMBOL_GPL(acpi_get_cpuid);
 
+#ifdef CONFIG_ACPI_HOTPLUG_CPU
+static bool map_processor(acpi_handle handle, phys_cpuid_t *phys_id, int 
*cpuid)
+{
+   int type;
+   u32 acpi_id;
+   acpi_status status;
+   acpi_object_type acpi_type;
+   unsigned long long tmp;
+   union acpi_object object = { 0 };
+   struct acpi_buffer buffer = { sizeof(union acpi_object),  };
+
+   status = acpi_get_type(handle, _type);
+   if (ACPI_FAILURE(status))
+   return false;
+
+   switch (acpi_type) {
+   case ACPI_TYPE_PROCESSOR:
+   status = acpi_evaluate_object(handle, NULL, NULL, );
+   if (ACPI_FAILURE(status))
+   return false;
+   acpi_id = object.processor.proc_id;
+   break;
+   case ACPI_TYPE_DEVICE:
+   status = acpi_evaluate_integer(handle, "_UID", NULL, );
+   if (ACPI_FAILURE(status))
+   return false;
+   acpi_id = tmp;
+   break;
+   default:
+   return false;
+   }
+
+   type = (acpi_type == AC

[PATCH v7 3/5] x86, acpi, cpu-hotplug: Introduce cpuid_to_apicid[] array to store persistent cpuid <-> apicid mapping.

2016-05-19 Thread Zhu Guihua
From: Gu Zheng <guz.f...@cn.fujitsu.com>

The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that,
when node online/offline happens, cache based on cpuid <-> nodeid mapping such 
as
wq_numa_possible_cpumask will not cause any problem.
It contains 4 steps:
1. Enable apic registeration flow to handle both enabled and disabled cpus.
2. Introduce a new array storing all possible cpuid <-> apicid mapping.
3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
4. Establish all possible cpuid <-> nodeid mapping.

This patch finishes step 2.

In this patch, we introduce a new static array named cpuid_to_apicid[],
which is large enough to store info for all possible cpus.

And then, we modify the cpuid calculation. In generic_processor_info(),
it simply finds the next unused cpuid. And it is also why the cpuid <-> nodeid
mapping changes with node hotplug.

After this patch, we find the next unused cpuid, map it to an apicid,
and store the mapping in cpuid_to_apicid[], so that cpuid <-> apicid
mapping will be persistent.

And finally we will use this array to make cpuid <-> nodeid persistent.

cpuid <-> apicid mapping is established at local apic registeration time.
But non-present or disabled cpus are ignored.

In this patch, we establish all possible cpuid <-> apicid mapping when
registering local apic.

Signed-off-by: Gu Zheng <guz.f...@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangc...@cn.fujitsu.com>
Signed-off-by: Zhu Guihua <zhugh.f...@cn.fujitsu.com>
---
 arch/x86/include/asm/mpspec.h |  1 +
 arch/x86/kernel/acpi/boot.c   |  6 ++---
 arch/x86/kernel/apic/apic.c   | 61 ---
 3 files changed, 61 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
index b07233b..db902d8 100644
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -86,6 +86,7 @@ static inline void early_reserve_e820_mpc_new(void) { }
 #endif
 
 int generic_processor_info(int apicid, int version);
+int __generic_processor_info(int apicid, int version, bool enabled);
 
 #define PHYSID_ARRAY_SIZE  BITS_TO_LONGS(MAX_LOCAL_APIC)
 
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 9414f84..37248c3 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -174,15 +174,13 @@ static int acpi_register_lapic(int id, u8 enabled)
return -EINVAL;
}
 
-   if (!enabled) {
+   if (!enabled)
++disabled_cpus;
-   return -EINVAL;
-   }
 
if (boot_cpu_physical_apicid != -1U)
ver = apic_version[boot_cpu_physical_apicid];
 
-   return generic_processor_info(id, ver);
+   return __generic_processor_info(id, ver, enabled);
 }
 
 static int __init
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 8e3c377..366fbbc 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1998,7 +1998,53 @@ void disconnect_bsp_APIC(int virt_wire_setup)
apic_write(APIC_LVT1, value);
 }
 
-static int __generic_processor_info(int apicid, int version, bool enabled)
+/*
+ * The number of allocated logical CPU IDs. Since logical CPU IDs are allocated
+ * contiguously, it equals to current allocated max logical CPU ID plus 1.
+ * All allocated CPU ID should be in [0, nr_logical_cpuidi), so the maximum of
+ * nr_logical_cpuids is nr_cpu_ids.
+ *
+ * NOTE: Reserve 0 for BSP.
+ */
+static int nr_logical_cpuids = 1;
+
+/*
+ * Used to store mapping between logical CPU IDs and APIC IDs.
+ */
+static int cpuid_to_apicid[] = {
+   [0 ... NR_CPUS - 1] = -1,
+};
+
+/*
+ * Should use this API to allocate logical CPU IDs to keep nr_logical_cpuids
+ * and cpuid_to_apicid[] synchronized.
+ */
+static int allocate_logical_cpuid(int apicid)
+{
+   int i;
+
+   /*
+* cpuid <-> apicid mapping is persistent, so when a cpu is up,
+* check if the kernel has allocated a cpuid for it.
+*/
+   for (i = 0; i < nr_logical_cpuids; i++) {
+   if (cpuid_to_apicid[i] == apicid)
+   return i;
+   }
+
+   /* Allocate a new cpuid. */
+   if (nr_logical_cpuids >= nr_cpu_ids) {
+   WARN_ONCE(1, "Only %d processors supported."
+"Processor %d/0x%x and the rest are ignored.\n",
+nr_cpu_ids - 1, nr_logical_cpuids, apicid);
+   return -1;
+   }
+
+   cpuid_to_apicid[nr_logical_cpuids] = apicid;
+   return nr_logical_cpuids++;
+}
+
+int __generic_processor_info(int apicid, int version, bool enabled)
 {
int cpu, max = nr_cpu_ids;
bool boot_cpu_detected = physid_isset(boot_cpu_physical_apicid,
@@ -2079,8 +2125,17 @@ static int __generic_processor_info(int apicid, int 

[PATCH v7 4/5] x86, acpi, cpu-hotplug: Enable MADT APIs to return disabled apicid.

2016-05-19 Thread Zhu Guihua
From: Gu Zheng 

The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that,
when node online/offline happens, cache based on cpuid <-> nodeid mapping such 
as
wq_numa_possible_cpumask will not cause any problem.
It contains 4 steps:
1. Enable apic registeration flow to handle both enabled and disabled cpus.
2. Introduce a new array storing all possible cpuid <-> apicid mapping.
3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
4. Establish all possible cpuid <-> nodeid mapping.

This patch finishes step 3.

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm(persistent)
2. apicid (physical cpu id)   <->   nodeid (persistent)
3. cpuid (logical cpu id) <->   apicid (not persistent, now persistent 
by step 2)
4. cpuid (logical cpu id) <->   nodeid (not persistent)

So, in order to setup persistent cpuid <-> nodeid mapping for all possible CPUs,
we should:
1. Setup cpuid <-> apicid mapping for all possible CPUs, which has been done in 
step 1, 2.
2. Setup cpuid <-> nodeid mapping for all possible CPUs. But before that, we 
should
   obtain all apicids from MADT.

All processors' apicids can be obtained by _MAT method or from MADT in ACPI.
The current code ignores disabled processors and returns -ENODEV.

After this patch, a new parameter will be added to MADT APIs so that caller
is able to control if disabled processors are ignored.

Signed-off-by: Gu Zheng 
Signed-off-by: Tang Chen 
Signed-off-by: Zhu Guihua 
---
 drivers/acpi/acpi_processor.c |  5 +++-
 drivers/acpi/processor_core.c | 57 +++
 2 files changed, 40 insertions(+), 22 deletions(-)

diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
index 0d92d0f..9aa2dd1 100644
--- a/drivers/acpi/acpi_processor.c
+++ b/drivers/acpi/acpi_processor.c
@@ -300,8 +300,11 @@ static int acpi_processor_get_info(struct acpi_device 
*device)
 *  Extra Processor objects may be enumerated on MP systems with
 *  less than the max # of CPUs. They should be ignored _iff
 *  they are physically not present.
+*
+*  NOTE: Even if the processor has a cpuid, it may not present because
+*  cpuid <-> apicid mapping is persistent now.
 */
-   if (invalid_logical_cpuid(pr->id)) {
+   if (invalid_logical_cpuid(pr->id) || !cpu_present(pr->id)) {
int ret = acpi_processor_hotadd_init(pr);
if (ret)
return ret;
diff --git a/drivers/acpi/processor_core.c b/drivers/acpi/processor_core.c
index 33a38d6..824b98b 100644
--- a/drivers/acpi/processor_core.c
+++ b/drivers/acpi/processor_core.c
@@ -32,12 +32,12 @@ static struct acpi_table_madt *get_madt_table(void)
 }
 
 static int map_lapic_id(struct acpi_subtable_header *entry,
-u32 acpi_id, phys_cpuid_t *apic_id)
+u32 acpi_id, phys_cpuid_t *apic_id, bool ignore_disabled)
 {
struct acpi_madt_local_apic *lapic =
container_of(entry, struct acpi_madt_local_apic, header);
 
-   if (!(lapic->lapic_flags & ACPI_MADT_ENABLED))
+   if (ignore_disabled && !(lapic->lapic_flags & ACPI_MADT_ENABLED))
return -ENODEV;
 
if (lapic->processor_id != acpi_id)
@@ -48,12 +48,13 @@ static int map_lapic_id(struct acpi_subtable_header *entry,
 }
 
 static int map_x2apic_id(struct acpi_subtable_header *entry,
-   int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id)
+   int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id,
+   bool ignore_disabled)
 {
struct acpi_madt_local_x2apic *apic =
container_of(entry, struct acpi_madt_local_x2apic, header);
 
-   if (!(apic->lapic_flags & ACPI_MADT_ENABLED))
+   if (ignore_disabled && !(apic->lapic_flags & ACPI_MADT_ENABLED))
return -ENODEV;
 
if (device_declaration && (apic->uid == acpi_id)) {
@@ -65,12 +66,13 @@ static int map_x2apic_id(struct acpi_subtable_header *entry,
 }
 
 static int map_lsapic_id(struct acpi_subtable_header *entry,
-   int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id)
+   int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id,
+   bool ignore_disabled)
 {
struct acpi_madt_local_sapic *lsapic =
container_of(entry, struct acpi_madt_local_sapic, header);
 
-   if (!(lsapic->lapic_flags & ACPI_MADT_ENABLED))
+   if (ignore_disabled && !(lsapic->lapic_flags & ACPI_MADT_ENABLED))
return -ENODEV;
 
if (device_declaration) {
@@ -87,12 +89,13 @@ static int map_lsapic_id(struct acpi_subtable_header *entry,
  * Retrieve the ARM CPU physical identifier (MPIDR)

[PATCH v7 5/5] x86, acpi, cpu-hotplug: Set persistent cpuid <-> nodeid mapping when booting.

2016-05-19 Thread Zhu Guihua
From: Gu Zheng 

The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that,
when node online/offline happens, cache based on cpuid <-> nodeid mapping such 
as
wq_numa_possible_cpumask will not cause any problem.
It contains 4 steps:
1. Enable apic registeration flow to handle both enabled and disabled cpus.
2. Introduce a new array storing all possible cpuid <-> apicid mapping.
3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
4. Establish all possible cpuid <-> nodeid mapping.

This patch finishes step 4.

This patch set the persistent cpuid <-> nodeid mapping for all enabled/disabled
processors at boot time via an additional acpi namespace walk for processors.

Signed-off-by: Gu Zheng 
Signed-off-by: Tang Chen 
Signed-off-by: Zhu Guihua 
---
 arch/ia64/kernel/acpi.c   |  3 +-
 arch/x86/kernel/acpi/boot.c   |  4 ++-
 drivers/acpi/acpi_processor.c |  5 
 drivers/acpi/bus.c|  3 ++
 drivers/acpi/processor_core.c | 65 +++
 include/linux/acpi.h  |  2 ++
 6 files changed, 80 insertions(+), 2 deletions(-)

diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c
index b1698bc..bb36515 100644
--- a/arch/ia64/kernel/acpi.c
+++ b/arch/ia64/kernel/acpi.c
@@ -796,7 +796,7 @@ int acpi_isa_irq_to_gsi(unsigned isa_irq, u32 *gsi)
  *  ACPI based hotplug CPU support
  */
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
-static int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 {
 #ifdef CONFIG_ACPI_NUMA
/*
@@ -811,6 +811,7 @@ static int acpi_map_cpu2node(acpi_handle handle, int cpu, 
int physid)
 #endif
return 0;
 }
+EXPORT_SYMBOL(acpi_map_cpu2node);
 
 int additional_cpus __initdata = -1;
 
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 37248c3..0900264f 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -695,7 +695,7 @@ static void __init acpi_set_irq_model_ioapic(void)
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
 #include 
 
-static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 {
 #ifdef CONFIG_ACPI_NUMA
int nid;
@@ -706,7 +706,9 @@ static void acpi_map_cpu2node(acpi_handle handle, int cpu, 
int physid)
numa_set_node(cpu, nid);
}
 #endif
+   return 0;
 }
+EXPORT_SYMBOL(acpi_map_cpu2node);
 
 int acpi_map_cpu(acpi_handle handle, phys_cpuid_t physid, int *pcpu)
 {
diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
index 9aa2dd1..b20f417 100644
--- a/drivers/acpi/acpi_processor.c
+++ b/drivers/acpi/acpi_processor.c
@@ -182,6 +182,11 @@ int __weak arch_register_cpu(int cpu)
 
 void __weak arch_unregister_cpu(int cpu) {}
 
+int __weak acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+{
+   return -ENODEV;
+}
+
 static int acpi_processor_hotadd_init(struct acpi_processor *pr)
 {
unsigned long long sta;
diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c
index 31e8da6..02f5721 100644
--- a/drivers/acpi/bus.c
+++ b/drivers/acpi/bus.c
@@ -1124,6 +1124,9 @@ static int __init acpi_init(void)
acpi_sleep_proc_init();
acpi_wakeup_device_init();
acpi_debugger_init();
+#ifdef CONFIG_ACPI_HOTPLUG_CPU
+   acpi_set_processor_mapping();
+#endif
return 0;
 }
 
diff --git a/drivers/acpi/processor_core.c b/drivers/acpi/processor_core.c
index 824b98b..69fb027 100644
--- a/drivers/acpi/processor_core.c
+++ b/drivers/acpi/processor_core.c
@@ -261,6 +261,71 @@ int acpi_get_cpuid(acpi_handle handle, int type, u32 
acpi_id)
 }
 EXPORT_SYMBOL_GPL(acpi_get_cpuid);
 
+#ifdef CONFIG_ACPI_HOTPLUG_CPU
+static bool map_processor(acpi_handle handle, phys_cpuid_t *phys_id, int 
*cpuid)
+{
+   int type;
+   u32 acpi_id;
+   acpi_status status;
+   acpi_object_type acpi_type;
+   unsigned long long tmp;
+   union acpi_object object = { 0 };
+   struct acpi_buffer buffer = { sizeof(union acpi_object),  };
+
+   status = acpi_get_type(handle, _type);
+   if (ACPI_FAILURE(status))
+   return false;
+
+   switch (acpi_type) {
+   case ACPI_TYPE_PROCESSOR:
+   status = acpi_evaluate_object(handle, NULL, NULL, );
+   if (ACPI_FAILURE(status))
+   return false;
+   acpi_id = object.processor.proc_id;
+   break;
+   case ACPI_TYPE_DEVICE:
+   status = acpi_evaluate_integer(handle, "_UID", NULL, );
+   if (ACPI_FAILURE(status))
+   return false;
+   acpi_id = tmp;
+   break;
+   default:
+   return false;
+   }
+
+   type = (acpi_type == ACPI_TYPE_DEVICE) ? 1 : 0;
+
+   *phys_id = __acpi_get_phys_id(handle, type, acpi_id, false);
+   *cpuid = acpi_map_cpui

[PATCH v7 3/5] x86, acpi, cpu-hotplug: Introduce cpuid_to_apicid[] array to store persistent cpuid <-> apicid mapping.

2016-05-19 Thread Zhu Guihua
From: Gu Zheng 

The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that,
when node online/offline happens, cache based on cpuid <-> nodeid mapping such 
as
wq_numa_possible_cpumask will not cause any problem.
It contains 4 steps:
1. Enable apic registeration flow to handle both enabled and disabled cpus.
2. Introduce a new array storing all possible cpuid <-> apicid mapping.
3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
4. Establish all possible cpuid <-> nodeid mapping.

This patch finishes step 2.

In this patch, we introduce a new static array named cpuid_to_apicid[],
which is large enough to store info for all possible cpus.

And then, we modify the cpuid calculation. In generic_processor_info(),
it simply finds the next unused cpuid. And it is also why the cpuid <-> nodeid
mapping changes with node hotplug.

After this patch, we find the next unused cpuid, map it to an apicid,
and store the mapping in cpuid_to_apicid[], so that cpuid <-> apicid
mapping will be persistent.

And finally we will use this array to make cpuid <-> nodeid persistent.

cpuid <-> apicid mapping is established at local apic registeration time.
But non-present or disabled cpus are ignored.

In this patch, we establish all possible cpuid <-> apicid mapping when
registering local apic.

Signed-off-by: Gu Zheng 
Signed-off-by: Tang Chen 
Signed-off-by: Zhu Guihua 
---
 arch/x86/include/asm/mpspec.h |  1 +
 arch/x86/kernel/acpi/boot.c   |  6 ++---
 arch/x86/kernel/apic/apic.c   | 61 ---
 3 files changed, 61 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
index b07233b..db902d8 100644
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -86,6 +86,7 @@ static inline void early_reserve_e820_mpc_new(void) { }
 #endif
 
 int generic_processor_info(int apicid, int version);
+int __generic_processor_info(int apicid, int version, bool enabled);
 
 #define PHYSID_ARRAY_SIZE  BITS_TO_LONGS(MAX_LOCAL_APIC)
 
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 9414f84..37248c3 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -174,15 +174,13 @@ static int acpi_register_lapic(int id, u8 enabled)
return -EINVAL;
}
 
-   if (!enabled) {
+   if (!enabled)
++disabled_cpus;
-   return -EINVAL;
-   }
 
if (boot_cpu_physical_apicid != -1U)
ver = apic_version[boot_cpu_physical_apicid];
 
-   return generic_processor_info(id, ver);
+   return __generic_processor_info(id, ver, enabled);
 }
 
 static int __init
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 8e3c377..366fbbc 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1998,7 +1998,53 @@ void disconnect_bsp_APIC(int virt_wire_setup)
apic_write(APIC_LVT1, value);
 }
 
-static int __generic_processor_info(int apicid, int version, bool enabled)
+/*
+ * The number of allocated logical CPU IDs. Since logical CPU IDs are allocated
+ * contiguously, it equals to current allocated max logical CPU ID plus 1.
+ * All allocated CPU ID should be in [0, nr_logical_cpuidi), so the maximum of
+ * nr_logical_cpuids is nr_cpu_ids.
+ *
+ * NOTE: Reserve 0 for BSP.
+ */
+static int nr_logical_cpuids = 1;
+
+/*
+ * Used to store mapping between logical CPU IDs and APIC IDs.
+ */
+static int cpuid_to_apicid[] = {
+   [0 ... NR_CPUS - 1] = -1,
+};
+
+/*
+ * Should use this API to allocate logical CPU IDs to keep nr_logical_cpuids
+ * and cpuid_to_apicid[] synchronized.
+ */
+static int allocate_logical_cpuid(int apicid)
+{
+   int i;
+
+   /*
+* cpuid <-> apicid mapping is persistent, so when a cpu is up,
+* check if the kernel has allocated a cpuid for it.
+*/
+   for (i = 0; i < nr_logical_cpuids; i++) {
+   if (cpuid_to_apicid[i] == apicid)
+   return i;
+   }
+
+   /* Allocate a new cpuid. */
+   if (nr_logical_cpuids >= nr_cpu_ids) {
+   WARN_ONCE(1, "Only %d processors supported."
+"Processor %d/0x%x and the rest are ignored.\n",
+nr_cpu_ids - 1, nr_logical_cpuids, apicid);
+   return -1;
+   }
+
+   cpuid_to_apicid[nr_logical_cpuids] = apicid;
+   return nr_logical_cpuids++;
+}
+
+int __generic_processor_info(int apicid, int version, bool enabled)
 {
int cpu, max = nr_cpu_ids;
bool boot_cpu_detected = physid_isset(boot_cpu_physical_apicid,
@@ -2079,8 +2125,17 @@ static int __generic_processor_info(int apicid, int 
version, bool enabled)
 * for BSP.
 */
cpu = 0;
-   } else
-

Re: [PATCH v7 0/5] Make cpuid <-> nodeid mapping persistent

2016-05-19 Thread Zhu Guihua

Hi Rafael,

This patch set was reported a kernel panic from Intel LKP team.
I do some investigation for this. I found that this panic was caused
because of Intel test machine. On their machine, the acpi tables has
something wrong. The proc_id of processors which are not present
cannot be assigned correctly, they are assigned the same value.
The wrong value will be used by our patch, and lead to panic.

Thanks,
Zhu

On 05/19/2016 06:39 PM, Zhu Guihua wrote:

[Problem]

cpuid <-> nodeid mapping is firstly established at boot time. And workqueue 
caches
the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.

When doing node online/offline, cpuid <-> nodeid mapping is 
established/destroyed,
which means, cpuid <-> nodeid mapping will change if node hotplug happens. But
workqueue does not update wq_numa_possible_cpumask.

So here is the problem:

Assume we have the following cpuid <-> nodeid in the beginning:

   Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 2 | 30-44, 90-104
node 3 | 45-59, 105-119

and we hot-remove node2 and node3, it becomes:

   Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89

and we hot-add node4 and node5, it becomes:

   Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 4 | 30-59
node 5 | 90-119

But in wq_numa_possible_cpumask, cpu30 is still mapped to node2, and the like.

When a pool workqueue is initialized, if its cpumask belongs to a node, its
pool->node will be mapped to that node. And memory used by this workqueue will
also be allocated on that node.

static struct worker_pool *get_unbound_pool(const struct workqueue_attrs 
*attrs){
...
 /* if cpumask is contained inside a NUMA node, we belong to that node 
*/
 if (wq_numa_enabled) {
 for_each_node(node) {
 if (cpumask_subset(pool->attrs->cpumask,
wq_numa_possible_cpumask[node])) {
 pool->node = node;
 break;
 }
 }
 }

Since wq_numa_possible_cpumask is not updated, it could be mapped to an offline 
node,
which will lead to memory allocation failure:

  SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
   cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, 
min order: 0
   node 0: slabs: 6172, objs: 259224, free: 245741
   node 1: slabs: 3261, objs: 136962, free: 127656

It happens here:

create_worker(struct worker_pool *pool)
  |--> worker = alloc_worker(pool->node);

static struct worker *alloc_worker(int node)
{
 struct worker *worker;

 worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, node); --> Here, 
useing the wrong node.

 ..

 return worker;
}


[Solution]

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm
2. apicid (physical cpu id)   <->   nodeid
3. cpuid (logical cpu id) <->   apicid
4. cpuid (logical cpu id) <->   nodeid

1. pxm (proximity domain) is provided by ACPI firmware in SRAT, and nodeid <-> 
pxm
mapping is setup at boot time. This mapping is persistent, won't change.

2. apicid <-> nodeid mapping is setup using info in 1. The mapping is setup at 
boot
time and CPU hotadd time, and cleared at CPU hotremove time. This mapping 
is also
persistent.

3. cpuid <-> apicid mapping is setup at boot time and CPU hotadd time. cpuid is
allocated, lower ids first, and released at CPU hotremove time, reused for 
other
hotadded CPUs. So this mapping is not persistent.

4. cpuid <-> nodeid mapping is also setup at boot time and CPU hotadd time, and
cleared at CPU hotremove time. As a result of 3, this mapping is not 
persistent.

To fix this problem, we establish cpuid <-> nodeid mapping for all the possible
cpus at boot time, and make it persistent. And according to init_cpu_to_node(),
cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and cpuid <-> 
apicid
mapping. So the key point is obtaining all cpus' apicid.

apicid can be obtained by _MAT (Multiple APIC Table Entry) method or found in
MADT (Multiple APIC Description Table). So we finish the job in the following 
steps:

1. Enable apic registeration flow to handle both enabled and disabled cpus.
This is done by introducing an extra parameter to generic_processor_info to 
let the
caller control if disabled cpus are ignored.

2. Introduce a new array storing all possible cpuid <-> apicid mapping. And 
also modify
the way cpuid is calculated. Establish all possible cpuid <-> apicid 
mapping when
registering local apic. Store the mapping in this array.

3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
This is also do

Re: [PATCH v7 0/5] Make cpuid <-> nodeid mapping persistent

2016-05-19 Thread Zhu Guihua

Hi Rafael,

This patch set was reported a kernel panic from Intel LKP team.
I do some investigation for this. I found that this panic was caused
because of Intel test machine. On their machine, the acpi tables has
something wrong. The proc_id of processors which are not present
cannot be assigned correctly, they are assigned the same value.
The wrong value will be used by our patch, and lead to panic.

Thanks,
Zhu

On 05/19/2016 06:39 PM, Zhu Guihua wrote:

[Problem]

cpuid <-> nodeid mapping is firstly established at boot time. And workqueue 
caches
the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.

When doing node online/offline, cpuid <-> nodeid mapping is 
established/destroyed,
which means, cpuid <-> nodeid mapping will change if node hotplug happens. But
workqueue does not update wq_numa_possible_cpumask.

So here is the problem:

Assume we have the following cpuid <-> nodeid in the beginning:

   Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 2 | 30-44, 90-104
node 3 | 45-59, 105-119

and we hot-remove node2 and node3, it becomes:

   Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89

and we hot-add node4 and node5, it becomes:

   Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 4 | 30-59
node 5 | 90-119

But in wq_numa_possible_cpumask, cpu30 is still mapped to node2, and the like.

When a pool workqueue is initialized, if its cpumask belongs to a node, its
pool->node will be mapped to that node. And memory used by this workqueue will
also be allocated on that node.

static struct worker_pool *get_unbound_pool(const struct workqueue_attrs 
*attrs){
...
 /* if cpumask is contained inside a NUMA node, we belong to that node 
*/
 if (wq_numa_enabled) {
 for_each_node(node) {
 if (cpumask_subset(pool->attrs->cpumask,
wq_numa_possible_cpumask[node])) {
 pool->node = node;
 break;
 }
 }
 }

Since wq_numa_possible_cpumask is not updated, it could be mapped to an offline 
node,
which will lead to memory allocation failure:

  SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
   cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, 
min order: 0
   node 0: slabs: 6172, objs: 259224, free: 245741
   node 1: slabs: 3261, objs: 136962, free: 127656

It happens here:

create_worker(struct worker_pool *pool)
  |--> worker = alloc_worker(pool->node);

static struct worker *alloc_worker(int node)
{
 struct worker *worker;

 worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, node); --> Here, 
useing the wrong node.

 ..

 return worker;
}


[Solution]

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm
2. apicid (physical cpu id)   <->   nodeid
3. cpuid (logical cpu id) <->   apicid
4. cpuid (logical cpu id) <->   nodeid

1. pxm (proximity domain) is provided by ACPI firmware in SRAT, and nodeid <-> 
pxm
mapping is setup at boot time. This mapping is persistent, won't change.

2. apicid <-> nodeid mapping is setup using info in 1. The mapping is setup at 
boot
time and CPU hotadd time, and cleared at CPU hotremove time. This mapping 
is also
persistent.

3. cpuid <-> apicid mapping is setup at boot time and CPU hotadd time. cpuid is
allocated, lower ids first, and released at CPU hotremove time, reused for 
other
hotadded CPUs. So this mapping is not persistent.

4. cpuid <-> nodeid mapping is also setup at boot time and CPU hotadd time, and
cleared at CPU hotremove time. As a result of 3, this mapping is not 
persistent.

To fix this problem, we establish cpuid <-> nodeid mapping for all the possible
cpus at boot time, and make it persistent. And according to init_cpu_to_node(),
cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and cpuid <-> 
apicid
mapping. So the key point is obtaining all cpus' apicid.

apicid can be obtained by _MAT (Multiple APIC Table Entry) method or found in
MADT (Multiple APIC Description Table). So we finish the job in the following 
steps:

1. Enable apic registeration flow to handle both enabled and disabled cpus.
This is done by introducing an extra parameter to generic_processor_info to 
let the
caller control if disabled cpus are ignored.

2. Introduce a new array storing all possible cpuid <-> apicid mapping. And 
also modify
the way cpuid is calculated. Establish all possible cpuid <-> apicid 
mapping when
registering local apic. Store the mapping in this array.

3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
This is also do

[PATCH v7 1/5] x86, memhp, numa: Online memory-less nodes at boot time.

2016-05-19 Thread Zhu Guihua
From: Tang Chen <tangc...@cn.fujitsu.com>

For now, x86 does not support memory-less node. A node without memory
will not be onlined, and the cpus on it will be mapped to the other
online nodes with memory in init_cpu_to_node(). The reason of doing this
is to ensure each cpu has mapped to a node with memory, so that it will
be able to allocate local memory for that cpu.

But we don't have to do it in this way.

In this series of patches, we are going to construct cpu <-> node mapping
for all possible cpus at boot time, which is a 1-1 mapping. It means the
cpu will be mapped to the node it belongs to, and will never be changed.
If a node has only cpus but no memory, the cpus on it will be mapped to
a memory-less node. And the memory-less node should be onlined.

This patch allocate pgdats for all memory-less nodes and online them at
boot time. Then build zonelists for these nodes. As a result, when cpus
on these memory-less nodes try to allocate memory from local node, it
will automatically fall back to the proper zones in the zonelists.

Signed-off-by: Zhu Guihua <zhugh.f...@cn.fujitsu.com>
---
 arch/x86/mm/numa.c | 27 +--
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index f70c1ff..257dd56 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -725,22 +725,19 @@ void __init x86_numa_init(void)
numa_init(dummy_numa_init);
 }
 
-static __init int find_near_online_node(int node)
+static void __init init_memory_less_node(int nid)
 {
-   int n, val;
-   int min_val = INT_MAX;
-   int best_node = -1;
+   unsigned long zones_size[MAX_NR_ZONES] = {0};
+   unsigned long zholes_size[MAX_NR_ZONES] = {0};
 
-   for_each_online_node(n) {
-   val = node_distance(node, n);
+   /* Allocate and initialize node data. Memory-less node is now online.*/
+   alloc_node_data(nid);
+   free_area_init_node(nid, zones_size, 0, zholes_size);
 
-   if (val < min_val) {
-   min_val = val;
-   best_node = n;
-   }
-   }
-
-   return best_node;
+   /*
+* All zonelists will be built later in start_kernel() after per cpu
+* areas are initialized.
+*/
 }
 
 /*
@@ -769,8 +766,10 @@ void __init init_cpu_to_node(void)
 
if (node == NUMA_NO_NODE)
continue;
+
if (!node_online(node))
-   node = find_near_online_node(node);
+   init_memory_less_node(node);
+
numa_set_node(cpu, node);
}
 }
-- 
1.9.3





[PATCH v7 1/5] x86, memhp, numa: Online memory-less nodes at boot time.

2016-05-19 Thread Zhu Guihua
From: Tang Chen 

For now, x86 does not support memory-less node. A node without memory
will not be onlined, and the cpus on it will be mapped to the other
online nodes with memory in init_cpu_to_node(). The reason of doing this
is to ensure each cpu has mapped to a node with memory, so that it will
be able to allocate local memory for that cpu.

But we don't have to do it in this way.

In this series of patches, we are going to construct cpu <-> node mapping
for all possible cpus at boot time, which is a 1-1 mapping. It means the
cpu will be mapped to the node it belongs to, and will never be changed.
If a node has only cpus but no memory, the cpus on it will be mapped to
a memory-less node. And the memory-less node should be onlined.

This patch allocate pgdats for all memory-less nodes and online them at
boot time. Then build zonelists for these nodes. As a result, when cpus
on these memory-less nodes try to allocate memory from local node, it
will automatically fall back to the proper zones in the zonelists.

Signed-off-by: Zhu Guihua 
---
 arch/x86/mm/numa.c | 27 +--
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index f70c1ff..257dd56 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -725,22 +725,19 @@ void __init x86_numa_init(void)
numa_init(dummy_numa_init);
 }
 
-static __init int find_near_online_node(int node)
+static void __init init_memory_less_node(int nid)
 {
-   int n, val;
-   int min_val = INT_MAX;
-   int best_node = -1;
+   unsigned long zones_size[MAX_NR_ZONES] = {0};
+   unsigned long zholes_size[MAX_NR_ZONES] = {0};
 
-   for_each_online_node(n) {
-   val = node_distance(node, n);
+   /* Allocate and initialize node data. Memory-less node is now online.*/
+   alloc_node_data(nid);
+   free_area_init_node(nid, zones_size, 0, zholes_size);
 
-   if (val < min_val) {
-   min_val = val;
-   best_node = n;
-   }
-   }
-
-   return best_node;
+   /*
+* All zonelists will be built later in start_kernel() after per cpu
+* areas are initialized.
+*/
 }
 
 /*
@@ -769,8 +766,10 @@ void __init init_cpu_to_node(void)
 
if (node == NUMA_NO_NODE)
continue;
+
if (!node_online(node))
-   node = find_near_online_node(node);
+   init_memory_less_node(node);
+
numa_set_node(cpu, node);
}
 }
-- 
1.9.3





[PATCH v7 0/5] Make cpuid <-> nodeid mapping persistent

2016-05-19 Thread Zhu Guihua
[Problem]

cpuid <-> nodeid mapping is firstly established at boot time. And workqueue 
caches
the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.

When doing node online/offline, cpuid <-> nodeid mapping is 
established/destroyed,
which means, cpuid <-> nodeid mapping will change if node hotplug happens. But
workqueue does not update wq_numa_possible_cpumask.

So here is the problem:

Assume we have the following cpuid <-> nodeid in the beginning:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 2 | 30-44, 90-104
node 3 | 45-59, 105-119

and we hot-remove node2 and node3, it becomes:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89

and we hot-add node4 and node5, it becomes:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 4 | 30-59
node 5 | 90-119

But in wq_numa_possible_cpumask, cpu30 is still mapped to node2, and the like.

When a pool workqueue is initialized, if its cpumask belongs to a node, its
pool->node will be mapped to that node. And memory used by this workqueue will
also be allocated on that node.

static struct worker_pool *get_unbound_pool(const struct workqueue_attrs 
*attrs){
...
/* if cpumask is contained inside a NUMA node, we belong to that node */
if (wq_numa_enabled) {
for_each_node(node) {
if (cpumask_subset(pool->attrs->cpumask,
   wq_numa_possible_cpumask[node])) {
pool->node = node;
break;
}
}
}

Since wq_numa_possible_cpumask is not updated, it could be mapped to an offline 
node,
which will lead to memory allocation failure:

 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min 
order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656

It happens here:

create_worker(struct worker_pool *pool)
 |--> worker = alloc_worker(pool->node);

static struct worker *alloc_worker(int node)
{
struct worker *worker;

worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, node); --> Here, 
useing the wrong node.

..

return worker;
}


[Solution]

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm
2. apicid (physical cpu id)   <->   nodeid
3. cpuid (logical cpu id) <->   apicid
4. cpuid (logical cpu id) <->   nodeid

1. pxm (proximity domain) is provided by ACPI firmware in SRAT, and nodeid <-> 
pxm
   mapping is setup at boot time. This mapping is persistent, won't change.

2. apicid <-> nodeid mapping is setup using info in 1. The mapping is setup at 
boot
   time and CPU hotadd time, and cleared at CPU hotremove time. This mapping is 
also
   persistent.

3. cpuid <-> apicid mapping is setup at boot time and CPU hotadd time. cpuid is
   allocated, lower ids first, and released at CPU hotremove time, reused for 
other
   hotadded CPUs. So this mapping is not persistent.

4. cpuid <-> nodeid mapping is also setup at boot time and CPU hotadd time, and
   cleared at CPU hotremove time. As a result of 3, this mapping is not 
persistent.

To fix this problem, we establish cpuid <-> nodeid mapping for all the possible
cpus at boot time, and make it persistent. And according to init_cpu_to_node(),
cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and cpuid <-> 
apicid
mapping. So the key point is obtaining all cpus' apicid.

apicid can be obtained by _MAT (Multiple APIC Table Entry) method or found in
MADT (Multiple APIC Description Table). So we finish the job in the following 
steps:

1. Enable apic registeration flow to handle both enabled and disabled cpus.
   This is done by introducing an extra parameter to generic_processor_info to 
let the
   caller control if disabled cpus are ignored.

2. Introduce a new array storing all possible cpuid <-> apicid mapping. And 
also modify
   the way cpuid is calculated. Establish all possible cpuid <-> apicid mapping 
when
   registering local apic. Store the mapping in this array.

3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
   This is also done by introducing an extra parameter to these apis to let the 
caller
   control if disabled cpus are ignored.

4. Establish all possible cpuid <-> nodeid mapping.
   This is done via an additional acpi namespace walk for processors.


For previous discussion, please refer to:
https://lkml.org/lkml/2015/2/27/145
https://lkml.org/lkml/2015/3/25/989
https://lkml.org/lkml/2015/5/14/244
https://lkml.org/lkml/2015/7/7/200
https://lkml.org/lkml/2015/9/27/209

Change log v6 -> v7:
1. Fix arm64 build failure.

Change log v5 -> v6:
1. Define func acpi_map_cpu2node() for x86 and ia64 respectively.

Change log v4 -> v5:
1. Remove 

[PATCH v7 2/5] x86, acpi, cpu-hotplug: Enable acpi to register all possible cpus at boot time.

2016-05-19 Thread Zhu Guihua
From: Gu Zheng <guz.f...@cn.fujitsu.com>

[Problem]

cpuid <-> nodeid mapping is firstly established at boot time. And workqueue 
caches
the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.

When doing node online/offline, cpuid <-> nodeid mapping is 
established/destroyed,
which means, cpuid <-> nodeid mapping will change if node hotplug happens. But
workqueue does not update wq_numa_possible_cpumask.

So here is the problem:

Assume we have the following cpuid <-> nodeid in the beginning:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 2 | 30-44, 90-104
node 3 | 45-59, 105-119

and we hot-remove node2 and node3, it becomes:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89

and we hot-add node4 and node5, it becomes:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 4 | 30-59
node 5 | 90-119

But in wq_numa_possible_cpumask, cpu30 is still mapped to node2, and the like.

When a pool workqueue is initialized, if its cpumask belongs to a node, its
pool->node will be mapped to that node. And memory used by this workqueue will
also be allocated on that node.

static struct worker_pool *get_unbound_pool(const struct workqueue_attrs 
*attrs){
...
/* if cpumask is contained inside a NUMA node, we belong to that node */
if (wq_numa_enabled) {
for_each_node(node) {
if (cpumask_subset(pool->attrs->cpumask,
   wq_numa_possible_cpumask[node])) {
pool->node = node;
break;
}
}
}

Since wq_numa_possible_cpumask is not updated, it could be mapped to an offline 
node,
which will lead to memory allocation failure:

 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min 
order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656

It happens here:

create_worker(struct worker_pool *pool)
 |--> worker = alloc_worker(pool->node);

static struct worker *alloc_worker(int node)
{
struct worker *worker;

worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, node); --> Here, 
useing the wrong node.

..

return worker;
}

[Solution]

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm
2. apicid (physical cpu id)   <->   nodeid
3. cpuid (logical cpu id) <->   apicid
4. cpuid (logical cpu id) <->   nodeid

1. pxm (proximity domain) is provided by ACPI firmware in SRAT, and nodeid <-> 
pxm
   mapping is setup at boot time. This mapping is persistent, won't change.

2. apicid <-> nodeid mapping is setup using info in 1. The mapping is setup at 
boot
   time and CPU hotadd time, and cleared at CPU hotremove time. This mapping is 
also
   persistent.

3. cpuid <-> apicid mapping is setup at boot time and CPU hotadd time. cpuid is
   allocated, lower ids first, and released at CPU hotremove time, reused for 
other
   hotadded CPUs. So this mapping is not persistent.

4. cpuid <-> nodeid mapping is also setup at boot time and CPU hotadd time, and
   cleared at CPU hotremove time. As a result of 3, this mapping is not 
persistent.

To fix this problem, we establish cpuid <-> nodeid mapping for all the possible
cpus at boot time, and make it persistent. And according to init_cpu_to_node(),
cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and cpuid <-> 
apicid
mapping. So the key point is obtaining all cpus' apicid.

apicid can be obtained by _MAT (Multiple APIC Table Entry) method or found in
MADT (Multiple APIC Description Table). So we finish the job in the following 
steps:

1. Enable apic registeration flow to handle both enabled and disabled cpus.
   This is done by introducing an extra parameter to generic_processor_info to 
let the
   caller control if disabled cpus are ignored.

2. Introduce a new array storing all possible cpuid <-> apicid mapping. And 
also modify
   the way cpuid is calculated. Establish all possible cpuid <-> apicid mapping 
when
   registering local apic. Store the mapping in this array.

3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
   This is also done by introducing an extra parameter to these apis to let the 
caller
   control if disabled cpus are ignored.

4. Establish all possible cpuid <-> nodeid mapping.
   This is done via an additional acpi namespace walk for processors.

This patch finished step 1.

Signed-off-by: Gu Zheng <guz.f...@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangc...@cn.fujitsu.com>
Signed-off-by: Zhu Guihua <zhugh.f...@cn.fujitsu.com>
---
 arc

[PATCH v7 0/5] Make cpuid <-> nodeid mapping persistent

2016-05-19 Thread Zhu Guihua
[Problem]

cpuid <-> nodeid mapping is firstly established at boot time. And workqueue 
caches
the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.

When doing node online/offline, cpuid <-> nodeid mapping is 
established/destroyed,
which means, cpuid <-> nodeid mapping will change if node hotplug happens. But
workqueue does not update wq_numa_possible_cpumask.

So here is the problem:

Assume we have the following cpuid <-> nodeid in the beginning:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 2 | 30-44, 90-104
node 3 | 45-59, 105-119

and we hot-remove node2 and node3, it becomes:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89

and we hot-add node4 and node5, it becomes:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 4 | 30-59
node 5 | 90-119

But in wq_numa_possible_cpumask, cpu30 is still mapped to node2, and the like.

When a pool workqueue is initialized, if its cpumask belongs to a node, its
pool->node will be mapped to that node. And memory used by this workqueue will
also be allocated on that node.

static struct worker_pool *get_unbound_pool(const struct workqueue_attrs 
*attrs){
...
/* if cpumask is contained inside a NUMA node, we belong to that node */
if (wq_numa_enabled) {
for_each_node(node) {
if (cpumask_subset(pool->attrs->cpumask,
   wq_numa_possible_cpumask[node])) {
pool->node = node;
break;
}
}
}

Since wq_numa_possible_cpumask is not updated, it could be mapped to an offline 
node,
which will lead to memory allocation failure:

 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min 
order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656

It happens here:

create_worker(struct worker_pool *pool)
 |--> worker = alloc_worker(pool->node);

static struct worker *alloc_worker(int node)
{
struct worker *worker;

worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, node); --> Here, 
useing the wrong node.

..

return worker;
}


[Solution]

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm
2. apicid (physical cpu id)   <->   nodeid
3. cpuid (logical cpu id) <->   apicid
4. cpuid (logical cpu id) <->   nodeid

1. pxm (proximity domain) is provided by ACPI firmware in SRAT, and nodeid <-> 
pxm
   mapping is setup at boot time. This mapping is persistent, won't change.

2. apicid <-> nodeid mapping is setup using info in 1. The mapping is setup at 
boot
   time and CPU hotadd time, and cleared at CPU hotremove time. This mapping is 
also
   persistent.

3. cpuid <-> apicid mapping is setup at boot time and CPU hotadd time. cpuid is
   allocated, lower ids first, and released at CPU hotremove time, reused for 
other
   hotadded CPUs. So this mapping is not persistent.

4. cpuid <-> nodeid mapping is also setup at boot time and CPU hotadd time, and
   cleared at CPU hotremove time. As a result of 3, this mapping is not 
persistent.

To fix this problem, we establish cpuid <-> nodeid mapping for all the possible
cpus at boot time, and make it persistent. And according to init_cpu_to_node(),
cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and cpuid <-> 
apicid
mapping. So the key point is obtaining all cpus' apicid.

apicid can be obtained by _MAT (Multiple APIC Table Entry) method or found in
MADT (Multiple APIC Description Table). So we finish the job in the following 
steps:

1. Enable apic registeration flow to handle both enabled and disabled cpus.
   This is done by introducing an extra parameter to generic_processor_info to 
let the
   caller control if disabled cpus are ignored.

2. Introduce a new array storing all possible cpuid <-> apicid mapping. And 
also modify
   the way cpuid is calculated. Establish all possible cpuid <-> apicid mapping 
when
   registering local apic. Store the mapping in this array.

3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
   This is also done by introducing an extra parameter to these apis to let the 
caller
   control if disabled cpus are ignored.

4. Establish all possible cpuid <-> nodeid mapping.
   This is done via an additional acpi namespace walk for processors.


For previous discussion, please refer to:
https://lkml.org/lkml/2015/2/27/145
https://lkml.org/lkml/2015/3/25/989
https://lkml.org/lkml/2015/5/14/244
https://lkml.org/lkml/2015/7/7/200
https://lkml.org/lkml/2015/9/27/209

Change log v6 -> v7:
1. Fix arm64 build failure.

Change log v5 -> v6:
1. Define func acpi_map_cpu2node() for x86 and ia64 respectively.

Change log v4 -> v5:
1. Remove 

[PATCH v7 2/5] x86, acpi, cpu-hotplug: Enable acpi to register all possible cpus at boot time.

2016-05-19 Thread Zhu Guihua
From: Gu Zheng 

[Problem]

cpuid <-> nodeid mapping is firstly established at boot time. And workqueue 
caches
the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.

When doing node online/offline, cpuid <-> nodeid mapping is 
established/destroyed,
which means, cpuid <-> nodeid mapping will change if node hotplug happens. But
workqueue does not update wq_numa_possible_cpumask.

So here is the problem:

Assume we have the following cpuid <-> nodeid in the beginning:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 2 | 30-44, 90-104
node 3 | 45-59, 105-119

and we hot-remove node2 and node3, it becomes:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89

and we hot-add node4 and node5, it becomes:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 4 | 30-59
node 5 | 90-119

But in wq_numa_possible_cpumask, cpu30 is still mapped to node2, and the like.

When a pool workqueue is initialized, if its cpumask belongs to a node, its
pool->node will be mapped to that node. And memory used by this workqueue will
also be allocated on that node.

static struct worker_pool *get_unbound_pool(const struct workqueue_attrs 
*attrs){
...
/* if cpumask is contained inside a NUMA node, we belong to that node */
if (wq_numa_enabled) {
for_each_node(node) {
if (cpumask_subset(pool->attrs->cpumask,
   wq_numa_possible_cpumask[node])) {
pool->node = node;
break;
}
}
}

Since wq_numa_possible_cpumask is not updated, it could be mapped to an offline 
node,
which will lead to memory allocation failure:

 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min 
order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656

It happens here:

create_worker(struct worker_pool *pool)
 |--> worker = alloc_worker(pool->node);

static struct worker *alloc_worker(int node)
{
struct worker *worker;

worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, node); --> Here, 
useing the wrong node.

..

return worker;
}

[Solution]

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm
2. apicid (physical cpu id)   <->   nodeid
3. cpuid (logical cpu id) <->   apicid
4. cpuid (logical cpu id) <->   nodeid

1. pxm (proximity domain) is provided by ACPI firmware in SRAT, and nodeid <-> 
pxm
   mapping is setup at boot time. This mapping is persistent, won't change.

2. apicid <-> nodeid mapping is setup using info in 1. The mapping is setup at 
boot
   time and CPU hotadd time, and cleared at CPU hotremove time. This mapping is 
also
   persistent.

3. cpuid <-> apicid mapping is setup at boot time and CPU hotadd time. cpuid is
   allocated, lower ids first, and released at CPU hotremove time, reused for 
other
   hotadded CPUs. So this mapping is not persistent.

4. cpuid <-> nodeid mapping is also setup at boot time and CPU hotadd time, and
   cleared at CPU hotremove time. As a result of 3, this mapping is not 
persistent.

To fix this problem, we establish cpuid <-> nodeid mapping for all the possible
cpus at boot time, and make it persistent. And according to init_cpu_to_node(),
cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and cpuid <-> 
apicid
mapping. So the key point is obtaining all cpus' apicid.

apicid can be obtained by _MAT (Multiple APIC Table Entry) method or found in
MADT (Multiple APIC Description Table). So we finish the job in the following 
steps:

1. Enable apic registeration flow to handle both enabled and disabled cpus.
   This is done by introducing an extra parameter to generic_processor_info to 
let the
   caller control if disabled cpus are ignored.

2. Introduce a new array storing all possible cpuid <-> apicid mapping. And 
also modify
   the way cpuid is calculated. Establish all possible cpuid <-> apicid mapping 
when
   registering local apic. Store the mapping in this array.

3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
   This is also done by introducing an extra parameter to these apis to let the 
caller
   control if disabled cpus are ignored.

4. Establish all possible cpuid <-> nodeid mapping.
   This is done via an additional acpi namespace walk for processors.

This patch finished step 1.

Signed-off-by: Gu Zheng 
Signed-off-by: Tang Chen 
Signed-off-by: Zhu Guihua 
---
 arch/x86/kernel/apic/apic.c | 26 +++---
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/arch/x86

Re: [PATCH v6 5/5] x86, acpi, cpu-hotplug: Set persistent cpuid <-> nodeid mapping when booting.

2016-04-12 Thread Zhu Guihua

Hi Lorenzo,

Thanks for your test and detailed suggestions.
I will update those later.

Thanks,
Zhu

On 04/06/2016 10:29 PM, Lorenzo Pieralisi wrote:

[+Dennis since he reported ARM64 build breakage]

On Thu, Mar 17, 2016 at 09:32:40AM +0800, Zhu Guihua wrote:

From: Gu Zheng <guz.f...@cn.fujitsu.com>

The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that,
when node online/offline happens, cache based on cpuid <-> nodeid mapping such 
as
wq_numa_possible_cpumask will not cause any problem.
It contains 4 steps:
1. Enable apic registeration flow to handle both enabled and disabled cpus.
2. Introduce a new array storing all possible cpuid <-> apicid mapping.
3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
4. Establish all possible cpuid <-> nodeid mapping.

This patch finishes step 4.

And it breaks the build on ARM64.

drivers/acpi/processor_core.c: In function 'set_processor_node_mapping':
drivers/acpi/processor_core.c:316:2: error: implicit declaration of
function 'acpi_map_cpu2node' [-Werror=implicit-function-declaration]

[...]


diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c
index b1698bc..7db5563 100644
--- a/arch/ia64/kernel/acpi.c
+++ b/arch/ia64/kernel/acpi.c
@@ -796,7 +796,7 @@ int acpi_isa_irq_to_gsi(unsigned isa_irq, u32 *gsi)
   *  ACPI based hotplug CPU support
   */
  #ifdef CONFIG_ACPI_HOTPLUG_CPU
-static int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)

Return value seems to be ignored on IA64 so you can get rid of it,
see below.


  {
  #ifdef CONFIG_ACPI_NUMA
/*
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 0ce06ee..7d45261 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -696,7 +696,7 @@ static void __init acpi_set_irq_model_ioapic(void)
  #ifdef CONFIG_ACPI_HOTPLUG_CPU
  #include 
  
-static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)

+void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
  {
  #ifdef CONFIG_ACPI_NUMA
int nid;
diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c
index 0e85678..215177a 100644
--- a/drivers/acpi/bus.c
+++ b/drivers/acpi/bus.c
@@ -1110,6 +1110,9 @@ static int __init acpi_init(void)
acpi_sleep_proc_init();
acpi_wakeup_device_init();
acpi_debugger_init();
+#ifdef CONFIG_ACPI_HOTPLUG_CPU
+   acpi_set_processor_mapping();
+#endif
return 0;
  }
  
diff --git a/drivers/acpi/processor_core.c b/drivers/acpi/processor_core.c

index 824b98b..45580ff 100644
--- a/drivers/acpi/processor_core.c
+++ b/drivers/acpi/processor_core.c
@@ -261,6 +261,71 @@ int acpi_get_cpuid(acpi_handle handle, int type, u32 
acpi_id)
  }
  EXPORT_SYMBOL_GPL(acpi_get_cpuid);
  
+#ifdef CONFIG_ACPI_HOTPLUG_CPU

+static bool map_processor(acpi_handle handle, int *phys_id, int *cpuid)

phys_id size is 64 bits (phys_cpuid_t) on ARM64, (phys_cpuid_t *phys_id) is
what you have to have here.


+{
+   int type;
+   u32 acpi_id;
+   acpi_status status;
+   acpi_object_type acpi_type;
+   unsigned long long tmp;
+   union acpi_object object = { 0 };
+   struct acpi_buffer buffer = { sizeof(union acpi_object),  };
+
+   status = acpi_get_type(handle, _type);
+   if (ACPI_FAILURE(status))
+   return false;
+
+   switch (acpi_type) {
+   case ACPI_TYPE_PROCESSOR:
+   status = acpi_evaluate_object(handle, NULL, NULL, );
+   if (ACPI_FAILURE(status))
+   return false;
+   acpi_id = object.processor.proc_id;
+   break;
+   case ACPI_TYPE_DEVICE:
+   status = acpi_evaluate_integer(handle, "_UID", NULL, );
+   if (ACPI_FAILURE(status))
+   return false;
+   acpi_id = tmp;
+   break;
+   default:
+   return false;
+   }
+
+   type = (acpi_type == ACPI_TYPE_DEVICE) ? 1 : 0;
+
+   *phys_id = __acpi_get_phys_id(handle, type, acpi_id, false);

Wrong on ARM64, see above.


+   *cpuid = acpi_map_cpuid(*phys_id, acpi_id);
+   if (*cpuid == -1)
+   return false;
+
+   return true;
+}
+
+static acpi_status __init
+set_processor_node_mapping(acpi_handle handle, u32 lvl, void *context,
+  void **rv)
+{
+   u32 apic_id;

- You can't use u32 here see above
- This is generic code and on ARM64 I have no idea what apic_id means,
   choose another variable name please (phys_id ?)


+   int cpu_id;
+
+   if (!map_processor(handle, _id, _id))
+   return AE_ERROR;
+
+   acpi_map_cpu2node(handle, cpu_id, apic_id);
+   return AE_OK;
+}
+
+void __init acpi_set_processor_mapping(void)
+{
+   /* Set persistent cpu <-> node mapping for all processors. */
+   acpi_walk_n

Re: [PATCH v6 5/5] x86, acpi, cpu-hotplug: Set persistent cpuid <-> nodeid mapping when booting.

2016-04-12 Thread Zhu Guihua

Hi Lorenzo,

Thanks for your test and detailed suggestions.
I will update those later.

Thanks,
Zhu

On 04/06/2016 10:29 PM, Lorenzo Pieralisi wrote:

[+Dennis since he reported ARM64 build breakage]

On Thu, Mar 17, 2016 at 09:32:40AM +0800, Zhu Guihua wrote:

From: Gu Zheng 

The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that,
when node online/offline happens, cache based on cpuid <-> nodeid mapping such 
as
wq_numa_possible_cpumask will not cause any problem.
It contains 4 steps:
1. Enable apic registeration flow to handle both enabled and disabled cpus.
2. Introduce a new array storing all possible cpuid <-> apicid mapping.
3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
4. Establish all possible cpuid <-> nodeid mapping.

This patch finishes step 4.

And it breaks the build on ARM64.

drivers/acpi/processor_core.c: In function 'set_processor_node_mapping':
drivers/acpi/processor_core.c:316:2: error: implicit declaration of
function 'acpi_map_cpu2node' [-Werror=implicit-function-declaration]

[...]


diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c
index b1698bc..7db5563 100644
--- a/arch/ia64/kernel/acpi.c
+++ b/arch/ia64/kernel/acpi.c
@@ -796,7 +796,7 @@ int acpi_isa_irq_to_gsi(unsigned isa_irq, u32 *gsi)
   *  ACPI based hotplug CPU support
   */
  #ifdef CONFIG_ACPI_HOTPLUG_CPU
-static int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)

Return value seems to be ignored on IA64 so you can get rid of it,
see below.


  {
  #ifdef CONFIG_ACPI_NUMA
/*
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 0ce06ee..7d45261 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -696,7 +696,7 @@ static void __init acpi_set_irq_model_ioapic(void)
  #ifdef CONFIG_ACPI_HOTPLUG_CPU
  #include 
  
-static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)

+void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
  {
  #ifdef CONFIG_ACPI_NUMA
int nid;
diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c
index 0e85678..215177a 100644
--- a/drivers/acpi/bus.c
+++ b/drivers/acpi/bus.c
@@ -1110,6 +1110,9 @@ static int __init acpi_init(void)
acpi_sleep_proc_init();
acpi_wakeup_device_init();
acpi_debugger_init();
+#ifdef CONFIG_ACPI_HOTPLUG_CPU
+   acpi_set_processor_mapping();
+#endif
return 0;
  }
  
diff --git a/drivers/acpi/processor_core.c b/drivers/acpi/processor_core.c

index 824b98b..45580ff 100644
--- a/drivers/acpi/processor_core.c
+++ b/drivers/acpi/processor_core.c
@@ -261,6 +261,71 @@ int acpi_get_cpuid(acpi_handle handle, int type, u32 
acpi_id)
  }
  EXPORT_SYMBOL_GPL(acpi_get_cpuid);
  
+#ifdef CONFIG_ACPI_HOTPLUG_CPU

+static bool map_processor(acpi_handle handle, int *phys_id, int *cpuid)

phys_id size is 64 bits (phys_cpuid_t) on ARM64, (phys_cpuid_t *phys_id) is
what you have to have here.


+{
+   int type;
+   u32 acpi_id;
+   acpi_status status;
+   acpi_object_type acpi_type;
+   unsigned long long tmp;
+   union acpi_object object = { 0 };
+   struct acpi_buffer buffer = { sizeof(union acpi_object),  };
+
+   status = acpi_get_type(handle, _type);
+   if (ACPI_FAILURE(status))
+   return false;
+
+   switch (acpi_type) {
+   case ACPI_TYPE_PROCESSOR:
+   status = acpi_evaluate_object(handle, NULL, NULL, );
+   if (ACPI_FAILURE(status))
+   return false;
+   acpi_id = object.processor.proc_id;
+   break;
+   case ACPI_TYPE_DEVICE:
+   status = acpi_evaluate_integer(handle, "_UID", NULL, );
+   if (ACPI_FAILURE(status))
+   return false;
+   acpi_id = tmp;
+   break;
+   default:
+   return false;
+   }
+
+   type = (acpi_type == ACPI_TYPE_DEVICE) ? 1 : 0;
+
+   *phys_id = __acpi_get_phys_id(handle, type, acpi_id, false);

Wrong on ARM64, see above.


+   *cpuid = acpi_map_cpuid(*phys_id, acpi_id);
+   if (*cpuid == -1)
+   return false;
+
+   return true;
+}
+
+static acpi_status __init
+set_processor_node_mapping(acpi_handle handle, u32 lvl, void *context,
+  void **rv)
+{
+   u32 apic_id;

- You can't use u32 here see above
- This is generic code and on ARM64 I have no idea what apic_id means,
   choose another variable name please (phys_id ?)


+   int cpu_id;
+
+   if (!map_processor(handle, _id, _id))
+   return AE_ERROR;
+
+   acpi_map_cpu2node(handle, cpu_id, apic_id);
+   return AE_OK;
+}
+
+void __init acpi_set_processor_mapping(void)
+{
+   /* Set persistent cpu <-> node mapping for all processors. */
+   acpi_walk_namespa

Re: [lkp] [x86, ACPI, cpu] f962c29c2f: BUG: unable to handle kernel paging request at 0000000000001b00

2016-04-12 Thread Zhu Guihua

Hi Rafael,

On 04/06/2016 11:21 PM, Rafael J. Wysocki wrote:

On 4/6/2016 2:43 AM, kernel test robot wrote:

FYI, we noticed the below changes on

https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git 
linux-next
commit f962c29c2f5d80be28bd76e2c3fadf1ce97ccd76 ("x86, ACPI, 
cpu-hotplug: Set persistent cpuid <-> nodeid mapping when booting")




Thanks for the report.

I've dropped the patch series that commit is part of from the 
linux-next branch of linux-pm.git because of this problem.




I tried to reproduce this failure, but I failed.
I compiled linux-next branch source code of linux-pm.git with this 
config, then I

used qemu to boot a guest with my compiled kernel, it worked well.

QEMU command line is:
./x86_64-softmmu/qemu-system-x86_64 -hda /home/fedora21.img \
-kernel /home/linux-pm/arch/x86_64/boot/bzImage \
-m 512,slots=100,maxmem=100G -smp 2,maxcpus=100 -serial stdio -enable-kvm \
-initrd /home/initramfs-4.6.0-rc2-00250-g8767898.img -append 
"root=/dev/sda3 console=ttyS0


Anyone succeed?

Thanks,
Zhu




+--+++
|  | 40d4d2e6c5 | f962c29c2f |
+--+++
| boot_successes   | 12 | 4  |
| boot_failures| 0  | 8  |
| BUG:unable_to_handle_kernel  | 0  | 8  |
| Oops | 0  | 8  |
| RIP:__alloc_pages_nodemask   | 0  | 8  |
| Kernel_panic-not_syncing:Fatal_exception | 0  | 8  |
| backtrace:pcpu_balance_workfn| 0  | 8  |
+--+++



[   10.413471] RAPL PMU: hw unit of domain pp0-core 2^-16 Joules
[   10.419886] RAPL PMU: hw unit of domain package 2^-16 Joules
[   10.426206] RAPL PMU: hw unit of domain dram 2^-16 Joules
[   10.433905] BUG: unable to handle kernel paging request at 
1b00
[   10.441699] IP: [] 
__alloc_pages_nodemask+0x210/0xc10

[   10.449097] PGD 0
[   10.451350] Oops:  [#1] SMP
[   10.454972] Modules linked in:
[   10.458390] CPU: 12 PID: 407 Comm: kworker/12:1 Not tainted 
4.6.0-rc1-7-gf962c29 #1
[   10.467326] Hardware name: Intel Corporation S2600WP/S2600WP, BIOS 
SE5C600.86B.02.02.0002.122320131210 12/23/2013

[   10.478776] Workqueue: events pcpu_balance_workfn
[   10.484035] task: 88100d858000 ti: 88100d86 task.ti: 
88100d86
[   10.492381] RIP: 0010:[] [] 
__alloc_pages_nodemask+0x210/0xc10

[   10.502491] RSP: :88100d863c28  EFLAGS: 00010286
[   10.508419] RAX:  RBX:  RCX: 
0004
[   10.516383] RDX: 8000 RSI: 0d06 RDI: 
81c9d430
[   10.524348] RBP: 88100d863d40 R08:  R09: 

[   10.532312] R10: 81ca1e87 R11: c9076000 R12: 
0003
[   10.540278] R13: 1b00 R14:  R15: 
88100d43e480
[   10.548243] FS:  () GS:88101340() 
knlGS:

[   10.557275] CS:  0010 DS:  ES:  CR0: 80050033
[   10.563687] CR2: 1b00 CR3: 00103ee06000 CR4: 
001406e0

[   10.571652] Stack:
[   10.573894]  c9090fff c9091000 881013002008 
0001
[   10.582196]  8163  024082c2 
0040
[   10.590492]  001e 880ffe57c0c0 88100d863c88 
811b7766

[   10.598794] Call Trace:
[   10.601527]  [] ? map_vm_area+0x36/0x50
[   10.607553]  [] pcpu_populate_chunk+0xae/0x340
[   10.614257]  [] pcpu_balance_workfn+0x578/0x5b0
[   10.621060]  [] process_one_work+0x155/0x440
[   10.627562]  [] worker_thread+0x4e/0x4c0
[   10.633687]  [] ? __schedule+0x34b/0x8b0
[   10.639809]  [] ? rescuer_thread+0x350/0x350
[   10.646319]  [] ? rescuer_thread+0x350/0x350
[   10.652831]  [] kthread+0xd4/0xf0
[   10.658276]  [] ret_from_fork+0x22/0x40
[   10.664302]  [] ? kthread_park+0x60/0x60
[   10.670423] Code: d0 49 8b 04 24 48 85 c0 75 e0 65 ff 0d 22 f0 e8 
7e eb 8b 31 d2 be 06 0d 00 00 48 c7 c7 30 d4 c9 81 e8 35 2c f2 ff e8 
50 40 77 00 <49> 83 7d 00 00 0f 85 7c fe ff ff 31 c0 e9 6d ff ff ff 
0f 1f 44
[   10.692137] RIP  [] 
__alloc_pages_nodemask+0x210/0xc10

[   10.699627]  RSP 
[   10.703518] CR2: 1b00
[   10.707220] ---[ end trace bc83ef9ded88f808 ]---
[   10.712373] Kernel panic - not syncing: Fatal exception


Thanks,
Kernel Test Robot




.







Re: [lkp] [x86, ACPI, cpu] f962c29c2f: BUG: unable to handle kernel paging request at 0000000000001b00

2016-04-12 Thread Zhu Guihua

Hi Rafael,

On 04/06/2016 11:21 PM, Rafael J. Wysocki wrote:

On 4/6/2016 2:43 AM, kernel test robot wrote:

FYI, we noticed the below changes on

https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git 
linux-next
commit f962c29c2f5d80be28bd76e2c3fadf1ce97ccd76 ("x86, ACPI, 
cpu-hotplug: Set persistent cpuid <-> nodeid mapping when booting")




Thanks for the report.

I've dropped the patch series that commit is part of from the 
linux-next branch of linux-pm.git because of this problem.




I tried to reproduce this failure, but I failed.
I compiled linux-next branch source code of linux-pm.git with this 
config, then I

used qemu to boot a guest with my compiled kernel, it worked well.

QEMU command line is:
./x86_64-softmmu/qemu-system-x86_64 -hda /home/fedora21.img \
-kernel /home/linux-pm/arch/x86_64/boot/bzImage \
-m 512,slots=100,maxmem=100G -smp 2,maxcpus=100 -serial stdio -enable-kvm \
-initrd /home/initramfs-4.6.0-rc2-00250-g8767898.img -append 
"root=/dev/sda3 console=ttyS0


Anyone succeed?

Thanks,
Zhu




+--+++
|  | 40d4d2e6c5 | f962c29c2f |
+--+++
| boot_successes   | 12 | 4  |
| boot_failures| 0  | 8  |
| BUG:unable_to_handle_kernel  | 0  | 8  |
| Oops | 0  | 8  |
| RIP:__alloc_pages_nodemask   | 0  | 8  |
| Kernel_panic-not_syncing:Fatal_exception | 0  | 8  |
| backtrace:pcpu_balance_workfn| 0  | 8  |
+--+++



[   10.413471] RAPL PMU: hw unit of domain pp0-core 2^-16 Joules
[   10.419886] RAPL PMU: hw unit of domain package 2^-16 Joules
[   10.426206] RAPL PMU: hw unit of domain dram 2^-16 Joules
[   10.433905] BUG: unable to handle kernel paging request at 
1b00
[   10.441699] IP: [] 
__alloc_pages_nodemask+0x210/0xc10

[   10.449097] PGD 0
[   10.451350] Oops:  [#1] SMP
[   10.454972] Modules linked in:
[   10.458390] CPU: 12 PID: 407 Comm: kworker/12:1 Not tainted 
4.6.0-rc1-7-gf962c29 #1
[   10.467326] Hardware name: Intel Corporation S2600WP/S2600WP, BIOS 
SE5C600.86B.02.02.0002.122320131210 12/23/2013

[   10.478776] Workqueue: events pcpu_balance_workfn
[   10.484035] task: 88100d858000 ti: 88100d86 task.ti: 
88100d86
[   10.492381] RIP: 0010:[] [] 
__alloc_pages_nodemask+0x210/0xc10

[   10.502491] RSP: :88100d863c28  EFLAGS: 00010286
[   10.508419] RAX:  RBX:  RCX: 
0004
[   10.516383] RDX: 8000 RSI: 0d06 RDI: 
81c9d430
[   10.524348] RBP: 88100d863d40 R08:  R09: 

[   10.532312] R10: 81ca1e87 R11: c9076000 R12: 
0003
[   10.540278] R13: 1b00 R14:  R15: 
88100d43e480
[   10.548243] FS:  () GS:88101340() 
knlGS:

[   10.557275] CS:  0010 DS:  ES:  CR0: 80050033
[   10.563687] CR2: 1b00 CR3: 00103ee06000 CR4: 
001406e0

[   10.571652] Stack:
[   10.573894]  c9090fff c9091000 881013002008 
0001
[   10.582196]  8163  024082c2 
0040
[   10.590492]  001e 880ffe57c0c0 88100d863c88 
811b7766

[   10.598794] Call Trace:
[   10.601527]  [] ? map_vm_area+0x36/0x50
[   10.607553]  [] pcpu_populate_chunk+0xae/0x340
[   10.614257]  [] pcpu_balance_workfn+0x578/0x5b0
[   10.621060]  [] process_one_work+0x155/0x440
[   10.627562]  [] worker_thread+0x4e/0x4c0
[   10.633687]  [] ? __schedule+0x34b/0x8b0
[   10.639809]  [] ? rescuer_thread+0x350/0x350
[   10.646319]  [] ? rescuer_thread+0x350/0x350
[   10.652831]  [] kthread+0xd4/0xf0
[   10.658276]  [] ret_from_fork+0x22/0x40
[   10.664302]  [] ? kthread_park+0x60/0x60
[   10.670423] Code: d0 49 8b 04 24 48 85 c0 75 e0 65 ff 0d 22 f0 e8 
7e eb 8b 31 d2 be 06 0d 00 00 48 c7 c7 30 d4 c9 81 e8 35 2c f2 ff e8 
50 40 77 00 <49> 83 7d 00 00 0f 85 7c fe ff ff 31 c0 e9 6d ff ff ff 
0f 1f 44
[   10.692137] RIP  [] 
__alloc_pages_nodemask+0x210/0xc10

[   10.699627]  RSP 
[   10.703518] CR2: 1b00
[   10.707220] ---[ end trace bc83ef9ded88f808 ]---
[   10.712373] Kernel panic - not syncing: Fatal exception


Thanks,
Kernel Test Robot




.







[PATCH v6 1/5] x86, memhp, numa: Online memory-less nodes at boot time.

2016-03-19 Thread Zhu Guihua
From: Tang Chen <tangc...@cn.fujitsu.com>

For now, x86 does not support memory-less node. A node without memory
will not be onlined, and the cpus on it will be mapped to the other
online nodes with memory in init_cpu_to_node(). The reason of doing this
is to ensure each cpu has mapped to a node with memory, so that it will
be able to allocate local memory for that cpu.

But we don't have to do it in this way.

In this series of patches, we are going to construct cpu <-> node mapping
for all possible cpus at boot time, which is a 1-1 mapping. It means the
cpu will be mapped to the node it belongs to, and will never be changed.
If a node has only cpus but no memory, the cpus on it will be mapped to
a memory-less node. And the memory-less node should be onlined.

This patch allocate pgdats for all memory-less nodes and online them at
boot time. Then build zonelists for these nodes. As a result, when cpus
on these memory-less nodes try to allocate memory from local node, it
will automatically fall back to the proper zones in the zonelists.

Signed-off-by: Zhu Guihua <zhugh.f...@cn.fujitsu.com>
---
 arch/x86/mm/numa.c | 27 +--
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index f70c1ff..257dd56 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -725,22 +725,19 @@ void __init x86_numa_init(void)
numa_init(dummy_numa_init);
 }
 
-static __init int find_near_online_node(int node)
+static void __init init_memory_less_node(int nid)
 {
-   int n, val;
-   int min_val = INT_MAX;
-   int best_node = -1;
+   unsigned long zones_size[MAX_NR_ZONES] = {0};
+   unsigned long zholes_size[MAX_NR_ZONES] = {0};
 
-   for_each_online_node(n) {
-   val = node_distance(node, n);
+   /* Allocate and initialize node data. Memory-less node is now online.*/
+   alloc_node_data(nid);
+   free_area_init_node(nid, zones_size, 0, zholes_size);
 
-   if (val < min_val) {
-   min_val = val;
-   best_node = n;
-   }
-   }
-
-   return best_node;
+   /*
+* All zonelists will be built later in start_kernel() after per cpu
+* areas are initialized.
+*/
 }
 
 /*
@@ -769,8 +766,10 @@ void __init init_cpu_to_node(void)
 
if (node == NUMA_NO_NODE)
continue;
+
if (!node_online(node))
-   node = find_near_online_node(node);
+   init_memory_less_node(node);
+
numa_set_node(cpu, node);
}
 }
-- 
1.9.3





[PATCH v6 1/5] x86, memhp, numa: Online memory-less nodes at boot time.

2016-03-19 Thread Zhu Guihua
From: Tang Chen 

For now, x86 does not support memory-less node. A node without memory
will not be onlined, and the cpus on it will be mapped to the other
online nodes with memory in init_cpu_to_node(). The reason of doing this
is to ensure each cpu has mapped to a node with memory, so that it will
be able to allocate local memory for that cpu.

But we don't have to do it in this way.

In this series of patches, we are going to construct cpu <-> node mapping
for all possible cpus at boot time, which is a 1-1 mapping. It means the
cpu will be mapped to the node it belongs to, and will never be changed.
If a node has only cpus but no memory, the cpus on it will be mapped to
a memory-less node. And the memory-less node should be onlined.

This patch allocate pgdats for all memory-less nodes and online them at
boot time. Then build zonelists for these nodes. As a result, when cpus
on these memory-less nodes try to allocate memory from local node, it
will automatically fall back to the proper zones in the zonelists.

Signed-off-by: Zhu Guihua 
---
 arch/x86/mm/numa.c | 27 +--
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index f70c1ff..257dd56 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -725,22 +725,19 @@ void __init x86_numa_init(void)
numa_init(dummy_numa_init);
 }
 
-static __init int find_near_online_node(int node)
+static void __init init_memory_less_node(int nid)
 {
-   int n, val;
-   int min_val = INT_MAX;
-   int best_node = -1;
+   unsigned long zones_size[MAX_NR_ZONES] = {0};
+   unsigned long zholes_size[MAX_NR_ZONES] = {0};
 
-   for_each_online_node(n) {
-   val = node_distance(node, n);
+   /* Allocate and initialize node data. Memory-less node is now online.*/
+   alloc_node_data(nid);
+   free_area_init_node(nid, zones_size, 0, zholes_size);
 
-   if (val < min_val) {
-   min_val = val;
-   best_node = n;
-   }
-   }
-
-   return best_node;
+   /*
+* All zonelists will be built later in start_kernel() after per cpu
+* areas are initialized.
+*/
 }
 
 /*
@@ -769,8 +766,10 @@ void __init init_cpu_to_node(void)
 
if (node == NUMA_NO_NODE)
continue;
+
if (!node_online(node))
-   node = find_near_online_node(node);
+   init_memory_less_node(node);
+
numa_set_node(cpu, node);
}
 }
-- 
1.9.3





[PATCH v6 0/5] Make cpuid <-> nodeid mapping persistent

2016-03-19 Thread Zhu Guihua
[Problem]

cpuid <-> nodeid mapping is firstly established at boot time. And workqueue 
caches
the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.

When doing node online/offline, cpuid <-> nodeid mapping is 
established/destroyed,
which means, cpuid <-> nodeid mapping will change if node hotplug happens. But
workqueue does not update wq_numa_possible_cpumask.

So here is the problem:

Assume we have the following cpuid <-> nodeid in the beginning:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 2 | 30-44, 90-104
node 3 | 45-59, 105-119

and we hot-remove node2 and node3, it becomes:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89

and we hot-add node4 and node5, it becomes:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 4 | 30-59
node 5 | 90-119

But in wq_numa_possible_cpumask, cpu30 is still mapped to node2, and the like.

When a pool workqueue is initialized, if its cpumask belongs to a node, its
pool->node will be mapped to that node. And memory used by this workqueue will
also be allocated on that node.

static struct worker_pool *get_unbound_pool(const struct workqueue_attrs 
*attrs){
...
/* if cpumask is contained inside a NUMA node, we belong to that node */
if (wq_numa_enabled) {
for_each_node(node) {
if (cpumask_subset(pool->attrs->cpumask,
   wq_numa_possible_cpumask[node])) {
pool->node = node;
break;
}
}
}

Since wq_numa_possible_cpumask is not updated, it could be mapped to an offline 
node,
which will lead to memory allocation failure:

 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min 
order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656

It happens here:

create_worker(struct worker_pool *pool)
 |--> worker = alloc_worker(pool->node);

static struct worker *alloc_worker(int node)
{
struct worker *worker;

worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, node); --> Here, 
useing the wrong node.

..

return worker;
}


[Solution]

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm
2. apicid (physical cpu id)   <->   nodeid
3. cpuid (logical cpu id) <->   apicid
4. cpuid (logical cpu id) <->   nodeid

1. pxm (proximity domain) is provided by ACPI firmware in SRAT, and nodeid <-> 
pxm
   mapping is setup at boot time. This mapping is persistent, won't change.

2. apicid <-> nodeid mapping is setup using info in 1. The mapping is setup at 
boot
   time and CPU hotadd time, and cleared at CPU hotremove time. This mapping is 
also
   persistent.

3. cpuid <-> apicid mapping is setup at boot time and CPU hotadd time. cpuid is
   allocated, lower ids first, and released at CPU hotremove time, reused for 
other
   hotadded CPUs. So this mapping is not persistent.

4. cpuid <-> nodeid mapping is also setup at boot time and CPU hotadd time, and
   cleared at CPU hotremove time. As a result of 3, this mapping is not 
persistent.

To fix this problem, we establish cpuid <-> nodeid mapping for all the possible
cpus at boot time, and make it persistent. And according to init_cpu_to_node(),
cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and cpuid <-> 
apicid
mapping. So the key point is obtaining all cpus' apicid.

apicid can be obtained by _MAT (Multiple APIC Table Entry) method or found in
MADT (Multiple APIC Description Table). So we finish the job in the following 
steps:

1. Enable apic registeration flow to handle both enabled and disabled cpus.
   This is done by introducing an extra parameter to generic_processor_info to 
let the
   caller control if disabled cpus are ignored.

2. Introduce a new array storing all possible cpuid <-> apicid mapping. And 
also modify
   the way cpuid is calculated. Establish all possible cpuid <-> apicid mapping 
when
   registering local apic. Store the mapping in this array.

3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
   This is also done by introducing an extra parameter to these apis to let the 
caller
   control if disabled cpus are ignored.

4. Establish all possible cpuid <-> nodeid mapping.
   This is done via an additional acpi namespace walk for processors.


For previous discussion, please refer to:
https://lkml.org/lkml/2015/2/27/145
https://lkml.org/lkml/2015/3/25/989
https://lkml.org/lkml/2015/5/14/244
https://lkml.org/lkml/2015/7/7/200
https://lkml.org/lkml/2015/9/27/209

Change log v5 -> v6:
1. Define func acpi_map_cpu2node() for x86 and ia64 respectively.

Change log v4 -> v5:
1. Remove useless code in patch 1.
2. Small improvement of 

[PATCH v6 0/5] Make cpuid <-> nodeid mapping persistent

2016-03-19 Thread Zhu Guihua
[Problem]

cpuid <-> nodeid mapping is firstly established at boot time. And workqueue 
caches
the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.

When doing node online/offline, cpuid <-> nodeid mapping is 
established/destroyed,
which means, cpuid <-> nodeid mapping will change if node hotplug happens. But
workqueue does not update wq_numa_possible_cpumask.

So here is the problem:

Assume we have the following cpuid <-> nodeid in the beginning:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 2 | 30-44, 90-104
node 3 | 45-59, 105-119

and we hot-remove node2 and node3, it becomes:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89

and we hot-add node4 and node5, it becomes:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 4 | 30-59
node 5 | 90-119

But in wq_numa_possible_cpumask, cpu30 is still mapped to node2, and the like.

When a pool workqueue is initialized, if its cpumask belongs to a node, its
pool->node will be mapped to that node. And memory used by this workqueue will
also be allocated on that node.

static struct worker_pool *get_unbound_pool(const struct workqueue_attrs 
*attrs){
...
/* if cpumask is contained inside a NUMA node, we belong to that node */
if (wq_numa_enabled) {
for_each_node(node) {
if (cpumask_subset(pool->attrs->cpumask,
   wq_numa_possible_cpumask[node])) {
pool->node = node;
break;
}
}
}

Since wq_numa_possible_cpumask is not updated, it could be mapped to an offline 
node,
which will lead to memory allocation failure:

 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min 
order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656

It happens here:

create_worker(struct worker_pool *pool)
 |--> worker = alloc_worker(pool->node);

static struct worker *alloc_worker(int node)
{
struct worker *worker;

worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, node); --> Here, 
useing the wrong node.

..

return worker;
}


[Solution]

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm
2. apicid (physical cpu id)   <->   nodeid
3. cpuid (logical cpu id) <->   apicid
4. cpuid (logical cpu id) <->   nodeid

1. pxm (proximity domain) is provided by ACPI firmware in SRAT, and nodeid <-> 
pxm
   mapping is setup at boot time. This mapping is persistent, won't change.

2. apicid <-> nodeid mapping is setup using info in 1. The mapping is setup at 
boot
   time and CPU hotadd time, and cleared at CPU hotremove time. This mapping is 
also
   persistent.

3. cpuid <-> apicid mapping is setup at boot time and CPU hotadd time. cpuid is
   allocated, lower ids first, and released at CPU hotremove time, reused for 
other
   hotadded CPUs. So this mapping is not persistent.

4. cpuid <-> nodeid mapping is also setup at boot time and CPU hotadd time, and
   cleared at CPU hotremove time. As a result of 3, this mapping is not 
persistent.

To fix this problem, we establish cpuid <-> nodeid mapping for all the possible
cpus at boot time, and make it persistent. And according to init_cpu_to_node(),
cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and cpuid <-> 
apicid
mapping. So the key point is obtaining all cpus' apicid.

apicid can be obtained by _MAT (Multiple APIC Table Entry) method or found in
MADT (Multiple APIC Description Table). So we finish the job in the following 
steps:

1. Enable apic registeration flow to handle both enabled and disabled cpus.
   This is done by introducing an extra parameter to generic_processor_info to 
let the
   caller control if disabled cpus are ignored.

2. Introduce a new array storing all possible cpuid <-> apicid mapping. And 
also modify
   the way cpuid is calculated. Establish all possible cpuid <-> apicid mapping 
when
   registering local apic. Store the mapping in this array.

3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
   This is also done by introducing an extra parameter to these apis to let the 
caller
   control if disabled cpus are ignored.

4. Establish all possible cpuid <-> nodeid mapping.
   This is done via an additional acpi namespace walk for processors.


For previous discussion, please refer to:
https://lkml.org/lkml/2015/2/27/145
https://lkml.org/lkml/2015/3/25/989
https://lkml.org/lkml/2015/5/14/244
https://lkml.org/lkml/2015/7/7/200
https://lkml.org/lkml/2015/9/27/209

Change log v5 -> v6:
1. Define func acpi_map_cpu2node() for x86 and ia64 respectively.

Change log v4 -> v5:
1. Remove useless code in patch 1.
2. Small improvement of 

[PATCH v6 3/5] x86, acpi, cpu-hotplug: Introduce cpuid_to_apicid[] array to store persistent cpuid <-> apicid mapping.

2016-03-19 Thread Zhu Guihua
From: Gu Zheng <guz.f...@cn.fujitsu.com>

The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that,
when node online/offline happens, cache based on cpuid <-> nodeid mapping such 
as
wq_numa_possible_cpumask will not cause any problem.
It contains 4 steps:
1. Enable apic registeration flow to handle both enabled and disabled cpus.
2. Introduce a new array storing all possible cpuid <-> apicid mapping.
3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
4. Establish all possible cpuid <-> nodeid mapping.

This patch finishes step 2.

In this patch, we introduce a new static array named cpuid_to_apicid[],
which is large enough to store info for all possible cpus.

And then, we modify the cpuid calculation. In generic_processor_info(),
it simply finds the next unused cpuid. And it is also why the cpuid <-> nodeid
mapping changes with node hotplug.

After this patch, we find the next unused cpuid, map it to an apicid,
and store the mapping in cpuid_to_apicid[], so that cpuid <-> apicid
mapping will be persistent.

And finally we will use this array to make cpuid <-> nodeid persistent.

cpuid <-> apicid mapping is established at local apic registeration time.
But non-present or disabled cpus are ignored.

In this patch, we establish all possible cpuid <-> apicid mapping when
registering local apic.

Signed-off-by: Gu Zheng <guz.f...@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangc...@cn.fujitsu.com>
Signed-off-by: Zhu Guihua <zhugh.f...@cn.fujitsu.com>
---
 arch/x86/include/asm/mpspec.h |  1 +
 arch/x86/kernel/acpi/boot.c   |  6 ++---
 arch/x86/kernel/apic/apic.c   | 61 ---
 3 files changed, 61 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
index b07233b..db902d8 100644
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -86,6 +86,7 @@ static inline void early_reserve_e820_mpc_new(void) { }
 #endif
 
 int generic_processor_info(int apicid, int version);
+int __generic_processor_info(int apicid, int version, bool enabled);
 
 #define PHYSID_ARRAY_SIZE  BITS_TO_LONGS(MAX_LOCAL_APIC)
 
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index e759076..0ce06ee 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -174,15 +174,13 @@ static int acpi_register_lapic(int id, u8 enabled)
return -EINVAL;
}
 
-   if (!enabled) {
+   if (!enabled)
++disabled_cpus;
-   return -EINVAL;
-   }
 
if (boot_cpu_physical_apicid != -1U)
ver = apic_version[boot_cpu_physical_apicid];
 
-   return generic_processor_info(id, ver);
+   return __generic_processor_info(id, ver, enabled);
 }
 
 static int __init
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 55cb80e..bedb674 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1998,7 +1998,53 @@ void disconnect_bsp_APIC(int virt_wire_setup)
apic_write(APIC_LVT1, value);
 }
 
-static int __generic_processor_info(int apicid, int version, bool enabled)
+/*
+ * The number of allocated logical CPU IDs. Since logical CPU IDs are allocated
+ * contiguously, it equals to current allocated max logical CPU ID plus 1.
+ * All allocated CPU ID should be in [0, nr_logical_cpuidi), so the maximum of
+ * nr_logical_cpuids is nr_cpu_ids.
+ *
+ * NOTE: Reserve 0 for BSP.
+ */
+static int nr_logical_cpuids = 1;
+
+/*
+ * Used to store mapping between logical CPU IDs and APIC IDs.
+ */
+static int cpuid_to_apicid[] = {
+   [0 ... NR_CPUS - 1] = -1,
+};
+
+/*
+ * Should use this API to allocate logical CPU IDs to keep nr_logical_cpuids
+ * and cpuid_to_apicid[] synchronized.
+ */
+static int allocate_logical_cpuid(int apicid)
+{
+   int i;
+
+   /*
+* cpuid <-> apicid mapping is persistent, so when a cpu is up,
+* check if the kernel has allocated a cpuid for it.
+*/
+   for (i = 0; i < nr_logical_cpuids; i++) {
+   if (cpuid_to_apicid[i] == apicid)
+   return i;
+   }
+
+   /* Allocate a new cpuid. */
+   if (nr_logical_cpuids >= nr_cpu_ids) {
+   WARN_ONCE(1, "Only %d processors supported."
+"Processor %d/0x%x and the rest are ignored.\n",
+nr_cpu_ids - 1, nr_logical_cpuids, apicid);
+   return -1;
+   }
+
+   cpuid_to_apicid[nr_logical_cpuids] = apicid;
+   return nr_logical_cpuids++;
+}
+
+int __generic_processor_info(int apicid, int version, bool enabled)
 {
int cpu, max = nr_cpu_ids;
bool boot_cpu_detected = physid_isset(boot_cpu_physical_apicid,
@@ -2079,8 +2125,17 @@ static int __generic_processor_info(int apicid, int 

[PATCH v6 3/5] x86, acpi, cpu-hotplug: Introduce cpuid_to_apicid[] array to store persistent cpuid <-> apicid mapping.

2016-03-19 Thread Zhu Guihua
From: Gu Zheng 

The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that,
when node online/offline happens, cache based on cpuid <-> nodeid mapping such 
as
wq_numa_possible_cpumask will not cause any problem.
It contains 4 steps:
1. Enable apic registeration flow to handle both enabled and disabled cpus.
2. Introduce a new array storing all possible cpuid <-> apicid mapping.
3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
4. Establish all possible cpuid <-> nodeid mapping.

This patch finishes step 2.

In this patch, we introduce a new static array named cpuid_to_apicid[],
which is large enough to store info for all possible cpus.

And then, we modify the cpuid calculation. In generic_processor_info(),
it simply finds the next unused cpuid. And it is also why the cpuid <-> nodeid
mapping changes with node hotplug.

After this patch, we find the next unused cpuid, map it to an apicid,
and store the mapping in cpuid_to_apicid[], so that cpuid <-> apicid
mapping will be persistent.

And finally we will use this array to make cpuid <-> nodeid persistent.

cpuid <-> apicid mapping is established at local apic registeration time.
But non-present or disabled cpus are ignored.

In this patch, we establish all possible cpuid <-> apicid mapping when
registering local apic.

Signed-off-by: Gu Zheng 
Signed-off-by: Tang Chen 
Signed-off-by: Zhu Guihua 
---
 arch/x86/include/asm/mpspec.h |  1 +
 arch/x86/kernel/acpi/boot.c   |  6 ++---
 arch/x86/kernel/apic/apic.c   | 61 ---
 3 files changed, 61 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
index b07233b..db902d8 100644
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -86,6 +86,7 @@ static inline void early_reserve_e820_mpc_new(void) { }
 #endif
 
 int generic_processor_info(int apicid, int version);
+int __generic_processor_info(int apicid, int version, bool enabled);
 
 #define PHYSID_ARRAY_SIZE  BITS_TO_LONGS(MAX_LOCAL_APIC)
 
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index e759076..0ce06ee 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -174,15 +174,13 @@ static int acpi_register_lapic(int id, u8 enabled)
return -EINVAL;
}
 
-   if (!enabled) {
+   if (!enabled)
++disabled_cpus;
-   return -EINVAL;
-   }
 
if (boot_cpu_physical_apicid != -1U)
ver = apic_version[boot_cpu_physical_apicid];
 
-   return generic_processor_info(id, ver);
+   return __generic_processor_info(id, ver, enabled);
 }
 
 static int __init
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 55cb80e..bedb674 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1998,7 +1998,53 @@ void disconnect_bsp_APIC(int virt_wire_setup)
apic_write(APIC_LVT1, value);
 }
 
-static int __generic_processor_info(int apicid, int version, bool enabled)
+/*
+ * The number of allocated logical CPU IDs. Since logical CPU IDs are allocated
+ * contiguously, it equals to current allocated max logical CPU ID plus 1.
+ * All allocated CPU ID should be in [0, nr_logical_cpuidi), so the maximum of
+ * nr_logical_cpuids is nr_cpu_ids.
+ *
+ * NOTE: Reserve 0 for BSP.
+ */
+static int nr_logical_cpuids = 1;
+
+/*
+ * Used to store mapping between logical CPU IDs and APIC IDs.
+ */
+static int cpuid_to_apicid[] = {
+   [0 ... NR_CPUS - 1] = -1,
+};
+
+/*
+ * Should use this API to allocate logical CPU IDs to keep nr_logical_cpuids
+ * and cpuid_to_apicid[] synchronized.
+ */
+static int allocate_logical_cpuid(int apicid)
+{
+   int i;
+
+   /*
+* cpuid <-> apicid mapping is persistent, so when a cpu is up,
+* check if the kernel has allocated a cpuid for it.
+*/
+   for (i = 0; i < nr_logical_cpuids; i++) {
+   if (cpuid_to_apicid[i] == apicid)
+   return i;
+   }
+
+   /* Allocate a new cpuid. */
+   if (nr_logical_cpuids >= nr_cpu_ids) {
+   WARN_ONCE(1, "Only %d processors supported."
+"Processor %d/0x%x and the rest are ignored.\n",
+nr_cpu_ids - 1, nr_logical_cpuids, apicid);
+   return -1;
+   }
+
+   cpuid_to_apicid[nr_logical_cpuids] = apicid;
+   return nr_logical_cpuids++;
+}
+
+int __generic_processor_info(int apicid, int version, bool enabled)
 {
int cpu, max = nr_cpu_ids;
bool boot_cpu_detected = physid_isset(boot_cpu_physical_apicid,
@@ -2079,8 +2125,17 @@ static int __generic_processor_info(int apicid, int 
version, bool enabled)
 * for BSP.
 */
cpu = 0;
-   } else
-

[PATCH v6 2/5] x86, acpi, cpu-hotplug: Enable acpi to register all possible cpus at boot time.

2016-03-19 Thread Zhu Guihua
From: Gu Zheng <guz.f...@cn.fujitsu.com>

[Problem]

cpuid <-> nodeid mapping is firstly established at boot time. And workqueue 
caches
the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.

When doing node online/offline, cpuid <-> nodeid mapping is 
established/destroyed,
which means, cpuid <-> nodeid mapping will change if node hotplug happens. But
workqueue does not update wq_numa_possible_cpumask.

So here is the problem:

Assume we have the following cpuid <-> nodeid in the beginning:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 2 | 30-44, 90-104
node 3 | 45-59, 105-119

and we hot-remove node2 and node3, it becomes:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89

and we hot-add node4 and node5, it becomes:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 4 | 30-59
node 5 | 90-119

But in wq_numa_possible_cpumask, cpu30 is still mapped to node2, and the like.

When a pool workqueue is initialized, if its cpumask belongs to a node, its
pool->node will be mapped to that node. And memory used by this workqueue will
also be allocated on that node.

static struct worker_pool *get_unbound_pool(const struct workqueue_attrs 
*attrs){
...
/* if cpumask is contained inside a NUMA node, we belong to that node */
if (wq_numa_enabled) {
for_each_node(node) {
if (cpumask_subset(pool->attrs->cpumask,
   wq_numa_possible_cpumask[node])) {
pool->node = node;
break;
}
}
}

Since wq_numa_possible_cpumask is not updated, it could be mapped to an offline 
node,
which will lead to memory allocation failure:

 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min 
order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656

It happens here:

create_worker(struct worker_pool *pool)
 |--> worker = alloc_worker(pool->node);

static struct worker *alloc_worker(int node)
{
struct worker *worker;

worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, node); --> Here, 
useing the wrong node.

..

return worker;
}

[Solution]

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm
2. apicid (physical cpu id)   <->   nodeid
3. cpuid (logical cpu id) <->   apicid
4. cpuid (logical cpu id) <->   nodeid

1. pxm (proximity domain) is provided by ACPI firmware in SRAT, and nodeid <-> 
pxm
   mapping is setup at boot time. This mapping is persistent, won't change.

2. apicid <-> nodeid mapping is setup using info in 1. The mapping is setup at 
boot
   time and CPU hotadd time, and cleared at CPU hotremove time. This mapping is 
also
   persistent.

3. cpuid <-> apicid mapping is setup at boot time and CPU hotadd time. cpuid is
   allocated, lower ids first, and released at CPU hotremove time, reused for 
other
   hotadded CPUs. So this mapping is not persistent.

4. cpuid <-> nodeid mapping is also setup at boot time and CPU hotadd time, and
   cleared at CPU hotremove time. As a result of 3, this mapping is not 
persistent.

To fix this problem, we establish cpuid <-> nodeid mapping for all the possible
cpus at boot time, and make it persistent. And according to init_cpu_to_node(),
cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and cpuid <-> 
apicid
mapping. So the key point is obtaining all cpus' apicid.

apicid can be obtained by _MAT (Multiple APIC Table Entry) method or found in
MADT (Multiple APIC Description Table). So we finish the job in the following 
steps:

1. Enable apic registeration flow to handle both enabled and disabled cpus.
   This is done by introducing an extra parameter to generic_processor_info to 
let the
   caller control if disabled cpus are ignored.

2. Introduce a new array storing all possible cpuid <-> apicid mapping. And 
also modify
   the way cpuid is calculated. Establish all possible cpuid <-> apicid mapping 
when
   registering local apic. Store the mapping in this array.

3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
   This is also done by introducing an extra parameter to these apis to let the 
caller
   control if disabled cpus are ignored.

4. Establish all possible cpuid <-> nodeid mapping.
   This is done via an additional acpi namespace walk for processors.

This patch finished step 1.

Signed-off-by: Gu Zheng <guz.f...@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangc...@cn.fujitsu.com>
Signed-off-by: Zhu Guihua <zhugh.f...@cn.fujitsu.com>
---
 arc

[PATCH v6 2/5] x86, acpi, cpu-hotplug: Enable acpi to register all possible cpus at boot time.

2016-03-19 Thread Zhu Guihua
From: Gu Zheng 

[Problem]

cpuid <-> nodeid mapping is firstly established at boot time. And workqueue 
caches
the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.

When doing node online/offline, cpuid <-> nodeid mapping is 
established/destroyed,
which means, cpuid <-> nodeid mapping will change if node hotplug happens. But
workqueue does not update wq_numa_possible_cpumask.

So here is the problem:

Assume we have the following cpuid <-> nodeid in the beginning:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 2 | 30-44, 90-104
node 3 | 45-59, 105-119

and we hot-remove node2 and node3, it becomes:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89

and we hot-add node4 and node5, it becomes:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 4 | 30-59
node 5 | 90-119

But in wq_numa_possible_cpumask, cpu30 is still mapped to node2, and the like.

When a pool workqueue is initialized, if its cpumask belongs to a node, its
pool->node will be mapped to that node. And memory used by this workqueue will
also be allocated on that node.

static struct worker_pool *get_unbound_pool(const struct workqueue_attrs 
*attrs){
...
/* if cpumask is contained inside a NUMA node, we belong to that node */
if (wq_numa_enabled) {
for_each_node(node) {
if (cpumask_subset(pool->attrs->cpumask,
   wq_numa_possible_cpumask[node])) {
pool->node = node;
break;
}
}
}

Since wq_numa_possible_cpumask is not updated, it could be mapped to an offline 
node,
which will lead to memory allocation failure:

 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min 
order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656

It happens here:

create_worker(struct worker_pool *pool)
 |--> worker = alloc_worker(pool->node);

static struct worker *alloc_worker(int node)
{
struct worker *worker;

worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, node); --> Here, 
useing the wrong node.

..

return worker;
}

[Solution]

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm
2. apicid (physical cpu id)   <->   nodeid
3. cpuid (logical cpu id) <->   apicid
4. cpuid (logical cpu id) <->   nodeid

1. pxm (proximity domain) is provided by ACPI firmware in SRAT, and nodeid <-> 
pxm
   mapping is setup at boot time. This mapping is persistent, won't change.

2. apicid <-> nodeid mapping is setup using info in 1. The mapping is setup at 
boot
   time and CPU hotadd time, and cleared at CPU hotremove time. This mapping is 
also
   persistent.

3. cpuid <-> apicid mapping is setup at boot time and CPU hotadd time. cpuid is
   allocated, lower ids first, and released at CPU hotremove time, reused for 
other
   hotadded CPUs. So this mapping is not persistent.

4. cpuid <-> nodeid mapping is also setup at boot time and CPU hotadd time, and
   cleared at CPU hotremove time. As a result of 3, this mapping is not 
persistent.

To fix this problem, we establish cpuid <-> nodeid mapping for all the possible
cpus at boot time, and make it persistent. And according to init_cpu_to_node(),
cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and cpuid <-> 
apicid
mapping. So the key point is obtaining all cpus' apicid.

apicid can be obtained by _MAT (Multiple APIC Table Entry) method or found in
MADT (Multiple APIC Description Table). So we finish the job in the following 
steps:

1. Enable apic registeration flow to handle both enabled and disabled cpus.
   This is done by introducing an extra parameter to generic_processor_info to 
let the
   caller control if disabled cpus are ignored.

2. Introduce a new array storing all possible cpuid <-> apicid mapping. And 
also modify
   the way cpuid is calculated. Establish all possible cpuid <-> apicid mapping 
when
   registering local apic. Store the mapping in this array.

3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
   This is also done by introducing an extra parameter to these apis to let the 
caller
   control if disabled cpus are ignored.

4. Establish all possible cpuid <-> nodeid mapping.
   This is done via an additional acpi namespace walk for processors.

This patch finished step 1.

Signed-off-by: Gu Zheng 
Signed-off-by: Tang Chen 
Signed-off-by: Zhu Guihua 
---
 arch/x86/kernel/apic/apic.c | 26 +++---
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/arch/x86

[PATCH v6 4/5] x86, acpi, cpu-hotplug: Enable MADT APIs to return disabled apicid.

2016-03-19 Thread Zhu Guihua
From: Gu Zheng <guz.f...@cn.fujitsu.com>

The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that,
when node online/offline happens, cache based on cpuid <-> nodeid mapping such 
as
wq_numa_possible_cpumask will not cause any problem.
It contains 4 steps:
1. Enable apic registeration flow to handle both enabled and disabled cpus.
2. Introduce a new array storing all possible cpuid <-> apicid mapping.
3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
4. Establish all possible cpuid <-> nodeid mapping.

This patch finishes step 3.

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm(persistent)
2. apicid (physical cpu id)   <->   nodeid (persistent)
3. cpuid (logical cpu id) <->   apicid (not persistent, now persistent 
by step 2)
4. cpuid (logical cpu id) <->   nodeid (not persistent)

So, in order to setup persistent cpuid <-> nodeid mapping for all possible CPUs,
we should:
1. Setup cpuid <-> apicid mapping for all possible CPUs, which has been done in 
step 1, 2.
2. Setup cpuid <-> nodeid mapping for all possible CPUs. But before that, we 
should
   obtain all apicids from MADT.

All processors' apicids can be obtained by _MAT method or from MADT in ACPI.
The current code ignores disabled processors and returns -ENODEV.

After this patch, a new parameter will be added to MADT APIs so that caller
is able to control if disabled processors are ignored.

Signed-off-by: Gu Zheng <guz.f...@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangc...@cn.fujitsu.com>
Signed-off-by: Zhu Guihua <zhugh.f...@cn.fujitsu.com>
---
 drivers/acpi/acpi_processor.c |  5 +++-
 drivers/acpi/processor_core.c | 57 +++
 2 files changed, 40 insertions(+), 22 deletions(-)

diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
index b5e54f2..3ccdd8f 100644
--- a/drivers/acpi/acpi_processor.c
+++ b/drivers/acpi/acpi_processor.c
@@ -300,8 +300,11 @@ static int acpi_processor_get_info(struct acpi_device 
*device)
 *  Extra Processor objects may be enumerated on MP systems with
 *  less than the max # of CPUs. They should be ignored _iff
 *  they are physically not present.
+*
+*  NOTE: Even if the processor has a cpuid, it may not present because
+*  cpuid <-> apicid mapping is persistent now.
 */
-   if (invalid_logical_cpuid(pr->id)) {
+   if (invalid_logical_cpuid(pr->id) || !cpu_present(pr->id)) {
int ret = acpi_processor_hotadd_init(pr);
if (ret)
return ret;
diff --git a/drivers/acpi/processor_core.c b/drivers/acpi/processor_core.c
index 33a38d6..824b98b 100644
--- a/drivers/acpi/processor_core.c
+++ b/drivers/acpi/processor_core.c
@@ -32,12 +32,12 @@ static struct acpi_table_madt *get_madt_table(void)
 }
 
 static int map_lapic_id(struct acpi_subtable_header *entry,
-u32 acpi_id, phys_cpuid_t *apic_id)
+u32 acpi_id, phys_cpuid_t *apic_id, bool ignore_disabled)
 {
struct acpi_madt_local_apic *lapic =
container_of(entry, struct acpi_madt_local_apic, header);
 
-   if (!(lapic->lapic_flags & ACPI_MADT_ENABLED))
+   if (ignore_disabled && !(lapic->lapic_flags & ACPI_MADT_ENABLED))
return -ENODEV;
 
if (lapic->processor_id != acpi_id)
@@ -48,12 +48,13 @@ static int map_lapic_id(struct acpi_subtable_header *entry,
 }
 
 static int map_x2apic_id(struct acpi_subtable_header *entry,
-   int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id)
+   int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id,
+   bool ignore_disabled)
 {
struct acpi_madt_local_x2apic *apic =
container_of(entry, struct acpi_madt_local_x2apic, header);
 
-   if (!(apic->lapic_flags & ACPI_MADT_ENABLED))
+   if (ignore_disabled && !(apic->lapic_flags & ACPI_MADT_ENABLED))
return -ENODEV;
 
if (device_declaration && (apic->uid == acpi_id)) {
@@ -65,12 +66,13 @@ static int map_x2apic_id(struct acpi_subtable_header *entry,
 }
 
 static int map_lsapic_id(struct acpi_subtable_header *entry,
-   int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id)
+   int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id,
+   bool ignore_disabled)
 {
struct acpi_madt_local_sapic *lsapic =
container_of(entry, struct acpi_madt_local_sapic, header);
 
-   if (!(lsapic->lapic_flags & ACPI_MADT_ENABLED))
+   if (ignore_disabled && !(lsapic->lapic_flags & ACPI_MADT_ENABLED))
return -ENODEV;
 
if (device_declaration) {
@@ -87,12 +89,1

[PATCH v6 4/5] x86, acpi, cpu-hotplug: Enable MADT APIs to return disabled apicid.

2016-03-19 Thread Zhu Guihua
From: Gu Zheng 

The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that,
when node online/offline happens, cache based on cpuid <-> nodeid mapping such 
as
wq_numa_possible_cpumask will not cause any problem.
It contains 4 steps:
1. Enable apic registeration flow to handle both enabled and disabled cpus.
2. Introduce a new array storing all possible cpuid <-> apicid mapping.
3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
4. Establish all possible cpuid <-> nodeid mapping.

This patch finishes step 3.

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm(persistent)
2. apicid (physical cpu id)   <->   nodeid (persistent)
3. cpuid (logical cpu id) <->   apicid (not persistent, now persistent 
by step 2)
4. cpuid (logical cpu id) <->   nodeid (not persistent)

So, in order to setup persistent cpuid <-> nodeid mapping for all possible CPUs,
we should:
1. Setup cpuid <-> apicid mapping for all possible CPUs, which has been done in 
step 1, 2.
2. Setup cpuid <-> nodeid mapping for all possible CPUs. But before that, we 
should
   obtain all apicids from MADT.

All processors' apicids can be obtained by _MAT method or from MADT in ACPI.
The current code ignores disabled processors and returns -ENODEV.

After this patch, a new parameter will be added to MADT APIs so that caller
is able to control if disabled processors are ignored.

Signed-off-by: Gu Zheng 
Signed-off-by: Tang Chen 
Signed-off-by: Zhu Guihua 
---
 drivers/acpi/acpi_processor.c |  5 +++-
 drivers/acpi/processor_core.c | 57 +++
 2 files changed, 40 insertions(+), 22 deletions(-)

diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
index b5e54f2..3ccdd8f 100644
--- a/drivers/acpi/acpi_processor.c
+++ b/drivers/acpi/acpi_processor.c
@@ -300,8 +300,11 @@ static int acpi_processor_get_info(struct acpi_device 
*device)
 *  Extra Processor objects may be enumerated on MP systems with
 *  less than the max # of CPUs. They should be ignored _iff
 *  they are physically not present.
+*
+*  NOTE: Even if the processor has a cpuid, it may not present because
+*  cpuid <-> apicid mapping is persistent now.
 */
-   if (invalid_logical_cpuid(pr->id)) {
+   if (invalid_logical_cpuid(pr->id) || !cpu_present(pr->id)) {
int ret = acpi_processor_hotadd_init(pr);
if (ret)
return ret;
diff --git a/drivers/acpi/processor_core.c b/drivers/acpi/processor_core.c
index 33a38d6..824b98b 100644
--- a/drivers/acpi/processor_core.c
+++ b/drivers/acpi/processor_core.c
@@ -32,12 +32,12 @@ static struct acpi_table_madt *get_madt_table(void)
 }
 
 static int map_lapic_id(struct acpi_subtable_header *entry,
-u32 acpi_id, phys_cpuid_t *apic_id)
+u32 acpi_id, phys_cpuid_t *apic_id, bool ignore_disabled)
 {
struct acpi_madt_local_apic *lapic =
container_of(entry, struct acpi_madt_local_apic, header);
 
-   if (!(lapic->lapic_flags & ACPI_MADT_ENABLED))
+   if (ignore_disabled && !(lapic->lapic_flags & ACPI_MADT_ENABLED))
return -ENODEV;
 
if (lapic->processor_id != acpi_id)
@@ -48,12 +48,13 @@ static int map_lapic_id(struct acpi_subtable_header *entry,
 }
 
 static int map_x2apic_id(struct acpi_subtable_header *entry,
-   int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id)
+   int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id,
+   bool ignore_disabled)
 {
struct acpi_madt_local_x2apic *apic =
container_of(entry, struct acpi_madt_local_x2apic, header);
 
-   if (!(apic->lapic_flags & ACPI_MADT_ENABLED))
+   if (ignore_disabled && !(apic->lapic_flags & ACPI_MADT_ENABLED))
return -ENODEV;
 
if (device_declaration && (apic->uid == acpi_id)) {
@@ -65,12 +66,13 @@ static int map_x2apic_id(struct acpi_subtable_header *entry,
 }
 
 static int map_lsapic_id(struct acpi_subtable_header *entry,
-   int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id)
+   int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id,
+   bool ignore_disabled)
 {
struct acpi_madt_local_sapic *lsapic =
container_of(entry, struct acpi_madt_local_sapic, header);
 
-   if (!(lsapic->lapic_flags & ACPI_MADT_ENABLED))
+   if (ignore_disabled && !(lsapic->lapic_flags & ACPI_MADT_ENABLED))
return -ENODEV;
 
if (device_declaration) {
@@ -87,12 +89,13 @@ static int map_lsapic_id(struct acpi_subtable_header *entry,
  * Retrieve the ARM CPU physical identifier (MPIDR)

[PATCH v6 5/5] x86, acpi, cpu-hotplug: Set persistent cpuid <-> nodeid mapping when booting.

2016-03-18 Thread Zhu Guihua
From: Gu Zheng <guz.f...@cn.fujitsu.com>

The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that,
when node online/offline happens, cache based on cpuid <-> nodeid mapping such 
as
wq_numa_possible_cpumask will not cause any problem.
It contains 4 steps:
1. Enable apic registeration flow to handle both enabled and disabled cpus.
2. Introduce a new array storing all possible cpuid <-> apicid mapping.
3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
4. Establish all possible cpuid <-> nodeid mapping.

This patch finishes step 4.

This patch set the persistent cpuid <-> nodeid mapping for all enabled/disabled
processors at boot time via an additional acpi namespace walk for processors.

Signed-off-by: Gu Zheng <guz.f...@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangc...@cn.fujitsu.com>
Signed-off-by: Zhu Guihua <zhugh.f...@cn.fujitsu.com>
---
 arch/ia64/kernel/acpi.c   |  2 +-
 arch/x86/kernel/acpi/boot.c   |  2 +-
 drivers/acpi/bus.c|  3 ++
 drivers/acpi/processor_core.c | 65 +++
 include/linux/acpi.h  |  6 
 5 files changed, 76 insertions(+), 2 deletions(-)

diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c
index b1698bc..7db5563 100644
--- a/arch/ia64/kernel/acpi.c
+++ b/arch/ia64/kernel/acpi.c
@@ -796,7 +796,7 @@ int acpi_isa_irq_to_gsi(unsigned isa_irq, u32 *gsi)
  *  ACPI based hotplug CPU support
  */
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
-static int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 {
 #ifdef CONFIG_ACPI_NUMA
/*
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 0ce06ee..7d45261 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -696,7 +696,7 @@ static void __init acpi_set_irq_model_ioapic(void)
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
 #include 
 
-static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 {
 #ifdef CONFIG_ACPI_NUMA
int nid;
diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c
index 0e85678..215177a 100644
--- a/drivers/acpi/bus.c
+++ b/drivers/acpi/bus.c
@@ -1110,6 +1110,9 @@ static int __init acpi_init(void)
acpi_sleep_proc_init();
acpi_wakeup_device_init();
acpi_debugger_init();
+#ifdef CONFIG_ACPI_HOTPLUG_CPU
+   acpi_set_processor_mapping();
+#endif
return 0;
 }
 
diff --git a/drivers/acpi/processor_core.c b/drivers/acpi/processor_core.c
index 824b98b..45580ff 100644
--- a/drivers/acpi/processor_core.c
+++ b/drivers/acpi/processor_core.c
@@ -261,6 +261,71 @@ int acpi_get_cpuid(acpi_handle handle, int type, u32 
acpi_id)
 }
 EXPORT_SYMBOL_GPL(acpi_get_cpuid);
 
+#ifdef CONFIG_ACPI_HOTPLUG_CPU
+static bool map_processor(acpi_handle handle, int *phys_id, int *cpuid)
+{
+   int type;
+   u32 acpi_id;
+   acpi_status status;
+   acpi_object_type acpi_type;
+   unsigned long long tmp;
+   union acpi_object object = { 0 };
+   struct acpi_buffer buffer = { sizeof(union acpi_object),  };
+
+   status = acpi_get_type(handle, _type);
+   if (ACPI_FAILURE(status))
+   return false;
+
+   switch (acpi_type) {
+   case ACPI_TYPE_PROCESSOR:
+   status = acpi_evaluate_object(handle, NULL, NULL, );
+   if (ACPI_FAILURE(status))
+   return false;
+   acpi_id = object.processor.proc_id;
+   break;
+   case ACPI_TYPE_DEVICE:
+   status = acpi_evaluate_integer(handle, "_UID", NULL, );
+   if (ACPI_FAILURE(status))
+   return false;
+   acpi_id = tmp;
+   break;
+   default:
+   return false;
+   }
+
+   type = (acpi_type == ACPI_TYPE_DEVICE) ? 1 : 0;
+
+   *phys_id = __acpi_get_phys_id(handle, type, acpi_id, false);
+   *cpuid = acpi_map_cpuid(*phys_id, acpi_id);
+   if (*cpuid == -1)
+   return false;
+
+   return true;
+}
+
+static acpi_status __init
+set_processor_node_mapping(acpi_handle handle, u32 lvl, void *context,
+  void **rv)
+{
+   u32 apic_id;
+   int cpu_id;
+
+   if (!map_processor(handle, _id, _id))
+   return AE_ERROR;
+
+   acpi_map_cpu2node(handle, cpu_id, apic_id);
+   return AE_OK;
+}
+
+void __init acpi_set_processor_mapping(void)
+{
+   /* Set persistent cpu <-> node mapping for all processors. */
+   acpi_walk_namespace(ACPI_TYPE_PROCESSOR, ACPI_ROOT_OBJECT,
+   ACPI_UINT32_MAX, set_processor_node_mapping,
+   NULL, NULL, NULL);
+}
+#endif
+
 #ifdef CONFIG_ACPI_HOTPLUG_IOAPIC
 static int get_ioapic_id(struct acpi_subtable_header *entry, u32 gsi_base,
   

[PATCH v6 5/5] x86, acpi, cpu-hotplug: Set persistent cpuid <-> nodeid mapping when booting.

2016-03-18 Thread Zhu Guihua
From: Gu Zheng 

The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that,
when node online/offline happens, cache based on cpuid <-> nodeid mapping such 
as
wq_numa_possible_cpumask will not cause any problem.
It contains 4 steps:
1. Enable apic registeration flow to handle both enabled and disabled cpus.
2. Introduce a new array storing all possible cpuid <-> apicid mapping.
3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
4. Establish all possible cpuid <-> nodeid mapping.

This patch finishes step 4.

This patch set the persistent cpuid <-> nodeid mapping for all enabled/disabled
processors at boot time via an additional acpi namespace walk for processors.

Signed-off-by: Gu Zheng 
Signed-off-by: Tang Chen 
Signed-off-by: Zhu Guihua 
---
 arch/ia64/kernel/acpi.c   |  2 +-
 arch/x86/kernel/acpi/boot.c   |  2 +-
 drivers/acpi/bus.c|  3 ++
 drivers/acpi/processor_core.c | 65 +++
 include/linux/acpi.h  |  6 
 5 files changed, 76 insertions(+), 2 deletions(-)

diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c
index b1698bc..7db5563 100644
--- a/arch/ia64/kernel/acpi.c
+++ b/arch/ia64/kernel/acpi.c
@@ -796,7 +796,7 @@ int acpi_isa_irq_to_gsi(unsigned isa_irq, u32 *gsi)
  *  ACPI based hotplug CPU support
  */
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
-static int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 {
 #ifdef CONFIG_ACPI_NUMA
/*
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 0ce06ee..7d45261 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -696,7 +696,7 @@ static void __init acpi_set_irq_model_ioapic(void)
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
 #include 
 
-static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 {
 #ifdef CONFIG_ACPI_NUMA
int nid;
diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c
index 0e85678..215177a 100644
--- a/drivers/acpi/bus.c
+++ b/drivers/acpi/bus.c
@@ -1110,6 +1110,9 @@ static int __init acpi_init(void)
acpi_sleep_proc_init();
acpi_wakeup_device_init();
acpi_debugger_init();
+#ifdef CONFIG_ACPI_HOTPLUG_CPU
+   acpi_set_processor_mapping();
+#endif
return 0;
 }
 
diff --git a/drivers/acpi/processor_core.c b/drivers/acpi/processor_core.c
index 824b98b..45580ff 100644
--- a/drivers/acpi/processor_core.c
+++ b/drivers/acpi/processor_core.c
@@ -261,6 +261,71 @@ int acpi_get_cpuid(acpi_handle handle, int type, u32 
acpi_id)
 }
 EXPORT_SYMBOL_GPL(acpi_get_cpuid);
 
+#ifdef CONFIG_ACPI_HOTPLUG_CPU
+static bool map_processor(acpi_handle handle, int *phys_id, int *cpuid)
+{
+   int type;
+   u32 acpi_id;
+   acpi_status status;
+   acpi_object_type acpi_type;
+   unsigned long long tmp;
+   union acpi_object object = { 0 };
+   struct acpi_buffer buffer = { sizeof(union acpi_object),  };
+
+   status = acpi_get_type(handle, _type);
+   if (ACPI_FAILURE(status))
+   return false;
+
+   switch (acpi_type) {
+   case ACPI_TYPE_PROCESSOR:
+   status = acpi_evaluate_object(handle, NULL, NULL, );
+   if (ACPI_FAILURE(status))
+   return false;
+   acpi_id = object.processor.proc_id;
+   break;
+   case ACPI_TYPE_DEVICE:
+   status = acpi_evaluate_integer(handle, "_UID", NULL, );
+   if (ACPI_FAILURE(status))
+   return false;
+   acpi_id = tmp;
+   break;
+   default:
+   return false;
+   }
+
+   type = (acpi_type == ACPI_TYPE_DEVICE) ? 1 : 0;
+
+   *phys_id = __acpi_get_phys_id(handle, type, acpi_id, false);
+   *cpuid = acpi_map_cpuid(*phys_id, acpi_id);
+   if (*cpuid == -1)
+   return false;
+
+   return true;
+}
+
+static acpi_status __init
+set_processor_node_mapping(acpi_handle handle, u32 lvl, void *context,
+  void **rv)
+{
+   u32 apic_id;
+   int cpu_id;
+
+   if (!map_processor(handle, _id, _id))
+   return AE_ERROR;
+
+   acpi_map_cpu2node(handle, cpu_id, apic_id);
+   return AE_OK;
+}
+
+void __init acpi_set_processor_mapping(void)
+{
+   /* Set persistent cpu <-> node mapping for all processors. */
+   acpi_walk_namespace(ACPI_TYPE_PROCESSOR, ACPI_ROOT_OBJECT,
+   ACPI_UINT32_MAX, set_processor_node_mapping,
+   NULL, NULL, NULL);
+}
+#endif
+
 #ifdef CONFIG_ACPI_HOTPLUG_IOAPIC
 static int get_ioapic_id(struct acpi_subtable_header *entry, u32 gsi_base,
 u64 *phys_addr, int *ioapic_id)
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 06ed7e5..a

Re: [RESEND PATCH v5 5/5] x86, acpi, cpu-hotplug: Set persistent cpuid <-> nodeid mapping when booting.

2016-03-02 Thread Zhu Guihua


On 03/03/2016 10:11 AM, kbuild test robot wrote:

Hi Gu,

[auto build test ERROR on tip/x86/core]
[also build test ERROR on v4.5-rc6 next-20160302]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improving the system]

url:
https://github.com/0day-ci/linux/commits/Zhu-Guihua/Make-cpuid-nodeid-mapping-persistent/20160303-094713
config: ia64-allyesconfig (attached as .config)
reproduce:
 wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
 chmod +x ~/bin/make.cross
 # save the attached .config to linux build tree
 make.cross ARCH=ia64

All errors (new ones prefixed by >>):


arch/ia64/kernel/acpi.c:799:5: error: conflicting types for 'acpi_map_cpu2node'

 int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 ^
In file included from arch/ia64/kernel/acpi.c:43:0:
include/linux/acpi.h:268:6: note: previous declaration of 
'acpi_map_cpu2node' was here
 void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid);
  ^

vim +/acpi_map_cpu2node +799 arch/ia64/kernel/acpi.c

793 }
794 
795 /*
796  *  ACPI based hotplug CPU support
797  */
798 #ifdef CONFIG_ACPI_HOTPLUG_CPU
  > 799  int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
800 {
801 #ifdef CONFIG_ACPI_NUMA
802 /*

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


Thanks for your test. I will investigate this.

Thanks,
Zhu




Re: [RESEND PATCH v5 5/5] x86, acpi, cpu-hotplug: Set persistent cpuid <-> nodeid mapping when booting.

2016-03-02 Thread Zhu Guihua


On 03/03/2016 10:11 AM, kbuild test robot wrote:

Hi Gu,

[auto build test ERROR on tip/x86/core]
[also build test ERROR on v4.5-rc6 next-20160302]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improving the system]

url:
https://github.com/0day-ci/linux/commits/Zhu-Guihua/Make-cpuid-nodeid-mapping-persistent/20160303-094713
config: ia64-allyesconfig (attached as .config)
reproduce:
 wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
 chmod +x ~/bin/make.cross
 # save the attached .config to linux build tree
 make.cross ARCH=ia64

All errors (new ones prefixed by >>):


arch/ia64/kernel/acpi.c:799:5: error: conflicting types for 'acpi_map_cpu2node'

 int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 ^
In file included from arch/ia64/kernel/acpi.c:43:0:
include/linux/acpi.h:268:6: note: previous declaration of 
'acpi_map_cpu2node' was here
 void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid);
  ^

vim +/acpi_map_cpu2node +799 arch/ia64/kernel/acpi.c

793 }
794 
795 /*
796  *  ACPI based hotplug CPU support
797  */
798 #ifdef CONFIG_ACPI_HOTPLUG_CPU
  > 799  int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
800 {
801 #ifdef CONFIG_ACPI_NUMA
802 /*

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


Thanks for your test. I will investigate this.

Thanks,
Zhu




Re: [RESEND PATCH v5 0/5] Make cpuid <-> nodeid mapping persistent

2016-03-02 Thread Zhu Guihua

Hi,

On 03/03/2016 10:08 AM, Rafael J. Wysocki wrote:

Hi,

On Thu, Mar 3, 2016 at 2:42 AM, Zhu Guihua <zhugh.f...@cn.fujitsu.com> wrote:

[Problem]

cpuid <-> nodeid mapping is firstly established at boot time. And workqueue 
caches
the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.

When doing node online/offline, cpuid <-> nodeid mapping is 
established/destroyed,
which means, cpuid <-> nodeid mapping will change if node hotplug happens. But
workqueue does not update wq_numa_possible_cpumask.


Are there any changes in this version relative to the previous one?

No, there are no changes in this version.

Thanks,
Zhu



Thanks,
Rafael


.







Re: [RESEND PATCH v5 0/5] Make cpuid <-> nodeid mapping persistent

2016-03-02 Thread Zhu Guihua

Hi,

On 03/03/2016 10:08 AM, Rafael J. Wysocki wrote:

Hi,

On Thu, Mar 3, 2016 at 2:42 AM, Zhu Guihua  wrote:

[Problem]

cpuid <-> nodeid mapping is firstly established at boot time. And workqueue 
caches
the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.

When doing node online/offline, cpuid <-> nodeid mapping is 
established/destroyed,
which means, cpuid <-> nodeid mapping will change if node hotplug happens. But
workqueue does not update wq_numa_possible_cpumask.


Are there any changes in this version relative to the previous one?

No, there are no changes in this version.

Thanks,
Zhu



Thanks,
Rafael


.







Re: [RESEND PATCH v5 0/5] Make cpuid <-> nodeid mapping persistent

2016-03-02 Thread Zhu Guihua

Hi Thomas, Ingo, hpa,

Would you please help to review the X86 part of this patch-set ?

Thanks,
Zhu

On 03/03/2016 09:42 AM, Zhu Guihua wrote:

[Problem]

cpuid <-> nodeid mapping is firstly established at boot time. And workqueue 
caches
the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.

When doing node online/offline, cpuid <-> nodeid mapping is 
established/destroyed,
which means, cpuid <-> nodeid mapping will change if node hotplug happens. But
workqueue does not update wq_numa_possible_cpumask.

So here is the problem:

Assume we have the following cpuid <-> nodeid in the beginning:

   Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 2 | 30-44, 90-104
node 3 | 45-59, 105-119

and we hot-remove node2 and node3, it becomes:

   Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89

and we hot-add node4 and node5, it becomes:

   Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 4 | 30-59
node 5 | 90-119

But in wq_numa_possible_cpumask, cpu30 is still mapped to node2, and the like.

When a pool workqueue is initialized, if its cpumask belongs to a node, its
pool->node will be mapped to that node. And memory used by this workqueue will
also be allocated on that node.

static struct worker_pool *get_unbound_pool(const struct workqueue_attrs 
*attrs){
...
 /* if cpumask is contained inside a NUMA node, we belong to that node 
*/
 if (wq_numa_enabled) {
 for_each_node(node) {
 if (cpumask_subset(pool->attrs->cpumask,
wq_numa_possible_cpumask[node])) {
 pool->node = node;
 break;
 }
 }
 }

Since wq_numa_possible_cpumask is not updated, it could be mapped to an offline 
node,
which will lead to memory allocation failure:

  SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
   cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, 
min order: 0
   node 0: slabs: 6172, objs: 259224, free: 245741
   node 1: slabs: 3261, objs: 136962, free: 127656

It happens here:

create_worker(struct worker_pool *pool)
  |--> worker = alloc_worker(pool->node);

static struct worker *alloc_worker(int node)
{
 struct worker *worker;

 worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, node); --> Here, 
useing the wrong node.

 ..

 return worker;
}


[Solution]

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm
2. apicid (physical cpu id)   <->   nodeid
3. cpuid (logical cpu id) <->   apicid
4. cpuid (logical cpu id) <->   nodeid

1. pxm (proximity domain) is provided by ACPI firmware in SRAT, and nodeid <-> 
pxm
mapping is setup at boot time. This mapping is persistent, won't change.

2. apicid <-> nodeid mapping is setup using info in 1. The mapping is setup at 
boot
time and CPU hotadd time, and cleared at CPU hotremove time. This mapping 
is also
persistent.

3. cpuid <-> apicid mapping is setup at boot time and CPU hotadd time. cpuid is
allocated, lower ids first, and released at CPU hotremove time, reused for 
other
hotadded CPUs. So this mapping is not persistent.

4. cpuid <-> nodeid mapping is also setup at boot time and CPU hotadd time, and
cleared at CPU hotremove time. As a result of 3, this mapping is not 
persistent.

To fix this problem, we establish cpuid <-> nodeid mapping for all the possible
cpus at boot time, and make it persistent. And according to init_cpu_to_node(),
cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and cpuid <-> 
apicid
mapping. So the key point is obtaining all cpus' apicid.

apicid can be obtained by _MAT (Multiple APIC Table Entry) method or found in
MADT (Multiple APIC Description Table). So we finish the job in the following 
steps:

1. Enable apic registeration flow to handle both enabled and disabled cpus.
This is done by introducing an extra parameter to generic_processor_info to 
let the
caller control if disabled cpus are ignored.

2. Introduce a new array storing all possible cpuid <-> apicid mapping. And 
also modify
the way cpuid is calculated. Establish all possible cpuid <-> apicid 
mapping when
registering local apic. Store the mapping in this array.

3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
This is also done by introducing an extra parameter to these apis to let 
the caller
control if disabled cpus are ignored.

4. Establish all possible cpuid <-> nodeid mapping.
This is done via an additional acpi namespace walk for processors.


For previous discussion, please refer to:
https://lkml.org/lkml/2015/2/27/145

Re: [RESEND PATCH v5 0/5] Make cpuid <-> nodeid mapping persistent

2016-03-02 Thread Zhu Guihua

Hi Thomas, Ingo, hpa,

Would you please help to review the X86 part of this patch-set ?

Thanks,
Zhu

On 03/03/2016 09:42 AM, Zhu Guihua wrote:

[Problem]

cpuid <-> nodeid mapping is firstly established at boot time. And workqueue 
caches
the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.

When doing node online/offline, cpuid <-> nodeid mapping is 
established/destroyed,
which means, cpuid <-> nodeid mapping will change if node hotplug happens. But
workqueue does not update wq_numa_possible_cpumask.

So here is the problem:

Assume we have the following cpuid <-> nodeid in the beginning:

   Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 2 | 30-44, 90-104
node 3 | 45-59, 105-119

and we hot-remove node2 and node3, it becomes:

   Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89

and we hot-add node4 and node5, it becomes:

   Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 4 | 30-59
node 5 | 90-119

But in wq_numa_possible_cpumask, cpu30 is still mapped to node2, and the like.

When a pool workqueue is initialized, if its cpumask belongs to a node, its
pool->node will be mapped to that node. And memory used by this workqueue will
also be allocated on that node.

static struct worker_pool *get_unbound_pool(const struct workqueue_attrs 
*attrs){
...
 /* if cpumask is contained inside a NUMA node, we belong to that node 
*/
 if (wq_numa_enabled) {
 for_each_node(node) {
 if (cpumask_subset(pool->attrs->cpumask,
wq_numa_possible_cpumask[node])) {
 pool->node = node;
 break;
 }
 }
 }

Since wq_numa_possible_cpumask is not updated, it could be mapped to an offline 
node,
which will lead to memory allocation failure:

  SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
   cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, 
min order: 0
   node 0: slabs: 6172, objs: 259224, free: 245741
   node 1: slabs: 3261, objs: 136962, free: 127656

It happens here:

create_worker(struct worker_pool *pool)
  |--> worker = alloc_worker(pool->node);

static struct worker *alloc_worker(int node)
{
 struct worker *worker;

 worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, node); --> Here, 
useing the wrong node.

 ..

 return worker;
}


[Solution]

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm
2. apicid (physical cpu id)   <->   nodeid
3. cpuid (logical cpu id) <->   apicid
4. cpuid (logical cpu id) <->   nodeid

1. pxm (proximity domain) is provided by ACPI firmware in SRAT, and nodeid <-> 
pxm
mapping is setup at boot time. This mapping is persistent, won't change.

2. apicid <-> nodeid mapping is setup using info in 1. The mapping is setup at 
boot
time and CPU hotadd time, and cleared at CPU hotremove time. This mapping 
is also
persistent.

3. cpuid <-> apicid mapping is setup at boot time and CPU hotadd time. cpuid is
allocated, lower ids first, and released at CPU hotremove time, reused for 
other
hotadded CPUs. So this mapping is not persistent.

4. cpuid <-> nodeid mapping is also setup at boot time and CPU hotadd time, and
cleared at CPU hotremove time. As a result of 3, this mapping is not 
persistent.

To fix this problem, we establish cpuid <-> nodeid mapping for all the possible
cpus at boot time, and make it persistent. And according to init_cpu_to_node(),
cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and cpuid <-> 
apicid
mapping. So the key point is obtaining all cpus' apicid.

apicid can be obtained by _MAT (Multiple APIC Table Entry) method or found in
MADT (Multiple APIC Description Table). So we finish the job in the following 
steps:

1. Enable apic registeration flow to handle both enabled and disabled cpus.
This is done by introducing an extra parameter to generic_processor_info to 
let the
caller control if disabled cpus are ignored.

2. Introduce a new array storing all possible cpuid <-> apicid mapping. And 
also modify
the way cpuid is calculated. Establish all possible cpuid <-> apicid 
mapping when
registering local apic. Store the mapping in this array.

3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
This is also done by introducing an extra parameter to these apis to let 
the caller
control if disabled cpus are ignored.

4. Establish all possible cpuid <-> nodeid mapping.
This is done via an additional acpi namespace walk for processors.


For previous discussion, please refer to:
https://lkml.org/lkml/2015/2/27/145

[RESEND PATCH v5 1/5] x86, memhp, numa: Online memory-less nodes at boot time.

2016-03-02 Thread Zhu Guihua
From: Tang Chen <tangc...@cn.fujitsu.com>

For now, x86 does not support memory-less node. A node without memory
will not be onlined, and the cpus on it will be mapped to the other
online nodes with memory in init_cpu_to_node(). The reason of doing this
is to ensure each cpu has mapped to a node with memory, so that it will
be able to allocate local memory for that cpu.

But we don't have to do it in this way.

In this series of patches, we are going to construct cpu <-> node mapping
for all possible cpus at boot time, which is a 1-1 mapping. It means the
cpu will be mapped to the node it belongs to, and will never be changed.
If a node has only cpus but no memory, the cpus on it will be mapped to
a memory-less node. And the memory-less node should be onlined.

This patch allocate pgdats for all memory-less nodes and online them at
boot time. Then build zonelists for these nodes. As a result, when cpus
on these memory-less nodes try to allocate memory from local node, it
will automatically fall back to the proper zones in the zonelists.

Signed-off-by: Zhu Guihua <zhugh.f...@cn.fujitsu.com>
---
 arch/x86/mm/numa.c | 27 +--
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index d04f809..baf3d72 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -704,22 +704,19 @@ void __init x86_numa_init(void)
numa_init(dummy_numa_init);
 }
 
-static __init int find_near_online_node(int node)
+static void __init init_memory_less_node(int nid)
 {
-   int n, val;
-   int min_val = INT_MAX;
-   int best_node = -1;
+   unsigned long zones_size[MAX_NR_ZONES] = {0};
+   unsigned long zholes_size[MAX_NR_ZONES] = {0};
 
-   for_each_online_node(n) {
-   val = node_distance(node, n);
+   /* Allocate and initialize node data. Memory-less node is now online.*/
+   alloc_node_data(nid);
+   free_area_init_node(nid, zones_size, 0, zholes_size);
 
-   if (val < min_val) {
-   min_val = val;
-   best_node = n;
-   }
-   }
-
-   return best_node;
+   /*
+* All zonelists will be built later in start_kernel() after per cpu
+* areas are initialized.
+*/
 }
 
 /*
@@ -748,8 +745,10 @@ void __init init_cpu_to_node(void)
 
if (node == NUMA_NO_NODE)
continue;
+
if (!node_online(node))
-   node = find_near_online_node(node);
+   init_memory_less_node(node);
+
numa_set_node(cpu, node);
}
 }
-- 
1.9.3





[RESEND PATCH v5 0/5] Make cpuid <-> nodeid mapping persistent

2016-03-02 Thread Zhu Guihua
[Problem]

cpuid <-> nodeid mapping is firstly established at boot time. And workqueue 
caches
the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.

When doing node online/offline, cpuid <-> nodeid mapping is 
established/destroyed,
which means, cpuid <-> nodeid mapping will change if node hotplug happens. But
workqueue does not update wq_numa_possible_cpumask.

So here is the problem:

Assume we have the following cpuid <-> nodeid in the beginning:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 2 | 30-44, 90-104
node 3 | 45-59, 105-119

and we hot-remove node2 and node3, it becomes:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89

and we hot-add node4 and node5, it becomes:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 4 | 30-59
node 5 | 90-119

But in wq_numa_possible_cpumask, cpu30 is still mapped to node2, and the like.

When a pool workqueue is initialized, if its cpumask belongs to a node, its
pool->node will be mapped to that node. And memory used by this workqueue will
also be allocated on that node.

static struct worker_pool *get_unbound_pool(const struct workqueue_attrs 
*attrs){
...
/* if cpumask is contained inside a NUMA node, we belong to that node */
if (wq_numa_enabled) {
for_each_node(node) {
if (cpumask_subset(pool->attrs->cpumask,
   wq_numa_possible_cpumask[node])) {
pool->node = node;
break;
}
}
}

Since wq_numa_possible_cpumask is not updated, it could be mapped to an offline 
node,
which will lead to memory allocation failure:

 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min 
order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656

It happens here:

create_worker(struct worker_pool *pool)
 |--> worker = alloc_worker(pool->node);

static struct worker *alloc_worker(int node)
{
struct worker *worker;

worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, node); --> Here, 
useing the wrong node.

..

return worker;
}


[Solution]

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm
2. apicid (physical cpu id)   <->   nodeid
3. cpuid (logical cpu id) <->   apicid
4. cpuid (logical cpu id) <->   nodeid

1. pxm (proximity domain) is provided by ACPI firmware in SRAT, and nodeid <-> 
pxm
   mapping is setup at boot time. This mapping is persistent, won't change.

2. apicid <-> nodeid mapping is setup using info in 1. The mapping is setup at 
boot
   time and CPU hotadd time, and cleared at CPU hotremove time. This mapping is 
also
   persistent.

3. cpuid <-> apicid mapping is setup at boot time and CPU hotadd time. cpuid is
   allocated, lower ids first, and released at CPU hotremove time, reused for 
other
   hotadded CPUs. So this mapping is not persistent.

4. cpuid <-> nodeid mapping is also setup at boot time and CPU hotadd time, and
   cleared at CPU hotremove time. As a result of 3, this mapping is not 
persistent.

To fix this problem, we establish cpuid <-> nodeid mapping for all the possible
cpus at boot time, and make it persistent. And according to init_cpu_to_node(),
cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and cpuid <-> 
apicid
mapping. So the key point is obtaining all cpus' apicid.

apicid can be obtained by _MAT (Multiple APIC Table Entry) method or found in
MADT (Multiple APIC Description Table). So we finish the job in the following 
steps:

1. Enable apic registeration flow to handle both enabled and disabled cpus.
   This is done by introducing an extra parameter to generic_processor_info to 
let the
   caller control if disabled cpus are ignored.

2. Introduce a new array storing all possible cpuid <-> apicid mapping. And 
also modify
   the way cpuid is calculated. Establish all possible cpuid <-> apicid mapping 
when
   registering local apic. Store the mapping in this array.

3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
   This is also done by introducing an extra parameter to these apis to let the 
caller
   control if disabled cpus are ignored.

4. Establish all possible cpuid <-> nodeid mapping.
   This is done via an additional acpi namespace walk for processors.


For previous discussion, please refer to:
https://lkml.org/lkml/2015/2/27/145
https://lkml.org/lkml/2015/3/25/989
https://lkml.org/lkml/2015/5/14/244
https://lkml.org/lkml/2015/7/7/200
https://lkml.org/lkml/2015/9/27/209

Change log v4 -> v5:
1. Remove useless code in patch 1.
2. Small improvement of commit message.

Change log v3 -> v4:
1. Fix the kernel panic at boot time. The cause is that 

[RESEND PATCH v5 1/5] x86, memhp, numa: Online memory-less nodes at boot time.

2016-03-02 Thread Zhu Guihua
From: Tang Chen 

For now, x86 does not support memory-less node. A node without memory
will not be onlined, and the cpus on it will be mapped to the other
online nodes with memory in init_cpu_to_node(). The reason of doing this
is to ensure each cpu has mapped to a node with memory, so that it will
be able to allocate local memory for that cpu.

But we don't have to do it in this way.

In this series of patches, we are going to construct cpu <-> node mapping
for all possible cpus at boot time, which is a 1-1 mapping. It means the
cpu will be mapped to the node it belongs to, and will never be changed.
If a node has only cpus but no memory, the cpus on it will be mapped to
a memory-less node. And the memory-less node should be onlined.

This patch allocate pgdats for all memory-less nodes and online them at
boot time. Then build zonelists for these nodes. As a result, when cpus
on these memory-less nodes try to allocate memory from local node, it
will automatically fall back to the proper zones in the zonelists.

Signed-off-by: Zhu Guihua 
---
 arch/x86/mm/numa.c | 27 +--
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index d04f809..baf3d72 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -704,22 +704,19 @@ void __init x86_numa_init(void)
numa_init(dummy_numa_init);
 }
 
-static __init int find_near_online_node(int node)
+static void __init init_memory_less_node(int nid)
 {
-   int n, val;
-   int min_val = INT_MAX;
-   int best_node = -1;
+   unsigned long zones_size[MAX_NR_ZONES] = {0};
+   unsigned long zholes_size[MAX_NR_ZONES] = {0};
 
-   for_each_online_node(n) {
-   val = node_distance(node, n);
+   /* Allocate and initialize node data. Memory-less node is now online.*/
+   alloc_node_data(nid);
+   free_area_init_node(nid, zones_size, 0, zholes_size);
 
-   if (val < min_val) {
-   min_val = val;
-   best_node = n;
-   }
-   }
-
-   return best_node;
+   /*
+* All zonelists will be built later in start_kernel() after per cpu
+* areas are initialized.
+*/
 }
 
 /*
@@ -748,8 +745,10 @@ void __init init_cpu_to_node(void)
 
if (node == NUMA_NO_NODE)
continue;
+
if (!node_online(node))
-   node = find_near_online_node(node);
+   init_memory_less_node(node);
+
numa_set_node(cpu, node);
}
 }
-- 
1.9.3





[RESEND PATCH v5 0/5] Make cpuid <-> nodeid mapping persistent

2016-03-02 Thread Zhu Guihua
[Problem]

cpuid <-> nodeid mapping is firstly established at boot time. And workqueue 
caches
the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.

When doing node online/offline, cpuid <-> nodeid mapping is 
established/destroyed,
which means, cpuid <-> nodeid mapping will change if node hotplug happens. But
workqueue does not update wq_numa_possible_cpumask.

So here is the problem:

Assume we have the following cpuid <-> nodeid in the beginning:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 2 | 30-44, 90-104
node 3 | 45-59, 105-119

and we hot-remove node2 and node3, it becomes:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89

and we hot-add node4 and node5, it becomes:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 4 | 30-59
node 5 | 90-119

But in wq_numa_possible_cpumask, cpu30 is still mapped to node2, and the like.

When a pool workqueue is initialized, if its cpumask belongs to a node, its
pool->node will be mapped to that node. And memory used by this workqueue will
also be allocated on that node.

static struct worker_pool *get_unbound_pool(const struct workqueue_attrs 
*attrs){
...
/* if cpumask is contained inside a NUMA node, we belong to that node */
if (wq_numa_enabled) {
for_each_node(node) {
if (cpumask_subset(pool->attrs->cpumask,
   wq_numa_possible_cpumask[node])) {
pool->node = node;
break;
}
}
}

Since wq_numa_possible_cpumask is not updated, it could be mapped to an offline 
node,
which will lead to memory allocation failure:

 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min 
order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656

It happens here:

create_worker(struct worker_pool *pool)
 |--> worker = alloc_worker(pool->node);

static struct worker *alloc_worker(int node)
{
struct worker *worker;

worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, node); --> Here, 
useing the wrong node.

..

return worker;
}


[Solution]

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm
2. apicid (physical cpu id)   <->   nodeid
3. cpuid (logical cpu id) <->   apicid
4. cpuid (logical cpu id) <->   nodeid

1. pxm (proximity domain) is provided by ACPI firmware in SRAT, and nodeid <-> 
pxm
   mapping is setup at boot time. This mapping is persistent, won't change.

2. apicid <-> nodeid mapping is setup using info in 1. The mapping is setup at 
boot
   time and CPU hotadd time, and cleared at CPU hotremove time. This mapping is 
also
   persistent.

3. cpuid <-> apicid mapping is setup at boot time and CPU hotadd time. cpuid is
   allocated, lower ids first, and released at CPU hotremove time, reused for 
other
   hotadded CPUs. So this mapping is not persistent.

4. cpuid <-> nodeid mapping is also setup at boot time and CPU hotadd time, and
   cleared at CPU hotremove time. As a result of 3, this mapping is not 
persistent.

To fix this problem, we establish cpuid <-> nodeid mapping for all the possible
cpus at boot time, and make it persistent. And according to init_cpu_to_node(),
cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and cpuid <-> 
apicid
mapping. So the key point is obtaining all cpus' apicid.

apicid can be obtained by _MAT (Multiple APIC Table Entry) method or found in
MADT (Multiple APIC Description Table). So we finish the job in the following 
steps:

1. Enable apic registeration flow to handle both enabled and disabled cpus.
   This is done by introducing an extra parameter to generic_processor_info to 
let the
   caller control if disabled cpus are ignored.

2. Introduce a new array storing all possible cpuid <-> apicid mapping. And 
also modify
   the way cpuid is calculated. Establish all possible cpuid <-> apicid mapping 
when
   registering local apic. Store the mapping in this array.

3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
   This is also done by introducing an extra parameter to these apis to let the 
caller
   control if disabled cpus are ignored.

4. Establish all possible cpuid <-> nodeid mapping.
   This is done via an additional acpi namespace walk for processors.


For previous discussion, please refer to:
https://lkml.org/lkml/2015/2/27/145
https://lkml.org/lkml/2015/3/25/989
https://lkml.org/lkml/2015/5/14/244
https://lkml.org/lkml/2015/7/7/200
https://lkml.org/lkml/2015/9/27/209

Change log v4 -> v5:
1. Remove useless code in patch 1.
2. Small improvement of commit message.

Change log v3 -> v4:
1. Fix the kernel panic at boot time. The cause is that 

[RESEND PATCH v5 2/5] x86, acpi, cpu-hotplug: Enable acpi to register all possible cpus at boot time.

2016-03-02 Thread Zhu Guihua
From: Gu Zheng <guz.f...@cn.fujitsu.com>

[Problem]

cpuid <-> nodeid mapping is firstly established at boot time. And workqueue 
caches
the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.

When doing node online/offline, cpuid <-> nodeid mapping is 
established/destroyed,
which means, cpuid <-> nodeid mapping will change if node hotplug happens. But
workqueue does not update wq_numa_possible_cpumask.

So here is the problem:

Assume we have the following cpuid <-> nodeid in the beginning:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 2 | 30-44, 90-104
node 3 | 45-59, 105-119

and we hot-remove node2 and node3, it becomes:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89

and we hot-add node4 and node5, it becomes:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 4 | 30-59
node 5 | 90-119

But in wq_numa_possible_cpumask, cpu30 is still mapped to node2, and the like.

When a pool workqueue is initialized, if its cpumask belongs to a node, its
pool->node will be mapped to that node. And memory used by this workqueue will
also be allocated on that node.

static struct worker_pool *get_unbound_pool(const struct workqueue_attrs 
*attrs){
...
/* if cpumask is contained inside a NUMA node, we belong to that node */
if (wq_numa_enabled) {
for_each_node(node) {
if (cpumask_subset(pool->attrs->cpumask,
   wq_numa_possible_cpumask[node])) {
pool->node = node;
break;
}
}
}

Since wq_numa_possible_cpumask is not updated, it could be mapped to an offline 
node,
which will lead to memory allocation failure:

 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min 
order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656

It happens here:

create_worker(struct worker_pool *pool)
 |--> worker = alloc_worker(pool->node);

static struct worker *alloc_worker(int node)
{
struct worker *worker;

worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, node); --> Here, 
useing the wrong node.

..

return worker;
}

[Solution]

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm
2. apicid (physical cpu id)   <->   nodeid
3. cpuid (logical cpu id) <->   apicid
4. cpuid (logical cpu id) <->   nodeid

1. pxm (proximity domain) is provided by ACPI firmware in SRAT, and nodeid <-> 
pxm
   mapping is setup at boot time. This mapping is persistent, won't change.

2. apicid <-> nodeid mapping is setup using info in 1. The mapping is setup at 
boot
   time and CPU hotadd time, and cleared at CPU hotremove time. This mapping is 
also
   persistent.

3. cpuid <-> apicid mapping is setup at boot time and CPU hotadd time. cpuid is
   allocated, lower ids first, and released at CPU hotremove time, reused for 
other
   hotadded CPUs. So this mapping is not persistent.

4. cpuid <-> nodeid mapping is also setup at boot time and CPU hotadd time, and
   cleared at CPU hotremove time. As a result of 3, this mapping is not 
persistent.

To fix this problem, we establish cpuid <-> nodeid mapping for all the possible
cpus at boot time, and make it persistent. And according to init_cpu_to_node(),
cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and cpuid <-> 
apicid
mapping. So the key point is obtaining all cpus' apicid.

apicid can be obtained by _MAT (Multiple APIC Table Entry) method or found in
MADT (Multiple APIC Description Table). So we finish the job in the following 
steps:

1. Enable apic registeration flow to handle both enabled and disabled cpus.
   This is done by introducing an extra parameter to generic_processor_info to 
let the
   caller control if disabled cpus are ignored.

2. Introduce a new array storing all possible cpuid <-> apicid mapping. And 
also modify
   the way cpuid is calculated. Establish all possible cpuid <-> apicid mapping 
when
   registering local apic. Store the mapping in this array.

3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
   This is also done by introducing an extra parameter to these apis to let the 
caller
   control if disabled cpus are ignored.

4. Establish all possible cpuid <-> nodeid mapping.
   This is done via an additional acpi namespace walk for processors.

This patch finished step 1.

Signed-off-by: Gu Zheng <guz.f...@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangc...@cn.fujitsu.com>
Signed-off-by: Zhu Guihua <zhugh.f...@cn.fujitsu.com>
---
 arc

[RESEND PATCH v5 2/5] x86, acpi, cpu-hotplug: Enable acpi to register all possible cpus at boot time.

2016-03-02 Thread Zhu Guihua
From: Gu Zheng 

[Problem]

cpuid <-> nodeid mapping is firstly established at boot time. And workqueue 
caches
the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.

When doing node online/offline, cpuid <-> nodeid mapping is 
established/destroyed,
which means, cpuid <-> nodeid mapping will change if node hotplug happens. But
workqueue does not update wq_numa_possible_cpumask.

So here is the problem:

Assume we have the following cpuid <-> nodeid in the beginning:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 2 | 30-44, 90-104
node 3 | 45-59, 105-119

and we hot-remove node2 and node3, it becomes:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89

and we hot-add node4 and node5, it becomes:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 4 | 30-59
node 5 | 90-119

But in wq_numa_possible_cpumask, cpu30 is still mapped to node2, and the like.

When a pool workqueue is initialized, if its cpumask belongs to a node, its
pool->node will be mapped to that node. And memory used by this workqueue will
also be allocated on that node.

static struct worker_pool *get_unbound_pool(const struct workqueue_attrs 
*attrs){
...
/* if cpumask is contained inside a NUMA node, we belong to that node */
if (wq_numa_enabled) {
for_each_node(node) {
if (cpumask_subset(pool->attrs->cpumask,
   wq_numa_possible_cpumask[node])) {
pool->node = node;
break;
}
}
}

Since wq_numa_possible_cpumask is not updated, it could be mapped to an offline 
node,
which will lead to memory allocation failure:

 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min 
order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656

It happens here:

create_worker(struct worker_pool *pool)
 |--> worker = alloc_worker(pool->node);

static struct worker *alloc_worker(int node)
{
struct worker *worker;

worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, node); --> Here, 
useing the wrong node.

..

return worker;
}

[Solution]

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm
2. apicid (physical cpu id)   <->   nodeid
3. cpuid (logical cpu id) <->   apicid
4. cpuid (logical cpu id) <->   nodeid

1. pxm (proximity domain) is provided by ACPI firmware in SRAT, and nodeid <-> 
pxm
   mapping is setup at boot time. This mapping is persistent, won't change.

2. apicid <-> nodeid mapping is setup using info in 1. The mapping is setup at 
boot
   time and CPU hotadd time, and cleared at CPU hotremove time. This mapping is 
also
   persistent.

3. cpuid <-> apicid mapping is setup at boot time and CPU hotadd time. cpuid is
   allocated, lower ids first, and released at CPU hotremove time, reused for 
other
   hotadded CPUs. So this mapping is not persistent.

4. cpuid <-> nodeid mapping is also setup at boot time and CPU hotadd time, and
   cleared at CPU hotremove time. As a result of 3, this mapping is not 
persistent.

To fix this problem, we establish cpuid <-> nodeid mapping for all the possible
cpus at boot time, and make it persistent. And according to init_cpu_to_node(),
cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and cpuid <-> 
apicid
mapping. So the key point is obtaining all cpus' apicid.

apicid can be obtained by _MAT (Multiple APIC Table Entry) method or found in
MADT (Multiple APIC Description Table). So we finish the job in the following 
steps:

1. Enable apic registeration flow to handle both enabled and disabled cpus.
   This is done by introducing an extra parameter to generic_processor_info to 
let the
   caller control if disabled cpus are ignored.

2. Introduce a new array storing all possible cpuid <-> apicid mapping. And 
also modify
   the way cpuid is calculated. Establish all possible cpuid <-> apicid mapping 
when
   registering local apic. Store the mapping in this array.

3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
   This is also done by introducing an extra parameter to these apis to let the 
caller
   control if disabled cpus are ignored.

4. Establish all possible cpuid <-> nodeid mapping.
   This is done via an additional acpi namespace walk for processors.

This patch finished step 1.

Signed-off-by: Gu Zheng 
Signed-off-by: Tang Chen 
Signed-off-by: Zhu Guihua 
---
 arch/x86/kernel/apic/apic.c | 26 +++---
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/arch/x86

[RESEND PATCH v5 4/5] x86, acpi, cpu-hotplug: Enable MADT APIs to return disabled apicid.

2016-03-02 Thread Zhu Guihua
From: Gu Zheng <guz.f...@cn.fujitsu.com>

The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that,
when node online/offline happens, cache based on cpuid <-> nodeid mapping such 
as
wq_numa_possible_cpumask will not cause any problem.
It contains 4 steps:
1. Enable apic registeration flow to handle both enabled and disabled cpus.
2. Introduce a new array storing all possible cpuid <-> apicid mapping.
3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
4. Establish all possible cpuid <-> nodeid mapping.

This patch finishes step 3.

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm(persistent)
2. apicid (physical cpu id)   <->   nodeid (persistent)
3. cpuid (logical cpu id) <->   apicid (not persistent, now persistent 
by step 2)
4. cpuid (logical cpu id) <->   nodeid (not persistent)

So, in order to setup persistent cpuid <-> nodeid mapping for all possible CPUs,
we should:
1. Setup cpuid <-> apicid mapping for all possible CPUs, which has been done in 
step 1, 2.
2. Setup cpuid <-> nodeid mapping for all possible CPUs. But before that, we 
should
   obtain all apicids from MADT.

All processors' apicids can be obtained by _MAT method or from MADT in ACPI.
The current code ignores disabled processors and returns -ENODEV.

After this patch, a new parameter will be added to MADT APIs so that caller
is able to control if disabled processors are ignored.

Signed-off-by: Gu Zheng <guz.f...@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangc...@cn.fujitsu.com>
Signed-off-by: Zhu Guihua <zhugh.f...@cn.fujitsu.com>
---
 drivers/acpi/acpi_processor.c |  5 +++-
 drivers/acpi/processor_core.c | 57 +++
 2 files changed, 40 insertions(+), 22 deletions(-)

diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
index 6979186..d30111a 100644
--- a/drivers/acpi/acpi_processor.c
+++ b/drivers/acpi/acpi_processor.c
@@ -300,8 +300,11 @@ static int acpi_processor_get_info(struct acpi_device 
*device)
 *  Extra Processor objects may be enumerated on MP systems with
 *  less than the max # of CPUs. They should be ignored _iff
 *  they are physically not present.
+*
+*  NOTE: Even if the processor has a cpuid, it may not present because
+*  cpuid <-> apicid mapping is persistent now.
 */
-   if (invalid_logical_cpuid(pr->id)) {
+   if (invalid_logical_cpuid(pr->id) || !cpu_present(pr->id)) {
int ret = acpi_processor_hotadd_init(pr);
if (ret)
return ret;
diff --git a/drivers/acpi/processor_core.c b/drivers/acpi/processor_core.c
index 33a38d6..824b98b 100644
--- a/drivers/acpi/processor_core.c
+++ b/drivers/acpi/processor_core.c
@@ -32,12 +32,12 @@ static struct acpi_table_madt *get_madt_table(void)
 }
 
 static int map_lapic_id(struct acpi_subtable_header *entry,
-u32 acpi_id, phys_cpuid_t *apic_id)
+u32 acpi_id, phys_cpuid_t *apic_id, bool ignore_disabled)
 {
struct acpi_madt_local_apic *lapic =
container_of(entry, struct acpi_madt_local_apic, header);
 
-   if (!(lapic->lapic_flags & ACPI_MADT_ENABLED))
+   if (ignore_disabled && !(lapic->lapic_flags & ACPI_MADT_ENABLED))
return -ENODEV;
 
if (lapic->processor_id != acpi_id)
@@ -48,12 +48,13 @@ static int map_lapic_id(struct acpi_subtable_header *entry,
 }
 
 static int map_x2apic_id(struct acpi_subtable_header *entry,
-   int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id)
+   int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id,
+   bool ignore_disabled)
 {
struct acpi_madt_local_x2apic *apic =
container_of(entry, struct acpi_madt_local_x2apic, header);
 
-   if (!(apic->lapic_flags & ACPI_MADT_ENABLED))
+   if (ignore_disabled && !(apic->lapic_flags & ACPI_MADT_ENABLED))
return -ENODEV;
 
if (device_declaration && (apic->uid == acpi_id)) {
@@ -65,12 +66,13 @@ static int map_x2apic_id(struct acpi_subtable_header *entry,
 }
 
 static int map_lsapic_id(struct acpi_subtable_header *entry,
-   int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id)
+   int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id,
+   bool ignore_disabled)
 {
struct acpi_madt_local_sapic *lsapic =
container_of(entry, struct acpi_madt_local_sapic, header);
 
-   if (!(lsapic->lapic_flags & ACPI_MADT_ENABLED))
+   if (ignore_disabled && !(lsapic->lapic_flags & ACPI_MADT_ENABLED))
return -ENODEV;
 
if (device_declaration) {
@@ -87,12 +89,1

[RESEND PATCH v5 4/5] x86, acpi, cpu-hotplug: Enable MADT APIs to return disabled apicid.

2016-03-02 Thread Zhu Guihua
From: Gu Zheng 

The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that,
when node online/offline happens, cache based on cpuid <-> nodeid mapping such 
as
wq_numa_possible_cpumask will not cause any problem.
It contains 4 steps:
1. Enable apic registeration flow to handle both enabled and disabled cpus.
2. Introduce a new array storing all possible cpuid <-> apicid mapping.
3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
4. Establish all possible cpuid <-> nodeid mapping.

This patch finishes step 3.

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm(persistent)
2. apicid (physical cpu id)   <->   nodeid (persistent)
3. cpuid (logical cpu id) <->   apicid (not persistent, now persistent 
by step 2)
4. cpuid (logical cpu id) <->   nodeid (not persistent)

So, in order to setup persistent cpuid <-> nodeid mapping for all possible CPUs,
we should:
1. Setup cpuid <-> apicid mapping for all possible CPUs, which has been done in 
step 1, 2.
2. Setup cpuid <-> nodeid mapping for all possible CPUs. But before that, we 
should
   obtain all apicids from MADT.

All processors' apicids can be obtained by _MAT method or from MADT in ACPI.
The current code ignores disabled processors and returns -ENODEV.

After this patch, a new parameter will be added to MADT APIs so that caller
is able to control if disabled processors are ignored.

Signed-off-by: Gu Zheng 
Signed-off-by: Tang Chen 
Signed-off-by: Zhu Guihua 
---
 drivers/acpi/acpi_processor.c |  5 +++-
 drivers/acpi/processor_core.c | 57 +++
 2 files changed, 40 insertions(+), 22 deletions(-)

diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
index 6979186..d30111a 100644
--- a/drivers/acpi/acpi_processor.c
+++ b/drivers/acpi/acpi_processor.c
@@ -300,8 +300,11 @@ static int acpi_processor_get_info(struct acpi_device 
*device)
 *  Extra Processor objects may be enumerated on MP systems with
 *  less than the max # of CPUs. They should be ignored _iff
 *  they are physically not present.
+*
+*  NOTE: Even if the processor has a cpuid, it may not present because
+*  cpuid <-> apicid mapping is persistent now.
 */
-   if (invalid_logical_cpuid(pr->id)) {
+   if (invalid_logical_cpuid(pr->id) || !cpu_present(pr->id)) {
int ret = acpi_processor_hotadd_init(pr);
if (ret)
return ret;
diff --git a/drivers/acpi/processor_core.c b/drivers/acpi/processor_core.c
index 33a38d6..824b98b 100644
--- a/drivers/acpi/processor_core.c
+++ b/drivers/acpi/processor_core.c
@@ -32,12 +32,12 @@ static struct acpi_table_madt *get_madt_table(void)
 }
 
 static int map_lapic_id(struct acpi_subtable_header *entry,
-u32 acpi_id, phys_cpuid_t *apic_id)
+u32 acpi_id, phys_cpuid_t *apic_id, bool ignore_disabled)
 {
struct acpi_madt_local_apic *lapic =
container_of(entry, struct acpi_madt_local_apic, header);
 
-   if (!(lapic->lapic_flags & ACPI_MADT_ENABLED))
+   if (ignore_disabled && !(lapic->lapic_flags & ACPI_MADT_ENABLED))
return -ENODEV;
 
if (lapic->processor_id != acpi_id)
@@ -48,12 +48,13 @@ static int map_lapic_id(struct acpi_subtable_header *entry,
 }
 
 static int map_x2apic_id(struct acpi_subtable_header *entry,
-   int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id)
+   int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id,
+   bool ignore_disabled)
 {
struct acpi_madt_local_x2apic *apic =
container_of(entry, struct acpi_madt_local_x2apic, header);
 
-   if (!(apic->lapic_flags & ACPI_MADT_ENABLED))
+   if (ignore_disabled && !(apic->lapic_flags & ACPI_MADT_ENABLED))
return -ENODEV;
 
if (device_declaration && (apic->uid == acpi_id)) {
@@ -65,12 +66,13 @@ static int map_x2apic_id(struct acpi_subtable_header *entry,
 }
 
 static int map_lsapic_id(struct acpi_subtable_header *entry,
-   int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id)
+   int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id,
+   bool ignore_disabled)
 {
struct acpi_madt_local_sapic *lsapic =
container_of(entry, struct acpi_madt_local_sapic, header);
 
-   if (!(lsapic->lapic_flags & ACPI_MADT_ENABLED))
+   if (ignore_disabled && !(lsapic->lapic_flags & ACPI_MADT_ENABLED))
return -ENODEV;
 
if (device_declaration) {
@@ -87,12 +89,13 @@ static int map_lsapic_id(struct acpi_subtable_header *entry,
  * Retrieve the ARM CPU physical identifier (MPIDR)

[RESEND PATCH v5 3/5] x86, acpi, cpu-hotplug: Introduce cpuid_to_apicid[] array to store persistent cpuid <-> apicid mapping.

2016-03-02 Thread Zhu Guihua
From: Gu Zheng <guz.f...@cn.fujitsu.com>

The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that,
when node online/offline happens, cache based on cpuid <-> nodeid mapping such 
as
wq_numa_possible_cpumask will not cause any problem.
It contains 4 steps:
1. Enable apic registeration flow to handle both enabled and disabled cpus.
2. Introduce a new array storing all possible cpuid <-> apicid mapping.
3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
4. Establish all possible cpuid <-> nodeid mapping.

This patch finishes step 2.

In this patch, we introduce a new static array named cpuid_to_apicid[],
which is large enough to store info for all possible cpus.

And then, we modify the cpuid calculation. In generic_processor_info(),
it simply finds the next unused cpuid. And it is also why the cpuid <-> nodeid
mapping changes with node hotplug.

After this patch, we find the next unused cpuid, map it to an apicid,
and store the mapping in cpuid_to_apicid[], so that cpuid <-> apicid
mapping will be persistent.

And finally we will use this array to make cpuid <-> nodeid persistent.

cpuid <-> apicid mapping is established at local apic registeration time.
But non-present or disabled cpus are ignored.

In this patch, we establish all possible cpuid <-> apicid mapping when
registering local apic.

Signed-off-by: Gu Zheng <guz.f...@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangc...@cn.fujitsu.com>
Signed-off-by: Zhu Guihua <zhugh.f...@cn.fujitsu.com>
---
 arch/x86/include/asm/mpspec.h |  1 +
 arch/x86/kernel/acpi/boot.c   |  6 ++---
 arch/x86/kernel/apic/apic.c   | 61 ---
 3 files changed, 61 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
index b07233b..db902d8 100644
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -86,6 +86,7 @@ static inline void early_reserve_e820_mpc_new(void) { }
 #endif
 
 int generic_processor_info(int apicid, int version);
+int __generic_processor_info(int apicid, int version, bool enabled);
 
 #define PHYSID_ARRAY_SIZE  BITS_TO_LONGS(MAX_LOCAL_APIC)
 
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index e759076..0ce06ee 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -174,15 +174,13 @@ static int acpi_register_lapic(int id, u8 enabled)
return -EINVAL;
}
 
-   if (!enabled) {
+   if (!enabled)
++disabled_cpus;
-   return -EINVAL;
-   }
 
if (boot_cpu_physical_apicid != -1U)
ver = apic_version[boot_cpu_physical_apicid];
 
-   return generic_processor_info(id, ver);
+   return __generic_processor_info(id, ver, enabled);
 }
 
 static int __init
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 1625778..4822cda 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1998,7 +1998,53 @@ void disconnect_bsp_APIC(int virt_wire_setup)
apic_write(APIC_LVT1, value);
 }
 
-static int __generic_processor_info(int apicid, int version, bool enabled)
+/*
+ * The number of allocated logical CPU IDs. Since logical CPU IDs are allocated
+ * contiguously, it equals to current allocated max logical CPU ID plus 1.
+ * All allocated CPU ID should be in [0, nr_logical_cpuidi), so the maximum of
+ * nr_logical_cpuids is nr_cpu_ids.
+ *
+ * NOTE: Reserve 0 for BSP.
+ */
+static int nr_logical_cpuids = 1;
+
+/*
+ * Used to store mapping between logical CPU IDs and APIC IDs.
+ */
+static int cpuid_to_apicid[] = {
+   [0 ... NR_CPUS - 1] = -1,
+};
+
+/*
+ * Should use this API to allocate logical CPU IDs to keep nr_logical_cpuids
+ * and cpuid_to_apicid[] synchronized.
+ */
+static int allocate_logical_cpuid(int apicid)
+{
+   int i;
+
+   /*
+* cpuid <-> apicid mapping is persistent, so when a cpu is up,
+* check if the kernel has allocated a cpuid for it.
+*/
+   for (i = 0; i < nr_logical_cpuids; i++) {
+   if (cpuid_to_apicid[i] == apicid)
+   return i;
+   }
+
+   /* Allocate a new cpuid. */
+   if (nr_logical_cpuids >= nr_cpu_ids) {
+   WARN_ONCE(1, "Only %d processors supported."
+"Processor %d/0x%x and the rest are ignored.\n",
+nr_cpu_ids - 1, nr_logical_cpuids, apicid);
+   return -1;
+   }
+
+   cpuid_to_apicid[nr_logical_cpuids] = apicid;
+   return nr_logical_cpuids++;
+}
+
+int __generic_processor_info(int apicid, int version, bool enabled)
 {
int cpu, max = nr_cpu_ids;
bool boot_cpu_detected = physid_isset(boot_cpu_physical_apicid,
@@ -2079,8 +2125,17 @@ static int __generic_processor_info(int apicid, int 

[RESEND PATCH v5 3/5] x86, acpi, cpu-hotplug: Introduce cpuid_to_apicid[] array to store persistent cpuid <-> apicid mapping.

2016-03-02 Thread Zhu Guihua
From: Gu Zheng 

The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that,
when node online/offline happens, cache based on cpuid <-> nodeid mapping such 
as
wq_numa_possible_cpumask will not cause any problem.
It contains 4 steps:
1. Enable apic registeration flow to handle both enabled and disabled cpus.
2. Introduce a new array storing all possible cpuid <-> apicid mapping.
3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
4. Establish all possible cpuid <-> nodeid mapping.

This patch finishes step 2.

In this patch, we introduce a new static array named cpuid_to_apicid[],
which is large enough to store info for all possible cpus.

And then, we modify the cpuid calculation. In generic_processor_info(),
it simply finds the next unused cpuid. And it is also why the cpuid <-> nodeid
mapping changes with node hotplug.

After this patch, we find the next unused cpuid, map it to an apicid,
and store the mapping in cpuid_to_apicid[], so that cpuid <-> apicid
mapping will be persistent.

And finally we will use this array to make cpuid <-> nodeid persistent.

cpuid <-> apicid mapping is established at local apic registeration time.
But non-present or disabled cpus are ignored.

In this patch, we establish all possible cpuid <-> apicid mapping when
registering local apic.

Signed-off-by: Gu Zheng 
Signed-off-by: Tang Chen 
Signed-off-by: Zhu Guihua 
---
 arch/x86/include/asm/mpspec.h |  1 +
 arch/x86/kernel/acpi/boot.c   |  6 ++---
 arch/x86/kernel/apic/apic.c   | 61 ---
 3 files changed, 61 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
index b07233b..db902d8 100644
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -86,6 +86,7 @@ static inline void early_reserve_e820_mpc_new(void) { }
 #endif
 
 int generic_processor_info(int apicid, int version);
+int __generic_processor_info(int apicid, int version, bool enabled);
 
 #define PHYSID_ARRAY_SIZE  BITS_TO_LONGS(MAX_LOCAL_APIC)
 
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index e759076..0ce06ee 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -174,15 +174,13 @@ static int acpi_register_lapic(int id, u8 enabled)
return -EINVAL;
}
 
-   if (!enabled) {
+   if (!enabled)
++disabled_cpus;
-   return -EINVAL;
-   }
 
if (boot_cpu_physical_apicid != -1U)
ver = apic_version[boot_cpu_physical_apicid];
 
-   return generic_processor_info(id, ver);
+   return __generic_processor_info(id, ver, enabled);
 }
 
 static int __init
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 1625778..4822cda 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1998,7 +1998,53 @@ void disconnect_bsp_APIC(int virt_wire_setup)
apic_write(APIC_LVT1, value);
 }
 
-static int __generic_processor_info(int apicid, int version, bool enabled)
+/*
+ * The number of allocated logical CPU IDs. Since logical CPU IDs are allocated
+ * contiguously, it equals to current allocated max logical CPU ID plus 1.
+ * All allocated CPU ID should be in [0, nr_logical_cpuidi), so the maximum of
+ * nr_logical_cpuids is nr_cpu_ids.
+ *
+ * NOTE: Reserve 0 for BSP.
+ */
+static int nr_logical_cpuids = 1;
+
+/*
+ * Used to store mapping between logical CPU IDs and APIC IDs.
+ */
+static int cpuid_to_apicid[] = {
+   [0 ... NR_CPUS - 1] = -1,
+};
+
+/*
+ * Should use this API to allocate logical CPU IDs to keep nr_logical_cpuids
+ * and cpuid_to_apicid[] synchronized.
+ */
+static int allocate_logical_cpuid(int apicid)
+{
+   int i;
+
+   /*
+* cpuid <-> apicid mapping is persistent, so when a cpu is up,
+* check if the kernel has allocated a cpuid for it.
+*/
+   for (i = 0; i < nr_logical_cpuids; i++) {
+   if (cpuid_to_apicid[i] == apicid)
+   return i;
+   }
+
+   /* Allocate a new cpuid. */
+   if (nr_logical_cpuids >= nr_cpu_ids) {
+   WARN_ONCE(1, "Only %d processors supported."
+"Processor %d/0x%x and the rest are ignored.\n",
+nr_cpu_ids - 1, nr_logical_cpuids, apicid);
+   return -1;
+   }
+
+   cpuid_to_apicid[nr_logical_cpuids] = apicid;
+   return nr_logical_cpuids++;
+}
+
+int __generic_processor_info(int apicid, int version, bool enabled)
 {
int cpu, max = nr_cpu_ids;
bool boot_cpu_detected = physid_isset(boot_cpu_physical_apicid,
@@ -2079,8 +2125,17 @@ static int __generic_processor_info(int apicid, int 
version, bool enabled)
 * for BSP.
 */
cpu = 0;
-   } else
-

[RESEND PATCH v5 5/5] x86, acpi, cpu-hotplug: Set persistent cpuid <-> nodeid mapping when booting.

2016-03-02 Thread Zhu Guihua
From: Gu Zheng <guz.f...@cn.fujitsu.com>

The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that,
when node online/offline happens, cache based on cpuid <-> nodeid mapping such 
as
wq_numa_possible_cpumask will not cause any problem.
It contains 4 steps:
1. Enable apic registeration flow to handle both enabled and disabled cpus.
2. Introduce a new array storing all possible cpuid <-> apicid mapping.
3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
4. Establish all possible cpuid <-> nodeid mapping.

This patch finishes step 4.

This patch set the persistent cpuid <-> nodeid mapping for all enabled/disabled
processors at boot time via an additional acpi namespace walk for processors.

Signed-off-by: Gu Zheng <guz.f...@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangc...@cn.fujitsu.com>
Signed-off-by: Zhu Guihua <zhugh.f...@cn.fujitsu.com>
---
 arch/ia64/kernel/acpi.c   |  2 +-
 arch/x86/kernel/acpi/boot.c   |  2 +-
 drivers/acpi/bus.c|  3 ++
 drivers/acpi/processor_core.c | 65 +++
 include/linux/acpi.h  |  2 ++
 5 files changed, 72 insertions(+), 2 deletions(-)

diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c
index b1698bc..7db5563 100644
--- a/arch/ia64/kernel/acpi.c
+++ b/arch/ia64/kernel/acpi.c
@@ -796,7 +796,7 @@ int acpi_isa_irq_to_gsi(unsigned isa_irq, u32 *gsi)
  *  ACPI based hotplug CPU support
  */
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
-static int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 {
 #ifdef CONFIG_ACPI_NUMA
/*
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 0ce06ee..7d45261 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -696,7 +696,7 @@ static void __init acpi_set_irq_model_ioapic(void)
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
 #include 
 
-static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 {
 #ifdef CONFIG_ACPI_NUMA
int nid;
diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c
index 891c42d..d92f45f 100644
--- a/drivers/acpi/bus.c
+++ b/drivers/acpi/bus.c
@@ -1096,6 +1096,9 @@ static int __init acpi_init(void)
acpi_sleep_proc_init();
acpi_wakeup_device_init();
acpi_debugger_init();
+#ifdef CONFIG_ACPI_HOTPLUG_CPU
+   acpi_set_processor_mapping();
+#endif
return 0;
 }
 
diff --git a/drivers/acpi/processor_core.c b/drivers/acpi/processor_core.c
index 824b98b..45580ff 100644
--- a/drivers/acpi/processor_core.c
+++ b/drivers/acpi/processor_core.c
@@ -261,6 +261,71 @@ int acpi_get_cpuid(acpi_handle handle, int type, u32 
acpi_id)
 }
 EXPORT_SYMBOL_GPL(acpi_get_cpuid);
 
+#ifdef CONFIG_ACPI_HOTPLUG_CPU
+static bool map_processor(acpi_handle handle, int *phys_id, int *cpuid)
+{
+   int type;
+   u32 acpi_id;
+   acpi_status status;
+   acpi_object_type acpi_type;
+   unsigned long long tmp;
+   union acpi_object object = { 0 };
+   struct acpi_buffer buffer = { sizeof(union acpi_object),  };
+
+   status = acpi_get_type(handle, _type);
+   if (ACPI_FAILURE(status))
+   return false;
+
+   switch (acpi_type) {
+   case ACPI_TYPE_PROCESSOR:
+   status = acpi_evaluate_object(handle, NULL, NULL, );
+   if (ACPI_FAILURE(status))
+   return false;
+   acpi_id = object.processor.proc_id;
+   break;
+   case ACPI_TYPE_DEVICE:
+   status = acpi_evaluate_integer(handle, "_UID", NULL, );
+   if (ACPI_FAILURE(status))
+   return false;
+   acpi_id = tmp;
+   break;
+   default:
+   return false;
+   }
+
+   type = (acpi_type == ACPI_TYPE_DEVICE) ? 1 : 0;
+
+   *phys_id = __acpi_get_phys_id(handle, type, acpi_id, false);
+   *cpuid = acpi_map_cpuid(*phys_id, acpi_id);
+   if (*cpuid == -1)
+   return false;
+
+   return true;
+}
+
+static acpi_status __init
+set_processor_node_mapping(acpi_handle handle, u32 lvl, void *context,
+  void **rv)
+{
+   u32 apic_id;
+   int cpu_id;
+
+   if (!map_processor(handle, _id, _id))
+   return AE_ERROR;
+
+   acpi_map_cpu2node(handle, cpu_id, apic_id);
+   return AE_OK;
+}
+
+void __init acpi_set_processor_mapping(void)
+{
+   /* Set persistent cpu <-> node mapping for all processors. */
+   acpi_walk_namespace(ACPI_TYPE_PROCESSOR, ACPI_ROOT_OBJECT,
+   ACPI_UINT32_MAX, set_processor_node_mapping,
+   NULL, NULL, NULL);
+}
+#endif
+
 #ifdef CONFIG_ACPI_HOTPLUG_IOAPIC
 static int get_ioapic_id(struct acpi_subtable_header *entry, u32 gsi_base,
   

[RESEND PATCH v5 5/5] x86, acpi, cpu-hotplug: Set persistent cpuid <-> nodeid mapping when booting.

2016-03-02 Thread Zhu Guihua
From: Gu Zheng 

The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that,
when node online/offline happens, cache based on cpuid <-> nodeid mapping such 
as
wq_numa_possible_cpumask will not cause any problem.
It contains 4 steps:
1. Enable apic registeration flow to handle both enabled and disabled cpus.
2. Introduce a new array storing all possible cpuid <-> apicid mapping.
3. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' 
apicid.
4. Establish all possible cpuid <-> nodeid mapping.

This patch finishes step 4.

This patch set the persistent cpuid <-> nodeid mapping for all enabled/disabled
processors at boot time via an additional acpi namespace walk for processors.

Signed-off-by: Gu Zheng 
Signed-off-by: Tang Chen 
Signed-off-by: Zhu Guihua 
---
 arch/ia64/kernel/acpi.c   |  2 +-
 arch/x86/kernel/acpi/boot.c   |  2 +-
 drivers/acpi/bus.c|  3 ++
 drivers/acpi/processor_core.c | 65 +++
 include/linux/acpi.h  |  2 ++
 5 files changed, 72 insertions(+), 2 deletions(-)

diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c
index b1698bc..7db5563 100644
--- a/arch/ia64/kernel/acpi.c
+++ b/arch/ia64/kernel/acpi.c
@@ -796,7 +796,7 @@ int acpi_isa_irq_to_gsi(unsigned isa_irq, u32 *gsi)
  *  ACPI based hotplug CPU support
  */
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
-static int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 {
 #ifdef CONFIG_ACPI_NUMA
/*
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 0ce06ee..7d45261 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -696,7 +696,7 @@ static void __init acpi_set_irq_model_ioapic(void)
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
 #include 
 
-static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 {
 #ifdef CONFIG_ACPI_NUMA
int nid;
diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c
index 891c42d..d92f45f 100644
--- a/drivers/acpi/bus.c
+++ b/drivers/acpi/bus.c
@@ -1096,6 +1096,9 @@ static int __init acpi_init(void)
acpi_sleep_proc_init();
acpi_wakeup_device_init();
acpi_debugger_init();
+#ifdef CONFIG_ACPI_HOTPLUG_CPU
+   acpi_set_processor_mapping();
+#endif
return 0;
 }
 
diff --git a/drivers/acpi/processor_core.c b/drivers/acpi/processor_core.c
index 824b98b..45580ff 100644
--- a/drivers/acpi/processor_core.c
+++ b/drivers/acpi/processor_core.c
@@ -261,6 +261,71 @@ int acpi_get_cpuid(acpi_handle handle, int type, u32 
acpi_id)
 }
 EXPORT_SYMBOL_GPL(acpi_get_cpuid);
 
+#ifdef CONFIG_ACPI_HOTPLUG_CPU
+static bool map_processor(acpi_handle handle, int *phys_id, int *cpuid)
+{
+   int type;
+   u32 acpi_id;
+   acpi_status status;
+   acpi_object_type acpi_type;
+   unsigned long long tmp;
+   union acpi_object object = { 0 };
+   struct acpi_buffer buffer = { sizeof(union acpi_object),  };
+
+   status = acpi_get_type(handle, _type);
+   if (ACPI_FAILURE(status))
+   return false;
+
+   switch (acpi_type) {
+   case ACPI_TYPE_PROCESSOR:
+   status = acpi_evaluate_object(handle, NULL, NULL, );
+   if (ACPI_FAILURE(status))
+   return false;
+   acpi_id = object.processor.proc_id;
+   break;
+   case ACPI_TYPE_DEVICE:
+   status = acpi_evaluate_integer(handle, "_UID", NULL, );
+   if (ACPI_FAILURE(status))
+   return false;
+   acpi_id = tmp;
+   break;
+   default:
+   return false;
+   }
+
+   type = (acpi_type == ACPI_TYPE_DEVICE) ? 1 : 0;
+
+   *phys_id = __acpi_get_phys_id(handle, type, acpi_id, false);
+   *cpuid = acpi_map_cpuid(*phys_id, acpi_id);
+   if (*cpuid == -1)
+   return false;
+
+   return true;
+}
+
+static acpi_status __init
+set_processor_node_mapping(acpi_handle handle, u32 lvl, void *context,
+  void **rv)
+{
+   u32 apic_id;
+   int cpu_id;
+
+   if (!map_processor(handle, _id, _id))
+   return AE_ERROR;
+
+   acpi_map_cpu2node(handle, cpu_id, apic_id);
+   return AE_OK;
+}
+
+void __init acpi_set_processor_mapping(void)
+{
+   /* Set persistent cpu <-> node mapping for all processors. */
+   acpi_walk_namespace(ACPI_TYPE_PROCESSOR, ACPI_ROOT_OBJECT,
+   ACPI_UINT32_MAX, set_processor_node_mapping,
+   NULL, NULL, NULL);
+}
+#endif
+
 #ifdef CONFIG_ACPI_HOTPLUG_IOAPIC
 static int get_ioapic_id(struct acpi_subtable_header *entry, u32 gsi_base,
 u64 *phys_addr, int *ioapic_id)
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 06ed7e5..0

Re: [PATCH v5 RESEND 0/5] Make cpuid <-> nodeid mapping persistent

2016-02-17 Thread Zhu Guihua

Hi Rafael,

On 02/03/2016 08:02 PM, Rafael J. Wysocki wrote:

Hi,

On Wed, Feb 3, 2016 at 10:14 AM, Zhu Guihua <zhugh.f...@cn.fujitsu.com> wrote:

On 01/25/2016 02:12 PM, Tang Chen wrote:

Hi Rafael, Len,

Would you please help to review the ACPI part of this patch-set ?


Can anyone help to review this?

I'm planning to look into this more thoroughly in the next few days.


Were you reviewing this ?

Thanks.


Thanks,
Rafael


.







Re: [PATCH v5 RESEND 0/5] Make cpuid <-> nodeid mapping persistent

2016-02-17 Thread Zhu Guihua

Hi Rafael,

On 02/03/2016 08:02 PM, Rafael J. Wysocki wrote:

Hi,

On Wed, Feb 3, 2016 at 10:14 AM, Zhu Guihua  wrote:

On 01/25/2016 02:12 PM, Tang Chen wrote:

Hi Rafael, Len,

Would you please help to review the ACPI part of this patch-set ?


Can anyone help to review this?

I'm planning to look into this more thoroughly in the next few days.


Were you reviewing this ?

Thanks.


Thanks,
Rafael


.







Re: [PATCH v5 RESEND 0/5] Make cpuid <-> nodeid mapping persistent

2016-02-03 Thread Zhu Guihua


On 01/25/2016 02:12 PM, Tang Chen wrote:

Hi Rafael, Len,

Would you please help to review the ACPI part of this patch-set ?


Can anyone help to review this?

Cc tj:
Can you add acked-by into this patchset?

Thanks.


Thanks.

On 01/25/2016 02:08 PM, Tang Chen wrote:

[Problem]

cpuid <-> nodeid mapping is firstly established at boot time. And 
workqueue caches

the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.

When doing node online/offline, cpuid <-> nodeid mapping is 
established/destroyed,
which means, cpuid <-> nodeid mapping will change if node hotplug 
happens. But

workqueue does not update wq_numa_possible_cpumask.

So here is the problem:

Assume we have the following cpuid <-> nodeid in the beginning:

   Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 2 | 30-44, 90-104
node 3 | 45-59, 105-119

and we hot-remove node2 and node3, it becomes:

   Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89

and we hot-add node4 and node5, it becomes:

   Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 4 | 30-59
node 5 | 90-119

But in wq_numa_possible_cpumask, cpu30 is still mapped to node2, and 
the like.


When a pool workqueue is initialized, if its cpumask belongs to a 
node, its
pool->node will be mapped to that node. And memory used by this 
workqueue will

also be allocated on that node.

static struct worker_pool *get_unbound_pool(const struct 
workqueue_attrs *attrs){

...
 /* if cpumask is contained inside a NUMA node, we belong to 
that node */

 if (wq_numa_enabled) {
 for_each_node(node) {
 if (cpumask_subset(pool->attrs->cpumask,
wq_numa_possible_cpumask[node])) {
 pool->node = node;
 break;
 }
 }
 }

Since wq_numa_possible_cpumask is not updated, it could be mapped to 
an offline node,

which will lead to memory allocation failure:

  SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
   cache: kmalloc-192, object size: 192, buffer size: 192, default 
order: 1, min order: 0

   node 0: slabs: 6172, objs: 259224, free: 245741
   node 1: slabs: 3261, objs: 136962, free: 127656

It happens here:

create_worker(struct worker_pool *pool)
  |--> worker = alloc_worker(pool->node);

static struct worker *alloc_worker(int node)
{
 struct worker *worker;

 worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, node); 
--> Here, useing the wrong node.


 ..

 return worker;
}


[Solution]

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm
2. apicid (physical cpu id)   <->   nodeid
3. cpuid (logical cpu id) <->   apicid
4. cpuid (logical cpu id) <->   nodeid

1. pxm (proximity domain) is provided by ACPI firmware in SRAT, and 
nodeid <-> pxm
mapping is setup at boot time. This mapping is persistent, won't 
change.


2. apicid <-> nodeid mapping is setup using info in 1. The mapping is 
setup at boot
time and CPU hotadd time, and cleared at CPU hotremove time. This 
mapping is also

persistent.

3. cpuid <-> apicid mapping is setup at boot time and CPU hotadd 
time. cpuid is
allocated, lower ids first, and released at CPU hotremove time, 
reused for other

hotadded CPUs. So this mapping is not persistent.

4. cpuid <-> nodeid mapping is also setup at boot time and CPU hotadd 
time, and
cleared at CPU hotremove time. As a result of 3, this mapping is 
not persistent.


To fix this problem, we establish cpuid <-> nodeid mapping for all 
the possible
cpus at boot time, and make it persistent. And according to 
init_cpu_to_node(),
cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and 
cpuid <-> apicid

mapping. So the key point is obtaining all cpus' apicid.

apicid can be obtained by _MAT (Multiple APIC Table Entry) method or 
found in
MADT (Multiple APIC Description Table). So we finish the job in the 
following steps:


1. Enable apic registeration flow to handle both enabled and disabled 
cpus.
This is done by introducing an extra parameter to 
generic_processor_info to let the

caller control if disabled cpus are ignored.

2. Introduce a new array storing all possible cpuid <-> apicid 
mapping. And also modify
the way cpuid is calculated. Establish all possible cpuid <-> 
apicid mapping when

registering local apic. Store the mapping in this array.

3. Enable _MAT and MADT relative apis to return non-presnet or 
disabled cpus' apicid.
This is also done by introducing an extra parameter to these apis 
to let the caller

control if disabled cpus are ignored.

4. Establish all possible cpuid <-> nodeid mapping.
This is done via an additional acpi namespace walk for processors.


For previous discussion, please refer to:
https://lkml.org/lkml/2015/2/27/145

Re: [PATCH v5 RESEND 0/5] Make cpuid <-> nodeid mapping persistent

2016-02-03 Thread Zhu Guihua


On 01/25/2016 02:12 PM, Tang Chen wrote:

Hi Rafael, Len,

Would you please help to review the ACPI part of this patch-set ?


Can anyone help to review this?

Cc tj:
Can you add acked-by into this patchset?

Thanks.


Thanks.

On 01/25/2016 02:08 PM, Tang Chen wrote:

[Problem]

cpuid <-> nodeid mapping is firstly established at boot time. And 
workqueue caches

the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.

When doing node online/offline, cpuid <-> nodeid mapping is 
established/destroyed,
which means, cpuid <-> nodeid mapping will change if node hotplug 
happens. But

workqueue does not update wq_numa_possible_cpumask.

So here is the problem:

Assume we have the following cpuid <-> nodeid in the beginning:

   Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 2 | 30-44, 90-104
node 3 | 45-59, 105-119

and we hot-remove node2 and node3, it becomes:

   Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89

and we hot-add node4 and node5, it becomes:

   Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 4 | 30-59
node 5 | 90-119

But in wq_numa_possible_cpumask, cpu30 is still mapped to node2, and 
the like.


When a pool workqueue is initialized, if its cpumask belongs to a 
node, its
pool->node will be mapped to that node. And memory used by this 
workqueue will

also be allocated on that node.

static struct worker_pool *get_unbound_pool(const struct 
workqueue_attrs *attrs){

...
 /* if cpumask is contained inside a NUMA node, we belong to 
that node */

 if (wq_numa_enabled) {
 for_each_node(node) {
 if (cpumask_subset(pool->attrs->cpumask,
wq_numa_possible_cpumask[node])) {
 pool->node = node;
 break;
 }
 }
 }

Since wq_numa_possible_cpumask is not updated, it could be mapped to 
an offline node,

which will lead to memory allocation failure:

  SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
   cache: kmalloc-192, object size: 192, buffer size: 192, default 
order: 1, min order: 0

   node 0: slabs: 6172, objs: 259224, free: 245741
   node 1: slabs: 3261, objs: 136962, free: 127656

It happens here:

create_worker(struct worker_pool *pool)
  |--> worker = alloc_worker(pool->node);

static struct worker *alloc_worker(int node)
{
 struct worker *worker;

 worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, node); 
--> Here, useing the wrong node.


 ..

 return worker;
}


[Solution]

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm
2. apicid (physical cpu id)   <->   nodeid
3. cpuid (logical cpu id) <->   apicid
4. cpuid (logical cpu id) <->   nodeid

1. pxm (proximity domain) is provided by ACPI firmware in SRAT, and 
nodeid <-> pxm
mapping is setup at boot time. This mapping is persistent, won't 
change.


2. apicid <-> nodeid mapping is setup using info in 1. The mapping is 
setup at boot
time and CPU hotadd time, and cleared at CPU hotremove time. This 
mapping is also

persistent.

3. cpuid <-> apicid mapping is setup at boot time and CPU hotadd 
time. cpuid is
allocated, lower ids first, and released at CPU hotremove time, 
reused for other

hotadded CPUs. So this mapping is not persistent.

4. cpuid <-> nodeid mapping is also setup at boot time and CPU hotadd 
time, and
cleared at CPU hotremove time. As a result of 3, this mapping is 
not persistent.


To fix this problem, we establish cpuid <-> nodeid mapping for all 
the possible
cpus at boot time, and make it persistent. And according to 
init_cpu_to_node(),
cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and 
cpuid <-> apicid

mapping. So the key point is obtaining all cpus' apicid.

apicid can be obtained by _MAT (Multiple APIC Table Entry) method or 
found in
MADT (Multiple APIC Description Table). So we finish the job in the 
following steps:


1. Enable apic registeration flow to handle both enabled and disabled 
cpus.
This is done by introducing an extra parameter to 
generic_processor_info to let the

caller control if disabled cpus are ignored.

2. Introduce a new array storing all possible cpuid <-> apicid 
mapping. And also modify
the way cpuid is calculated. Establish all possible cpuid <-> 
apicid mapping when

registering local apic. Store the mapping in this array.

3. Enable _MAT and MADT relative apis to return non-presnet or 
disabled cpus' apicid.
This is also done by introducing an extra parameter to these apis 
to let the caller

control if disabled cpus are ignored.

4. Establish all possible cpuid <-> nodeid mapping.
This is done via an additional acpi namespace walk for processors.


For previous discussion, please refer to:
https://lkml.org/lkml/2015/2/27/145

Re: [RFC] theoretical race between memory hotplug and pfn iterator

2015-12-21 Thread Zhu Guihua


On 12/21/2015 03:17 PM, Joonsoo Kim wrote:

On Mon, Dec 21, 2015 at 03:00:08PM +0800, Zhu Guihua wrote:

On 12/21/2015 11:15 AM, Joonsoo Kim wrote:

Hello, memory-hotplug folks.

I found theoretical problems between memory hotplug and pfn iterator.
For example, pfn iterator works something like below.

for (pfn = zone_start_pfn; pfn < zone_end_pfn; pfn++) {
 if (!pfn_valid(pfn))
 continue;

 page = pfn_to_page(pfn);
 /* Do whatever we want */
}

Sequence of hotplug is something like below.

1) add memmap (after then, pfn_valid will return valid)
2) memmap_init_zone()

So, if pfn iterator runs between 1) and 2), it could access
uninitialized page information.

This problem could be solved by re-ordering initialization steps.

Hot-remove also has a problem. If memory is hot-removed after
pfn_valid() succeed in pfn iterator, access to page would cause NULL
deference because hot-remove frees corresponding memmap. There is no
guard against free in any pfn iterators.

This problem can be solved by inserting get_online_mems() in all pfn
iterators but this looks error-prone for future usage. Another idea is
that delaying free corresponding memmap until synchronization point such
as system suspend. It will guarantee that there is no running pfn
iterator. Do any have a better idea?

Btw, I tried to memory-hotremove with QEMU 2.5.5 but it didn't work. I
followed sequences in doc/memory-hotplug. Do you have any comment on this?

I tried memory hot remove with qemu 2.5.5 and RHEL 7, it works well.
Maybe you can provide more details, such as guest version, err log.

I'm testing with qemu 2.5.5 and linux-next-20151209 with reverting
following two patches.

"mm/memblock.c: use memblock_insert_region() for the empty array"
"mm-memblock-use-memblock_insert_region-for-the-empty-array-checkpatch-fixes"

When I type "device_del dimm1" in qemu monitor, there is no err log in
kernel and it looks like command has no effect. I inserted log to
acpi_memory_device_remove() but there is no message, too. Is there
another way to check that device_del event is actually transmitted to kernel?


You can use udev to monitor memory device remove event. (udevadm monitor)



I launch the qemu with following command.
./qemu-system-x86_64-recent -enable-kvm -smp 8 -m 4096,slots=16,maxmem=8G ...

Thanks.


.





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] theoretical race between memory hotplug and pfn iterator

2015-12-21 Thread Zhu Guihua


On 12/21/2015 03:17 PM, Joonsoo Kim wrote:

On Mon, Dec 21, 2015 at 03:00:08PM +0800, Zhu Guihua wrote:

On 12/21/2015 11:15 AM, Joonsoo Kim wrote:

Hello, memory-hotplug folks.

I found theoretical problems between memory hotplug and pfn iterator.
For example, pfn iterator works something like below.

for (pfn = zone_start_pfn; pfn < zone_end_pfn; pfn++) {
 if (!pfn_valid(pfn))
 continue;

 page = pfn_to_page(pfn);
 /* Do whatever we want */
}

Sequence of hotplug is something like below.

1) add memmap (after then, pfn_valid will return valid)
2) memmap_init_zone()

So, if pfn iterator runs between 1) and 2), it could access
uninitialized page information.

This problem could be solved by re-ordering initialization steps.

Hot-remove also has a problem. If memory is hot-removed after
pfn_valid() succeed in pfn iterator, access to page would cause NULL
deference because hot-remove frees corresponding memmap. There is no
guard against free in any pfn iterators.

This problem can be solved by inserting get_online_mems() in all pfn
iterators but this looks error-prone for future usage. Another idea is
that delaying free corresponding memmap until synchronization point such
as system suspend. It will guarantee that there is no running pfn
iterator. Do any have a better idea?

Btw, I tried to memory-hotremove with QEMU 2.5.5 but it didn't work. I
followed sequences in doc/memory-hotplug. Do you have any comment on this?

I tried memory hot remove with qemu 2.5.5 and RHEL 7, it works well.
Maybe you can provide more details, such as guest version, err log.

I'm testing with qemu 2.5.5 and linux-next-20151209 with reverting
following two patches.

"mm/memblock.c: use memblock_insert_region() for the empty array"
"mm-memblock-use-memblock_insert_region-for-the-empty-array-checkpatch-fixes"

When I type "device_del dimm1" in qemu monitor, there is no err log in
kernel and it looks like command has no effect. I inserted log to
acpi_memory_device_remove() but there is no message, too. Is there
another way to check that device_del event is actually transmitted to kernel?


You can use udev to monitor memory device remove event. (udevadm monitor)



I launch the qemu with following command.
./qemu-system-x86_64-recent -enable-kvm -smp 8 -m 4096,slots=16,maxmem=8G ...

Thanks.


.





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] theoretical race between memory hotplug and pfn iterator

2015-12-20 Thread Zhu Guihua


On 12/21/2015 11:15 AM, Joonsoo Kim wrote:

Hello, memory-hotplug folks.

I found theoretical problems between memory hotplug and pfn iterator.
For example, pfn iterator works something like below.

for (pfn = zone_start_pfn; pfn < zone_end_pfn; pfn++) {
 if (!pfn_valid(pfn))
 continue;

 page = pfn_to_page(pfn);
 /* Do whatever we want */
}

Sequence of hotplug is something like below.

1) add memmap (after then, pfn_valid will return valid)
2) memmap_init_zone()

So, if pfn iterator runs between 1) and 2), it could access
uninitialized page information.

This problem could be solved by re-ordering initialization steps.

Hot-remove also has a problem. If memory is hot-removed after
pfn_valid() succeed in pfn iterator, access to page would cause NULL
deference because hot-remove frees corresponding memmap. There is no
guard against free in any pfn iterators.

This problem can be solved by inserting get_online_mems() in all pfn
iterators but this looks error-prone for future usage. Another idea is
that delaying free corresponding memmap until synchronization point such
as system suspend. It will guarantee that there is no running pfn
iterator. Do any have a better idea?

Btw, I tried to memory-hotremove with QEMU 2.5.5 but it didn't work. I
followed sequences in doc/memory-hotplug. Do you have any comment on this?


I tried memory hot remove with qemu 2.5.5 and RHEL 7, it works well.
Maybe you can provide more details, such as guest version, err log.

Thanks,
Zhu



Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


.





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] theoretical race between memory hotplug and pfn iterator

2015-12-20 Thread Zhu Guihua


On 12/21/2015 11:15 AM, Joonsoo Kim wrote:

Hello, memory-hotplug folks.

I found theoretical problems between memory hotplug and pfn iterator.
For example, pfn iterator works something like below.

for (pfn = zone_start_pfn; pfn < zone_end_pfn; pfn++) {
 if (!pfn_valid(pfn))
 continue;

 page = pfn_to_page(pfn);
 /* Do whatever we want */
}

Sequence of hotplug is something like below.

1) add memmap (after then, pfn_valid will return valid)
2) memmap_init_zone()

So, if pfn iterator runs between 1) and 2), it could access
uninitialized page information.

This problem could be solved by re-ordering initialization steps.

Hot-remove also has a problem. If memory is hot-removed after
pfn_valid() succeed in pfn iterator, access to page would cause NULL
deference because hot-remove frees corresponding memmap. There is no
guard against free in any pfn iterators.

This problem can be solved by inserting get_online_mems() in all pfn
iterators but this looks error-prone for future usage. Another idea is
that delaying free corresponding memmap until synchronization point such
as system suspend. It will guarantee that there is no running pfn
iterator. Do any have a better idea?

Btw, I tried to memory-hotremove with QEMU 2.5.5 but it didn't work. I
followed sequences in doc/memory-hotplug. Do you have any comment on this?


I tried memory hot remove with qemu 2.5.5 and RHEL 7, it works well.
Maybe you can provide more details, such as guest version, err log.

Thanks,
Zhu



Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


.





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:x86/urgent] x86/espfix: Init espfix on the boot CPU side

2015-07-06 Thread tip-bot for Zhu Guihua
Commit-ID:  20d5e4a9cd453991e2590a4c25230a99b42dee0c
Gitweb: http://git.kernel.org/tip/20d5e4a9cd453991e2590a4c25230a99b42dee0c
Author: Zhu Guihua 
AuthorDate: Fri, 3 Jul 2015 17:37:19 +0800
Committer:  Ingo Molnar 
CommitDate: Mon, 6 Jul 2015 15:00:34 +0200

x86/espfix: Init espfix on the boot CPU side

As we alloc pages with GFP_KERNEL in init_espfix_ap() which is
called before we enable local irqs, so the lockdep sub-system
would (correctly) trigger a warning about the potentially
blocking API.

So we allocate them on the boot CPU side when the secondary CPU is
brought up by the boot CPU, and hand them over to the secondary
CPU.

And we use alloc_pages_node() with the secondary CPU's node, to
make sure the espfix stack is NUMA-local to the CPU that is
going to use it.

Signed-off-by: Zhu Guihua 
Cc: 
Cc: 
Cc: 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/c97add2670e9abebb90095369f0cfc172373ac94.1435824469.git.zhugh.f...@cn.fujitsu.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/espfix_64.c | 21 +
 arch/x86/kernel/smpboot.c   | 14 +++---
 2 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 1fb0e00..ce95676 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -141,12 +141,12 @@ void init_espfix_ap(int cpu)
pud_t pud, *pud_p;
pmd_t pmd, *pmd_p;
pte_t pte, *pte_p;
-   int n;
+   int n, node;
void *stack_page;
pteval_t ptemask;
 
/* We only have to do this once... */
-   if (likely(this_cpu_read(espfix_stack)))
+   if (likely(per_cpu(espfix_stack, cpu)))
return; /* Already initialized */
 
addr = espfix_base_addr(cpu);
@@ -164,12 +164,15 @@ void init_espfix_ap(int cpu)
if (stack_page)
goto unlock_done;
 
+   node = cpu_to_node(cpu);
ptemask = __supported_pte_mask;
 
pud_p = _pud_page[pud_index(addr)];
pud = *pud_p;
if (!pud_present(pud)) {
-   pmd_p = (pmd_t *)__get_free_page(PGALLOC_GFP);
+   struct page *page = alloc_pages_node(node, PGALLOC_GFP, 0);
+
+   pmd_p = (pmd_t *)page_address(page);
pud = __pud(__pa(pmd_p) | (PGTABLE_PROT & ptemask));
paravirt_alloc_pmd(_mm, __pa(pmd_p) >> PAGE_SHIFT);
for (n = 0; n < ESPFIX_PUD_CLONES; n++)
@@ -179,7 +182,9 @@ void init_espfix_ap(int cpu)
pmd_p = pmd_offset(, addr);
pmd = *pmd_p;
if (!pmd_present(pmd)) {
-   pte_p = (pte_t *)__get_free_page(PGALLOC_GFP);
+   struct page *page = alloc_pages_node(node, PGALLOC_GFP, 0);
+
+   pte_p = (pte_t *)page_address(page);
pmd = __pmd(__pa(pte_p) | (PGTABLE_PROT & ptemask));
paravirt_alloc_pte(_mm, __pa(pte_p) >> PAGE_SHIFT);
for (n = 0; n < ESPFIX_PMD_CLONES; n++)
@@ -187,7 +192,7 @@ void init_espfix_ap(int cpu)
}
 
pte_p = pte_offset_kernel(, addr);
-   stack_page = (void *)__get_free_page(GFP_KERNEL);
+   stack_page = page_address(alloc_pages_node(node, GFP_KERNEL, 0));
pte = __pte(__pa(stack_page) | (__PAGE_KERNEL_RO & ptemask));
for (n = 0; n < ESPFIX_PTE_CLONES; n++)
set_pte(_p[n*PTE_STRIDE], pte);
@@ -198,7 +203,7 @@ void init_espfix_ap(int cpu)
 unlock_done:
mutex_unlock(_init_mutex);
 done:
-   this_cpu_write(espfix_stack, addr);
-   this_cpu_write(espfix_waddr, (unsigned long)stack_page
-  + (addr & ~PAGE_MASK));
+   per_cpu(espfix_stack, cpu) = addr;
+   per_cpu(espfix_waddr, cpu) = (unsigned long)stack_page
+ + (addr & ~PAGE_MASK);
 }
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 6f5abac..0bd8c1d 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -239,13 +239,6 @@ static void notrace start_secondary(void *unused)
check_tsc_sync_target();
 
/*
-* Enable the espfix hack for this CPU
-*/
-#ifdef CONFIG_X86_ESPFIX64
-   init_espfix_ap(smp_processor_id());
-#endif
-
-   /*
 * We need to hold vector_lock so there the set of online cpus
 * does not change while we are assigning vectors to cpus.  Holding
 * this lock ensures we don't half assign or remove an irq from a cpu.
@@ -854,6 +847,13 @@ static int do_boot_cpu(int apicid, int cpu, struct 
task_struct *idle)
initial_code = (unsigned long)start_secondary;
stack_start  = idle->thread.sp;
 
+   /*
+* Enable the espfix hack for this CPU
+   */
+#ifdef CONFIG_X86_ESPFIX64
+   init_espfix_ap(cpu);
+#endif
+
/* So we see what's up */
announce_cpu(cpu, apicid);
 
--
To unsubscribe from this list: send the

[tip:x86/urgent] x86/espfix: Add 'cpu' parameter to init_espfix_ap()

2015-07-06 Thread tip-bot for Zhu Guihua
Commit-ID:  1db875631f8d5bbf497f67e47f254eece888d51d
Gitweb: http://git.kernel.org/tip/1db875631f8d5bbf497f67e47f254eece888d51d
Author: Zhu Guihua 
AuthorDate: Fri, 3 Jul 2015 17:37:18 +0800
Committer:  Ingo Molnar 
CommitDate: Mon, 6 Jul 2015 15:00:33 +0200

x86/espfix: Add 'cpu' parameter to init_espfix_ap()

Add a CPU index parameter to init_espfix_ap(), so that the
parameter could be propagated to the function for espfix
page allocation.

Signed-off-by: Zhu Guihua 
Cc: 
Cc: 
Cc: 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/cde3fcf1b3211f3f03feb1a995bce3fee850f0fc.1435824469.git.zhugh.f...@cn.fujitsu.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/espfix.h | 2 +-
 arch/x86/kernel/espfix_64.c   | 7 +++
 arch/x86/kernel/smpboot.c | 2 +-
 3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/espfix.h b/arch/x86/include/asm/espfix.h
index 99efebb..ca3ce9a 100644
--- a/arch/x86/include/asm/espfix.h
+++ b/arch/x86/include/asm/espfix.h
@@ -9,7 +9,7 @@ DECLARE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
 DECLARE_PER_CPU_READ_MOSTLY(unsigned long, espfix_waddr);
 
 extern void init_espfix_bsp(void);
-extern void init_espfix_ap(void);
+extern void init_espfix_ap(int cpu);
 
 #endif /* CONFIG_X86_64 */
 
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index f5d0730..1fb0e00 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -131,12 +131,12 @@ void __init init_espfix_bsp(void)
init_espfix_random();
 
/* The rest is the same as for any other processor */
-   init_espfix_ap();
+   init_espfix_ap(0);
 }
 
-void init_espfix_ap(void)
+void init_espfix_ap(int cpu)
 {
-   unsigned int cpu, page;
+   unsigned int page;
unsigned long addr;
pud_t pud, *pud_p;
pmd_t pmd, *pmd_p;
@@ -149,7 +149,6 @@ void init_espfix_ap(void)
if (likely(this_cpu_read(espfix_stack)))
return; /* Already initialized */
 
-   cpu = smp_processor_id();
addr = espfix_base_addr(cpu);
page = cpu/ESPFIX_STACKS_PER_PAGE;
 
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 8add66b..6f5abac 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -242,7 +242,7 @@ static void notrace start_secondary(void *unused)
 * Enable the espfix hack for this CPU
 */
 #ifdef CONFIG_X86_ESPFIX64
-   init_espfix_ap();
+   init_espfix_ap(smp_processor_id());
 #endif
 
/*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:x86/urgent] x86/espfix: Init espfix on the boot CPU side

2015-07-06 Thread tip-bot for Zhu Guihua
Commit-ID:  20d5e4a9cd453991e2590a4c25230a99b42dee0c
Gitweb: http://git.kernel.org/tip/20d5e4a9cd453991e2590a4c25230a99b42dee0c
Author: Zhu Guihua zhugh.f...@cn.fujitsu.com
AuthorDate: Fri, 3 Jul 2015 17:37:19 +0800
Committer:  Ingo Molnar mi...@kernel.org
CommitDate: Mon, 6 Jul 2015 15:00:34 +0200

x86/espfix: Init espfix on the boot CPU side

As we alloc pages with GFP_KERNEL in init_espfix_ap() which is
called before we enable local irqs, so the lockdep sub-system
would (correctly) trigger a warning about the potentially
blocking API.

So we allocate them on the boot CPU side when the secondary CPU is
brought up by the boot CPU, and hand them over to the secondary
CPU.

And we use alloc_pages_node() with the secondary CPU's node, to
make sure the espfix stack is NUMA-local to the CPU that is
going to use it.

Signed-off-by: Zhu Guihua zhugh.f...@cn.fujitsu.com
Cc: b...@alien8.de
Cc: l...@amacapital.net
Cc: l...@kernel.org
Cc: Linus Torvalds torva...@linux-foundation.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Link: 
http://lkml.kernel.org/r/c97add2670e9abebb90095369f0cfc172373ac94.1435824469.git.zhugh.f...@cn.fujitsu.com
Signed-off-by: Ingo Molnar mi...@kernel.org
---
 arch/x86/kernel/espfix_64.c | 21 +
 arch/x86/kernel/smpboot.c   | 14 +++---
 2 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 1fb0e00..ce95676 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -141,12 +141,12 @@ void init_espfix_ap(int cpu)
pud_t pud, *pud_p;
pmd_t pmd, *pmd_p;
pte_t pte, *pte_p;
-   int n;
+   int n, node;
void *stack_page;
pteval_t ptemask;
 
/* We only have to do this once... */
-   if (likely(this_cpu_read(espfix_stack)))
+   if (likely(per_cpu(espfix_stack, cpu)))
return; /* Already initialized */
 
addr = espfix_base_addr(cpu);
@@ -164,12 +164,15 @@ void init_espfix_ap(int cpu)
if (stack_page)
goto unlock_done;
 
+   node = cpu_to_node(cpu);
ptemask = __supported_pte_mask;
 
pud_p = espfix_pud_page[pud_index(addr)];
pud = *pud_p;
if (!pud_present(pud)) {
-   pmd_p = (pmd_t *)__get_free_page(PGALLOC_GFP);
+   struct page *page = alloc_pages_node(node, PGALLOC_GFP, 0);
+
+   pmd_p = (pmd_t *)page_address(page);
pud = __pud(__pa(pmd_p) | (PGTABLE_PROT  ptemask));
paravirt_alloc_pmd(init_mm, __pa(pmd_p)  PAGE_SHIFT);
for (n = 0; n  ESPFIX_PUD_CLONES; n++)
@@ -179,7 +182,9 @@ void init_espfix_ap(int cpu)
pmd_p = pmd_offset(pud, addr);
pmd = *pmd_p;
if (!pmd_present(pmd)) {
-   pte_p = (pte_t *)__get_free_page(PGALLOC_GFP);
+   struct page *page = alloc_pages_node(node, PGALLOC_GFP, 0);
+
+   pte_p = (pte_t *)page_address(page);
pmd = __pmd(__pa(pte_p) | (PGTABLE_PROT  ptemask));
paravirt_alloc_pte(init_mm, __pa(pte_p)  PAGE_SHIFT);
for (n = 0; n  ESPFIX_PMD_CLONES; n++)
@@ -187,7 +192,7 @@ void init_espfix_ap(int cpu)
}
 
pte_p = pte_offset_kernel(pmd, addr);
-   stack_page = (void *)__get_free_page(GFP_KERNEL);
+   stack_page = page_address(alloc_pages_node(node, GFP_KERNEL, 0));
pte = __pte(__pa(stack_page) | (__PAGE_KERNEL_RO  ptemask));
for (n = 0; n  ESPFIX_PTE_CLONES; n++)
set_pte(pte_p[n*PTE_STRIDE], pte);
@@ -198,7 +203,7 @@ void init_espfix_ap(int cpu)
 unlock_done:
mutex_unlock(espfix_init_mutex);
 done:
-   this_cpu_write(espfix_stack, addr);
-   this_cpu_write(espfix_waddr, (unsigned long)stack_page
-  + (addr  ~PAGE_MASK));
+   per_cpu(espfix_stack, cpu) = addr;
+   per_cpu(espfix_waddr, cpu) = (unsigned long)stack_page
+ + (addr  ~PAGE_MASK);
 }
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 6f5abac..0bd8c1d 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -239,13 +239,6 @@ static void notrace start_secondary(void *unused)
check_tsc_sync_target();
 
/*
-* Enable the espfix hack for this CPU
-*/
-#ifdef CONFIG_X86_ESPFIX64
-   init_espfix_ap(smp_processor_id());
-#endif
-
-   /*
 * We need to hold vector_lock so there the set of online cpus
 * does not change while we are assigning vectors to cpus.  Holding
 * this lock ensures we don't half assign or remove an irq from a cpu.
@@ -854,6 +847,13 @@ static int do_boot_cpu(int apicid, int cpu, struct 
task_struct *idle)
initial_code = (unsigned long)start_secondary;
stack_start  = idle-thread.sp;
 
+   /*
+* Enable the espfix hack for this CPU
+   */
+#ifdef

[tip:x86/urgent] x86/espfix: Add 'cpu' parameter to init_espfix_ap()

2015-07-06 Thread tip-bot for Zhu Guihua
Commit-ID:  1db875631f8d5bbf497f67e47f254eece888d51d
Gitweb: http://git.kernel.org/tip/1db875631f8d5bbf497f67e47f254eece888d51d
Author: Zhu Guihua zhugh.f...@cn.fujitsu.com
AuthorDate: Fri, 3 Jul 2015 17:37:18 +0800
Committer:  Ingo Molnar mi...@kernel.org
CommitDate: Mon, 6 Jul 2015 15:00:33 +0200

x86/espfix: Add 'cpu' parameter to init_espfix_ap()

Add a CPU index parameter to init_espfix_ap(), so that the
parameter could be propagated to the function for espfix
page allocation.

Signed-off-by: Zhu Guihua zhugh.f...@cn.fujitsu.com
Cc: b...@alien8.de
Cc: l...@amacapital.net
Cc: l...@kernel.org
Cc: Linus Torvalds torva...@linux-foundation.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Thomas Gleixner t...@linutronix.de
Link: 
http://lkml.kernel.org/r/cde3fcf1b3211f3f03feb1a995bce3fee850f0fc.1435824469.git.zhugh.f...@cn.fujitsu.com
Signed-off-by: Ingo Molnar mi...@kernel.org
---
 arch/x86/include/asm/espfix.h | 2 +-
 arch/x86/kernel/espfix_64.c   | 7 +++
 arch/x86/kernel/smpboot.c | 2 +-
 3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/espfix.h b/arch/x86/include/asm/espfix.h
index 99efebb..ca3ce9a 100644
--- a/arch/x86/include/asm/espfix.h
+++ b/arch/x86/include/asm/espfix.h
@@ -9,7 +9,7 @@ DECLARE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
 DECLARE_PER_CPU_READ_MOSTLY(unsigned long, espfix_waddr);
 
 extern void init_espfix_bsp(void);
-extern void init_espfix_ap(void);
+extern void init_espfix_ap(int cpu);
 
 #endif /* CONFIG_X86_64 */
 
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index f5d0730..1fb0e00 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -131,12 +131,12 @@ void __init init_espfix_bsp(void)
init_espfix_random();
 
/* The rest is the same as for any other processor */
-   init_espfix_ap();
+   init_espfix_ap(0);
 }
 
-void init_espfix_ap(void)
+void init_espfix_ap(int cpu)
 {
-   unsigned int cpu, page;
+   unsigned int page;
unsigned long addr;
pud_t pud, *pud_p;
pmd_t pmd, *pmd_p;
@@ -149,7 +149,6 @@ void init_espfix_ap(void)
if (likely(this_cpu_read(espfix_stack)))
return; /* Already initialized */
 
-   cpu = smp_processor_id();
addr = espfix_base_addr(cpu);
page = cpu/ESPFIX_STACKS_PER_PAGE;
 
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 8add66b..6f5abac 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -242,7 +242,7 @@ static void notrace start_secondary(void *unused)
 * Enable the espfix hack for this CPU
 */
 #ifdef CONFIG_X86_ESPFIX64
-   init_espfix_ap();
+   init_espfix_ap(smp_processor_id());
 #endif
 
/*
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 0/2] x86, espfix: silence warning on espfix initialization for ap

2015-07-03 Thread Zhu Guihua
The following lockdep warning occurrs when running with latest kernel:
[3.178000] [ cut here ]
[3.183000] WARNING: CPU: 128 PID: 0 at kernel/locking/lockdep.c:2755 
lockdep_trace_alloc+0xdd/0xe0()
[3.193000] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
[3.199000] Modules linked in:

[3.203000] CPU: 128 PID: 0 Comm: swapper/128 Not tainted 4.1.0-rc3 #70
[3.221000]   2d6601fb3e6d4e4c 88086fd5fc38 
81773f0a
[3.23]   88086fd5fc90 88086fd5fc78 
8108c85a
[3.238000]  88086fd6 0092 88086fd6 
00d0
[3.246000] Call Trace:
[3.249000]  [] dump_stack+0x4c/0x65
[3.255000]  [] warn_slowpath_common+0x8a/0xc0
[3.261000]  [] warn_slowpath_fmt+0x55/0x70
[3.268000]  [] lockdep_trace_alloc+0xdd/0xe0
[3.274000]  [] __alloc_pages_nodemask+0xad/0xca0
[3.281000]  [] ? __lock_acquire+0xf6d/0x1560
[3.288000]  [] alloc_page_interleave+0x3a/0x90
[3.295000]  [] alloc_pages_current+0x17d/0x1a0
[3.301000]  [] ? __get_free_pages+0xe/0x50
[3.308000]  [] __get_free_pages+0xe/0x50
[3.314000]  [] init_espfix_ap+0x17b/0x320
[3.32]  [] start_secondary+0xf1/0x1f0
[3.327000] ---[ end trace 1b3327d9d6a1d62c ]---

As we alloc pages with GFP_KERNEL in init_espfix_ap() which is called
before enabled local irq, and the lockdep sub-system considers this
behaviour as allocating memory with GFP_FS with local irq disabled,
then trigger the warning as mentioned about.

So we allocate them on the boot CPU side when the target CPU is bringing
up by the primary CPU, and hand them over to the secondary CPU.

v3:
 -split the original path into two parts

v2:
 -allocate espfix stack pages when the targert CPU is bringing up by the
  primary CPU
 -commit messages changed

v1:
 -Alloc the page on the node the target CPU is on.

Zhu Guihua (2):
  espfix: add cpu parameter to init_espfix_ap()
  x86, espfix: init espfix on the boot cpu side

 arch/x86/include/asm/espfix.h |  2 +-
 arch/x86/kernel/espfix_64.c   | 28 
 arch/x86/kernel/smpboot.c | 14 +++---
 3 files changed, 24 insertions(+), 20 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 1/2] espfix: add cpu parameter to init_espfix_ap()

2015-07-03 Thread Zhu Guihua
Add cpu index parameter to init_espfix_ap(), so that the parameter
could be propagated to the function for pages allocation.

Signed-off-by: Zhu Guihua 
---
 arch/x86/include/asm/espfix.h | 2 +-
 arch/x86/kernel/espfix_64.c   | 7 +++
 arch/x86/kernel/smpboot.c | 2 +-
 3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/espfix.h b/arch/x86/include/asm/espfix.h
index 99efebb..ca3ce9a 100644
--- a/arch/x86/include/asm/espfix.h
+++ b/arch/x86/include/asm/espfix.h
@@ -9,7 +9,7 @@ DECLARE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
 DECLARE_PER_CPU_READ_MOSTLY(unsigned long, espfix_waddr);
 
 extern void init_espfix_bsp(void);
-extern void init_espfix_ap(void);
+extern void init_espfix_ap(int cpu);
 
 #endif /* CONFIG_X86_64 */
 
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index f5d0730..1fb0e00 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -131,12 +131,12 @@ void __init init_espfix_bsp(void)
init_espfix_random();
 
/* The rest is the same as for any other processor */
-   init_espfix_ap();
+   init_espfix_ap(0);
 }
 
-void init_espfix_ap(void)
+void init_espfix_ap(int cpu)
 {
-   unsigned int cpu, page;
+   unsigned int page;
unsigned long addr;
pud_t pud, *pud_p;
pmd_t pmd, *pmd_p;
@@ -149,7 +149,6 @@ void init_espfix_ap(void)
if (likely(this_cpu_read(espfix_stack)))
return; /* Already initialized */
 
-   cpu = smp_processor_id();
addr = espfix_base_addr(cpu);
page = cpu/ESPFIX_STACKS_PER_PAGE;
 
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 8add66b..6f5abac 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -242,7 +242,7 @@ static void notrace start_secondary(void *unused)
 * Enable the espfix hack for this CPU
 */
 #ifdef CONFIG_X86_ESPFIX64
-   init_espfix_ap();
+   init_espfix_ap(smp_processor_id());
 #endif
 
/*
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 2/2] x86, espfix: init espfix on the boot cpu side

2015-07-03 Thread Zhu Guihua
As we alloc pages with GFP_KERNEL in init_espfix_ap() which is called
before enabled local irq, the lockdep sub-system would trigger a warning.

So we allocate them on the boot CPU side when the target CPU is bringing
up by the primary CPU, and hand them over to the secondary CPU.

And we use alloc_pages_node() with the secondary CPU's node, to make sure
the espfix stack is NUMA-local to the CPU that is going to use it.

Signed-off-by: Zhu Guihua 
---
 arch/x86/kernel/espfix_64.c | 21 +
 arch/x86/kernel/smpboot.c   | 14 +++---
 2 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 1fb0e00..ce95676 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -141,12 +141,12 @@ void init_espfix_ap(int cpu)
pud_t pud, *pud_p;
pmd_t pmd, *pmd_p;
pte_t pte, *pte_p;
-   int n;
+   int n, node;
void *stack_page;
pteval_t ptemask;
 
/* We only have to do this once... */
-   if (likely(this_cpu_read(espfix_stack)))
+   if (likely(per_cpu(espfix_stack, cpu)))
return; /* Already initialized */
 
addr = espfix_base_addr(cpu);
@@ -164,12 +164,15 @@ void init_espfix_ap(int cpu)
if (stack_page)
goto unlock_done;
 
+   node = cpu_to_node(cpu);
ptemask = __supported_pte_mask;
 
pud_p = _pud_page[pud_index(addr)];
pud = *pud_p;
if (!pud_present(pud)) {
-   pmd_p = (pmd_t *)__get_free_page(PGALLOC_GFP);
+   struct page *page = alloc_pages_node(node, PGALLOC_GFP, 0);
+
+   pmd_p = (pmd_t *)page_address(page);
pud = __pud(__pa(pmd_p) | (PGTABLE_PROT & ptemask));
paravirt_alloc_pmd(_mm, __pa(pmd_p) >> PAGE_SHIFT);
for (n = 0; n < ESPFIX_PUD_CLONES; n++)
@@ -179,7 +182,9 @@ void init_espfix_ap(int cpu)
pmd_p = pmd_offset(, addr);
pmd = *pmd_p;
if (!pmd_present(pmd)) {
-   pte_p = (pte_t *)__get_free_page(PGALLOC_GFP);
+   struct page *page = alloc_pages_node(node, PGALLOC_GFP, 0);
+
+   pte_p = (pte_t *)page_address(page);
pmd = __pmd(__pa(pte_p) | (PGTABLE_PROT & ptemask));
paravirt_alloc_pte(_mm, __pa(pte_p) >> PAGE_SHIFT);
for (n = 0; n < ESPFIX_PMD_CLONES; n++)
@@ -187,7 +192,7 @@ void init_espfix_ap(int cpu)
}
 
pte_p = pte_offset_kernel(, addr);
-   stack_page = (void *)__get_free_page(GFP_KERNEL);
+   stack_page = page_address(alloc_pages_node(node, GFP_KERNEL, 0));
pte = __pte(__pa(stack_page) | (__PAGE_KERNEL_RO & ptemask));
for (n = 0; n < ESPFIX_PTE_CLONES; n++)
set_pte(_p[n*PTE_STRIDE], pte);
@@ -198,7 +203,7 @@ void init_espfix_ap(int cpu)
 unlock_done:
mutex_unlock(_init_mutex);
 done:
-   this_cpu_write(espfix_stack, addr);
-   this_cpu_write(espfix_waddr, (unsigned long)stack_page
-  + (addr & ~PAGE_MASK));
+   per_cpu(espfix_stack, cpu) = addr;
+   per_cpu(espfix_waddr, cpu) = (unsigned long)stack_page
+ + (addr & ~PAGE_MASK);
 }
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 6f5abac..0bd8c1d 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -239,13 +239,6 @@ static void notrace start_secondary(void *unused)
check_tsc_sync_target();
 
/*
-* Enable the espfix hack for this CPU
-*/
-#ifdef CONFIG_X86_ESPFIX64
-   init_espfix_ap(smp_processor_id());
-#endif
-
-   /*
 * We need to hold vector_lock so there the set of online cpus
 * does not change while we are assigning vectors to cpus.  Holding
 * this lock ensures we don't half assign or remove an irq from a cpu.
@@ -854,6 +847,13 @@ static int do_boot_cpu(int apicid, int cpu, struct 
task_struct *idle)
initial_code = (unsigned long)start_secondary;
stack_start  = idle->thread.sp;
 
+   /*
+* Enable the espfix hack for this CPU
+   */
+#ifdef CONFIG_X86_ESPFIX64
+   init_espfix_ap(cpu);
+#endif
+
/* So we see what's up */
announce_cpu(cpu, apicid);
 
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 0/2] x86, espfix: silence warning on espfix initialization for ap

2015-07-03 Thread Zhu Guihua
The following lockdep warning occurrs when running with latest kernel:
[3.178000] [ cut here ]
[3.183000] WARNING: CPU: 128 PID: 0 at kernel/locking/lockdep.c:2755 
lockdep_trace_alloc+0xdd/0xe0()
[3.193000] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
[3.199000] Modules linked in:

[3.203000] CPU: 128 PID: 0 Comm: swapper/128 Not tainted 4.1.0-rc3 #70
[3.221000]   2d6601fb3e6d4e4c 88086fd5fc38 
81773f0a
[3.23]   88086fd5fc90 88086fd5fc78 
8108c85a
[3.238000]  88086fd6 0092 88086fd6 
00d0
[3.246000] Call Trace:
[3.249000]  [81773f0a] dump_stack+0x4c/0x65
[3.255000]  [8108c85a] warn_slowpath_common+0x8a/0xc0
[3.261000]  [8108c8e5] warn_slowpath_fmt+0x55/0x70
[3.268000]  [810ee24d] lockdep_trace_alloc+0xdd/0xe0
[3.274000]  [811cda0d] __alloc_pages_nodemask+0xad/0xca0
[3.281000]  [810ec7ad] ? __lock_acquire+0xf6d/0x1560
[3.288000]  [81219c8a] alloc_page_interleave+0x3a/0x90
[3.295000]  [8121b32d] alloc_pages_current+0x17d/0x1a0
[3.301000]  [811c869e] ? __get_free_pages+0xe/0x50
[3.308000]  [811c869e] __get_free_pages+0xe/0x50
[3.314000]  [8102640b] init_espfix_ap+0x17b/0x320
[3.32]  [8105c691] start_secondary+0xf1/0x1f0
[3.327000] ---[ end trace 1b3327d9d6a1d62c ]---

As we alloc pages with GFP_KERNEL in init_espfix_ap() which is called
before enabled local irq, and the lockdep sub-system considers this
behaviour as allocating memory with GFP_FS with local irq disabled,
then trigger the warning as mentioned about.

So we allocate them on the boot CPU side when the target CPU is bringing
up by the primary CPU, and hand them over to the secondary CPU.

v3:
 -split the original path into two parts

v2:
 -allocate espfix stack pages when the targert CPU is bringing up by the
  primary CPU
 -commit messages changed

v1:
 -Alloc the page on the node the target CPU is on.

Zhu Guihua (2):
  espfix: add cpu parameter to init_espfix_ap()
  x86, espfix: init espfix on the boot cpu side

 arch/x86/include/asm/espfix.h |  2 +-
 arch/x86/kernel/espfix_64.c   | 28 
 arch/x86/kernel/smpboot.c | 14 +++---
 3 files changed, 24 insertions(+), 20 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 1/2] espfix: add cpu parameter to init_espfix_ap()

2015-07-03 Thread Zhu Guihua
Add cpu index parameter to init_espfix_ap(), so that the parameter
could be propagated to the function for pages allocation.

Signed-off-by: Zhu Guihua zhugh.f...@cn.fujitsu.com
---
 arch/x86/include/asm/espfix.h | 2 +-
 arch/x86/kernel/espfix_64.c   | 7 +++
 arch/x86/kernel/smpboot.c | 2 +-
 3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/espfix.h b/arch/x86/include/asm/espfix.h
index 99efebb..ca3ce9a 100644
--- a/arch/x86/include/asm/espfix.h
+++ b/arch/x86/include/asm/espfix.h
@@ -9,7 +9,7 @@ DECLARE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
 DECLARE_PER_CPU_READ_MOSTLY(unsigned long, espfix_waddr);
 
 extern void init_espfix_bsp(void);
-extern void init_espfix_ap(void);
+extern void init_espfix_ap(int cpu);
 
 #endif /* CONFIG_X86_64 */
 
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index f5d0730..1fb0e00 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -131,12 +131,12 @@ void __init init_espfix_bsp(void)
init_espfix_random();
 
/* The rest is the same as for any other processor */
-   init_espfix_ap();
+   init_espfix_ap(0);
 }
 
-void init_espfix_ap(void)
+void init_espfix_ap(int cpu)
 {
-   unsigned int cpu, page;
+   unsigned int page;
unsigned long addr;
pud_t pud, *pud_p;
pmd_t pmd, *pmd_p;
@@ -149,7 +149,6 @@ void init_espfix_ap(void)
if (likely(this_cpu_read(espfix_stack)))
return; /* Already initialized */
 
-   cpu = smp_processor_id();
addr = espfix_base_addr(cpu);
page = cpu/ESPFIX_STACKS_PER_PAGE;
 
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 8add66b..6f5abac 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -242,7 +242,7 @@ static void notrace start_secondary(void *unused)
 * Enable the espfix hack for this CPU
 */
 #ifdef CONFIG_X86_ESPFIX64
-   init_espfix_ap();
+   init_espfix_ap(smp_processor_id());
 #endif
 
/*
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 2/2] x86, espfix: init espfix on the boot cpu side

2015-07-03 Thread Zhu Guihua
As we alloc pages with GFP_KERNEL in init_espfix_ap() which is called
before enabled local irq, the lockdep sub-system would trigger a warning.

So we allocate them on the boot CPU side when the target CPU is bringing
up by the primary CPU, and hand them over to the secondary CPU.

And we use alloc_pages_node() with the secondary CPU's node, to make sure
the espfix stack is NUMA-local to the CPU that is going to use it.

Signed-off-by: Zhu Guihua zhugh.f...@cn.fujitsu.com
---
 arch/x86/kernel/espfix_64.c | 21 +
 arch/x86/kernel/smpboot.c   | 14 +++---
 2 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 1fb0e00..ce95676 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -141,12 +141,12 @@ void init_espfix_ap(int cpu)
pud_t pud, *pud_p;
pmd_t pmd, *pmd_p;
pte_t pte, *pte_p;
-   int n;
+   int n, node;
void *stack_page;
pteval_t ptemask;
 
/* We only have to do this once... */
-   if (likely(this_cpu_read(espfix_stack)))
+   if (likely(per_cpu(espfix_stack, cpu)))
return; /* Already initialized */
 
addr = espfix_base_addr(cpu);
@@ -164,12 +164,15 @@ void init_espfix_ap(int cpu)
if (stack_page)
goto unlock_done;
 
+   node = cpu_to_node(cpu);
ptemask = __supported_pte_mask;
 
pud_p = espfix_pud_page[pud_index(addr)];
pud = *pud_p;
if (!pud_present(pud)) {
-   pmd_p = (pmd_t *)__get_free_page(PGALLOC_GFP);
+   struct page *page = alloc_pages_node(node, PGALLOC_GFP, 0);
+
+   pmd_p = (pmd_t *)page_address(page);
pud = __pud(__pa(pmd_p) | (PGTABLE_PROT  ptemask));
paravirt_alloc_pmd(init_mm, __pa(pmd_p)  PAGE_SHIFT);
for (n = 0; n  ESPFIX_PUD_CLONES; n++)
@@ -179,7 +182,9 @@ void init_espfix_ap(int cpu)
pmd_p = pmd_offset(pud, addr);
pmd = *pmd_p;
if (!pmd_present(pmd)) {
-   pte_p = (pte_t *)__get_free_page(PGALLOC_GFP);
+   struct page *page = alloc_pages_node(node, PGALLOC_GFP, 0);
+
+   pte_p = (pte_t *)page_address(page);
pmd = __pmd(__pa(pte_p) | (PGTABLE_PROT  ptemask));
paravirt_alloc_pte(init_mm, __pa(pte_p)  PAGE_SHIFT);
for (n = 0; n  ESPFIX_PMD_CLONES; n++)
@@ -187,7 +192,7 @@ void init_espfix_ap(int cpu)
}
 
pte_p = pte_offset_kernel(pmd, addr);
-   stack_page = (void *)__get_free_page(GFP_KERNEL);
+   stack_page = page_address(alloc_pages_node(node, GFP_KERNEL, 0));
pte = __pte(__pa(stack_page) | (__PAGE_KERNEL_RO  ptemask));
for (n = 0; n  ESPFIX_PTE_CLONES; n++)
set_pte(pte_p[n*PTE_STRIDE], pte);
@@ -198,7 +203,7 @@ void init_espfix_ap(int cpu)
 unlock_done:
mutex_unlock(espfix_init_mutex);
 done:
-   this_cpu_write(espfix_stack, addr);
-   this_cpu_write(espfix_waddr, (unsigned long)stack_page
-  + (addr  ~PAGE_MASK));
+   per_cpu(espfix_stack, cpu) = addr;
+   per_cpu(espfix_waddr, cpu) = (unsigned long)stack_page
+ + (addr  ~PAGE_MASK);
 }
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 6f5abac..0bd8c1d 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -239,13 +239,6 @@ static void notrace start_secondary(void *unused)
check_tsc_sync_target();
 
/*
-* Enable the espfix hack for this CPU
-*/
-#ifdef CONFIG_X86_ESPFIX64
-   init_espfix_ap(smp_processor_id());
-#endif
-
-   /*
 * We need to hold vector_lock so there the set of online cpus
 * does not change while we are assigning vectors to cpus.  Holding
 * this lock ensures we don't half assign or remove an irq from a cpu.
@@ -854,6 +847,13 @@ static int do_boot_cpu(int apicid, int cpu, struct 
task_struct *idle)
initial_code = (unsigned long)start_secondary;
stack_start  = idle-thread.sp;
 
+   /*
+* Enable the espfix hack for this CPU
+   */
+#ifdef CONFIG_X86_ESPFIX64
+   init_espfix_ap(cpu);
+#endif
+
/* So we see what's up */
announce_cpu(cpu, apicid);
 
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] x86, espfix: init espfix on the boot cpu side

2015-06-26 Thread Zhu Guihua
The following lockdep warning occurrs when running with latest kernel:
[3.178000] [ cut here ]
[3.183000] WARNING: CPU: 128 PID: 0 at kernel/locking/lockdep.c:2755 
lockdep_trace_alloc+0xdd/0xe0()
[3.193000] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
[3.199000] Modules linked in:

[3.203000] CPU: 128 PID: 0 Comm: swapper/128 Not tainted 4.1.0-rc3 #70
[3.221000]   2d6601fb3e6d4e4c 88086fd5fc38 
81773f0a
[3.23]   88086fd5fc90 88086fd5fc78 
8108c85a
[3.238000]  88086fd6 0092 88086fd6 
00d0
[3.246000] Call Trace:
[3.249000]  [] dump_stack+0x4c/0x65
[3.255000]  [] warn_slowpath_common+0x8a/0xc0
[3.261000]  [] warn_slowpath_fmt+0x55/0x70
[3.268000]  [] lockdep_trace_alloc+0xdd/0xe0
[3.274000]  [] __alloc_pages_nodemask+0xad/0xca0
[3.281000]  [] ? __lock_acquire+0xf6d/0x1560
[3.288000]  [] alloc_page_interleave+0x3a/0x90
[3.295000]  [] alloc_pages_current+0x17d/0x1a0
[3.301000]  [] ? __get_free_pages+0xe/0x50
[3.308000]  [] __get_free_pages+0xe/0x50
[3.314000]  [] init_espfix_ap+0x17b/0x320
[3.32]  [] start_secondary+0xf1/0x1f0
[3.327000] ---[ end trace 1b3327d9d6a1d62c ]---

As we alloc pages with GFP_KERNEL in init_espfix_ap() which is called
before enabled local irq, and the lockdep sub-system considers this
behaviour as allocating memory with GFP_FS with local irq disabled,
then trigger the warning as mentioned about.

So we allocate them on the boot CPU side when the target CPU is bringing
up by the primary CPU, and hand them over to the secondary CPU.

Signed-off-by: Zhu Guihua 
---
v2:
 -allocate espfix stack pages when the targert CPU is bringing up by the
  primary CPU
 -commit messages changed
v1:
 -Alloc the page on the node the target CPU is on.
RFC v2:
 -Let the boot up routine init the espfix stack for the target cpu after it
  booted.
---
 arch/x86/include/asm/espfix.h |  2 +-
 arch/x86/kernel/espfix_64.c   | 28 
 arch/x86/kernel/smpboot.c | 14 +++---
 3 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/espfix.h b/arch/x86/include/asm/espfix.h
index 99efebb..ca3ce9a 100644
--- a/arch/x86/include/asm/espfix.h
+++ b/arch/x86/include/asm/espfix.h
@@ -9,7 +9,7 @@ DECLARE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
 DECLARE_PER_CPU_READ_MOSTLY(unsigned long, espfix_waddr);
 
 extern void init_espfix_bsp(void);
-extern void init_espfix_ap(void);
+extern void init_espfix_ap(int cpu);
 
 #endif /* CONFIG_X86_64 */
 
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index f5d0730..f6dd9df 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -131,25 +131,24 @@ void __init init_espfix_bsp(void)
init_espfix_random();
 
/* The rest is the same as for any other processor */
-   init_espfix_ap();
+   init_espfix_ap(0);
 }
 
-void init_espfix_ap(void)
+void init_espfix_ap(int cpu)
 {
-   unsigned int cpu, page;
+   unsigned int page;
unsigned long addr;
pud_t pud, *pud_p;
pmd_t pmd, *pmd_p;
pte_t pte, *pte_p;
-   int n;
+   int n, node;
void *stack_page;
pteval_t ptemask;
 
/* We only have to do this once... */
-   if (likely(this_cpu_read(espfix_stack)))
+   if (likely(per_cpu(espfix_stack, cpu)))
return; /* Already initialized */
 
-   cpu = smp_processor_id();
addr = espfix_base_addr(cpu);
page = cpu/ESPFIX_STACKS_PER_PAGE;
 
@@ -165,12 +164,15 @@ void init_espfix_ap(void)
if (stack_page)
goto unlock_done;
 
+   node = cpu_to_node(cpu);
ptemask = __supported_pte_mask;
 
pud_p = _pud_page[pud_index(addr)];
pud = *pud_p;
if (!pud_present(pud)) {
-   pmd_p = (pmd_t *)__get_free_page(PGALLOC_GFP);
+   struct page *page = alloc_pages_node(node, PGALLOC_GFP, 0);
+
+   pmd_p = (pmd_t *)page_address(page);
pud = __pud(__pa(pmd_p) | (PGTABLE_PROT & ptemask));
paravirt_alloc_pmd(_mm, __pa(pmd_p) >> PAGE_SHIFT);
for (n = 0; n < ESPFIX_PUD_CLONES; n++)
@@ -180,7 +182,9 @@ void init_espfix_ap(void)
pmd_p = pmd_offset(, addr);
pmd = *pmd_p;
if (!pmd_present(pmd)) {
-   pte_p = (pte_t *)__get_free_page(PGALLOC_GFP);
+   struct page *page = alloc_pages_node(node, PGALLOC_GFP, 0);
+
+   pte_p = (pte_t *)page_address(page);
pmd = __pmd(__pa(pte_p) | (PGTABLE_PROT & ptemask));
paravirt_alloc_pte(_mm, __pa(pte_p) >> PAGE_SHIFT);
for (n = 0; n < ESPFIX_PMD_CLONES; n++)
@@ -188,7 +192,7 @@ void init_espfix_ap(void)
}
 
pte_p = pte_offset_kernel(, addr);

[PATCH v2] x86, espfix: init espfix on the boot cpu side

2015-06-26 Thread Zhu Guihua
The following lockdep warning occurrs when running with latest kernel:
[3.178000] [ cut here ]
[3.183000] WARNING: CPU: 128 PID: 0 at kernel/locking/lockdep.c:2755 
lockdep_trace_alloc+0xdd/0xe0()
[3.193000] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
[3.199000] Modules linked in:

[3.203000] CPU: 128 PID: 0 Comm: swapper/128 Not tainted 4.1.0-rc3 #70
[3.221000]   2d6601fb3e6d4e4c 88086fd5fc38 
81773f0a
[3.23]   88086fd5fc90 88086fd5fc78 
8108c85a
[3.238000]  88086fd6 0092 88086fd6 
00d0
[3.246000] Call Trace:
[3.249000]  [81773f0a] dump_stack+0x4c/0x65
[3.255000]  [8108c85a] warn_slowpath_common+0x8a/0xc0
[3.261000]  [8108c8e5] warn_slowpath_fmt+0x55/0x70
[3.268000]  [810ee24d] lockdep_trace_alloc+0xdd/0xe0
[3.274000]  [811cda0d] __alloc_pages_nodemask+0xad/0xca0
[3.281000]  [810ec7ad] ? __lock_acquire+0xf6d/0x1560
[3.288000]  [81219c8a] alloc_page_interleave+0x3a/0x90
[3.295000]  [8121b32d] alloc_pages_current+0x17d/0x1a0
[3.301000]  [811c869e] ? __get_free_pages+0xe/0x50
[3.308000]  [811c869e] __get_free_pages+0xe/0x50
[3.314000]  [8102640b] init_espfix_ap+0x17b/0x320
[3.32]  [8105c691] start_secondary+0xf1/0x1f0
[3.327000] ---[ end trace 1b3327d9d6a1d62c ]---

As we alloc pages with GFP_KERNEL in init_espfix_ap() which is called
before enabled local irq, and the lockdep sub-system considers this
behaviour as allocating memory with GFP_FS with local irq disabled,
then trigger the warning as mentioned about.

So we allocate them on the boot CPU side when the target CPU is bringing
up by the primary CPU, and hand them over to the secondary CPU.

Signed-off-by: Zhu Guihua zhugh.f...@cn.fujitsu.com
---
v2:
 -allocate espfix stack pages when the targert CPU is bringing up by the
  primary CPU
 -commit messages changed
v1:
 -Alloc the page on the node the target CPU is on.
RFC v2:
 -Let the boot up routine init the espfix stack for the target cpu after it
  booted.
---
 arch/x86/include/asm/espfix.h |  2 +-
 arch/x86/kernel/espfix_64.c   | 28 
 arch/x86/kernel/smpboot.c | 14 +++---
 3 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/espfix.h b/arch/x86/include/asm/espfix.h
index 99efebb..ca3ce9a 100644
--- a/arch/x86/include/asm/espfix.h
+++ b/arch/x86/include/asm/espfix.h
@@ -9,7 +9,7 @@ DECLARE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
 DECLARE_PER_CPU_READ_MOSTLY(unsigned long, espfix_waddr);
 
 extern void init_espfix_bsp(void);
-extern void init_espfix_ap(void);
+extern void init_espfix_ap(int cpu);
 
 #endif /* CONFIG_X86_64 */
 
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index f5d0730..f6dd9df 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -131,25 +131,24 @@ void __init init_espfix_bsp(void)
init_espfix_random();
 
/* The rest is the same as for any other processor */
-   init_espfix_ap();
+   init_espfix_ap(0);
 }
 
-void init_espfix_ap(void)
+void init_espfix_ap(int cpu)
 {
-   unsigned int cpu, page;
+   unsigned int page;
unsigned long addr;
pud_t pud, *pud_p;
pmd_t pmd, *pmd_p;
pte_t pte, *pte_p;
-   int n;
+   int n, node;
void *stack_page;
pteval_t ptemask;
 
/* We only have to do this once... */
-   if (likely(this_cpu_read(espfix_stack)))
+   if (likely(per_cpu(espfix_stack, cpu)))
return; /* Already initialized */
 
-   cpu = smp_processor_id();
addr = espfix_base_addr(cpu);
page = cpu/ESPFIX_STACKS_PER_PAGE;
 
@@ -165,12 +164,15 @@ void init_espfix_ap(void)
if (stack_page)
goto unlock_done;
 
+   node = cpu_to_node(cpu);
ptemask = __supported_pte_mask;
 
pud_p = espfix_pud_page[pud_index(addr)];
pud = *pud_p;
if (!pud_present(pud)) {
-   pmd_p = (pmd_t *)__get_free_page(PGALLOC_GFP);
+   struct page *page = alloc_pages_node(node, PGALLOC_GFP, 0);
+
+   pmd_p = (pmd_t *)page_address(page);
pud = __pud(__pa(pmd_p) | (PGTABLE_PROT  ptemask));
paravirt_alloc_pmd(init_mm, __pa(pmd_p)  PAGE_SHIFT);
for (n = 0; n  ESPFIX_PUD_CLONES; n++)
@@ -180,7 +182,9 @@ void init_espfix_ap(void)
pmd_p = pmd_offset(pud, addr);
pmd = *pmd_p;
if (!pmd_present(pmd)) {
-   pte_p = (pte_t *)__get_free_page(PGALLOC_GFP);
+   struct page *page = alloc_pages_node(node, PGALLOC_GFP, 0);
+
+   pte_p = (pte_t *)page_address(page);
pmd = __pmd(__pa(pte_p) | (PGTABLE_PROT  ptemask

Re: [PATCH V1] x86, espfix: postpone the initialization of espfix stack for AP

2015-06-16 Thread Zhu Guihua

Any feedback about this?

On 06/04/2015 05:45 PM, Gu Zheng wrote:

The following lockdep warning occurrs when running with latest kernel:
[3.178000] [ cut here ]
[3.183000] WARNING: CPU: 128 PID: 0 at kernel/locking/lockdep.c:2755 
lockdep_trace_alloc+0xdd/0xe0()
[3.193000] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
[3.199000] Modules linked in:

[3.203000] CPU: 128 PID: 0 Comm: swapper/128 Not tainted 4.1.0-rc3 #70
[3.221000]   2d6601fb3e6d4e4c 88086fd5fc38 
81773f0a
[3.23]   88086fd5fc90 88086fd5fc78 
8108c85a
[3.238000]  88086fd6 0092 88086fd6 
00d0
[3.246000] Call Trace:
[3.249000]  [] dump_stack+0x4c/0x65
[3.255000]  [] warn_slowpath_common+0x8a/0xc0
[3.261000]  [] warn_slowpath_fmt+0x55/0x70
[3.268000]  [] lockdep_trace_alloc+0xdd/0xe0
[3.274000]  [] __alloc_pages_nodemask+0xad/0xca0
[3.281000]  [] ? __lock_acquire+0xf6d/0x1560
[3.288000]  [] alloc_page_interleave+0x3a/0x90
[3.295000]  [] alloc_pages_current+0x17d/0x1a0
[3.301000]  [] ? __get_free_pages+0xe/0x50
[3.308000]  [] __get_free_pages+0xe/0x50
[3.314000]  [] init_espfix_ap+0x17b/0x320
[3.32]  [] start_secondary+0xf1/0x1f0
[3.327000] ---[ end trace 1b3327d9d6a1d62c ]---

As we alloc pages with GFP_KERNEL in init_espfix_ap() which is called
before enabled local irq, and the lockdep sub-system considers this
behaviour as allocating memory with GFP_FS with local irq disabled,
then trigger the warning as mentioned about.

Though we could allocate them on the boot CPU side and hand them over to
the secondary CPU, but it seemes a bit waste if some of cpus are offline.
As thers is no need to these pages(espfix stack) until we try to run user
code, so we postpone the initialization of espfix stack, and let the boot
up routine init the espfix stack for the target cpu after it booted to
avoid the noise.

Signed-off-by: Gu Zheng 
---
v1:
   Alloc the page on the node the target CPU is on.
RFC:
   Let the boot up routine init the espfix stack for the target cpu after it
   booted.
---
---
  arch/x86/include/asm/espfix.h |2 +-
  arch/x86/kernel/espfix_64.c   |   28 
  arch/x86/kernel/smpboot.c |   14 +++---
  3 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/espfix.h b/arch/x86/include/asm/espfix.h
index 99efebb..ca3ce9a 100644
--- a/arch/x86/include/asm/espfix.h
+++ b/arch/x86/include/asm/espfix.h
@@ -9,7 +9,7 @@ DECLARE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
  DECLARE_PER_CPU_READ_MOSTLY(unsigned long, espfix_waddr);
  
  extern void init_espfix_bsp(void);

-extern void init_espfix_ap(void);
+extern void init_espfix_ap(int cpu);
  
  #endif /* CONFIG_X86_64 */
  
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c

index f5d0730..e397583 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -131,25 +131,24 @@ void __init init_espfix_bsp(void)
init_espfix_random();
  
  	/* The rest is the same as for any other processor */

-   init_espfix_ap();
+   init_espfix_ap(0);
  }
  
-void init_espfix_ap(void)

+void init_espfix_ap(int cpu)
  {
-   unsigned int cpu, page;
+   unsigned int page;
unsigned long addr;
pud_t pud, *pud_p;
pmd_t pmd, *pmd_p;
pte_t pte, *pte_p;
-   int n;
+   int n, node;
void *stack_page;
pteval_t ptemask;
  
  	/* We only have to do this once... */

-   if (likely(this_cpu_read(espfix_stack)))
+   if (likely(per_cpu(espfix_stack, cpu)))
return; /* Already initialized */
  
-	cpu = smp_processor_id();

addr = espfix_base_addr(cpu);
page = cpu/ESPFIX_STACKS_PER_PAGE;
  
@@ -165,12 +164,15 @@ void init_espfix_ap(void)

if (stack_page)
goto unlock_done;
  
+	node = cpu_to_node(cpu);

ptemask = __supported_pte_mask;
  
  	pud_p = _pud_page[pud_index(addr)];

pud = *pud_p;
if (!pud_present(pud)) {
-   pmd_p = (pmd_t *)__get_free_page(PGALLOC_GFP);
+   struct page *page = alloc_pages_node(node, PGALLOC_GFP, 0);
+
+   pmd_p = (pmd_t *)page_address(page);
pud = __pud(__pa(pmd_p) | (PGTABLE_PROT & ptemask));
paravirt_alloc_pmd(_mm, __pa(pmd_p) >> PAGE_SHIFT);
for (n = 0; n < ESPFIX_PUD_CLONES; n++)
@@ -180,7 +182,9 @@ void init_espfix_ap(void)
pmd_p = pmd_offset(, addr);
pmd = *pmd_p;
if (!pmd_present(pmd)) {
-   pte_p = (pte_t *)__get_free_page(PGALLOC_GFP);
+   struct page *page = alloc_pages_node(node, PGALLOC_GFP, 0);
+
+   pte_p = (pte_t *)page_address(page);
pmd = __pmd(__pa(pte_p) | (PGTABLE_PROT & ptemask));
paravirt_alloc_pte(_mm, 

Re: [PATCH V1] x86, espfix: postpone the initialization of espfix stack for AP

2015-06-16 Thread Zhu Guihua

Any feedback about this?

On 06/04/2015 05:45 PM, Gu Zheng wrote:

The following lockdep warning occurrs when running with latest kernel:
[3.178000] [ cut here ]
[3.183000] WARNING: CPU: 128 PID: 0 at kernel/locking/lockdep.c:2755 
lockdep_trace_alloc+0xdd/0xe0()
[3.193000] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
[3.199000] Modules linked in:

[3.203000] CPU: 128 PID: 0 Comm: swapper/128 Not tainted 4.1.0-rc3 #70
[3.221000]   2d6601fb3e6d4e4c 88086fd5fc38 
81773f0a
[3.23]   88086fd5fc90 88086fd5fc78 
8108c85a
[3.238000]  88086fd6 0092 88086fd6 
00d0
[3.246000] Call Trace:
[3.249000]  [81773f0a] dump_stack+0x4c/0x65
[3.255000]  [8108c85a] warn_slowpath_common+0x8a/0xc0
[3.261000]  [8108c8e5] warn_slowpath_fmt+0x55/0x70
[3.268000]  [810ee24d] lockdep_trace_alloc+0xdd/0xe0
[3.274000]  [811cda0d] __alloc_pages_nodemask+0xad/0xca0
[3.281000]  [810ec7ad] ? __lock_acquire+0xf6d/0x1560
[3.288000]  [81219c8a] alloc_page_interleave+0x3a/0x90
[3.295000]  [8121b32d] alloc_pages_current+0x17d/0x1a0
[3.301000]  [811c869e] ? __get_free_pages+0xe/0x50
[3.308000]  [811c869e] __get_free_pages+0xe/0x50
[3.314000]  [8102640b] init_espfix_ap+0x17b/0x320
[3.32]  [8105c691] start_secondary+0xf1/0x1f0
[3.327000] ---[ end trace 1b3327d9d6a1d62c ]---

As we alloc pages with GFP_KERNEL in init_espfix_ap() which is called
before enabled local irq, and the lockdep sub-system considers this
behaviour as allocating memory with GFP_FS with local irq disabled,
then trigger the warning as mentioned about.

Though we could allocate them on the boot CPU side and hand them over to
the secondary CPU, but it seemes a bit waste if some of cpus are offline.
As thers is no need to these pages(espfix stack) until we try to run user
code, so we postpone the initialization of espfix stack, and let the boot
up routine init the espfix stack for the target cpu after it booted to
avoid the noise.

Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com
---
v1:
   Alloc the page on the node the target CPU is on.
RFC:
   Let the boot up routine init the espfix stack for the target cpu after it
   booted.
---
---
  arch/x86/include/asm/espfix.h |2 +-
  arch/x86/kernel/espfix_64.c   |   28 
  arch/x86/kernel/smpboot.c |   14 +++---
  3 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/espfix.h b/arch/x86/include/asm/espfix.h
index 99efebb..ca3ce9a 100644
--- a/arch/x86/include/asm/espfix.h
+++ b/arch/x86/include/asm/espfix.h
@@ -9,7 +9,7 @@ DECLARE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
  DECLARE_PER_CPU_READ_MOSTLY(unsigned long, espfix_waddr);
  
  extern void init_espfix_bsp(void);

-extern void init_espfix_ap(void);
+extern void init_espfix_ap(int cpu);
  
  #endif /* CONFIG_X86_64 */
  
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c

index f5d0730..e397583 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -131,25 +131,24 @@ void __init init_espfix_bsp(void)
init_espfix_random();
  
  	/* The rest is the same as for any other processor */

-   init_espfix_ap();
+   init_espfix_ap(0);
  }
  
-void init_espfix_ap(void)

+void init_espfix_ap(int cpu)
  {
-   unsigned int cpu, page;
+   unsigned int page;
unsigned long addr;
pud_t pud, *pud_p;
pmd_t pmd, *pmd_p;
pte_t pte, *pte_p;
-   int n;
+   int n, node;
void *stack_page;
pteval_t ptemask;
  
  	/* We only have to do this once... */

-   if (likely(this_cpu_read(espfix_stack)))
+   if (likely(per_cpu(espfix_stack, cpu)))
return; /* Already initialized */
  
-	cpu = smp_processor_id();

addr = espfix_base_addr(cpu);
page = cpu/ESPFIX_STACKS_PER_PAGE;
  
@@ -165,12 +164,15 @@ void init_espfix_ap(void)

if (stack_page)
goto unlock_done;
  
+	node = cpu_to_node(cpu);

ptemask = __supported_pte_mask;
  
  	pud_p = espfix_pud_page[pud_index(addr)];

pud = *pud_p;
if (!pud_present(pud)) {
-   pmd_p = (pmd_t *)__get_free_page(PGALLOC_GFP);
+   struct page *page = alloc_pages_node(node, PGALLOC_GFP, 0);
+
+   pmd_p = (pmd_t *)page_address(page);
pud = __pud(__pa(pmd_p) | (PGTABLE_PROT  ptemask));
paravirt_alloc_pmd(init_mm, __pa(pmd_p)  PAGE_SHIFT);
for (n = 0; n  ESPFIX_PUD_CLONES; n++)
@@ -180,7 +182,9 @@ void init_espfix_ap(void)
pmd_p = pmd_offset(pud, addr);
pmd = *pmd_p;
if (!pmd_present(pmd)) {
-   pte_p = (pte_t *)__get_free_page(PGALLOC_GFP);
+   

Re: [PATCH] mm/memory hotplug: print the last vmemmap region at the end of hot add memory

2015-06-11 Thread Zhu Guihua


On 06/10/2015 04:29 AM, Andrew Morton wrote:

On Tue, 9 Jun 2015 11:41:28 +0800 Zhu Guihua  wrote:


--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -513,6 +513,7 @@ int __ref __add_pages(int nid, struct zone *zone, unsigned 
long phys_start_pfn,
break;
err = 0;
}
+   vmemmap_populate_print_last();
   
   	return err;

   }

vmemmap_populate_print_last() is only available on x86_64, when
CONFIG_SPARSEMEM_VMEMMAP=y.  Are you sure this won't break builds?

I tried this on i386 and on x86_64 when CONFIG_SPARSEMEM_VMEMMAP=n ,
it builds ok.

With powerpc:

akpm3:/usr/src/25> make allmodconfig
akpm3:/usr/src/25> make mm/memory_hotplug.o
akpm3:/usr/src/25> nm mm/memory_hotplug.o | grep vmemmap_populate_print_last
U .vmemmap_populate_print_last
akpm3:/usr/src/25> grep -r vmemmap_populate_print_last arch/powerpc
akpm3:/usr/src/25>

So I think that's going to break.

I expect ia64 will break also, but I didn't investigate.
.



There is
void __weak __meminit vmemmap_populate_print last(void)
in /mm/sparse.c, so I think this won't break builds.

And I found the function was invoked in void __init sparse_init(void) 
without

CONFIG_SPARSEMEM_VMEMMAP=y.

I also tried this on arm, it builds ok too.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/memory hotplug: print the last vmemmap region at the end of hot add memory

2015-06-11 Thread Zhu Guihua


On 06/10/2015 04:29 AM, Andrew Morton wrote:

On Tue, 9 Jun 2015 11:41:28 +0800 Zhu Guihua zhugh.f...@cn.fujitsu.com wrote:


--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -513,6 +513,7 @@ int __ref __add_pages(int nid, struct zone *zone, unsigned 
long phys_start_pfn,
break;
err = 0;
}
+   vmemmap_populate_print_last();
   
   	return err;

   }

vmemmap_populate_print_last() is only available on x86_64, when
CONFIG_SPARSEMEM_VMEMMAP=y.  Are you sure this won't break builds?

I tried this on i386 and on x86_64 when CONFIG_SPARSEMEM_VMEMMAP=n ,
it builds ok.

With powerpc:

akpm3:/usr/src/25 make allmodconfig
akpm3:/usr/src/25 make mm/memory_hotplug.o
akpm3:/usr/src/25 nm mm/memory_hotplug.o | grep vmemmap_populate_print_last
U .vmemmap_populate_print_last
akpm3:/usr/src/25 grep -r vmemmap_populate_print_last arch/powerpc
akpm3:/usr/src/25

So I think that's going to break.

I expect ia64 will break also, but I didn't investigate.
.



There is
void __weak __meminit vmemmap_populate_print last(void)
in /mm/sparse.c, so I think this won't break builds.

And I found the function was invoked in void __init sparse_init(void) 
without

CONFIG_SPARSEMEM_VMEMMAP=y.

I also tried this on arm, it builds ok too.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/memory hotplug: print the last vmemmap region at the end of hot add memory

2015-06-08 Thread Zhu Guihua


On 06/09/2015 07:30 AM, Andrew Morton wrote:

On Mon, 8 Jun 2015 14:44:41 +0800 Zhu Guihua  wrote:


When hot add two nodes continuously, we found the vmemmap region info is a
bit messed. The last region of node 2 is printed when node 3 hot added,
like the following:
Initmem setup node 2 [mem 0x-0x]
  On node 2 totalpages: 0
  Built 2 zonelists in Node order, mobility grouping on.  Total pages: 16090539
  Policy zone: Normal
  init_memory_mapping: [mem 0x400-0x407]
   [mem 0x400-0x407] page 1G
   [ea10-ea10001f] PMD -> 
[8a077d80-8a077d9f] on node 2
   [ea100020-ea10003f] PMD -> 
[8a077de0-8a077dff] on node 2
...
   [ea101f60-ea101f9f] PMD -> 
[8a074ac0-8a074aff] on node 2
   [ea101fa0-ea101fdf] PMD -> 
[8a074a80-8a074abf] on node 2
Initmem setup node 3 [mem 0x-0x]
  On node 3 totalpages: 0
  Built 3 zonelists in Node order, mobility grouping on.  Total pages: 16090539
  Policy zone: Normal
  init_memory_mapping: [mem 0x600-0x607]
   [mem 0x600-0x607] page 1G
   [ea101fe0-ea101fff] PMD -> [8a074a40-8a074a5f] 
on node 2 <=== node 2 ???
   [ea18-ea18001f] PMD -> 
[8a074a60-8a074a7f] on node 3
   [ea180020-ea18005f] PMD -> 
[8a074a00-8a074a3f] on node 3
   [ea180060-ea18009f] PMD -> 
[8a0749c0-8a0749ff] on node 3
...

The cause is the last region was missed at the and of hot add memory, and
p_start, p_end, node_start were not reset, so when hot add memory to a new
node, it will consider they are not contiguous blocks and print the
previous one. So we print the last vmemmap region at the end of hot add
memory to avoid the confusion.

...

--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -513,6 +513,7 @@ int __ref __add_pages(int nid, struct zone *zone, unsigned 
long phys_start_pfn,
break;
err = 0;
}
+   vmemmap_populate_print_last();
  
  	return err;

  }

vmemmap_populate_print_last() is only available on x86_64, when
CONFIG_SPARSEMEM_VMEMMAP=y.  Are you sure this won't break builds?


I tried this on i386 and on x86_64 when CONFIG_SPARSEMEM_VMEMMAP=n ,
it builds ok.

Thanks,
Zhu



.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mm/memory hotplug: print the last vmemmap region at the end of hot add memory

2015-06-08 Thread Zhu Guihua
When hot add two nodes continuously, we found the vmemmap region info is a
bit messed. The last region of node 2 is printed when node 3 hot added,
like the following:
Initmem setup node 2 [mem 0x-0x]
 On node 2 totalpages: 0
 Built 2 zonelists in Node order, mobility grouping on.  Total pages: 16090539
 Policy zone: Normal
 init_memory_mapping: [mem 0x400-0x407]
  [mem 0x400-0x407] page 1G
  [ea10-ea10001f] PMD -> 
[8a077d80-8a077d9f] on node 2
  [ea100020-ea10003f] PMD -> 
[8a077de0-8a077dff] on node 2
...
  [ea101f60-ea101f9f] PMD -> 
[8a074ac0-8a074aff] on node 2
  [ea101fa0-ea101fdf] PMD -> 
[8a074a80-8a074abf] on node 2
Initmem setup node 3 [mem 0x-0x]
 On node 3 totalpages: 0
 Built 3 zonelists in Node order, mobility grouping on.  Total pages: 16090539
 Policy zone: Normal
 init_memory_mapping: [mem 0x600-0x607]
  [mem 0x600-0x607] page 1G
  [ea101fe0-ea101fff] PMD -> 
[8a074a40-8a074a5f] on node 2 <=== node 2 ???
  [ea18-ea18001f] PMD -> 
[8a074a60-8a074a7f] on node 3
  [ea180020-ea18005f] PMD -> 
[8a074a00-8a074a3f] on node 3
  [ea180060-ea18009f] PMD -> 
[8a0749c0-8a0749ff] on node 3
...

The cause is the last region was missed at the and of hot add memory, and
p_start, p_end, node_start were not reset, so when hot add memory to a new
node, it will consider they are not contiguous blocks and print the
previous one. So we print the last vmemmap region at the end of hot add
memory to avoid the confusion.

Signed-off-by: Zhu Guihua 
---
 mm/memory_hotplug.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 457bde5..58fb223 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -513,6 +513,7 @@ int __ref __add_pages(int nid, struct zone *zone, unsigned 
long phys_start_pfn,
break;
err = 0;
}
+   vmemmap_populate_print_last();
 
return err;
 }
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/memory hotplug: print the last vmemmap region at the end of hot add memory

2015-06-08 Thread Zhu Guihua


On 06/09/2015 07:30 AM, Andrew Morton wrote:

On Mon, 8 Jun 2015 14:44:41 +0800 Zhu Guihua zhugh.f...@cn.fujitsu.com wrote:


When hot add two nodes continuously, we found the vmemmap region info is a
bit messed. The last region of node 2 is printed when node 3 hot added,
like the following:
Initmem setup node 2 [mem 0x-0x]
  On node 2 totalpages: 0
  Built 2 zonelists in Node order, mobility grouping on.  Total pages: 16090539
  Policy zone: Normal
  init_memory_mapping: [mem 0x400-0x407]
   [mem 0x400-0x407] page 1G
   [ea10-ea10001f] PMD - 
[8a077d80-8a077d9f] on node 2
   [ea100020-ea10003f] PMD - 
[8a077de0-8a077dff] on node 2
...
   [ea101f60-ea101f9f] PMD - 
[8a074ac0-8a074aff] on node 2
   [ea101fa0-ea101fdf] PMD - 
[8a074a80-8a074abf] on node 2
Initmem setup node 3 [mem 0x-0x]
  On node 3 totalpages: 0
  Built 3 zonelists in Node order, mobility grouping on.  Total pages: 16090539
  Policy zone: Normal
  init_memory_mapping: [mem 0x600-0x607]
   [mem 0x600-0x607] page 1G
   [ea101fe0-ea101fff] PMD - [8a074a40-8a074a5f] 
on node 2 === node 2 ???
   [ea18-ea18001f] PMD - 
[8a074a60-8a074a7f] on node 3
   [ea180020-ea18005f] PMD - 
[8a074a00-8a074a3f] on node 3
   [ea180060-ea18009f] PMD - 
[8a0749c0-8a0749ff] on node 3
...

The cause is the last region was missed at the and of hot add memory, and
p_start, p_end, node_start were not reset, so when hot add memory to a new
node, it will consider they are not contiguous blocks and print the
previous one. So we print the last vmemmap region at the end of hot add
memory to avoid the confusion.

...

--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -513,6 +513,7 @@ int __ref __add_pages(int nid, struct zone *zone, unsigned 
long phys_start_pfn,
break;
err = 0;
}
+   vmemmap_populate_print_last();
  
  	return err;

  }

vmemmap_populate_print_last() is only available on x86_64, when
CONFIG_SPARSEMEM_VMEMMAP=y.  Are you sure this won't break builds?


I tried this on i386 and on x86_64 when CONFIG_SPARSEMEM_VMEMMAP=n ,
it builds ok.

Thanks,
Zhu



.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mm/memory hotplug: print the last vmemmap region at the end of hot add memory

2015-06-08 Thread Zhu Guihua
When hot add two nodes continuously, we found the vmemmap region info is a
bit messed. The last region of node 2 is printed when node 3 hot added,
like the following:
Initmem setup node 2 [mem 0x-0x]
 On node 2 totalpages: 0
 Built 2 zonelists in Node order, mobility grouping on.  Total pages: 16090539
 Policy zone: Normal
 init_memory_mapping: [mem 0x400-0x407]
  [mem 0x400-0x407] page 1G
  [ea10-ea10001f] PMD - 
[8a077d80-8a077d9f] on node 2
  [ea100020-ea10003f] PMD - 
[8a077de0-8a077dff] on node 2
...
  [ea101f60-ea101f9f] PMD - 
[8a074ac0-8a074aff] on node 2
  [ea101fa0-ea101fdf] PMD - 
[8a074a80-8a074abf] on node 2
Initmem setup node 3 [mem 0x-0x]
 On node 3 totalpages: 0
 Built 3 zonelists in Node order, mobility grouping on.  Total pages: 16090539
 Policy zone: Normal
 init_memory_mapping: [mem 0x600-0x607]
  [mem 0x600-0x607] page 1G
  [ea101fe0-ea101fff] PMD - 
[8a074a40-8a074a5f] on node 2 === node 2 ???
  [ea18-ea18001f] PMD - 
[8a074a60-8a074a7f] on node 3
  [ea180020-ea18005f] PMD - 
[8a074a00-8a074a3f] on node 3
  [ea180060-ea18009f] PMD - 
[8a0749c0-8a0749ff] on node 3
...

The cause is the last region was missed at the and of hot add memory, and
p_start, p_end, node_start were not reset, so when hot add memory to a new
node, it will consider they are not contiguous blocks and print the
previous one. So we print the last vmemmap region at the end of hot add
memory to avoid the confusion.

Signed-off-by: Zhu Guihua zhugh.f...@cn.fujitsu.com
---
 mm/memory_hotplug.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 457bde5..58fb223 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -513,6 +513,7 @@ int __ref __add_pages(int nid, struct zone *zone, unsigned 
long phys_start_pfn,
break;
err = 0;
}
+   vmemmap_populate_print_last();
 
return err;
 }
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/