Re: [PATCH 0/6 qemu] acpi: NUMA nodes for CXL HB as GP + complex NUMA test.

2024-04-12 Thread Huang, Ying
Hi, Jonathan,

Jonathan Cameron  writes:

> ACPI 6.5 introduced Generic Port Affinity Structures to close a system
> description gap that was a problem for CXL memory systems.
> It defines an new SRAT Affinity structure (and hence allows creation of an
> ACPI Proximity Node which can only be defined via an SRAT structure)
> for the boundary between a discoverable fabric and a non discoverable
> system interconnects etc.
>
> The HMAT data on latency and bandwidth is combined with discoverable
> information from the CXL bus (link speeds, lane counts) and CXL devices
> (switch port to port characteristics and USP to memory, via CDAT tables
> read from the device).  QEMU has supported the rest of the elements
> of this chain for a while but now the kernel has caught up and we need
> the missing element of Generic Ports (this code has been used extensively
> in testing and debugging that kernel support, some resulting fixes
> currently under review).
>
> Generic Port Affinity Structures are very similar to the recently
> added Generic Initiator Affinity Structures (GI) so this series
> factors out and reuses much of that infrastructure for reuse
> There are subtle differences (beyond the obvious structure ID change).
>
> - The ACPI spec example (and linux kernel support) has a Generic
>   Port not as associated with the CXL root port, but rather with
>   the CXL Host bridge. As a result, an ACPI handle is used (rather
>   than the PCI SBDF option for GIs). In QEMU the easiest way
>   to get to this is to target the root bridge PCI Bus, and
>   conveniently the root bridge bus number is used for the UID allowing
>   us to construct an appropriate entry.
>
> A key addition of this series is a complex NUMA topology example that
> stretches the QEMU emulation code for GI, GP and nodes with just
> CPUS, just memory, just hot pluggable memory, mixture of memory and CPUs.
>
> A similar test showed up a few NUMA related bugs with fixes applied for
> 9.0 (note that one of these needs linux booted to identify that it
> rejects the HMAT table and this test is a regression test for the
> table generation only).
>
> https://lore.kernel.org/qemu-devel/2eb6672cfdaea7dacd8e9bb0523887f13b9f85ce.1710282274.git@redhat.com/
> https://lore.kernel.org/qemu-devel/74e2845c5f95b0c139c79233ddb65bb17f2dd679.1710282274.git@redhat.com/
>

Thanks a lot for your work!

I need this to test some memory tiering kernel patches.  I found the
following git branch,

https://gitlab.com/jic23/qemu/-/commits/cxl-2024-03-05/?ref_type=heads

Can I use that branch directly?

And, can you share an example qemu command line to setup Genport, CDAT,
and HMAT?

--
Best Regards,
Huang, Ying



Re: [PATCH v11 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-04-09 Thread Huang, Ying
"Ho-Ren (Jack) Chuang"  writes:

> On Fri, Apr 5, 2024 at 7:03 AM Jonathan Cameron
>  wrote:
>>
>> On Fri,  5 Apr 2024 00:07:06 +
>> "Ho-Ren (Jack) Chuang"  wrote:
>>
>> > The current implementation treats emulated memory devices, such as
>> > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory
>> > (E820_TYPE_RAM). However, these emulated devices have different
>> > characteristics than traditional DRAM, making it important to
>> > distinguish them. Thus, we modify the tiered memory initialization process
>> > to introduce a delay specifically for CPUless NUMA nodes. This delay
>> > ensures that the memory tier initialization for these nodes is deferred
>> > until HMAT information is obtained during the boot process. Finally,
>> > demotion tables are recalculated at the end.
>> >
>> > * late_initcall(memory_tier_late_init);
>> > Some device drivers may have initialized memory tiers between
>> > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
>> > online memory nodes and configuring memory tiers. They should be excluded
>> > in the late init.
>> >
>> > * Handle cases where there is no HMAT when creating memory tiers
>> > There is a scenario where a CPUless node does not provide HMAT information.
>> > If no HMAT is specified, it falls back to using the default DRAM tier.
>> >
>> > * Introduce another new lock `default_dram_perf_lock` for adist calculation
>> > In the current implementation, iterating through CPUlist nodes requires
>> > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
>> > trying to acquire the same lock, leading to a potential deadlock.
>> > Therefore, we propose introducing a standalone `default_dram_perf_lock` to
>> > protect `default_dram_perf_*`. This approach not only avoids deadlock
>> > but also prevents holding a large lock simultaneously.
>> >
>> > * Upgrade `set_node_memory_tier` to support additional cases, including
>> >   default DRAM, late CPUless, and hot-plugged initializations.
>> > To cover hot-plugged memory nodes, `mt_calc_adistance()` and
>> > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
>> > handle cases where memtype is not initialized and where HMAT information is
>> > available.
>> >
>> > * Introduce `default_memory_types` for those memory types that are not
>> >   initialized by device drivers.
>> > Because late initialized memory and default DRAM memory need to be managed,
>> > a default memory type is created for storing all memory types that are
>> > not initialized by device drivers and as a fallback.
>> >
>> > Signed-off-by: Ho-Ren (Jack) Chuang 
>> > Signed-off-by: Hao Xiang 
>> > Reviewed-by: "Huang, Ying" 
>>
>> Hi - one remaining question. Why can't we delay init for all nodes
>> to either drivers or your fallback late_initcall code.
>> It would be nice to reduce possible code paths.
>
> I try not to change too much of the existing code structure in
> this patchset.
>
> To me, postponing/moving all memory tier registrations to
> late_initcall() is another possible action item for the next patchset.
>
> After tier_mem(), hmat_init() is called, which requires registering
> `default_dram_type` info. This is when `default_dram_type` is needed.
> However, it is indeed possible to postpone the latter part,
> set_node_memory_tier(), to `late_init(). So, memory_tier_init() can
> indeed be split into two parts, and the latter part can be moved to
> late_initcall() to be processed together.

I don't think that it's good to move all memory_tier initialization in
drivers to late_initcall().  It's natural to keep them in
device_initcall() level.

If so, we can allocate default_dram_type in memory_tier_init(), and call
set_node_memory_tier() only in memory_tier_lateinit().  We can call
memory_tier_lateinit() in device_initcall() level too.

--
Best Regards,
Huang, Ying

> Doing this all memory-type drivers have to call late_initcall() to
> register a memory tier. I’m not sure how many they are?
>
> What do you guys think?
>
>>
>> Jonathan
>>
>>
>> > ---
>> >  mm/memory-tiers.c | 94 +++
>> >  1 file changed, 70 insertions(+), 24 deletions(-)
>> >
>> > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>> > index 516b144fd45a..6632102bd5c9 100644
>> > --- a/mm/memory-tiers.c
>> > +++ b/mm/memory-tiers.c
&

Re: [PATCH v8 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-03-28 Thread Huang, Ying
"Ho-Ren (Jack) Chuang"  writes:

> The current implementation treats emulated memory devices, such as
> CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory
> (E820_TYPE_RAM). However, these emulated devices have different
> characteristics than traditional DRAM, making it important to
> distinguish them. Thus, we modify the tiered memory initialization process
> to introduce a delay specifically for CPUless NUMA nodes. This delay
> ensures that the memory tier initialization for these nodes is deferred
> until HMAT information is obtained during the boot process. Finally,
> demotion tables are recalculated at the end.
>
> * late_initcall(memory_tier_late_init);
> Some device drivers may have initialized memory tiers between
> `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
> online memory nodes and configuring memory tiers. They should be excluded
> in the late init.
>
> * Handle cases where there is no HMAT when creating memory tiers
> There is a scenario where a CPUless node does not provide HMAT information.
> If no HMAT is specified, it falls back to using the default DRAM tier.
>
> * Introduce another new lock `default_dram_perf_lock` for adist calculation
> In the current implementation, iterating through CPUlist nodes requires
> holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
> trying to acquire the same lock, leading to a potential deadlock.
> Therefore, we propose introducing a standalone `default_dram_perf_lock` to
> protect `default_dram_perf_*`. This approach not only avoids deadlock
> but also prevents holding a large lock simultaneously.
>
> * Upgrade `set_node_memory_tier` to support additional cases, including
>   default DRAM, late CPUless, and hot-plugged initializations.
> To cover hot-plugged memory nodes, `mt_calc_adistance()` and
> `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
> handle cases where memtype is not initialized and where HMAT information is
> available.
>
> * Introduce `default_memory_types` for those memory types that are not
>   initialized by device drivers.
> Because late initialized memory and default DRAM memory need to be managed,
> a default memory type is created for storing all memory types that are
> not initialized by device drivers and as a fallback.
>
> Signed-off-by: Ho-Ren (Jack) Chuang 
> Signed-off-by: Hao Xiang 
> Reviewed-by: "Huang, Ying" 
> ---
>  mm/memory-tiers.c | 94 +++
>  1 file changed, 78 insertions(+), 16 deletions(-)
>
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index 974af10cfdd8..e24fc3bebae4 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -36,6 +36,11 @@ struct node_memory_type_map {
>  
>  static DEFINE_MUTEX(memory_tier_lock);
>  static LIST_HEAD(memory_tiers);
> +/*
> + * The list is used to store all memory types that are not created
> + * by a device driver.
> + */
> +static LIST_HEAD(default_memory_types);
>  static struct node_memory_type_map node_memory_types[MAX_NUMNODES];
>  struct memory_dev_type *default_dram_type;
>  
> @@ -108,6 +113,8 @@ static struct demotion_nodes *node_demotion __read_mostly;
>  
>  static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms);
>  
> +/* The lock is used to protect `default_dram_perf*` info and nid. */
> +static DEFINE_MUTEX(default_dram_perf_lock);
>  static bool default_dram_perf_error;
>  static struct access_coordinate default_dram_perf;
>  static int default_dram_perf_ref_nid = NUMA_NO_NODE;
> @@ -505,7 +512,8 @@ static inline void __init_node_memory_type(int node, 
> struct memory_dev_type *mem
>  static struct memory_tier *set_node_memory_tier(int node)
>  {
>   struct memory_tier *memtier;
> - struct memory_dev_type *memtype;
> + struct memory_dev_type *mtype = default_dram_type;
> + int adist = MEMTIER_ADISTANCE_DRAM;
>   pg_data_t *pgdat = NODE_DATA(node);
>  
>  
> @@ -514,11 +522,20 @@ static struct memory_tier *set_node_memory_tier(int 
> node)
>   if (!node_state(node, N_MEMORY))
>   return ERR_PTR(-EINVAL);
>  
> - __init_node_memory_type(node, default_dram_type);
> + mt_calc_adistance(node, );
> + if (node_memory_types[node].memtype == NULL) {
> + mtype = mt_find_alloc_memory_type(adist, _memory_types);
> + if (IS_ERR(mtype)) {
> + mtype = default_dram_type;
> + pr_info("Failed to allocate a memory type. Fall 
> back.\n");
> + }
> + }
> +
> + __init_node_memory_type(node, mtype);
>  
> - memtype = node_memory_types[node].memtype;
>

Re: [PATCH v6 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-03-27 Thread Huang, Ying
"Ho-Ren (Jack) Chuang"  writes:

[snip]

> @@ -655,6 +672,34 @@ void mt_put_memory_types(struct list_head *memory_types)
>  }
>  EXPORT_SYMBOL_GPL(mt_put_memory_types);
>  
> +/*
> + * This is invoked via `late_initcall()` to initialize memory tiers for
> + * CPU-less memory nodes after driver initialization, which is
> + * expected to provide `adistance` algorithms.
> + */
> +static int __init memory_tier_late_init(void)
> +{
> + int nid;
> +
> + mutex_lock(_tier_lock);
> + for_each_node_state(nid, N_MEMORY)
> + if (!node_state(nid, N_CPU) &&
> + node_memory_types[nid].memtype == NULL)

Think about this again.  It seems that it is better to check
"node_memory_types[nid].memtype == NULL" only here.  Because for all
node with N_CPU in memory_tier_init(), "node_memory_types[nid].memtype"
will be !NULL.  And it's possible (in theory) that some nodes becomes
"node_state(nid, N_CPU) == true" between memory_tier_init() and
memory_tier_late_init().

Otherwise, Looks good to me.  Feel free to add

Reviewed-by: "Huang, Ying" 

in the future version.

> + /*
> +  * Some device drivers may have initialized memory tiers
> +  * between `memory_tier_init()` and 
> `memory_tier_late_init()`,
> +  * potentially bringing online memory nodes and
> +  * configuring memory tiers. Exclude them here.
> +  */
> + set_node_memory_tier(nid);
> +
> + establish_demotion_targets();
> + mutex_unlock(_tier_lock);
> +
> + return 0;
> +}
> +late_initcall(memory_tier_late_init);
> +

[snip]

--
Best Regards,
Huang, Ying



Re: [PATCH v5 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-03-26 Thread Huang, Ying
   
> _memory_types);
>   if (IS_ERR(default_dram_type))
>   panic("%s() failed to allocate default DRAM tier\n", __func__);
>  
> @@ -868,6 +919,14 @@ static int __init memory_tier_init(void)
>* types assigned.
>*/
>   for_each_node_state(node, N_MEMORY) {
> + if (!node_state(node, N_CPU))
> + /*
> +  * Defer memory tier initialization on CPUless numa 
> nodes.
> +  * These will be initialized after firmware and devices 
> are
> +  * initialized.
> +  */
> + continue;
> +
>   memtier = set_node_memory_tier(node);
>   if (IS_ERR(memtier))
>   /*

--
Best Regards,
Huang, Ying



Re: [External] Re: [PATCH v4 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-03-25 Thread Huang, Ying
"Ho-Ren (Jack) Chuang"  writes:

> On Fri, Mar 22, 2024 at 1:41 AM Huang, Ying  wrote:
>>
>> "Ho-Ren (Jack) Chuang"  writes:
>>
>> > The current implementation treats emulated memory devices, such as
>> > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory
>> > (E820_TYPE_RAM). However, these emulated devices have different
>> > characteristics than traditional DRAM, making it important to
>> > distinguish them. Thus, we modify the tiered memory initialization process
>> > to introduce a delay specifically for CPUless NUMA nodes. This delay
>> > ensures that the memory tier initialization for these nodes is deferred
>> > until HMAT information is obtained during the boot process. Finally,
>> > demotion tables are recalculated at the end.
>> >
>> > * late_initcall(memory_tier_late_init);
>> > Some device drivers may have initialized memory tiers between
>> > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
>> > online memory nodes and configuring memory tiers. They should be excluded
>> > in the late init.
>> >
>> > * Handle cases where there is no HMAT when creating memory tiers
>> > There is a scenario where a CPUless node does not provide HMAT information.
>> > If no HMAT is specified, it falls back to using the default DRAM tier.
>> >
>> > * Introduce another new lock `default_dram_perf_lock` for adist calculation
>> > In the current implementation, iterating through CPUlist nodes requires
>> > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
>> > trying to acquire the same lock, leading to a potential deadlock.
>> > Therefore, we propose introducing a standalone `default_dram_perf_lock` to
>> > protect `default_dram_perf_*`. This approach not only avoids deadlock
>> > but also prevents holding a large lock simultaneously.
>> >
>> > * Upgrade `set_node_memory_tier` to support additional cases, including
>> >   default DRAM, late CPUless, and hot-plugged initializations.
>> > To cover hot-plugged memory nodes, `mt_calc_adistance()` and
>> > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
>> > handle cases where memtype is not initialized and where HMAT information is
>> > available.
>> >
>> > * Introduce `default_memory_types` for those memory types that are not
>> >   initialized by device drivers.
>> > Because late initialized memory and default DRAM memory need to be managed,
>> > a default memory type is created for storing all memory types that are
>> > not initialized by device drivers and as a fallback.
>> >
>> > Signed-off-by: Ho-Ren (Jack) Chuang 
>> > Signed-off-by: Hao Xiang 
>> > ---
>> >  mm/memory-tiers.c | 73 ---
>> >  1 file changed, 63 insertions(+), 10 deletions(-)
>> >
>> > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>> > index 974af10cfdd8..9396330fa162 100644
>> > --- a/mm/memory-tiers.c
>> > +++ b/mm/memory-tiers.c
>> > @@ -36,6 +36,11 @@ struct node_memory_type_map {
>> >
>> >  static DEFINE_MUTEX(memory_tier_lock);
>> >  static LIST_HEAD(memory_tiers);
>> > +/*
>> > + * The list is used to store all memory types that are not created
>> > + * by a device driver.
>> > + */
>> > +static LIST_HEAD(default_memory_types);
>> >  static struct node_memory_type_map node_memory_types[MAX_NUMNODES];
>> >  struct memory_dev_type *default_dram_type;
>> >
>> > @@ -108,6 +113,7 @@ static struct demotion_nodes *node_demotion 
>> > __read_mostly;
>> >
>> >  static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms);
>> >
>> > +static DEFINE_MUTEX(default_dram_perf_lock);
>>
>> Better to add comments about what is protected by this lock.
>>
>
> Thank you. I will add a comment like this:
> + /* The lock is used to protect `default_dram_perf*` info and nid. */
> +static DEFINE_MUTEX(default_dram_perf_lock);
>
> I also found an error path was not handled and
> found the lock could be put closer to what it protects.
> I will have them fixed in V5.
>
>> >  static bool default_dram_perf_error;
>> >  static struct access_coordinate default_dram_perf;
>> >  static int default_dram_perf_ref_nid = NUMA_NO_NODE;
>> > @@ -505,7 +511,8 @@ static inline void __init_node_memory_type(int node, 
>> > struct memory_dev_type *mem

Re: [PATCH v4 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-03-22 Thread Huang, Ying
ate_memory_tier(memtype);
> + __init_node_memory_type(node, mtype);
> +
> + mtype = node_memory_types[node].memtype;
> + node_set(node, mtype->nodes);
> + memtier = find_create_memory_tier(mtype);
>   if (!IS_ERR(memtier))
>   rcu_assign_pointer(pgdat->memtier, memtier);
>   return memtier;
> @@ -655,6 +671,34 @@ void mt_put_memory_types(struct list_head *memory_types)
>  }
>  EXPORT_SYMBOL_GPL(mt_put_memory_types);
>  
> +/*
> + * This is invoked via `late_initcall()` to initialize memory tiers for
> + * CPU-less memory nodes after driver initialization, which is
> + * expected to provide `adistance` algorithms.
> + */
> +static int __init memory_tier_late_init(void)
> +{
> + int nid;
> +
> + mutex_lock(_tier_lock);
> + for_each_node_state(nid, N_MEMORY)
> + if (!node_state(nid, N_CPU) &&
> + node_memory_types[nid].memtype == NULL)
> + /*
> +  * Some device drivers may have initialized memory tiers
> +  * between `memory_tier_init()` and 
> `memory_tier_late_init()`,
> +  * potentially bringing online memory nodes and
> +  * configuring memory tiers. Exclude them here.
> +  */
> + set_node_memory_tier(nid);
> +
> + establish_demotion_targets();
> + mutex_unlock(_tier_lock);
> +
> + return 0;
> +}
> +late_initcall(memory_tier_late_init);
> +
>  static void dump_hmem_attrs(struct access_coordinate *coord, const char 
> *prefix)
>  {
>   pr_info(
> @@ -668,7 +712,7 @@ int mt_set_default_dram_perf(int nid, struct 
> access_coordinate *perf,
>  {
>   int rc = 0;
>  
> - mutex_lock(_tier_lock);
> + mutex_lock(_dram_perf_lock);
>   if (default_dram_perf_error) {
>   rc = -EIO;
>   goto out;
> @@ -716,7 +760,7 @@ int mt_set_default_dram_perf(int nid, struct 
> access_coordinate *perf,
>   }
>  
>  out:
> - mutex_unlock(_tier_lock);
> + mutex_unlock(_dram_perf_lock);
>   return rc;
>  }
>  
> @@ -732,7 +776,7 @@ int mt_perf_to_adistance(struct access_coordinate *perf, 
> int *adist)
>   perf->read_bandwidth + perf->write_bandwidth == 0)
>   return -EINVAL;
>  
> - mutex_lock(_tier_lock);
> + mutex_lock(_dram_perf_lock);
>   /*
>* The abstract distance of a memory node is in direct proportion to
>* its memory latency (read + write) and inversely proportional to its
> @@ -745,7 +789,7 @@ int mt_perf_to_adistance(struct access_coordinate *perf, 
> int *adist)
>   (default_dram_perf.read_latency + 
> default_dram_perf.write_latency) *
>   (default_dram_perf.read_bandwidth + 
> default_dram_perf.write_bandwidth) /
>   (perf->read_bandwidth + perf->write_bandwidth);
> - mutex_unlock(_tier_lock);
> + mutex_unlock(_dram_perf_lock);
>  
>   return 0;
>  }
> @@ -858,7 +902,8 @@ static int __init memory_tier_init(void)
>* For now we can have 4 faster memory tiers with smaller adistance
>* than default DRAM tier.
>*/
> - default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM);
> + default_dram_type = mt_find_alloc_memory_type(MEMTIER_ADISTANCE_DRAM,
> + 
> _memory_types);
>   if (IS_ERR(default_dram_type))
>   panic("%s() failed to allocate default DRAM tier\n", __func__);
>  
> @@ -868,6 +913,14 @@ static int __init memory_tier_init(void)
>* types assigned.
>*/
>   for_each_node_state(node, N_MEMORY) {
> + if (!node_state(node, N_CPU))
> + /*
> +  * Defer memory tier initialization on CPUless numa 
> nodes.
> +  * These will be initialized after firmware and devices 
> are
> +  * initialized.
> +  */
> + continue;
> +
>   memtier = set_node_memory_tier(node);
>   if (IS_ERR(memtier))
>   /*

--
Best Regards,
Huang, Ying



Re: [PATCH v3 1/2] memory tier: dax/kmem: create CPUless memory tiers after obtaining HMAT info

2024-03-20 Thread Huang, Ying
 tiers
> +  * between `memory_tier_init()` and 
> `memory_tier_late_init()`,
> +  * potentially bringing online memory nodes and
> +  * configuring memory tiers. Exclude them here.
> +  */
> + set_node_memory_tier(nid);
> +
> + establish_demotion_targets();
> + mutex_unlock(_tier_lock);
> +
> + return 0;
> +}
> +late_initcall(memory_tier_late_init);
> +
>  static void dump_hmem_attrs(struct access_coordinate *coord, const char 
> *prefix)
>  {
>   pr_info(
> @@ -631,12 +698,16 @@ static void dump_hmem_attrs(struct access_coordinate 
> *coord, const char *prefix)
>   coord->read_bandwidth, coord->write_bandwidth);
>  }
>  
> +/*
> + * The lock is used to protect the default_dram_perf.
> + */
> +static DEFINE_MUTEX(mt_perf_lock);

Miscommunication here too.  Should be moved to near the
"default_dram_perf" definition.  And it protects not only
default_dram_perf.

>  int mt_set_default_dram_perf(int nid, struct access_coordinate *perf,
>const char *source)
>  {
>   int rc = 0;
>  
> - mutex_lock(_tier_lock);
> + mutex_lock(_perf_lock);
>   if (default_dram_perf_error) {
>   rc = -EIO;
>   goto out;
> @@ -684,7 +755,7 @@ int mt_set_default_dram_perf(int nid, struct 
> access_coordinate *perf,
>   }
>  
>  out:
> - mutex_unlock(_tier_lock);
> + mutex_unlock(_perf_lock);
>   return rc;
>  }
>  
> @@ -700,7 +771,7 @@ int mt_perf_to_adistance(struct access_coordinate *perf, 
> int *adist)
>   perf->read_bandwidth + perf->write_bandwidth == 0)
>   return -EINVAL;
>  
> - mutex_lock(_tier_lock);
> + mutex_lock(_perf_lock);
>   /*
>* The abstract distance of a memory node is in direct proportion to
>* its memory latency (read + write) and inversely proportional to its
> @@ -713,7 +784,7 @@ int mt_perf_to_adistance(struct access_coordinate *perf, 
> int *adist)
>   (default_dram_perf.read_latency + 
> default_dram_perf.write_latency) *
>   (default_dram_perf.read_bandwidth + 
> default_dram_perf.write_bandwidth) /
>   (perf->read_bandwidth + perf->write_bandwidth);
> - mutex_unlock(_tier_lock);
> + mutex_unlock(_perf_lock);
>  
>   return 0;
>  }
> @@ -826,7 +897,8 @@ static int __init memory_tier_init(void)
>* For now we can have 4 faster memory tiers with smaller adistance
>* than default DRAM tier.
>*/
> - default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM);
> + default_dram_type = mt_find_alloc_memory_type(
> + MEMTIER_ADISTANCE_DRAM, 
> _memory_types);
>   if (IS_ERR(default_dram_type))
>   panic("%s() failed to allocate default DRAM tier\n", __func__);
>  
> @@ -836,6 +908,14 @@ static int __init memory_tier_init(void)
>* types assigned.
>*/
>   for_each_node_state(node, N_MEMORY) {
> + if (!node_state(node, N_CPU))
> + /*
> +  * Defer memory tier initialization on CPUless numa 
> nodes.
> +  * These will be initialized after firmware and devices 
> are
> +  * initialized.
> +  */
> + continue;
> +
>   memtier = set_node_memory_tier(node);
>   if (IS_ERR(memtier))
>   /*

--
Best Regards,
Huang, Ying



Re: [External] Re: [PATCH v2 1/1] memory tier: acpi/hmat: create CPUless memory tiers after obtaining HMAT info

2024-03-14 Thread Huang, Ying
"Ho-Ren (Jack) Chuang"  writes:

> On Tue, Mar 12, 2024 at 2:21 AM Huang, Ying  wrote:
>>
>> "Ho-Ren (Jack) Chuang"  writes:
>>
>> > The current implementation treats emulated memory devices, such as
>> > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory
>> > (E820_TYPE_RAM). However, these emulated devices have different
>> > characteristics than traditional DRAM, making it important to
>> > distinguish them. Thus, we modify the tiered memory initialization process
>> > to introduce a delay specifically for CPUless NUMA nodes. This delay
>> > ensures that the memory tier initialization for these nodes is deferred
>> > until HMAT information is obtained during the boot process. Finally,
>> > demotion tables are recalculated at the end.
>> >
>> > * Abstract common functions into `find_alloc_memory_type()`
>>
>> We should move kmem_put_memory_types() (renamed to
>> mt_put_memory_types()?) too.  This can be put in a separate patch.
>>
>
> Will do! Thanks,
>
>
>>
>> > Since different memory devices require finding or allocating a memory type,
>> > these common steps are abstracted into a single function,
>> > `find_alloc_memory_type()`, enhancing code scalability and conciseness.
>> >
>> > * Handle cases where there is no HMAT when creating memory tiers
>> > There is a scenario where a CPUless node does not provide HMAT information.
>> > If no HMAT is specified, it falls back to using the default DRAM tier.
>> >
>> > * Change adist calculation code to use another new lock, mt_perf_lock.
>> > In the current implementation, iterating through CPUlist nodes requires
>> > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
>> > trying to acquire the same lock, leading to a potential deadlock.
>> > Therefore, we propose introducing a standalone `mt_perf_lock` to protect
>> > `default_dram_perf`. This approach not only avoids deadlock but also
>> > prevents holding a large lock simultaneously.
>> >
>> > Signed-off-by: Ho-Ren (Jack) Chuang 
>> > Signed-off-by: Hao Xiang 
>> > ---
>> >  drivers/acpi/numa/hmat.c | 11 ++
>> >  drivers/dax/kmem.c   | 13 +--
>> >  include/linux/acpi.h |  6 
>> >  include/linux/memory-tiers.h |  8 +
>> >  mm/memory-tiers.c| 70 +---
>> >  5 files changed, 92 insertions(+), 16 deletions(-)
>> >
>> > diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
>> > index d6b85f0f6082..28812ec2c793 100644
>> > --- a/drivers/acpi/numa/hmat.c
>> > +++ b/drivers/acpi/numa/hmat.c
>> > @@ -38,6 +38,8 @@ static LIST_HEAD(targets);
>> >  static LIST_HEAD(initiators);
>> >  static LIST_HEAD(localities);
>> >
>> > +static LIST_HEAD(hmat_memory_types);
>> > +
>>
>> HMAT isn't a device driver for some memory devices.  So I don't think we
>> should manage memory types in HMAT.
>
> I can put it back in memory-tier.c. How about the list? Do we still
> need to keep a separate list for storing late_inited memory nodes?
> And how about the list name if we need to remove the prefix "hmat_"?

I don't think we need a separate list for memory-less nodes.  Just
iterate all memory-less nodes.

>
>> Instead, if the memory_type of a
>> node isn't set by the driver, we should manage it in memory-tier.c as
>> fallback.
>>
>
> Do you mean some device drivers may init memory tiers between
> memory_tier_init() and late_initcall(memory_tier_late_init);?
> And this is the reason why you mention to exclude
> "node_memory_types[nid].memtype != NULL" in memory_tier_late_init().
> Is my understanding correct?

Yes.

>> >  static DEFINE_MUTEX(target_lock);
>> >
>> >  /*
>> > @@ -149,6 +151,12 @@ int acpi_get_genport_coordinates(u32 uid,
>> >  }
>> >  EXPORT_SYMBOL_NS_GPL(acpi_get_genport_coordinates, CXL);
>> >
>> > +struct memory_dev_type *hmat_find_alloc_memory_type(int adist)
>> > +{
>> > + return find_alloc_memory_type(adist, _memory_types);
>> > +}
>> > +EXPORT_SYMBOL_GPL(hmat_find_alloc_memory_type);
>> > +
>> >  static __init void alloc_memory_initiator(unsigned int cpu_pxm)
>> >  {
>> >   struct memory_initiator *initiator;
>> > @@ -1038,6 +1046,9 @@ static __init int hmat_init(void)
>> >   if (!hmat_set_default_dram_perf())

Re: [PATCH v2 1/1] memory tier: acpi/hmat: create CPUless memory tiers after obtaining HMAT info

2024-03-12 Thread Huang, Ying
>  {
>   pr_info(
> @@ -636,7 +690,7 @@ int mt_set_default_dram_perf(int nid, struct 
> access_coordinate *perf,
>  {
>   int rc = 0;
>  
> - mutex_lock(_tier_lock);
> + mutex_lock(_perf_lock);
>   if (default_dram_perf_error) {
>   rc = -EIO;
>   goto out;
> @@ -684,7 +738,7 @@ int mt_set_default_dram_perf(int nid, struct 
> access_coordinate *perf,
>   }
>  
>  out:
> - mutex_unlock(_tier_lock);
> + mutex_unlock(_perf_lock);
>   return rc;
>  }
>  
> @@ -700,7 +754,7 @@ int mt_perf_to_adistance(struct access_coordinate *perf, 
> int *adist)
>   perf->read_bandwidth + perf->write_bandwidth == 0)
>   return -EINVAL;
>  
> - mutex_lock(_tier_lock);
> + mutex_lock(_perf_lock);
>   /*
>* The abstract distance of a memory node is in direct proportion to
>* its memory latency (read + write) and inversely proportional to its
> @@ -713,7 +767,7 @@ int mt_perf_to_adistance(struct access_coordinate *perf, 
> int *adist)
>   (default_dram_perf.read_latency + 
> default_dram_perf.write_latency) *
>   (default_dram_perf.read_bandwidth + 
> default_dram_perf.write_bandwidth) /
>   (perf->read_bandwidth + perf->write_bandwidth);
> - mutex_unlock(_tier_lock);
> + mutex_unlock(_perf_lock);
>  
>   return 0;
>  }
> @@ -836,6 +890,14 @@ static int __init memory_tier_init(void)
>* types assigned.
>*/
>   for_each_node_state(node, N_MEMORY) {
> + if (!node_state(node, N_CPU))
> + /*
> +  * Defer memory tier initialization on CPUless numa 
> nodes.
> +  * These will be initialized when HMAT information is

HMAT is platform specific, we should avoid to mention it in general code
if possible.

> +  * available.
> +  */
> + continue;
> +
>   memtier = set_node_memory_tier(node);
>   if (IS_ERR(memtier))
>   /*

--
Best Regards,
Huang, Ying



Re: [External] Re: [PATCH v1 1/1] memory tier: acpi/hmat: create CPUless memory tiers after obtaining HMAT info

2024-03-05 Thread Huang, Ying
"Ho-Ren (Jack) Chuang"  writes:

> On Sun, Mar 3, 2024 at 6:42 PM Huang, Ying  wrote:
>>
>> Hi, Jack,
>>
>> "Ho-Ren (Jack) Chuang"  writes:
>>
>> > * Introduce `mt_init_with_hmat()`
>> > We defer memory tier initialization for those CPUless NUMA nodes
>> > until acquiring HMAT info. `mt_init_with_hmat()` is introduced to
>> > post-create CPUless memory tiers after obtaining HMAT info.
>> > It iterates through each CPUless memory node, creating memory tiers if
>> > necessary. Finally, it calculates demotion tables again at the end.
>> >
>> > * Introduce `hmat_find_alloc_memory_type()`
>> > Find or allocate a memory type in the `hmat_memory_types` list.
>> >
>> > * Make `set_node_memory_tier()` more generic
>> > This function can also be used for setting other memory types for a node.
>> > To do so, a new argument is added to specify a memory type.
>> >
>> > * Handle cases where there is no HMAT when creating memory tiers
>> > If no HMAT is specified, it falls back to using `default_dram_type`.
>> >
>> > * Change adist calculation code to use another new lock, mt_perf_lock.
>> > Iterating through CPUlist nodes requires holding the `memory_tier_lock`.
>> > However, `mt_calc_adistance()` will end up trying to acquire the same lock,
>> > leading to a potential deadlock. Therefore, we propose introducing a
>> > standalone `mt_perf_lock` to protect `default_dram_perf`. This approach not
>> > only avoids deadlock but also prevents holding a large lock simultaneously.
>>
>> The patch description is used to described why we need the change, and
>> how we do that, but not what we do.  People can tell what is done from
>> the code itself.
>>
>
> Got it. Thanks. Will rewrite it after the code is finalized.
>
>> > Signed-off-by: Ho-Ren (Jack) Chuang 
>> > Signed-off-by: Hao Xiang 
>> > ---
>> >  drivers/acpi/numa/hmat.c |  3 ++
>> >  include/linux/memory-tiers.h |  6 +++
>> >  mm/memory-tiers.c| 76 
>> >  3 files changed, 77 insertions(+), 8 deletions(-)
>> >
>> > diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
>> > index d6b85f0f6082..9f57338b3cb5 100644
>> > --- a/drivers/acpi/numa/hmat.c
>> > +++ b/drivers/acpi/numa/hmat.c
>> > @@ -1038,6 +1038,9 @@ static __init int hmat_init(void)
>> >   if (!hmat_set_default_dram_perf())
>> >   register_mt_adistance_algorithm(_adist_nb);
>> >
>> > + /* Post-create CPUless memory tiers after getting HMAT info */
>> > + mt_init_with_hmat();
>> > +
>> >   return 0;
>> >  out_put:
>> >   hmat_free_structures();
>> > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>> > index 69e781900082..2f845e90c033 100644
>> > --- a/include/linux/memory-tiers.h
>> > +++ b/include/linux/memory-tiers.h
>> > @@ -48,6 +48,7 @@ int mt_calc_adistance(int node, int *adist);
>> >  int mt_set_default_dram_perf(int nid, struct access_coordinate *perf,
>> >const char *source);
>> >  int mt_perf_to_adistance(struct access_coordinate *perf, int *adist);
>> > +void mt_init_with_hmat(void);
>>
>> HMAT isn't universally available.  It's a driver in fact.  So, don't put
>> driver specific code in general code.
>>
>
> Please see below regarding "move code to hmat.c"
>
>> >  #ifdef CONFIG_MIGRATION
>> >  int next_demotion_node(int node);
>> >  void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
>> > @@ -136,5 +137,10 @@ static inline int mt_perf_to_adistance(struct 
>> > access_coordinate *perf, int *adis
>> >  {
>> >   return -EIO;
>> >  }
>> > +
>> > +static inline void mt_init_with_hmat(void)
>> > +{
>> > +
>> > +}
>> >  #endif   /* CONFIG_NUMA */
>> >  #endif  /* _LINUX_MEMORY_TIERS_H */
>> > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>> > index 0537664620e5..7a0a579b3deb 100644
>> > --- a/mm/memory-tiers.c
>> > +++ b/mm/memory-tiers.c
>> > @@ -35,7 +35,9 @@ struct node_memory_type_map {
>> >  };
>> >
>> >  static DEFINE_MUTEX(memory_tier_lock);
>> > +static DEFINE_MUTEX(mt_perf_lock);
>> >  static LIST_HEAD(memory_tiers);
>> > +static LIST_HEAD(hmat_memory_types);

Re: [External] Re: [PATCH v1 0/1] Improved Memory Tier Creation for CPUless NUMA Nodes

2024-03-04 Thread Huang, Ying
"Ho-Ren (Jack) Chuang"  writes:

> On Mon, Mar 4, 2024 at 10:36 PM Huang, Ying  wrote:
>>
>> "Ho-Ren (Jack) Chuang"  writes:
>>
>> > On Sun, Mar 3, 2024 at 6:47 PM Huang, Ying  wrote:
>> >>
>> >> "Ho-Ren (Jack) Chuang"  writes:
>> >>
>> >> > The memory tiering component in the kernel is functionally useless for
>> >> > CPUless memory/non-DRAM devices like CXL1.1 type3 memory because the 
>> >> > nodes
>> >> > are lumped together in the DRAM tier.
>> >> > https://lore.kernel.org/linux-mm/ph0pr08mb7955e9f08ccb64f23963b5c3a8...@ph0pr08mb7955.namprd08.prod.outlook.com/T/
>> >>
>> >> I think that it's unfair to call it "useless".  Yes, it doesn't work if
>> >> the CXL memory device are not enumerate via drivers/dax/kmem.c.  So,
>> >> please be specific about in which cases it doesn't work instead of too
>> >> general "useless".
>> >>
>> >
>> > Thank you and I didn't mean anything specific. I simply reused phrases
>> > we discussed
>> > earlier in the previous patchset. I will change them to the following in 
>> > v2:
>> > "At boot time, current memory tiering assigns all detected memory nodes
>> > to the same DRAM tier. This results in CPUless memory/non-DRAM devices,
>> > such as CXL1.1 type3 memory, being unable to be assigned to the
>> > correct memory tier,
>> > leading to the inability to migrate pages between different types of 
>> > memory."
>> >
>> > Please see if this looks more specific.
>>
>> I don't think that the description above is accurate.  In fact, there
>> are 2 ways to enumerate the memory device,
>>
>> 1. Mark it as reserved memory (E820_TYPE_SOFT_RESERVED, etc.) in E820
>>table or something similar.
>>
>> 2. Mark it as normal memory (E820_TYPE_RAM) in E820 table or something
>>similar
>>
>> For 1, the memory device (including CXL memory) is onlined via
>> drivers/dax/kmem.c, so will be put in proper memory tiers.  For 2, the
>> memory device is indistinguishable with normal DRAM with current
>> implementation.  And this is what this patch is working on.
>>
>> Right?
>
> Good point! How about this?:
> "
> When a memory device, such as CXL1.1 type3 memory, is emulated as
> normal memory (E820_TYPE_RAM), the memory device is indistinguishable
> from normal DRAM in terms of memory tiering with the current implementation.
> The current memory tiering assigns all detected normal memory nodes
> to the same DRAM tier. This results in normal memory devices with
> different attributions being unable to be assigned to the correct memory tier,
> leading to the inability to migrate pages between different types of memory.
> "

Looks good me!  Thanks!

--
Best Regards,
Huang, Ying



Re: [External] Re: [PATCH v1 0/1] Improved Memory Tier Creation for CPUless NUMA Nodes

2024-03-04 Thread Huang, Ying
"Ho-Ren (Jack) Chuang"  writes:

> On Sun, Mar 3, 2024 at 6:47 PM Huang, Ying  wrote:
>>
>> "Ho-Ren (Jack) Chuang"  writes:
>>
>> > The memory tiering component in the kernel is functionally useless for
>> > CPUless memory/non-DRAM devices like CXL1.1 type3 memory because the nodes
>> > are lumped together in the DRAM tier.
>> > https://lore.kernel.org/linux-mm/ph0pr08mb7955e9f08ccb64f23963b5c3a8...@ph0pr08mb7955.namprd08.prod.outlook.com/T/
>>
>> I think that it's unfair to call it "useless".  Yes, it doesn't work if
>> the CXL memory device are not enumerate via drivers/dax/kmem.c.  So,
>> please be specific about in which cases it doesn't work instead of too
>> general "useless".
>>
>
> Thank you and I didn't mean anything specific. I simply reused phrases
> we discussed
> earlier in the previous patchset. I will change them to the following in v2:
> "At boot time, current memory tiering assigns all detected memory nodes
> to the same DRAM tier. This results in CPUless memory/non-DRAM devices,
> such as CXL1.1 type3 memory, being unable to be assigned to the
> correct memory tier,
> leading to the inability to migrate pages between different types of memory."
>
> Please see if this looks more specific.

I don't think that the description above is accurate.  In fact, there
are 2 ways to enumerate the memory device,

1. Mark it as reserved memory (E820_TYPE_SOFT_RESERVED, etc.) in E820
   table or something similar.

2. Mark it as normal memory (E820_TYPE_RAM) in E820 table or something
   similar

For 1, the memory device (including CXL memory) is onlined via
drivers/dax/kmem.c, so will be put in proper memory tiers.  For 2, the
memory device is indistinguishable with normal DRAM with current
implementation.  And this is what this patch is working on.

Right?

--
Best Regards,
Huang, Ying

>> > This patchset automatically resolves the issues. It delays the 
>> > initialization
>> > of memory tiers for CPUless NUMA nodes until they obtain HMAT information
>> > at boot time, eliminating the need for user intervention.
>> > If no HMAT specified, it falls back to using `default_dram_type`.
>> >
>> > Example usecase:
>> > We have CXL memory on the host, and we create VMs with a new system memory
>> > device backed by host CXL memory. We inject CXL memory performance 
>> > attributes
>> > through QEMU, and the guest now sees memory nodes with performance 
>> > attributes
>> > in HMAT. With this change, we enable the guest kernel to construct
>> > the correct memory tiering for the memory nodes.
>> >
>> > Ho-Ren (Jack) Chuang (1):
>> >   memory tier: acpi/hmat: create CPUless memory tiers after obtaining
>> > HMAT info
>> >
>> >  drivers/acpi/numa/hmat.c |  3 ++
>> >  include/linux/memory-tiers.h |  6 +++
>> >  mm/memory-tiers.c| 76 
>> >  3 files changed, 77 insertions(+), 8 deletions(-)
>>
>> --
>> Best Regards,
>> Huang, Ying



Re: [PATCH v1 0/1] Improved Memory Tier Creation for CPUless NUMA Nodes

2024-03-03 Thread Huang, Ying
"Ho-Ren (Jack) Chuang"  writes:

> The memory tiering component in the kernel is functionally useless for
> CPUless memory/non-DRAM devices like CXL1.1 type3 memory because the nodes
> are lumped together in the DRAM tier.
> https://lore.kernel.org/linux-mm/ph0pr08mb7955e9f08ccb64f23963b5c3a8...@ph0pr08mb7955.namprd08.prod.outlook.com/T/

I think that it's unfair to call it "useless".  Yes, it doesn't work if
the CXL memory device are not enumerate via drivers/dax/kmem.c.  So,
please be specific about in which cases it doesn't work instead of too
general "useless".

> This patchset automatically resolves the issues. It delays the initialization
> of memory tiers for CPUless NUMA nodes until they obtain HMAT information
> at boot time, eliminating the need for user intervention.
> If no HMAT specified, it falls back to using `default_dram_type`.
>
> Example usecase:
> We have CXL memory on the host, and we create VMs with a new system memory
> device backed by host CXL memory. We inject CXL memory performance attributes
> through QEMU, and the guest now sees memory nodes with performance attributes
> in HMAT. With this change, we enable the guest kernel to construct
> the correct memory tiering for the memory nodes.
>
> Ho-Ren (Jack) Chuang (1):
>   memory tier: acpi/hmat: create CPUless memory tiers after obtaining
> HMAT info
>
>  drivers/acpi/numa/hmat.c |  3 ++
>  include/linux/memory-tiers.h |  6 +++
>  mm/memory-tiers.c| 76 ++++
>  3 files changed, 77 insertions(+), 8 deletions(-)

--
Best Regards,
Huang, Ying



Re: [PATCH v1 1/1] memory tier: acpi/hmat: create CPUless memory tiers after obtaining HMAT info

2024-03-03 Thread Huang, Ying
truct 
> memory_dev_type *memtype)
>  }
>  EXPORT_SYMBOL_GPL(clear_node_memory_type);
>  
> +static struct memory_dev_type *hmat_find_alloc_memory_type(int adist)

Similar function existed in drivers/dax/kmem.c.  Please abstract them
and move them here.

> +{
> + bool found = false;
> + struct memory_dev_type *mtype;
> +
> + list_for_each_entry(mtype, _memory_types, list) {
> + if (mtype->adistance == adist) {
> + found = true;
> + break;
> + }
> + }
> + if (!found) {
> + mtype = alloc_memory_type(adist);
> + if (!IS_ERR(mtype))
> + list_add(>list, _memory_types);
> + }
> + return mtype;
> +}
> +
> +static void mt_create_with_hmat(int node)
> +{
> + struct memory_dev_type *mtype = NULL;
> + int adist = MEMTIER_ADISTANCE_DRAM;
> +
> + mt_calc_adistance(node, );
> + if (adist != MEMTIER_ADISTANCE_DRAM) {
> + mtype = hmat_find_alloc_memory_type(adist);
> + if (IS_ERR(mtype))
> + pr_err("%s() failed to allocate a tier\n", __func__);
> + } else {
> + mtype = default_dram_type;
> + }
> +
> + set_node_memory_tier(node, mtype);
> +}
> +
> +void mt_init_with_hmat(void)
> +{
> + int nid;
> +
> + mutex_lock(_tier_lock);
> + for_each_node_state(nid, N_MEMORY)
> + if (!node_state(nid, N_CPU))
> + mt_create_with_hmat(nid);
> +
> + establish_demotion_targets();
> + mutex_unlock(_tier_lock);
> +}
> +EXPORT_SYMBOL_GPL(mt_init_with_hmat);
> +

I guess that we can put most hmat related code above in hmat.c.

>  static void dump_hmem_attrs(struct access_coordinate *coord, const char 
> *prefix)
>  {
>   pr_info(
> @@ -636,7 +688,7 @@ int mt_set_default_dram_perf(int nid, struct 
> access_coordinate *perf,
>  {
>   int rc = 0;
>  
> - mutex_lock(_tier_lock);
> + mutex_lock(_perf_lock);
>   if (default_dram_perf_error) {
>   rc = -EIO;
>   goto out;
> @@ -684,7 +736,7 @@ int mt_set_default_dram_perf(int nid, struct 
> access_coordinate *perf,
>   }
>  
>  out:
> - mutex_unlock(_tier_lock);
> + mutex_unlock(_perf_lock);
>   return rc;
>  }
>  
> @@ -700,7 +752,7 @@ int mt_perf_to_adistance(struct access_coordinate *perf, 
> int *adist)
>   perf->read_bandwidth + perf->write_bandwidth == 0)
>   return -EINVAL;
>  
> - mutex_lock(_tier_lock);
> + mutex_lock(_perf_lock);
>   /*
>* The abstract distance of a memory node is in direct proportion to
>* its memory latency (read + write) and inversely proportional to its
> @@ -713,7 +765,7 @@ int mt_perf_to_adistance(struct access_coordinate *perf, 
> int *adist)
>   (default_dram_perf.read_latency + 
> default_dram_perf.write_latency) *
>   (default_dram_perf.read_bandwidth + 
> default_dram_perf.write_bandwidth) /
>   (perf->read_bandwidth + perf->write_bandwidth);
> - mutex_unlock(_tier_lock);
> + mutex_unlock(_perf_lock);
>  
>   return 0;
>  }
> @@ -797,7 +849,7 @@ static int __meminit memtier_hotplug_callback(struct 
> notifier_block *self,
>   break;
>   case MEM_ONLINE:
>   mutex_lock(_tier_lock);
> - memtier = set_node_memory_tier(arg->status_change_nid);
> + memtier = set_node_memory_tier(arg->status_change_nid, 
> default_dram_type);
>   if (!IS_ERR(memtier))
>   establish_demotion_targets();
>   mutex_unlock(_tier_lock);
> @@ -836,7 +888,15 @@ static int __init memory_tier_init(void)
>* types assigned.
>*/
>   for_each_node_state(node, N_MEMORY) {
> - memtier = set_node_memory_tier(node);
> + if (!node_state(node, N_CPU))
> + /*
> +  * Defer memory tier initialization on CPUless numa 
> nodes.
> +  * These will be initialized when HMAT information is
> +  * available.
> +  */
> + continue;
> +
> + memtier = set_node_memory_tier(node, default_dram_type);

On system with HMAT, how to fall back CPU-less node to
default_dram_type?  I found your description, but I don't find it in code.

>   if (IS_ERR(memtier))
>   /*
>* Continue with memtiers we are able to setup

--
Best Regards,
Huang, Ying



[Qemu-devel] [PATCH -v4] Monitor command: x-gpa2hva, translate guest physical address to host virtual address

2011-05-15 Thread Huang Ying
Add command x-gpa2hva to translate guest physical address to host
virtual address. Because gpa to hva translation is not consistent, so
this command is only used for debugging.

The x-gpa2hva command provides one step in a chain of translations from
guest virtual to guest physical to host virtual to host physical. Host
physical is then used to inject a machine check error. As a
consequence the HWPOISON code on the host and the MCE injection code
in qemu-kvm are exercised.

v4:

- Rebased on qemu.git

v3:

- Rename to x-gpa2hva
- Remove QMP support, because gpa2hva is not consistent

v2:

- Add QMP support

Signed-off-by: Max Asbock masb...@linux.vnet.ibm.com
Signed-off-by: Jiajia Zheng jiajia.zh...@intel.com
Signed-off-by: Huang Ying ying.hu...@intel.com
---
 hmp-commands.hx |   15 +++
 monitor.c   |   22 ++
 2 files changed, 37 insertions(+)

--- a/monitor.c
+++ b/monitor.c
@@ -2713,6 +2713,28 @@ static void do_inject_mce(Monitor *mon,
 }
 #endif
 
+static void do_gpa2hva_print(Monitor *mon, const QObject *data)
+{
+QInt *qint;
+
+qint = qobject_to_qint(data);
+monitor_printf(mon, 0x%lx\n, (unsigned long)qint-value);
+}
+
+static int do_gpa2hva(Monitor *mon, const QDict *qdict, QObject **ret_data)
+{
+target_phys_addr_t paddr;
+target_phys_addr_t size = TARGET_PAGE_SIZE;
+void *vaddr;
+
+paddr = qdict_get_int(qdict, addr);
+vaddr = cpu_physical_memory_map(paddr, size, 0);
+cpu_physical_memory_unmap(vaddr, size, 0, 0);
+*ret_data = qobject_from_jsonf(%ld, (unsigned long)vaddr);
+
+return 0;
+}
+
 static int do_getfd(Monitor *mon, const QDict *qdict, QObject **ret_data)
 {
 const char *fdname = qdict_get_str(qdict, fdname);
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -330,6 +330,21 @@ Start gdbserver session (default @var{po
 ETEXI
 
 {
+.name   = x-gpa2hva,
+.args_type  = fmt:/,addr:l,
+.params = /fmt addr,
+.help   = translate guest physical 'addr' to host virtual 
address, only for debugging,
+.user_print = do_gpa2hva_print,
+.mhandler.cmd_new = do_gpa2hva,
+},
+
+STEXI
+@item x-gpa2hva @var{addr}
+@findex x-gpa2hva
+Translate guest physical @var{addr} to host virtual address, only for 
debugging.
+ETEXI
+
+{
 .name   = x,
 .args_type  = fmt:/,addr:l,
 .params = /fmt addr,



[Qemu-devel] Re: [PATCH uq/master -v2 2/2] KVM, MCE, unpoison memory address across reboot

2011-02-10 Thread Huang Ying
On Wed, 2011-02-09 at 16:00 +0800, Jan Kiszka wrote:
 On 2011-02-09 04:00, Huang Ying wrote:
  In Linux kernel HWPoison processing implementation, the virtual
  address in processes mapping the error physical memory page is marked
  as HWPoison.  So that, the further accessing to the virtual
  address will kill corresponding processes with SIGBUS.
  
  If the error physical memory page is used by a KVM guest, the SIGBUS
  will be sent to QEMU, and QEMU will simulate a MCE to report that
  memory error to the guest OS.  If the guest OS can not recover from
  the error (for example, the page is accessed by kernel code), guest OS
  will reboot the system.  But because the underlying host virtual
  address backing the guest physical memory is still poisoned, if the
  guest system accesses the corresponding guest physical memory even
  after rebooting, the SIGBUS will still be sent to QEMU and MCE will be
  simulated.  That is, guest system can not recover via rebooting.
 
 Yeah, saw this already during my test...
 
  
  In fact, across rebooting, the contents of guest physical memory page
  need not to be kept.  We can allocate a new host physical page to
  back the corresponding guest physical address.
 
 I just wondering what would be architecturally suboptimal if we simply
 remapped on SIGBUS directly. Would save us at least the bookkeeping.

Because we can not change the content of memory silently during guest OS
running, this may corrupts guest OS data structure and even ruins disk
contents.  But during rebooting, all guest OS state are discarded.

[snip]
  @@ -1882,6 +1919,7 @@ int kvm_arch_on_sigbus_vcpu(CPUState *en
   hardware_memory_error();
   }
   }
  +kvm_hwpoison_page_add(ram_addr);
   
   if (code == BUS_MCEERR_AR) {
   /* Fake an Intel architectural Data Load SRAR UCR */
  @@ -1926,6 +1964,7 @@ int kvm_arch_on_sigbus(int code, void *a
   QEMU itself instead of guest system!: %p\n, addr);
   return 0;
   }
  +kvm_hwpoison_page_add(ram_addr);
   kvm_mce_inj_srao_memscrub2(first_cpu, paddr);
   } else
   #endif
  
  
 
 Looks fine otherwise. Unless that simplification makes sense, I could
 offer to include this into my MCE rework (there is some minor conflict).
 If all goes well, that series should be posted during this week.

Thanks.

Best Regards,
Huang Ying




[Qemu-devel] Re: [PATCH uq/master -v2 2/2] KVM, MCE, unpoison memory address across reboot

2011-02-10 Thread Huang Ying
On Thu, 2011-02-10 at 16:52 +0800, Jan Kiszka wrote:
 On 2011-02-10 01:27, Huang Ying wrote:
  @@ -1882,6 +1919,7 @@ int kvm_arch_on_sigbus_vcpu(CPUState *en
   hardware_memory_error();
   }
   }
  +kvm_hwpoison_page_add(ram_addr);
   
   if (code == BUS_MCEERR_AR) {
   /* Fake an Intel architectural Data Load SRAR UCR */
  @@ -1926,6 +1964,7 @@ int kvm_arch_on_sigbus(int code, void *a
   QEMU itself instead of guest system!: %p\n, addr);
   return 0;
   }
  +kvm_hwpoison_page_add(ram_addr);
   kvm_mce_inj_srao_memscrub2(first_cpu, paddr);
   } else
   #endif
 
 
 
  Looks fine otherwise. Unless that simplification makes sense, I could
  offer to include this into my MCE rework (there is some minor conflict).
  If all goes well, that series should be posted during this week.
 
 Please have a look at
 
 git://git.kiszka.org/qemu-kvm.git queues/kvm-upstream
 
 and tell me if it works for you and your signed-off still applies.

Thanks!  Works as expected in my testing!

Best Regards,
Huang Ying





[Qemu-devel] [PATCH uq/master -v2 1/2] Add qemu_ram_remap

2011-02-08 Thread Huang Ying
qemu_ram_remap() unmaps the specified RAM pages, then re-maps these
pages again.  This is used by KVM HWPoison support to clear HWPoisoned
page tables across guest rebooting, so that a new page may be
allocated later to recover the memory error.

Signed-off-by: Huang Ying ying.hu...@intel.com
---
 cpu-all.h|4 +++
 cpu-common.h |1 
 exec.c   |   61 ++-
 3 files changed, 65 insertions(+), 1 deletion(-)

--- a/cpu-all.h
+++ b/cpu-all.h
@@ -863,10 +863,14 @@ target_phys_addr_t cpu_get_phys_page_deb
 extern int phys_ram_fd;
 extern ram_addr_t ram_size;
 
+/* RAM is pre-allocated and passed into qemu_ram_alloc_from_ptr */
+#define RAM_PREALLOC_MASK  (1  0)
+
 typedef struct RAMBlock {
 uint8_t *host;
 ram_addr_t offset;
 ram_addr_t length;
+uint32_t flags;
 char idstr[256];
 QLIST_ENTRY(RAMBlock) next;
 #if defined(__linux__)  !defined(TARGET_S390X)
--- a/exec.c
+++ b/exec.c
@@ -2837,6 +2837,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(Devic
 
 if (host) {
 new_block-host = host;
+new_block-flags |= RAM_PREALLOC_MASK;
 } else {
 if (mem_path) {
 #if defined (__linux__)  !defined(TARGET_S390X)
@@ -2890,7 +2891,9 @@ void qemu_ram_free(ram_addr_t addr)
 QLIST_FOREACH(block, ram_list.blocks, next) {
 if (addr == block-offset) {
 QLIST_REMOVE(block, next);
-if (mem_path) {
+if (block-flags  RAM_PREALLOC_MASK) {
+;
+} else if (mem_path) {
 #if defined (__linux__)  !defined(TARGET_S390X)
 if (block-fd) {
 munmap(block-host, block-length);
@@ -2913,6 +2916,62 @@ void qemu_ram_free(ram_addr_t addr)
 
 }
 
+void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
+{
+RAMBlock *block;
+ram_addr_t offset;
+int flags;
+void *area, *vaddr;
+
+QLIST_FOREACH(block, ram_list.blocks, next) {
+offset = addr - block-offset;
+if (offset  block-length) {
+vaddr = block-host + offset;
+if (block-flags  RAM_PREALLOC_MASK) {
+;
+} else {
+flags = MAP_FIXED;
+munmap(vaddr, length);
+if (mem_path) {
+#if defined (__linux__)  !defined(TARGET_S390X)
+if (block-fd) {
+#ifdef MAP_POPULATE
+flags |= mem_prealloc ? MAP_POPULATE | MAP_SHARED :
+MAP_PRIVATE;
+#else
+flags |= MAP_PRIVATE;
+#endif
+area = mmap(vaddr, length, PROT_READ | PROT_WRITE,
+flags, block-fd, offset);
+} else {
+flags |= MAP_PRIVATE | MAP_ANONYMOUS;
+area = mmap(vaddr, length, PROT_READ | PROT_WRITE,
+flags, -1, 0);
+}
+#endif
+} else {
+#if defined(TARGET_S390X)  defined(CONFIG_KVM)
+flags |= MAP_SHARED | MAP_ANONYMOUS;
+area = mmap(vaddr, length, PROT_EXEC|PROT_READ|PROT_WRITE,
+flags, -1, 0);
+#else
+flags |= MAP_PRIVATE | MAP_ANONYMOUS;
+area = mmap(vaddr, length, PROT_READ | PROT_WRITE,
+flags, -1, 0);
+#endif
+}
+if (area != vaddr) {
+fprintf(stderr, Could not remap addr: %lx@%lx\n,
+length, addr);
+exit(1);
+}
+qemu_madvise(vaddr, length, QEMU_MADV_MERGEABLE);
+}
+return;
+}
+}
+}
+
 /* Return a host pointer to ram allocated with qemu_ram_alloc.
With the exception of the softmmu code in this file, this should
only be used for local memory (e.g. video ram) that the device owns,
--- a/cpu-common.h
+++ b/cpu-common.h
@@ -50,6 +50,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(Devic
 ram_addr_t size, void *host);
 ram_addr_t qemu_ram_alloc(DeviceState *dev, const char *name, ram_addr_t size);
 void qemu_ram_free(ram_addr_t addr);
+void qemu_ram_remap(ram_addr_t addr, ram_addr_t length);
 /* This should only be used for ram local to a device.  */
 void *qemu_get_ram_ptr(ram_addr_t addr);
 /* Same but slower, to use for migration, where the order of





[Qemu-devel] [PATCH uq/master -v2 2/2] KVM, MCE, unpoison memory address across reboot

2011-02-08 Thread Huang Ying
In Linux kernel HWPoison processing implementation, the virtual
address in processes mapping the error physical memory page is marked
as HWPoison.  So that, the further accessing to the virtual
address will kill corresponding processes with SIGBUS.

If the error physical memory page is used by a KVM guest, the SIGBUS
will be sent to QEMU, and QEMU will simulate a MCE to report that
memory error to the guest OS.  If the guest OS can not recover from
the error (for example, the page is accessed by kernel code), guest OS
will reboot the system.  But because the underlying host virtual
address backing the guest physical memory is still poisoned, if the
guest system accesses the corresponding guest physical memory even
after rebooting, the SIGBUS will still be sent to QEMU and MCE will be
simulated.  That is, guest system can not recover via rebooting.

In fact, across rebooting, the contents of guest physical memory page
need not to be kept.  We can allocate a new host physical page to
back the corresponding guest physical address.

This patch fixes this issue in QEMU-KVM via calling qemu_ram_remap()
to clear the corresponding page table entry, so that make it possible
to allocate a new page to recover the issue.

Signed-off-by: Huang Ying ying.hu...@intel.com
---
 target-i386/kvm.c |   39 +++
 1 file changed, 39 insertions(+)

--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -508,6 +508,42 @@ static int kvm_get_supported_msrs(KVMSta
 return ret;
 }
 
+struct HWPoisonPage;
+typedef struct HWPoisonPage HWPoisonPage;
+struct HWPoisonPage
+{
+ram_addr_t ram_addr;
+QLIST_ENTRY(HWPoisonPage) list;
+};
+
+static QLIST_HEAD(hwpoison_page_list, HWPoisonPage) hwpoison_page_list =
+QLIST_HEAD_INITIALIZER(hwpoison_page_list);
+
+static void kvm_unpoison_all(void *param)
+{
+HWPoisonPage *page, *next_page;
+
+QLIST_FOREACH_SAFE(page, hwpoison_page_list, list, next_page) {
+QLIST_REMOVE(page, list);
+qemu_ram_remap(page-ram_addr, TARGET_PAGE_SIZE);
+qemu_free(page);
+}
+}
+
+static void kvm_hwpoison_page_add(ram_addr_t ram_addr)
+{
+HWPoisonPage *page;
+
+QLIST_FOREACH(page, hwpoison_page_list, list) {
+if (page-ram_addr == ram_addr)
+return;
+}
+
+page = qemu_malloc(sizeof(HWPoisonPage));
+page-ram_addr = ram_addr;
+QLIST_INSERT_HEAD(hwpoison_page_list, page, list);
+}
+
 int kvm_arch_init(KVMState *s)
 {
 uint64_t identity_base = 0xfffbc000;
@@ -556,6 +592,7 @@ int kvm_arch_init(KVMState *s)
 fprintf(stderr, e820_add_entry() table is full\n);
 return ret;
 }
+qemu_register_reset(kvm_unpoison_all, NULL);
 
 return 0;
 }
@@ -1882,6 +1919,7 @@ int kvm_arch_on_sigbus_vcpu(CPUState *en
 hardware_memory_error();
 }
 }
+kvm_hwpoison_page_add(ram_addr);
 
 if (code == BUS_MCEERR_AR) {
 /* Fake an Intel architectural Data Load SRAR UCR */
@@ -1926,6 +1964,7 @@ int kvm_arch_on_sigbus(int code, void *a
 QEMU itself instead of guest system!: %p\n, addr);
 return 0;
 }
+kvm_hwpoison_page_add(ram_addr);
 kvm_mce_inj_srao_memscrub2(first_cpu, paddr);
 } else
 #endif





[Qemu-devel] Re: [PATCH uq/master 2/2] MCE, unpoison memory address across reboot

2011-01-16 Thread Huang Ying
On Fri, 2011-01-14 at 16:38 +0800, Jan Kiszka wrote:
 Am 14.01.2011 02:51, Huang Ying wrote:
  On Thu, 2011-01-13 at 17:01 +0800, Jan Kiszka wrote:
  Am 13.01.2011 09:34, Huang Ying wrote:
[snip]
  +
  +void kvm_unpoison_all(void *param)
 
  Minor nit: This can be static now.
  
  In uq/master, it can be make static.  But in kvm/master, kvm_arch_init
  is not compiled because of conditional compiling, so we will get warning
  and error for unused symbol.  Should we consider kvm/master in this
  patch?
 
 qemu-kvm is very close to switching to upstream kvm_*init. As long as it
 requires this service in its own modules, it will have to patch this
 detail. It does this for other functions already.

OK.  I will change this.

[snip]
  As indicated, I'm sitting on lots of fixes and refactorings of the MCE
  user space code. How do you test your patches? Any suggestions how to do
  this efficiently would be warmly welcome.
  
  We use a self-made test script to test.  Repository is at:
  
  git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git
  
  The kvm test script is in kvm sub-directory.
  
  The qemu patch attached is need by the test script.
  
 
 Yeah, I already found this yesterday and started reading. I was just
 searching for p2v in qemu, but now it's clear where it comes from. Will
 have a look (if you want to preview my changes:
 git://git.kiszka.org/qemu-kvm.git queues/kvm-upstream).
 
 I was almost about to use MADV_HWPOISON instead of the injection module.
 Is there a way to recover the fake corruption afterward? I think that
 would allow to move some of the test logic into qemu and avoid p2v which
 - IIRC - was disliked upstream.

I don't know how to fully recover from  MADV_HWPOISON.  You can recover
the virtual address space via qemu_ram_remap() introduced in 1/2 of this
patchset.  But you will lose one or several physical pages for each
testing.  I think that may be not a big issue for a testing machine.

Ccing Andi and Fengguang, they know more than me about MADV_HWPOISON.

 Also, is there a way to simulate corrected errors (BUS_MCEERR_AO)?

BUS_MCEERR_AO is recoverable uncorrected error instead of corrected
error.

The test script is for BUS_MCEERR_AO and BUS_MCEERR_AR.  To see the
effect of pure BUS_MCEERR_AO, just remove the memory accessing loop
(memset) in tools/simple_process/simple_process.c.

Best Regards,
Huang Ying





[Qemu-devel] [PATCH uq/master 2/2] MCE, unpoison memory address across reboot

2011-01-13 Thread Huang Ying
In Linux kernel HWPoison processing implementation, the virtual
address in processes mapping the error physical memory page is marked
as HWPoison.  So that, the further accessing to the virtual
address will kill corresponding processes with SIGBUS.

If the error physical memory page is used by a KVM guest, the SIGBUS
will be sent to QEMU, and QEMU will simulate a MCE to report that
memory error to the guest OS.  If the guest OS can not recover from
the error (for example, the page is accessed by kernel code), guest OS
will reboot the system.  But because the underlying host virtual
address backing the guest physical memory is still poisoned, if the
guest system accesses the corresponding guest physical memory even
after rebooting, the SIGBUS will still be sent to QEMU and MCE will be
simulated.  That is, guest system can not recover via rebooting.

In fact, across rebooting, the contents of guest physical memory page
need not to be kept.  We can allocate a new host physical page to
back the corresponding guest physical address.

This patch fixes this issue in QEMU via calling qemu_ram_remap() to
clear the corresponding page table entry, so that make it possible to
allocate a new page to recover the issue.

Signed-off-by: Huang Ying ying.hu...@intel.com
---
 kvm.h |2 ++
 target-i386/kvm.c |   39 +++
 2 files changed, 41 insertions(+)

--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -580,6 +580,42 @@ static int kvm_get_supported_msrs(void)
 return ret;
 }
 
+struct HWPoisonPage;
+typedef struct HWPoisonPage HWPoisonPage;
+struct HWPoisonPage
+{
+ram_addr_t ram_addr;
+QLIST_ENTRY(HWPoisonPage) list;
+};
+
+static QLIST_HEAD(hwpoison_page_list, HWPoisonPage) hwpoison_page_list =
+QLIST_HEAD_INITIALIZER(hwpoison_page_list);
+
+void kvm_unpoison_all(void *param)
+{
+HWPoisonPage *page, *next_page;
+
+QLIST_FOREACH_SAFE(page, hwpoison_page_list, list, next_page) {
+QLIST_REMOVE(page, list);
+qemu_ram_remap(page-ram_addr, TARGET_PAGE_SIZE);
+qemu_free(page);
+}
+}
+
+static void kvm_hwpoison_page_add(ram_addr_t ram_addr)
+{
+HWPoisonPage *page;
+
+QLIST_FOREACH(page, hwpoison_page_list, list) {
+if (page-ram_addr == ram_addr)
+return;
+}
+
+page = qemu_malloc(sizeof(HWPoisonPage));
+page-ram_addr = ram_addr;
+QLIST_INSERT_HEAD(hwpoison_page_list, page, list);
+}
+
 int kvm_arch_init(void)
 {
 uint64_t identity_base = 0xfffbc000;
@@ -632,6 +668,7 @@ int kvm_arch_init(void)
 fprintf(stderr, e820_add_entry() table is full\n);
 return ret;
 }
+qemu_register_reset(kvm_unpoison_all, NULL);
 
 return 0;
 }
@@ -1940,6 +1977,7 @@ int kvm_on_sigbus_vcpu(CPUState *env, in
 hardware_memory_error();
 }
 }
+kvm_hwpoison_page_add(ram_addr);
 
 if (code == BUS_MCEERR_AR) {
 /* Fake an Intel architectural Data Load SRAR UCR */
@@ -1984,6 +2022,7 @@ int kvm_on_sigbus(int code, void *addr)
 QEMU itself instead of guest system!: %p\n, addr);
 return 0;
 }
+kvm_hwpoison_page_add(ram_addr);
 kvm_mce_inj_srao_memscrub2(first_cpu, paddr);
 } else
 #endif
--- a/kvm.h
+++ b/kvm.h
@@ -188,6 +188,8 @@ int kvm_physical_memory_addr_from_ram(ra
   target_phys_addr_t *phys_addr);
 #endif
 
+void kvm_unpoison_all(void *param);
+
 #endif
 int kvm_set_ioeventfd_mmio_long(int fd, uint32_t adr, uint32_t val, bool 
assign);
 





[Qemu-devel] [PATCH uq/master 1/2] Add qemu_ram_remap

2011-01-13 Thread Huang Ying
qemu_ram_remap() unmaps the specified RAM pages, then re-maps these
pages again.  This is used by KVM HWPoison support to clear HWPoisoned
page tables across guest rebooting, so that a new page may be
allocated later to recover the memory error.

Signed-off-by: Huang Ying ying.hu...@intel.com
---
 cpu-all.h|4 +++
 cpu-common.h |1 
 exec.c   |   61 ++-
 3 files changed, 65 insertions(+), 1 deletion(-)

--- a/cpu-all.h
+++ b/cpu-all.h
@@ -863,10 +863,14 @@ target_phys_addr_t cpu_get_phys_page_deb
 extern int phys_ram_fd;
 extern ram_addr_t ram_size;
 
+/* RAM is pre-allocated and passed into qemu_ram_alloc_from_ptr */
+#define RAM_PREALLOC_MASK  (1  0)
+
 typedef struct RAMBlock {
 uint8_t *host;
 ram_addr_t offset;
 ram_addr_t length;
+uint32_t flags;
 char idstr[256];
 QLIST_ENTRY(RAMBlock) next;
 #if defined(__linux__)  !defined(TARGET_S390X)
--- a/exec.c
+++ b/exec.c
@@ -2830,6 +2830,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(Devic
 
 if (host) {
 new_block-host = host;
+new_block-flags |= RAM_PREALLOC_MASK;
 } else {
 if (mem_path) {
 #if defined (__linux__)  !defined(TARGET_S390X)
@@ -2883,7 +2884,9 @@ void qemu_ram_free(ram_addr_t addr)
 QLIST_FOREACH(block, ram_list.blocks, next) {
 if (addr == block-offset) {
 QLIST_REMOVE(block, next);
-if (mem_path) {
+if (block-flags  RAM_PREALLOC_MASK)
+;
+else if (mem_path) {
 #if defined (__linux__)  !defined(TARGET_S390X)
 if (block-fd) {
 munmap(block-host, block-length);
@@ -2906,6 +2909,62 @@ void qemu_ram_free(ram_addr_t addr)
 
 }
 
+void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
+{
+RAMBlock *block;
+ram_addr_t offset;
+int flags;
+void *area, *vaddr;
+
+QLIST_FOREACH(block, ram_list.blocks, next) {
+offset = addr - block-offset;
+if (offset  block-length) {
+vaddr = block-host + offset;
+if (block-flags  RAM_PREALLOC_MASK) {
+;
+} else {
+flags = MAP_FIXED;
+munmap(vaddr, length);
+if (mem_path) {
+#if defined (__linux__)  !defined(TARGET_S390X)
+if (block-fd) {
+#ifdef MAP_POPULATE
+flags |= mem_prealloc ? MAP_POPULATE | MAP_SHARED :
+MAP_PRIVATE;
+#else
+flags |= MAP_PRIVATE;
+#endif
+area = mmap(vaddr, length, PROT_READ | PROT_WRITE,
+flags, block-fd, offset);
+} else {
+flags |= MAP_PRIVATE | MAP_ANONYMOUS;
+area = mmap(vaddr, length, PROT_READ | PROT_WRITE,
+flags, -1, 0);
+}
+#endif
+} else {
+#if defined(TARGET_S390X)  defined(CONFIG_KVM)
+flags |= MAP_SHARED | MAP_ANONYMOUS;
+area = mmap(vaddr, length, PROT_EXEC|PROT_READ|PROT_WRITE,
+flags, -1, 0);
+#else
+flags |= MAP_PRIVATE | MAP_ANONYMOUS;
+area = mmap(vaddr, length, PROT_READ | PROT_WRITE,
+flags, -1, 0);
+#endif
+}
+if (area != vaddr) {
+fprintf(stderr, Could not remap addr: %lx@%lx\n,
+length, addr);
+exit(1);
+}
+qemu_madvise(vaddr, length, QEMU_MADV_MERGEABLE);
+}
+return;
+}
+}
+}
+
 /* Return a host pointer to ram allocated with qemu_ram_alloc.
With the exception of the softmmu code in this file, this should
only be used for local memory (e.g. video ram) that the device owns,
--- a/cpu-common.h
+++ b/cpu-common.h
@@ -50,6 +50,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(Devic
 ram_addr_t size, void *host);
 ram_addr_t qemu_ram_alloc(DeviceState *dev, const char *name, ram_addr_t size);
 void qemu_ram_free(ram_addr_t addr);
+void qemu_ram_remap(ram_addr_t addr, ram_addr_t length);
 /* This should only be used for ram local to a device.  */
 void *qemu_get_ram_ptr(ram_addr_t addr);
 /* Same but slower, to use for migration, where the order of





Re: [Qemu-devel] [PATCH uq/master 1/2] Add qemu_ram_remap

2011-01-13 Thread Huang Ying
On Fri, 2011-01-14 at 05:14 +0800, Blue Swirl wrote:
 On Thu, Jan 13, 2011 at 8:34 AM, Huang Ying ying.hu...@intel.com wrote:
  qemu_ram_remap() unmaps the specified RAM pages, then re-maps these
  pages again.  This is used by KVM HWPoison support to clear HWPoisoned
  page tables across guest rebooting, so that a new page may be
  allocated later to recover the memory error.
 
  Signed-off-by: Huang Ying ying.hu...@intel.com
  ---
   cpu-all.h|4 +++
   cpu-common.h |1
   exec.c   |   61 
  ++-
   3 files changed, 65 insertions(+), 1 deletion(-)
 
  --- a/cpu-all.h
  +++ b/cpu-all.h
  @@ -863,10 +863,14 @@ target_phys_addr_t cpu_get_phys_page_deb
   extern int phys_ram_fd;
   extern ram_addr_t ram_size;
 
  +/* RAM is pre-allocated and passed into qemu_ram_alloc_from_ptr */
  +#define RAM_PREALLOC_MASK  (1  0)
  +
   typedef struct RAMBlock {
  uint8_t *host;
  ram_addr_t offset;
  ram_addr_t length;
  +uint32_t flags;
  char idstr[256];
  QLIST_ENTRY(RAMBlock) next;
   #if defined(__linux__)  !defined(TARGET_S390X)
  --- a/exec.c
  +++ b/exec.c
  @@ -2830,6 +2830,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(Devic
 
  if (host) {
  new_block-host = host;
  +new_block-flags |= RAM_PREALLOC_MASK;
  } else {
  if (mem_path) {
   #if defined (__linux__)  !defined(TARGET_S390X)
  @@ -2883,7 +2884,9 @@ void qemu_ram_free(ram_addr_t addr)
  QLIST_FOREACH(block, ram_list.blocks, next) {
  if (addr == block-offset) {
  QLIST_REMOVE(block, next);
  -if (mem_path) {
  +if (block-flags  RAM_PREALLOC_MASK)
 
 Missing braces.

Sorry, forgot this one, will fix it.

Best Regards,
Huang Ying





[Qemu-devel] Re: [PATCH uq/master 2/2] MCE, unpoison memory address across reboot

2011-01-13 Thread Huang Ying
On Thu, 2011-01-13 at 17:01 +0800, Jan Kiszka wrote:
 Am 13.01.2011 09:34, Huang Ying wrote:
  In Linux kernel HWPoison processing implementation, the virtual
  address in processes mapping the error physical memory page is marked
  as HWPoison.  So that, the further accessing to the virtual
  address will kill corresponding processes with SIGBUS.
  
  If the error physical memory page is used by a KVM guest, the SIGBUS
  will be sent to QEMU, and QEMU will simulate a MCE to report that
  memory error to the guest OS.  If the guest OS can not recover from
  the error (for example, the page is accessed by kernel code), guest OS
  will reboot the system.  But because the underlying host virtual
  address backing the guest physical memory is still poisoned, if the
  guest system accesses the corresponding guest physical memory even
  after rebooting, the SIGBUS will still be sent to QEMU and MCE will be
  simulated.  That is, guest system can not recover via rebooting.
  
  In fact, across rebooting, the contents of guest physical memory page
  need not to be kept.  We can allocate a new host physical page to
  back the corresponding guest physical address.
  
  This patch fixes this issue in QEMU via calling qemu_ram_remap() to
  clear the corresponding page table entry, so that make it possible to
  allocate a new page to recover the issue.
  
  Signed-off-by: Huang Ying ying.hu...@intel.com
  ---
   kvm.h |2 ++
   target-i386/kvm.c |   39 +++
   2 files changed, 41 insertions(+)
  
  --- a/target-i386/kvm.c
  +++ b/target-i386/kvm.c
  @@ -580,6 +580,42 @@ static int kvm_get_supported_msrs(void)
   return ret;
   }
   
  +struct HWPoisonPage;
  +typedef struct HWPoisonPage HWPoisonPage;
  +struct HWPoisonPage
  +{
  +ram_addr_t ram_addr;
  +QLIST_ENTRY(HWPoisonPage) list;
  +};
  +
  +static QLIST_HEAD(hwpoison_page_list, HWPoisonPage) hwpoison_page_list =
  +QLIST_HEAD_INITIALIZER(hwpoison_page_list);
  +
  +void kvm_unpoison_all(void *param)
 
 Minor nit: This can be static now.

In uq/master, it can be make static.  But in kvm/master, kvm_arch_init
is not compiled because of conditional compiling, so we will get warning
and error for unused symbol.  Should we consider kvm/master in this
patch?

  +{
  +HWPoisonPage *page, *next_page;
  +
  +QLIST_FOREACH_SAFE(page, hwpoison_page_list, list, next_page) {
  +QLIST_REMOVE(page, list);
  +qemu_ram_remap(page-ram_addr, TARGET_PAGE_SIZE);
  +qemu_free(page);
  +}
  +}
  +
  +static void kvm_hwpoison_page_add(ram_addr_t ram_addr)
  +{
  +HWPoisonPage *page;
  +
  +QLIST_FOREACH(page, hwpoison_page_list, list) {
  +if (page-ram_addr == ram_addr)
  +return;
  +}
  +
  +page = qemu_malloc(sizeof(HWPoisonPage));
  +page-ram_addr = ram_addr;
  +QLIST_INSERT_HEAD(hwpoison_page_list, page, list);
  +}
  +
   int kvm_arch_init(void)
   {
   uint64_t identity_base = 0xfffbc000;
  @@ -632,6 +668,7 @@ int kvm_arch_init(void)
   fprintf(stderr, e820_add_entry() table is full\n);
   return ret;
   }
  +qemu_register_reset(kvm_unpoison_all, NULL);
   
   return 0;
   }
  @@ -1940,6 +1977,7 @@ int kvm_on_sigbus_vcpu(CPUState *env, in
   hardware_memory_error();
   }
   }
  +kvm_hwpoison_page_add(ram_addr);
   
   if (code == BUS_MCEERR_AR) {
   /* Fake an Intel architectural Data Load SRAR UCR */
  @@ -1984,6 +2022,7 @@ int kvm_on_sigbus(int code, void *addr)
   QEMU itself instead of guest system!: %p\n, addr);
   return 0;
   }
  +kvm_hwpoison_page_add(ram_addr);
   kvm_mce_inj_srao_memscrub2(first_cpu, paddr);
   } else
   #endif
  --- a/kvm.h
  +++ b/kvm.h
  @@ -188,6 +188,8 @@ int kvm_physical_memory_addr_from_ram(ra
 target_phys_addr_t *phys_addr);
   #endif
   
  +void kvm_unpoison_all(void *param);
  +
 
 To be removed if kvm_unpoison_all is static.
 
   #endif
   int kvm_set_ioeventfd_mmio_long(int fd, uint32_t adr, uint32_t val, bool 
  assign);
   
  
 
 As indicated, I'm sitting on lots of fixes and refactorings of the MCE
 user space code. How do you test your patches? Any suggestions how to do
 this efficiently would be warmly welcome.

We use a self-made test script to test.  Repository is at:

git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git

The kvm test script is in kvm sub-directory.

The qemu patch attached is need by the test script.

Best Regards,
Huang Ying

Author: Max Asbock masb...@linux.vnet.ibm.com
Subject: [PATCH -v3] Monitor command: x-gpa2hva, translate guest physical address to host virtual address

Add command x-gpa2hva to translate guest physical address to host
virtual address. Because gpa to hva translation is not consistent, so
this command is only used for debugging.

The x-gpa2hva command

[Qemu-devel] Re: [PATCH v3 12/21] kvm: x86: Drop MCE MSRs write back restrictions

2011-01-05 Thread Huang Ying
On Wed, 2011-01-05 at 16:07 +0800, Jan Kiszka wrote:
 Am 05.01.2011 07:42, Huang Ying wrote:
  On Tue, 2011-01-04 at 16:32 +0800, Jan Kiszka wrote:
  From: Jan Kiszka jan.kis...@siemens.com
 
  There is no need to restrict writing back MCE MSRs to reset or full
  state updates as setting their values has no side effects.
  
  Sorry for late.
 
 Don't worry.
 
  
  The MCE MSRs contents is sticky for warm reset except MCG_STATUS, so
  their content should be kept.  And the following sequence may set
  uncorrected value in MCE registers.
  
  savevm - loadvm - (OS clear MCE registers) - reset - (MCE registers
  has new (uncorrected) value)
 
 Sorry, I can't follow. Unless I miss some subtle detail, the question is
 not when we transfer the mcg_* CPUState fields to the kernel, but when
 and how we manipulate them in user space, e.g. on reset. Where are those
 fields touched incorrectly between get and put msrs so that we cannot
 write them back?

If my understanding is correct, MSRs are not saved to user space
(env-mce_banks) during reset in current code.  So if all MCE MSRs are
restored to kernel, their user space contents from previous loadvm may
be put into kernel after reset.

Best Regards,
Huang Ying






[Qemu-devel] Re: [PATCH v3 12/21] kvm: x86: Drop MCE MSRs write back restrictions

2011-01-04 Thread Huang Ying
On Tue, 2011-01-04 at 16:32 +0800, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com
 
 There is no need to restrict writing back MCE MSRs to reset or full
 state updates as setting their values has no side effects.

Sorry for late.

The MCE MSRs contents is sticky for warm reset except MCG_STATUS, so
their content should be kept.  And the following sequence may set
uncorrected value in MCE registers.

savevm - loadvm - (OS clear MCE registers) - reset - (MCE registers
has new (uncorrected) value)

Best Regards,
Huang Ying

 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 CC: Huang Ying ying.hu...@intel.com
 ---
  target-i386/kvm.c |   12 
  1 files changed, 4 insertions(+), 8 deletions(-)
 
 diff --git a/target-i386/kvm.c b/target-i386/kvm.c
 index 8267655..1789bff 100644
 --- a/target-i386/kvm.c
 +++ b/target-i386/kvm.c
 @@ -863,14 +863,10 @@ static int kvm_put_msrs(CPUState *env, int level)
  if (env-mcg_cap) {
  int i;
  
 -if (level == KVM_PUT_RESET_STATE) {
 -kvm_msr_entry_set(msrs[n++], MSR_MCG_STATUS, env-mcg_status);
 -} else if (level == KVM_PUT_FULL_STATE) {
 -kvm_msr_entry_set(msrs[n++], MSR_MCG_STATUS, env-mcg_status);
 -kvm_msr_entry_set(msrs[n++], MSR_MCG_CTL, env-mcg_ctl);
 -for (i = 0; i  (env-mcg_cap  0xff) * 4; i++) {
 -kvm_msr_entry_set(msrs[n++], MSR_MC0_CTL + i, 
 env-mce_banks[i]);
 -}
 +kvm_msr_entry_set(msrs[n++], MSR_MCG_STATUS, env-mcg_status);
 +kvm_msr_entry_set(msrs[n++], MSR_MCG_CTL, env-mcg_ctl);
 +for (i = 0; i  (env-mcg_cap  0xff) * 4; i++) {
 +kvm_msr_entry_set(msrs[n++], MSR_MC0_CTL + i, 
 env-mce_banks[i]);
  }
  }
  #endif





[Qemu-devel] Re: [RFC 2/2] KVM, MCE, unpoison memory address across reboot

2011-01-04 Thread Huang Ying
On Fri, 2010-12-31 at 17:10 +0800, Jan Kiszka wrote:
 Am 31.12.2010 06:22, Huang Ying wrote:
  In Linux kernel HWPoison processing implementation, the virtual
  address in processes mapping the error physical memory page is marked
  as HWPoison.  So that, the further accessing to the virtual
  address will kill corresponding processes with SIGBUS.
  
  If the error physical memory page is used by a KVM guest, the SIGBUS
  will be sent to QEMU, and QEMU will simulate a MCE to report that
  memory error to the guest OS.  If the guest OS can not recover from
  the error (for example, the page is accessed by kernel code), guest OS
  will reboot the system.  But because the underlying host virtual
  address backing the guest physical memory is still poisoned, if the
  guest system accesses the corresponding guest physical memory even
  after rebooting, the SIGBUS will still be sent to QEMU and MCE will be
  simulated.  That is, guest system can not recover via rebooting.
  
  In fact, across rebooting, the contents of guest physical memory page
  need not to be kept.  We can allocate a new host physical page to
  back the corresponding guest physical address.
  
  This patch fixes this issue in QEMU-KVM via calling qemu_ram_remap()
  to clear the corresponding page table entry, so that make it possible
  to allocate a new page to recover the issue.
  
  Signed-off-by: Huang Ying ying.hu...@intel.com
  ---
   kvm.h |2 ++
   qemu-kvm.c|   37 +
 
 What's missing in upstream to make this a uq/master patch? We are still
 piling up features and fixes in qemu-kvm* that should better target
 upstream directly. That's work needlessly done twice.

OK. I will do that. Just based on uq/master is sufficient to make it an
upstream patch?

 Is this infrastructure really arch-independent? Will there be other
 users besides x86? If not, better keep it in target-i386/kvm.c.

No.  It is used only in x86.  I will move it into target-i386/kvm.c.

Best Regards,
Huang Ying





[Qemu-devel] [RFC 1/2] Add qemu_ram_remap

2010-12-30 Thread Huang Ying
qemu_ram_remap() unmaps the specified RAM pages, then re-maps these
pages again.  This is used by KVM HWPoison support to clear HWPoisoned
page tables across guest rebooting, so that a new page may be
allocated later to recover the memory error.

Signed-off-by: Huang Ying ying.hu...@intel.com
---
 cpu-all.h|4 +++
 cpu-common.h |1 
 exec.c   |   61 ++-
 3 files changed, 65 insertions(+), 1 deletion(-)

--- a/cpu-all.h
+++ b/cpu-all.h
@@ -861,10 +861,14 @@ target_phys_addr_t cpu_get_phys_page_deb
 extern int phys_ram_fd;
 extern ram_addr_t ram_size;
 
+/* RAM is pre-allocated and passed into qemu_ram_alloc_from_ptr */
+#define RAM_PREALLOC_MASK  (1  0)
+
 typedef struct RAMBlock {
 uint8_t *host;
 ram_addr_t offset;
 ram_addr_t length;
+uint32_t flags;
 char idstr[256];
 QLIST_ENTRY(RAMBlock) next;
 #if defined(__linux__)  !defined(TARGET_S390X)
--- a/exec.c
+++ b/exec.c
@@ -2845,6 +2845,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(Devic
 
 if (host) {
 new_block-host = host;
+new_block-flags |= RAM_PREALLOC_MASK;
 } else {
 if (mem_path) {
 #if defined (__linux__)  !defined(TARGET_S390X)
@@ -2911,7 +2912,9 @@ void qemu_ram_free(ram_addr_t addr)
 QLIST_FOREACH(block, ram_list.blocks, next) {
 if (addr == block-offset) {
 QLIST_REMOVE(block, next);
-if (mem_path) {
+if (block-flags  RAM_PREALLOC_MASK)
+;
+else if (mem_path) {
 #if defined (__linux__)  !defined(TARGET_S390X)
 if (block-fd) {
 munmap(block-host, block-length);
@@ -2934,6 +2937,62 @@ void qemu_ram_free(ram_addr_t addr)
 
 }
 
+void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
+{
+RAMBlock *block;
+ram_addr_t offset;
+int flags;
+void *area, *vaddr;
+
+QLIST_FOREACH(block, ram_list.blocks, next) {
+offset = addr - block-offset;
+if (offset  block-length) {
+vaddr = block-host + offset;
+if (block-flags  RAM_PREALLOC_MASK)
+;
+else {
+flags = MAP_FIXED;
+munmap(vaddr, length);
+if (mem_path) {
+#if defined (__linux__)  !defined(TARGET_S390X)
+if (block-fd) {
+#ifdef MAP_POPULATE
+flags |= mem_prealloc ? MAP_POPULATE | MAP_SHARED :
+MAP_PRIVATE;
+#else
+flags |= MAP_PRIVATE;
+#endif
+area = mmap(vaddr, length, PROT_READ | PROT_WRITE,
+flags, block-fd, offset);
+} else {
+flags |= MAP_PRIVATE | MAP_ANONYMOUS;
+area = mmap(vaddr, length, PROT_READ | PROT_WRITE,
+flags, -1, 0);
+}
+#endif
+} else {
+#if defined(TARGET_S390X)  defined(CONFIG_KVM)
+flags |= MAP_SHARED | MAP_ANONYMOUS;
+area = mmap(vaddr, length, PROT_EXEC|PROT_READ|PROT_WRITE,
+flags, -1, 0);
+#else
+flags |= MAP_PRIVATE | MAP_ANONYMOUS;
+area = mmap(vaddr, length, PROT_READ | PROT_WRITE,
+flags, -1, 0);
+#endif
+}
+if (area != vaddr) {
+fprintf(stderr, Could not remap addr: %...@%lx\n,
+length, addr);
+exit(1);
+}
+qemu_madvise(vaddr, length, QEMU_MADV_MERGEABLE);
+}
+return;
+}
+}
+}
+
 /* Return a host pointer to ram allocated with qemu_ram_alloc.
With the exception of the softmmu code in this file, this should
only be used for local memory (e.g. video ram) that the device owns,
--- a/cpu-common.h
+++ b/cpu-common.h
@@ -45,6 +45,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(Devic
 ram_addr_t qemu_ram_alloc(DeviceState *dev, const char *name, ram_addr_t size);
 void qemu_ram_unmap(ram_addr_t addr);
 void qemu_ram_free(ram_addr_t addr);
+void qemu_ram_remap(ram_addr_t addr, ram_addr_t length);
 /* This should only be used for ram local to a device.  */
 void *qemu_get_ram_ptr(ram_addr_t addr);
 /* This should not be used by devices.  */





[Qemu-devel] Re: [PATCH v3 3/7] target-i386: Don't use SoftFloat uint64 type

2010-12-19 Thread Huang Ying
On Sun, 2010-12-19 at 00:25 +0800, Andreas Färber wrote:
 softfloat.h's uint64 type has least-width semantics,
 which seems unintended here since uint64_t is used in helpers.
 
 v3:
 * Split off.
 
 Cc: Huang Ying ying.hu...@intel.com
 Cc: Juan Quintela quint...@redhat.com
 Signed-off-by: Andreas Färber andreas.faer...@web.de

Acked-by: Huang Ying ying.hu...@intel.com





Re: [Qemu-devel] Re: [PATCH 11/11] kvm, x86: broadcast mce depending on the cpu version

2010-10-14 Thread Huang Ying
On Fri, 2010-10-15 at 09:52 +0800, Hidetoshi Seto wrote:
 (2010/10/15 10:06), Marcelo Tosatti wrote:
  On Thu, Oct 14, 2010 at 05:55:28PM +0900, Jin Dongming wrote:
  There is no reason why SRAO event received by the main thread
  is the only one that being broadcasted.
 
  According to the x86 ASDM vol.3A 15.10.4.1,
  MCE signal is broadcast on processor version 06H_EH or later.
 
  This change is required to handle SRAR in the guest.
 
  Signed-off-by: Hidetoshi Seto seto.hideto...@jp.fujitsu.com
  Tested-by: Jin Dongming jin.dongm...@np.css.fujitsu.com
  ---
   qemu-kvm.c |   63 
  +--
   1 files changed, 31 insertions(+), 32 deletions(-)
  
  Why is this necessary? _AO SIGBUS should be sent to all vcpu threads and
  main thread.
 
 Humm? If you are right, vcpu threads will receive same SRAO event twice,
 one is that received by itself and another is that received by main thread
 and forwarded by the broadcast.
 
 My understanding is (Jin, please correct me if something wrong):
  - _AO SIGBUS is sent to main thread only, and then SRAO event is
broadcasted to all vcpu threads.

Yes. It is.

  - _AR SIGBUS is sent to a vcpu thread that tried to touch the
unmapped poisoned page, and SRAR event is posted to the vcpu.
 
 One problem here is that SRAR is not broadcasted.
 The guest might observe the event differently, like some cpus
 don't enter machine check.

Yes. SRAR Broadcast follows spec better.

Best Regards,
Huang Ying





[Qemu-devel] Re: [patch uq/master 7/8] MCE: Relay UCR MCE to guest

2010-10-07 Thread Huang Ying
On Thu, 2010-10-07 at 00:05 +0800, Marcelo Tosatti wrote:
 On Wed, Oct 06, 2010 at 10:58:36AM +0900, Hidetoshi Seto wrote:
  I got some more question:
  
  (2010/10/05 3:54), Marcelo Tosatti wrote:
   Index: qemu/target-i386/cpu.h
   ===
   --- qemu.orig/target-i386/cpu.h
   +++ qemu/target-i386/cpu.h
   @@ -250,16 +250,32 @@
#define PG_ERROR_RSVD_MASK 0x08
#define PG_ERROR_I_D_MASK  0x10

   -#define MCG_CTL_P(1UL8)   /* MCG_CAP register available */
   +#define MCG_CTL_P(1ULL8)   /* MCG_CAP register available */
   +#define MCG_SER_P(1ULL24) /* MCA recovery/new status bits */

   -#define MCE_CAP_DEF  MCG_CTL_P
   +#define MCE_CAP_DEF  (MCG_CTL_P|MCG_SER_P)
#define MCE_BANKS_DEF10

  
  It seems that current kvm doesn't support SER_P, so injecting SRAO
  to guest will mean that guest receives VAL|UC|!PCC and RIPV event
  from virtual processor that doesn't have SER_P.
 
 Dean also noted this. I don't think it was deliberate choice to not
 expose SER_P. Huang?

In fact, that should be a BUG. I will fix it as soon as possible.

Best Regards,
Huang Ying





[Qemu-devel] Re: [patch uq/master 7/8] MCE: Relay UCR MCE to guest

2010-10-07 Thread Huang Ying
, and the issue remains.
 
 Definitely.
 I guess Huang has some plan or hint for rework this point.

Yes. This should be fixed. The SRAR SIGBUS should be sent directly
instead of being sent via touching poisoned virtual address.
 
  I would think that if the the bad page can't be sidelined, such that
  the newly booting guest can't use it, then the new guest shouldn't be
  allowed to boot. But perhaps there is some merit in letting it try to
  boot and see if one gets 'lucky'.
 
 In case of booting a real machine in real world, hardware and firmware
 usually (or often) do self-test before passing control to OS.
 Some platform can boot OS with degraded configuration (for example,
 fewer memory) if it has trouble on its component.  Some BIOS may
 stop booting and show messages like please reseat [component] on the
 screen.  So we could implement/request qemu to have such mechanism.
 
 I can understand the merit you mentioned here, in some degree. But I
 think it is hard to say unlucky to customer in business...

Because the contents of poisoned pages are not relevant after reboot.
Qemu can replace the poisoned pages with good pages when reboot guest.
Do you think that is good.

Best Regards,
Huang Ying





[Qemu-devel] [BUGFIX] MCE: Fix bug of IA32_MCG_STATUS after system reset

2010-01-05 Thread Huang Ying
Now, if we inject a fatal MCE into guest OS, for example Linux, Linux
will go panic and then reboot. But if we inject another MCE now,
system will reset directly instead of go panic firstly, because
MCG_STATUS.MCIP is set to 1 and not cleared after reboot. This is does
not follow the behavior in real hardware.

This patch fixes this via set env-mcg_status to 0 during system reset.

Signed-off-by: Huang Ying ying.hu...@intel.com

---
 target-i386/helper.c |2 ++
 1 file changed, 2 insertions(+)

--- a/target-i386/helper.c
+++ b/target-i386/helper.c
@@ -617,6 +617,8 @@ void cpu_reset(CPUX86State *env)
 env-dr[7] = DR7_FIXED_1;
 cpu_breakpoint_remove_all(env, BP_CPU);
 cpu_watchpoint_remove_all(env, BP_CPU);
+
+env-mcg_status = 0;
 }
 
 void cpu_x86_close(CPUX86State *env)