Re: [PATCH v2 0/3] mm: use memmap_on_memory semantics for dax/kmem

2023-07-25 Thread David Hildenbrand

On 20.07.23 09:14, Vishal Verma wrote:

The dax/kmem driver can potentially hot-add large amounts of memory
originating from CXL memory expanders, or NVDIMMs, or other 'device
memories'. There is a chance there isn't enough regular system memory
available to fit the memmap for this new memory. It's therefore
desirable, if all other conditions are met, for the kmem managed memory
to place its memmap on the newly added memory itself.

The main hurdle for accomplishing this for kmem is that memmap_on_memory
can only be done if the memory being added is equal to the size of one
memblock. To overcome this,allow the hotplug code to split an add_memory()
request into memblock-sized chunks, and try_remove_memory() to also
expect and handle such a scenario.

Patch 1 exports mhp_supports_memmap_on_memory() so it can be used by the
kmem driver.

Patch 2 teaches the memory_hotplug code to allow for splitting
add_memory() and remove_memory() requests over memblock sized chunks.

Patch 3 adds a sysfs control for the kmem driver that would
allow an opt-out of using memmap_on_memory for the memory being added.



It might be reasonable to rebase this on Aneesh's work. For example, 
patch #1 might not be required anymore.


--
Cheers,

David / dhildenb




Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management

2023-07-25 Thread Alistair Popple


"Huang, Ying"  writes:

> Hi, Alistair,
>
> Thanks a lot for comments!
>
> Alistair Popple  writes:
>
>> Huang Ying  writes:
>>
>>> The abstract distance may be calculated by various drivers, such as
>>> ACPI HMAT, CXL CDAT, etc.  While it may be used by various code which
>>> hot-add memory node, such as dax/kmem etc.  To decouple the algorithm
>>> users and the providers, the abstract distance calculation algorithms
>>> management mechanism is implemented in this patch.  It provides
>>> interface for the providers to register the implementation, and
>>> interface for the users.
>>
>> I wonder if we need this level of decoupling though? It seems to me like
>> it would be simpler and better for drivers to calculate the abstract
>> distance directly themselves by calling the desired algorithm (eg. ACPI
>> HMAT) and pass this when creating the nodes rather than having a
>> notifier chain.
>
> Per my understanding, ACPI HMAT and memory device drivers (such as
> dax/kmem) may belong to different subsystems (ACPI vs. dax).  It's not
> good to call functions across subsystems directly.  So, I think it's
> better to use a general subsystem: memory-tier.c to decouple them.  If
> it turns out that a notifier chain is unnecessary, we can use some
> function pointers instead.
>
>> At the moment it seems we've only identified two possible algorithms
>> (ACPI HMAT and CXL CDAT) and I don't think it would make sense for one
>> of those to fallback to the other based on priority, so why not just
>> have drivers call the correct algorithm directly?
>
> For example, we have a system with PMEM (persistent memory, Optane
> DCPMM, or AEP, or something else) in DIMM slots and CXL.mem connected
> via CXL link to a remote memory pool.  We will need ACPI HMAT for PMEM
> and CXL CDAT for CXL.mem.  One way is to make dax/kmem identify the
> types of the device and call corresponding algorithms.

Yes, that is what I was thinking.

> The other way (suggested by this series) is to make dax/kmem call a
> notifier chain, then CXL CDAT or ACPI HMAT can identify the type of
> device and calculate the distance if the type is correct for them.  I
> don't think that it's good to make dax/kem to know every possible
> types of memory devices.

Do we expect there to be lots of different types of memory devices
sharing a common dax/kmem driver though? Must admit I'm coming from a
GPU background where we'd expect each type of device to have it's own
driver anyway so wasn't expecting different types of memory devices to
be handled by the same driver.

>>> Multiple algorithm implementations can cooperate via calculating
>>> abstract distance for different memory nodes.  The preference of
>>> algorithm implementations can be specified via
>>> priority (notifier_block.priority).
>>
>> How/what decides the priority though? That seems like something better
>> decided by a device driver than the algorithm driver IMHO.
>
> Do we need the memory device driver specific priority?  Or we just share
> a common priority?  For example, the priority of CXL CDAT is always
> higher than that of ACPI HMAT?  Or architecture specific?

Ok, thanks. Having read the above I think the priority is
unimportant. Algorithms can either decide to return a distance and
NOTIFY_STOP_MASK if they can calculate a distance or NOTIFY_DONE if they
can't for a specific device.

> And, I don't think that we are forced to use the general notifier
> chain interface in all memory device drivers.  If the memory device
> driver has better understanding of the memory device, it can use other
> way to determine abstract distance.  For example, a CXL memory device
> driver can identify abstract distance by itself.  While other memory
> device drivers can use the general notifier chain interface at the
> same time.

Whilst I think personally I would find that flexibility useful I am
concerned it means every driver will just end up divining it's own
distance rather than ensuring data in HMAT/CDAT/etc. is correct. That
would kind of defeat the purpose of it all then.



Re: [PATCH RESEND 4/4] dax, kmem: calculate abstract distance with general interface

2023-07-25 Thread Huang, Ying
Alistair Popple  writes:

> Huang Ying  writes:
>
>> Previously, a fixed abstract distance MEMTIER_DEFAULT_DAX_ADISTANCE is
>> used for slow memory type in kmem driver.  This limits the usage of
>> kmem driver, for example, it cannot be used for HBM (high bandwidth
>> memory).
>>
>> So, we use the general abstract distance calculation mechanism in kmem
>> drivers to get more accurate abstract distance on systems with proper
>> support.  The original MEMTIER_DEFAULT_DAX_ADISTANCE is used as
>> fallback only.
>>
>> Now, multiple memory types may be managed by kmem.  These memory types
>> are put into the "kmem_memory_types" list and protected by
>> kmem_memory_type_lock.
>
> See below but I wonder if kmem_memory_types could be a common helper
> rather than kdax specific?
>
>> Signed-off-by: "Huang, Ying" 
>> Cc: Aneesh Kumar K.V 
>> Cc: Wei Xu 
>> Cc: Alistair Popple 
>> Cc: Dan Williams 
>> Cc: Dave Hansen 
>> Cc: Davidlohr Bueso 
>> Cc: Johannes Weiner 
>> Cc: Jonathan Cameron 
>> Cc: Michal Hocko 
>> Cc: Yang Shi 
>> Cc: Rafael J Wysocki 
>> ---
>>  drivers/dax/kmem.c   | 54 +++-
>>  include/linux/memory-tiers.h |  2 ++
>>  mm/memory-tiers.c|  2 +-
>>  3 files changed, 44 insertions(+), 14 deletions(-)
>>
>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>> index 898ca9505754..837165037231 100644
>> --- a/drivers/dax/kmem.c
>> +++ b/drivers/dax/kmem.c
>> @@ -49,14 +49,40 @@ struct dax_kmem_data {
>>  struct resource *res[];
>>  };
>>  
>> -static struct memory_dev_type *dax_slowmem_type;
>> +static DEFINE_MUTEX(kmem_memory_type_lock);
>> +static LIST_HEAD(kmem_memory_types);
>> +
>> +static struct memory_dev_type *kmem_find_alloc_memorty_type(int adist)
>> +{
>> +bool found = false;
>> +struct memory_dev_type *mtype;
>> +
>> +mutex_lock(_memory_type_lock);
>> +list_for_each_entry(mtype, _memory_types, list) {
>> +if (mtype->adistance == adist) {
>> +found = true;
>> +break;
>> +}
>> +}
>> +if (!found) {
>> +mtype = alloc_memory_type(adist);
>> +if (!IS_ERR(mtype))
>> +list_add(>list, _memory_types);
>> +}
>> +mutex_unlock(_memory_type_lock);
>> +
>> +return mtype;
>> +}
>> +
>>  static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>  {
>>  struct device *dev = _dax->dev;
>>  unsigned long total_len = 0;
>>  struct dax_kmem_data *data;
>> +struct memory_dev_type *mtype;
>>  int i, rc, mapped = 0;
>>  int numa_node;
>> +int adist = MEMTIER_DEFAULT_DAX_ADISTANCE;
>>  
>>  /*
>>   * Ensure good NUMA information for the persistent memory.
>> @@ -71,6 +97,11 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>  return -EINVAL;
>>  }
>>  
>> +mt_calc_adistance(numa_node, );
>> +mtype = kmem_find_alloc_memorty_type(adist);
>> +if (IS_ERR(mtype))
>> +return PTR_ERR(mtype);
>> +
>
> I wrote my own quick and dirty module to test this and wrote basically
> the same code sequence.
>
> I notice your using a list of memory types here though. I think it would
> be nice to have a common helper that other users could call to do the
> mt_calc_adistance() / kmem_find_alloc_memory_type() /
> init_node_memory_type() sequence and cleanup as my naive approach would
> result in a new memory_dev_type per device even though adist might be
> the same. A common helper would make it easy to de-dup those.

If it's useful, we can move kmem_find_alloc_memory_type() to
memory-tier.c after some revision.  But I tend to move it after we have
the second user.  What do you think about that?

--
Best Regards,
Huang, Ying

>>  for (i = 0; i < dev_dax->nr_range; i++) {
>>  struct range range;
>>  
>> @@ -88,7 +119,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>  return -EINVAL;
>>  }
>>  
>> -init_node_memory_type(numa_node, dax_slowmem_type);
>> +init_node_memory_type(numa_node, mtype);
>>  
>>  rc = -ENOMEM;
>>  data = kzalloc(struct_size(data, res, dev_dax->nr_range), GFP_KERNEL);
>> @@ -167,7 +198,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>  err_res_name:
>>  kfree(data);
>>  err_dax_kmem_data:
>> -clear_node_memory_type(numa_node, dax_slowmem_type);
>> +clear_node_memory_type(numa_node, mtype);
>>  return rc;
>>  }
>>  
>> @@ -219,7 +250,7 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
>>   * for that. This implies this reference will be around
>>   * till next reboot.
>>   */
>> -clear_node_memory_type(node, dax_slowmem_type);
>> +clear_node_memory_type(node, NULL);
>>  }
>>  }
>>  #else
>> @@ -251,12 +282,6 @@ static int __init dax_kmem_init(void)
>>  if (!kmem_name)
>>  return -ENOMEM;
>>  
>> -dax_slowmem_type = 

Re: [PATCH RESEND 3/4] acpi, hmat: calculate abstract distance with HMAT

2023-07-25 Thread Huang, Ying
Alistair Popple  writes:

> Huang Ying  writes:
>
>> A memory tiering abstract distance calculation algorithm based on ACPI
>> HMAT is implemented.  The basic idea is as follows.
>>
>> The performance attributes of system default DRAM nodes are recorded
>> as the base line.  Whose abstract distance is MEMTIER_ADISTANCE_DRAM.
>> Then, the ratio of the abstract distance of a memory node (target) to
>> MEMTIER_ADISTANCE_DRAM is scaled based on the ratio of the performance
>> attributes of the node to that of the default DRAM nodes.
>
> The problem I encountered here with the calculations is that HBM memory
> ended up in a lower-tiered node which isn't what I wanted (at least when
> that HBM is attached to a GPU say).

I have tested the series on a server machine with HBM (pure HBM, not
attached to a GPU).  Where, HBM is placed in a higher tier than DRAM.

> I suspect this is because the calculations are based on the CPU
> point-of-view (access1) which still sees lower bandwidth to remote HBM
> than local DRAM, even though the remote GPU has higher bandwidth access
> to that memory. Perhaps we need to be considering access0 as well?
> Ie. HBM directly attached to a generic initiator should be in a higher
> tier regardless of CPU access characteristics?

What's your requirements for memory tiers on the machine?  I guess you
want to put GPU attache HBM in a higher tier and put DRAM in a lower
tier.  So, cold HBM pages can be demoted to DRAM when there are memory
pressure on HBM?  This sounds reasonable from GPU point of view.

The above requirements may be satisfied via calculating abstract
distance based on access0 (or combined with access1).  But I suspect
this will be a general solution.  I guess that any memory devices that
are used mainly by the memory initiators other than CPUs want to put
themselves in a higher memory tier than DRAM, regardless of its
access0.

One solution is to put GPU HBM in the highest memory tier (with smallest
abstract distance) always in GPU device driver regardless its HMAT
performance attributes.  Is it possible?

> That said I'm not entirely convinced the HMAT tables I'm testing against
> are accurate/complete.

--
Best Regards,
Huang, Ying