from:"Jonathan Cameron"

On Mon, 22 Apr 2024 15:23:16 +0100
Jonathan Cameron  wrote:

> On Mon, 22 Apr 2024 13:04:48 +0100
> Jonathan Cameron  wrote:
> 
> > On Sat, 20 Apr 2024 16:35:46 -0400
> > Gregory Price  wrote:
> >   
> > > On Fri, Apr 19, 2024 at 11:43:14AM -0700, fan wrote:
> > > > On Fri, Apr 19, 2024 at 02:24:36PM -0400, Gregory Price wrote:  
> > > > > 
> > > > > added review to all patches, will hopefully be able to add a Tested-by
> > > > > tag early next week, along with a v1 RFC for MHD bit-tracking.
> > > > > 
> > > > > We've been testing v5/v6 for a bit, so I expect as soon as we get the
> > > > > MHD code ported over to v7 i'll ship a tested-by tag pretty quick.
> > > > > 
> > > > > The super-set release will complicate a few things but this doesn't
> > > > > look like a blocker on our end, just a change to how we track bits in 
> > > > > a
> > > > > shared bit/bytemap.
> > > > >   
> > > > 
> > > > Hi Gregory,
> > > > Thanks for reviewing the patches so quickly. 
> > > > 
> > > > No pressure, but look forward to your MHD work. :)
> > > > 
> > > > Fan  
> > > 
> > > Starting to get into versioniong hell a bit, since the Niagara work was
> > > based off of jonathan's branch and the mhd-dcd work needs some of the
> > > extentions from that branch - while this branch is based on master.
> > > 
> > > Probably we'll need to wait for a new cxl dated branch to try and sus
> > > out the pain points before we push an RFC.  I would not want to have
> > > conflicting commits for something like this for example:
> > > 
> > > https://lore.kernel.org/qemu-devel/20230901012914.226527-2-gregory.pr...@memverge.com/
> > > 
> > > We get merge conflicts here because this is behind that patch. So
> > > pushing up an RFC in this state would be mostly useless to everyone
> > 
> > Subtle hint noted ;) 
> > 
> > I'll build a fresh tree - any remaining rebases until QEMU 9.0 should be
> > straight forward anyway.   My ideal is that the NUMA GP series lands early
> > in 9.1 cycle and this can go in parallel.  I'd really like to
> > get this in early if possible so we can start clearing some of the other
> > stuff that ended up built on top of it!  
> 
> I've pushed to gitlab.com/jic23/qemu cxl-2024-04-22-draft
> Its extremely lightly tested so far.
> 
> To save time, I've temporarily dropped the fm-api DCD initiate
> dynamic capacity add patch as that needs non trivial updates.
> 
> I've not yet caught up with some other outstanding series, but
> I will almost certainly put them on top of DCD.

If anyone pulled in meantime... I failed to push down a fix from
my working tree on top of this.
Goes to show I shouldn't ignore patches simply named "Push down" :(

Updated on same branch.

Jonathan
> 
> Jonathan
> 
> > 
> > Jonathan
> >   
> > > 
> > > ~Gregory
> > 
> >   
> 
>

Re: [PATCH v7 00/12] Enabling DCD emulation support in Qemu

On Mon, 22 Apr 2024 13:04:48 +0100
Jonathan Cameron  wrote:

> On Sat, 20 Apr 2024 16:35:46 -0400
> Gregory Price  wrote:
> 
> > On Fri, Apr 19, 2024 at 11:43:14AM -0700, fan wrote:  
> > > On Fri, Apr 19, 2024 at 02:24:36PM -0400, Gregory Price wrote:
> > > > 
> > > > added review to all patches, will hopefully be able to add a Tested-by
> > > > tag early next week, along with a v1 RFC for MHD bit-tracking.
> > > > 
> > > > We've been testing v5/v6 for a bit, so I expect as soon as we get the
> > > > MHD code ported over to v7 i'll ship a tested-by tag pretty quick.
> > > > 
> > > > The super-set release will complicate a few things but this doesn't
> > > > look like a blocker on our end, just a change to how we track bits in a
> > > > shared bit/bytemap.
> > > > 
> > > 
> > > Hi Gregory,
> > > Thanks for reviewing the patches so quickly. 
> > > 
> > > No pressure, but look forward to your MHD work. :)
> > > 
> > > Fan
> > 
> > Starting to get into versioniong hell a bit, since the Niagara work was
> > based off of jonathan's branch and the mhd-dcd work needs some of the
> > extentions from that branch - while this branch is based on master.
> > 
> > Probably we'll need to wait for a new cxl dated branch to try and sus
> > out the pain points before we push an RFC.  I would not want to have
> > conflicting commits for something like this for example:
> > 
> > https://lore.kernel.org/qemu-devel/20230901012914.226527-2-gregory.pr...@memverge.com/
> > 
> > We get merge conflicts here because this is behind that patch. So
> > pushing up an RFC in this state would be mostly useless to everyone  
> 
> Subtle hint noted ;) 
> 
> I'll build a fresh tree - any remaining rebases until QEMU 9.0 should be
> straight forward anyway.   My ideal is that the NUMA GP series lands early
> in 9.1 cycle and this can go in parallel.  I'd really like to
> get this in early if possible so we can start clearing some of the other
> stuff that ended up built on top of it!

I've pushed to gitlab.com/jic23/qemu cxl-2024-04-22-draft
Its extremely lightly tested so far.

To save time, I've temporarily dropped the fm-api DCD initiate
dynamic capacity add patch as that needs non trivial updates.

I've not yet caught up with some other outstanding series, but
I will almost certainly put them on top of DCD.

Jonathan

> 
> Jonathan
> 
> > 
> > ~Gregory  
> 
>

Re: [PATCH v7 00/12] Enabling DCD emulation support in Qemu

On Sat, 20 Apr 2024 16:35:46 -0400
Gregory Price  wrote:

> On Fri, Apr 19, 2024 at 11:43:14AM -0700, fan wrote:
> > On Fri, Apr 19, 2024 at 02:24:36PM -0400, Gregory Price wrote:  
> > > 
> > > added review to all patches, will hopefully be able to add a Tested-by
> > > tag early next week, along with a v1 RFC for MHD bit-tracking.
> > > 
> > > We've been testing v5/v6 for a bit, so I expect as soon as we get the
> > > MHD code ported over to v7 i'll ship a tested-by tag pretty quick.
> > > 
> > > The super-set release will complicate a few things but this doesn't
> > > look like a blocker on our end, just a change to how we track bits in a
> > > shared bit/bytemap.
> > >   
> > 
> > Hi Gregory,
> > Thanks for reviewing the patches so quickly. 
> > 
> > No pressure, but look forward to your MHD work. :)
> > 
> > Fan  
> 
> Starting to get into versioniong hell a bit, since the Niagara work was
> based off of jonathan's branch and the mhd-dcd work needs some of the
> extentions from that branch - while this branch is based on master.
> 
> Probably we'll need to wait for a new cxl dated branch to try and sus
> out the pain points before we push an RFC.  I would not want to have
> conflicting commits for something like this for example:
> 
> https://lore.kernel.org/qemu-devel/20230901012914.226527-2-gregory.pr...@memverge.com/
> 
> We get merge conflicts here because this is behind that patch. So
> pushing up an RFC in this state would be mostly useless to everyone

Subtle hint noted ;) 

I'll build a fresh tree - any remaining rebases until QEMU 9.0 should be
straight forward anyway.   My ideal is that the NUMA GP series lands early
in 9.1 cycle and this can go in parallel.  I'd really like to
get this in early if possible so we can start clearing some of the other
stuff that ended up built on top of it!

Jonathan

> 
> ~Gregory

Re: [PATCH v7 09/12] hw/cxl/events: Add qmp interfaces to add/release dynamic capacity extents

On Thu, 18 Apr 2024 16:11:00 -0700
nifan@gmail.com wrote:

> From: Fan Ni 
>

Hi Fan,

Please expand CC list to include QAPI maintainers.
+CC Markus and Micheal.

Also, for future versions +CC Michael Tsirkin.

I'm find rolling these up as a series with the precursors but
if it is already some Michael has seen it may speed things up.

Jonathan

p.s. Today I'm just building a tree, but will circle back around
later in the week with a final review of the last few changes.

 
> To simulate FM functionalities for initiating Dynamic Capacity Add
> (Opcode 5604h) and Dynamic Capacity Release (Opcode 5605h) as in CXL spec
> r3.1 7.6.7.6.5 and 7.6.7.6.6, we implemented two QMP interfaces to issue
> add/release dynamic capacity extents requests.
> 
> With the change, we allow to release an extent only when its DPA range
> is contained by a single accepted extent in the device. That is to say,
> extent superset release is not supported yet.
> 
> 1. Add dynamic capacity extents:
> 
> For example, the command to add two continuous extents (each 128MiB long)
> to region 0 (starting at DPA offset 0) looks like below:
> 
> { "execute": "qmp_capabilities" }
> 
> { "execute": "cxl-add-dynamic-capacity",
>   "arguments": {
>   "path": "/machine/peripheral/cxl-dcd0",
>   "hid": 0,
>   "selection-policy": 2,
>   "region-id": 0,
>   "tag": "",
>   "extents": [
>   {
>   "offset": 0,
>   "len": 134217728
>   },
>   {
>   "offset": 134217728,
>   "len": 134217728
>   }
>   ]
>   }
> }
> 
> 2. Release dynamic capacity extents:
> 
> For example, the command to release an extent of size 128MiB from region 0
> (DPA offset 128MiB) looks like below:
> 
> { "execute": "cxl-release-dynamic-capacity",
>   "arguments": {
>   "path": "/machine/peripheral/cxl-dcd0",
>   "hid": 0,
>   "flags": 1,
>   "region-id": 0,
>   "tag": "",
>   "extents": [
>   {
>   "offset": 134217728,
>   "len": 134217728
>   }
>   ]
>   }
> }
> 
> Signed-off-by: Fan Ni 
> ---
>  hw/cxl/cxl-mailbox-utils.c  |  62 +--
>  hw/mem/cxl_type3.c  | 311 +++-
>  hw/mem/cxl_type3_stubs.c|  20 +++
>  include/hw/cxl/cxl_device.h |  22 +++
>  include/hw/cxl/cxl_events.h |  18 +++
>  qapi/cxl.json   |  69 
>  6 files changed, 489 insertions(+), 13 deletions(-)
> 
> diff --git a/hw/cxl/cxl-mailbox-utils.c b/hw/cxl/cxl-mailbox-utils.c
> index 9d54e10cd4..3569902e9e 100644
> --- a/hw/cxl/cxl-mailbox-utils.c
> +++ b/hw/cxl/cxl-mailbox-utils.c
> @@ -1405,7 +1405,7 @@ static CXLRetCode cmd_dcd_get_dyn_cap_ext_list(const 
> struct cxl_cmd *cmd,
>   * Check whether any bit between addr[nr, nr+size) is set,
>   * return true if any bit is set, otherwise return false
>   */
> -static bool test_any_bits_set(const unsigned long *addr, unsigned long nr,
> +bool test_any_bits_set(const unsigned long *addr, unsigned long nr,
>unsigned long size)
>  {
>  unsigned long res = find_next_bit(addr, size + nr, nr);
> @@ -1444,7 +1444,7 @@ CXLDCRegion *cxl_find_dc_region(CXLType3Dev *ct3d, 
> uint64_t dpa, uint64_t len)
>  return NULL;
>  }
>  
> -static void cxl_insert_extent_to_extent_list(CXLDCExtentList *list,
> +void cxl_insert_extent_to_extent_list(CXLDCExtentList *list,
>   uint64_t dpa,
>   uint64_t len,
>   uint8_t *tag,
> @@ -1470,6 +1470,44 @@ void 
> cxl_remove_extent_from_extent_list(CXLDCExtentList *list,
>  g_free(extent);
>  }
>  
> +/*
> + * Add a new extent to the extent "group" if group exists;
> + * otherwise, create a new group
> + * Return value: return the group where the extent is inserted.
> + */
> +CXLDCExtentGroup *cxl_insert_extent_to_extent_group(CXLDCExtentGroup *group,
> +uint64_t dpa,
> +uint64_t len,
> +uint8_t *tag,
> +uint16_t shared_seq)
> +{
> +if (!group) {
> +group = g_new0(CXLDCExtentGroup, 1);
> +QTAILQ_INIT(>list);
> +}
> +cxl_insert_extent_to_extent_list(>list, dpa, len,
> + tag, shared_seq);
> +return group;
> +}
> +
> +void cxl_extent_group_list_insert_tail(CXLDCExtentGroupList *list,
> +   CXLDCExtentGroup *group)
> +{
> +QTAILQ_INSERT_TAIL(list, group, node);
> +}
> +
> +void cxl_extent_group_list_delete_front(CXLDCExtentGroupList *list)
> +{
> +CXLDCExtent *ent, *ent_next;
> +CXLDCExtentGroup *group = QTAILQ_FIRST(list);
> +
> +QTAILQ_REMOVE(list, group, node);
> +QTAILQ_FOREACH_SAFE(ent, >list, node, ent_next) {
> +

Re: [PATCH v7 06/12] hw/mem/cxl_type3: Add host backend and address space handling for DC regions

On Fri, 19 Apr 2024 13:27:59 -0400
Gregory Price  wrote:

> On Thu, Apr 18, 2024 at 04:10:57PM -0700, nifan@gmail.com wrote:
> > From: Fan Ni 
> > 
> > Add (file/memory backed) host backend for DCD. All the dynamic capacity
> > regions will share a single, large enough host backend. Set up address
> > space for DC regions to support read/write operations to dynamic capacity
> > for DCD.
> > 
> > With the change, the following support is added:
> > 1. Add a new property to type3 device "volatile-dc-memdev" to point to host
> >memory backend for dynamic capacity. Currently, all DC regions share one
> >host backend;
> > 2. Add namespace for dynamic capacity for read/write support;
> > 3. Create cdat entries for each dynamic capacity region.
> > 
> > Signed-off-by: Fan Ni 
> > ---
> >  hw/cxl/cxl-mailbox-utils.c  |  16 ++--
> >  hw/mem/cxl_type3.c  | 172 +---
> >  include/hw/cxl/cxl_device.h |   8 ++
> >  3 files changed, 160 insertions(+), 36 deletions(-)
> >   
> 
> A couple general comments in line for discussion, but patch looks good
> otherwise. Notes are mostly on improvements we could make that should
> not block this patch.
> 
> Reviewed-by: Gregory Price 
> 
> >  
> > diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
> > index a1fe268560..ac87398089 100644
> > --- a/hw/mem/cxl_type3.c
> > +++ b/hw/mem/cxl_type3.c
> > @@ -45,7 +45,8 @@ enum {
> >  
> >  static void ct3_build_cdat_entries_for_mr(CDATSubHeader **cdat_table,
> >int dsmad_handle, uint64_t size,
> > -  bool is_pmem, uint64_t dpa_base)
> > +  bool is_pmem, bool is_dynamic,
> > +  uint64_t dpa_base)  
> 
> We should probably change the is_* fields into a flags field and do some
> error checking on the combination of flags.
> 
> >  {
> >  CDATDsmas *dsmas;
> >  CDATDslbis *dslbis0;
> > @@ -61,7 +62,8 @@ static void ct3_build_cdat_entries_for_mr(CDATSubHeader 
> > **cdat_table,
> >  .length = sizeof(*dsmas),
> >  },
> >  .DSMADhandle = dsmad_handle,
> > -.flags = is_pmem ? CDAT_DSMAS_FLAG_NV : 0,
> > +.flags = (is_pmem ? CDAT_DSMAS_FLAG_NV : 0) |
> > + (is_dynamic ? CDAT_DSMAS_FLAG_DYNAMIC_CAP : 0),  
> 
> For example, as noted elsewhere in the code, is_pmem+is_dynamic is not
> presently supported, so this shouldn't even be allowed in this function.
> 
> > +if (dc_mr) {
> > +int i;
> > +uint64_t region_base = vmr_size + pmr_size;
> > +
> > +/*
> > + * TODO: we assume the dynamic capacity to be volatile for now.
> > + * Non-volatile dynamic capacity will be added if needed in the
> > + * future.
> > + */  
> 
> Probably don't need to mark this TODO, can just leave it as a note.
> 
> Non-volatile dynamic capacity will coincide with shared memory, so it'll
> end up handled.  So this isn't really a TODO for this current work, and
> should read more like:
> 
> "Dynamic Capacity is always volatile, until shared memory is
> implemented"

I can sort of see your logic, but there is a difference between
volatile memory that is shared and persistent memory (typically whether
we need to care about deep flushes in some architectures) so I'd expected
volatile shared capacity to still be a thing, even if the host OS treats
it in most ways as persistent.

Also, persistent + DCD could be a thing without sharing sometime in the
future.

> 
> > +} else if (ct3d->hostpmem) {
> >  range1_size_hi = ct3d->hostpmem->size >> 32;
> >  range1_size_lo = (2 << 5) | (2 << 2) | 0x3 |
> >   (ct3d->hostpmem->size & 0xF000);
> > +} else {
> > +/*
> > + * For DCD with no static memory, set memory active, memory class 
> > bits.
> > + * No range is set.
> > + */
> > +range1_size_lo = (2 << 5) | (2 << 2) | 0x3;  
> 
> We should probably add defs for these fields at some point. Can be
> tabled for later work though.
Agreed - worth tidying up but not on critical path.

> 
> > +/*
> > + * TODO: set dc as volatile for now, non-volatile support can be 
> > added
> > + * in the future if needed.
> > + */
> > +memory_region_set_nonvolatile(dc_mr, false);  
> 
> Again can probably drop the TODO and just leave a statement.
> 
> ~Gregory

Re: [PATCH v7 06/12] hw/mem/cxl_type3: Add host backend and address space handling for DC regions

On Thu, 18 Apr 2024 16:10:57 -0700
nifan@gmail.com wrote:

> From: Fan Ni 
> 
> Add (file/memory backed) host backend for DCD. All the dynamic capacity
> regions will share a single, large enough host backend. Set up address
> space for DC regions to support read/write operations to dynamic capacity
> for DCD.
> 
> With the change, the following support is added:
> 1. Add a new property to type3 device "volatile-dc-memdev" to point to host
>memory backend for dynamic capacity. Currently, all DC regions share one
>host backend;
> 2. Add namespace for dynamic capacity for read/write support;
> 3. Create cdat entries for each dynamic capacity region.
> 
> Signed-off-by: Fan Ni 
One fixlet needed inline.

I've set range1_size_lo = 0 there for my tree.

> @@ -301,10 +337,16 @@ static void build_dvsecs(CXLType3Dev *ct3d)
>  range2_size_lo = (2 << 5) | (2 << 2) | 0x3 |
>   (ct3d->hostpmem->size & 0xF000);
>  }
> -} else {
> +} else if (ct3d->hostpmem) {
>  range1_size_hi = ct3d->hostpmem->size >> 32;
>  range1_size_lo = (2 << 5) | (2 << 2) | 0x3 |
>   (ct3d->hostpmem->size & 0xF000);
> +} else {
> +/*
> + * For DCD with no static memory, set memory active, memory class 
> bits.
> + * No range is set.
> + */

range1_size_hi is not initialized.

> +range1_size_lo = (2 << 5) | (2 << 2) | 0x3;
>  }
>  
>  dvsec = (uint8_t *)&(CXLDVSECDevice){

Re: [PATCH 0/3] hw/cxl/cxl-cdat: Make cxl_doe_cdat_init() return boolean

On Fri, 19 Apr 2024 17:40:07 +0200
Philippe Mathieu-Daudé  wrote:

> On 18/4/24 12:04, Zhao Liu wrote:
> > From: Zhao Liu   
> 
> 
> > ---
> > Zhao Liu (3):
> >hw/cxl/cxl-cdat: Make ct3_load_cdat() return boolean
> >hw/cxl/cxl-cdat: Make ct3_build_cdat() return boolean
> >hw/cxl/cxl-cdat: Make cxl_doe_cdat_init() return boolean  
> 
> Since Jonathan Ack'ed the series, I'm queuing it via my hw-misc tree.
> 

Thanks,

J

Re: [edk2-devel] [PATCH v3 5/6] target/arm: Do memory type alignment check when translation disabled

2024-04-19 Thread Jonathan Cameron via

On Fri, 19 Apr 2024 13:52:07 +0200
Gerd Hoffmann  wrote:

>   Hi,
> 
> > Gerd, any ideas?  Maybe I needs something subtly different in my
> > edk2 build?  I've not looked at this bit of the qemu infrastructure
> > before - is there a document on how that image is built?  
> 
> There is roms/Makefile for that.
> 
> make -C roms help
> make -C roms efi
> 
> So easiest would be to just update the edk2 submodule to what you
> need, then rebuild.
> 
> The build is handled by the roms/edk2-build.py script,
> with the build configuration being in roms/edk2-build.config.
> That is usable outside the qemu source tree too, i.e. like this:
> 
>   python3 /path/to/qemu.git/roms/edk2-build.py \
> --config /path/to/qemu.git/roms/edk2-build.config \
> --core /path/to/edk2.git \
> --match armvirt \
> --silent --no-logs
> 
> That'll try to place the images build in "../pc-bios", so maybe better
> work with a copy of the config file where you adjust this.
> 
> HTH,
>   Gerd
> 

Thanks Gerd!

So the builds are very similar via the two method...
However - the QEMU build sets -D CAVIUM_ERRATUM_27456=TRUE

And that's the difference - with that set for my other builds the alignment
problems go away...

Any idea why we have that set in roms/edk2-build.config?
Superficially it seems rather unlikely anyone cares about thunderx1
(if they do we need to get them some new hardware with fresh bugs)
bugs now and this config file was only added last year.

However, the last comment in Ard's commit message below seems
highly likely to be relevant!

Chasing through Ard's patch it has the side effect of dropping
an override of a requirement for strict alignment. 
So with out the errata 
DEFINE GCC_AARCH64_CC_XIPFLAGS = -mstrict-align -mgeneral-regs-only
is replaced with
 [BuildOptions]
+!if $(CAVIUM_ERRATUM_27456) == TRUE^M
+  GCC:*_*_AARCH64_PP_FLAGS = -DCAVIUM_ERRATUM_27456^M
+!else^M
   GCC:*_*_AARCH64_CC_XIPFLAGS ==
+!endif^M

The edk2 commit that added this was the following +CC Ard.

Given I wasn't sure of the syntax of that file I set it
manually to the original value and indeed it works.

commit ec54ce1f1ab41b92782b37ae59e752fff0ef9c41
Author: Ard Biesheuvel 
Date:   Wed Jan 4 16:51:35 2023 +0100

ArmVirtPkg/ArmVirtQemu: Avoid early ID map on ThunderX

The early ID map used by ArmVirtQemu uses ASID scoped non-global
mappings, as this allows us to switch to the permanent ID map seamlessly
without the need for explicit TLB maintenance.

However, this triggers a known erratum on ThunderX, which does not
tolerate non-global mappings that are executable at EL1, as this appears
to result in I-cache corruption. (Linux disables the KPTI based Meltdown
mitigation on ThunderX for the same reason)

So work around this, by detecting the CPU implementor and part number,
and proceeding without the early ID map if a ThunderX CPU is detected.

Note that this requires the C code to be built with strict alignment
again, as we may end up executing it with the MMU and caches off.

Signed-off-by: Ard Biesheuvel 
Acked-by: Laszlo Ersek 
Tested-by: dann frazier 

Test case is
qemu-system-aarch64 -M virt,virtualization=true, -m 4g -cpu cortex-a76 \
-bios QEMU_EFI.fd -d int

Which gets alignment faults since:
https://lore.kernel.org/all/20240301204110.656742-6-richard.hender...@linaro.org/

So my feeling here is EDK2 should either have yet another config for QEMU as a 
host
or should always set the alignment without needing to pick the CAVIUM 27456 
errata
which I suspect will get dropped soonish anyway if anyone ever cleans up
old errata.

Jonathan

Re: [PATCH v11 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-04-19 Thread Jonathan Cameron via

On Fri,  5 Apr 2024 00:07:06 +
"Ho-Ren (Jack) Chuang"  wrote:

> The current implementation treats emulated memory devices, such as
> CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory
> (E820_TYPE_RAM). However, these emulated devices have different
> characteristics than traditional DRAM, making it important to
> distinguish them. Thus, we modify the tiered memory initialization process
> to introduce a delay specifically for CPUless NUMA nodes. This delay
> ensures that the memory tier initialization for these nodes is deferred
> until HMAT information is obtained during the boot process. Finally,
> demotion tables are recalculated at the end.
> 
> * late_initcall(memory_tier_late_init);
> Some device drivers may have initialized memory tiers between
> `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
> online memory nodes and configuring memory tiers. They should be excluded
> in the late init.
> 
> * Handle cases where there is no HMAT when creating memory tiers
> There is a scenario where a CPUless node does not provide HMAT information.
> If no HMAT is specified, it falls back to using the default DRAM tier.
> 
> * Introduce another new lock `default_dram_perf_lock` for adist calculation
> In the current implementation, iterating through CPUlist nodes requires
> holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
> trying to acquire the same lock, leading to a potential deadlock.
> Therefore, we propose introducing a standalone `default_dram_perf_lock` to
> protect `default_dram_perf_*`. This approach not only avoids deadlock
> but also prevents holding a large lock simultaneously.
> 
> * Upgrade `set_node_memory_tier` to support additional cases, including
>   default DRAM, late CPUless, and hot-plugged initializations.
> To cover hot-plugged memory nodes, `mt_calc_adistance()` and
> `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
> handle cases where memtype is not initialized and where HMAT information is
> available.
> 
> * Introduce `default_memory_types` for those memory types that are not
>   initialized by device drivers.
> Because late initialized memory and default DRAM memory need to be managed,
> a default memory type is created for storing all memory types that are
> not initialized by device drivers and as a fallback.
> 
> Signed-off-by: Ho-Ren (Jack) Chuang 
> Signed-off-by: Hao Xiang 
> Reviewed-by: "Huang, Ying" 
Reviewed-by: Jonathan Cameron

Re: [PATCH v3 5/6] target/arm: Do memory type alignment check when translation disabled

2024-04-18 Thread Jonathan Cameron via

On Thu, 18 Apr 2024 09:15:55 +0100
Jonathan Cameron via  wrote:

> On Wed, 17 Apr 2024 13:07:35 -0700
> Richard Henderson  wrote:
> 
> > On 4/16/24 08:11, Jonathan Cameron wrote:  
> > > On Fri,  1 Mar 2024 10:41:09 -1000
> > > Richard Henderson  wrote:
> > > 
> > >> If translation is disabled, the default memory type is Device, which
> > >> requires alignment checking.  This is more optimally done early via
> > >> the MemOp given to the TCG memory operation.
> > >>
> > >> Reviewed-by: Philippe Mathieu-Daudé 
> > >> Reported-by: Idan Horowitz 
> > >> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1204
> > >> Signed-off-by: Richard Henderson 
> > > 
> > > Hi Richard.
> > > 
> > > I noticed some tests I was running stopped booting with master.
> > > (it's a fun and complex stack of QEMU + kvm on QEMU for vCPU Hotplug 
> > > kernel work,
> > > but this is the host booting)
> > > 
> > > EDK2 build from upstream as of somepoint last week.
> > > 
> > > Bisects to this patch.
> > > 
> > >   qemu-system-aarch64 -M virt,gic-version=3,virtualization=true -m 
> > > 4g,maxmem=8G,slots=8 -cpu cortex-a76 -smp 
> > > cpus=4,threads=2,clusters=2,sockets=1 \
> > >   -kernel Image \
> > >   -drive if=none,file=full.qcow2,format=qcow2,id=hd \
> > >   -device ioh3420,id=root_port1 -device virtio-blk-pci,drive=hd \
> > >   -netdev user,id=mynet,hostfwd=tcp::-:22 -device 
> > > virtio-net-pci,netdev=mynet,id=bob \
> > >   -nographic -no-reboot -append 'earlycon root=/dev/vda2 fsck.mode=skip 
> > > tp_printk' \
> > >   -monitor telnet:127.0.0.1:1235,server,nowait -bios QEMU_EFI.fd \
> > >   -object memory-backend-ram,size=4G,id=mem0 \
> > >   -numa node,nodeid=0,cpus=0-3,memdev=mem0
> > > 
> > > Symptoms: Nothing on console from edk2 which is built in debug mode so is 
> > > normally very noisy.
> > >No sign of anything much happening at all :(
> > 
> > This isn't a fantastic bug report.
> > 
> > (1) If it doesn't boot efi, then none of the -kernel parameters are 
> > necessary.
> > (2) I'd be surprised if the full.qcow2 drive parameters are necessary 
> > either.
> >  But if they are, what contents?  Is a new empty drive sufficient, just
> >  enough to send the bios through the correct device initialization?
> > (3) edk2 build from ...
> >  Well, this is partly edk2's fault, as the build documentation is awful.
> >  I spent an entire afternoon trying to figure it out and gave up.
> > 
> > I will say that the edk2 shipped with qemu does work, so... are you 
> > absolutely
> > certain that it isn't a bug in edk2 since then?  Firmware bugs are exactly 
> > what
> > that patch is supposed to expose, as requested by issue #1204.
> > 
> > I'd say you should boot with "-d int" and see what kind of interrupts 
> > you're getting very 
> > early on.  I suspect that you'll see data aborts with ESR xx/yy where the 
> > last 6 bits of 
> > yy are 0x21 (alignment fault).  
> 
> Hi Richard,
> 
> Sorry for lack of details, I was aware it wasn't great and should have stated 
> I planned
> to come back with more details when I had time to debug.  Snowed under so for 
> now I've
> just dropped back to 8.2 and will get back to this perhaps next week.

+CC EDK2 list and Gerd.

Still not a thorough report but some breadcrumbs.

May be something about my local build setup as the shipped EDK2 succeeds,
but the one I'm building via
uefi-tools/edk2-build.sh armvirtqemu64
(some aged instructions here that are more or less working still)
https://people.kernel.org/jic23/

Indeed starts out with some alignment faults.

Gerd, any ideas?  Maybe I needs something subtly different in my
edk2 build?  I've not looked at this bit of the qemu infrastructure
before - is there a document on how that image is built?
As Richard observed, EDK2 isn't the simplest thing to build - I've
been using uefitools for this for a long time, so maybe I missed some
new requirement?

Build machine is x86_64 ubuntu, gcc 12.2.0.

I need to build it because of some necessary tweaks to debug a
PCI enumeration issue in Linux. (these tests were without those
tweaks)

As Richard observed, most of the command line isn't needed.

qemu-system-aarch64 -M virt,virtualization=true, -m 4g -cpu cortex-a76 \
-bios QEMU_EFI.fd -d int

Jonathan

 


> 
> Jonathan
> 
> > 
> > 
> > r~  
> 
>

Re: [PATCH 0/3] hw/cxl/cxl-cdat: Make cxl_doe_cdat_init() return boolean

2024-04-18 Thread Jonathan Cameron via

On Thu, 18 Apr 2024 14:06:39 +0200
Philippe Mathieu-Daudé  wrote:

> On 18/4/24 12:04, Zhao Liu wrote:
> > From: Zhao Liu   
> 
> 
> > ---
> > Zhao Liu (3):
> >hw/cxl/cxl-cdat: Make ct3_load_cdat() return boolean
> >hw/cxl/cxl-cdat: Make ct3_build_cdat() return boolean
> >hw/cxl/cxl-cdat: Make cxl_doe_cdat_init() return boolean  
> 
> Series:
> Reviewed-by: Philippe Mathieu-Daudé 
> 

Acked-by: Jonathan Cameron

Re: [PATCH v3 5/6] target/arm: Do memory type alignment check when translation disabled

2024-04-18 Thread Jonathan Cameron via

On Wed, 17 Apr 2024 13:07:35 -0700
Richard Henderson  wrote:

> On 4/16/24 08:11, Jonathan Cameron wrote:
> > On Fri,  1 Mar 2024 10:41:09 -1000
> > Richard Henderson  wrote:
> >   
> >> If translation is disabled, the default memory type is Device, which
> >> requires alignment checking.  This is more optimally done early via
> >> the MemOp given to the TCG memory operation.
> >>
> >> Reviewed-by: Philippe Mathieu-Daudé 
> >> Reported-by: Idan Horowitz 
> >> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1204
> >> Signed-off-by: Richard Henderson   
> > 
> > Hi Richard.
> > 
> > I noticed some tests I was running stopped booting with master.
> > (it's a fun and complex stack of QEMU + kvm on QEMU for vCPU Hotplug kernel 
> > work,
> > but this is the host booting)
> > 
> > EDK2 build from upstream as of somepoint last week.
> > 
> > Bisects to this patch.
> > 
> >   qemu-system-aarch64 -M virt,gic-version=3,virtualization=true -m 
> > 4g,maxmem=8G,slots=8 -cpu cortex-a76 -smp 
> > cpus=4,threads=2,clusters=2,sockets=1 \
> >   -kernel Image \
> >   -drive if=none,file=full.qcow2,format=qcow2,id=hd \
> >   -device ioh3420,id=root_port1 -device virtio-blk-pci,drive=hd \
> >   -netdev user,id=mynet,hostfwd=tcp::-:22 -device 
> > virtio-net-pci,netdev=mynet,id=bob \
> >   -nographic -no-reboot -append 'earlycon root=/dev/vda2 fsck.mode=skip 
> > tp_printk' \
> >   -monitor telnet:127.0.0.1:1235,server,nowait -bios QEMU_EFI.fd \
> >   -object memory-backend-ram,size=4G,id=mem0 \
> >   -numa node,nodeid=0,cpus=0-3,memdev=mem0
> > 
> > Symptoms: Nothing on console from edk2 which is built in debug mode so is 
> > normally very noisy.
> >No sign of anything much happening at all :(  
> 
> This isn't a fantastic bug report.
> 
> (1) If it doesn't boot efi, then none of the -kernel parameters are necessary.
> (2) I'd be surprised if the full.qcow2 drive parameters are necessary either.
>  But if they are, what contents?  Is a new empty drive sufficient, just
>  enough to send the bios through the correct device initialization?
> (3) edk2 build from ...
>  Well, this is partly edk2's fault, as the build documentation is awful.
>  I spent an entire afternoon trying to figure it out and gave up.
> 
> I will say that the edk2 shipped with qemu does work, so... are you absolutely
> certain that it isn't a bug in edk2 since then?  Firmware bugs are exactly 
> what
> that patch is supposed to expose, as requested by issue #1204.
> 
> I'd say you should boot with "-d int" and see what kind of interrupts you're 
> getting very 
> early on.  I suspect that you'll see data aborts with ESR xx/yy where the 
> last 6 bits of 
> yy are 0x21 (alignment fault).

Hi Richard,

Sorry for lack of details, I was aware it wasn't great and should have stated I 
planned
to come back with more details when I had time to debug.  Snowed under so for 
now I've
just dropped back to 8.2 and will get back to this perhaps next week.

Jonathan

> 
> 
> r~

Re: [PATCH v6 10/12] hw/mem/cxl_type3: Add dpa range validation for accesses to DC regions

2024-04-17 Thread Jonathan Cameron via

On Tue, 16 Apr 2024 09:37:09 -0700
fan  wrote:

> On Tue, Apr 16, 2024 at 04:00:56PM +0100, Jonathan Cameron wrote:
> > On Mon, 15 Apr 2024 10:37:00 -0700
> > fan  wrote:
> >   
> > > On Fri, Apr 12, 2024 at 06:54:42PM -0400, Gregory Price wrote:  
> > > > On Mon, Mar 25, 2024 at 12:02:28PM -0700, nifan@gmail.com wrote:
> > > > > From: Fan Ni 
> > > > > 
> > > > > All dpa ranges in the DC regions are invalid to access until an extent
> > > > > covering the range has been added. Add a bitmap for each region to
> > > > > record whether a DC block in the region has been backed by DC extent.
> > > > > For the bitmap, a bit in the bitmap represents a DC block. When a DC
> > > > > extent is added, all the bits of the blocks in the extent will be set,
> > > > > which will be cleared when the extent is released.
> > > > > 
> > > > > Reviewed-by: Jonathan Cameron 
> > > > > Signed-off-by: Fan Ni 
> > > > > ---
> > > > >  hw/cxl/cxl-mailbox-utils.c  |  6 +++
> > > > >  hw/mem/cxl_type3.c  | 76 
> > > > > +
> > > > >  include/hw/cxl/cxl_device.h |  7 
> > > > >  3 files changed, 89 insertions(+)
> > > > > 
> > > > > diff --git a/hw/cxl/cxl-mailbox-utils.c b/hw/cxl/cxl-mailbox-utils.c
> > > > > index 7094e007b9..a0d2239176 100644
> > > > > --- a/hw/cxl/cxl-mailbox-utils.c
> > > > > +++ b/hw/cxl/cxl-mailbox-utils.c
> > > > > @@ -1620,6 +1620,7 @@ static CXLRetCode cmd_dcd_add_dyn_cap_rsp(const 
> > > > > struct cxl_cmd *cmd,
> > > > >  
> > > > >  cxl_insert_extent_to_extent_list(extent_list, dpa, len, 
> > > > > NULL, 0);
> > > > >  ct3d->dc.total_extent_count += 1;
> > > > > +ct3_set_region_block_backed(ct3d, dpa, len);
> > > > >  
> > > > >  ent = QTAILQ_FIRST(>dc.extents_pending);
> > > > >  
> > > > > cxl_remove_extent_from_extent_list(>dc.extents_pending, ent);   
> > > > >  
> > > > 
> > > > while looking at the MHD code, we had decided to "reserve" the blocks in
> > > > the bitmap in the call to `qmp_cxl_process_dynamic_capacity` in order to
> > > > prevent a potential double-allocation (basically we need to sanity check
> > > > that two hosts aren't reserving the region PRIOR to the host being
> > > > notified).
> > > > 
> > > > I did not see any checks in the `qmp_cxl_process_dynamic_capacity` path
> > > > to prevent pending extents from being double-allocated.  Is this an
> > > > explicit choice?
> > > > 
> > > > I can see, for example, why you may want to allow the following in the
> > > > pending list: [Add X, Remove X, Add X].  I just want to know if this is
> > > > intentional or not. If not, you may consider adding a pending check
> > > > during the sanity check phase of `qmp_cxl_process_dynamic_capacity`
> > > > 
> > > > ~Gregory
> > > 
> > > First, for remove request, pending list is not involved. See cxl r3.1,
> > > 9.13.3.3. Pending basically means "pending to add". 
> > > So for the above example, in the pending list, you can see [Add x, add x] 
> > > if the
> > > event is not processed in time.
> > > Second, from the spec, I cannot find any text saying we cannot issue
> > > another add extent X if it is still pending.  
> > 
> > I think there is text saying that the capacity is not released for reuse
> > by the device until it receives a response from the host.   Whilst
> > it's not explicit on offers to the same host, I'm not sure that matters.
> > So I don't think it is suppose to queue multiple extents...  
> 
> Are you suggesting we add a check here to reject the second add when the
> first one is still pending?

Yes.  The capacity is not back with the device to reissue.
On an MH-MLD/SLD we'd need to prevent it being added (not shared) to multiple 
hosts,
this is kind of the temporal equivalent of that.

> 
> Currently, we do not allow releasing an extent when it is still pending,
> which aligns with the case you mentioned above "not release for reuse", I
> think.
> Can the second add mean a retry instead of reuse? 
No - or at least the device should not be doing that.  The FM might try
aga

Re: [PATCH v6 09/12] hw/cxl/events: Add qmp interfaces to add/release dynamic capacity extents

2024-04-17 Thread Jonathan Cameron via



> > >  
> > >  ret = cxl_detect_malformed_extent_list(ct3d, in);
> > >  if (ret != CXL_MBOX_SUCCESS) {
> > > +cxl_extent_group_list_delete_front(>dc.extents_pending);  
> > 
> > If it's a bad message from the host, I don't think the device is supposed to
> > do anything with pending extents.  
> 
> It is not clear to me here. 
> 
> In the spec r3.1 8.2.9.9.9.3, Add Dynamic Capacity Response (Opcode 4802h),
> there is text like "After this command is received, the device is free to
> reclaim capacity that the host does not utilize.", that seems to imply
> as long as the response is received, we need to update the pending list
> so the capacity unused can be reclaimed. But of course, we can say if
> there is error, we cannot tell whether the host accepts the extents or
> not so not update the pending list.

Can try and get a clarification as I agree 'is received' is unclear,
but in general any command that gets an error response should have no
affect on device state. If it does, then what affect it has must be stated
in the specification.

> >   
> > > +
> > > +#define REMOVAL_POLICY_MASK 0xf
> > > +#define FORCED_REMOVAL_BIT BIT(4)
> > > +
> > > +void qmp_cxl_release_dynamic_capacity(const char *path, uint16_t hid,
> > > +  uint8_t flags, uint8_t region_id,
> > > +  const char *tag,
> > > +  CXLDCExtentRecordList  *records,
> > > +  Error **errp)
> > > +{
> > > +CXLDCEventType type = DC_EVENT_RELEASE_CAPACITY;
> > > +
> > > +if (flags & FORCED_REMOVAL_BIT) {
> > > +/* TODO: enable forced removal in the future */
> > > +type = DC_EVENT_FORCED_RELEASE_CAPACITY;
> > > +error_setg(errp, "Forced removal not supported yet");
> > > +return;
> > > +}
> > > +
> > > +switch (flags & REMOVAL_POLICY_MASK) {
> > > +case 1:  
> > Probably benefit form a suitable define.
> >   
> > > +qmp_cxl_process_dynamic_capacity_prescriptive(path, hid, type,
> > > +  region_id, 
> > > records, errp);
> > > +break;  
> > 
> > I'd not noticed before but might as well return from these case blocks.  
> 
> Sorry. I do not follow here. What do you mean by "return from these case
> blocks", are you referring the check above about the forced removal case?

No, what I meant was much simpler - just a code refactoring thing.
case 1:
qmp_cxl_process_dynamic_capacity_prescriptive(path, hid, type,
  region_id, records, 
errp);

//break;
return;
> 
> Fan
> 
> >   
> > > +default:
> > > +error_setg(errp, "Removal policy not supported");
> > > +break;
   return;
> > > +}
> > > +}

Re: [PATCH v3 5/6] target/arm: Do memory type alignment check when translation disabled

2024-04-16 Thread Jonathan Cameron via

On Fri,  1 Mar 2024 10:41:09 -1000
Richard Henderson  wrote:

> If translation is disabled, the default memory type is Device, which
> requires alignment checking.  This is more optimally done early via
> the MemOp given to the TCG memory operation.
> 
> Reviewed-by: Philippe Mathieu-Daudé 
> Reported-by: Idan Horowitz 
> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1204
> Signed-off-by: Richard Henderson 

Hi Richard.

I noticed some tests I was running stopped booting with master.
(it's a fun and complex stack of QEMU + kvm on QEMU for vCPU Hotplug kernel 
work,
but this is the host booting)

EDK2 build from upstream as of somepoint last week.

Bisects to this patch.

 qemu-system-aarch64 -M virt,gic-version=3,virtualization=true -m 
4g,maxmem=8G,slots=8 -cpu cortex-a76 -smp cpus=4,threads=2,clusters=2,sockets=1 
\
 -kernel Image \
 -drive if=none,file=full.qcow2,format=qcow2,id=hd \
 -device ioh3420,id=root_port1 -device virtio-blk-pci,drive=hd \
 -netdev user,id=mynet,hostfwd=tcp::-:22 -device 
virtio-net-pci,netdev=mynet,id=bob \
 -nographic -no-reboot -append 'earlycon root=/dev/vda2 fsck.mode=skip 
tp_printk' \
 -monitor telnet:127.0.0.1:1235,server,nowait -bios QEMU_EFI.fd \
 -object memory-backend-ram,size=4G,id=mem0 \
 -numa node,nodeid=0,cpus=0-3,memdev=mem0

Symptoms: Nothing on console from edk2 which is built in debug mode so is 
normally very noisy.
  No sign of anything much happening at all :(

Jonathan



> ---
>  target/arm/tcg/hflags.c | 34 --
>  1 file changed, 32 insertions(+), 2 deletions(-)
> 
> diff --git a/target/arm/tcg/hflags.c b/target/arm/tcg/hflags.c
> index 8e5d35d922..5da1b0fc1d 100644
> --- a/target/arm/tcg/hflags.c
> +++ b/target/arm/tcg/hflags.c
> @@ -26,6 +26,35 @@ static inline bool fgt_svc(CPUARMState *env, int el)
>  FIELD_EX64(env->cp15.fgt_exec[FGTREG_HFGITR], HFGITR_EL2, SVC_EL1);
>  }
>  
> +/* Return true if memory alignment should be enforced. */
> +static bool aprofile_require_alignment(CPUARMState *env, int el, uint64_t 
> sctlr)
> +{
> +#ifdef CONFIG_USER_ONLY
> +return false;
> +#else
> +/* Check the alignment enable bit. */
> +if (sctlr & SCTLR_A) {
> +return true;
> +}
> +
> +/*
> + * If translation is disabled, then the default memory type is
> + * Device(-nGnRnE) instead of Normal, which requires that alignment
> + * be enforced.  Since this affects all ram, it is most efficient
> + * to handle this during translation.
> + */
> +if (sctlr & SCTLR_M) {
> +/* Translation enabled: memory type in PTE via MAIR_ELx. */
> +return false;
> +}
> +if (el < 2 && (arm_hcr_el2_eff(env) & (HCR_DC | HCR_VM))) {
> +/* Stage 2 translation enabled: memory type in PTE. */
> +return false;
> +}
> +return true;
> +#endif
> +}
> +
>  static CPUARMTBFlags rebuild_hflags_common(CPUARMState *env, int fp_el,
> ARMMMUIdx mmu_idx,
> CPUARMTBFlags flags)
> @@ -121,8 +150,9 @@ static CPUARMTBFlags rebuild_hflags_a32(CPUARMState *env, 
> int fp_el,
>  {
>  CPUARMTBFlags flags = {};
>  int el = arm_current_el(env);
> +uint64_t sctlr = arm_sctlr(env, el);
>  
> -if (arm_sctlr(env, el) & SCTLR_A) {
> +if (aprofile_require_alignment(env, el, sctlr)) {
>  DP_TBFLAG_ANY(flags, ALIGN_MEM, 1);
>  }
>  
> @@ -223,7 +253,7 @@ static CPUARMTBFlags rebuild_hflags_a64(CPUARMState *env, 
> int el, int fp_el,
>  
>  sctlr = regime_sctlr(env, stage1);
>  
> -if (sctlr & SCTLR_A) {
> +if (aprofile_require_alignment(env, el, sctlr)) {
>  DP_TBFLAG_ANY(flags, ALIGN_MEM, 1);
>  }
>

Re: [PATCH v6 10/12] hw/mem/cxl_type3: Add dpa range validation for accesses to DC regions

2024-04-16 Thread Jonathan Cameron via

On Mon, 15 Apr 2024 10:37:00 -0700
fan  wrote:

> On Fri, Apr 12, 2024 at 06:54:42PM -0400, Gregory Price wrote:
> > On Mon, Mar 25, 2024 at 12:02:28PM -0700, nifan@gmail.com wrote:  
> > > From: Fan Ni 
> > > 
> > > All dpa ranges in the DC regions are invalid to access until an extent
> > > covering the range has been added. Add a bitmap for each region to
> > > record whether a DC block in the region has been backed by DC extent.
> > > For the bitmap, a bit in the bitmap represents a DC block. When a DC
> > > extent is added, all the bits of the blocks in the extent will be set,
> > > which will be cleared when the extent is released.
> > > 
> > > Reviewed-by: Jonathan Cameron 
> > > Signed-off-by: Fan Ni 
> > > ---
> > >  hw/cxl/cxl-mailbox-utils.c  |  6 +++
> > >  hw/mem/cxl_type3.c  | 76 +
> > >  include/hw/cxl/cxl_device.h |  7 
> > >  3 files changed, 89 insertions(+)
> > > 
> > > diff --git a/hw/cxl/cxl-mailbox-utils.c b/hw/cxl/cxl-mailbox-utils.c
> > > index 7094e007b9..a0d2239176 100644
> > > --- a/hw/cxl/cxl-mailbox-utils.c
> > > +++ b/hw/cxl/cxl-mailbox-utils.c
> > > @@ -1620,6 +1620,7 @@ static CXLRetCode cmd_dcd_add_dyn_cap_rsp(const 
> > > struct cxl_cmd *cmd,
> > >  
> > >  cxl_insert_extent_to_extent_list(extent_list, dpa, len, NULL, 0);
> > >  ct3d->dc.total_extent_count += 1;
> > > +ct3_set_region_block_backed(ct3d, dpa, len);
> > >  
> > >  ent = QTAILQ_FIRST(>dc.extents_pending);
> > >  cxl_remove_extent_from_extent_list(>dc.extents_pending, 
> > > ent);  
> > 
> > while looking at the MHD code, we had decided to "reserve" the blocks in
> > the bitmap in the call to `qmp_cxl_process_dynamic_capacity` in order to
> > prevent a potential double-allocation (basically we need to sanity check
> > that two hosts aren't reserving the region PRIOR to the host being
> > notified).
> > 
> > I did not see any checks in the `qmp_cxl_process_dynamic_capacity` path
> > to prevent pending extents from being double-allocated.  Is this an
> > explicit choice?
> > 
> > I can see, for example, why you may want to allow the following in the
> > pending list: [Add X, Remove X, Add X].  I just want to know if this is
> > intentional or not. If not, you may consider adding a pending check
> > during the sanity check phase of `qmp_cxl_process_dynamic_capacity`
> > 
> > ~Gregory  
> 
> First, for remove request, pending list is not involved. See cxl r3.1,
> 9.13.3.3. Pending basically means "pending to add". 
> So for the above example, in the pending list, you can see [Add x, add x] if 
> the
> event is not processed in time.
> Second, from the spec, I cannot find any text saying we cannot issue
> another add extent X if it is still pending.

I think there is text saying that the capacity is not released for reuse
by the device until it receives a response from the host.   Whilst
it's not explicit on offers to the same host, I'm not sure that matters.
So I don't think it is suppose to queue multiple extents...


> From the kernel side, if the first one is accepted, the second one will
> get rejected, and there is no issue there.
> If the first is reject for some reason, the second one can get
> accepted or rejected and do not need to worry about the first one.
> 
> 
> Fan
>

Re: [PATCH v6 09/12] hw/cxl/events: Add qmp interfaces to add/release dynamic capacity extents

2024-04-16 Thread Jonathan Cameron via

On Mon, 15 Apr 2024 13:06:04 -0700
fan  wrote:

> From ce75be83e915fbc4dd6e489f976665b81174002b Mon Sep 17 00:00:00 2001
> From: Fan Ni 
> Date: Tue, 20 Feb 2024 09:48:31 -0800
> Subject: [PATCH 09/13] hw/cxl/events: Add qmp interfaces to add/release
>  dynamic capacity extents
> 
> To simulate FM functionalities for initiating Dynamic Capacity Add
> (Opcode 5604h) and Dynamic Capacity Release (Opcode 5605h) as in CXL spec
> r3.1 7.6.7.6.5 and 7.6.7.6.6, we implemented two QMP interfaces to issue
> add/release dynamic capacity extents requests.
> 
> With the change, we allow to release an extent only when its DPA range
> is contained by a single accepted extent in the device. That is to say,
> extent superset release is not supported yet.
> 
> 1. Add dynamic capacity extents:
> 
> For example, the command to add two continuous extents (each 128MiB long)
> to region 0 (starting at DPA offset 0) looks like below:
> 
> { "execute": "qmp_capabilities" }
> 
> { "execute": "cxl-add-dynamic-capacity",
>   "arguments": {
>   "path": "/machine/peripheral/cxl-dcd0",
>   "hid": 0,
>   "selection-policy": 2,
>   "region-id": 0,
>   "tag": "",
>   "extents": [
>   {
>   "offset": 0,
>   "len": 134217728
>   },
>   {
>   "offset": 134217728,
>   "len": 134217728
>   }
>   ]
>   }
> }
> 
> 2. Release dynamic capacity extents:
> 
> For example, the command to release an extent of size 128MiB from region 0
> (DPA offset 128MiB) looks like below:
> 
> { "execute": "cxl-release-dynamic-capacity",
>   "arguments": {
>   "path": "/machine/peripheral/cxl-dcd0",
>   "hid": 0,
>   "flags": 1,
>   "region-id": 0,
>   "tag": "",
>   "extents": [
>   {
>   "offset": 134217728,
>   "len": 134217728
>   }
>   ]
>   }
> }
> 
> Signed-off-by: Fan Ni 

Nice!  A few small comments inline - particularly don't be nice to the
kernel by blocking things it doesn't understand yet ;)

Jonathan

> ---
>  hw/cxl/cxl-mailbox-utils.c  |  65 ++--
>  hw/mem/cxl_type3.c  | 310 +++-
>  hw/mem/cxl_type3_stubs.c|  20 +++
>  include/hw/cxl/cxl_device.h |  22 +++
>  include/hw/cxl/cxl_events.h |  18 +++
>  qapi/cxl.json   |  69 
>  6 files changed, 491 insertions(+), 13 deletions(-)
> 
> diff --git a/hw/cxl/cxl-mailbox-utils.c b/hw/cxl/cxl-mailbox-utils.c
> index cd9092b6bf..839ae836a1 100644
> --- a/hw/cxl/cxl-mailbox-utils.c
> +++ b/hw/cxl/cxl-mailbox-utils.c

>  /*
>   * CXL r3.1 Table 8-168: Add Dynamic Capacity Response Input Payload
>   * CXL r3.1 Table 8-170: Release Dynamic Capacity Input Payload
> @@ -1541,6 +1579,7 @@ static CXLRetCode 
> cxl_dcd_add_dyn_cap_rsp_dry_run(CXLType3Dev *ct3d,
>  {
>  uint32_t i;
>  CXLDCExtent *ent;
> +CXLDCExtentGroup *ext_group;
>  uint64_t dpa, len;
>  Range range1, range2;
>  
> @@ -1551,9 +1590,13 @@ static CXLRetCode 
> cxl_dcd_add_dyn_cap_rsp_dry_run(CXLType3Dev *ct3d,
>  range_init_nofail(, dpa, len);
>  
>  /*
> - * TODO: once the pending extent list is added, check against
> - * the list will be added here.
> + * The host-accepted DPA range must be contained by the first extent
> + * group in the pending list
>   */
> +ext_group = QTAILQ_FIRST(>dc.extents_pending);
> +if (!cxl_extents_contains_dpa_range(_group->list, dpa, len)) {
> +return CXL_MBOX_INVALID_PA;
> +}
>  
>  /* to-be-added range should not overlap with range already accepted 
> */
>  QTAILQ_FOREACH(ent, >dc.extents, node) {
> @@ -1588,26 +1631,26 @@ static CXLRetCode cmd_dcd_add_dyn_cap_rsp(const 
> struct cxl_cmd *cmd,
>  CXLRetCode ret;
>  
>  if (in->num_entries_updated == 0) {
> -/*
> - * TODO: once the pending list is introduced, extents in the 
> beginning
> - * will get wiped out.
> - */
> +cxl_extent_group_list_delete_front(>dc.extents_pending);
>  return CXL_MBOX_SUCCESS;
>  }
>  
>  /* Adding extents causes exceeding device's extent tracking ability. */
>  if (in->num_entries_updated + ct3d->dc.total_extent_count >
>  CXL_NUM_EXTENTS_SUPPORTED) {
> +cxl_extent_group_list_delete_front(>dc.extents_pending);
>  return CXL_MBOX_RESOURCES_EXHAUSTED;
>  }
>  
>  ret = cxl_detect_malformed_extent_list(ct3d, in);
>  if (ret != CXL_MBOX_SUCCESS) {
> +cxl_extent_group_list_delete_front(>dc.extents_pending);

If it's a bad message from the host, I don't think the device is supposed to
do anything with pending extents.

>  return ret;
>  }
>  
>  ret = cxl_dcd_add_dyn_cap_rsp_dry_run(ct3d, in);
>  if (ret != CXL_MBOX_SUCCESS) {
> +cxl_extent_group_list_delete_front(>dc.extents_pending);
>  return ret;
>  }



> diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
> index

Re: [PATCH v2] hw/mem/cxl_type3: reset dvsecs in ct3d_reset()

2024-04-11 Thread Jonathan Cameron via

On Tue,  9 Apr 2024 15:58:46 +0800
Li Zhijian  wrote:

> After the kernel commit
> 0cab68720598 ("cxl/pci: Fix disabling memory if DVSEC CXL Range does not 
> match a CFMWS window")
> CXL type3 devices cannot be enabled again after the reboot because the
> control register(see 8.1.3.2 in CXL specifiction 2.0 for more details) was
> not reset.
> 
> These registers could be changed by the firmware or OS, let them have
> their initial value in reboot so that the OS can read their clean status.
> 
> Fixes: e1706ea83da0 ("hw/cxl/device: Add a memory device (8.2.8.5)")
> Signed-off-by: Li Zhijian 
Hi,

We need to have a close look at what this is actually doing before
considering applying it.  I don't have time to get that this week, but
hopefully will find some time later this month.

I don't want a partial fix for one particular case that causes
us potential trouble in others.

Jonathan

> ---
> root_port, usp and dsp have the same issue, if this patch get approved,
> I will send another patch to fix them later.
> 
> V2:
>Add fixes tag.
>Reset all dvsecs registers instead of CTRL only
> ---
>  hw/mem/cxl_type3.c | 11 +++
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
> index b0a7e9f11b64..4f09d0b8fedc 100644
> --- a/hw/mem/cxl_type3.c
> +++ b/hw/mem/cxl_type3.c
> @@ -30,6 +30,7 @@
>  #include "hw/pci/msix.h"
>  
>  #define DWORD_BYTE 4
> +#define CT3D_CAP_SN_OFFSET PCI_CONFIG_SPACE_SIZE
>  
>  /* Default CDAT entries for a memory region */
>  enum {
> @@ -284,6 +285,10 @@ static void build_dvsecs(CXLType3Dev *ct3d)
>   range2_size_hi = 0, range2_size_lo = 0,
>   range2_base_hi = 0, range2_base_lo = 0;
>  
> +cxl_cstate->dvsec_offset = CT3D_CAP_SN_OFFSET;
> +if (ct3d->sn != UI64_NULL) {
> +cxl_cstate->dvsec_offset += PCI_EXT_CAP_DSN_SIZEOF;
> +}
>  /*
>   * Volatile memory is mapped as (0x0)
>   * Persistent memory is mapped at (volatile->size)
> @@ -664,10 +669,7 @@ static void ct3_realize(PCIDevice *pci_dev, Error **errp)
>  
>  pcie_endpoint_cap_init(pci_dev, 0x80);
>  if (ct3d->sn != UI64_NULL) {
> -pcie_dev_ser_num_init(pci_dev, 0x100, ct3d->sn);
> -cxl_cstate->dvsec_offset = 0x100 + 0x0c;
> -} else {
> -cxl_cstate->dvsec_offset = 0x100;
> +pcie_dev_ser_num_init(pci_dev, CT3D_CAP_SN_OFFSET, ct3d->sn);
>  }
>  
>  ct3d->cxl_cstate.pdev = pci_dev;
> @@ -907,6 +909,7 @@ static void ct3d_reset(DeviceState *dev)
>  
>  cxl_component_register_init_common(reg_state, write_msk, 
> CXL2_TYPE3_DEVICE);
>  cxl_device_register_init_t3(ct3d);
> +build_dvsecs(ct3d);
>  
>  /*
>   * Bring up an endpoint to target with MCTP over VDM.

Re: [PATCH v6 09/12] hw/cxl/events: Add qmp interfaces to add/release dynamic capacity extents

2024-04-10 Thread Jonathan Cameron via

On Tue, 9 Apr 2024 14:26:51 -0700
fan  wrote:

> On Fri, Apr 05, 2024 at 01:18:56PM +0100, Jonathan Cameron wrote:
> > On Mon, 25 Mar 2024 12:02:27 -0700
> > nifan@gmail.com wrote:
> >   
> > > From: Fan Ni 
> > > 
> > > To simulate FM functionalities for initiating Dynamic Capacity Add
> > > (Opcode 5604h) and Dynamic Capacity Release (Opcode 5605h) as in CXL spec
> > > r3.1 7.6.7.6.5 and 7.6.7.6.6, we implemented two QMP interfaces to issue
> > > add/release dynamic capacity extents requests.
> > > 
> > > With the change, we allow to release an extent only when its DPA range
> > > is contained by a single accepted extent in the device. That is to say,
> > > extent superset release is not supported yet.
> > > 
> > > 1. Add dynamic capacity extents:
> > > 
> > > For example, the command to add two continuous extents (each 128MiB long)
> > > to region 0 (starting at DPA offset 0) looks like below:
> > > 
> > > { "execute": "qmp_capabilities" }
> > > 
> > > { "execute": "cxl-add-dynamic-capacity",
> > >   "arguments": {
> > >   "path": "/machine/peripheral/cxl-dcd0",
> > >   "region-id": 0,
> > >   "extents": [
> > >   {
> > >   "offset": 0,
> > >   "len": 134217728
> > >   },
> > >   {
> > >   "offset": 134217728,
> > >   "len": 134217728
> > >   }  
> > 
> > Hi Fan,
> > 
> > I talk more on this inline, but to me this interface takes multiple extents
> > so that we can treat them as a single 'offer' of capacity. That is they
> > should be linked in the event log with the more flag and the host should
> > have to handle them in one go (I known Ira and Navneet's code doesn't handle
> > this yet, but that doesn't mean QEMU shouldn't).
> > 
> > Alternative for now would be to only support a single entry.  Keep the
> > interface defined to take multiple entries but reject it at runtime.
> > 
> > I don't want to end up with a more complex interface in the end just
> > because we allowed this form to not set the MORE flag today.
> > We will need this to do tagged handling and ultimately sharing, so good
> > to get it right from the start.
> > 
> > For tagged handling I think the right option is to have the tag alongside
> > region-id not in the individual extents.  That way the interface is 
> > naturally
> > used to generate the right description to the host.
> >   
> > >   ]
> > >   }
> > > }  
> Hi Jonathan,
> Thanks for the detailed comments.
> 
> For the QMP interface, I have one question. 
> Do we want the interface to follow exactly as shown in
> Table 7-70 and Table 7-71 in cxl r3.1?

I don't mind if it doesn't as long as it lets us pass reasonable
things in to test the kernel code.  I'd have the interface designed
to allow us to generate the set of records associate with a given
'request'.  E.g. All same tag in the same QMP command.

If we want multiple sets of such records (and the extents to back
them) we can issue multiple calls.

Jonathan



> 
> Fan
> 
> > > 
> > > 2. Release dynamic capacity extents:
> > > 
> > > For example, the command to release an extent of size 128MiB from region 0
> > > (DPA offset 128MiB) looks like below:
> > > 
> > > { "execute": "cxl-release-dynamic-capacity",
> > >   "arguments": {
> > >   "path": "/machine/peripheral/cxl-dcd0",
> > >   "region-id": 0,
> > >   "extents": [
> > >   {
> > >   "offset": 134217728,
> > >   "len": 134217728
> > >   }
> > >   ]
> > >   }
> > > }
> > > 
> > > Signed-off-by: Fan Ni   
> > 
> > 
> >   
> > >  /* to-be-added range should not overlap with range already 
> > > accepted */
> > >  QTAILQ_FOREACH(ent, >dc.extents, node) {
> > > @@ -1585,9 +1586,13 @@ static CXLRetCode cmd_dcd_add_dyn_cap_rsp(const 
> > > struct cxl_cmd *cmd,
> > >  CXLDCExtentList *extent_list = >dc.extents;
> > >  uint32_t i;
> > >  uint64_t dpa, len;
> > > +CXLDCExtent *ent;
> > >  CXLRetCode ret;
> > >  
> > >  if (i

Re: [External] Re: [PATCH v11 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-04-10 Thread Jonathan Cameron via

On Tue, 9 Apr 2024 12:02:31 -0700
"Ho-Ren (Jack) Chuang"  wrote:

> Hi Jonathan,
> 
> On Tue, Apr 9, 2024 at 9:12 AM Jonathan Cameron
>  wrote:
> >
> > On Fri, 5 Apr 2024 15:43:47 -0700
> > "Ho-Ren (Jack) Chuang"  wrote:
> >  
> > > On Fri, Apr 5, 2024 at 7:03 AM Jonathan Cameron
> > >  wrote:  
> > > >
> > > > On Fri,  5 Apr 2024 00:07:06 +
> > > > "Ho-Ren (Jack) Chuang"  wrote:
> > > >  
> > > > > The current implementation treats emulated memory devices, such as
> > > > > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal 
> > > > > memory
> > > > > (E820_TYPE_RAM). However, these emulated devices have different
> > > > > characteristics than traditional DRAM, making it important to
> > > > > distinguish them. Thus, we modify the tiered memory initialization 
> > > > > process
> > > > > to introduce a delay specifically for CPUless NUMA nodes. This delay
> > > > > ensures that the memory tier initialization for these nodes is 
> > > > > deferred
> > > > > until HMAT information is obtained during the boot process. Finally,
> > > > > demotion tables are recalculated at the end.
> > > > >
> > > > > * late_initcall(memory_tier_late_init);
> > > > > Some device drivers may have initialized memory tiers between
> > > > > `memory_tier_init()` and `memory_tier_late_init()`, potentially 
> > > > > bringing
> > > > > online memory nodes and configuring memory tiers. They should be 
> > > > > excluded
> > > > > in the late init.
> > > > >
> > > > > * Handle cases where there is no HMAT when creating memory tiers
> > > > > There is a scenario where a CPUless node does not provide HMAT 
> > > > > information.
> > > > > If no HMAT is specified, it falls back to using the default DRAM tier.
> > > > >
> > > > > * Introduce another new lock `default_dram_perf_lock` for adist 
> > > > > calculation
> > > > > In the current implementation, iterating through CPUlist nodes 
> > > > > requires
> > > > > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will 
> > > > > end up
> > > > > trying to acquire the same lock, leading to a potential deadlock.
> > > > > Therefore, we propose introducing a standalone 
> > > > > `default_dram_perf_lock` to
> > > > > protect `default_dram_perf_*`. This approach not only avoids deadlock
> > > > > but also prevents holding a large lock simultaneously.
> > > > >
> > > > > * Upgrade `set_node_memory_tier` to support additional cases, 
> > > > > including
> > > > >   default DRAM, late CPUless, and hot-plugged initializations.
> > > > > To cover hot-plugged memory nodes, `mt_calc_adistance()` and
> > > > > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` 
> > > > > to
> > > > > handle cases where memtype is not initialized and where HMAT 
> > > > > information is
> > > > > available.
> > > > >
> > > > > * Introduce `default_memory_types` for those memory types that are not
> > > > >   initialized by device drivers.
> > > > > Because late initialized memory and default DRAM memory need to be 
> > > > > managed,
> > > > > a default memory type is created for storing all memory types that are
> > > > > not initialized by device drivers and as a fallback.
> > > > >
> > > > > Signed-off-by: Ho-Ren (Jack) Chuang 
> > > > > Signed-off-by: Hao Xiang 
> > > > > Reviewed-by: "Huang, Ying"   
> > > >
> > > > Hi - one remaining question. Why can't we delay init for all nodes
> > > > to either drivers or your fallback late_initcall code.
> > > > It would be nice to reduce possible code paths.  
> > >
> > > I try not to change too much of the existing code structure in
> > > this patchset.
> > >
> > > To me, postponing/moving all memory tier registrations to
> > > late_initcall() is another possible action item for the next patchset.
> > >
> > > After tier_mem(), hmat_init() is called, which requires registering
> > > `default_dram_type` info. This is when `default_dram_type` is needed.
> > > However, it is indeed possible to postpone the latter part,
> > > set_node_memory_tier(), to `late_init(). So, memory_tier_init() can
> > > indeed be split into two parts, and the latter part can be moved to
> > > late_initcall() to be processed together.
> > >
> > > Doing this all memory-type drivers have to call late_initcall() to
> > > register a memory tier. I’m not sure how many they are?
> > >
> > > What do you guys think?  
> >
> > Gut feeling - if you are going to move it for some cases, move it for
> > all of them.  Then we only have to test once ;)
> >
> > J  
> 
> Thank you for your reminder! I agree~ That's why I'm considering
> changing them in the next patchset because of the amount of changes.
> And also, this patchset already contains too many things.

Makes sense.  (Interestingly we are reaching the same conclusion
for the thread that motivated suggesting bringing them all together
in the first place!)

Get things work in a clean fashion, then consider moving everything to
happen at the same time to simplify testing etc.

Jonathan

Re: [PATCH v11 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-04-09 Thread Jonathan Cameron via

On Fri, 5 Apr 2024 15:43:47 -0700
"Ho-Ren (Jack) Chuang"  wrote:

> On Fri, Apr 5, 2024 at 7:03 AM Jonathan Cameron
>  wrote:
> >
> > On Fri,  5 Apr 2024 00:07:06 +
> > "Ho-Ren (Jack) Chuang"  wrote:
> >  
> > > The current implementation treats emulated memory devices, such as
> > > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal 
> > > memory
> > > (E820_TYPE_RAM). However, these emulated devices have different
> > > characteristics than traditional DRAM, making it important to
> > > distinguish them. Thus, we modify the tiered memory initialization process
> > > to introduce a delay specifically for CPUless NUMA nodes. This delay
> > > ensures that the memory tier initialization for these nodes is deferred
> > > until HMAT information is obtained during the boot process. Finally,
> > > demotion tables are recalculated at the end.
> > >
> > > * late_initcall(memory_tier_late_init);
> > > Some device drivers may have initialized memory tiers between
> > > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
> > > online memory nodes and configuring memory tiers. They should be excluded
> > > in the late init.
> > >
> > > * Handle cases where there is no HMAT when creating memory tiers
> > > There is a scenario where a CPUless node does not provide HMAT 
> > > information.
> > > If no HMAT is specified, it falls back to using the default DRAM tier.
> > >
> > > * Introduce another new lock `default_dram_perf_lock` for adist 
> > > calculation
> > > In the current implementation, iterating through CPUlist nodes requires
> > > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
> > > trying to acquire the same lock, leading to a potential deadlock.
> > > Therefore, we propose introducing a standalone `default_dram_perf_lock` to
> > > protect `default_dram_perf_*`. This approach not only avoids deadlock
> > > but also prevents holding a large lock simultaneously.
> > >
> > > * Upgrade `set_node_memory_tier` to support additional cases, including
> > >   default DRAM, late CPUless, and hot-plugged initializations.
> > > To cover hot-plugged memory nodes, `mt_calc_adistance()` and
> > > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
> > > handle cases where memtype is not initialized and where HMAT information 
> > > is
> > > available.
> > >
> > > * Introduce `default_memory_types` for those memory types that are not
> > >   initialized by device drivers.
> > > Because late initialized memory and default DRAM memory need to be 
> > > managed,
> > > a default memory type is created for storing all memory types that are
> > > not initialized by device drivers and as a fallback.
> > >
> > > Signed-off-by: Ho-Ren (Jack) Chuang 
> > > Signed-off-by: Hao Xiang 
> > > Reviewed-by: "Huang, Ying"   
> >
> > Hi - one remaining question. Why can't we delay init for all nodes
> > to either drivers or your fallback late_initcall code.
> > It would be nice to reduce possible code paths.  
> 
> I try not to change too much of the existing code structure in
> this patchset.
> 
> To me, postponing/moving all memory tier registrations to
> late_initcall() is another possible action item for the next patchset.
> 
> After tier_mem(), hmat_init() is called, which requires registering
> `default_dram_type` info. This is when `default_dram_type` is needed.
> However, it is indeed possible to postpone the latter part,
> set_node_memory_tier(), to `late_init(). So, memory_tier_init() can
> indeed be split into two parts, and the latter part can be moved to
> late_initcall() to be processed together.
> 
> Doing this all memory-type drivers have to call late_initcall() to
> register a memory tier. I’m not sure how many they are?
> 
> What do you guys think?

Gut feeling - if you are going to move it for some cases, move it for
all of them.  Then we only have to test once ;)

J
> 
> >
> > Jonathan
> >
> >  
> > > ---
> > >  mm/memory-tiers.c | 94 +++
> > >  1 file changed, 70 insertions(+), 24 deletions(-)
> > >
> > > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> > > index 516b144fd45a..6632102bd5c9 100644
> > > --- a/mm/memory-tiers.c
> > > +++ b/mm/memory-tiers.c  
> >
> >

Re: [PATCH v6 09/12] hw/cxl/events: Add qmp interfaces to add/release dynamic capacity extents

2024-04-09 Thread Jonathan Cameron via

On Fri, 5 Apr 2024 14:09:23 -0400
Gregory Price  wrote:

> On Fri, Apr 05, 2024 at 06:44:52PM +0100, Jonathan Cameron wrote:
> > On Fri, 5 Apr 2024 12:07:45 -0400
> > Gregory Price  wrote:
> >   
> > > 3. (C) Upon Device receiving Release Dynamic Capacity Request
> > >a. check for a pending release request. If exists, error.  
> > 
> > Not sure that's necessary - can queue as long as the head
> > can track if the bits are in a pending release state.
> >   
> 
> Yeah probably it's fine to just queue the event and everything
> downstream just handles it.
> 
> > >b. check that the bits in the MHD bitmap are actually set  
> > Good.  
> > > 
> > >function: qmp_cxl_process_dynamic_capacity
> > > 
> > > 4. (D) Upon Device receiving Release Dynamic Capacity Response
> > >a. clear the bits in the mhd bitmap
> > >b. remove the pending request from the pending list
> > > 
> > >function: cmd_dcd_release_dyn_cap
> > > 
> > > Something to note: The MHD bitmap is essentially the same as the
> > > existing DCD extent bitmap - except that it is located in a shared
> > > region of memory (mmap file, shm, whatever - pick one).  
> > 
> > I think you will ideally also have a per head one to track head access
> > to the things offered by the mhd.
> >   
> 
> Generally I try not to duplicate state, reduces consistency problems.
> 
> You do still need a shared memory state and a per-head state to capture
> per-head data, but the allocation bitmap is really device-global state.

There is a separation between 'offered' to a head and 'accepted on that head'.
Sure you could track all outstanding offers (if you let more than one be
outstanding) at the shared memory, just seemed easier to do that in the
per head element.


> 
> Either way you have a race condition when checking the bitmap during a
> memory access in the process of adding/releasing capacity - but that's
> more an indication of bad host behavior than it is of a bug in the
> implementatio of the emulated device. Probably we don't need to
> read-lock the bitmap (for access validation), only write-lock.
> 
> My preference, for what it's worth, would be to have a single bitmap
> and have it be anonymous-memory for Single-head and file-backed for
> for Multi-head.  I'll have to work out the locking mechanism.
I'll go with maybe until I see the code :)

J
> 
> ~Gregory

Re: How to use pxb-pcie in correct way?

2024-04-09 Thread Jonathan Cameron via

On Mon, 8 Apr 2024 13:58:00 +0200
Marcin Juszkiewicz  wrote:

> For quite a while I am experimenting with PCI Express setup on SBSA-Ref 
> system. And finally decided to write.
> 
> We want to play with NUMA setup and "pxb-pcie" can be assigned to NUMA 
> node other than cpu0 one. But adding it makes other cards dissapear...
> 
> When I boot sbsa-ref I have plain PCIe setup:
> 
> (qemu) info pci
>Bus  0, device   0, function 0:
>  Host bridge: PCI device 1b36:0008
>PCI subsystem 1af4:1100
>id ""
>Bus  0, device   1, function 0:
>  Ethernet controller: PCI device 8086:10d3
>PCI subsystem 8086:
>IRQ 255, pin A
>BAR0: 32 bit memory at 0x [0x0001fffe].
>BAR1: 32 bit memory at 0x [0x0001fffe].
>BAR2: I/O at 0x [0x001e].
>BAR3: 32 bit memory at 0x [0x3ffe].
>BAR6: 32 bit memory at 0x [0x0003fffe].
>id ""
>Bus  0, device   2, function 0:
>  Display controller: PCI device 1234:
>PCI subsystem 1af4:1100
>BAR0: 32 bit prefetchable memory at 0x8000 [0x80ff].
>BAR2: 32 bit memory at 0x81084000 [0x81084fff].
>BAR6: 32 bit memory at 0x [0x7ffe].
>id ""
> 
> Adding extra PCIe card works fine - both just "igb" and "igb" with 
> "pcie-root-port".
> 
> But adding "pcie-root-port" + "igb" and then "pxb-pcie" makes "igb" 
> dissapear:
> 
> ../code/qemu/build/qemu-system-aarch64
> -monitor telnet::45454,server,nowait
> -serial stdio
> -device pcie-root-port,id=ULyWl,slot=0,chassis=0
> -device igb,bus=ULyWl
> -device pxb-pcie,bus_nr=1

That's setting the base bus number to 1.  Very likely to clash with the bus
number for the bus below the root port.

Set it to bu_nr=128 or something like that.

There is no sanity checking for PXBs because the bus enumeration is
an EDK2 problem in general - short of enumerating the buses in QEMU
there isn't a way for it to tell.

J

Re: [PATCH v6 09/12] hw/cxl/events: Add qmp interfaces to add/release dynamic capacity extents

On Fri, 5 Apr 2024 12:07:45 -0400
Gregory Price  wrote:

> On Fri, Apr 05, 2024 at 01:27:19PM +0100, Jonathan Cameron wrote:
> > On Wed, 3 Apr 2024 14:16:25 -0400
> > Gregory Price  wrote:
> > 
> > A few follow up comments.
> >   
> > >   
> > > > +error_setg(errp, "no valid extents to send to process");
> > > > +return;
> > > > +}
> > > > +
> > > 
> > > I'm looking at adding the MHD extensions around this point, e.g.:
> > > 
> > > /* If MHD cannot allocate requested extents, the cmd fails */
> > > if (type == DC_EVENT_ADD_CAPACITY && dcd->mhd_dcd_extents_allocate &&
> > > num_extents != dcd->mhd_dcd_extents_allocate(...))
> > >   return;
> > > 
> > > where mhd_dcd_extents_allocate checks the MHD block bitmap and tags
> > > for correctness (shared // no double-allocations, etc). On success,
> > > it garuantees proper ownership.
> > > 
> > > the release path would then be done in the release response path from
> > > the host, as opposed to the release event injection.  
> > 
> > I think it would be polite to check if the QMP command on release
> > for whether it is asking something plausible - makes for an easier
> > to user QMP interface.  I guess it's not strictly required though.
> > What races are there on release?  
> 
> The only real critical section, barring force-release beign supported,
> is when you clear the bits in the device allowing new requests to swipe
> those blocks. The appropriate place appears to be after the host kernel
> has responded to the release extent request.

Agreed you can't release till then, but you can check if it's going to 
work.  I think that's worth doing for ease of use reasons.

> 
> Also need to handle the case of multiple add-requests contending for the
> same region, but that's just an "oops failed to get all the bits, roll
> back" scenario - easy to handle.
> 
> Could go coarse-grained to just lock access to the bitmap entirely while
> operating on it, or be fancy and use atomics to go lockless. The latter
> code already exists in the Niagara model for reference.

I'm fine either way, though I'd just use a lock in initial version()

> 
> > We aren't support force release
> > for now, and for anything else, it's host specific (unlike add where
> > the extra rules kick in).   AS such I 'think' a check at command
> > time will be valid as long as the host hasn't done an async
> > release of capacity between that and the event record.  That
> > is a race we always have and the host should at most log it and
> > not release capacity twice.
> >  
> 
> Borrowing from the Ira's flow chart, here are the pieces I believe are
> needed to implement MHD support for DCD.
> 
> Orchestrator  FM Device   Host KernelHost User
> 
> | |   ||  |
> |-- Add ->|-- Add --->A--- Add --->|  |
> |  Capacity   |  Extent   |   Extent   |  |
> | |   ||  |
> | |<--Accept--B<--Accept --|  |
> | |   Extent  |   Extent   |  |
> | |   ||  |
> | | ... snip ...   |  |
> | |   ||  |
> |-- Remove -->|--Release->C---Release->|  |
> |  Capacity   |  Extent   |   Extent   |  |
> | |   ||  |
> | |<-Release--D<--Release--|  |
> | |  Extent   |   Extent   |  |
> | |   ||  |
> 
> 1. (A) Upon Device Receiving Add Capacity Request
>a. the device sanity checks the request against local mappings
>b. the mhd hook is called to sanity check against global mappings
>c. the mhd bitmap is updated, marking the capacity owned by that head
> 
>function: qmp_cxl_process_dynamic_capacity
> 
> 2. (B) Upon Device Receiving Add Dynamic Capacity Response
>a. accepted extents are compared to the original request
>b. not accepted extents are cleared from the bitmap (local and MHD)
>(Note: My understanding is that for now each request = 1 extent)

Yeah but that is a restriction I think we need to solve soon.

> 
>function: cmd_dcd_add_dyn_cap_rsp
> 
> 3. (C) Upon Device receiving Release Dynamic Capacity Request
>a. check for a p

Re: [RFC PATCH 5/5] cxl/core: add poison injection event handler

On Fri, 15 Mar 2024 10:29:07 +0800
Shiyang Ruan  wrote:

> 在 2024/2/14 0:51, Jonathan Cameron 写道:
> >   
> >> +
> >> +void cxl_event_handle_record(struct cxl_memdev *cxlmd,
> >> +   enum cxl_event_log_type type,
> >> +   enum cxl_event_type event_type,
> >> +   const uuid_t *uuid, union cxl_event *evt)
> >> +{
> >> +  if (event_type == CXL_CPER_EVENT_GEN_MEDIA) {
> >>trace_cxl_general_media(cxlmd, type, >gen_media);
> >> -  else if (event_type == CXL_CPER_EVENT_DRAM)
> >> +  /* handle poison event */
> >> +  if (type == CXL_EVENT_TYPE_FAIL)
> >> +  cxl_event_handle_poison(cxlmd, >gen_media);  
> > 
> > I'm not 100% convinced this is necessary poison causing.  Also
> > the text tells us we should see 'an appropriate event'.
> > DRAM one seems likely to be chosen by some vendors.  
> 
> I think it's right to use DRAM Event Record for volatile-memdev, but 
> should poison on a persistent-memdev also use DRAM Event Record too? 
> Though its 'Physical Address' feild has the 'Volatile' bit too, which is 
> same as General Media Event Record.  I am a bit confused about this.

That is indeed 'novel' in a DRAM device, but maybe it could be battery
backed and have a path to say a flash device that isn't visible to CXL
and form which the DRAM is refilled on power restore?

Anyhow, doesn't make sense for persistent memory that doesn't correspond
to all the other stuff in the DRAM event.
> 
> > 
> > The fatal check maybe makes it a little more likely (maybe though
> > I'm not sure anything says a device must log it to the failure log)
> > but it might be Memory Event Type 1, which is the host tried to
> > access an invalid address.  Sure poison might be returned to that
> > error but what would the main kernel memory handling do with it?
> > Something is very wrong
> > but it's not corrupted device memory.  TE state violations are in there
> > as well. Sure poison is returned on reads (I think - haven't checked).
> > 
> > IF the aim here is to say 'maybe there is poison, better check the
> > poison list'. Then that is reasonable but we should ensure things
> > like timer expiry are definitely ruled out and rename the function
> > to make it clear it might not find poison.  
> 
> I forgot to distinguish the 'Transaction Type' here. Host Inject Poison 
> is 0x04h. And other types should also have their specific handle method.
Yes. If you can use transaction type that solves this issue I think.
> 
> 
> --
> Thanks,
> Ruan.
> 
> > 
> > Jonathan

Re: [PATCH] mem/cxl_type3: fix hpa to dpa logic

On Mon, 1 Apr 2024 17:00:50 +0100
Jonathan Cameron via  wrote:

> On Thu, 28 Mar 2024 06:24:24 +
> "Xingtao Yao (Fujitsu)"  wrote:
> 
> > Jonathan
> > 
> > thanks for your reply!
> >   
> > > -----Original Message-
> > > From: Jonathan Cameron 
> > > Sent: Wednesday, March 27, 2024 9:28 PM
> > > To: Yao, Xingtao/姚 幸涛 
> > > Cc: fan...@samsung.com; qemu-devel@nongnu.org; Cao, Quanquan/曹 全全
> > > 
> > > Subject: Re: [PATCH] mem/cxl_type3: fix hpa to dpa logic
> > > 
> > > On Tue, 26 Mar 2024 21:46:53 -0400
> > > Yao Xingtao  wrote:
> > > 
> > > > In 3, 6, 12 interleave ways, we could not access cxl memory properly,
> > > > and when the process is running on it, a 'segmentation fault' error will
> > > > occur.
> > > >
> > > > According to the CXL specification '8.2.4.20.13 Decoder Protection',
> > > > there are two branches to convert HPA to DPA:
> > > > b1: Decoder[m].IW < 8 (for 1, 2, 4, 8, 16 interleave ways)
> > > > b2: Decoder[m].IW >= 8 (for 3, 6, 12 interleave ways)
> > > >
> > > > but only b1 has been implemented.
> > > >
> > > > To solve this issue, we should implement b2:
> > > >   DPAOffset[51:IG+8]=HPAOffset[51:IG+IW] / 3
> > > >   DPAOffset[IG+7:0]=HPAOffset[IG+7:0]
> > > >   DPA=DPAOffset + Decoder[n].DPABase
> > > >
> > > > Links:
> > > https://lore.kernel.org/linux-cxl/3e84b919-7631-d1db-3e1d-33000f3f3868@fujits
> > > u.com/
> > > > Signed-off-by: Yao Xingtao 
> > > 
> > > Not implementing this was intentional (shouldn't seg fault obviously) but
> > > I thought we were not advertising EP support for 3, 6, 12?  The HDM 
> > > Decoder
> > > configuration checking is currently terrible so we don't prevent
> > > the bits being set (adding device side sanity checks for those decoders
> > > has been on the todo list for a long time).  There are a lot of ways of
> > > programming those that will blow up.
> > > 
> > > Can you confirm that the emulation reports they are supported.
> > > https://elixir.bootlin.com/qemu/v9.0.0-rc1/source/hw/cxl/cxl-component-utils.c
> > > #L246
> > > implies it shouldn't and so any software using them is broken.
> > yes, the feature is not supported by QEMU, but I can still create a 
> > 6-interleave-ways region on kernel layer.
> > 
> > I checked the source code of kernel, and found that the kernel did not 
> > check this bit when committing decoder.
> > we may add some check on kernel side.  
> 
> ouch.  We definitely want that check!  The decoder commit will fail
> anyway (which QEMU doesn't yet because we don't do all the sanity checks
> we should). However failing on commit is nasty as the reason should have
> been detected earlier.
> 
> >   
> > > 
> > > The non power of 2 decodes always made me nervous as the maths is more
> > > complex and any changes to that decode will need careful checking.
> > > For the power of 2 cases it was a bunch of writes to edge conditions etc
> > > and checking the right data landed in the backing stores.
> > after applying this modification, I tested some command by using these 
> > memory, like 'ls', 'top'..
> > and they can be executed normally, maybe there are some other problems I 
> > haven't met yet.  
> 
> I usually run a bunch of manual tests with devmem2 to ensure the edge cases 
> are handled
> correctly, but I've not really seen any errors that didn't also show up in 
> running
> stressors (e.g. stressng) or just memhog on the memory.


Hi Yao,

If you have time, please spin a v2 that also sets the relevant
flag to say the QEMU emulation supports this interleave.

Whilst we test the kernel fixes, we can just drop that patch but longer term I'm
find with having this support in general in the QEMU emulation - so I won't 
queue
it up as a fix, but instead as a feature.

Thanks,

Jonathan

> 
> Jonathan
> 
> >   
> > > 
> > > Joanthan
> > > 
> > > 
> > > > ---
> > > >  hw/mem/cxl_type3.c | 15 +++
> > > >  1 file changed, 11 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
> > > > index b0a7e9f11b..2c1218fb12 100644
> > > > --- a/hw/mem/cxl_type3.c
> > > > +++ b/hw/mem/cxl_type3.c
> > > > @@ -805,10 +805,17 @@

Re: [PATCH v11 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

On Fri,  5 Apr 2024 00:07:06 +
"Ho-Ren (Jack) Chuang"  wrote:

> The current implementation treats emulated memory devices, such as
> CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory
> (E820_TYPE_RAM). However, these emulated devices have different
> characteristics than traditional DRAM, making it important to
> distinguish them. Thus, we modify the tiered memory initialization process
> to introduce a delay specifically for CPUless NUMA nodes. This delay
> ensures that the memory tier initialization for these nodes is deferred
> until HMAT information is obtained during the boot process. Finally,
> demotion tables are recalculated at the end.
> 
> * late_initcall(memory_tier_late_init);
> Some device drivers may have initialized memory tiers between
> `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
> online memory nodes and configuring memory tiers. They should be excluded
> in the late init.
> 
> * Handle cases where there is no HMAT when creating memory tiers
> There is a scenario where a CPUless node does not provide HMAT information.
> If no HMAT is specified, it falls back to using the default DRAM tier.
> 
> * Introduce another new lock `default_dram_perf_lock` for adist calculation
> In the current implementation, iterating through CPUlist nodes requires
> holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
> trying to acquire the same lock, leading to a potential deadlock.
> Therefore, we propose introducing a standalone `default_dram_perf_lock` to
> protect `default_dram_perf_*`. This approach not only avoids deadlock
> but also prevents holding a large lock simultaneously.
> 
> * Upgrade `set_node_memory_tier` to support additional cases, including
>   default DRAM, late CPUless, and hot-plugged initializations.
> To cover hot-plugged memory nodes, `mt_calc_adistance()` and
> `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
> handle cases where memtype is not initialized and where HMAT information is
> available.
> 
> * Introduce `default_memory_types` for those memory types that are not
>   initialized by device drivers.
> Because late initialized memory and default DRAM memory need to be managed,
> a default memory type is created for storing all memory types that are
> not initialized by device drivers and as a fallback.
> 
> Signed-off-by: Ho-Ren (Jack) Chuang 
> Signed-off-by: Hao Xiang 
> Reviewed-by: "Huang, Ying" 

Hi - one remaining question. Why can't we delay init for all nodes
to either drivers or your fallback late_initcall code.
It would be nice to reduce possible code paths.

Jonathan


> ---
>  mm/memory-tiers.c | 94 +++
>  1 file changed, 70 insertions(+), 24 deletions(-)
> 
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index 516b144fd45a..6632102bd5c9 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c



> @@ -855,7 +892,8 @@ static int __init memory_tier_init(void)
>* For now we can have 4 faster memory tiers with smaller adistance
>* than default DRAM tier.
>*/
> - default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM);
> + default_dram_type = mt_find_alloc_memory_type(MEMTIER_ADISTANCE_DRAM,
> +   _memory_types);
>   if (IS_ERR(default_dram_type))
>   panic("%s() failed to allocate default DRAM tier\n", __func__);
>  
> @@ -865,6 +903,14 @@ static int __init memory_tier_init(void)
>* types assigned.
>*/
>   for_each_node_state(node, N_MEMORY) {
> + if (!node_state(node, N_CPU))
> + /*
> +  * Defer memory tier initialization on
> +  * CPUless numa nodes. These will be initialized
> +  * after firmware and devices are initialized.

Could the comment also say why we can't defer them all?

(In an odd coincidence we have a similar issue for some CPU hotplug
 related bring up where review feedback was move all cases later).

> +  */
> + continue;
> +
>   memtier = set_node_memory_tier(node);
>   if (IS_ERR(memtier))
>   /*

Re: [PATCH v11 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types

On Fri,  5 Apr 2024 00:07:05 +
"Ho-Ren (Jack) Chuang"  wrote:

> Since different memory devices require finding, allocating, and putting
> memory types, these common steps are abstracted in this patch,
> enhancing the scalability and conciseness of the code.
> 
> Signed-off-by: Ho-Ren (Jack) Chuang 
> Reviewed-by: "Huang, Ying" 
Reviewed-by: Jonathan Cameron

Re: [PATCH v6 12/12] hw/mem/cxl_type3: Allow to release extent superset in QMP interface

On Mon, 25 Mar 2024 12:02:30 -0700
nifan@gmail.com wrote:

> From: Fan Ni 
> 
> Before the change, the QMP interface used for add/release DC extents
> only allows to release an extent whose DPA range is contained by a single
> accepted extent in the device.
> 
> With the change, we relax the constraints.  As long as the DPA range of
> the extent is covered by accepted extents, we allow the release.
> 
> Signed-off-by: Fan Ni 

Nice.

Reviewed-by: Jonathan Cameron 

> ---
>  hw/mem/cxl_type3.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
> index 2628a6f50f..62c2022477 100644
> --- a/hw/mem/cxl_type3.c
> +++ b/hw/mem/cxl_type3.c
> @@ -1935,8 +1935,7 @@ static void qmp_cxl_process_dynamic_capacity(const char 
> *path, CxlEventLog log,
> "cannot release extent with pending DPA range");
>  return;
>  }
> -if (!cxl_extents_contains_dpa_range(>dc.extents,
> -dpa, len)) {
> +if (!ct3_test_region_block_backed(dcd, dpa, len)) {
>  error_setg(errp,
> "cannot release extent with non-existing DPA 
> range");
>  return;

Re: [PATCH v6 11/12] hw/cxl/cxl-mailbox-utils: Add superset extent release mailbox support

On Mon, 25 Mar 2024 12:02:29 -0700
nifan@gmail.com wrote:

> From: Fan Ni 
> 
> With the change, we extend the extent release mailbox command processing
> to allow more flexible release. As long as the DPA range of the extent to
> release is covered by accepted extent(s) in the device, the release can be
> performed.
> 
> Signed-off-by: Fan Ni 
Nothing to add from me.
Nice and simple which is great.
Jonathan

Re: [PATCH v6 10/12] hw/mem/cxl_type3: Add dpa range validation for accesses to DC regions

On Mon, 25 Mar 2024 12:02:28 -0700
nifan@gmail.com wrote:

> From: Fan Ni 
> 
> All dpa ranges in the DC regions are invalid to access until an extent

Let's be more consistent for commit logs and use DPA DC HPA etc all
caps. It's a bit of a mixture in this series at the moment.

> covering the range has been added.
I'd expand that to 'has been successfully accepted by the host.'

> Add a bitmap for each region to
> record whether a DC block in the region has been backed by DC extent.
> For the bitmap, a bit in the bitmap represents a DC block. When a DC
> extent is added, all the bits of the blocks in the extent will be set,
> which will be cleared when the extent is released.
> 
> Reviewed-by: Jonathan Cameron 
> Signed-off-by: Fan Ni

Re: [PATCH v6 09/12] hw/cxl/events: Add qmp interfaces to add/release dynamic capacity extents

On Wed, 3 Apr 2024 14:16:25 -0400
Gregory Price  wrote:

A few follow up comments.

> On Mon, Mar 25, 2024 at 12:02:27PM -0700, nifan@gmail.com wrote:
> > From: Fan Ni 
> > 
> > To simulate FM functionalities for initiating Dynamic Capacity Add
> > (Opcode 5604h) and Dynamic Capacity Release (Opcode 5605h) as in CXL spec
> > r3.1 7.6.7.6.5 and 7.6.7.6.6, we implemented two QMP interfaces to issue
> > add/release dynamic capacity extents requests.
> >   
> ... snip 
> > +
> > +/*
> > + * The main function to process dynamic capacity event. Currently DC 
> > extents
> > + * add/release requests are processed.
> > + */
> > +static void qmp_cxl_process_dynamic_capacity(const char *path, CxlEventLog 
> > log,
> > + CXLDCEventType type, uint16_t 
> > hid,
> > + uint8_t rid,
> > + CXLDCExtentRecordList 
> > *records,
> > + Error **errp)
> > +{  
> ... snip 
> > +/* Sanity check and count the extents */
> > +list = records;
> > +while (list) {
> > +offset = list->value->offset;
> > +len = list->value->len;
> > +dpa = offset + dcd->dc.regions[rid].base;
> > +
> > +if (len == 0) {
> > +error_setg(errp, "extent with 0 length is not allowed");
> > +return;
> > +}
> > +
> > +if (offset % block_size || len % block_size) {
> > +error_setg(errp, "dpa or len is not aligned to region block 
> > size");
> > +return;
> > +}
> > +
> > +if (offset + len > dcd->dc.regions[rid].len) {
> > +error_setg(errp, "extent range is beyond the region end");
> > +return;
> > +}
> > +
> > +/* No duplicate or overlapped extents are allowed */
> > +if (test_any_bits_set(blk_bitmap, offset / block_size,
> > +  len / block_size)) {
> > +error_setg(errp, "duplicate or overlapped extents are 
> > detected");
> > +return;
> > +}
> > +bitmap_set(blk_bitmap, offset / block_size, len / block_size);
> > +
> > +num_extents++;  
> 
> I think num_extents is always equal to the length of the list, otherwise
> this code will return with error.
> 
> Nitpick:
> This can be moved to the bottom w/ `list = list->next` to express that a
> little more clearly.
> 
> > +if (type == DC_EVENT_RELEASE_CAPACITY) {
> > +if (cxl_extents_overlaps_dpa_range(>dc.extents_pending,
> > +   dpa, len)) {
> > +error_setg(errp,
> > +   "cannot release extent with pending DPA range");
> > +return;
> > +}
> > +if (!cxl_extents_contains_dpa_range(>dc.extents,
> > +dpa, len)) {
> > +error_setg(errp,
> > +   "cannot release extent with non-existing DPA 
> > range");
> > +return;
> > +}
> > +}
> > +list = list->next;
> > +}
> > +
> > +if (num_extents == 0) {  
> 
> Since num_extents is always the length of the list, this is equivalent to
> `if (!records)` prior to the while loop. Makes it a little more clear that:
> 
> 1. There must be at least 1 extent
> 2. All extents must be valid for the command to be serviced.

Agreed.

> 
> > +error_setg(errp, "no valid extents to send to process");
> > +return;
> > +}
> > +  
> 
> I'm looking at adding the MHD extensions around this point, e.g.:
> 
> /* If MHD cannot allocate requested extents, the cmd fails */
> if (type == DC_EVENT_ADD_CAPACITY && dcd->mhd_dcd_extents_allocate &&
> num_extents != dcd->mhd_dcd_extents_allocate(...))
>   return;
> 
> where mhd_dcd_extents_allocate checks the MHD block bitmap and tags
> for correctness (shared // no double-allocations, etc). On success,
> it garuantees proper ownership.
> 
> the release path would then be done in the release response path from
> the host, as opposed to the release event injection.

I think it would be polite to check if the QMP command on release
for whether it is asking something plausible - makes for an easier
to user QMP interface.  I guess it's not strictly required though.
What races are there on release?  We aren't support force release
for now, and for anything else, it's host specific (unlike add where
the extra rules kick in).   AS such I 'think' a check at command
time will be valid as long as the host hasn't done an async
release of capacity between that and the event record.  That
is a race we always have and the host should at most log it and
not release capacity twice.

> 
> Do you see any issues with that flow?
> 
> > +/* Create extent list for event being passed to host */
> > +i = 0;
> > +list = records;
> > +extents =

Re: [PATCH v6 09/12] hw/cxl/events: Add qmp interfaces to add/release dynamic capacity extents

On Mon, 25 Mar 2024 12:02:27 -0700
nifan@gmail.com wrote:

> From: Fan Ni 
> 
> To simulate FM functionalities for initiating Dynamic Capacity Add
> (Opcode 5604h) and Dynamic Capacity Release (Opcode 5605h) as in CXL spec
> r3.1 7.6.7.6.5 and 7.6.7.6.6, we implemented two QMP interfaces to issue
> add/release dynamic capacity extents requests.
> 
> With the change, we allow to release an extent only when its DPA range
> is contained by a single accepted extent in the device. That is to say,
> extent superset release is not supported yet.
> 
> 1. Add dynamic capacity extents:
> 
> For example, the command to add two continuous extents (each 128MiB long)
> to region 0 (starting at DPA offset 0) looks like below:
> 
> { "execute": "qmp_capabilities" }
> 
> { "execute": "cxl-add-dynamic-capacity",
>   "arguments": {
>   "path": "/machine/peripheral/cxl-dcd0",
>   "region-id": 0,
>   "extents": [
>   {
>   "offset": 0,
>   "len": 134217728
>   },
>   {
>   "offset": 134217728,
>   "len": 134217728
>   }

Hi Fan,

I talk more on this inline, but to me this interface takes multiple extents
so that we can treat them as a single 'offer' of capacity. That is they
should be linked in the event log with the more flag and the host should
have to handle them in one go (I known Ira and Navneet's code doesn't handle
this yet, but that doesn't mean QEMU shouldn't).

Alternative for now would be to only support a single entry.  Keep the
interface defined to take multiple entries but reject it at runtime.

I don't want to end up with a more complex interface in the end just
because we allowed this form to not set the MORE flag today.
We will need this to do tagged handling and ultimately sharing, so good
to get it right from the start.

For tagged handling I think the right option is to have the tag alongside
region-id not in the individual extents.  That way the interface is naturally
used to generate the right description to the host.

>   ]
>   }
> }
> 
> 2. Release dynamic capacity extents:
> 
> For example, the command to release an extent of size 128MiB from region 0
> (DPA offset 128MiB) looks like below:
> 
> { "execute": "cxl-release-dynamic-capacity",
>   "arguments": {
>   "path": "/machine/peripheral/cxl-dcd0",
>   "region-id": 0,
>   "extents": [
>   {
>   "offset": 134217728,
>   "len": 134217728
>   }
>   ]
>   }
> }
> 
> Signed-off-by: Fan Ni 



>  /* to-be-added range should not overlap with range already accepted 
> */
>  QTAILQ_FOREACH(ent, >dc.extents, node) {
> @@ -1585,9 +1586,13 @@ static CXLRetCode cmd_dcd_add_dyn_cap_rsp(const struct 
> cxl_cmd *cmd,
>  CXLDCExtentList *extent_list = >dc.extents;
>  uint32_t i;
>  uint64_t dpa, len;
> +CXLDCExtent *ent;
>  CXLRetCode ret;
>  
>  if (in->num_entries_updated == 0) {
> +/* Always remove the first pending extent when response received. */
> +ent = QTAILQ_FIRST(>dc.extents_pending);
> +cxl_remove_extent_from_extent_list(>dc.extents_pending, ent);
>  return CXL_MBOX_SUCCESS;
>  }
>  
> @@ -1604,6 +1609,8 @@ static CXLRetCode cmd_dcd_add_dyn_cap_rsp(const struct 
> cxl_cmd *cmd,
>  
>  ret = cxl_dcd_add_dyn_cap_rsp_dry_run(ct3d, in);
>  if (ret != CXL_MBOX_SUCCESS) {
> +ent = QTAILQ_FIRST(>dc.extents_pending);
> +cxl_remove_extent_from_extent_list(>dc.extents_pending, ent);

Ah this deals with the todo I suggest you add to the earlier patch.
I'd not mind so much if you hadn't been so thorough on other todo notes ;)
Add one in the earlier patch and get rid of ti here like you do below.

However as I note below I think we need to handle these as groups of extents
not single extents. That way we keep an 'offered' set offered at the same time 
by
a single command (and expose to host using the more flag) together and reject
them on mass.


>  return ret;
>  }
>  
> @@ -1613,10 +1620,9 @@ static CXLRetCode cmd_dcd_add_dyn_cap_rsp(const struct 
> cxl_cmd *cmd,
>  
>  cxl_insert_extent_to_extent_list(extent_list, dpa, len, NULL, 0);
>  ct3d->dc.total_extent_count += 1;
> -/*
> - * TODO: we will add a pending extent list based on event log record
> - * and process the list according here.
> - */
> +
> +ent = QTAILQ_FIRST(>dc.extents_pending);
> +cxl_remove_extent_from_extent_list(>dc.extents_pending, ent);
>  }
>  
>  return CXL_MBOX_SUCCESS;
> diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
> index 951bd79a82..74cb64e843 100644
> --- a/hw/mem/cxl_type3.c
> +++ b/hw/mem/cxl_type3.c

>  
>  static bool cxl_setup_memory(CXLType3Dev *ct3d, Error **errp)
> @@ -1449,7 +1454,8 @@ static int ct3d_qmp_cxl_event_log_enc(CxlEventLog log)
>  return CXL_EVENT_TYPE_FAIL;
>  case CXL_EVENT_LOG_FATAL:
>  return CXL_EVENT_TYPE_FATAL;
> -/* DCD not yet supported */

Re: [PATCH v6 08/12] hw/cxl/cxl-mailbox-utils: Add mailbox commands to support add/release dynamic capacity response

On Mon, 25 Mar 2024 12:02:26 -0700
nifan@gmail.com wrote:

> From: Fan Ni 
> 
> Per CXL spec 3.1, two mailbox commands are implemented:
> Add Dynamic Capacity Response (Opcode 4802h) 8.2.9.9.9.3, and
> Release Dynamic Capacity (Opcode 4803h) 8.2.9.9.9.4.
> 
> For the process of the above two commands, we use two-pass approach.
> Pass 1: Check whether the input payload is valid or not; if not, skip
> Pass 2 and return mailbox process error.
> Pass 2: Do the real work--add or release extents, respectively.
> 
> Signed-off-by: Fan Ni 

A few additional comments from me.

Jonathan


> +/*
> + * For the extents in the extent list to operate, check whether they are 
> valid
> + * 1. The extent should be in the range of a valid DC region;
> + * 2. The extent should not cross multiple regions;
> + * 3. The start DPA and the length of the extent should align with the block
> + * size of the region;
> + * 4. The address range of multiple extents in the list should not overlap.
> + */
> +static CXLRetCode cxl_detect_malformed_extent_list(CXLType3Dev *ct3d,
> +const CXLUpdateDCExtentListInPl *in)
> +{
> +uint64_t min_block_size = UINT64_MAX;
> +CXLDCRegion *region = >dc.regions[0];

This is immediately overwritten if num_regions != 0 (Which I think is checked 
before
calling this function).  So no need to initialize it.

> +CXLDCRegion *lastregion = >dc.regions[ct3d->dc.num_regions - 1];
> +g_autofree unsigned long *blk_bitmap = NULL;
> +uint64_t dpa, len;
> +uint32_t i;
> +
> +for (i = 0; i < ct3d->dc.num_regions; i++) {
> +region = >dc.regions[i];
> +min_block_size = MIN(min_block_size, region->block_size);
> +}
> +
> +blk_bitmap = bitmap_new((lastregion->base + lastregion->len -
> + ct3d->dc.regions[0].base) / min_block_size);
> +
> +for (i = 0; i < in->num_entries_updated; i++) {
> +dpa = in->updated_entries[i].start_dpa;
> +len = in->updated_entries[i].len;
> +
> +region = cxl_find_dc_region(ct3d, dpa, len);
> +if (!region) {
> +return CXL_MBOX_INVALID_PA;
> +}
> +
> +dpa -= ct3d->dc.regions[0].base;
> +if (dpa % region->block_size || len % region->block_size) {
> +return CXL_MBOX_INVALID_EXTENT_LIST;
> +}
> +/* the dpa range already covered by some other extents in the list */
> +if (test_any_bits_set(blk_bitmap, dpa / min_block_size,
> +len / min_block_size)) {
> +return CXL_MBOX_INVALID_EXTENT_LIST;
> +}
> +bitmap_set(blk_bitmap, dpa / min_block_size, len / min_block_size);
> +   }
> +
> +return CXL_MBOX_SUCCESS;
> +}



> +/*
> + * CXL r3.1 section 8.2.9.9.9.3: Add Dynamic Capacity Response (Opcode 4802h)
> + * An extent is added to the extent list and becomes usable only after the
> + * response is processed successfully
> + */
> +static CXLRetCode cmd_dcd_add_dyn_cap_rsp(const struct cxl_cmd *cmd,
> +  uint8_t *payload_in,
> +  size_t len_in,
> +  uint8_t *payload_out,
> +  size_t *len_out,
> +  CXLCCI *cci)
> +{
> +CXLUpdateDCExtentListInPl *in = (void *)payload_in;
> +CXLType3Dev *ct3d = CXL_TYPE3(cci->d);
> +CXLDCExtentList *extent_list = >dc.extents;
> +uint32_t i;
> +uint64_t dpa, len;
> +CXLRetCode ret;
> +
> +if (in->num_entries_updated == 0) {
> +return CXL_MBOX_SUCCESS;
> +}


A zero length response is a rejection of an offered set of extents.
Probably want a todo here to say this will wipe out part of the pending list
(similar to the one you have below).

> +
> +/* Adding extents causes exceeding device's extent tracking ability. */
> +if (in->num_entries_updated + ct3d->dc.total_extent_count >
> +CXL_NUM_EXTENTS_SUPPORTED) {
> +return CXL_MBOX_RESOURCES_EXHAUSTED;
> +}
> +
> +ret = cxl_detect_malformed_extent_list(ct3d, in);
> +if (ret != CXL_MBOX_SUCCESS) {
> +return ret;
> +}
> +
> +ret = cxl_dcd_add_dyn_cap_rsp_dry_run(ct3d, in);
> +if (ret != CXL_MBOX_SUCCESS) {
> +return ret;
> +}
> +
> +for (i = 0; i < in->num_entries_updated; i++) {
> +dpa = in->updated_entries[i].start_dpa;
> +len = in->updated_entries[i].len;
> +
> +cxl_insert_extent_to_extent_list(extent_list, dpa, len, NULL, 0);
> +ct3d->dc.total_extent_count += 1;
> +/*
> + * TODO: we will add a pending extent list based on event log record
> + * and process the list according here.
> + */
> +}
> +
> +return CXL_MBOX_SUCCESS;
> +}

> +static CXLRetCode cxl_dc_extent_release_dry_run(CXLType3Dev *ct3d,
> +const CXLUpdateDCExtentListInPl *in)
> +{
> +CXLDCExtent *ent, *ent_next;
> +uint64_t dpa,

Re: [PATCH v6 08/12] hw/cxl/cxl-mailbox-utils: Add mailbox commands to support add/release dynamic capacity response

On Thu, 4 Apr 2024 13:32:23 +
Jørgen Hansen  wrote:

Hi Jørgen,

> > +static CXLRetCode cmd_dcd_add_dyn_cap_rsp(const struct cxl_cmd *cmd,
> > +  uint8_t *payload_in,
> > +  size_t len_in,
> > +  uint8_t *payload_out,
> > +  size_t *len_out,
> > +  CXLCCI *cci)
> > +{
> > +CXLUpdateDCExtentListInPl *in = (void *)payload_in;
> > +CXLType3Dev *ct3d = CXL_TYPE3(cci->d);
> > +CXLDCExtentList *extent_list = >dc.extents;
> > +uint32_t i;
> > +uint64_t dpa, len;
> > +CXLRetCode ret;
> > +
> > +if (in->num_entries_updated == 0) {
> > +return CXL_MBOX_SUCCESS;
> > +}  
> 
> The mailbox processing in patch 2 converts from le explicitly, whereas 
> the mailbox commands here don't. Looking at the existing mailbox 
> commands, convertion doesn't seem to be rigorously applied, so maybe 
> that is OK?

The early CXL code didn't take this into account much at all. We've
sort of been fixing stuff up as we happen to be working on it. Hence
some stuff is big endian safe and some not :(

Patches welcome, but it would be good to not introduce more cases
that need fixing when we eventually clean them all up (and have
a big endian test platform to see if we got it right!)

Jonathan

Re: [PATCH v6 07/12] hw/mem/cxl_type3: Add DC extent list representative and get DC extent list mailbox support

On Mon, 25 Mar 2024 12:02:25 -0700
nifan@gmail.com wrote:

> From: Fan Ni 
> 
> Add dynamic capacity extent list representative to the definition of
> CXLType3Dev and implement get DC extent list mailbox command per
> CXL.spec.3.1:.8.2.9.9.9.2.
> 
> Signed-off-by: Fan Ni 

One really minor comment inline.
Reviewed-by: Jonathan Cameron 

>  
> +/*
> + * CXL r3.1 section 8.2.9.9.9.2:
> + * Get Dynamic Capacity Extent List (Opcode 4801h)
> + */
> +static CXLRetCode cmd_dcd_get_dyn_cap_ext_list(const struct cxl_cmd *cmd,
> +   uint8_t *payload_in,
> +   size_t len_in,
> +   uint8_t *payload_out,
> +   size_t *len_out,
> +   CXLCCI *cci)
> +{
> +CXLType3Dev *ct3d = CXL_TYPE3(cci->d);
> +struct {
> +uint32_t extent_cnt;
> +uint32_t start_extent_id;
> +} QEMU_PACKED *in = (void *)payload_in;
> +struct {
> +uint32_t count;
> +uint32_t total_extents;
> +uint32_t generation_num;
> +uint8_t rsvd[4];
> +CXLDCExtentRaw records[];
> +} QEMU_PACKED *out = (void *)payload_out;
> +uint32_t start_extent_id = in->start_extent_id;
> +CXLDCExtentList *extent_list = >dc.extents;
> +uint16_t record_count = 0, i = 0, record_done = 0;
> +uint16_t out_pl_len, size;
> +CXLDCExtent *ent;
> +
> +if (start_extent_id > ct3d->dc.total_extent_count) {
> +return CXL_MBOX_INVALID_INPUT;
> +}
> +
> +record_count = MIN(in->extent_cnt,
> +   ct3d->dc.total_extent_count - start_extent_id);
> +size = CXL_MAILBOX_MAX_PAYLOAD_SIZE - sizeof(*out);
> +if (size / sizeof(out->records[0]) < record_count) {
> +record_count = size / sizeof(out->records[0]);
> +}

Could use another min for this I think?
record_count = MIN(record_count, size / sizeof(out->records[0]);

> +out_pl_len = sizeof(*out) + record_count * sizeof(out->records[0]);
> +
> +stl_le_p(>count, record_count);
> +stl_le_p(>total_extents, ct3d->dc.total_extent_count);
> +stl_le_p(>generation_num, ct3d->dc.ext_list_gen_seq);
> +
> +if (record_count > 0) {
> +CXLDCExtentRaw *out_rec = >records[record_done];
> +
> +QTAILQ_FOREACH(ent, extent_list, node) {
> +if (i++ < start_extent_id) {
> +continue;
> +}
> +stq_le_p(_rec->start_dpa, ent->start_dpa);
> +stq_le_p(_rec->len, ent->len);
> +memcpy(_rec->tag, ent->tag, 0x10);
> +stw_le_p(_rec->shared_seq, ent->shared_seq);
> +
> +record_done++;
> +if (record_done == record_count) {
> +break;
> +}
> +}
> +}
> +
> +*len_out = out_pl_len;
> +return CXL_MBOX_SUCCESS;
> +}

Re: [PATCH v6 06/12] hw/mem/cxl_type3: Add host backend and address space handling for DC regions

On Mon, 25 Mar 2024 12:02:24 -0700
nifan@gmail.com wrote:

> From: Fan Ni 
> 
> Add (file/memory backed) host backend, all the dynamic capacity regions
> will share a single, large enough host backend. 

This doesn't parse.  I suggests splitting it into 2 sentences.

Add (file/memory backend) host backend for DCD.  All the dynamic capacity
regions will share a single, large enough host backend.

> Set up address space for
> DC regions to support read/write operations to dynamic capacity for DCD.
> 
> With the change, following supports are added:

Oddity of English wrt to plurals.

With this change, the following support is added.

> 1. Add a new property to type3 device "volatile-dc-memdev" to point to host
>memory backend for dynamic capacity. Currently, all dc regions share one
>host backend.
> 2. Add namespace for dynamic capacity for read/write support;
> 3. Create cdat entries for each dynamic capacity region;
> 
> Signed-off-by: Fan Ni 
All comments trivial with exception of the one about setting size of range
registers. For now I think just set the flags and we will deal with whatever
output we get from the consortium in the long run.
With that tweaked.

Reviewed-by: Jonathan Cameron 
> ---
>  hw/cxl/cxl-mailbox-utils.c  |  16 ++-
>  hw/mem/cxl_type3.c  | 187 +---
>  include/hw/cxl/cxl_device.h |   8 ++
>  3 files changed, 172 insertions(+), 39 deletions(-)
> 
> diff --git a/hw/cxl/cxl-mailbox-utils.c b/hw/cxl/cxl-mailbox-utils.c
> index 0f2ad58a14..831cef0567 100644
> --- a/hw/cxl/cxl-mailbox-utils.c
> +++ b/hw/cxl/cxl-mailbox-utils.c
> @@ -622,7 +622,8 @@ static CXLRetCode cmd_firmware_update_get_info(const 
> struct cxl_cmd *cmd,

> diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
> index a9e8bdc436..75ea9b20e1 100644
> --- a/hw/mem/cxl_type3.c
> +++ b/hw/mem/cxl_type3.c
> @@ -45,7 +45,8 @@ enum {



> +if (dc_mr) {
> +int i;
> +uint64_t region_base = vmr_size + pmr_size;
> +
> +/*
> + * TODO: we assume the dynamic capacity to be volatile for now,
> + * non-volatile dynamic capacity will be added if needed in the
> + * future.

Trivial but I'd make that 2 sentences with a full stop after "now".


>  assert(len == cur_ent);
>  
>  *cdat_table = g_steal_pointer();
> @@ -300,11 +336,24 @@ static void build_dvsecs(CXLType3Dev *ct3d)
>  range2_size_hi = ct3d->hostpmem->size >> 32;
>  range2_size_lo = (2 << 5) | (2 << 2) | 0x3 |
>   (ct3d->hostpmem->size & 0xF000);
> +} else if (ct3d->dc.host_dc) {
> +range2_size_hi = ct3d->dc.host_dc->size >> 32;
> +range2_size_lo = (2 << 5) | (2 << 2) | 0x3 |
> + (ct3d->dc.host_dc->size & 0xF000);
>  }
> -} else {
> +} else if (ct3d->hostpmem) {
>  range1_size_hi = ct3d->hostpmem->size >> 32;
>  range1_size_lo = (2 << 5) | (2 << 2) | 0x3 |
>   (ct3d->hostpmem->size & 0xF000);
> +if (ct3d->dc.host_dc) {
> +range2_size_hi = ct3d->dc.host_dc->size >> 32;
> +range2_size_lo = (2 << 5) | (2 << 2) | 0x3 |
> + (ct3d->dc.host_dc->size & 0xF000);
> +}
> +} else {
> +range1_size_hi = ct3d->dc.host_dc->size >> 32;
> +range1_size_lo = (2 << 5) | (2 << 2) | 0x3 |
> + (ct3d->dc.host_dc->size & 0xF000);
>  }

As per your cover letter this is a work around for an ambiguity in the
spec and what Linux is currently doing with.  However as per the call
the other day, Linux only checks the flags.  So I'd set those only and
not the size field.  We may have to deal with spec errata later, but
I don't want to block this series on the corner case in the meantime.

Given complexity of DC we'll be waiting for ever if we have to get
all clarifications before we land anything!
(Quick though those nice folk in the CXL consortium working groups are :))


> @@ -679,9 +746,41 @@ static bool cxl_setup_memory(CXLType3Dev *ct3d, Error 
> **errp)
>  g_free(p_name);
>  }
>  
> -if (!cxl_create_dc_regions(ct3d, errp)) {
> -error_setg(errp, "setup DC regions failed");
> -return false;
> +ct3d->dc.total_capacity = 0;
> +if (ct3d->dc.num_regions) {

Trivial suggestion.

As dc.num_regions already existed from patch 4, maybe it's worth pushing this
if statement back there?  It will be harmless short

Re: [RFC PATCH v2 3/6] cxl/core: add report option for cxl_mem_get_poison()

2024-04-04 Thread Jonathan Cameron via

On Wed, 3 Apr 2024 22:56:58 +0800
Shiyang Ruan  wrote:

> 在 2024/3/30 9:50, Dan Williams 写道:
> > Shiyang Ruan wrote:  
> >> The GMER only has "Physical Address" field, no such one indicates length.
> >> So, when a poison event is received, we could use GET_POISON_LIST command
> >> to get the poison list.  Now driver has cxl_mem_get_poison(), so
> >> reuse it and add a parameter 'bool report', report poison record to MCE
> >> if set true.  
> > 
> > I am not sure I agree with the rationale here because there is no
> > correlation between the event being signaled and the current state of
> > the poison list. It also establishes race between multiple GMER events,
> > i.e. imagine the hardware sends 4 GMER events to communicate a 256B
> > poison discovery event. Does the driver need logic to support GMER event
> > 2, 3, and 4 if it already say all 256B of poison after processing GMER
> > event 1?  
> 
> Yes, I didn't thought about that.
> 
> > 
> > I think the best the driver can do is assume at least 64B of poison
> > per-event and depend on multiple notifications to handle larger poison
> > lengths.  
> 
> Agree.  This also makes things easier.
> 
> And for qemu, I'm thinking of making a patch to limit the length of a 
> poison record when injecting.  The length should between 64B to 4KiB per 
> GMER. And emit many GMERs if length > 4KiB.

I'm not keen on such a restriction in QEMU.
QEMU is injecting lengths allowed by the specification.  That facility is
useful for testing the kernel and the QEMU modeling should not be based
on what the kernel supports.

When you said this I wondered if we had a clever implementation that fused
entries in the list, but we don't (I thought about doing so a long time
ago but seems I never bothered :)  So if you are using QEMU for testing
and you don't want to exceed the kernel supported poison lengths, don't
inject poison that big.

Jonathan

> 
> > 
> > Otherwise, the poison list is really only useful for pre-populating
> > pages to offline after a reboot, i.e. to catch the kernel up with the
> > state of poison pages after a reboot.  
> 
> Got it.
> 
> 
> --
> Thanks,
> Ruan.

Re: [PATCH v10 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-04-04 Thread Jonathan Cameron via



> > > @@ -858,7 +910,8 @@ static int __init memory_tier_init(void)
> > >* For now we can have 4 faster memory tiers with smaller adistance
> > >* than default DRAM tier.
> > >*/
> > > - default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM);
> > > + default_dram_type = 
> > > mt_find_alloc_memory_type(MEMTIER_ADISTANCE_DRAM,
> > > + 
> > > _memory_types);  
> >
> > Unusual indenting.  Align with just after (
> >  
> 
> Aligning with "(" will exceed 100 columns. Would that be acceptable?
I think we are talking cross purposes.

default_dram_type = mt_find_alloc_memory_type(MEMTIER_ADISTANCE_DRAM,
  _memory_types);  

Is what I was suggesting.

> 
> > >   if (IS_ERR(default_dram_type))
> > >   panic("%s() failed to allocate default DRAM tier\n", 
> > > __func__);
> > >
> > > @@ -868,6 +921,14 @@ static int __init memory_tier_init(void)
> > >* types assigned.
> > >*/
> > >   for_each_node_state(node, N_MEMORY) {
> > > + if (!node_state(node, N_CPU))
> > > + /*
> > > +  * Defer memory tier initialization on CPUless numa 
> > > nodes.
> > > +  * These will be initialized after firmware and 
> > > devices are  
> >
> > I think this wraps at just over 80 chars.  Seems silly to wrap so tightly 
> > and not
> > quite fit under 80. (this is about 83 chars.
> >  
> 
> I can fix this.
> I have a question. From my patch, this is <80 chars. However,
> in an email, this is >80 chars. Does that mean we need to
> count the number of chars in an email, not in a patch? Or if I
> missed something? like vim configuration or?

3 tabs + 1 space + the text from * (58)
= 24 + 1 + 58 = 83

Advantage of using claws email for kernel stuff is it has a nice per character
ruler at the top of the window.

I wonder if you have a different tab indent size?  The kernel uses 8
characters.  It might explain the few other odd indents if perhaps
you have it at 4 in your editor?

https://www.kernel.org/doc/html/v4.10/process/coding-style.html

Jonathan

> 
> > > +  * initialized.
> > > +  */
> > > + continue;
> > > +
> > >   memtier = set_node_memory_tier(node);
> > >   if (IS_ERR(memtier))
> > >   /*  
> >  
> 
>

Re: [PATCH v10 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

A few minor comments inline.

> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index a44c03c2ba3a..16769552a338 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -140,12 +140,13 @@ static inline int mt_perf_to_adistance(struct 
> access_coordinate *perf, int *adis
>   return -EIO;
>  }
>  
> -struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct 
> list_head *memory_types)
> +static inline struct memory_dev_type *mt_find_alloc_memory_type(int adist,
> + struct list_head *memory_types)
>  {
>   return NULL;
>  }
>  
> -void mt_put_memory_types(struct list_head *memory_types)
> +static inline void mt_put_memory_types(struct list_head *memory_types)
>  {
Why in this patch and not previous one?
>  
>  }
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index 974af10cfdd8..44fa10980d37 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -36,6 +36,11 @@ struct node_memory_type_map {
>  
>  static DEFINE_MUTEX(memory_tier_lock);
>  static LIST_HEAD(memory_tiers);
> +/*
> + * The list is used to store all memory types that are not created
> + * by a device driver.
> + */
> +static LIST_HEAD(default_memory_types);
>  static struct node_memory_type_map node_memory_types[MAX_NUMNODES];
>  struct memory_dev_type *default_dram_type;
>  
> @@ -108,6 +113,8 @@ static struct demotion_nodes *node_demotion __read_mostly;
>  
>  static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms);
>  
> +/* The lock is used to protect `default_dram_perf*` info and nid. */
> +static DEFINE_MUTEX(default_dram_perf_lock);
>  static bool default_dram_perf_error;
>  static struct access_coordinate default_dram_perf;
>  static int default_dram_perf_ref_nid = NUMA_NO_NODE;
> @@ -505,7 +512,8 @@ static inline void __init_node_memory_type(int node, 
> struct memory_dev_type *mem
>  static struct memory_tier *set_node_memory_tier(int node)
>  {
>   struct memory_tier *memtier;
> - struct memory_dev_type *memtype;
> + struct memory_dev_type *mtype = default_dram_type;

Does the rename add anything major to the patch?
If not I'd leave it alone to reduce the churn and give
a more readable patch.  If it is worth doing perhaps
a precursor patch?

> + int adist = MEMTIER_ADISTANCE_DRAM;
>   pg_data_t *pgdat = NODE_DATA(node);
>  
>  
> @@ -514,11 +522,20 @@ static struct memory_tier *set_node_memory_tier(int 
> node)
>   if (!node_state(node, N_MEMORY))
>   return ERR_PTR(-EINVAL);
>  
> - __init_node_memory_type(node, default_dram_type);
> + mt_calc_adistance(node, );
> + if (node_memory_types[node].memtype == NULL) {
> + mtype = mt_find_alloc_memory_type(adist, _memory_types);
> + if (IS_ERR(mtype)) {
> + mtype = default_dram_type;
> + pr_info("Failed to allocate a memory type. Fall 
> back.\n");
> + }
> + }
> +
> + __init_node_memory_type(node, mtype);
>  
> - memtype = node_memory_types[node].memtype;
> - node_set(node, memtype->nodes);
> - memtier = find_create_memory_tier(memtype);
> + mtype = node_memory_types[node].memtype;
> + node_set(node, mtype->nodes);
> + memtier = find_create_memory_tier(mtype);
>   if (!IS_ERR(memtier))
>   rcu_assign_pointer(pgdat->memtier, memtier);
>   return memtier;
> @@ -655,6 +672,33 @@ void mt_put_memory_types(struct list_head *memory_types)
>  }
>  EXPORT_SYMBOL_GPL(mt_put_memory_types);
>  
> +/*
> + * This is invoked via `late_initcall()` to initialize memory tiers for
> + * CPU-less memory nodes after driver initialization, which is
> + * expected to provide `adistance` algorithms.
> + */
> +static int __init memory_tier_late_init(void)
> +{
> + int nid;
> +
> + mutex_lock(_tier_lock);
> + for_each_node_state(nid, N_MEMORY)
> + if (node_memory_types[nid].memtype == NULL)
> + /*
> +  * Some device drivers may have initialized memory tiers
> +  * between `memory_tier_init()` and 
> `memory_tier_late_init()`,
> +  * potentially bringing online memory nodes and
> +  * configuring memory tiers. Exclude them here.
> +  */

Does the comment refer to this path, or to ones where memtype is set?

> + set_node_memory_tier(nid);

Given the large comment I would add {} to help with readability.
You could flip the logic to reduce indent
for_each_node_state(nid, N_MEMORY) {
if (node_memory_types[nid].memtype)
continue;
/*
 * Some device drivers may have initialized memory tiers
 * between `memory_tier_init()` and `memory_tier_late_init()`,
 * potentially bringing online memory nodes and
 * configuring memory tiers. Exclude them

Re: [PATCH v10 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types

On Tue,  2 Apr 2024 00:17:37 +
"Ho-Ren (Jack) Chuang"  wrote:

> Since different memory devices require finding, allocating, and putting
> memory types, these common steps are abstracted in this patch,
> enhancing the scalability and conciseness of the code.
> 
> Signed-off-by: Ho-Ren (Jack) Chuang 
> Reviewed-by: "Huang, Ying" 

Hi,

I know this is a late entry to the discussion but a few comments inline.
(sorry I didn't look earlier!)

All opportunities to improve code complexity and readability as a result
of your factoring out.

Jonathan


> ---
>  drivers/dax/kmem.c   | 20 ++--
>  include/linux/memory-tiers.h | 13 +
>  mm/memory-tiers.c| 32 
>  3 files changed, 47 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 42ee360cf4e3..01399e5b53b2 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -55,21 +55,10 @@ static LIST_HEAD(kmem_memory_types);
>  
>  static struct memory_dev_type *kmem_find_alloc_memory_type(int adist)
>  {
> - bool found = false;
>   struct memory_dev_type *mtype;
>  
>   mutex_lock(_memory_type_lock);
could use

guard(mutex)(_memory_type_lock);
return mt_find_alloc_memory_type(adist, _memory_types);

I'm fine if you ignore this comment though as may be other functions in
here that could take advantage of the cleanup.h stuff in a future patch.

> - list_for_each_entry(mtype, _memory_types, list) {
> - if (mtype->adistance == adist) {
> - found = true;
> - break;
> - }
> - }
> - if (!found) {
> - mtype = alloc_memory_type(adist);
> - if (!IS_ERR(mtype))
> - list_add(>list, _memory_types);
> - }
> + mtype = mt_find_alloc_memory_type(adist, _memory_types);
>   mutex_unlock(_memory_type_lock);
>  
>   return mtype;
 
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index 69e781900082..a44c03c2ba3a 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -48,6 +48,9 @@ int mt_calc_adistance(int node, int *adist);
>  int mt_set_default_dram_perf(int nid, struct access_coordinate *perf,
>const char *source);
>  int mt_perf_to_adistance(struct access_coordinate *perf, int *adist);
> +struct memory_dev_type *mt_find_alloc_memory_type(int adist,
> + struct list_head 
> *memory_types);

That indent looks unusual.  Align the start of struct with start of int.

> +void mt_put_memory_types(struct list_head *memory_types);
>  #ifdef CONFIG_MIGRATION
>  int next_demotion_node(int node);
>  void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
> @@ -136,5 +139,15 @@ static inline int mt_perf_to_adistance(struct 
> access_coordinate *perf, int *adis
>  {
>   return -EIO;
>  }
> +
> +struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct 
> list_head *memory_types)
> +{
> + return NULL;
> +}
> +
> +void mt_put_memory_types(struct list_head *memory_types)
> +{
> +
No blank line needed here. 
> +}
>  #endif   /* CONFIG_NUMA */
>  #endif  /* _LINUX_MEMORY_TIERS_H */
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index 0537664620e5..974af10cfdd8 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -623,6 +623,38 @@ void clear_node_memory_type(int node, struct 
> memory_dev_type *memtype)
>  }
>  EXPORT_SYMBOL_GPL(clear_node_memory_type);
>  
> +struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct 
> list_head *memory_types)

Breaking this out as a separate function provides opportunity to improve it.
Maybe a follow up patch makes sense given it would no longer be a straight
forward code move.  However in my view it would be simple enough to be obvious
even within this patch.

> +{
> + bool found = false;
> + struct memory_dev_type *mtype;
> +
> + list_for_each_entry(mtype, memory_types, list) {
> + if (mtype->adistance == adist) {
> + found = true;

Why not return here?
return mtype;

> + break;
> + }
> + }
> + if (!found) {

If returning above, no need for found variable - just do this unconditionally.
+ I suggest you flip logic for simpler to follow code flow.
It's more code but I think a bit easier to read as error handling is
out of the main simple flow.

mtype = alloc_memory_type(adist);
if (IS_ERR(mtype))
return mtype;

list_add(>list, memory_types);

return mtype;

> + mtype = alloc_memory_type(adist);
> + if (!IS_ERR(mtype))
> + list_add(>list, memory_types);
> + }
> +
> + return mtype;
> +}
> +EXPORT_SYMBOL_GPL(mt_find_alloc_memory_type);
> +
> +void mt_put_memory_types(struct

[PATCH 6/6] bios-tables-test: Add data for complex numa test (GI, GP etc)

 0242 002h]
Reserved :  


[0F4h 0244 004h]  Length : 0078 


[0F8h 0248 001h]   Flags 
(decoded below) : 00


   Memory Hierarchy : 0 


  Use 
Minimum Transfer Size : 0   


 Non-sequential Transfers : 0
[0F9h 0249 001h]   Data Type : 03
[0FAh 0250 001h]   Minimum Transfer Size : 00
[0FBh 0251 001h]   Reserved1 : 00
[0FCh 0252 004h] Initiator Proximity Domains # : 0004
[100h 0256 004h]  Target Proximity Domains # : 0006
[104h 0260 004h]   Reserved2 : 
[108h 0264 008h] Entry Base Unit : 0004
[110h 0272 004h] Initiator Proximity Domain List : 
[114h 0276 004h] Initiator Proximity Domain List : 0001
[118h 0280 004h] Initiator Proximity Domain List : 0003
[11Ch 0284 004h] Initiator Proximity Domain List : 0005
[120h 0288 004h] Target Proximity Domain List : 
[124h 0292 004h] Target Proximity Domain List : 0001
[128h 0296 004h] Target Proximity Domain List : 0002
[12Ch 0300 004h] Target Proximity Domain List : 0003
[130h 0304 004h] Target Proximity Domain List : 0004
[134h 0308 004h] Target Proximity Domain List : 0005
[138h 0312 002h]   Entry : 00C8
[13Ah 0314 002h]   Entry : 
[13Ch 0316 002h]   Entry : 0032
[13Eh 0318 002h]   Entry : 
[140h 0320 002h]   Entry : 0032
[142h 0322 002h]   Entry : 0064
[144h 0324 002h]   Entry : 0019
[146h 0326 002h]   Entry : 
[148h 0328 002h]   Entry : 0064
[14Ah 0330 002h]   Entry : 
[14Ch 0332 002h]   Entry : 00C8
[14Eh 0334 002h]   Entry : 0019
[150h 0336 002h]   Entry : 0064
[152h 0338 002h]   Entry : 
[154h 0340 002h]   Entry : 0032
[156h 0342 002h]   Entry : 
[158h 0344 002h]   Entry : 0032
[15Ah 0346 002h]   Entry : 0064
[15Ch 0348 002h]   Entry : 0064
[15Eh 0350 002h]   Entry : 
[160h 0352 002h]   Entry : 0032
[162h 0354 002h]   Entry : 
[164h 0356 002h]   Entry : 0032
[166h 0358 002h]   Entry : 00C8

Note the zeros represent entries where the target node has no
memory.  These could be surpressed but it isn't 'wrong' to provide
them and it is (probably) permissible under ACPI to hotplug memory
into these nodes later.

Signed-off-by: Jonathan Cameron 
---
 tests/qtest/bios-tables-test-allowed-diff.h |   5 -
 tests/data/acpi/q35/APIC.acpihmat-generic-x | Bin 0 -> 136 bytes
 tests/data/acpi/q35/CEDT.acpihmat-generic-x | Bin 0 -> 68 bytes
 tests/data/acpi/q35/DSDT.acpihmat-generic-x | Bin 0 -> 10400 bytes
 tests/data/acpi/q35/HMAT.acpihmat-generic-x | Bin 0 -> 360 bytes
 tests/data/acpi/q35/SRAT.acpihmat-generic-x | Bin 0 -> 520 bytes
 6 files changed, 5 deletions(-)

diff --git a/tests/qtest/bios-tables-test-allowed-diff.h 
b/tests/qtest/bios-tables-test-allowed-diff.h
index a5aa801c99..dfb8523c8b 100644
--- a/tests/qtest/bios-tables-test-allowed-diff.h
+++ b/tests/qtest/bios-tables-test-allowed-diff.h
@@ -1,6 +1 @@
 /* List of comma-separated changed AML files to ignore */
-"tests/data/acpi/q35/APIC.acpihmat-generic-x",
-"tests/data/acpi/q35/CEDT.acpihmat-generic-x",
-"tests/data/acpi/q35/DSDT.acpihmat-generic-x",
-"tests/data/acpi/q35/HMAT.acpihmat-generic-x",
-"tests/data/acpi/q35/SRAT.acpihmat-generic-x",
diff --git a/tests/data/ac

[PATCH 5/6] bios-tables-test: Add complex SRAT / HMAT test for GI GP

Add a test with 6 nodes to exercise most interesting corner cases
of SRAT and HMAT generation including the new Generic Initiator
and Generic Port Affinity structures.  More details of the
set up in the following patch adding the table data.

Signed-off-by: Jonathan Cameron 
---
 tests/qtest/bios-tables-test.c | 92 ++
 1 file changed, 92 insertions(+)

diff --git a/tests/qtest/bios-tables-test.c b/tests/qtest/bios-tables-test.c
index d1ff4db7a2..1651d06b7b 100644
--- a/tests/qtest/bios-tables-test.c
+++ b/tests/qtest/bios-tables-test.c
@@ -1862,6 +1862,96 @@ static void test_acpi_q35_tcg_acpi_hmat_noinitiator(void)
 free_test_data();
 }
 
+/* Test intended to hit corner cases of SRAT and HMAT */
+static void test_acpi_q35_tcg_acpi_hmat_generic_x(void)
+{
+test_data data = {};
+
+data.machine = MACHINE_Q35;
+data.variant = ".acpihmat-generic-x";
+test_acpi_one(" -machine hmat=on,cxl=on"
+  " -smp 3,sockets=3"
+  " -m 128M,maxmem=384M,slots=2"
+  " -device virtio-rng-pci,id=gidev"
+  " -device pxb-cxl,bus_nr=64,bus=pcie.0,id=cxl.1"
+  " -object memory-backend-ram,size=64M,id=ram0"
+  " -object memory-backend-ram,size=64M,id=ram1"
+  " -numa node,nodeid=0,cpus=0,memdev=ram0"
+  " -numa node,nodeid=1"
+  " -object acpi-generic-initiator,id=gi0,pci-dev=gidev,node=1"
+  " -numa node,nodeid=2"
+  " -object acpi-generic-port,id=gp0,pci-bus=cxl.1,node=2"
+  " -numa node,nodeid=3,cpus=1"
+  " -numa node,nodeid=4,memdev=ram1"
+  " -numa node,nodeid=5,cpus=2"
+  " -numa hmat-lb,initiator=0,target=0,hierarchy=memory,"
+  "data-type=access-latency,latency=10"
+  " -numa hmat-lb,initiator=0,target=0,hierarchy=memory,"
+  "data-type=access-bandwidth,bandwidth=800M"
+  " -numa hmat-lb,initiator=0,target=2,hierarchy=memory,"
+  "data-type=access-latency,latency=100"
+  " -numa hmat-lb,initiator=0,target=2,hierarchy=memory,"
+  "data-type=access-bandwidth,bandwidth=200M"
+  " -numa hmat-lb,initiator=0,target=4,hierarchy=memory,"
+  "data-type=access-latency,latency=100"
+  " -numa hmat-lb,initiator=0,target=4,hierarchy=memory,"
+  "data-type=access-bandwidth,bandwidth=200M"
+  " -numa hmat-lb,initiator=0,target=5,hierarchy=memory,"
+  "data-type=access-latency,latency=200"
+  " -numa hmat-lb,initiator=0,target=5,hierarchy=memory,"
+  "data-type=access-bandwidth,bandwidth=400M"
+  " -numa hmat-lb,initiator=1,target=0,hierarchy=memory,"
+  "data-type=access-latency,latency=500"
+  " -numa hmat-lb,initiator=1,target=0,hierarchy=memory,"
+  "data-type=access-bandwidth,bandwidth=100M"
+  " -numa hmat-lb,initiator=1,target=2,hierarchy=memory,"
+  "data-type=access-latency,latency=50"
+  " -numa hmat-lb,initiator=1,target=2,hierarchy=memory,"
+  "data-type=access-bandwidth,bandwidth=400M"
+  " -numa hmat-lb,initiator=1,target=4,hierarchy=memory,"
+  "data-type=access-latency,latency=50"
+  " -numa hmat-lb,initiator=1,target=4,hierarchy=memory,"
+  "data-type=access-bandwidth,bandwidth=800M"
+  " -numa hmat-lb,initiator=1,target=5,hierarchy=memory,"
+  "data-type=access-latency,latency=500"
+  " -numa hmat-lb,initiator=1,target=5,hierarchy=memory,"
+  "data-type=access-bandwidth,bandwidth=100M"
+  " -numa hmat-lb,initiator=3,target=0,hierarchy=memory,"
+  "data-type=access-latency,latency=20"
+  " -numa hmat-lb,initiator=3,target=0,hierarchy=memory,"
+  "data-type=access-bandwidth,bandwidth=400M"
+  " -numa hmat-lb,initiator=3,target=2,hierarchy=memory,"
+  "data-type=access-latency,latency=80"
+  " -numa hmat-lb,initiator=3,target=2,hierarchy=memory,"
+  "data-type=access-bandwidth

[PATCH 4/6] bios-tables-test: Allow for new acpihmat-generic-x test data.

The test to be added exercises many corners of the SRAT and HMAT
table generation.

Signed-off-by: Jonathan Cameron 
---
 tests/qtest/bios-tables-test-allowed-diff.h | 5 +
 tests/data/acpi/q35/APIC.acpihmat-generic-x | 0
 tests/data/acpi/q35/CEDT.acpihmat-generic-x | 0
 tests/data/acpi/q35/DSDT.acpihmat-generic-x | 0
 tests/data/acpi/q35/HMAT.acpihmat-generic-x | 0
 tests/data/acpi/q35/SRAT.acpihmat-generic-x | 0
 6 files changed, 5 insertions(+)

diff --git a/tests/qtest/bios-tables-test-allowed-diff.h 
b/tests/qtest/bios-tables-test-allowed-diff.h
index dfb8523c8b..a5aa801c99 100644
--- a/tests/qtest/bios-tables-test-allowed-diff.h
+++ b/tests/qtest/bios-tables-test-allowed-diff.h
@@ -1 +1,6 @@
 /* List of comma-separated changed AML files to ignore */
+"tests/data/acpi/q35/APIC.acpihmat-generic-x",
+"tests/data/acpi/q35/CEDT.acpihmat-generic-x",
+"tests/data/acpi/q35/DSDT.acpihmat-generic-x",
+"tests/data/acpi/q35/HMAT.acpihmat-generic-x",
+"tests/data/acpi/q35/SRAT.acpihmat-generic-x",
diff --git a/tests/data/acpi/q35/APIC.acpihmat-generic-x 
b/tests/data/acpi/q35/APIC.acpihmat-generic-x
new file mode 100644
index 00..e69de29bb2
diff --git a/tests/data/acpi/q35/CEDT.acpihmat-generic-x 
b/tests/data/acpi/q35/CEDT.acpihmat-generic-x
new file mode 100644
index 00..e69de29bb2
diff --git a/tests/data/acpi/q35/DSDT.acpihmat-generic-x 
b/tests/data/acpi/q35/DSDT.acpihmat-generic-x
new file mode 100644
index 00..e69de29bb2
diff --git a/tests/data/acpi/q35/HMAT.acpihmat-generic-x 
b/tests/data/acpi/q35/HMAT.acpihmat-generic-x
new file mode 100644
index 00..e69de29bb2
diff --git a/tests/data/acpi/q35/SRAT.acpihmat-generic-x 
b/tests/data/acpi/q35/SRAT.acpihmat-generic-x
new file mode 100644
index 00..e69de29bb2
-- 
2.39.2

[PATCH 3/6] hw/acpi: Generic Port Affinity Structure support

These are very similar to the recently added Generic Initiators
but instead of representing an initiator of memory traffic they
represent an edge point beyond which may lie either targets or
initiators.  Here we add these ports such that they may
be targets of hmat_lb records to describe the latency and
bandwidth from host side initiators to the port.  A descoverable
mechanism such as UEFI CDAT read from CXL devices and switches
is used to discover the remainder fo the path and the OS can build
up full latency and bandwidth numbers as need for work and data
placement decisions.

Signed-off-by: Jonathan Cameron 
---
 qapi/qom.json|  18 +++
 include/hw/acpi/acpi_generic_initiator.h |  18 ++-
 include/hw/pci/pci_bridge.h  |   1 +
 hw/acpi/acpi_generic_initiator.c | 141 +--
 hw/pci-bridge/pci_expander_bridge.c  |   1 -
 5 files changed, 141 insertions(+), 38 deletions(-)

diff --git a/qapi/qom.json b/qapi/qom.json
index 85e6b4f84a..5480d9ca24 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -826,6 +826,22 @@
   'data': { 'pci-dev': 'str',
 'node': 'uint32' } }
 
+
+##
+# @AcpiGenericPortProperties:
+#
+# Properties for acpi-generic-port objects.
+#
+# @pci-bus: PCI bus of the hostbridge associated with this SRAT entry
+#
+# @node: numa node associated with the PCI device
+#
+# Since: 9.1
+##
+{ 'struct': 'AcpiGenericPortProperties',
+  'data': { 'pci-bus': 'str',
+'node': 'uint32' } }
+
 ##
 # @RngProperties:
 #
@@ -944,6 +960,7 @@
 { 'enum': 'ObjectType',
   'data': [
 'acpi-generic-initiator',
+'acpi-generic-port',
 'authz-list',
 'authz-listfile',
 'authz-pam',
@@ -1016,6 +1033,7 @@
   'discriminator': 'qom-type',
   'data': {
   'acpi-generic-initiator': 'AcpiGenericInitiatorProperties',
+  'acpi-generic-port':  'AcpiGenericPortProperties',
   'authz-list': 'AuthZListProperties',
   'authz-listfile': 'AuthZListFileProperties',
   'authz-pam':  'AuthZPAMProperties',
diff --git a/include/hw/acpi/acpi_generic_initiator.h 
b/include/hw/acpi/acpi_generic_initiator.h
index 26e2bd92d4..49ac448034 100644
--- a/include/hw/acpi/acpi_generic_initiator.h
+++ b/include/hw/acpi/acpi_generic_initiator.h
@@ -30,6 +30,12 @@ typedef struct AcpiGenericInitiator {
 AcpiGenericNode parent;
 } AcpiGenericInitiator;
 
+#define TYPE_ACPI_GENERIC_PORT "acpi-generic-port"
+
+typedef struct AcpiGenericPort {
+AcpiGenericInitiator parent;
+} AcpiGenericPort;
+
 /*
  * ACPI 6.3:
  * Table 5-81 Flags – Generic Initiator Affinity Structure
@@ -49,8 +55,16 @@ typedef enum {
  * Table 5-80 Device Handle - PCI
  */
 typedef struct PCIDeviceHandle {
-uint16_t segment;
-uint16_t bdf;
+union {
+struct {
+uint16_t segment;
+uint16_t bdf;
+};
+struct {
+uint64_t hid;
+uint32_t uid;
+};
+};
 } PCIDeviceHandle;
 
 void build_srat_generic_pci_initiator(GArray *table_data);
diff --git a/include/hw/pci/pci_bridge.h b/include/hw/pci/pci_bridge.h
index 5cd452115a..5456e24883 100644
--- a/include/hw/pci/pci_bridge.h
+++ b/include/hw/pci/pci_bridge.h
@@ -102,6 +102,7 @@ typedef struct PXBPCIEDev {
 PXBDev parent_obj;
 } PXBPCIEDev;
 
+#define TYPE_PXB_CXL_BUS "pxb-cxl-bus"
 #define TYPE_PXB_DEV "pxb"
 OBJECT_DECLARE_SIMPLE_TYPE(PXBDev, PXB_DEV)
 
diff --git a/hw/acpi/acpi_generic_initiator.c b/hw/acpi/acpi_generic_initiator.c
index c054e0e27d..85191e90ab 100644
--- a/hw/acpi/acpi_generic_initiator.c
+++ b/hw/acpi/acpi_generic_initiator.c
@@ -7,6 +7,7 @@
 #include "hw/acpi/acpi_generic_initiator.h"
 #include "hw/acpi/aml-build.h"
 #include "hw/boards.h"
+#include "hw/pci/pci_bridge.h"
 #include "hw/pci/pci_device.h"
 #include "qemu/error-report.h"
 
@@ -18,6 +19,10 @@ typedef struct AcpiGenericInitiatorClass {
  AcpiGenericNodeClass parent_class;
 } AcpiGenericInitiatorClass;
 
+typedef struct AcpiGenericPortClass {
+AcpiGenericInitiatorClass parent;
+} AcpiGenericPortClass;
+
 OBJECT_DEFINE_ABSTRACT_TYPE(AcpiGenericNode, acpi_generic_node,
 ACPI_GENERIC_NODE, OBJECT)
 
@@ -30,6 +35,13 @@ OBJECT_DEFINE_TYPE_WITH_INTERFACES(AcpiGenericInitiator, 
acpi_generic_initiator,
 
 OBJECT_DECLARE_SIMPLE_TYPE(AcpiGenericInitiator, ACPI_GENERIC_INITIATOR)
 
+OBJECT_DEFINE_TYPE_WITH_INTERFACES(AcpiGenericPort, acpi_generic_port,
+   ACPI_GENERIC_PORT, ACPI_GENERIC_NODE,
+   { TYPE_USER_CREATABLE },
+   { NULL })
+
+OBJECT_DECLARE_SIMPLE_TYPE(AcpiGenericPort, ACPI_GENERIC_PORT)
+
 static void acpi_generic_node_init(Object *obj)
 {
 AcpiGenericNode *gn = ACPI_GENERIC_NODE(obj);
@@ -53,6 +65,14 @@ static void acpi_generic_initiator_finalize(Object *obj)
 {
 }
 
+static void acpi_generic_port_init

[PATCH 2/6] hw/acpi: Insert an acpi-generic-node base under acpi-generic-initiator

This will simplify reuse when adding acpi-generic-port.
Note that some error_printf() messages will now print acpi-generic-node
whereas others will move to type specific cases in next patch so
are left alone for now.

Signed-off-by: Jonathan Cameron 
---
 include/hw/acpi/acpi_generic_initiator.h | 15 -
 hw/acpi/acpi_generic_initiator.c | 78 +++-
 2 files changed, 62 insertions(+), 31 deletions(-)

diff --git a/include/hw/acpi/acpi_generic_initiator.h 
b/include/hw/acpi/acpi_generic_initiator.h
index a304bad73e..26e2bd92d4 100644
--- a/include/hw/acpi/acpi_generic_initiator.h
+++ b/include/hw/acpi/acpi_generic_initiator.h
@@ -8,15 +8,26 @@
 
 #include "qom/object_interfaces.h"
 
-#define TYPE_ACPI_GENERIC_INITIATOR "acpi-generic-initiator"
+/*
+ * Abstract type to be used as base for
+ * - acpi-generic-initator
+ * - acpi-generic-port
+ */
+#define TYPE_ACPI_GENERIC_NODE "acpi-generic-node"
 
-typedef struct AcpiGenericInitiator {
+typedef struct AcpiGenericNode {
 /* private */
 Object parent;
 
 /* public */
 char *pci_dev;
 uint16_t node;
+} AcpiGenericNode;
+
+#define TYPE_ACPI_GENERIC_INITIATOR "acpi-generic-initiator"
+
+typedef struct AcpiGenericInitiator {
+AcpiGenericNode parent;
 } AcpiGenericInitiator;
 
 /*
diff --git a/hw/acpi/acpi_generic_initiator.c b/hw/acpi/acpi_generic_initiator.c
index 18a939b0e5..c054e0e27d 100644
--- a/hw/acpi/acpi_generic_initiator.c
+++ b/hw/acpi/acpi_generic_initiator.c
@@ -10,45 +10,61 @@
 #include "hw/pci/pci_device.h"
 #include "qemu/error-report.h"
 
-typedef struct AcpiGenericInitiatorClass {
+typedef struct AcpiGenericNodeClass {
 ObjectClass parent_class;
+} AcpiGenericNodeClass;
+
+typedef struct AcpiGenericInitiatorClass {
+ AcpiGenericNodeClass parent_class;
 } AcpiGenericInitiatorClass;
 
+OBJECT_DEFINE_ABSTRACT_TYPE(AcpiGenericNode, acpi_generic_node,
+ACPI_GENERIC_NODE, OBJECT)
+
+OBJECT_DECLARE_SIMPLE_TYPE(AcpiGenericNode, ACPI_GENERIC_NODE)
+
 OBJECT_DEFINE_TYPE_WITH_INTERFACES(AcpiGenericInitiator, 
acpi_generic_initiator,
-   ACPI_GENERIC_INITIATOR, OBJECT,
+   ACPI_GENERIC_INITIATOR, ACPI_GENERIC_NODE,
{ TYPE_USER_CREATABLE },
{ NULL })
 
 OBJECT_DECLARE_SIMPLE_TYPE(AcpiGenericInitiator, ACPI_GENERIC_INITIATOR)
 
+static void acpi_generic_node_init(Object *obj)
+{
+AcpiGenericNode *gn = ACPI_GENERIC_NODE(obj);
+
+gn->node = MAX_NODES;
+gn->pci_dev = NULL;
+}
+
 static void acpi_generic_initiator_init(Object *obj)
 {
-AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
+}
+
+static void acpi_generic_node_finalize(Object *obj)
+{
+AcpiGenericNode *gn = ACPI_GENERIC_NODE(obj);
 
-gi->node = MAX_NODES;
-gi->pci_dev = NULL;
+g_free(gn->pci_dev);
 }
 
 static void acpi_generic_initiator_finalize(Object *obj)
 {
-AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
-
-g_free(gi->pci_dev);
 }
 
-static void acpi_generic_initiator_set_pci_device(Object *obj, const char *val,
-  Error **errp)
+static void acpi_generic_node_set_pci_device(Object *obj, const char *val,
+ Error **errp)
 {
-AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
+AcpiGenericNode *gn = ACPI_GENERIC_NODE(obj);
 
-gi->pci_dev = g_strdup(val);
+gn->pci_dev = g_strdup(val);
 }
-
-static void acpi_generic_initiator_set_node(Object *obj, Visitor *v,
-const char *name, void *opaque,
-Error **errp)
+static void acpi_generic_node_set_node(Object *obj, Visitor *v,
+   const char *name, void *opaque,
+   Error **errp)
 {
-AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
+AcpiGenericNode *gn = ACPI_GENERIC_NODE(obj);
 MachineState *ms = MACHINE(qdev_get_machine());
 uint32_t value;
 
@@ -58,20 +74,24 @@ static void acpi_generic_initiator_set_node(Object *obj, 
Visitor *v,
 
 if (value >= MAX_NODES) {
 error_printf("%s: Invalid NUMA node specified\n",
- TYPE_ACPI_GENERIC_INITIATOR);
+ TYPE_ACPI_GENERIC_NODE);
 exit(1);
 }
 
-gi->node = value;
-ms->numa_state->nodes[gi->node].has_gi = true;
+gn->node = value;
+ms->numa_state->nodes[gn->node].has_gi = true;
 }
 
-static void acpi_generic_initiator_class_init(ObjectClass *oc, void *data)
+static void acpi_generic_node_class_init(ObjectClass *oc, void *data)
 {
 object_class_property_add_str(oc, "pci-dev", NULL,
-acpi_generic_initiator_set_pci_device);
+acpi_generic_node_set_pci_device);
 object_class_property_

[PATCH 1/6] hw/acpi/GI: Fix trivial parameter alignment issue.

Before making additional modification, tidy up this misleading indentation.

Signed-off-by: Jonathan Cameron 
---
 hw/acpi/acpi_generic_initiator.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/acpi/acpi_generic_initiator.c b/hw/acpi/acpi_generic_initiator.c
index 17b9a052f5..18a939b0e5 100644
--- a/hw/acpi/acpi_generic_initiator.c
+++ b/hw/acpi/acpi_generic_initiator.c
@@ -132,7 +132,7 @@ static int build_all_acpi_generic_initiators(Object *obj, 
void *opaque)
 
 dev_handle.segment = 0;
 dev_handle.bdf = PCI_BUILD_BDF(pci_bus_num(pci_get_bus(pci_dev)),
-   pci_dev->devfn);
+   pci_dev->devfn);
 
 build_srat_generic_pci_initiator_affinity(table_data,
   gi->node, _handle);
-- 
2.39.2

[PATCH 0/6 qemu] acpi: NUMA nodes for CXL HB as GP + complex NUMA test.

ACPI 6.5 introduced Generic Port Affinity Structures to close a system
description gap that was a problem for CXL memory systems.
It defines an new SRAT Affinity structure (and hence allows creation of an
ACPI Proximity Node which can only be defined via an SRAT structure)
for the boundary between a discoverable fabric and a non discoverable
system interconnects etc.

The HMAT data on latency and bandwidth is combined with discoverable
information from the CXL bus (link speeds, lane counts) and CXL devices
(switch port to port characteristics and USP to memory, via CDAT tables
read from the device).  QEMU has supported the rest of the elements
of this chain for a while but now the kernel has caught up and we need
the missing element of Generic Ports (this code has been used extensively
in testing and debugging that kernel support, some resulting fixes
currently under review).

Generic Port Affinity Structures are very similar to the recently
added Generic Initiator Affinity Structures (GI) so this series
factors out and reuses much of that infrastructure for reuse
There are subtle differences (beyond the obvious structure ID change).

- The ACPI spec example (and linux kernel support) has a Generic
  Port not as associated with the CXL root port, but rather with
  the CXL Host bridge. As a result, an ACPI handle is used (rather
  than the PCI SBDF option for GIs). In QEMU the easiest way
  to get to this is to target the root bridge PCI Bus, and
  conveniently the root bridge bus number is used for the UID allowing
  us to construct an appropriate entry.

A key addition of this series is a complex NUMA topology example that
stretches the QEMU emulation code for GI, GP and nodes with just
CPUS, just memory, just hot pluggable memory, mixture of memory and CPUs.

A similar test showed up a few NUMA related bugs with fixes applied for
9.0 (note that one of these needs linux booted to identify that it
rejects the HMAT table and this test is a regression test for the
table generation only).

https://lore.kernel.org/qemu-devel/2eb6672cfdaea7dacd8e9bb0523887f13b9f85ce.1710282274.git@redhat.com/
https://lore.kernel.org/qemu-devel/74e2845c5f95b0c139c79233ddb65bb17f2dd679.1710282274.git@redhat.com/

Jonathan Cameron (6):
  hw/acpi/GI: Fix trivial parameter alignment issue.
  hw/acpi: Insert an acpi-generic-node base under acpi-generic-initiator
  hw/acpi: Generic Port Affinity Structure support
  bios-tables-test: Allow for new acpihmat-generic-x test data.
  bios-tables-test: Add complex SRAT / HMAT test for GI GP
  bios-tables-test: Add data for complex numa test (GI, GP etc)

 qapi/qom.json   |  18 ++
 include/hw/acpi/acpi_generic_initiator.h|  33 +++-
 include/hw/pci/pci_bridge.h |   1 +
 hw/acpi/acpi_generic_initiator.c| 199 ++--
 hw/pci-bridge/pci_expander_bridge.c |   1 -
 tests/qtest/bios-tables-test.c  |  92 +
 tests/data/acpi/q35/APIC.acpihmat-generic-x | Bin 0 -> 136 bytes
 tests/data/acpi/q35/CEDT.acpihmat-generic-x | Bin 0 -> 68 bytes
 tests/data/acpi/q35/DSDT.acpihmat-generic-x | Bin 0 -> 10400 bytes
 tests/data/acpi/q35/HMAT.acpihmat-generic-x | Bin 0 -> 360 bytes
 tests/data/acpi/q35/SRAT.acpihmat-generic-x | Bin 0 -> 520 bytes
 11 files changed, 285 insertions(+), 59 deletions(-)
 create mode 100644 tests/data/acpi/q35/APIC.acpihmat-generic-x
 create mode 100644 tests/data/acpi/q35/CEDT.acpihmat-generic-x
 create mode 100644 tests/data/acpi/q35/DSDT.acpihmat-generic-x
 create mode 100644 tests/data/acpi/q35/HMAT.acpihmat-generic-x
 create mode 100644 tests/data/acpi/q35/SRAT.acpihmat-generic-x

-- 
2.39.2

Re: [PATCH 2/2] CXL/cxl_type3: reset DVSEC CXL Control in ct3d_reset

2024-04-02 Thread Jonathan Cameron via

On Tue,  2 Apr 2024 09:46:47 +0800
Li Zhijian  wrote:

> After the kernel commit
> 0cab68720598 ("cxl/pci: Fix disabling memory if DVSEC CXL Range does not 
> match a CFMWS window")

Fixes tag seems appropriate.

> CXL type3 devices cannot be enabled again after the reboot because this
> flag was not reset.
> 
> This flag could be changed by the firmware or OS, let it have a
> reset(default) value in reboot so that the OS can read its clean status.

Good find.  I think we should aim for a fix that is less fragile to future
code rearrangement etc.

> 
> Signed-off-by: Li Zhijian 
> ---
>  hw/mem/cxl_type3.c | 14 +-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
> index ad2fe7d463fb..3fe136053390 100644
> --- a/hw/mem/cxl_type3.c
> +++ b/hw/mem/cxl_type3.c
> @@ -305,7 +305,8 @@ static void build_dvsecs(CXLType3Dev *ct3d)
>  
>  dvsec = (uint8_t *)&(CXLDVSECDevice){
>  .cap = 0x1e,
> -.ctrl = 0x2,
> +#define CT3D_DEVSEC_CXL_CTRL 0x2
> +.ctrl = CT3D_DEVSEC_CXL_CTRL,
Naming doesn't make it clear the define is a reset value / default value.
>  .status2 = 0x2,
>  .range1_size_hi = range1_size_hi,
>  .range1_size_lo = range1_size_lo,
> @@ -906,6 +907,16 @@ MemTxResult cxl_type3_write(PCIDevice *d, hwaddr 
> host_addr, uint64_t data,
>  return address_space_write(as, dpa_offset, attrs, , size);
>  }
>  
> +/* Reset DVSEC CXL Control */
> +static void ct3d_dvsec_cxl_ctrl_reset(CXLType3Dev *ct3d)
> +{
> +uint16_t offset = first_dvsec_offset(ct3d);

This relies to much on the current memory layout.  We should doing a search
of config space to find the right entry, or we should cache a pointer to
the relevant structure when we fill it in the first time.

> +CXLDVSECDevice *dvsec;
> +
> +dvsec = (CXLDVSECDevice *)(ct3d->cxl_cstate.pdev->config + offset);
> +dvsec->ctrl = CT3D_DEVSEC_CXL_CTRL;
> +}
> +
>  static void ct3d_reset(DeviceState *dev)
>  {
>  CXLType3Dev *ct3d = CXL_TYPE3(dev);
> @@ -914,6 +925,7 @@ static void ct3d_reset(DeviceState *dev)
>  
>  cxl_component_register_init_common(reg_state, write_msk, 
> CXL2_TYPE3_DEVICE);
>  cxl_device_register_init_t3(ct3d);
> +ct3d_dvsec_cxl_ctrl_reset(ct3d);
>  
>  /*
>   * Bring up an endpoint to target with MCTP over VDM.

Re: [PATCH 1/2] CXL/cxl_type3: add first_dvsec_offset() helper

2024-04-02 Thread Jonathan Cameron via

On Tue,  2 Apr 2024 09:46:46 +0800
Li Zhijian  wrote:

> It helps to figure out where the first dvsec register is located. In
> addition, replace offset and size hardcore with existing macros.
> 
> Signed-off-by: Li Zhijian 

I agree we should be using the macros.

The offset calc is a bit specific to the the chosen memory layout,
so not sure it makes sense to break it out to a separate function.

I'll suggest alternative possible approaches in review of next patch.

Jonathan

> ---
>  hw/mem/cxl_type3.c | 19 +--
>  1 file changed, 13 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
> index b0a7e9f11b64..ad2fe7d463fb 100644
> --- a/hw/mem/cxl_type3.c
> +++ b/hw/mem/cxl_type3.c
> @@ -643,6 +643,16 @@ static DOEProtocol doe_cdat_prot[] = {
>  { }
>  };
>  
> +static uint16_t first_dvsec_offset(CXLType3Dev *ct3d)
> +{
> +uint16_t offset = PCI_CONFIG_SPACE_SIZE;
> +
> +if (ct3d->sn != UI64_NULL)
> +offset += PCI_EXT_CAP_DSN_SIZEOF;
> +
> +return offset;
> +}
> +
>  static void ct3_realize(PCIDevice *pci_dev, Error **errp)
>  {
>  ERRP_GUARD();
> @@ -663,13 +673,10 @@ static void ct3_realize(PCIDevice *pci_dev, Error 
> **errp)
>  pci_config_set_prog_interface(pci_conf, 0x10);
>  
>  pcie_endpoint_cap_init(pci_dev, 0x80);
> -if (ct3d->sn != UI64_NULL) {
> -pcie_dev_ser_num_init(pci_dev, 0x100, ct3d->sn);
> -cxl_cstate->dvsec_offset = 0x100 + 0x0c;
> -} else {
> -cxl_cstate->dvsec_offset = 0x100;
> -}
> +if (ct3d->sn != UI64_NULL)
> +pcie_dev_ser_num_init(pci_dev, PCI_CONFIG_SPACE_SIZE, ct3d->sn);
>  
> +cxl_cstate->dvsec_offset = first_dvsec_offset(ct3d);
>  ct3d->cxl_cstate.pdev = pci_dev;
>  build_dvsecs(ct3d);
>

Re: [PATCH v2 3/4] hw/cxl/mbox: replace sanitize_running() with cxl_dev_media_disabled()

On Sun, 21 Jan 2024 21:50:00 -0500
Hyeonggon Yoo <42.hye...@gmail.com> wrote:

> On Tue, Jan 9, 2024 at 12:54 PM Jonathan Cameron
>  wrote:
> >
> > On Fri, 22 Dec 2023 18:00:50 +0900
> > Hyeonggon Yoo <42.hye...@gmail.com> wrote:
> >  
> > > The spec states that reads/writes should have no effect and a part of
> > > commands should be ignored when the media is disabled, not when the
> > > sanitize command is running.qq
> > >
> > > Introduce cxl_dev_media_disabled() to check if the media is disabled and
> > > replace sanitize_running() with it.
> > >
> > > Make sure that the media has been correctly disabled during sanitation
> > > by adding an assert to __toggle_media(). Now, enabling when already
> > > enabled or vice versa results in an assert() failure.
> > >
> > > Suggested-by: Davidlohr Bueso 
> > > Signed-off-by: Hyeonggon Yoo <42.hye...@gmail.com>  
> >
> > This applies to
> >
> > hw/cxl: Add get scan media capabilities cmd support.
> >
> > Should I just squash it with that patch in my tree?
> > For now I'm holding it immediately on top of that, but I'm not keen to
> > send messy code upstream unless there is a good reason to retain the
> > history.  
> 
> Oh, while the diff looks like the patch touches scan_media_running(), it's 
> not.
> 
> The proper Fixes: tag will be:
> Fixes: d77176724422 ("hw/cxl: Add support for device sanitation")
> 
> > If you are doing this sort of fix series in future, please call out
> > what they fix explicitly.  Can't use fixes tags as the commit ids
> > are unstable, but can mention the patch to make my life easier!  
> 
> Okay, next time I will either add the Fixes tag or add a comment on
> what it fixes.
> 
> By the way I guess your latest, public branch is still cxl-2023-11-02, right?
> https://gitlab.com/jic23/qemu/-/tree/cxl-2023-11-02
> 
> I assume you adjusted my v2 series, but please let me know if you prefer
> sending v3 against your latest tree.
> 
> Thanks,
> Hyeonggon

Side note, in it's current form this breaks the switch-cci support in upstream
QEMU.  I've finally gotten back to getting ready to look at MMPT support and
ran into a crash as a result.  Needs protection with checked 
object_dynamic_cast()
to make sure we have a type3 device.  I'll update the patch in my tree.

Thanks,

Jonathan

>

Re: [RFC PATCH-for-9.1 08/29] hw/i386/pc: Move CXLState to PcPciMachineState

On Thu, 28 Mar 2024 16:54:16 +0100
Philippe Mathieu-Daudé  wrote:

> CXL depends on PCIe, which isn't available on non-PCI
> machines such the ISA-only PC one.
> Move CXLState to PcPciMachineState, and move the CXL
> specific calls to pc_pci_machine_initfn() and
> pc_pci_machine_done().
> 
> Signed-off-by: Philippe Mathieu-Daudé 

LGTM as a change on it's own - I've not reviewed the series
in general though, hence just an ack as an rb feels too strong.

Acked-by: Jonathan Cameron

Re: [PATCH-for-9.0] hw/i386/pc: Restrict CXL to PCI-based machines

On Wed, 27 Mar 2024 17:16:42 +0100
Philippe Mathieu-Daudé  wrote:

> CXL is based on PCIe. In is pointless to initialize
> its context on non-PCI machines.
> 
> Signed-off-by: Philippe Mathieu-Daudé 
Seems a reasonable restriction.

Acked-by: Jonathan Cameron 

Jonathan

> ---
>  hw/i386/pc.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index e80f02bef4..5c21b0c4db 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -1738,7 +1738,9 @@ static void pc_machine_initfn(Object *obj)
>  pcms->pcspk = isa_new(TYPE_PC_SPEAKER);
>  object_property_add_alias(OBJECT(pcms), "pcspk-audiodev",
>OBJECT(pcms->pcspk), "audiodev");
> -cxl_machine_init(obj, >cxl_devices_state);
> +if (pcmc->pci_enabled) {
> +cxl_machine_init(obj, >cxl_devices_state);
> +}
>  
>  pcms->machine_done.notify = pc_machine_done;
>  qemu_add_machine_init_done_notifier(>machine_done);

Re: [PATCH] mem/cxl_type3: fix hpa to dpa logic

On Thu, 28 Mar 2024 06:24:24 +
"Xingtao Yao (Fujitsu)"  wrote:

> Jonathan
> 
> thanks for your reply!
> 
> > -Original Message-
> > From: Jonathan Cameron 
> > Sent: Wednesday, March 27, 2024 9:28 PM
> > To: Yao, Xingtao/姚 幸涛 
> > Cc: fan...@samsung.com; qemu-devel@nongnu.org; Cao, Quanquan/曹 全全
> > 
> > Subject: Re: [PATCH] mem/cxl_type3: fix hpa to dpa logic
> > 
> > On Tue, 26 Mar 2024 21:46:53 -0400
> > Yao Xingtao  wrote:
> >   
> > > In 3, 6, 12 interleave ways, we could not access cxl memory properly,
> > > and when the process is running on it, a 'segmentation fault' error will
> > > occur.
> > >
> > > According to the CXL specification '8.2.4.20.13 Decoder Protection',
> > > there are two branches to convert HPA to DPA:
> > > b1: Decoder[m].IW < 8 (for 1, 2, 4, 8, 16 interleave ways)
> > > b2: Decoder[m].IW >= 8 (for 3, 6, 12 interleave ways)
> > >
> > > but only b1 has been implemented.
> > >
> > > To solve this issue, we should implement b2:
> > >   DPAOffset[51:IG+8]=HPAOffset[51:IG+IW] / 3
> > >   DPAOffset[IG+7:0]=HPAOffset[IG+7:0]
> > >   DPA=DPAOffset + Decoder[n].DPABase
> > >
> > > Links:  
> > https://lore.kernel.org/linux-cxl/3e84b919-7631-d1db-3e1d-33000f3f3868@fujits
> > u.com/  
> > > Signed-off-by: Yao Xingtao   
> > 
> > Not implementing this was intentional (shouldn't seg fault obviously) but
> > I thought we were not advertising EP support for 3, 6, 12?  The HDM Decoder
> > configuration checking is currently terrible so we don't prevent
> > the bits being set (adding device side sanity checks for those decoders
> > has been on the todo list for a long time).  There are a lot of ways of
> > programming those that will blow up.
> > 
> > Can you confirm that the emulation reports they are supported.
> > https://elixir.bootlin.com/qemu/v9.0.0-rc1/source/hw/cxl/cxl-component-utils.c
> > #L246
> > implies it shouldn't and so any software using them is broken.  
> yes, the feature is not supported by QEMU, but I can still create a 
> 6-interleave-ways region on kernel layer.
> 
> I checked the source code of kernel, and found that the kernel did not check 
> this bit when committing decoder.
> we may add some check on kernel side.

ouch.  We definitely want that check!  The decoder commit will fail
anyway (which QEMU doesn't yet because we don't do all the sanity checks
we should). However failing on commit is nasty as the reason should have
been detected earlier.

> 
> > 
> > The non power of 2 decodes always made me nervous as the maths is more
> > complex and any changes to that decode will need careful checking.
> > For the power of 2 cases it was a bunch of writes to edge conditions etc
> > and checking the right data landed in the backing stores.  
> after applying this modification, I tested some command by using these 
> memory, like 'ls', 'top'..
> and they can be executed normally, maybe there are some other problems I 
> haven't met yet.

I usually run a bunch of manual tests with devmem2 to ensure the edge cases are 
handled
correctly, but I've not really seen any errors that didn't also show up in 
running
stressors (e.g. stressng) or just memhog on the memory.

Jonathan

> 
> > 
> > Joanthan
> > 
> >   
> > > ---
> > >  hw/mem/cxl_type3.c | 15 +++
> > >  1 file changed, 11 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
> > > index b0a7e9f11b..2c1218fb12 100644
> > > --- a/hw/mem/cxl_type3.c
> > > +++ b/hw/mem/cxl_type3.c
> > > @@ -805,10 +805,17 @@ static bool cxl_type3_dpa(CXLType3Dev *ct3d, hwaddr 
> > >  
> > host_addr, uint64_t *dpa)  
> > >  continue;
> > >  }
> > >
> > > -*dpa = dpa_base +
> > > -((MAKE_64BIT_MASK(0, 8 + ig) & hpa_offset) |
> > > - ((MAKE_64BIT_MASK(8 + ig + iw, 64 - 8 - ig - iw) & 
> > > hpa_offset)  
> > > -  >> iw));  
> > > +if (iw < 8) {
> > > +*dpa = dpa_base +
> > > +((MAKE_64BIT_MASK(0, 8 + ig) & hpa_offset) |
> > > + ((MAKE_64BIT_MASK(8 + ig + iw, 64 - 8 - ig - iw) &  
> > hpa_offset)  
> > > +  >> iw));
> > > +} else {
> > > +*dpa = dpa_base +
> > > +((MAKE_64BIT_MASK(0, 8 + ig) & hpa_offset) |
> > > + MAKE_64BIT_MASK(ig + iw, 64 - ig - iw) & hpa_offset)
> > > +   >> (ig + iw)) / 3) << (ig + 8)));
> > > +}
> > >
> > >  return true;
> > >  }  
>

Re: [PATCH] mem/cxl_type3: fix hpa to dpa logic

2024-03-27 Thread Jonathan Cameron via

On Tue, 26 Mar 2024 21:46:53 -0400
Yao Xingtao  wrote:

> In 3, 6, 12 interleave ways, we could not access cxl memory properly,
> and when the process is running on it, a 'segmentation fault' error will
> occur.
> 
> According to the CXL specification '8.2.4.20.13 Decoder Protection',
> there are two branches to convert HPA to DPA:
> b1: Decoder[m].IW < 8 (for 1, 2, 4, 8, 16 interleave ways)
> b2: Decoder[m].IW >= 8 (for 3, 6, 12 interleave ways)
> 
> but only b1 has been implemented.
> 
> To solve this issue, we should implement b2:
>   DPAOffset[51:IG+8]=HPAOffset[51:IG+IW] / 3
>   DPAOffset[IG+7:0]=HPAOffset[IG+7:0]
>   DPA=DPAOffset + Decoder[n].DPABase
> 
> Links: 
> https://lore.kernel.org/linux-cxl/3e84b919-7631-d1db-3e1d-33000f3f3...@fujitsu.com/
> Signed-off-by: Yao Xingtao 

Not implementing this was intentional (shouldn't seg fault obviously) but
I thought we were not advertising EP support for 3, 6, 12?  The HDM Decoder
configuration checking is currently terrible so we don't prevent
the bits being set (adding device side sanity checks for those decoders
has been on the todo list for a long time).  There are a lot of ways of
programming those that will blow up.

Can you confirm that the emulation reports they are supported.
https://elixir.bootlin.com/qemu/v9.0.0-rc1/source/hw/cxl/cxl-component-utils.c#L246
implies it shouldn't and so any software using them is broken.

The non power of 2 decodes always made me nervous as the maths is more
complex and any changes to that decode will need careful checking.
For the power of 2 cases it was a bunch of writes to edge conditions etc
and checking the right data landed in the backing stores.

Joanthan

> ---
>  hw/mem/cxl_type3.c | 15 +++
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
> index b0a7e9f11b..2c1218fb12 100644
> --- a/hw/mem/cxl_type3.c
> +++ b/hw/mem/cxl_type3.c
> @@ -805,10 +805,17 @@ static bool cxl_type3_dpa(CXLType3Dev *ct3d, hwaddr 
> host_addr, uint64_t *dpa)
>  continue;
>  }
>  
> -*dpa = dpa_base +
> -((MAKE_64BIT_MASK(0, 8 + ig) & hpa_offset) |
> - ((MAKE_64BIT_MASK(8 + ig + iw, 64 - 8 - ig - iw) & hpa_offset)
> -  >> iw));  
> +if (iw < 8) {
> +*dpa = dpa_base +
> +((MAKE_64BIT_MASK(0, 8 + ig) & hpa_offset) |
> + ((MAKE_64BIT_MASK(8 + ig + iw, 64 - 8 - ig - iw) & 
> hpa_offset)
> +  >> iw));
> +} else {
> +*dpa = dpa_base +
> +((MAKE_64BIT_MASK(0, 8 + ig) & hpa_offset) |
> + MAKE_64BIT_MASK(ig + iw, 64 - ig - iw) & hpa_offset)
> +   >> (ig + iw)) / 3) << (ig + 8)));
> +}
>  
>  return true;
>  }

Re: [PATCH v2 1/1] cxl/mem: Fix for the index of Clear Event Record Handle

2024-03-18 Thread Jonathan Cameron via

On Mon, 18 Mar 2024 10:29:28 +0800
Yuquan Wang  wrote:

> The dev_dbg info for Clear Event Records mailbox command would report
> the handle of the next record to clear not the current one.
> 
> This was because the index 'i' had incremented before printing the
> current handle value.
> 
> Signed-off-by: Yuquan Wang 
> ---
>  drivers/cxl/core/mbox.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> index 9adda4795eb7..b810a6aa3010 100644
> --- a/drivers/cxl/core/mbox.c
> +++ b/drivers/cxl/core/mbox.c
> @@ -915,7 +915,7 @@ static int cxl_clear_event_record(struct cxl_memdev_state 
> *mds,
>  
>   payload->handles[i++] = gen->hdr.handle;
>   dev_dbg(mds->cxlds.dev, "Event log '%d': Clearing %u\n", log,
> - le16_to_cpu(payload->handles[i]));
> + le16_to_cpu(payload->handles[i-1]));
Trivial but needs spaces around the -. e.g.  [i - 1] 

Maybe Dan can fix up whilst applying.

Otherwise

Reviewed-by: Jonathan Cameron 

>  
>   if (i == max_handles) {
>   payload->nr_recs = i;

Re: [PATCH v2 2/2] hmat acpi: Fix out of bounds access due to missing use of indirection

2024-03-15 Thread Jonathan Cameron via

On Wed, 13 Mar 2024 21:24:06 +0300
Michael Tokarev  wrote:

> 07.03.2024 19:03, Jonathan Cameron via wrote:
> > With a numa set up such as
> > 
> > -numa nodeid=0,cpus=0 \
> > -numa nodeid=1,memdev=mem \
> > -numa nodeid=2,cpus=1
> > 
> > and appropriate hmat_lb entries the initiator list is correctly
> > computed and writen to HMAT as 0,2 but then the LB data is accessed
> > using the node id (here 2), landing outside the entry_list array.
> > 
> > Stash the reverse lookup when writing the initiator list and use
> > it to get the correct array index index.
> > 
> > Fixes: 4586a2cb83 ("hmat acpi: Build System Locality Latency and Bandwidth 
> > Information Structure(s)")
> > Signed-off-by: Jonathan Cameron   
> 
> This seems like a -stable material, is it not?

Yes. Use case is obscure, but indeed seems suitable for stable.
Thanks.

Jonathan

> 
> Thanks,
> 
> /mjt
> 
> > ---
> >   hw/acpi/hmat.c | 6 +-
> >   1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/acpi/hmat.c b/hw/acpi/hmat.c
> > index 723ae28d32..b933ae3c06 100644
> > --- a/hw/acpi/hmat.c
> > +++ b/hw/acpi/hmat.c
> > @@ -78,6 +78,7 @@ static void build_hmat_lb(GArray *table_data, 
> > HMAT_LB_Info *hmat_lb,
> > uint32_t *initiator_list)
> >   {
> >   int i, index;
> > +uint32_t initiator_to_index[MAX_NODES] = {};
> >   HMAT_LB_Data *lb_data;
> >   uint16_t *entry_list;
> >   uint32_t base;
> > @@ -121,6 +122,8 @@ static void build_hmat_lb(GArray *table_data, 
> > HMAT_LB_Info *hmat_lb,
> >   /* Initiator Proximity Domain List */
> >   for (i = 0; i < num_initiator; i++) {
> >   build_append_int_noprefix(table_data, initiator_list[i], 4);
> > +/* Reverse mapping for array possitions */
> > +initiator_to_index[initiator_list[i]] = i;
> >   }
> >   
> >   /* Target Proximity Domain List */
> > @@ -132,7 +135,8 @@ static void build_hmat_lb(GArray *table_data, 
> > HMAT_LB_Info *hmat_lb,
> >   entry_list = g_new0(uint16_t, num_initiator * num_target);
> >   for (i = 0; i < hmat_lb->list->len; i++) {
> >   lb_data = _array_index(hmat_lb->list, HMAT_LB_Data, i);
> > -index = lb_data->initiator * num_target + lb_data->target;
> > +index = initiator_to_index[lb_data->initiator] * num_target +
> > +lb_data->target;
> >   
> >   entry_list[index] = (uint16_t)(lb_data->data / hmat_lb->base);
> >   }  
>

Re: [PATCH v9 0/7] QEMU CXL Provide mock CXL events and irq support

2024-03-15 Thread Jonathan Cameron via

On Fri, 15 Mar 2024 09:52:28 +0800
Yuquan Wang  wrote:

> Hello, Jonathan
> 
> When during the test of qmps of CXL events like 
> "cxl-inject-general-media-event", 
> I am confuesd about the argument "flags". According to "qapi/cxl.json" in 
> qemu, 
> this argument represents "Event Record Flags" in Common Event Record Format.
> However, it seems like the specific 'Event Record Severity' in this field can 
> be
> different from the value of 'Event Status' in "Event Status Register". 
> 
> For instance (take an injection example in the coverlatter):
> 
> { "execute": "cxl-inject-general-media-event",
> "arguments": {
> "path": "/machine/peripheral/cxl-mem0",
> "log": "informational",
> "flags": 1,
> "dpa": 1000,
> "descriptor": 3,
> "type": 3,
> "transaction-type": 192,
> "channel": 3,
> "device": 5,
> "component-id": "iras mem"
> }}
> 
> In my understanding, the 'Event Status' is informational and the 
> 'Event Record Severity' is Warning event, which means these two arguments are
> independent of each other. Is my understanding correct?

The event status registers dictates the notification path (which log).
So I think that's "informational" here.

Whereas flags is about the specific error. One case where they might be
different is where the Related Event Record Handle is set.
An error might be reported as
1) Several things that were non fatal (each with their own record)
2) In combination they result in a fatal situation (also has it's own record).

The QEMU injection shouldn't restrict these combinations more than the spec
does (which is not at all!).

This same disconnect in error severity is seen in UEFI CPER records for example
where there is a containing record with one severity field, but more specific
parts of record can have lower (or in theory higher) severity.

Jonathan

> 
> Many thanks
> Yuquan
>

Re: [PATCH v5 09/13] hw/cxl/events: Add qmp interfaces to add/release dynamic capacity extents

2024-03-12 Thread Jonathan Cameron via

On Fri, 8 Mar 2024 20:35:53 -0800
fan  wrote:

> On Thu, Mar 07, 2024 at 12:45:55PM +0000, Jonathan Cameron wrote:
> > ...
> >   
> > > > > +list = records;
> > > > > +extents = g_new0(CXLDCExtentRaw, num_extents);
> > > > > +while (list) {
> > > > > +CXLDCExtent *ent;
> > > > > +bool skip_extent = false;
> > > > > +
> > > > > +offset = list->value->offset;
> > > > > +len = list->value->len;
> > > > > +
> > > > > +extents[i].start_dpa = offset + dcd->dc.regions[rid].base;
> > > > > +extents[i].len = len;
> > > > > +memset(extents[i].tag, 0, 0x10);
> > > > > +extents[i].shared_seq = 0;
> > > > > +
> > > > > +if (type == DC_EVENT_RELEASE_CAPACITY ||
> > > > > +type == DC_EVENT_FORCED_RELEASE_CAPACITY) {
> > > > > +/*
> > > > > + *  if the extent is still pending to be added to the 
> > > > > host,
> > > > 
> > > > Odd spacing.
> > > > 
> > > > > + * remove it from the pending extent list, so later when 
> > > > > the add
> > > > > + * response for the extent arrives, the device can 
> > > > > reject the
> > > > > + * extent as it is not in the pending list.
> > > > > + */
> > > > > +ent = 
> > > > > cxl_dc_extent_exists(>dc.extents_pending_to_add,
> > > > > +[i]);
> > > > > +if (ent) {
> > > > > +QTAILQ_REMOVE(>dc.extents_pending_to_add, ent, 
> > > > > node);
> > > > > +g_free(ent);
> > > > > +skip_extent = true;
> > > > > +} else if (!cxl_dc_extent_exists(>dc.extents, 
> > > > > [i])) {
> > > > > +/* If the exact extent is not in the accepted list, 
> > > > > skip */
> > > > > +skip_extent = true;
> > > > > +}
> > > > I think we need to reject case of some extents skipped and others not.
> > > > That's not supported yet so we need to complain if we get it at least. 
> > > > Maybe we need
> > > > to do two passes so we know this has happened early (or perhaps this is 
> > > > a later
> > > > patch in which case a todo here would help).
> > > 
> > > Skip here does not mean the extent is invalid, it just means the extent
> > > is still pending to add, so remove them from pending list would be
> > > enough to reject the extent, no need to release further. That is based
> > > on your feedback on v4.  
> > 
> > Ah. I'd missunderstood.  
> 
> Hi Jonathan,
> 
> I think we should not allow to release extents that are still pending to
> add. 
> If we allow it, there is a case that will not work.
> Let's see the following case (time order):
> 1. Send request to add extent A to host; (A --> pending list)
> 2. Send request to release A from the host; (Delete A from pending list,
> hoping the following add response for A will fail as there is not a matched
> extent in the pending list).

Definitely not allow the host to release something it hasn't accepted.
Should allow QMP to release such entrees though (and same for fmapi when
we get there). Any such requested from host should be treated as whatever
it says to do if you release an extent that you don't have.

> 3. Host send response to the device for the add request, however, for
> some reason, it does not accept any of it, so updated list is empty,
> spec allows it. Based on the spec, we need to drop the extent at the
> head of the event log. Now we have problem. Since extent A is already
> dropped from the list, we either cannot drop as the list is empty, which
> is not the worst. If we have more extents in the list, we may drop the
> one following A, which is for another request. If this happens, all the
> following extents will be acked incorrectly as the order has been
> shifted.
>  
> Does the above reasoning make sense to you?
Absolutely.  I got confused here on who was doing release.
Host definitely can't release stuff it hasn't successfully accepted.

Jonathan

> 
> Fan
> 
> >   
> > > 
> > > The loop here is only to collect the extents to sent to t

Re: [PATCH v9 3/3] hw/i386/acpi-build: Add support for SRAT Generic Initiator structures

2024-03-11 Thread Jonathan Cameron via

On Fri, 8 Mar 2024 14:55:25 +
 wrote:

> From: Ankit Agrawal 
> 
> The acpi-generic-initiator object is added to allow a host device
> to be linked with a NUMA node. Qemu use it to build the SRAT
> Generic Initiator Affinity structure [1]. Add support for i386.
> 
> [1] ACPI Spec 6.3, Section 5.2.16.6
> 
> Suggested-by: Jonathan Cameron 

Reviewed-by: Jonathan Cameron 

> Signed-off-by: Ankit Agrawal 
> ---
>  hw/i386/acpi-build.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> index 1e178341de..b65202fc07 100644
> --- a/hw/i386/acpi-build.c
> +++ b/hw/i386/acpi-build.c
> @@ -68,6 +68,7 @@
>  #include "hw/acpi/utils.h"
>  #include "hw/acpi/pci.h"
>  #include "hw/acpi/cxl.h"
> +#include "hw/acpi/acpi_generic_initiator.h"
>  
>  #include "qom/qom-qobject.h"
>  #include "hw/i386/amd_iommu.h"
> @@ -2056,6 +2057,8 @@ build_srat(GArray *table_data, BIOSLinker *linker, 
> MachineState *machine)
>  build_srat_memory(table_data, 0, 0, 0, MEM_AFFINITY_NOFLAGS);
>  }
>  
> +build_srat_generic_pci_initiator(table_data);
> +
>  /*
>   * Entry is required for Windows to enable memory hotplug in OS
>   * and for Linux to enable SWIOTLB when booted with less than

Re: [PULL 53/60] hw/cxl: Standardize all references on CXL r3.1 and minor updates

On Fri, 8 Mar 2024 14:38:55 +
Peter Maydell  wrote:

> On Fri, 8 Mar 2024 at 14:34, Jonathan Cameron
>  wrote:
> >
> > On Fri, 8 Mar 2024 13:47:47 +
> > Peter Maydell  wrote:  
> > > Is there a way we could write this that would catch this error?
> > > I'm thinking maybe something like
> > >
> > > #define CXL_CREATE_DVSEC(CXL, DEVTYPE, TYPE, DATA) do { \
> > >  assert(sizeof(*DATA) == TYPE##_LENGTH); \
> > >  cxl_component_create_dvsec(CXL, DEVTYPE, TYPE##_LENGTH, \
> > > TYPE, TYPE##_REVID, (uint8_t*)DATA); \
> > >  } while (0)  
> >
> > We should be able to use the length definitions in the original assert.
> > I'm not sure why that wasn't done before.  I think there were some cases
> > where we supported multiple versions and so the length can be shorter
> > than the structure defintion but that doesn't matter on this one.
> >
> > So I think minimal fix is u16 of padding and update the assert.
> > Can circle back to tidy up the multiple places the value is defined.
> > Any mismatch in which the wrong length define is used should be easy
> > enough to spot so not sure we need the macro you suggest.  
> 
> Well, I mean, you didn't in fact spot the mismatch between
> the struct type you were passing and the length value you
> were using. That's why I think it would be helpful to
> assert() that the size of the struct really does match
> the length value you're passing in. At the moment the
> code completely throws away the type information the compiler
> has by casting the pointer to the struct to a uint8_t*.

True, but the original assert at the structure definition would have
fired if I'd actually used the define rather than a number :(

There is definitely more to do here - but fix wants to be on the light
side of all the options.

cxl_component_create_dvsec() is an odd function in general as it has
more code that varies depending on cxl_dev_type than is shared.

So it might just make sense to split it up and provide some more
trivial functions for the header writing. This is a case of code
that has evolved and ended up as a far from ideal solution.

We only carry the DVSECHeader in the structures so that the sizes can
be read against the spec. It makes the code more complex though
so maybe should consider dropping it and making the asserts next
to the structure definitions more complex.

The asserts in existing function can go (checking it fits etc is done
by pcie_add_capability()).

If not need something more like
//awkward naming is because the second cxl needs to be there to match spec.
void cxl_create_pcie_cxl_device_dvsec(CXLComponentState *cxl,
  CXLDVSECDevice *dvsec)
{
PCIDevice *pdev = cxl->pdev;
uint16_t offset = cxl->dvsec_offset;
uint16_t length = sizeof(*dvsec);
uint8_t *wmask = pdev->wmask;

///next block can probably be a helper or done in a simpler way.
/// A lot of what we have here is just to let us reuse this first call.
pcie_add_capability(pdev, PCI_EXT_CAP_ID_DVSEC, 1, offset, length);

///These could be done by writing into dvsec, and memcpy ing more
///but the offset will be even stranger if we do that.
pci_set_long(pdev->config + offset + PCIE_DVSEC_HEADER1_OFFSET,
 (length << 20) | (rev << 16) | CXL_VENDOR_ID);
pci_set_word(pdev->config + offset + PCIE_DEVSEC_ID_OFFSET,
 PCIE_CXL_DEVICE_DEVSEC);

memcpy(pdev->config + offset + sizeof(DVSEC_HEADER),
   (uint8_t *)dvsec + sizeof(DVSECHeader),
   length - sizeof(DVSECHEAEDR));

// all the wmask stuff for this structure.
}

So I'm aiming for more drastic surgery than you were suggesting but
not in the fix!

Jonathan

> 
> thanks
> -- PMM

[PATCH] hw/cxl: Fix missing reserved data in CXL Device DVSEC

The r3.1 specification introduced a new 2 byte field, but
to maintain DWORD alignment, a additional 2 reserved bytes
were added. Forgot those in updating the structure definition
but did include them in the size define leading to a buffer
overrun.

Also use the define so that we don't duplicate the value.

Fixes: Coverity ID 1534095 buffer overrun
Fixes: 8700ee15de ("hw/cxl: Standardize all references on CXL r3.1 and minor 
updates")
Reported-by: Peter Maydell 
Signed-off-by: Jonathan Cameron 
---
 include/hw/cxl/cxl_pci.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/hw/cxl/cxl_pci.h b/include/hw/cxl/cxl_pci.h
index 265db6c407..d0855ed78b 100644
--- a/include/hw/cxl/cxl_pci.h
+++ b/include/hw/cxl/cxl_pci.h
@@ -92,8 +92,9 @@ typedef struct CXLDVSECDevice {
 uint32_t range2_base_hi;
 uint32_t range2_base_lo;
 uint16_t cap3;
+uint16_t resv;
 } QEMU_PACKED CXLDVSECDevice;
-QEMU_BUILD_BUG_ON(sizeof(CXLDVSECDevice) != 0x3A);
+QEMU_BUILD_BUG_ON(sizeof(CXLDVSECDevice) != PCIE_CXL_DEVICE_DVSEC_LENGTH);
 
 /*
  * CXL r3.1 Section 8.1.5: CXL Extensions DVSEC for Ports
-- 
2.39.2

Re: [PULL 53/60] hw/cxl: Standardize all references on CXL r3.1 and minor updates

On Fri, 8 Mar 2024 13:47:47 +
Peter Maydell  wrote:

> On Wed, 14 Feb 2024 at 11:16, Michael S. Tsirkin  wrote:
> >
> > From: Jonathan Cameron 
> >
> > Previously not all references mentioned any spec version at all.
> > Given r3.1 is the current specification available for evaluation at
> > www.computeexpresslink.org update references to refer to that.
> > Hopefully this won't become a never ending job.
> >
> > A few structure definitions have been updated to add new fields.
> > Defaults of 0 and read only are valid choices for these new DVSEC
> > registers so go with that for now.
> >
> > There are additional error codes and some of the 'questions' in
> > the comments are resolved now.
> >
> > Update documentation reference to point to the CXL r3.1 specification
> > with naming closer to what is on the cover.
> >
> > For cases where there are structure version numbers, add defines
> > so they can be found next to the register definitions.  
> 
> Hi; Coverity points out that this change has introduced a
> buffer overrun (CID 1534905). In hw/mem/cxl_type3.c:build_dvsecs()
> we create a local struct of type CXLDVSecDevice, and then we
> pass it to cxl_component_create_dvsec() as the body parameter,
> passing it a length argument PCIE_CXL_DEVICE_DVSEC_LENGTH.
> 
> Before this change, both sizeof(CXLDVSecDevice) and
> PCIE_CXL_DEVICE_DVSEC_LENGTH were 0x38, so this was fine.
> But now...
> 
> > diff --git a/include/hw/cxl/cxl_pci.h b/include/hw/cxl/cxl_pci.h
> > index ddf01a543b..265db6c407 100644
> > --- a/include/hw/cxl/cxl_pci.h
> > +++ b/include/hw/cxl/cxl_pci.h
> > @@ -16,9 +16,8 @@
> >  #define PCIE_DVSEC_HEADER1_OFFSET 0x4 /* Offset from start of extend cap */
> >  #define PCIE_DVSEC_ID_OFFSET 0x8
> >
> > -#define PCIE_CXL_DEVICE_DVSEC_LENGTH 0x38
> > -#define PCIE_CXL1_DEVICE_DVSEC_REVID 0
> > -#define PCIE_CXL2_DEVICE_DVSEC_REVID 1
> > +#define PCIE_CXL_DEVICE_DVSEC_LENGTH 0x3C
> > +#define PCIE_CXL31_DEVICE_DVSEC_REVID 3
> >
> >  #define EXTENSIONS_PORT_DVSEC_LENGTH 0x28
> >  #define EXTENSIONS_PORT_DVSEC_REVID 0  
> 
> ...PCIE_CXL_DEVICE_DVSEC_LENGTH is 0x3C...
Gah.  Evil spec change - they defined only one extra
u16 worth of data but added padding after it and I missed
that in the structure definition.

> 
> > -/* CXL 2.0 - 8.1.3 (ID 0001) */
> > +/*
> > + * CXL r3.1 Section 8.1.3: PCIe DVSEC for Devices
> > + * DVSEC ID: 0, Revision: 3
> > + */
> >  typedef struct CXLDVSECDevice {
> >  DVSECHeader hdr;
> >  uint16_t cap;
> > @@ -82,10 +91,14 @@ typedef struct CXLDVSECDevice {
> >  uint32_t range2_size_lo;
> >  uint32_t range2_base_hi;
> >  uint32_t range2_base_lo;
> > -} CXLDVSECDevice;
> > -QEMU_BUILD_BUG_ON(sizeof(CXLDVSECDevice) != 0x38);
> > +uint16_t cap3;
> > +} QEMU_PACKED CXLDVSECDevice;
> > +QEMU_BUILD_BUG_ON(sizeof(CXLDVSECDevice) != 0x3A);  
(this is the assert I mention below)
> 
> ...and CXLDVSECDevice is only size 0x3A, so we try to read off the
> end of the struct.
> 
> What was supposed to happen here?
needs an extra uint16_t resv; at the end.

> 
> > --- a/hw/mem/cxl_type3.c
> > +++ b/hw/mem/cxl_type3.c
> > @@ -319,7 +319,7 @@ static void build_dvsecs(CXLType3Dev *ct3d)
> >  cxl_component_create_dvsec(cxl_cstate, CXL2_TYPE3_DEVICE,
> > PCIE_CXL_DEVICE_DVSEC_LENGTH,
> > PCIE_CXL_DEVICE_DVSEC,
> > -   PCIE_CXL2_DEVICE_DVSEC_REVID, dvsec);
> > +   PCIE_CXL31_DEVICE_DVSEC_REVID, dvsec);
> >
> >  dvsec = (uint8_t *)&(CXLDVSECRegisterLocator){
> >  .rsvd = 0,  
> 
> Perhaps this call to cxl_component_create_dvsec() was
> supposed to have the length argument changed, as seems
> to have been done with this other call:
> 
> > @@ -346,9 +346,9 @@ static void build_dvsecs(CXLType3Dev *ct3d)
> >  .rcvd_mod_ts_data_phase1 = 0xef, /* WTF? */
> >  };
> >  cxl_component_create_dvsec(cxl_cstate, CXL2_TYPE3_DEVICE,
> > -   PCIE_FLEXBUS_PORT_DVSEC_LENGTH_2_0,
> > +   PCIE_CXL3_FLEXBUS_PORT_DVSEC_LENGTH,
> > PCIE_FLEXBUS_PORT_DVSEC,
> > -   PCIE_FLEXBUS_PORT_DVSEC_REVID_2_0, dvsec);
> > +   PCIE_CXL3_FLEXBUS_PORT_DVSEC_REVID, dvsec);
> >  }  
> 
>  static void hdm_decoder_commit(CXLType3Dev *ct3d, int which)
> 
> 
> and with similar other

Re: Enabling internal errors for VH CXL devices: [was: Re: Questions about CXL RAS injection test in qemu]

On Fri, 8 Mar 2024 10:01:34 +0800
Yuquan Wang  wrote:

> On 2024-03-07 20:10,  jonathan.cameron wrote:
> 
> > Hack is fine the relevant device with lspci -tv and then use
> > setpci -s 0d:00.0 0x208.l=0
> > to clear all the mask bits for uncorrectable errors.  
> 
> Thanks! The suggestions from you and Terry did work!
> 
> BTW, is my understanding below about CXL RAS correct?
> 
> >> 2) The error injected by "pcie_aer_inject_error" is "protocol & link 
> >> errors" of cxl.io?
> >>The error injected by "cxl-inject-uncorrectable-errors" or 
> >> "cxl-inject-correctable-error" is "protocol & link errors" of cxl.cachemem 
> >>  
> 
> Many thanks
> Yuuqan
> 
Yes.  Note the two CXL errors are actually communicated via AER uncorrectable / 
correctable internal
error combined with data that is available on the EP in the CXL specific 
registers.

Jonathan

Re: [PATCH v3 11/20] util/dsa: Implement DSA task asynchronous submission and wait for completion.

On Thu,  4 Jan 2024 00:44:43 +
Hao Xiang  wrote:

> * Add a DSA task completion callback.
> * DSA completion thread will call the tasks's completion callback
> on every task/batch task completion.
> * DSA submission path to wait for completion.
> * Implement CPU fallback if DSA is not able to complete the task.
> 
> Signed-off-by: Hao Xiang 
> Signed-off-by: Bryan Zhang 

Hi,

One naming comment inline. You had me confused on how you were handling async
processing at where this is used.   Answer is that I think you aren't!

>  
> +/**
> + * @brief Performs buffer zero comparison on a DSA batch task asynchronously.
The hardware may be doing it asynchronously but unless that
buffer_zero_dsa_wait() call doesn't do what it's name suggests, this function
is wrapping the async hardware related stuff to make it synchronous.

So name it buffer_is_zero_dsa_batch_sync()!

Jonathan

> + *
> + * @param batch_task A pointer to the batch task.
> + * @param buf An array of memory buffers.
> + * @param count The number of buffers in the array.
> + * @param len The buffer length.
> + *
> + * @return Zero if successful, otherwise non-zero.
> + */
> +int
> +buffer_is_zero_dsa_batch_async(struct dsa_batch_task *batch_task,
> +   const void **buf, size_t count, size_t len)
> +{
> +if (count <= 0 || count > batch_task->batch_size) {
> +return -1;
> +}
> +
> +assert(batch_task != NULL);
> +assert(len != 0);
> +assert(buf != NULL);
> +
> +if (count == 1) {
> +/* DSA doesn't take batch operation with only 1 task. */
> +buffer_zero_dsa_async(batch_task, buf[0], len);
> +} else {
> +buffer_zero_dsa_batch_async(batch_task, buf, count, len);
> +}
> +
> +buffer_zero_dsa_wait(batch_task);
> +buffer_zero_cpu_fallback(batch_task);
> +
> +return 0;
> +}
> +
>  #endif
>

[PATCH v2 2/2] hmat acpi: Fix out of bounds access due to missing use of indirection

With a numa set up such as

-numa nodeid=0,cpus=0 \
-numa nodeid=1,memdev=mem \
-numa nodeid=2,cpus=1

and appropriate hmat_lb entries the initiator list is correctly
computed and writen to HMAT as 0,2 but then the LB data is accessed
using the node id (here 2), landing outside the entry_list array.

Stash the reverse lookup when writing the initiator list and use
it to get the correct array index index.

Fixes: 4586a2cb83 ("hmat acpi: Build System Locality Latency and Bandwidth 
Information Structure(s)")
Signed-off-by: Jonathan Cameron 
---
 hw/acpi/hmat.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/hw/acpi/hmat.c b/hw/acpi/hmat.c
index 723ae28d32..b933ae3c06 100644
--- a/hw/acpi/hmat.c
+++ b/hw/acpi/hmat.c
@@ -78,6 +78,7 @@ static void build_hmat_lb(GArray *table_data, HMAT_LB_Info 
*hmat_lb,
   uint32_t *initiator_list)
 {
 int i, index;
+uint32_t initiator_to_index[MAX_NODES] = {};
 HMAT_LB_Data *lb_data;
 uint16_t *entry_list;
 uint32_t base;
@@ -121,6 +122,8 @@ static void build_hmat_lb(GArray *table_data, HMAT_LB_Info 
*hmat_lb,
 /* Initiator Proximity Domain List */
 for (i = 0; i < num_initiator; i++) {
 build_append_int_noprefix(table_data, initiator_list[i], 4);
+/* Reverse mapping for array possitions */
+initiator_to_index[initiator_list[i]] = i;
 }
 
 /* Target Proximity Domain List */
@@ -132,7 +135,8 @@ static void build_hmat_lb(GArray *table_data, HMAT_LB_Info 
*hmat_lb,
 entry_list = g_new0(uint16_t, num_initiator * num_target);
 for (i = 0; i < hmat_lb->list->len; i++) {
 lb_data = _array_index(hmat_lb->list, HMAT_LB_Data, i);
-index = lb_data->initiator * num_target + lb_data->target;
+index = initiator_to_index[lb_data->initiator] * num_target +
+lb_data->target;
 
 entry_list[index] = (uint16_t)(lb_data->data / hmat_lb->base);
 }
-- 
2.39.2

[PATCH v2 1/2] hmat acpi: Do not add Memory Proximity Domain Attributes Structure targetting non existent memory.

If qemu is started with a proximity node containing CPUs alone,
it will provide one of these structures to say memory in this
node is directly connected to itself.

This description is arguably pointless even if there is memory
in the node.  If there is no memory present, and hence no SRAT
entry it breaks Linux HMAT passing and the table is rejected.

https://elixir.bootlin.com/linux/v6.7/source/drivers/acpi/numa/hmat.c#L444

Signed-off-by: Jonathan Cameron 


v2: Fix link in patch description to be stable.
---
 hw/acpi/hmat.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/hw/acpi/hmat.c b/hw/acpi/hmat.c
index 3042d223c8..723ae28d32 100644
--- a/hw/acpi/hmat.c
+++ b/hw/acpi/hmat.c
@@ -204,6 +204,13 @@ static void hmat_build_table_structs(GArray *table_data, 
NumaState *numa_state)
 build_append_int_noprefix(table_data, 0, 4); /* Reserved */
 
 for (i = 0; i < numa_state->num_nodes; i++) {
+/*
+ * Linux rejects whole HMAT table if a node with no memory
+ * has one of these structures listing it as a target.
+ */
+if (!numa_state->nodes[i].node_mem) {
+continue;
+}
 flags = 0;
 
 if (numa_state->nodes[i].initiator < MAX_NODES) {
-- 
2.39.2

[PATCH v2 0/2] hw/acpi/hmat: Misc fixes

v2: Fixed a link in patch 1 description so it points somewhere stable.

Two unrelated fixes here:
1) Linux really doesn't like it when you claim non existent memory
   is directly connected to an initiator (here a CPU).
   It is a nonsense entry, though I also plan to try and get
   a relaxation of the condition into the kernel.
   Maybe we need to care about migration, but I suspect no one
   cares about this corner case (hence no one noticed the
   problem!)

2) An access outside of the allocated array when building the
   the latency and bandwidth tables.  Given this crashes QEMU
   for me, I think we are fine with the potential table change.

Some notes on 1:
- This structure is almost entirely pointless in general - most
  of the fields were removed in HMAT v2.
  What remains, is meant to convey memory controller location
  when the memory is in a different Proximity Domain from the
  memory controller (e.g. a SoC with both HBM and DDR will present
  2 NUMA domains but memory controllers will be wherever we describe
  the CPUs as being - typically with the DDR)
  Currently QEMU creates these to indicate direct connection between
  a CPU domain and memory in the same domain. Using the Proximity
  domain in SRAT conveys the same. This adds no information but
  it is harmless and avoids migration problems.

Notes on 2:
- I debated a follow up patch removing the entrees in the table
  for initiators on nodes that don't have any initiators.
  QEMU won't let you use them as initiators in the LB entries
  anyway so there is no way to set those entries and they
  end up reported as 0. OK for Bandwidth as no one is going to use
  the zero bandwidth channel, but that's a very attractive latency,
  but that's fine as no one will read the number as there are
  no initiators? (right?)

  There is a corner case in ACPI that bites us here.
  ACPI Proximity domains are only defined in SRAT, but nothing says
  they need to be fully defined.  Generic Initiators are optional
  after all (newish feature) so it was common to use _PXM in DSDT
  to define where various platform devices were (and PCI but that's
  still not read by Linux - a story of pain and broken systems for
  another day). That's fine if they are in a node with CPUs
  (initiators) but not so much if they happen to be in a memory
  only node. Today I think the only thing we can make hit this
  condition in QEMU is a PCI Expander Bridge which doesn't initiate
  transactions. But things behind it do and there are drivers out
  there that do buffer placement based on SLIT distances. I'd
  expect HMAT users to follow soon.

  It would be nice to think all such systems will use Generic Port
  Affinity Structures (and I have patches for those to follow shortly)
  but that's overly optimistic beyond CXL where the kernel will use
  them and which drove their introduction.

Jonathan Cameron (2):
  hmat acpi: Do not add Memory Proximity Domain Attributes Structure
targetting non existent memory.
  hmat acpi: Fix out of bounds access due to missing use of indirection

 hw/acpi/hmat.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

-- 
2.39.2

[PATCH v3 1/1] target/i386: Enable page walking from MMIO memory

From: Gregory Price 

CXL emulation of interleave requires read and write hooks due to
requirement for subpage granularity. The Linux kernel stack now enables
using this memory as conventional memory in a separate NUMA node. If a
process is deliberately forced to run from that node
$ numactl --membind=1 ls
the page table walk on i386 fails.

Useful part of backtrace:

(cpu=cpu@entry=0x56fd9000, fmt=fmt@entry=0x55fe3378 
"cpu_io_recompile: could not find TB for pc=%p")
at ../../cpu-target.c:359
(retaddr=0, addr=19595792376, attrs=..., xlat=, 
cpu=0x56fd9000, out_offset=)
at ../../accel/tcg/cputlb.c:1339
(cpu=0x56fd9000, full=0x7fffee0d96e0, ret_be=ret_be@entry=0, 
addr=19595792376, size=size@entry=8, mmu_idx=4, type=MMU_DATA_LOAD, ra=0) at 
../../accel/tcg/cputlb.c:2030
(cpu=cpu@entry=0x56fd9000, p=p@entry=0x756fddc0, mmu_idx=, type=type@entry=MMU_DATA_LOAD, memop=, ra=ra@entry=0) at 
../../accel/tcg/cputlb.c:2356
(cpu=cpu@entry=0x56fd9000, addr=addr@entry=19595792376, oi=oi@entry=52, 
ra=ra@entry=0, access_type=access_type@entry=MMU_DATA_LOAD) at 
../../accel/tcg/cputlb.c:2439
at ../../accel/tcg/ldst_common.c.inc:301
at ../../target/i386/tcg/sysemu/excp_helper.c:173
(err=0x756fdf80, out=0x756fdf70, mmu_idx=0, 
access_type=MMU_INST_FETCH, addr=18446744072116178925, env=0x56fdb7c0)
at ../../target/i386/tcg/sysemu/excp_helper.c:578
(cs=0x56fd9000, addr=18446744072116178925, size=, 
access_type=MMU_INST_FETCH, mmu_idx=0, probe=, retaddr=0) at 
../../target/i386/tcg/sysemu/excp_helper.c:604

Avoid this by plumbing the address all the way down from
x86_cpu_tlb_fill() where is available as retaddr to the actual accessors
which provide it to probe_access_full() which already handles MMIO accesses.

Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Richard Henderson 
Suggested-by: Peter Maydell 
Signed-off-by: Gregory Price 
Signed-off-by: Jonathan Cameron 
---
v3: No change.

 target/i386/tcg/sysemu/excp_helper.c | 57 +++-
 1 file changed, 30 insertions(+), 27 deletions(-)

diff --git a/target/i386/tcg/sysemu/excp_helper.c 
b/target/i386/tcg/sysemu/excp_helper.c
index 8f7011d966..7a57b7dd10 100644
--- a/target/i386/tcg/sysemu/excp_helper.c
+++ b/target/i386/tcg/sysemu/excp_helper.c
@@ -59,14 +59,14 @@ typedef struct PTETranslate {
 hwaddr gaddr;
 } PTETranslate;
 
-static bool ptw_translate(PTETranslate *inout, hwaddr addr)
+static bool ptw_translate(PTETranslate *inout, hwaddr addr, uint64_t ra)
 {
 CPUTLBEntryFull *full;
 int flags;
 
 inout->gaddr = addr;
 flags = probe_access_full(inout->env, addr, 0, MMU_DATA_STORE,
-  inout->ptw_idx, true, >haddr, , 0);
+  inout->ptw_idx, true, >haddr, , ra);
 
 if (unlikely(flags & TLB_INVALID_MASK)) {
 TranslateFault *err = inout->err;
@@ -82,20 +82,20 @@ static bool ptw_translate(PTETranslate *inout, hwaddr addr)
 return true;
 }
 
-static inline uint32_t ptw_ldl(const PTETranslate *in)
+static inline uint32_t ptw_ldl(const PTETranslate *in, uint64_t ra)
 {
 if (likely(in->haddr)) {
 return ldl_p(in->haddr);
 }
-return cpu_ldl_mmuidx_ra(in->env, in->gaddr, in->ptw_idx, 0);
+return cpu_ldl_mmuidx_ra(in->env, in->gaddr, in->ptw_idx, ra);
 }
 
-static inline uint64_t ptw_ldq(const PTETranslate *in)
+static inline uint64_t ptw_ldq(const PTETranslate *in, uint64_t ra)
 {
 if (likely(in->haddr)) {
 return ldq_p(in->haddr);
 }
-return cpu_ldq_mmuidx_ra(in->env, in->gaddr, in->ptw_idx, 0);
+return cpu_ldq_mmuidx_ra(in->env, in->gaddr, in->ptw_idx, ra);
 }
 
 /*
@@ -132,7 +132,8 @@ static inline bool ptw_setl(const PTETranslate *in, 
uint32_t old, uint32_t set)
 }
 
 static bool mmu_translate(CPUX86State *env, const TranslateParams *in,
-  TranslateResult *out, TranslateFault *err)
+  TranslateResult *out, TranslateFault *err,
+  uint64_t ra)
 {
 const target_ulong addr = in->addr;
 const int pg_mode = in->pg_mode;
@@ -164,11 +165,11 @@ static bool mmu_translate(CPUX86State *env, const 
TranslateParams *in,
  * Page table level 5
  */
 pte_addr = (in->cr3 & ~0xfff) + (((addr >> 48) & 0x1ff) << 3);
-if (!ptw_translate(_trans, pte_addr)) {
+if (!ptw_translate(_trans, pte_addr, ra)) {
 return false;
 }
 restart_5:
-pte = ptw_ldq(_trans);
+pte = ptw_ldq(_trans, ra);
 if (!(pte & PG_PRESENT_MASK)) {
 goto do_fault;
 }
@@ -188,11 +189,11 @@ static bool mmu_translate(CPUX86State *env, const 
TranslateParams *in,

[PATCH v3 0/1] target/i386: Fix page walking from MMIO memory.

Previously: tcg/i386: Page tables in MMIO memory fixes (CXL)
Richard Henderson picked up patches 1 and 3 which were architecture independent
leaving just this x86 specific patch.

No change to the patch. Resending because it's hard to spot individual
unapplied patches in a larger series.

Original cover letter (edited).

CXL memory is interleaved at granularities as fine as 64 bytes.
To emulate this each read and write access undergoes address translation
similar to that used in physical hardware. This is done using
cfmws_ops for a memory region per CXL Fixed Memory Window (the PA address
range in the host that is interleaved across host bridges and beyond.
The OS programs interleaved decoders in the CXL Root Bridges, switch
upstream ports and the corresponding decoders CXL type 3 devices who
have to know the Host PA to Device PA mappings).

Unfortunately this CXL memory may be used as normal memory and anything
that can end up in RAM can be placed within it. As Linux has become
more capable of handling this memory we've started to get quite a few
bug reports for the QEMU support. However terrible the performance is
people seem to like running actual software stacks on it :(

This doesn't work for KVM - so for now CXL emulation remains TCG only.
(unless you are very careful on how it is used!) I plan to add some
safety guards at a later date to make it slightly harder for people
to shoot themselves in the foot + a more limited set of CXL functionality
that is safe (no interleaving!)

Previously we had some issues with TCG reading instructions from CXL
memory but that is now all working. This time the issues are around
the Page Tables being in the CXL memory + DMA buffers being placed in it.

The test setup I've been using is simple 2 way interleave via 2 root
ports below a single CXL root complex. After configuration in Linux
these are mapped to their own Numa Node and
numactl --membind=1 ls
followed by powering down the machine is sufficient to hit all the bugs
addressed in this series.

Thanks to Gregory, Peter and Alex for their help figuring this lot
out.

Whilst thread started back at:
https://lore.kernel.org/all/CAAg4PaqsGZvkDk_=ph+oz-yeeuvcvsrumncagegrkuxe_yo...@mail.gmail.com/
The QEMU part is from.
https://lore.kernel.org/all/20240201130438.1...@huawei.com/

Gregory Price (1):
target/i386: Enable page walking from MMIO memory

target/i386/tcg/sysemu/excp_helper.c | 57 +++-
1 file changed, 30 insertions(+), 27 deletions(-)

--
2.39.2

[PATCH v2 4/4] physmem: Fix wrong address in large address_space_read/write_cached_slow()

If the access is bigger than the MemoryRegion supports,
flatview_read/write_continue() will attempt to update the Memory Region.
but the address passed to flatview_translate() is relative to the cache, not
to the FlatView.

On arm/virt with interleaved CXL memory emulation and virtio-blk-pci this
lead to the first part of descriptor being read from the CXL memory and the
second part from PA 0x8 which happens to be a blank region
of a flash chip and all ffs on this particular configuration.
Note this test requires the out of tree ARM support for CXL, but
the problem is more general.

Avoid this by adding new address_space_read_continue_cached()
and address_space_write_continue_cached() which share all the logic
with the flatview versions except for the MemoryRegion lookup which
is unnecessary as the MemoryRegionCache only covers one MemoryRegion.

Signed-off-by: Jonathan Cameron 
---
v2: Review from Peter Xu
- Drop additional lookups of the MemoryRegion via
address_space_translate_cached() as it will always return the same
answer.
- Drop various parameters that are then unused.
- rename addr1 to mr_addr.
- Drop a fuzz_dma_read_cb(). Could put this back but it means
  carrying the address into the inner call and the only in tree
  fuzzer checks if it is normal RAM and if not does nothing anyway.
  We don't hit this path for normal RAM.
---
 system/physmem.c | 63 +++-
 1 file changed, 57 insertions(+), 6 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index 1264eab24b..701bea27dd 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -3381,6 +3381,59 @@ static inline MemoryRegion 
*address_space_translate_cached(
 return section.mr;
 }
 
+/* Called within RCU critical section.  */
+static MemTxResult address_space_write_continue_cached(MemTxAttrs attrs,
+   const void *ptr,
+   hwaddr len,
+   hwaddr mr_addr,
+   hwaddr l,
+   MemoryRegion *mr)
+{
+MemTxResult result = MEMTX_OK;
+const uint8_t *buf = ptr;
+
+for (;;) {
+result |= flatview_write_continue_step(attrs, buf, len, mr_addr, ,
+   mr);
+
+len -= l;
+buf += l;
+mr_addr += l;
+
+if (!len) {
+break;
+}
+
+l = len;
+}
+
+return result;
+}
+
+/* Called within RCU critical section.  */
+static MemTxResult address_space_read_continue_cached(MemTxAttrs attrs,
+  void *ptr, hwaddr len,
+  hwaddr mr_addr, hwaddr l,
+  MemoryRegion *mr)
+{
+MemTxResult result = MEMTX_OK;
+uint8_t *buf = ptr;
+
+for (;;) {
+result |= flatview_read_continue_step(attrs, buf, len, mr_addr, , 
mr);
+len -= l;
+buf += l;
+mr_addr += l;
+
+if (!len) {
+break;
+}
+l = len;
+}
+
+return result;
+}
+
 /* Called from RCU critical section. address_space_read_cached uses this
  * out of line function when the target is an MMIO or IOMMU region.
  */
@@ -3394,9 +3447,8 @@ address_space_read_cached_slow(MemoryRegionCache *cache, 
hwaddr addr,
 l = len;
 mr = address_space_translate_cached(cache, addr, _addr, , false,
 MEMTXATTRS_UNSPECIFIED);
-return flatview_read_continue(cache->fv,
-  addr, MEMTXATTRS_UNSPECIFIED, buf, len,
-  mr_addr, l, mr);
+return address_space_read_continue_cached(MEMTXATTRS_UNSPECIFIED,
+  buf, len, mr_addr, l, mr);
 }
 
 /* Called from RCU critical section. address_space_write_cached uses this
@@ -3412,9 +3464,8 @@ address_space_write_cached_slow(MemoryRegionCache *cache, 
hwaddr addr,
 l = len;
 mr = address_space_translate_cached(cache, addr, _addr, , true,
 MEMTXATTRS_UNSPECIFIED);
-return flatview_write_continue(cache->fv,
-   addr, MEMTXATTRS_UNSPECIFIED, buf, len,
-   mr_addr, l, mr);
+return address_space_write_continue_cached(MEMTXATTRS_UNSPECIFIED,
+   buf, len, mr_addr, l, mr);
 }
 
 #define ARG1_DECLMemoryRegionCache *cache
-- 
2.39.2

[PATCH v2 3/4] physmem: Factor out body of flatview_read/write_continue() loop

This code will be reused for the address_space_cached accessors
shortly.

Also reduce scope of result variable now we aren't directly
calling this in the loop.

Signed-off-by: Jonathan Cameron 
---
v2: Thanks to Peter Xu
- Fix alignment of code.
- Drop unused addr parameter.
- Carry through new mr_addr parameter name.
- RB not picked up as not sure what Peter will think wrt to
  resulting parameter ordering.
---
 system/physmem.c | 169 +++
 1 file changed, 99 insertions(+), 70 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index a64a96a3e5..1264eab24b 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2681,6 +2681,56 @@ static bool flatview_access_allowed(MemoryRegion *mr, 
MemTxAttrs attrs,
 return false;
 }
 
+static MemTxResult flatview_write_continue_step(MemTxAttrs attrs,
+const uint8_t *buf,
+hwaddr len, hwaddr mr_addr,
+hwaddr *l, MemoryRegion *mr)
+{
+if (!flatview_access_allowed(mr, attrs, mr_addr, *l)) {
+return MEMTX_ACCESS_ERROR;
+}
+
+if (!memory_access_is_direct(mr, true)) {
+uint64_t val;
+MemTxResult result;
+bool release_lock = prepare_mmio_access(mr);
+
+*l = memory_access_size(mr, *l, mr_addr);
+/*
+ * XXX: could force current_cpu to NULL to avoid
+ * potential bugs
+ */
+
+/*
+ * Assure Coverity (and ourselves) that we are not going to OVERRUN
+ * the buffer by following ldn_he_p().
+ */
+#ifdef QEMU_STATIC_ANALYSIS
+assert((*l == 1 && len >= 1) ||
+   (*l == 2 && len >= 2) ||
+   (*l == 4 && len >= 4) ||
+   (*l == 8 && len >= 8));
+#endif
+val = ldn_he_p(buf, *l);
+result = memory_region_dispatch_write(mr, mr_addr, val,
+  size_memop(*l), attrs);
+if (release_lock) {
+bql_unlock();
+}
+
+return result;
+} else {
+/* RAM case */
+uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l,
+   false);
+
+memmove(ram_ptr, buf, *l);
+invalidate_and_set_dirty(mr, mr_addr, *l);
+
+return MEMTX_OK;
+}
+}
+
 /* Called within RCU critical section.  */
 static MemTxResult flatview_write_continue(FlatView *fv, hwaddr addr,
MemTxAttrs attrs,
@@ -2692,44 +2742,8 @@ static MemTxResult flatview_write_continue(FlatView *fv, 
hwaddr addr,
 const uint8_t *buf = ptr;
 
 for (;;) {
-if (!flatview_access_allowed(mr, attrs, mr_addr, l)) {
-result |= MEMTX_ACCESS_ERROR;
-/* Keep going. */
-} else if (!memory_access_is_direct(mr, true)) {
-uint64_t val;
-bool release_lock = prepare_mmio_access(mr);
-
-l = memory_access_size(mr, l, mr_addr);
-/* XXX: could force current_cpu to NULL to avoid
-   potential bugs */
-
-/*
- * Assure Coverity (and ourselves) that we are not going to OVERRUN
- * the buffer by following ldn_he_p().
- */
-#ifdef QEMU_STATIC_ANALYSIS
-assert((l == 1 && len >= 1) ||
-   (l == 2 && len >= 2) ||
-   (l == 4 && len >= 4) ||
-   (l == 8 && len >= 8));
-#endif
-val = ldn_he_p(buf, l);
-result |= memory_region_dispatch_write(mr, mr_addr, val,
-   size_memop(l), attrs);
-if (release_lock) {
-bql_unlock();
-}
-
-
-} else {
-/* RAM case */
-
-uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, ,
-   false);
-
-memmove(ram_ptr, buf, l);
-invalidate_and_set_dirty(mr, mr_addr, l);
-}
+result |= flatview_write_continue_step(attrs, buf, len, mr_addr, ,
+   mr);
 
 len -= l;
 buf += l;
@@ -2763,6 +2777,52 @@ static MemTxResult flatview_write(FlatView *fv, hwaddr 
addr, MemTxAttrs attrs,
mr_addr, l, mr);
 }
 
+static MemTxResult flatview_read_continue_step(MemTxAttrs attrs, uint8_t *buf,
+   hwaddr len, hwaddr mr_addr,
+   hwaddr *l,
+   MemoryRegion *mr)
+{
+if (!flatview_access_allowed(mr, attrs, mr_addr, *l)) {
+return MEMTX_ACCESS_ERROR;
+}
+
+if (!memory_access_is_direct(mr, false)) {
+

[PATCH v2 2/4] physmem: Reduce local variable scope in flatview_read/write_continue()

Precursor to factoring out the inner loops for reuse.

Reviewed-by: Peter Xu 
Signed-off-by: Jonathan Cameron 
---
v2: Picked up tag from Peter.
 system/physmem.c | 40 
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index 2704b780f6..a64a96a3e5 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2688,10 +2688,7 @@ static MemTxResult flatview_write_continue(FlatView *fv, 
hwaddr addr,
hwaddr len, hwaddr mr_addr,
hwaddr l, MemoryRegion *mr)
 {
-uint8_t *ram_ptr;
-uint64_t val;
 MemTxResult result = MEMTX_OK;
-bool release_lock = false;
 const uint8_t *buf = ptr;
 
 for (;;) {
@@ -2699,7 +2696,9 @@ static MemTxResult flatview_write_continue(FlatView *fv, 
hwaddr addr,
 result |= MEMTX_ACCESS_ERROR;
 /* Keep going. */
 } else if (!memory_access_is_direct(mr, true)) {
-release_lock |= prepare_mmio_access(mr);
+uint64_t val;
+bool release_lock = prepare_mmio_access(mr);
+
 l = memory_access_size(mr, l, mr_addr);
 /* XXX: could force current_cpu to NULL to avoid
potential bugs */
@@ -2717,18 +2716,21 @@ static MemTxResult flatview_write_continue(FlatView 
*fv, hwaddr addr,
 val = ldn_he_p(buf, l);
 result |= memory_region_dispatch_write(mr, mr_addr, val,
size_memop(l), attrs);
+if (release_lock) {
+bql_unlock();
+}
+
+
 } else {
 /* RAM case */
-ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, , false);
+
+uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, ,
+   false);
+
 memmove(ram_ptr, buf, l);
 invalidate_and_set_dirty(mr, mr_addr, l);
 }
 
-if (release_lock) {
-bql_unlock();
-release_lock = false;
-}
-
 len -= l;
 buf += l;
 addr += l;
@@ -2767,10 +2769,7 @@ MemTxResult flatview_read_continue(FlatView *fv, hwaddr 
addr,
hwaddr len, hwaddr mr_addr, hwaddr l,
MemoryRegion *mr)
 {
-uint8_t *ram_ptr;
-uint64_t val;
 MemTxResult result = MEMTX_OK;
-bool release_lock = false;
 uint8_t *buf = ptr;
 
 fuzz_dma_read_cb(addr, len, mr);
@@ -2780,7 +2779,9 @@ MemTxResult flatview_read_continue(FlatView *fv, hwaddr 
addr,
 /* Keep going. */
 } else if (!memory_access_is_direct(mr, false)) {
 /* I/O case */
-release_lock |= prepare_mmio_access(mr);
+uint64_t val;
+bool release_lock = prepare_mmio_access(mr);
+
 l = memory_access_size(mr, l, mr_addr);
 result |= memory_region_dispatch_read(mr, mr_addr, ,
   size_memop(l), attrs);
@@ -2796,17 +2797,16 @@ MemTxResult flatview_read_continue(FlatView *fv, hwaddr 
addr,
(l == 8 && len >= 8));
 #endif
 stn_he_p(buf, l, val);
+if (release_lock) {
+bql_unlock();
+}
 } else {
 /* RAM case */
-ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, , false);
+uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, ,
+   false);
 memcpy(buf, ram_ptr, l);
 }
 
-if (release_lock) {
-bql_unlock();
-release_lock = false;
-}
-
 len -= l;
 buf += l;
 addr += l;
-- 
2.39.2

[PATCH v2 1/4] physmem: Rename addr1 to more informative mr_addr in flatview_read/write() and similar

The calls to flatview_read/write[_continue]() have parameters addr and
addr1 but the names give no indication of what they are addresses of.
Rename addr1 to mr_addr to reflect that it is the translated address
offset within the MemoryRegion returned by flatview_translate().
Similarly rename the parameter in address_space_read/write_cached_slow()

Suggested-by: Peter Xu 
Signed-off-by: Jonathan Cameron 

---
v2: New patch.
- I have kept the renames to only the code I'm touching later in this
  series, but they could be applied much more widely.
---
 system/physmem.c | 50 
 1 file changed, 25 insertions(+), 25 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index 05997a7ca7..2704b780f6 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2685,7 +2685,7 @@ static bool flatview_access_allowed(MemoryRegion *mr, 
MemTxAttrs attrs,
 static MemTxResult flatview_write_continue(FlatView *fv, hwaddr addr,
MemTxAttrs attrs,
const void *ptr,
-   hwaddr len, hwaddr addr1,
+   hwaddr len, hwaddr mr_addr,
hwaddr l, MemoryRegion *mr)
 {
 uint8_t *ram_ptr;
@@ -2695,12 +2695,12 @@ static MemTxResult flatview_write_continue(FlatView 
*fv, hwaddr addr,
 const uint8_t *buf = ptr;
 
 for (;;) {
-if (!flatview_access_allowed(mr, attrs, addr1, l)) {
+if (!flatview_access_allowed(mr, attrs, mr_addr, l)) {
 result |= MEMTX_ACCESS_ERROR;
 /* Keep going. */
 } else if (!memory_access_is_direct(mr, true)) {
 release_lock |= prepare_mmio_access(mr);
-l = memory_access_size(mr, l, addr1);
+l = memory_access_size(mr, l, mr_addr);
 /* XXX: could force current_cpu to NULL to avoid
potential bugs */
 
@@ -2715,13 +2715,13 @@ static MemTxResult flatview_write_continue(FlatView 
*fv, hwaddr addr,
(l == 8 && len >= 8));
 #endif
 val = ldn_he_p(buf, l);
-result |= memory_region_dispatch_write(mr, addr1, val,
+result |= memory_region_dispatch_write(mr, mr_addr, val,
size_memop(l), attrs);
 } else {
 /* RAM case */
-ram_ptr = qemu_ram_ptr_length(mr->ram_block, addr1, , false);
+ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, , false);
 memmove(ram_ptr, buf, l);
-invalidate_and_set_dirty(mr, addr1, l);
+invalidate_and_set_dirty(mr, mr_addr, l);
 }
 
 if (release_lock) {
@@ -2738,7 +2738,7 @@ static MemTxResult flatview_write_continue(FlatView *fv, 
hwaddr addr,
 }
 
 l = len;
-mr = flatview_translate(fv, addr, , , true, attrs);
+mr = flatview_translate(fv, addr, _addr, , true, attrs);
 }
 
 return result;
@@ -2749,22 +2749,22 @@ static MemTxResult flatview_write(FlatView *fv, hwaddr 
addr, MemTxAttrs attrs,
   const void *buf, hwaddr len)
 {
 hwaddr l;
-hwaddr addr1;
+hwaddr mr_addr;
 MemoryRegion *mr;
 
 l = len;
-mr = flatview_translate(fv, addr, , , true, attrs);
+mr = flatview_translate(fv, addr, _addr, , true, attrs);
 if (!flatview_access_allowed(mr, attrs, addr, len)) {
 return MEMTX_ACCESS_ERROR;
 }
 return flatview_write_continue(fv, addr, attrs, buf, len,
-   addr1, l, mr);
+   mr_addr, l, mr);
 }
 
 /* Called within RCU critical section.  */
 MemTxResult flatview_read_continue(FlatView *fv, hwaddr addr,
MemTxAttrs attrs, void *ptr,
-   hwaddr len, hwaddr addr1, hwaddr l,
+   hwaddr len, hwaddr mr_addr, hwaddr l,
MemoryRegion *mr)
 {
 uint8_t *ram_ptr;
@@ -2775,14 +2775,14 @@ MemTxResult flatview_read_continue(FlatView *fv, hwaddr 
addr,
 
 fuzz_dma_read_cb(addr, len, mr);
 for (;;) {
-if (!flatview_access_allowed(mr, attrs, addr1, l)) {
+if (!flatview_access_allowed(mr, attrs, mr_addr, l)) {
 result |= MEMTX_ACCESS_ERROR;
 /* Keep going. */
 } else if (!memory_access_is_direct(mr, false)) {
 /* I/O case */
 release_lock |= prepare_mmio_access(mr);
-l = memory_access_size(mr, l, addr1);
-result |= memory_region_dispatch_read(mr, addr1, ,
+l = memory_access_size(mr, l, mr_addr);
+result |= memory_region_dispatch_read(mr, mr_addr, ,
   size_memop(l), attrs);
 
 /*
@@ -2798,7 +2798,7 @@ MemTxResult flatview_read_cont

[PATCH v2 0/4] physmem: Fix MemoryRegion for second access to cached MMIO Address Space

v2: (Thanks to Peter Xu for reviewing!)
- New patch 1 to rename addr1 to mr_addr in the interests of meaningful naming.
- Take advantage of a cached address space only allow for a single MR to 
simplify
  the new code.
- Various cleanups of indentation etc.
- Cover letter and some patch descriptions updated to reflect changes.
- Changes all called out in specific patches.

Issue seen testing virtio-blk-pci with CXL emulated interleave memory.
Tests were done on arm64, but the issue isn't architecture specific.
Note that some additional fixes are needed to TCG to be able to run far
enough to hit this on arm64 or x86. Most of these are now upstream
with exception of:

target/i386: Enable page walking from MMIO memory
https://lore.kernel.org/qemu-devel/20240219173153.12114-3-jonathan.came...@huawei.com/

The address_space_read_cached_slow() and address_space_write_cached_slow()
functions query the MemoryRegion for the cached address space correctly
using address_space_translate_cached() but then call into
flatview_read_continue() / flatview_write_continue().
If the access is to a MMIO MemoryRegion and is bigger than the MemoryRegion
supports, the loop will query the MemoryRegion for the next access to use.
That query uses flatview_translate() but the address passed is suitable
for the cache, not the flatview. On my test setup that mean the second
8 bytes and onwards of the virtio descriptor was read from flash memory
at the beginning of the system address map, not the CXL emulated memory
where the descriptor was found.  Result happened to be all fs so easy to
spot.

Changes these calls to assume that the MemoryRegion does not change
as multiple acceses are perfomed to the MemoryRegionCache.
The first patch renames the addr1 parameter to the hopefully more
informative mr_addr.

To avoid duplicating most of the code, the next 2 patches factor out
the common parts of flatview_read_continue() and flatview_write_continue()
so they can be reused.

Write path has not been tested but it so similar to the read path I've
included it here.

Jonathan Cameron (4):
  physmem: Rename addr1 to more informative mr_addr in
flatview_read/write() and similar
  physmem: Reduce local variable scope in flatview_read/write_continue()
  physmem: Factor out body of flatview_read/write_continue() loop
  physmem: Fix wrong address in large
address_space_read/write_cached_slow()

 system/physmem.c | 260 +++
 1 file changed, 170 insertions(+), 90 deletions(-)

-- 
2.39.2

Re: [PATCH 3/3] physmem: Fix wrong MR in large address_space_read/write_cached_slow()

On Fri, 1 Mar 2024 13:44:01 +0800
Peter Xu  wrote:

> On Thu, Feb 15, 2024 at 02:28:17PM +0000, Jonathan Cameron wrote:
> 
> Can we rename the subject?
> 
>   physmem: Fix wrong MR in large address_space_read/write_cached_slow()
> 
> IMHO "wrong MR" is misleading, as the MR was wrong only because the address
> passed over is wrong at the first place.  Perhaps s/MR/addr/?
> 
> > If the access is bigger than the MemoryRegion supports,
> > flatview_read/write_continue() will attempt to update the Memory Region.
> > but the address passed to flatview_translate() is relative to the cache, not
> > to the FlatView.
> > 
> > On arm/virt with interleaved CXL memory emulation and virtio-blk-pci this
> > lead to the first part of descriptor being read from the CXL memory and the
> > second part from PA 0x8 which happens to be a blank region
> > of a flash chip and all ffs on this particular configuration.
> > Note this test requires the out of tree ARM support for CXL, but
> > the problem is more general.
> > 
> > Avoid this by adding new address_space_read_continue_cached()
> > and address_space_write_continue_cached() which share all the logic
> > with the flatview versions except for the MemoryRegion lookup.
> > 
> > Signed-off-by: Jonathan Cameron 
> > ---
> >  system/physmem.c | 78 
> >  1 file changed, 72 insertions(+), 6 deletions(-)
> >   
> 
> [...]
> 
> > +/* Called within RCU critical section.  */
> > +static MemTxResult address_space_read_continue_cached(MemoryRegionCache 
> > *cache,
> > +  hwaddr addr,
> > +  MemTxAttrs attrs,
> > +  void *ptr, hwaddr 
> > len,
> > +  hwaddr addr1, hwaddr 
> > l,
> > +  MemoryRegion *mr)  
> 
> It looks like "addr" (of flatview AS) is not needed for a cached RW (see
> below), because we should have a stable (cached) MR to operate anyway?
> 
> How about we also use "mr_addr" as the single addr of the MR, then drop
> addr1?

Agreed, but also need to drop the fuzz_dma_read_cb().
However given the first thing that is checked by the only in tree fuzzing
code is whether we are dealing with RAM, I think that's fine.
> 
> > +{
> > +MemTxResult result = MEMTX_OK;
> > +uint8_t *buf = ptr;
> > +
> > +fuzz_dma_read_cb(addr, len, mr);
> > +for (;;) {
> > +  
> 
> Remove empty line?
> 
> > +result |= flatview_read_continue_step(addr, attrs, buf, len, addr1,
> > +  , mr);
> > +len -= l;
> > +buf += l;
> > +addr += l;
> > +
> > +if (!len) {
> > +break;
> > +}
> > +l = len;
> > +
> > +mr = address_space_translate_cached(cache, addr, , , false,
> > +attrs);  
> 
> Here IIUC the mr will always be the same as before?  If so, maybe "mr_addr
> += l" should be enough?
> 
I had the same thought originally but couldn't convince myself that there
was no route to end up with a different MR here. I don't yet
have a good enough grip on how this all fits together so I particularly
appreciate your help.

With hindsight I should have called this out as a question in this patch set.

Can drop passing in cache as well given it is no longer used within
this function.

Thanks,

Jonathan

> (similar comment applies to the writer side too)
> 
> > +}
> > +
> > +return result;
> > +}
> > +
> >  /* Called from RCU critical section. address_space_read_cached uses this
> >   * out of line function when the target is an MMIO or IOMMU region.
> >   */
> > @@ -3390,9 +3456,9 @@ address_space_read_cached_slow(MemoryRegionCache 
> > *cache, hwaddr addr,
> >  l = len;
> >  mr = address_space_translate_cached(cache, addr, , , false,
> >  MEMTXATTRS_UNSPECIFIED);
> > -return flatview_read_continue(cache->fv,
> > -  addr, MEMTXATTRS_UNSPECIFIED, buf, len,
> > -  addr1, l, mr);
> > +return address_space_read_continue_cached(cache, addr,
> > +  MEMTXATTRS_UNSPECIFIED, buf, 
> > len,
> > +

Re: [PATCH 2/3] physmem: Factor out body of flatview_read/write_continue() loop

On Fri, 1 Mar 2024 13:35:26 +0800
Peter Xu  wrote:

> On Fri, Mar 01, 2024 at 01:29:04PM +0800, Peter Xu wrote:
> > On Thu, Feb 15, 2024 at 02:28:16PM +, Jonathan Cameron wrote:  
> > > This code will be reused for the address_space_cached accessors
> > > shortly.
> > > 
> > > Also reduce scope of result variable now we aren't directly
> > > calling this in the loop.
> > > 
> > > Signed-off-by: Jonathan Cameron 
> > > ---
> > >  system/physmem.c | 165 ---
> > >  1 file changed, 98 insertions(+), 67 deletions(-)
> > > 
> > > diff --git a/system/physmem.c b/system/physmem.c
> > > index 39b5ac751e..74f92bb3b8 100644
> > > --- a/system/physmem.c
> > > +++ b/system/physmem.c
> > > @@ -2677,6 +2677,54 @@ static bool flatview_access_allowed(MemoryRegion 
> > > *mr, MemTxAttrs attrs,
> > >  return false;
> > >  }
> > >  
> > > +static MemTxResult flatview_write_continue_step(hwaddr addr,  
> 
> One more thing: this addr var is not used, afaict.  We could drop addr1
> below and use this to represent the MR offset.

I'm tempted to keep the addr1 where it is in the parameter list just so that
it matches up with the caller location but a rename makes a lot of sense.

> 
> I'm wondering whether we should start to use some better namings already
> for memory API functions to show obviously what AS it is describing.  From
> that POV, perhaps rename it to "mr_addr"?

I'll add a precursor patch renaming these for the functions this series touches.
We can tidy up other cases later.  I'll put a note in that patch below the cut
to observe that the rename makes sense more widely.

I've not picked up the RB given because of the parameter ordering question.

Thanks,

Jonathan

> 
> > > +MemTxAttrs attrs,
> > > +const uint8_t *buf,
> > > +hwaddr len, hwaddr addr1,
> > > +hwaddr *l, MemoryRegion 
> > > *mr)
> > > +{
> > > +if (!flatview_access_allowed(mr, attrs, addr1, *l)) {
> > > +return MEMTX_ACCESS_ERROR;
> > > +}
> > > +
> > > +if (!memory_access_is_direct(mr, true)) {
> > > +uint64_t val;
> > > +MemTxResult result;
> > > +bool release_lock = prepare_mmio_access(mr);
> > > +
> > > +*l = memory_access_size(mr, *l, addr1);
> > > +/* XXX: could force current_cpu to NULL to avoid
> > > +   potential bugs */
> > > +
> > > +/*
> > > + * Assure Coverity (and ourselves) that we are not going to 
> > > OVERRUN
> > > + * the buffer by following ldn_he_p().
> > > + */
> > > +#ifdef QEMU_STATIC_ANALYSIS
> > > +assert((*l == 1 && len >= 1) ||
> > > +   (*l == 2 && len >= 2) ||
> > > +   (*l == 4 && len >= 4) ||
> > > +   (*l == 8 && len >= 8));
> > > +#endif
> > > +val = ldn_he_p(buf, *l);
> > > +result = memory_region_dispatch_write(mr, addr1, val,
> > > +  size_memop(*l), attrs);
> > > +if (release_lock) {
> > > +bql_unlock();
> > > +}
> > > +
> > > +return result;
> > > +} else {
> > > +/* RAM case */
> > > +uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, addr1, l, 
> > > false);
> > > +
> > > +memmove(ram_ptr, buf, *l);
> > > +invalidate_and_set_dirty(mr, addr1, *l);
> > > +
> > > +return MEMTX_OK;
> > > +}
> > > +}
> > > +
> > >  /* Called within RCU critical section.  */
> > >  static MemTxResult flatview_write_continue(FlatView *fv, hwaddr addr,
> > > MemTxAttrs attrs,
> > > @@ -2688,42 +2736,9 @@ static MemTxResult 
> > > flatview_write_continue(FlatView *fv, hwaddr addr,
> > >  const uint8_t *buf = ptr;
> > >  
> > >  for (;;) {
> > > -if (!flatview_access_allowed(mr, attrs, addr1, l)) {
> > > -result |= MEMTX_ACCESS_ERROR;
> > > -/* Keep going. */
> > > -} else if (!memory_access_is_direct

Re: [PATCH v5 09/13] hw/cxl/events: Add qmp interfaces to add/release dynamic capacity extents



> >   
> > > + * remove it from the pending extent list, so later when the 
> > > add
> > > + * response for the extent arrives, the device can reject the
> > > + * extent as it is not in the pending list.
> > > + */
> > > +ent = cxl_dc_extent_exists(>dc.extents_pending_to_add,
> > > +[i]);
> > > +if (ent) {
> > > +QTAILQ_REMOVE(>dc.extents_pending_to_add, ent, 
> > > node);
> > > +g_free(ent);
> > > +skip_extent = true;
> > > +} else if (!cxl_dc_extent_exists(>dc.extents, 
> > > [i])) {
> > > +/* If the exact extent is not in the accepted list, skip 
> > > */
> > > +skip_extent = true;
> > > +}  
> > I think we need to reject case of some extents skipped and others not.
> > That's not supported yet so we need to complain if we get it at least. 
> > Maybe we need
> > to do two passes so we know this has happened early (or perhaps this is a 
> > later
> > patch in which case a todo here would help).  
> 
> If the second skip_extent case, I will reject earlier instead of
> skipping.
That was me misunderstanding the flow. I think this is fine as you have it 
already.

Jonathan

Re: [PATCH v5 09/13] hw/cxl/events: Add qmp interfaces to add/release dynamic capacity extents

...

> > > +list = records;
> > > +extents = g_new0(CXLDCExtentRaw, num_extents);
> > > +while (list) {
> > > +CXLDCExtent *ent;
> > > +bool skip_extent = false;
> > > +
> > > +offset = list->value->offset;
> > > +len = list->value->len;
> > > +
> > > +extents[i].start_dpa = offset + dcd->dc.regions[rid].base;
> > > +extents[i].len = len;
> > > +memset(extents[i].tag, 0, 0x10);
> > > +extents[i].shared_seq = 0;
> > > +
> > > +if (type == DC_EVENT_RELEASE_CAPACITY ||
> > > +type == DC_EVENT_FORCED_RELEASE_CAPACITY) {
> > > +/*
> > > + *  if the extent is still pending to be added to the host,  
> > 
> > Odd spacing.
> >   
> > > + * remove it from the pending extent list, so later when the 
> > > add
> > > + * response for the extent arrives, the device can reject the
> > > + * extent as it is not in the pending list.
> > > + */
> > > +ent = cxl_dc_extent_exists(>dc.extents_pending_to_add,
> > > +[i]);
> > > +if (ent) {
> > > +QTAILQ_REMOVE(>dc.extents_pending_to_add, ent, 
> > > node);
> > > +g_free(ent);
> > > +skip_extent = true;
> > > +} else if (!cxl_dc_extent_exists(>dc.extents, 
> > > [i])) {
> > > +/* If the exact extent is not in the accepted list, skip 
> > > */
> > > +skip_extent = true;
> > > +}  
> > I think we need to reject case of some extents skipped and others not.
> > That's not supported yet so we need to complain if we get it at least. 
> > Maybe we need
> > to do two passes so we know this has happened early (or perhaps this is a 
> > later
> > patch in which case a todo here would help).  
> 
> Skip here does not mean the extent is invalid, it just means the extent
> is still pending to add, so remove them from pending list would be
> enough to reject the extent, no need to release further. That is based
> on your feedback on v4.

Ah. I'd missunderstood.

> 
> The loop here is only to collect the extents to sent to the event log. 
> But as you said, we need one pass before updating pending list.
> Actually if we do not allow the above case where extents to release is
> still in the pending to add list, we can just return here with error, no
> extra dry run needed. 
> 
> What do you think?

I think we need a way to back out extents from the pending to add list
so we can create the race where they are offered to the OS and it takes
forever to accept and by the time it does we've removed them.

> 
> >   
> > > +
> > > +
> > > +/* No duplicate or overlapped extents are allowed */
> > > +if (test_any_bits_set(blk_bitmap, offset / block_size,
> > > +  len / block_size)) {
> > > +error_setg(errp, "duplicate or overlapped extents are 
> > > detected");
> > > +return;
> > > +}
> > > +bitmap_set(blk_bitmap, offset / block_size, len / block_size);
> > > +
> > > +list = list->next;
> > > +if (!skip_extent) {
> > > +i++;  
> > Problem is if we skip one in the middle the records will be wrong below.  
> 
> Why? Only extents passed the check will be stored in variable extents and
> processed further and i be updated. 
> For skipped ones, since i is not updated, they will be
> overwritten by following valid ones.
Ah. I'd missed the fact you store into the extent without a check on validity
but only move the index on if they were valid. Then rely on not passing a 
trailing
entry at the end.
If would be more readable I think if local variables were used for the 
parameters
until we've decided not to skip and the this ended with

if (!skip_extent) {
extents[i] = (DCXLDCExtentRaw) {
.start_dpa = ...
...
};
i++
}
We have local len already so probably just need
uint64_t start_dpa = offset + dcd->dc.regions[rid].base;

Also maybe skip_extent_evlog or something like that to explain we are only
skipping that part. 
Helps people like me who read it completely wrong!

Jonathan

Re: [PATCH v5 08/13] hw/cxl/cxl-mailbox-utils: Add mailbox commands to support add/release dynamic capacity response

> > > +static void cxl_destroy_dc_regions(CXLType3Dev *ct3d)
> > > +{
> > > +CXLDCExtent *ent;
> > > +
> > > +while (!QTAILQ_EMPTY(>dc.extents)) {
> > > +ent = QTAILQ_FIRST(>dc.extents);
> > > +cxl_remove_extent_from_extent_list(>dc.extents, ent);  
> > 
> > Isn't this same a something like.
> > QTAILQ_FOREACH_SAFE(ent, >dc.extents, node)) {
> > cxl_remove_extent_from_extent_list(>dc.extents, ent);
> > //This wrapper is small enough I'd be tempted to just have the
> > //code inline at the places it's called.
> >   
> We will have more to release after we introduce pending list as well as
> bitmap. Keep it?
ok. 
> 
> Fan
> 
> > }  
> > > +}
> > > +}  
> >

Re: [PATCH v5 08/13] hw/cxl/cxl-mailbox-utils: Add mailbox commands to support add/release dynamic capacity response

On Wed, 6 Mar 2024 13:39:50 -0800
fan  wrote:



> > > +}
> > > +if (len2) {
> > > +cxl_insert_extent_to_extent_list(extent_list, 
> > > dpa + len,
> > > + len2, NULL, 0);
> > > +ct3d->dc.total_extent_count += 1;
> > > +}
> > > +break;  
> > Maybe this makes sense after the support below is added, but at this
> > point in the series 
> > return CXL_MBOX_SUCCESS;
> > then found isn't relevant so can drop that. Looks like you drop it later in 
> > the
> > series anyway.  
> 
> We cannot return directly as we have more extents to release.

Ah good point. I'd missed the double loop.

> One thing I think I need to add is a dry run to test if any extent in
> the income list is not contained by an extent in the extent list and
> return error before starting to do the real release. The spec just says
> we need to return invalid PA but not specify whether we should update the list
> until we found a "bad" extent or reject the request directly. Current code
> leaves a situation where we may have updated the extent list until we found a
> "bad" extent to release.

Yes, I'm not sure on the correct answer to this either. My assumption is that in
error cases there are no side effects, but I don't see a clear statement of 
that.

So I think we are in the world of best practice, not spec compliance.
If we wanted to recover from such an error case we'd have to verify the current
extent list.  I'll fire off a question to relevant folk in appropriate forum.

Jonathan

Re: [PATCH v5 06/13] hw/mem/cxl_type3: Add host backend and address space handling for DC regions

> > > @@ -868,16 +974,24 @@ static int cxl_type3_hpa_to_as_and_dpa(CXLType3Dev 
> > > *ct3d,
> > > AddressSpace **as,
> > > uint64_t *dpa_offset)
> > >  {
> > > -MemoryRegion *vmr = NULL, *pmr = NULL;
> > > +MemoryRegion *vmr = NULL, *pmr = NULL, *dc_mr = NULL;
> > > +uint64_t vmr_size = 0, pmr_size = 0, dc_size = 0;
> > >  
> > >  if (ct3d->hostvmem) {
> > >  vmr = host_memory_backend_get_memory(ct3d->hostvmem);
> > > +vmr_size = memory_region_size(vmr);
> > >  }
> > >  if (ct3d->hostpmem) {
> > >  pmr = host_memory_backend_get_memory(ct3d->hostpmem);
> > > +pmr_size = memory_region_size(pmr);
> > > +}
> > > +if (ct3d->dc.host_dc) {
> > > +dc_mr = host_memory_backend_get_memory(ct3d->dc.host_dc);
> > > +/* Do we want dc_size to be dc_mr->size or not?? */  
> > 
> > Maybe - definitely don't want to leave this comment here
> > unanswered and I think you enforce it above anyway.
> > 
> > So if we get here ct3d->dc.total_capacity == 
> > memory_region_size(ct3d->dc.host_dc);
> > As such we could just not stash total_capacity at all?  
> 
> I cannot identify a case where these two will be different. But
> total_capacity is referenced at quite some places, it may be nice to have
> it so we do not need to call the function to get the value every time?

I kind of like having it via one path so that there is no confusion
for the reader, but up to you on this one.  The function called is trivial
(other than some magic to handle very large memory regions) so
this is just a readability question, not a perf one.

Whatever, don't leave the question behind.  Find to have something
that says they are always the same size if you don't get rid
of the total_capacity representation.


Jonathan

Re: [PATCH v5 0/3] Initial support for SPDM Responders

On Thu,  7 Mar 2024 10:58:56 +1000
Alistair Francis  wrote:

> The Security Protocol and Data Model (SPDM) Specification defines
> messages, data objects, and sequences for performing message exchanges
> over a variety of transport and physical media.
>  - 
> https://www.dmtf.org/sites/default/files/standards/documents/DSP0274_1.3.0.pdf
> 
> SPDM currently supports PCIe DOE and MCTP transports, but it can be
> extended to support others in the future. This series adds
> support to QEMU to connect to an external SPDM instance.
> 
> SPDM support can be added to any QEMU device by exposing a
> TCP socket to a SPDM server. The server can then implement the SPDM
> decoding/encoding support, generally using libspdm [1].
> 
> This is similar to how the current TPM implementation works and means
> that the heavy lifting of setting up certificate chains, capabilities,
> measurements and complex crypto can be done outside QEMU by a well
> supported and tested library.
> 
> This series implements socket support and exposes SPDM for a NVMe device.

Thanks Alastair,

I'm really keen to seen this land soon as I have the CXL infrastructure
for this backed up behind it.  Also will be needed for PCI (IDE) and CXL link
encryption emulation and most if not all of the confidential computing stacks
with QEMU emulating the host system + peripherals.

I believe it's just waiting for a PCI Maintainer Ack at this point? Klaus said 
he
was happy to take it through NVME but wanted a PCI Ack first.

Michael / Marcel, if you have time to look at it that would be great.

Thanks,

Jonathan


> 
> 1: https://github.com/DMTF/libspdm
> 
> v5:
>  - Update MAINTAINERS
> v4:
>  - Rebase
> v3:
>  - Spelling fixes
>  - Support for SPDM-Utils
> v2:
>  - Add cover letter
>  - A few code fixes based on comments
>  - Document SPDM-Utils
>  - A few tweaks and clarifications to the documentation
> 
> Alistair Francis (1):
>   hw/pci: Add all Data Object Types defined in PCIe r6.0
> 
> Huai-Cheng Kuo (1):
>   backends: Initial support for SPDM socket support
> 
> Wilfred Mallawa (1):
>   hw/nvme: Add SPDM over DOE support
> 
>  MAINTAINERS  |   6 +
>  docs/specs/index.rst |   1 +
>  docs/specs/spdm.rst  | 122 
>  include/hw/pci/pci_device.h  |   5 +
>  include/hw/pci/pcie_doe.h|   5 +
>  include/sysemu/spdm-socket.h |  44 +++
>  backends/spdm-socket.c   | 216 +++
>  hw/nvme/ctrl.c   |  53 +
>  backends/Kconfig |   4 +
>  backends/meson.build |   2 +
>  10 files changed, 458 insertions(+)
>  create mode 100644 docs/specs/spdm.rst
>  create mode 100644 include/sysemu/spdm-socket.h
>  create mode 100644 backends/spdm-socket.c
>

Re: [PATCH v8 2/2] hw/acpi: Implement the SRAT GI affinity structure

On Thu, 7 Mar 2024 03:03:02 +
Ankit Agrawal  wrote:

> >>
> >> [1] ACPI Spec 6.3, Section 5.2.16.6
> >> [2] ACPI Spec 6.3, Table 5.80
> >>
> >> Cc: Jonathan Cameron 
> >> Cc: Alex Williamson 
> >> Cc: Cedric Le Goater 
> >> Signed-off-by: Ankit Agrawal   
> >
> > I guess we gloss over the bisection breakage due to being able to add
> > these nodes and have them used in HMAT as initiators before we have
> > added SRAT support.  Linux will moan about it and not use such an HMAT
> > but meh, it will boot.
> >
> > You could drag the HMAT change after this but perhaps it's not worth 
> > bothering.  
> 
> Sorry this part isn't clear to me. Are you suggesting we keep the HMAT
> changes out from this patch?

No - don't drop them. Move them from patch 1 to either patch 2, or to a
patch 3 if that ends up looking clearer.  I think patch 2 is the
right choice though as that enables everything at once.

It's valid to have SRAT containing GI entries without the same in HMAT
(as HMAT doesn't have to be complete), it's not valid to have HMAT refer
to entries that aren't in SRAT.

Another thing we may need to do add in the long run is the _OSC support.
That's needed for DSDT entries with _PXM associated with a GI only node
so that we can make them move node depending on whether or not the
Guest OS supports GIs and so will create the nodes.  Requires a bit of
magic AML to make that work.

It used to crash linux if you didn't do that, but that's been fixed
for a while I believe.

For now we aren't adding any such _PXM entries though so this is just
one for the TODO list :)

> 
> > Otherwise LGTM
> > Reviewed-by: Jonathan Cameron   
> 
> Thanks!
> 
> > Could add x86 support (posted in reply to v7 this morning)
> > and sounds like you have the test nearly ready which is great.  
> 
> Ok, will add the x86 part as well. I could reuse what you shared
> earlier.
> 
> https://gitlab.com/jic23/qemu/-/commit/ccfb4fe22167e035173390cf147d9c226951b9b6
Excellent - thanks!

Jonathan

> 
> 
>

Re: [PATCH v5 12/13] hw/mem/cxl_type3: Allow to release partial extent and extent superset in QMP interface

On Mon,  4 Mar 2024 11:34:07 -0800
nifan@gmail.com wrote:

> From: Fan Ni 
> 
> Before the change, the QMP interface used for add/release DC extents
> only allows to release extents that exist in either pending-to-add list
> or accepted list in the device, which means the DPA range of the extent must
> match exactly that of an extent in either list. Otherwise, the release
> request will be ignored.
> 
> With the change, we relax the constraints. As long as the DPA range of the
> extent to release is covered by extents in one of the two lists
> mentioned above, we allow the release.
> 
> Signed-off-by: Fan Ni 
Run out of time today,  so just took a very quick look at this.

Seemed fine but similar comments on exit conditions and retry gotos as
earlier patches.

> +/*
> + * Remove all extents whose DPA range has overlaps with  the DPA range
> + * [dpa, dpa + len) from the list, and delete the overlapped portion.
> + * Note:
> + * 1. If the removed extents is fully within the DPA range, delete the 
> extent;
> + * 2. Otherwise, keep the portion that does not overlap, insert new extents 
> to
> + * the list if needed for the un-coverlapped part.
> + */
> +static void cxl_delist_extent_by_dpa_range(CXLDCExtentList *list,
> +   uint64_t dpa, uint64_t len)
> +{
> +CXLDCExtent *ent;
>  
> -return NULL;
> +process_leftover:

As before can we turn this into a while loop so the exit conditions are 
more obvious?  Based on len I think.


> +QTAILQ_FOREACH(ent, list, node) {
> +if (ent->start_dpa <= dpa && dpa < ent->start_dpa + ent->len) {
> +uint64_t ent_start_dpa = ent->start_dpa;
> +uint64_t ent_len = ent->len;
> +uint64_t len1 = dpa - ent_start_dpa;
> +
> +cxl_remove_extent_from_extent_list(list, ent);
> +if (len1) {
> +cxl_insert_extent_to_extent_list(list, ent_start_dpa,
> + len1, NULL, 0);
> +}
> +
> +if (dpa + len <= ent_start_dpa + ent_len) {
> +uint64_t len2 = ent_start_dpa + ent_len - dpa - len;
> +if (len2) {
> +cxl_insert_extent_to_extent_list(list, dpa + len,
> + len2, NULL, 0);
> +}
> +} else {
> +len = dpa + len - ent_start_dpa - ent_len;
> +dpa = ent_start_dpa + ent_len;
> +goto process_leftover;
> +}
> +}
> +}
>  }
>  
>  /*
> @@ -1915,8 +1966,8 @@ static void qmp_cxl_process_dynamic_capacity(const char 
> *path, CxlEventLog log,
>  list = records;
>  extents = g_new0(CXLDCExtentRaw, num_extents);
>  while (list) {
> -CXLDCExtent *ent;
>  bool skip_extent = false;
> +CXLDCExtentList *extent_list;
>  
>  offset = list->value->offset;
>  len = list->value->len;
> @@ -1933,15 +1984,32 @@ static void qmp_cxl_process_dynamic_capacity(const 
> char *path, CxlEventLog log,
>   * remove it from the pending extent list, so later when the add
>   * response for the extent arrives, the device can reject the
>   * extent as it is not in the pending list.
> + * Now, we can handle the case where the extent covers the DPA

No need for Now. Anyone reading it is look at the cod here.

> + * range of multiple extents in the pending_to_add list.
> + * TODO: we do not allow the extent covers range of extents in
> + * pending_to_add list and accepted list at the same time for 
> now.
>   */
> -ent = cxl_dc_extent_exists(>dc.extents_pending_to_add,
> -[i]);
> -if (ent) {
> -QTAILQ_REMOVE(>dc.extents_pending_to_add, ent, node);
> -g_free(ent);
> +extent_list = >dc.extents_pending_to_add;
> +if (cxl_test_dpa_range_covered_by_extents(extent_list,
> +  extents[i].start_dpa,
> +  extents[i].len)) {
> +cxl_delist_extent_by_dpa_range(extent_list,
> +   extents[i].start_dpa,
> +   extents[i].len);
> +} else if (!ct3_test_region_block_backed(dcd, 
> extents[i].start_dpa,
> + extents[i].len)) {
> +/*
> + * If the DPA range of the extent is not covered by extents
> + * in the accepted list, skip
> + */
>  skip_extent = true;
> -} else if (!cxl_dc_extent_exists(>dc.extents, [i])) 
> {
> -/* If the exact extent is not in the accepted list, skip */
> +}
> +} else if (type ==

Re: [PATCH v5 11/13] hw/cxl/cxl-mailbox-utils: Add partial and superset extent release mailbox support

On Mon,  4 Mar 2024 11:34:06 -0800
nifan@gmail.com wrote:

> From: Fan Ni 
> 
> With the change, we extend the extent release mailbox command processing
> to allow more flexible release. As long as the DPA range of the extent to
> release is covered by valid extent(s) in the device, the release can be
> performed.
> 
> Signed-off-by: Fan Ni 

Ouch this is more complex than I was thinking, but seems correct to me.

A few minor comments inline

Jonathan

> +/*
> + * Detect potential extent overflow caused by extent split during processing
> + * extent release requests, also allow releasing superset of extents where 
> the
> + * extent to release covers the range of multiple extents in the device.
> + * Note:
> + * 1.we will reject releasing an extent if some portion of its rang is

range

> + * not covered by valid extents.
> + * 2.This function is called after cxl_detect_malformed_extent_list so checks
> + * already performed there will be skipped.
> + */
> +static CXLRetCode cxl_detect_extent_overflow(const CXLType3Dev *ct3d,
> +const CXLUpdateDCExtentListInPl *in)

This code is basically dry running the actual removal.  Can we just
make the core code the same for both cases?  The bit where you update bitmaps
and extent lists at least.

> +{
> +uint64_t nbits, offset;
> +const CXLDCRegion *region;
> +unsigned long **bitmaps_copied;
> +uint64_t dpa, len;
> +int i, rid;
> +CXLRetCode ret = CXL_MBOX_SUCCESS;
> +long extent_cnt_delta = 0;
> +CXLDCExtentList tmp_list;
> +CXLDCExtent *ent;
> +
> +QTAILQ_INIT(_list);
> +copy_extent_list(_list, >dc.extents);
> +
> +bitmaps_copied = g_new0(unsigned long *, ct3d->dc.num_regions);
> +for (i = 0; i < ct3d->dc.num_regions; i++) {
> +region = >dc.regions[i];
> +nbits = region->len / region->block_size;
> +bitmaps_copied[i] = bitmap_new(nbits);
> +bitmap_copy(bitmaps_copied[i], region->blk_bitmap, nbits);
> +}
> +
> +for (i = 0; i < in->num_entries_updated; i++) {
> +dpa = in->updated_entries[i].start_dpa;
> +len = in->updated_entries[i].len;
> +
> +rid = cxl_find_dc_region_id(ct3d, dpa, len);
> +region = >dc.regions[rid];
> +offset = (dpa - region->base) / region->block_size;
> +nbits = len / region->block_size;
> +
> +/* Check whether range [dpa, dpa + len) is covered by valid range */
> +if (find_next_zero_bit(bitmaps_copied[rid], offset + nbits, offset) <
> +   offset + nbits) {
> +ret = CXL_MBOX_INVALID_PA;
> +goto free_and_exit;
> +}
> +
> +QTAILQ_FOREACH(ent, _list, node) {
> +/* Only split within an extent can cause extent count increase */
> +if (ent->start_dpa <= dpa &&
> +dpa + len <= ent->start_dpa + ent->len) {
> +uint64_t ent_start_dpa = ent->start_dpa;
> +uint64_t ent_len = ent->len;
> +uint64_t len1 = dpa - ent_start_dpa;
> +uint64_t len2 = ent_start_dpa + ent_len - dpa - len;
> +
> +extent_cnt_delta += len1 && len2 ? 2 : (len1 || len2 ? 1 : 
> 0);
I think this is the same as

if (len1)
extent_cnt_delta++;
if (len2)
extent_cnt_delta++;
extent_cnt_delta--;



> +extent_cnt_delta -= 1;
> +if (ct3d->dc.total_extent_count + extent_cnt_delta >
> +CXL_NUM_EXTENTS_SUPPORTED) {

This early overflow detect seems valid to me because a device might run
out or resource mid processing the list even if it would fit at the end.
Good.
> +ret = CXL_MBOX_RESOURCES_EXHAUSTED;
> +goto free_and_exit;
> +}
> +
> +offset = (ent->start_dpa - region->base) / 
> region->block_size;
> +nbits = ent->len / region->block_size;
> +bitmap_clear(bitmaps_copied[rid], offset, nbits);
> +cxl_remove_extent_from_extent_list(_list, ent);
> +
> + if (len1) {
> +offset = (dpa - region->base) / region->block_size;
> +nbits = len1 / region->block_size;
> +bitmap_set(bitmaps_copied[rid], offset, nbits);
> +cxl_insert_extent_to_extent_list(_list,
> + ent_start_dpa, len1,
> + NULL, 0);
> + }
> +
> + if (len2) {
> +offset = (dpa + len - region->base) / region->block_size;
> +nbits = len2 / region->block_size;
> +bitmap_set(bitmaps_copied[rid], offset, nbits);
> +cxl_insert_extent_to_extent_list(_list, dpa + len,
> + len2, NULL, 0);

Re: [PATCH v5 10/13] hw/mem/cxl_type3: Add dpa range validation for accesses to DC regions

On Mon,  4 Mar 2024 11:34:05 -0800
nifan@gmail.com wrote:

> From: Fan Ni 
> 
> Not all dpa range in the DC regions is valid to access until an extent

All DPA ranges in the DC regions are invalid to access until an extent
covering the range has been added.

> covering the range has been added. Add a bitmap for each region to
> record whether a DC block in the region has been backed by DC extent.
> For the bitmap, a bit in the bitmap represents a DC block. When a DC
> extent is added, all the bits of the blocks in the extent will be set,
> which will be cleared when the extent is released.
> 
> Signed-off-by: Fan Ni 
Reviewed-by: Jonathan Cameron

Re: [PATCH v5 09/13] hw/cxl/events: Add qmp interfaces to add/release dynamic capacity extents

On Mon,  4 Mar 2024 11:34:04 -0800
nifan@gmail.com wrote:

> From: Fan Ni 
> 
> Since fabric manager emulation is not supported yet, the change implements
> the functions to add/release dynamic capacity extents as QMP interfaces.

We'll need them anyway, or to implement an fm interface via QMP which is
going to be ugly and complex.

> 
> Note: we skips any FM issued extent release request if the exact extent
> does not exist in the extent list of the device. We will loose the
> restriction later once we have partial release support in the kernel.

Maybe the kernel will treat it as a request to release the extent it
is tracking that contains it.  So we may want to add a way to poke that.
Not today though!

> 
> 1. Add dynamic capacity extents:
> 
> For example, the command to add two continuous extents (each 128MiB long)
> to region 0 (starting at DPA offset 0) looks like below:
> 
> { "execute": "qmp_capabilities" }
> 
> { "execute": "cxl-add-dynamic-capacity",
>   "arguments": {
>   "path": "/machine/peripheral/cxl-dcd0",
>   "region-id": 0,
>   "extents": [
>   {
>   "dpa": 0,
>   "len": 134217728
>   },
>   {
>   "dpa": 134217728,
>   "len": 134217728
>   }
>   ]
>   }
> }
> 
> 2. Release dynamic capacity extents:
> 
> For example, the command to release an extent of size 128MiB from region 0
> (DPA offset 128MiB) look like below:
> 
> { "execute": "cxl-release-dynamic-capacity",
>   "arguments": {
>   "path": "/machine/peripheral/cxl-dcd0",
>   "region-id": 0,
>   "extents": [
>   {
>   "dpa": 134217728,
>   "len": 134217728
>   }
>   ]
>   }
> }
> 
> Signed-off-by: Fan Ni 

...
  
> diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
> index dccfaaad3a..e9c8994cdb 100644
> --- a/hw/mem/cxl_type3.c
> +++ b/hw/mem/cxl_type3.c
> @@ -674,6 +674,7 @@ static bool cxl_create_dc_regions(CXLType3Dev *ct3d, 
> Error **errp)
>  ct3d->dc.total_capacity += region->len;
>  }
>  QTAILQ_INIT(>dc.extents);
> +QTAILQ_INIT(>dc.extents_pending_to_add);
>  
>  return true;
>  }
> @@ -686,6 +687,12 @@ static void cxl_destroy_dc_regions(CXLType3Dev *ct3d)
>  ent = QTAILQ_FIRST(>dc.extents);
>  cxl_remove_extent_from_extent_list(>dc.extents, ent);
>  }
> +
> +while (!QTAILQ_EMPTY(>dc.extents_pending_to_add)) {

QTAILQ_FOR_EACHSAFE

> +ent = QTAILQ_FIRST(>dc.extents_pending_to_add);
> +cxl_remove_extent_from_extent_list(>dc.extents_pending_to_add,
> +   ent);
> +}
>  }

> +/*
> + * The main function to process dynamic capacity event. Currently DC extents
> + * add/release requests are processed.
> + */
> +static void qmp_cxl_process_dynamic_capacity(const char *path, CxlEventLog 
> log,
> + CXLDCEventType type, uint16_t 
> hid,
> + uint8_t rid,
> + CXLDCExtentRecordList *records,
> + Error **errp)
> +{
> +Object *obj;
> +CXLEventDynamicCapacity dCap = {};
> +CXLEventRecordHdr *hdr = 
> +CXLType3Dev *dcd;
> +uint8_t flags = 1 << CXL_EVENT_TYPE_INFO;
> +uint32_t num_extents = 0;
> +CXLDCExtentRecordList *list;
> +g_autofree CXLDCExtentRaw *extents = NULL;
> +uint8_t enc_log;
> +uint64_t offset, len, block_size;
> +int i;
> +int rc;

Combine the two lines above.

> +g_autofree unsigned long *blk_bitmap = NULL;
> +
> +obj = object_resolve_path(path, NULL);
> +if (!obj) {
> +error_setg(errp, "Unable to resolve path");
> +return;
> +}

object_resolve_path_type() and skip a step (should do this in various places
in our existing code!)

> +if (!object_dynamic_cast(obj, TYPE_CXL_TYPE3)) {
> +error_setg(errp, "Path not point to a valid CXL type3 device");
> +return;
> +}
> +
> +dcd = CXL_TYPE3(obj);
> +if (!dcd->dc.num_regions) {
> +error_setg(errp, "No dynamic capacity support from the device");
> +return;
> +}
> +
> +rc = ct3d_qmp_cxl_event_log_enc(log);
> +if (rc < 0) {
> +error_setg(errp, "Unhandled error log type");
> +return;
> +}
> +enc_log = rc;
> +
> +if (rid >= dcd->dc.num_regions) {
> +error_setg(errp, "region id is too large");
> +return;
> +}
> +block_size = dcd->dc.regions[rid].block_size;
> +
> +/* Sanity check and count the extents */
> +list = records;
> +while (list) {
> +offset = list->value->offset;
> +len = list->value->len;
> +
> +if (len == 0) {
> +error_setg(errp, "extent with 0 length is not allowed");
> +return;
> +}
> +
> +if (offset % block_size || len % block_size) {
> +error_setg(errp, "dpa or len is not aligned to region block 
> size");
> +

Re: [PATCH v5 08/13] hw/cxl/cxl-mailbox-utils: Add mailbox commands to support add/release dynamic capacity response