Re: CXL numa error on arm64 qemu virt machine
On Thu, 9 May 2024 16:35:34 +0800 Yuquan Wang wrote: > On Wed, May 08, 2024 at 01:02:52PM +0100, Jonathan Cameron wrote: > > > > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x4000-0xbfff] > > > [0.00] ACPI: SRAT: Node 1 PXM 1 [mem 0xc000-0x13fff] > > > [0.00] ACPI: Unknown target node for memory at 0x100, > > > assuming node 0 > > > [0.00] NUMA: Warning: invalid memblk node 16 [mem > > > 0x0400-0x07ff] > > > [0.00] NUMA: Faking a node at [mem > > > 0x0400-0x00013fff] > > > [0.00] NUMA: NODE_DATA [mem 0x13f7f89c0-0x13f7fafff] > > > > > > Previous discussion: > > > https://lore.kernel.org/linux-cxl/20231011150620.2...@huawei.com/ > > > > > > root@debian-bullseye-arm64:~# cxl create-region -d decoder0.0 -t ram > > > [ 68.653873] cxl region0: Bypassing cpu_cache_invalidate_memregion() > > > for testing! > > > [ 68.660568] Unknown target node for memory at 0x100, assuming > > > node 0 > > > > You need a load of kernel changes for NUMA nodes to work correctly with > > CXL memory on arm64 platforms. I have some working code but need to tidy > > up a few corners that came up in an internal review earlier this week. > > > > I have some travel coming up so may take a week or so to get those out. > > > > Curiously that invalid memblk has nothing to do with the CXL fixed memory > > window > > Could you check if that is happening for you without the CXL patches? > > > > Thanks. > > I have checked it, the problem is caused by my bios firmware file. I > change the bios file and the numa topology can works. > > BTW, if it is convenient, could you help to post the link of the patches of > CXL > memory NUMA nodes on arm64 platforms? https://git.kernel.org/pub/scm/linux/kernel/git/jic23/cxl-staging.git/log/?h=arm-numa-fixes I've run out of time to sort out cover letters and things + just before the merge window is never a good time get anyone to pay attention to potentially controversial patches. So for now I've thrown up a branch on kernel.org with Robert's series of fixes of related code (that's queued in the ACPI tree for the merge window) and Dan Williams (from several years ago) + my additions that 'work' (lightly tested) on qemu/arm64 with the generic port patches etc. I'll send out an RFC in a couple of weeks. In meantime let me know if you run into any problems or have suggestions to improve them. Jonathan p.s. Apparently my computer is in the future. (-28 minutes and counting!) > > > > > Whilst it doesn't work yet (because of missing kernel support) > > you'll need something that looks more like the generic ports test added in > > https://gitlab.com/jic23/qemu/-/commit/6589c527920ba22fe0923b60b58d33a8e9fd371e > > > > Most importantly > > -numa node,nodeid=2 -object acpi-generic-port,id=gp0,pci-bus-cxl.1,node=2 > > + the bits setting distances etc. Note CXL memory does not provide SLIT > > like > > data at the moment, so the test above won't help you identify if it is > > correctly > > set up. That's a gap in general in the kernel support. Whilst we'd love > > it if everyone moved to hmat derived information we may need to provide > > some fallback. > > > > Jonathan > > > > Many thanks > Yuquan >
Re: [PATCH v3 2/2] cxl/core: add poison creation event handler
On Fri, 3 May 2024 18:42:31 +0800 Shiyang Ruan wrote: > 在 2024/4/24 1:57, Ira Weiny 写道: > > Shiyang Ruan wrote: > >> Currently driver only traces cxl events, poison creation (for both vmem > >> and pmem type) on cxl memdev is silent. OS needs to be notified then it > >> could handle poison pages in time. Per CXL spec, the device error event > >> could be signaled through FW-First and OS-First methods. > >> > >> So, add poison creation event handler in OS-First method: > >>- Qemu: > >> - CXL device reports POISON creation event to OS by MSI by sending > >>GMER/DER after injecting a poison record; > >>- CXL driver: > >> a. parse the POISON event from GMER/DER; > >> b. translate poisoned DPA to HPA (PFN); > >> c. enqueue poisoned PFN to memory_failure's work queue; > > > > I'm a bit confused by the need for this patch. Perhaps a bit more detail > > here? > > Yes, I should have wrote more details. > > I want to check and make sure the HWPOISON on a CXL device (type3) is > working properly. For example, a FSDAX filesystem created on a > type3-pmem device, then it gets a POISON bit, the OS should be able to > handle this POISON event: find the relevant process > > Currently I'm using Qemu with several simulated CXL devices, and using > poison injection API of Qemu to create POISON records, but OS isn't > notified. Only when we actively call list POISON records (cxl list -L) > will the driver fetch them and log into trace events, then we see the > POISON records. Memory failure wasn't triggered too. > Indeed - QEMU emulation of this is not complete. It should also be generating the events. Ideally we'd even handle injecting silent poison (not yet detected by the device) and have that generate the synchronous memory faults on an access. > That's why I said "poison creation on cxl memdev is silent". Per spec, > POISON creation should be notified to OS. Since not familiar with > firmware part, I'm try adding this notification for OS-First. > > > > > More comments below. > > > >> > >> Signed-off-by: Shiyang Ruan > >> --- > >> drivers/cxl/core/mbox.c | 119 +- > >> drivers/cxl/cxlmem.h | 8 +-- > >> include/linux/cxl-event.h | 18 +- > >> 3 files changed, 125 insertions(+), 20 deletions(-) > >> > >> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c > >> index f0f54aeccc87..76af0d73859d 100644 > >> --- a/drivers/cxl/core/mbox.c > >> +++ b/drivers/cxl/core/mbox.c > >> @@ -837,25 +837,116 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds) > >> } > >> EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL); > >> > >> -void cxl_event_trace_record(const struct cxl_memdev *cxlmd, > >> - enum cxl_event_log_type type, > >> - enum cxl_event_type event_type, > >> - const uuid_t *uuid, union cxl_event *evt) > >> +static void cxl_report_poison(struct cxl_memdev *cxlmd, struct cxl_region > >> *cxlr, > >> +u64 dpa) > >> { > >> - if (event_type == CXL_CPER_EVENT_GEN_MEDIA) > >> + u64 hpa = cxl_trace_hpa(cxlr, cxlmd, dpa); > >> + unsigned long pfn = PHYS_PFN(hpa); > >> + > >> + if (!IS_ENABLED(CONFIG_MEMORY_FAILURE)) > >> + return; > >> + > >> + memory_failure_queue(pfn, MF_ACTION_REQUIRED); > > > > I thought that ras daemon was supposed to take care of this when the trace > > event occurred. Alison is working on the HPA data for that path. > > It seems to save CXL trace events/memory-failures to DB and report to > others, but cannot let OS call memory failure. Interesting question of whose problem this is. For corrected errors it's policy stuff that belongs in userspace, but for known memory failure, it might want to be in kernel. Shiju (+Cc) pointed me a the existing rasdaemon handling for corrected errors (statistics get too bad, so memory offlined) https://github.com/mchehab/rasdaemon/commit/9ae6b70effb8adc9572debc800b8e16173f74bb8 Poison detection via scrub though is reported via CPER Memory Error Section with "Scrub uncorrected error" set. That triggers apei handling. On x86 looks to be apei_mce_report_mem_error(). Also triggers the ghes_do_memory_failure_path() and ultimately memory_failure(). Conveniently there was a patch fixing the sync path last year that includes info on what happens in async case. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a70297d2213253853e95f5b49651f924990c6d3b "In addition, for aysnchronous errors, kernel will notify the process who owns the poisoned page by sending SIGBUS with BUS_MCERRR_A0 in early kill mode." So I think the kernel should probably do the same for CXL poison error records when it gets them. Jonathan
Re: CXL numa error on arm64 qemu virt machine
On Wed, 8 May 2024 16:00:51 +0800 Yuquan Wang wrote: > Hello, Jonathan > > Recently I run some cxl tests on qemu virt(branch:cxl-2024-04-22-draft) but > met some > problems. > > Problems: > 1) the virt machine could not set the right numa topology from user input; > > My Qemu numa set: > -object memory-backend-ram,size=2G,id=mem0 \ > -numa node,nodeid=0,cpus=0-1,memdev=mem0 \ > -object memory-backend-ram,size=2G,id=mem1 \ > -numa node,nodeid=1,cpus=2-3,memdev=mem1 \ That is setting up the main DRAM nodes, unrelated to CXL memory. For CXL memory you need to use generic port entries (in my gitlab qemu tree - with examples but not upstream yet) However, if you get some breakage > > However, the system shows: > root@ubuntu-jammy-arm64:~# numactl -H > available: 1 nodes (0) > node 0 cpus: 0 1 2 3 > node 0 size: 4166 MB > node 0 free: 3920 MB > node distances: > node 0 > 0: 10 > > Boot Kernel print: > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x4000-0xbfff] > [0.00] ACPI: SRAT: Node 1 PXM 1 [mem 0xc000-0x13fff] > [0.00] ACPI: Unknown target node for memory at 0x100, > assuming node 0 > [0.00] NUMA: Warning: invalid memblk node 16 [mem > 0x0400-0x07ff] > [0.00] NUMA: Faking a node at [mem > 0x0400-0x00013fff] > [0.00] NUMA: NODE_DATA [mem 0x13f7f89c0-0x13f7fafff] > > 2) it seems like the problem of allocating numa node in arm for cxl memory > still exists; > Previous discussion: > https://lore.kernel.org/linux-cxl/20231011150620.2...@huawei.com/ > > root@debian-bullseye-arm64:~# cxl create-region -d decoder0.0 -t ram > [ 68.653873] cxl region0: Bypassing cpu_cache_invalidate_memregion() for > testing! > [ 68.660568] Unknown target node for memory at 0x100, assuming node > 0 You need a load of kernel changes for NUMA nodes to work correctly with CXL memory on arm64 platforms. I have some working code but need to tidy up a few corners that came up in an internal review earlier this week. I have some travel coming up so may take a week or so to get those out. Curiously that invalid memblk has nothing to do with the CXL fixed memory window Could you check if that is happening for you without the CXL patches? > > If not, maybe I could try to do something to help fix this problem. > > > My full qemu command line: > qemu-system-aarch64 \ > -M virt,gic-version=3,cxl=on \ > -m 4G \ > -smp 4 \ > -object memory-backend-ram,size=2G,id=mem0 \ > -numa node,nodeid=0,cpus=0-1,memdev=mem0 \ > -object memory-backend-ram,size=2G,id=mem1 \ > -numa node,nodeid=1,cpus=2-3,memdev=mem1 \ > -cpu cortex-a57 \ > -bios QEMU_EFI.fd.bak \ > -device virtio-blk-pci,drive=hd,bus=pcie.0 \ > -drive if=none,id=hd,file=../disk/debos_arm64.ext \ > -nographic \ > -object memory-backend-file,id=mem2,mem-path=/tmp/mem2,size=256M,share=true \ > -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \ > -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \ > -device cxl-type3,bus=root_port13,volatile-memdev=mem2,id=cxl-mem1 \ > -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G \ > -qmp tcp:127.0.0.1:,server,nowait \ > > Qemu version: the lastest commit of branch cxl-2024-04-22-draft in > "https://gitlab.com/jic23/qemu; > Kernel version: 6.6.0 Whilst it doesn't work yet (because of missing kernel support) you'll need something that looks more like the generic ports test added in https://gitlab.com/jic23/qemu/-/commit/6589c527920ba22fe0923b60b58d33a8e9fd371e Most importantly -numa node,nodeid=2 -object acpi-generic-port,id=gp0,pci-bus-cxl.1,node=2 + the bits setting distances etc. Note CXL memory does not provide SLIT like data at the moment, so the test above won't help you identify if it is correctly set up. That's a gap in general in the kernel support. Whilst we'd love it if everyone moved to hmat derived information we may need to provide some fallback. Jonathan > > Many thanks > Yuquan >
Re: [PATCH v2] mem/cxl_type3: support 3, 6, 12 and 16 interleave ways
On Tue, 7 May 2024 00:22:00 + "Xingtao Yao (Fujitsu)" wrote: > > -Original Message- > > From: Jonathan Cameron > > Sent: Tuesday, April 30, 2024 10:43 PM > > To: Yao, Xingtao/姚 幸涛 > > Cc: fan...@samsung.com; qemu-devel@nongnu.org > > Subject: Re: [PATCH v2] mem/cxl_type3: support 3, 6, 12 and 16 interleave > > ways > > > > On Wed, 24 Apr 2024 01:36:56 + > > "Xingtao Yao (Fujitsu)" wrote: > > > > > ping. > > > > > > > -Original Message- > > > > From: Yao Xingtao > > > > Sent: Sunday, April 7, 2024 11:07 AM > > > > To: jonathan.came...@huawei.com; fan...@samsung.com > > > > Cc: qemu-devel@nongnu.org; Yao, Xingtao/姚 幸涛 > > > > > > Subject: [PATCH v2] mem/cxl_type3: support 3, 6, 12 and 16 interleave > > > > ways > > > > > > > > Since the kernel does not check the interleave capability, a > > > > 3-way, 6-way, 12-way or 16-way region can be create normally. > > > > > > > > Applications can access the memory of 16-way region normally because > > > > qemu can convert hpa to dpa correctly for the power of 2 interleave > > > > ways, after kernel implementing the check, this kind of region will > > > > not be created any more. > > > > > > > > For non power of 2 interleave ways, applications could not access the > > > > memory normally and may occur some unexpected behaviors, such as > > > > segmentation fault. > > > > > > > > So implements this feature is needed. > > > > > > > > Link: > > > > > > https://lore.kernel.org/linux-cxl/3e84b919-7631-d1db-3e1d-33000f3f3868@fujits > > > > > > u.com/ > > > > Signed-off-by: Yao Xingtao > > > > --- > > > > hw/mem/cxl_type3.c | 18 ++ > > > > 1 file changed, 14 insertions(+), 4 deletions(-) > > > > > > > > diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c > > > > index b0a7e9f11b..d6ef784e96 100644 > > > > --- a/hw/mem/cxl_type3.c > > > > +++ b/hw/mem/cxl_type3.c > > > > @@ -805,10 +805,17 @@ static bool cxl_type3_dpa(CXLType3Dev *ct3d, > > hwaddr > > > > host_addr, uint64_t *dpa) > > > > continue; > > > > } > > > > > > > > -*dpa = dpa_base + > > > > -((MAKE_64BIT_MASK(0, 8 + ig) & hpa_offset) | > > > > - ((MAKE_64BIT_MASK(8 + ig + iw, 64 - 8 - ig - iw) & > > > > hpa_offset) > > > > - >> iw)); > > > > +if (iw < 8) { > > > > +*dpa = dpa_base + > > > > +((MAKE_64BIT_MASK(0, 8 + ig) & hpa_offset) | > > > > + ((MAKE_64BIT_MASK(8 + ig + iw, 64 - 8 - ig - iw) & > > hpa_offset) > > > > + >> iw)); > > > > +} else { > > > > +*dpa = dpa_base + > > > > +((MAKE_64BIT_MASK(0, 8 + ig) & hpa_offset) | > > > > + MAKE_64BIT_MASK(ig + iw, 64 - ig - iw) & > > > > hpa_offset) > > > > + >> (ig + iw)) / 3) << (ig + 8))); > > > > +} > > > > > > > > return true; > > > > } > > > > @@ -906,6 +913,9 @@ static void ct3d_reset(DeviceState *dev) > > > > uint32_t *write_msk = > > ct3d->cxl_cstate.crb.cache_mem_regs_write_mask; > > > > > > > > cxl_component_register_init_common(reg_state, write_msk, > > > > CXL2_TYPE3_DEVICE); > > > > +ARRAY_FIELD_DP32(reg_state, CXL_HDM_DECODER_CAPABILITY, > > > > 3_6_12_WAY, 1); > > > > +ARRAY_FIELD_DP32(reg_state, CXL_HDM_DECODER_CAPABILITY, > > > > 16_WAY, 1); > > > > + > > > > Why here rather than in hdm_reg_init_common()? > > It's constant data and is currently being set to 0 in there. > > according to the CXL specifications (8.2.4.20.1 CXL HDM Decoder Capability > Register (Offset 00h)), > this feature is only applicable to cxl.mem, upstream switch port and CXL host > bridges shall hardwrite > these bits to 0. > > so I think it would be more appropriate to set these bits here. I don't follow. hdm_init_common() (sorry wrong function name above) has some type specific stuff already to show how this can be done. I'd prefer to minimize what we set directly in the ct3d_reset() call because it loses the connection to the rest of the register setup. Jonathan Jonathan > > > > > > > cxl_device_register_init_t3(ct3d); > > > > > > > > /* > > > > -- > > > > 2.37.3 > > > >
Re: [PATCH v7 09/12] hw/cxl/events: Add qmp interfaces to add/release dynamic capacity extents
> > >> > +# @hid: host id > > >> > > >> @host-id, unless "HID" is established terminology in CXL DCD land. > > > > > > host-id works. > > >> > > >> What is a host ID? > > > > > > It is an id identifying the host to which the capacity is being added. > > > > How are these IDs assigned? > > All the arguments passed to the command here are defined in CXL spec. I > will add reference to the spec. > > Based on the spec, for LD-FAM (Fabric attached memory represented as > logical device), host id is the LD-ID of the host interface to which > the capacity is being added. LD-ID is a unique number (16-bit) assigned > to a host interface. Key here is the host doesn't know it. This ID exists purely for rooting to the appropriate host interface either via choosing a port on a multihead Single Logical Device (SLD) (so today it's always 0 as we only have one head) or if we ever implement a switch capable of handling MLDs then the switch will handle routing of host PCIe accesses so it lands on the interface defined by this ID (and the event turns up in that event log. Host A Host B - could in theory be a RP on host A ;) | | Doesn't exist (yet, but there are partial. _|__|_ patches for this on list. | LD 0 LD 1| | | | Multi Head | | Single Logical | | Device (MH-SLD) | |__| Host view similar to the switch case, but just two direct connected devices. Or Switch and MLD case - we aren't emulating this yet at all Wiring / real topology Host View Host A Host B Host A Host B | | || ___|__|___ _|_ _|_ | \ SWITCH /| |SW0|| | | |\ / | | | || | | |LD0 LD1 | | | || | | | \ / | | | || | | || | | | || | | ||_| |_|_||_|_| | || Traffic tagged with LD || | || | |___ |___ | Multilogical Device MLD | || || ||| | Simple | | Another| | / \ | | CXL| | CXL| | / \ | | Memory | | Memory | |Interfaces | | Device | | Device | | LD0 LD1 | || || |_| || || Note the hosts just see separate devices and switches with the fun exception that the memory may actually be available to both at the same time. Control plane for the switches and MLD see what is actually going on. At this stage upshot is we could just default this to zero and add an optional parameter to set it later. ... > > >> > +# @extents: Extents to release > > >> > +# > > >> > +# Since : 9.1 > > >> > +## > > >> > +{ 'command': 'cxl-release-dynamic-capacity', > > >> > + 'data': { 'path': 'str', > > >> > +'hid': 'uint16', > > >> > +'flags': 'uint8', > > >> > +'region-id': 'uint8', > > >> > +'tag': 'str', > > >> > +'extents': [ 'CXLDCExtentRecord' ] > > >> > + } > > >> > +} > > >> > > >> During review of v5, you wrote: > > >> > > >> For add command, the host will send a mailbox command to response to > > >> the add request to the device to indicate whether it accepts the add > > >> capacity offer or not. > > >> > > >> For release command, the host send a mailbox command (not always a > > >> response since the host can proactively release capacity if it does > > >> not need it any more) to device to ask device release the capacity. > > >> > > >> Can you briefly sketch the protocol? Peers and messages involved. > > >> Possibly as a state diagram. > > > > > > Need to think about it. If we can polish the text nicely, maybe the > > > sketch is not needed. My concern is that the sketch may > > > introduce unwanted complexity as we expose too much details. The two > > > commands provide ways to add/release dynamic capacity to/from a host, > > > that is all. All the other information, like what the host will do, or > > > how the device will react, are consequence of the command, not sure > > > whether we want to include here. > > > > The protocol sketch is for me, not necessarily the doc comment. I'd > > like to understand at high level how this stuff works, because only then > > can I meaningfully review the docs. > > > For add command, saying a user sends a request to FM to ask
Re: [PATCH v7 09/12] hw/cxl/events: Add qmp interfaces to add/release dynamic capacity extents
On Mon, 29 Apr 2024 09:58:42 +0200 Markus Armbruster wrote: > fan writes: > > > On Fri, Apr 26, 2024 at 11:12:50AM +0200, Markus Armbruster wrote: > >> nifan@gmail.com writes: > > [...] > > >> > diff --git a/qapi/cxl.json b/qapi/cxl.json > >> > index 4281726dec..2dcf03d973 100644 > >> > --- a/qapi/cxl.json > >> > +++ b/qapi/cxl.json > >> > @@ -361,3 +361,72 @@ > >> > ## > >> > {'command': 'cxl-inject-correctable-error', > >> > 'data': {'path': 'str', 'type': 'CxlCorErrorType'}} > >> > + > >> > +## > >> > +# @CXLDCExtentRecord: > >> > >> Such traffic jams of capital letters are hard to read. What about > >> CxlDynamicCapacityExtent? > >> > >> > +# > >> > +# Record of a single extent to add/release > >> > >> Suggest "A dynamic capacity extent." > >> > >> > +# > >> > +# @offset: offset to the start of the region where the extent to be > >> > operated > >> > >> Blank line here, please. > >> > >> > >> > >> > +# @len: length of the extent > >> > +# > >> > +# Since: 9.1 > >> > +## > >> > +{ 'struct': 'CXLDCExtentRecord', > >> > + 'data': { > >> > + 'offset':'uint64', > >> > + 'len': 'uint64' > >> > + } > >> > +} > >> > + > >> > +## > >> > +# @cxl-add-dynamic-capacity: > >> > +# > >> > +# Command to start add dynamic capacity extents flow. The device will > >> > +# have to acknowledged the acceptance of the extents before they are > >> > usable. > >> > >> This text needs work. More on that at the end of my review. > > > > Yes. I will work on it for the next version once all the feedbacks > > are collected and comments are resolved. > > > > See below. > > > >> > >> docs/devel/qapi-code-gen.rst: > >> > >> For legibility, wrap text paragraphs so every line is at most 70 > >> characters long. > >> > >> Separate sentences with two spaces. > >> > >> More elsewhere. > >> > >> > +# > >> > +# @path: CXL DCD canonical QOM path > >> > >> I'd prefer @qom-path, unless you can make a consistency argument for > >> @path. > >> > >> Sure the QOM path needs to be canonical? > >> > >> If not, what about "path to the CXL dynamic capacity device in the QOM > >> tree". Intentionally close to existing descriptions of @qom-path > >> elsewhere. > > > > From the same file, I saw "path" was used for other commands, like > > "cxl-inject-memory-module-event", so I followed it. > > DCD is nothing different from "type 3 device" expect it can dynamically > > change capacity. > > Renaming it to "qom-path" is no problem for me, just want to make sure it > > will not break the naming consistency. > > Both @path and @qom-path are used (sadly). @path is used for all kinds > of paths, whereas @qom-path is only used for QOM paths. That's why I > prefer it. > > However, you're making a compelling local consistency argument: cxl.json > uses only @path. Sticking to that makes sense. > > >> > +# @hid: host id > >> > >> @host-id, unless "HID" is established terminology in CXL DCD land. > > > > host-id works. > >> > >> What is a host ID? > > > > It is an id identifying the host to which the capacity is being added. > > How are these IDs assigned? Right now there is only 1 option. We can drop this for now and introduce it when needed (Default of 0 will be fine). Multi head device patches that will need this are on list though I haven't read them yet :( > > >> > +# @selection-policy: policy to use for selecting extents for adding > >> > capacity > >> > >> Where are selection policies defined? > > > > It is defined in CXL specification: Specifies the policy to use for > > selecting > > which extents comprise the added capacity > > Include a reference to the spec here? > > >> > +# @region-id: id of the region where the extent to add > >> > >> Is "region ID" the established terminology in CXL DCD land? Or is > >> "region number" also used? I'm asking because "ID" in this QEMU device > >> context suggests a connection to a qdev ID. > >> > >> If region number is fine, I'd rename to just @region, and rephrase the > >> description to avoid "ID". Perhaps "number of the region the extent is > >> to be added to". Not entirely happy with the phrasing, doesn't exactly > >> roll off the tongue, but "where the extent to add" sounds worse to my > >> ears. Mind, I'm not a native speaker. > > > > Yes. region number is fine. Will rename it as "region" > > > >> > >> > +# @tag: Context field > >> > >> What is this about? > > > > Based on the specification, it is "Context field utilized by implementations > > that make use of the Dynamic Capacity feature.". Basically, it is a > > string (label) attached to an dynamic capacity extent so we can achieve > > specific purpose, like identifying or grouping extents. > > Include a reference to the spec here? Agreed - that is the best we can do. It'sa magic value. > > >> > +# @extents: Extents to add > >> > >> Blank lines between argument descriptions, please. > >> > >> > +# > >> >
Re: [PATCH v2] mem/cxl_type3: support 3, 6, 12 and 16 interleave ways
On Wed, 24 Apr 2024 01:36:56 + "Xingtao Yao (Fujitsu)" wrote: > ping. > > > -Original Message- > > From: Yao Xingtao > > Sent: Sunday, April 7, 2024 11:07 AM > > To: jonathan.came...@huawei.com; fan...@samsung.com > > Cc: qemu-devel@nongnu.org; Yao, Xingtao/姚 幸涛 > > Subject: [PATCH v2] mem/cxl_type3: support 3, 6, 12 and 16 interleave ways > > > > Since the kernel does not check the interleave capability, a > > 3-way, 6-way, 12-way or 16-way region can be create normally. > > > > Applications can access the memory of 16-way region normally because > > qemu can convert hpa to dpa correctly for the power of 2 interleave > > ways, after kernel implementing the check, this kind of region will > > not be created any more. > > > > For non power of 2 interleave ways, applications could not access the > > memory normally and may occur some unexpected behaviors, such as > > segmentation fault. > > > > So implements this feature is needed. > > > > Link: > > https://lore.kernel.org/linux-cxl/3e84b919-7631-d1db-3e1d-33000f3f3868@fujits > > u.com/ > > Signed-off-by: Yao Xingtao > > --- > > hw/mem/cxl_type3.c | 18 ++ > > 1 file changed, 14 insertions(+), 4 deletions(-) > > > > diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c > > index b0a7e9f11b..d6ef784e96 100644 > > --- a/hw/mem/cxl_type3.c > > +++ b/hw/mem/cxl_type3.c > > @@ -805,10 +805,17 @@ static bool cxl_type3_dpa(CXLType3Dev *ct3d, hwaddr > > host_addr, uint64_t *dpa) > > continue; > > } > > > > -*dpa = dpa_base + > > -((MAKE_64BIT_MASK(0, 8 + ig) & hpa_offset) | > > - ((MAKE_64BIT_MASK(8 + ig + iw, 64 - 8 - ig - iw) & > > hpa_offset) > > - >> iw)); > > +if (iw < 8) { > > +*dpa = dpa_base + > > +((MAKE_64BIT_MASK(0, 8 + ig) & hpa_offset) | > > + ((MAKE_64BIT_MASK(8 + ig + iw, 64 - 8 - ig - iw) & > > hpa_offset) > > + >> iw)); > > +} else { > > +*dpa = dpa_base + > > +((MAKE_64BIT_MASK(0, 8 + ig) & hpa_offset) | > > + MAKE_64BIT_MASK(ig + iw, 64 - ig - iw) & hpa_offset) > > + >> (ig + iw)) / 3) << (ig + 8))); > > +} > > > > return true; > > } > > @@ -906,6 +913,9 @@ static void ct3d_reset(DeviceState *dev) > > uint32_t *write_msk = ct3d->cxl_cstate.crb.cache_mem_regs_write_mask; > > > > cxl_component_register_init_common(reg_state, write_msk, > > CXL2_TYPE3_DEVICE); > > +ARRAY_FIELD_DP32(reg_state, CXL_HDM_DECODER_CAPABILITY, > > 3_6_12_WAY, 1); > > +ARRAY_FIELD_DP32(reg_state, CXL_HDM_DECODER_CAPABILITY, > > 16_WAY, 1); > > + Why here rather than in hdm_reg_init_common()? It's constant data and is currently being set to 0 in there. > > cxl_device_register_init_t3(ct3d); > > > > /* > > -- > > 2.37.3 >
Re: [PATCH 3/6] hw/acpi: Generic Port Affinity Structure support
On Tue, 30 Apr 2024 08:55:12 +0200 Markus Armbruster wrote: > Jonathan Cameron writes: > > > On Tue, 23 Apr 2024 12:56:21 +0200 > > Markus Armbruster wrote: > > > >> Jonathan Cameron writes: > >> > >> > These are very similar to the recently added Generic Initiators > >> > but instead of representing an initiator of memory traffic they > >> > represent an edge point beyond which may lie either targets or > >> > initiators. Here we add these ports such that they may > >> > be targets of hmat_lb records to describe the latency and > >> > bandwidth from host side initiators to the port. A descoverable > >> > mechanism such as UEFI CDAT read from CXL devices and switches > >> > is used to discover the remainder fo the path and the OS can build > >> > up full latency and bandwidth numbers as need for work and data > >> > placement decisions. > >> > > >> > Signed-off-by: Jonathan Cameron > > > > Hi Markus, > > > > I've again managed a bad job of defining an interface - thanks for > > your help! > > Good interfaces are hard! > > >> > --- > >> > qapi/qom.json| 18 +++ > >> > include/hw/acpi/acpi_generic_initiator.h | 18 ++- > >> > include/hw/pci/pci_bridge.h | 1 + > >> > hw/acpi/acpi_generic_initiator.c | 141 +-- > >> > hw/pci-bridge/pci_expander_bridge.c | 1 - > >> > 5 files changed, 141 insertions(+), 38 deletions(-) > >> > > >> > diff --git a/qapi/qom.json b/qapi/qom.json > >> > index 85e6b4f84a..5480d9ca24 100644 > >> > --- a/qapi/qom.json > >> > +++ b/qapi/qom.json > >> > @@ -826,6 +826,22 @@ > >> >'data': { 'pci-dev': 'str', > >> > 'node': 'uint32' } } > >> > > >> > + > >> > +## > >> > +# @AcpiGenericPortProperties: > >> > +# > >> > +# Properties for acpi-generic-port objects. > >> > +# > >> > +# @pci-bus: PCI bus of the hostbridge associated with this SRAT entry > >> > > >> > >> What's this exactly? A QOM path? A qdev ID? Something else? > > > > QOM path I believe as going to call object_resolve_path_type() on it. > > QOM path then. > > > Oddity is it's defined for the bus, not the host bridge that > > we care about as the host bridge doesn't have a convenient id to let > > us identify it. > > > > e.g. It is specified via --device pxb-cxl,id= > > of TYPE_PXB_CXL_HOST in the command line but ends up on the > > TYPE_PCI_BUS with parent set to the PXB_CXL_HOST. > > Normally we just want this bus for hanging root ports of it. > > > > I can clarify it's the QOM path but I'm struggling a bit to explain > > the relationship without resorting to an example. > > This should also not mention SRAT as at some stage I'd expect DT > > bindings to provide similar functionality. > > Let's start with an example. Not to put it into the doc comment, only > to help me understand what you need. Hopefully I can then assist with > improving the interface and/or its documentation. Stripping out some relevant bits from a test setup and editing it down - most of this is about creating relevant SLIT/HMAT tables. # First a CXL root bridge, root port and direct attached device plus fixed # memory window. Linux currently builds a NUMA node per fixed memory window # but that's a simplification that may change over time. -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1,hdm_for_passthrough=true \ -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \ -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,persistent-memdev=cxl-mem2,id=cxl-pmem1,lsa=cxl-lsa1,sn=3 \ -machine cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=1k \ # Next line is the port defintion - see that the pci-bus refers to the one in the id parameter for # the PXB CXL, but the ACPI table that is generated refers to the DSDT entry via a ACPI0016 entry. # So to get to that we use the PCI bus ID of the root bus that forms part of the root bridge (but # is a child object in qemu. -numa node,nodeid=2 \ -object acpi-generic-port,id=bob11,pci-bus=cxl.1,node=2 \ # The rest is a the setup for the hmat and slit tables. I hid most of the config, but # left this here as key is that we specify values to and from the generic port 'node' # but it's not really a node as such, but just a point along the path to one. -numa dist,src=0,dst=0,val=10 -nu
Re: [PATCH 3/6] hw/acpi: Generic Port Affinity Structure support
On Tue, 23 Apr 2024 12:56:21 +0200 Markus Armbruster wrote: > Jonathan Cameron writes: > > > These are very similar to the recently added Generic Initiators > > but instead of representing an initiator of memory traffic they > > represent an edge point beyond which may lie either targets or > > initiators. Here we add these ports such that they may > > be targets of hmat_lb records to describe the latency and > > bandwidth from host side initiators to the port. A descoverable > > mechanism such as UEFI CDAT read from CXL devices and switches > > is used to discover the remainder fo the path and the OS can build > > up full latency and bandwidth numbers as need for work and data > > placement decisions. > > > > Signed-off-by: Jonathan Cameron Hi Markus, I've again managed a bad job of defining an interface - thanks for your help! > > --- > > qapi/qom.json| 18 +++ > > include/hw/acpi/acpi_generic_initiator.h | 18 ++- > > include/hw/pci/pci_bridge.h | 1 + > > hw/acpi/acpi_generic_initiator.c | 141 +-- > > hw/pci-bridge/pci_expander_bridge.c | 1 - > > 5 files changed, 141 insertions(+), 38 deletions(-) > > > > diff --git a/qapi/qom.json b/qapi/qom.json > > index 85e6b4f84a..5480d9ca24 100644 > > --- a/qapi/qom.json > > +++ b/qapi/qom.json > > @@ -826,6 +826,22 @@ > >'data': { 'pci-dev': 'str', > > 'node': 'uint32' } } > > > > + > > +## > > +# @AcpiGenericPortProperties: > > +# > > +# Properties for acpi-generic-port objects. > > +# > > +# @pci-bus: PCI bus of the hostbridge associated with this SRAT entry > > What's this exactly? A QOM path? A qdev ID? Something else? QOM path I believe as going to call object_resolve_path_type() on it. Oddity is it's defined for the bus, not the host bridge that we care about as the host bridge doesn't have a convenient id to let us identify it. e.g. It is specified via --device pxb-cxl,id= of TYPE_PXB_CXL_HOST in the command line but ends up on the TYPE_PCI_BUS with parent set to the PXB_CXL_HOST. Normally we just want this bus for hanging root ports of it. I can clarify it's the QOM path but I'm struggling a bit to explain the relationship without resorting to an example. This should also not mention SRAT as at some stage I'd expect DT bindings to provide similar functionality. > > > +# > > +# @node: numa node associated with the PCI device > > NUMA > > Is this a NUMA node ID? Fair question with a non obvious answer. ACPI wise it's a proximity domain. In every other SRAT entry (which define proximity domains) this does map to a NUMA node in an operating system as they contain at least either some form of memory access initiator (CPU, Generic Initiator etc) or a target (memory). A Generic Port is subtly different in that it defines a proximity domain that in of itself is not what we'd think of as a NUMA node but rather an entity that exists to provide the info to the OS to stitch together non discoverable and discoverable buses. So I should have gone with something more specific. Could add this to the parameter docs, or is it too much? @node: Similar to a NUMA node ID, but instead of providing a reference point used for defining NUMA distances and access characteristics to memory or from an initiator (e.g. CPU), this node defines the boundary point between non discoverable system buses which must be discovered from firmware, and a discoverable bus. NUMA distances and access characteristics are defined to and from that point, but for system software to establish full initiator to target characteristics this information must be combined with information retrieved form the discoverable part of the path. An example would use CDAT information read from devices and switches in conjunction with link characteristics read from PCIe Configuration space. > > > +# > > +# Since: 9.1 > > +## > > +{ 'struct': 'AcpiGenericPortProperties', > > + 'data': { 'pci-bus': 'str', > > +'node': 'uint32' } } > > + > > ## > > # @RngProperties: > > # > > @@ -944,6 +960,7 @@ > > { 'enum': 'ObjectType', > >'data': [ > > 'acpi-generic-initiator', > > +'acpi-generic-port', > > 'authz-list', > > 'authz-listfile', > > 'authz-pam', > > @@ -1016,6 +1033,7 @@ > >'discriminator': 'qom-type', > >'data': { > >'acpi-generic-initiator': 'AcpiGenericInitiatorProperties', > > + 'acpi-generic-port': 'AcpiGenericPortProperties', > >'authz-list': 'AuthZListProperties', > >'authz-listfile': 'AuthZListFileProperties', > >'authz-pam': 'AuthZPAMProperties', > > [...] > >
Re: [PATCH v5 09/13] hw/cxl/events: Add qmp interfaces to add/release dynamic capacity extents
On Thu, 25 Apr 2024 10:30:51 -0700 Ira Weiny wrote: > Markus Armbruster wrote: > > fan writes: > > > > > On Wed, Apr 24, 2024 at 03:09:52PM +0200, Markus Armbruster wrote: > > >> nifan@gmail.com writes: > > >> > > >> > From: Fan Ni > > >> > > > >> > Since fabric manager emulation is not supported yet, the change > > >> > implements > > >> > the functions to add/release dynamic capacity extents as QMP > > >> > interfaces. > > >> > > >> Will fabric manager emulation obsolete these commands? > > > > > > If in the future, fabric manager emulation supports commands for dynamic > > > capacity > > > extent add/release, it is possible we do not need the commands. > > > But it seems not to happen soon, we need the qmp commands for the > > > end-to-end test with kernel DCD support. > > > > I asked because if the commands are temporary testing aids, they should > > probably be declared unstable. Even if they are permanent testing aids, > > unstable might be the right choice. This is for the CXL maintainers to > > decide. > > > > What does "unstable" mean? docs/devel/qapi-code-gen.rst: "Interfaces so > > marked may be withdrawn or changed incompatibly in future releases." > > > > Management applications need stable interfaces. Libvirt developers > > generally refuse to touch anything in QMP that's declared unstable. > > > > Human users and their ad hoc scripts appreciate stability, but they > > don't need it nearly as much as management applications do. > > > > A stability promise increases the maintenance burden. By how much is > > unclear. In other words, by promising stability, the maintainers take > > on risk. Are the CXL maintainers happy to accept the risk here? > > > > Ah... All great points. > > Outside of CXL development I don't think there is a strong need for them > to be stable. I would like to see more than ad hoc scripts use them > though. So I don't think they are going to be changed without some > thought though. These align closely with the data that comes from the fabric management API in the CXL spec. So I don't see a big maintenance burden problem in having these as stable interfaces. Whilst they aren't doing quite the same job as the FM-API (which will be emulated such that it is visible to the guest as that aids some other types of testing) that interface defines the limits on what we can tell the device to do. So yes, risk for these is minimal and I'm happy to accept that. It'll be a while before we need libvirt to use them but I do expect to see that happen. (subject to some guessing on a future virtualization stack!) Jonathan > > Ira > > [snip]
Re: [PATCH v5 09/13] hw/cxl/events: Add qmp interfaces to add/release dynamic capacity extents
On Wed, 24 Apr 2024 10:33:33 -0700 Ira Weiny wrote: > Markus Armbruster wrote: > > nifan@gmail.com writes: > > > > > From: Fan Ni > > > > > > Since fabric manager emulation is not supported yet, the change implements > > > the functions to add/release dynamic capacity extents as QMP interfaces. > > > > Will fabric manager emulation obsolete these commands? > > I don't think so. In the development of the kernel, I see these being > valuable to do CI and regression testing without the complexity of an FM. Fully agree - I also long term see these as the drivers for one possible virtualization stack for DCD devices (whether it turns out to be the way forwards for that is going to take a while to resolve!) It doesn't make much sense to add a fabric manager into that flow or to expose an appropriate (maybe MCTP) interface from QEMU just to poke the emulated device. Jonathan > > Ira > > > > > > Note: we skips any FM issued extent release request if the exact extent > > > does not exist in the extent list of the device. We will loose the > > > restriction later once we have partial release support in the kernel. > > > > > > 1. Add dynamic capacity extents: > > > > > > For example, the command to add two continuous extents (each 128MiB long) > > > to region 0 (starting at DPA offset 0) looks like below: > > > > > > { "execute": "qmp_capabilities" } > > > > > > { "execute": "cxl-add-dynamic-capacity", > > > "arguments": { > > > "path": "/machine/peripheral/cxl-dcd0", > > > "region-id": 0, > > > "extents": [ > > > { > > > "dpa": 0, > > > "len": 134217728 > > > }, > > > { > > > "dpa": 134217728, > > > "len": 134217728 > > > } > > > ] > > > } > > > } > > > > > > 2. Release dynamic capacity extents: > > > > > > For example, the command to release an extent of size 128MiB from region 0 > > > (DPA offset 128MiB) look like below: > > > > > > { "execute": "cxl-release-dynamic-capacity", > > > "arguments": { > > > "path": "/machine/peripheral/cxl-dcd0", > > > "region-id": 0, > > > "extents": [ > > > { > > > "dpa": 134217728, > > > "len": 134217728 > > > } > > > ] > > > } > > > } > > > > > > Signed-off-by: Fan Ni > > > > [...] > > > > > diff --git a/qapi/cxl.json b/qapi/cxl.json > > > index 8cc4c72fa9..2645004666 100644 > > > --- a/qapi/cxl.json > > > +++ b/qapi/cxl.json > > > @@ -19,13 +19,16 @@ > > > # > > > # @fatal: Fatal Event Log > > > # > > > +# @dyncap: Dynamic Capacity Event Log > > > +# > > > # Since: 8.1 > > > ## > > > { 'enum': 'CxlEventLog', > > >'data': ['informational', > > > 'warning', > > > 'failure', > > > - 'fatal'] > > > + 'fatal', > > > + 'dyncap'] > > > > We tend to avoid abbreviations in QMP identifiers: dynamic-capacity. > > > > > } > > > > > > ## > > > @@ -361,3 +364,59 @@ > > > ## > > > {'command': 'cxl-inject-correctable-error', > > > 'data': {'path': 'str', 'type': 'CxlCorErrorType'}} > > > + > > > +## > > > +# @CXLDCExtentRecord: > > > > Such traffic jams of capital letters are hard to read. > > > > What does DC mean? > > > > > +# > > > +# Record of a single extent to add/release > > > +# > > > +# @offset: offset to the start of the region where the extent to be > > > operated > > > > Blank line here, please > > > > > +# @len: length of the extent > > > +# > > > +# Since: 9.0 > > > +## > > > +{ 'struct': 'CXLDCExtentRecord', > > > + 'data': { > > > + 'offset':'uint64', > > > + 'len': 'uint64' > > > + } > > > +} > > > + > > > +## > > > +# @cxl-add-dynamic-capacity: > > > +# > > > +# Command to start add dynamic capacity extents flow. The device will > > > > I think we're missing an article here. Is it "a flow" or "the flow"? > > > > > +# have to acknowledged the acceptance of the extents before they are > > > usable. > > > > to acknowledge > > > > docs/devel/qapi-code-gen.rst: > > > > For legibility, wrap text paragraphs so every line is at most 70 > > characters long. > > > > Separate sentences with two spaces. > > > > > +# > > > +# @path: CXL DCD canonical QOM path > > > > What is a CXL DCD? Is it a device? > > > > I'd prefer @qom-path, unless you can make a consistency argument for > > @path. > > > > > +# @region-id: id of the region where the extent to add > > > > What's a region, and how do they get their IDs? > > > > > +# @extents: Extents to add > > > > Blank lines between argument descriptions, please. > > > > > +# > > > +# Since : 9.0 > > > > 9.1 > > > > > +## > > > +{ 'command': 'cxl-add-dynamic-capacity', > > > + 'data': { 'path': 'str', > > > +'region-id': 'uint8', > > > +'extents': [ 'CXLDCExtentRecord' ] > > > + } > > > +} > > > + > > > +## > > > +# @cxl-release-dynamic-capacity: > > > +# > > > +# Command to
Re: [PATCH v7 00/12] Enabling DCD emulation support in Qemu
On Mon, 22 Apr 2024 15:23:16 +0100 Jonathan Cameron wrote: > On Mon, 22 Apr 2024 13:04:48 +0100 > Jonathan Cameron wrote: > > > On Sat, 20 Apr 2024 16:35:46 -0400 > > Gregory Price wrote: > > > > > On Fri, Apr 19, 2024 at 11:43:14AM -0700, fan wrote: > > > > On Fri, Apr 19, 2024 at 02:24:36PM -0400, Gregory Price wrote: > > > > > > > > > > added review to all patches, will hopefully be able to add a Tested-by > > > > > tag early next week, along with a v1 RFC for MHD bit-tracking. > > > > > > > > > > We've been testing v5/v6 for a bit, so I expect as soon as we get the > > > > > MHD code ported over to v7 i'll ship a tested-by tag pretty quick. > > > > > > > > > > The super-set release will complicate a few things but this doesn't > > > > > look like a blocker on our end, just a change to how we track bits in > > > > > a > > > > > shared bit/bytemap. > > > > > > > > > > > > > Hi Gregory, > > > > Thanks for reviewing the patches so quickly. > > > > > > > > No pressure, but look forward to your MHD work. :) > > > > > > > > Fan > > > > > > Starting to get into versioniong hell a bit, since the Niagara work was > > > based off of jonathan's branch and the mhd-dcd work needs some of the > > > extentions from that branch - while this branch is based on master. > > > > > > Probably we'll need to wait for a new cxl dated branch to try and sus > > > out the pain points before we push an RFC. I would not want to have > > > conflicting commits for something like this for example: > > > > > > https://lore.kernel.org/qemu-devel/20230901012914.226527-2-gregory.pr...@memverge.com/ > > > > > > We get merge conflicts here because this is behind that patch. So > > > pushing up an RFC in this state would be mostly useless to everyone > > > > Subtle hint noted ;) > > > > I'll build a fresh tree - any remaining rebases until QEMU 9.0 should be > > straight forward anyway. My ideal is that the NUMA GP series lands early > > in 9.1 cycle and this can go in parallel. I'd really like to > > get this in early if possible so we can start clearing some of the other > > stuff that ended up built on top of it! > > I've pushed to gitlab.com/jic23/qemu cxl-2024-04-22-draft > Its extremely lightly tested so far. > > To save time, I've temporarily dropped the fm-api DCD initiate > dynamic capacity add patch as that needs non trivial updates. > > I've not yet caught up with some other outstanding series, but > I will almost certainly put them on top of DCD. If anyone pulled in meantime... I failed to push down a fix from my working tree on top of this. Goes to show I shouldn't ignore patches simply named "Push down" :( Updated on same branch. Jonathan > > Jonathan > > > > > Jonathan > > > > > > > > ~Gregory > > > > > >
Re: [PATCH v7 00/12] Enabling DCD emulation support in Qemu
On Mon, 22 Apr 2024 13:04:48 +0100 Jonathan Cameron wrote: > On Sat, 20 Apr 2024 16:35:46 -0400 > Gregory Price wrote: > > > On Fri, Apr 19, 2024 at 11:43:14AM -0700, fan wrote: > > > On Fri, Apr 19, 2024 at 02:24:36PM -0400, Gregory Price wrote: > > > > > > > > added review to all patches, will hopefully be able to add a Tested-by > > > > tag early next week, along with a v1 RFC for MHD bit-tracking. > > > > > > > > We've been testing v5/v6 for a bit, so I expect as soon as we get the > > > > MHD code ported over to v7 i'll ship a tested-by tag pretty quick. > > > > > > > > The super-set release will complicate a few things but this doesn't > > > > look like a blocker on our end, just a change to how we track bits in a > > > > shared bit/bytemap. > > > > > > > > > > Hi Gregory, > > > Thanks for reviewing the patches so quickly. > > > > > > No pressure, but look forward to your MHD work. :) > > > > > > Fan > > > > Starting to get into versioniong hell a bit, since the Niagara work was > > based off of jonathan's branch and the mhd-dcd work needs some of the > > extentions from that branch - while this branch is based on master. > > > > Probably we'll need to wait for a new cxl dated branch to try and sus > > out the pain points before we push an RFC. I would not want to have > > conflicting commits for something like this for example: > > > > https://lore.kernel.org/qemu-devel/20230901012914.226527-2-gregory.pr...@memverge.com/ > > > > We get merge conflicts here because this is behind that patch. So > > pushing up an RFC in this state would be mostly useless to everyone > > Subtle hint noted ;) > > I'll build a fresh tree - any remaining rebases until QEMU 9.0 should be > straight forward anyway. My ideal is that the NUMA GP series lands early > in 9.1 cycle and this can go in parallel. I'd really like to > get this in early if possible so we can start clearing some of the other > stuff that ended up built on top of it! I've pushed to gitlab.com/jic23/qemu cxl-2024-04-22-draft Its extremely lightly tested so far. To save time, I've temporarily dropped the fm-api DCD initiate dynamic capacity add patch as that needs non trivial updates. I've not yet caught up with some other outstanding series, but I will almost certainly put them on top of DCD. Jonathan > > Jonathan > > > > > ~Gregory > >
Re: [PATCH v7 00/12] Enabling DCD emulation support in Qemu
On Sat, 20 Apr 2024 16:35:46 -0400 Gregory Price wrote: > On Fri, Apr 19, 2024 at 11:43:14AM -0700, fan wrote: > > On Fri, Apr 19, 2024 at 02:24:36PM -0400, Gregory Price wrote: > > > > > > added review to all patches, will hopefully be able to add a Tested-by > > > tag early next week, along with a v1 RFC for MHD bit-tracking. > > > > > > We've been testing v5/v6 for a bit, so I expect as soon as we get the > > > MHD code ported over to v7 i'll ship a tested-by tag pretty quick. > > > > > > The super-set release will complicate a few things but this doesn't > > > look like a blocker on our end, just a change to how we track bits in a > > > shared bit/bytemap. > > > > > > > Hi Gregory, > > Thanks for reviewing the patches so quickly. > > > > No pressure, but look forward to your MHD work. :) > > > > Fan > > Starting to get into versioniong hell a bit, since the Niagara work was > based off of jonathan's branch and the mhd-dcd work needs some of the > extentions from that branch - while this branch is based on master. > > Probably we'll need to wait for a new cxl dated branch to try and sus > out the pain points before we push an RFC. I would not want to have > conflicting commits for something like this for example: > > https://lore.kernel.org/qemu-devel/20230901012914.226527-2-gregory.pr...@memverge.com/ > > We get merge conflicts here because this is behind that patch. So > pushing up an RFC in this state would be mostly useless to everyone Subtle hint noted ;) I'll build a fresh tree - any remaining rebases until QEMU 9.0 should be straight forward anyway. My ideal is that the NUMA GP series lands early in 9.1 cycle and this can go in parallel. I'd really like to get this in early if possible so we can start clearing some of the other stuff that ended up built on top of it! Jonathan > > ~Gregory
Re: [PATCH v7 09/12] hw/cxl/events: Add qmp interfaces to add/release dynamic capacity extents
On Thu, 18 Apr 2024 16:11:00 -0700 nifan@gmail.com wrote: > From: Fan Ni > Hi Fan, Please expand CC list to include QAPI maintainers. +CC Markus and Micheal. Also, for future versions +CC Michael Tsirkin. I'm find rolling these up as a series with the precursors but if it is already some Michael has seen it may speed things up. Jonathan p.s. Today I'm just building a tree, but will circle back around later in the week with a final review of the last few changes. > To simulate FM functionalities for initiating Dynamic Capacity Add > (Opcode 5604h) and Dynamic Capacity Release (Opcode 5605h) as in CXL spec > r3.1 7.6.7.6.5 and 7.6.7.6.6, we implemented two QMP interfaces to issue > add/release dynamic capacity extents requests. > > With the change, we allow to release an extent only when its DPA range > is contained by a single accepted extent in the device. That is to say, > extent superset release is not supported yet. > > 1. Add dynamic capacity extents: > > For example, the command to add two continuous extents (each 128MiB long) > to region 0 (starting at DPA offset 0) looks like below: > > { "execute": "qmp_capabilities" } > > { "execute": "cxl-add-dynamic-capacity", > "arguments": { > "path": "/machine/peripheral/cxl-dcd0", > "hid": 0, > "selection-policy": 2, > "region-id": 0, > "tag": "", > "extents": [ > { > "offset": 0, > "len": 134217728 > }, > { > "offset": 134217728, > "len": 134217728 > } > ] > } > } > > 2. Release dynamic capacity extents: > > For example, the command to release an extent of size 128MiB from region 0 > (DPA offset 128MiB) looks like below: > > { "execute": "cxl-release-dynamic-capacity", > "arguments": { > "path": "/machine/peripheral/cxl-dcd0", > "hid": 0, > "flags": 1, > "region-id": 0, > "tag": "", > "extents": [ > { > "offset": 134217728, > "len": 134217728 > } > ] > } > } > > Signed-off-by: Fan Ni > --- > hw/cxl/cxl-mailbox-utils.c | 62 +-- > hw/mem/cxl_type3.c | 311 +++- > hw/mem/cxl_type3_stubs.c| 20 +++ > include/hw/cxl/cxl_device.h | 22 +++ > include/hw/cxl/cxl_events.h | 18 +++ > qapi/cxl.json | 69 > 6 files changed, 489 insertions(+), 13 deletions(-) > > diff --git a/hw/cxl/cxl-mailbox-utils.c b/hw/cxl/cxl-mailbox-utils.c > index 9d54e10cd4..3569902e9e 100644 > --- a/hw/cxl/cxl-mailbox-utils.c > +++ b/hw/cxl/cxl-mailbox-utils.c > @@ -1405,7 +1405,7 @@ static CXLRetCode cmd_dcd_get_dyn_cap_ext_list(const > struct cxl_cmd *cmd, > * Check whether any bit between addr[nr, nr+size) is set, > * return true if any bit is set, otherwise return false > */ > -static bool test_any_bits_set(const unsigned long *addr, unsigned long nr, > +bool test_any_bits_set(const unsigned long *addr, unsigned long nr, >unsigned long size) > { > unsigned long res = find_next_bit(addr, size + nr, nr); > @@ -1444,7 +1444,7 @@ CXLDCRegion *cxl_find_dc_region(CXLType3Dev *ct3d, > uint64_t dpa, uint64_t len) > return NULL; > } > > -static void cxl_insert_extent_to_extent_list(CXLDCExtentList *list, > +void cxl_insert_extent_to_extent_list(CXLDCExtentList *list, > uint64_t dpa, > uint64_t len, > uint8_t *tag, > @@ -1470,6 +1470,44 @@ void > cxl_remove_extent_from_extent_list(CXLDCExtentList *list, > g_free(extent); > } > > +/* > + * Add a new extent to the extent "group" if group exists; > + * otherwise, create a new group > + * Return value: return the group where the extent is inserted. > + */ > +CXLDCExtentGroup *cxl_insert_extent_to_extent_group(CXLDCExtentGroup *group, > +uint64_t dpa, > +uint64_t len, > +uint8_t *tag, > +uint16_t shared_seq) > +{ > +if (!group) { > +group = g_new0(CXLDCExtentGroup, 1); > +QTAILQ_INIT(>list); > +} > +cxl_insert_extent_to_extent_list(>list, dpa, len, > + tag, shared_seq); > +return group; > +} > + > +void cxl_extent_group_list_insert_tail(CXLDCExtentGroupList *list, > + CXLDCExtentGroup *group) > +{ > +QTAILQ_INSERT_TAIL(list, group, node); > +} > + > +void cxl_extent_group_list_delete_front(CXLDCExtentGroupList *list) > +{ > +CXLDCExtent *ent, *ent_next; > +CXLDCExtentGroup *group = QTAILQ_FIRST(list); > + > +QTAILQ_REMOVE(list, group, node); > +QTAILQ_FOREACH_SAFE(ent, >list, node, ent_next) { > +
Re: [PATCH v7 06/12] hw/mem/cxl_type3: Add host backend and address space handling for DC regions
On Fri, 19 Apr 2024 13:27:59 -0400 Gregory Price wrote: > On Thu, Apr 18, 2024 at 04:10:57PM -0700, nifan@gmail.com wrote: > > From: Fan Ni > > > > Add (file/memory backed) host backend for DCD. All the dynamic capacity > > regions will share a single, large enough host backend. Set up address > > space for DC regions to support read/write operations to dynamic capacity > > for DCD. > > > > With the change, the following support is added: > > 1. Add a new property to type3 device "volatile-dc-memdev" to point to host > >memory backend for dynamic capacity. Currently, all DC regions share one > >host backend; > > 2. Add namespace for dynamic capacity for read/write support; > > 3. Create cdat entries for each dynamic capacity region. > > > > Signed-off-by: Fan Ni > > --- > > hw/cxl/cxl-mailbox-utils.c | 16 ++-- > > hw/mem/cxl_type3.c | 172 +--- > > include/hw/cxl/cxl_device.h | 8 ++ > > 3 files changed, 160 insertions(+), 36 deletions(-) > > > > A couple general comments in line for discussion, but patch looks good > otherwise. Notes are mostly on improvements we could make that should > not block this patch. > > Reviewed-by: Gregory Price > > > > > diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c > > index a1fe268560..ac87398089 100644 > > --- a/hw/mem/cxl_type3.c > > +++ b/hw/mem/cxl_type3.c > > @@ -45,7 +45,8 @@ enum { > > > > static void ct3_build_cdat_entries_for_mr(CDATSubHeader **cdat_table, > >int dsmad_handle, uint64_t size, > > - bool is_pmem, uint64_t dpa_base) > > + bool is_pmem, bool is_dynamic, > > + uint64_t dpa_base) > > We should probably change the is_* fields into a flags field and do some > error checking on the combination of flags. > > > { > > CDATDsmas *dsmas; > > CDATDslbis *dslbis0; > > @@ -61,7 +62,8 @@ static void ct3_build_cdat_entries_for_mr(CDATSubHeader > > **cdat_table, > > .length = sizeof(*dsmas), > > }, > > .DSMADhandle = dsmad_handle, > > -.flags = is_pmem ? CDAT_DSMAS_FLAG_NV : 0, > > +.flags = (is_pmem ? CDAT_DSMAS_FLAG_NV : 0) | > > + (is_dynamic ? CDAT_DSMAS_FLAG_DYNAMIC_CAP : 0), > > For example, as noted elsewhere in the code, is_pmem+is_dynamic is not > presently supported, so this shouldn't even be allowed in this function. > > > +if (dc_mr) { > > +int i; > > +uint64_t region_base = vmr_size + pmr_size; > > + > > +/* > > + * TODO: we assume the dynamic capacity to be volatile for now. > > + * Non-volatile dynamic capacity will be added if needed in the > > + * future. > > + */ > > Probably don't need to mark this TODO, can just leave it as a note. > > Non-volatile dynamic capacity will coincide with shared memory, so it'll > end up handled. So this isn't really a TODO for this current work, and > should read more like: > > "Dynamic Capacity is always volatile, until shared memory is > implemented" I can sort of see your logic, but there is a difference between volatile memory that is shared and persistent memory (typically whether we need to care about deep flushes in some architectures) so I'd expected volatile shared capacity to still be a thing, even if the host OS treats it in most ways as persistent. Also, persistent + DCD could be a thing without sharing sometime in the future. > > > +} else if (ct3d->hostpmem) { > > range1_size_hi = ct3d->hostpmem->size >> 32; > > range1_size_lo = (2 << 5) | (2 << 2) | 0x3 | > > (ct3d->hostpmem->size & 0xF000); > > +} else { > > +/* > > + * For DCD with no static memory, set memory active, memory class > > bits. > > + * No range is set. > > + */ > > +range1_size_lo = (2 << 5) | (2 << 2) | 0x3; > > We should probably add defs for these fields at some point. Can be > tabled for later work though. Agreed - worth tidying up but not on critical path. > > > +/* > > + * TODO: set dc as volatile for now, non-volatile support can be > > added > > + * in the future if needed. > > + */ > > +memory_region_set_nonvolatile(dc_mr, false); > > Again can probably drop the TODO and just leave a statement. > > ~Gregory
Re: [PATCH v7 06/12] hw/mem/cxl_type3: Add host backend and address space handling for DC regions
On Thu, 18 Apr 2024 16:10:57 -0700 nifan@gmail.com wrote: > From: Fan Ni > > Add (file/memory backed) host backend for DCD. All the dynamic capacity > regions will share a single, large enough host backend. Set up address > space for DC regions to support read/write operations to dynamic capacity > for DCD. > > With the change, the following support is added: > 1. Add a new property to type3 device "volatile-dc-memdev" to point to host >memory backend for dynamic capacity. Currently, all DC regions share one >host backend; > 2. Add namespace for dynamic capacity for read/write support; > 3. Create cdat entries for each dynamic capacity region. > > Signed-off-by: Fan Ni One fixlet needed inline. I've set range1_size_lo = 0 there for my tree. > @@ -301,10 +337,16 @@ static void build_dvsecs(CXLType3Dev *ct3d) > range2_size_lo = (2 << 5) | (2 << 2) | 0x3 | > (ct3d->hostpmem->size & 0xF000); > } > -} else { > +} else if (ct3d->hostpmem) { > range1_size_hi = ct3d->hostpmem->size >> 32; > range1_size_lo = (2 << 5) | (2 << 2) | 0x3 | > (ct3d->hostpmem->size & 0xF000); > +} else { > +/* > + * For DCD with no static memory, set memory active, memory class > bits. > + * No range is set. > + */ range1_size_hi is not initialized. > +range1_size_lo = (2 << 5) | (2 << 2) | 0x3; > } > > dvsec = (uint8_t *)&(CXLDVSECDevice){
Re: [PATCH 0/3] hw/cxl/cxl-cdat: Make cxl_doe_cdat_init() return boolean
On Fri, 19 Apr 2024 17:40:07 +0200 Philippe Mathieu-Daudé wrote: > On 18/4/24 12:04, Zhao Liu wrote: > > From: Zhao Liu > > > > --- > > Zhao Liu (3): > >hw/cxl/cxl-cdat: Make ct3_load_cdat() return boolean > >hw/cxl/cxl-cdat: Make ct3_build_cdat() return boolean > >hw/cxl/cxl-cdat: Make cxl_doe_cdat_init() return boolean > > Since Jonathan Ack'ed the series, I'm queuing it via my hw-misc tree. > Thanks, J
Re: [edk2-devel] [PATCH v3 5/6] target/arm: Do memory type alignment check when translation disabled
On Fri, 19 Apr 2024 13:52:07 +0200 Gerd Hoffmann wrote: > Hi, > > > Gerd, any ideas? Maybe I needs something subtly different in my > > edk2 build? I've not looked at this bit of the qemu infrastructure > > before - is there a document on how that image is built? > > There is roms/Makefile for that. > > make -C roms help > make -C roms efi > > So easiest would be to just update the edk2 submodule to what you > need, then rebuild. > > The build is handled by the roms/edk2-build.py script, > with the build configuration being in roms/edk2-build.config. > That is usable outside the qemu source tree too, i.e. like this: > > python3 /path/to/qemu.git/roms/edk2-build.py \ > --config /path/to/qemu.git/roms/edk2-build.config \ > --core /path/to/edk2.git \ > --match armvirt \ > --silent --no-logs > > That'll try to place the images build in "../pc-bios", so maybe better > work with a copy of the config file where you adjust this. > > HTH, > Gerd > Thanks Gerd! So the builds are very similar via the two method... However - the QEMU build sets -D CAVIUM_ERRATUM_27456=TRUE And that's the difference - with that set for my other builds the alignment problems go away... Any idea why we have that set in roms/edk2-build.config? Superficially it seems rather unlikely anyone cares about thunderx1 (if they do we need to get them some new hardware with fresh bugs) bugs now and this config file was only added last year. However, the last comment in Ard's commit message below seems highly likely to be relevant! Chasing through Ard's patch it has the side effect of dropping an override of a requirement for strict alignment. So with out the errata DEFINE GCC_AARCH64_CC_XIPFLAGS = -mstrict-align -mgeneral-regs-only is replaced with [BuildOptions] +!if $(CAVIUM_ERRATUM_27456) == TRUE^M + GCC:*_*_AARCH64_PP_FLAGS = -DCAVIUM_ERRATUM_27456^M +!else^M GCC:*_*_AARCH64_CC_XIPFLAGS == +!endif^M The edk2 commit that added this was the following +CC Ard. Given I wasn't sure of the syntax of that file I set it manually to the original value and indeed it works. commit ec54ce1f1ab41b92782b37ae59e752fff0ef9c41 Author: Ard Biesheuvel Date: Wed Jan 4 16:51:35 2023 +0100 ArmVirtPkg/ArmVirtQemu: Avoid early ID map on ThunderX The early ID map used by ArmVirtQemu uses ASID scoped non-global mappings, as this allows us to switch to the permanent ID map seamlessly without the need for explicit TLB maintenance. However, this triggers a known erratum on ThunderX, which does not tolerate non-global mappings that are executable at EL1, as this appears to result in I-cache corruption. (Linux disables the KPTI based Meltdown mitigation on ThunderX for the same reason) So work around this, by detecting the CPU implementor and part number, and proceeding without the early ID map if a ThunderX CPU is detected. Note that this requires the C code to be built with strict alignment again, as we may end up executing it with the MMU and caches off. Signed-off-by: Ard Biesheuvel Acked-by: Laszlo Ersek Tested-by: dann frazier Test case is qemu-system-aarch64 -M virt,virtualization=true, -m 4g -cpu cortex-a76 \ -bios QEMU_EFI.fd -d int Which gets alignment faults since: https://lore.kernel.org/all/20240301204110.656742-6-richard.hender...@linaro.org/ So my feeling here is EDK2 should either have yet another config for QEMU as a host or should always set the alignment without needing to pick the CAVIUM 27456 errata which I suspect will get dropped soonish anyway if anyone ever cleans up old errata. Jonathan
Re: [PATCH v11 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info
On Fri, 5 Apr 2024 00:07:06 + "Ho-Ren (Jack) Chuang" wrote: > The current implementation treats emulated memory devices, such as > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory > (E820_TYPE_RAM). However, these emulated devices have different > characteristics than traditional DRAM, making it important to > distinguish them. Thus, we modify the tiered memory initialization process > to introduce a delay specifically for CPUless NUMA nodes. This delay > ensures that the memory tier initialization for these nodes is deferred > until HMAT information is obtained during the boot process. Finally, > demotion tables are recalculated at the end. > > * late_initcall(memory_tier_late_init); > Some device drivers may have initialized memory tiers between > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing > online memory nodes and configuring memory tiers. They should be excluded > in the late init. > > * Handle cases where there is no HMAT when creating memory tiers > There is a scenario where a CPUless node does not provide HMAT information. > If no HMAT is specified, it falls back to using the default DRAM tier. > > * Introduce another new lock `default_dram_perf_lock` for adist calculation > In the current implementation, iterating through CPUlist nodes requires > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up > trying to acquire the same lock, leading to a potential deadlock. > Therefore, we propose introducing a standalone `default_dram_perf_lock` to > protect `default_dram_perf_*`. This approach not only avoids deadlock > but also prevents holding a large lock simultaneously. > > * Upgrade `set_node_memory_tier` to support additional cases, including > default DRAM, late CPUless, and hot-plugged initializations. > To cover hot-plugged memory nodes, `mt_calc_adistance()` and > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to > handle cases where memtype is not initialized and where HMAT information is > available. > > * Introduce `default_memory_types` for those memory types that are not > initialized by device drivers. > Because late initialized memory and default DRAM memory need to be managed, > a default memory type is created for storing all memory types that are > not initialized by device drivers and as a fallback. > > Signed-off-by: Ho-Ren (Jack) Chuang > Signed-off-by: Hao Xiang > Reviewed-by: "Huang, Ying" Reviewed-by: Jonathan Cameron
Re: [PATCH v3 5/6] target/arm: Do memory type alignment check when translation disabled
On Thu, 18 Apr 2024 09:15:55 +0100 Jonathan Cameron via wrote: > On Wed, 17 Apr 2024 13:07:35 -0700 > Richard Henderson wrote: > > > On 4/16/24 08:11, Jonathan Cameron wrote: > > > On Fri, 1 Mar 2024 10:41:09 -1000 > > > Richard Henderson wrote: > > > > > >> If translation is disabled, the default memory type is Device, which > > >> requires alignment checking. This is more optimally done early via > > >> the MemOp given to the TCG memory operation. > > >> > > >> Reviewed-by: Philippe Mathieu-Daudé > > >> Reported-by: Idan Horowitz > > >> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1204 > > >> Signed-off-by: Richard Henderson > > > > > > Hi Richard. > > > > > > I noticed some tests I was running stopped booting with master. > > > (it's a fun and complex stack of QEMU + kvm on QEMU for vCPU Hotplug > > > kernel work, > > > but this is the host booting) > > > > > > EDK2 build from upstream as of somepoint last week. > > > > > > Bisects to this patch. > > > > > > qemu-system-aarch64 -M virt,gic-version=3,virtualization=true -m > > > 4g,maxmem=8G,slots=8 -cpu cortex-a76 -smp > > > cpus=4,threads=2,clusters=2,sockets=1 \ > > > -kernel Image \ > > > -drive if=none,file=full.qcow2,format=qcow2,id=hd \ > > > -device ioh3420,id=root_port1 -device virtio-blk-pci,drive=hd \ > > > -netdev user,id=mynet,hostfwd=tcp::-:22 -device > > > virtio-net-pci,netdev=mynet,id=bob \ > > > -nographic -no-reboot -append 'earlycon root=/dev/vda2 fsck.mode=skip > > > tp_printk' \ > > > -monitor telnet:127.0.0.1:1235,server,nowait -bios QEMU_EFI.fd \ > > > -object memory-backend-ram,size=4G,id=mem0 \ > > > -numa node,nodeid=0,cpus=0-3,memdev=mem0 > > > > > > Symptoms: Nothing on console from edk2 which is built in debug mode so is > > > normally very noisy. > > >No sign of anything much happening at all :( > > > > This isn't a fantastic bug report. > > > > (1) If it doesn't boot efi, then none of the -kernel parameters are > > necessary. > > (2) I'd be surprised if the full.qcow2 drive parameters are necessary > > either. > > But if they are, what contents? Is a new empty drive sufficient, just > > enough to send the bios through the correct device initialization? > > (3) edk2 build from ... > > Well, this is partly edk2's fault, as the build documentation is awful. > > I spent an entire afternoon trying to figure it out and gave up. > > > > I will say that the edk2 shipped with qemu does work, so... are you > > absolutely > > certain that it isn't a bug in edk2 since then? Firmware bugs are exactly > > what > > that patch is supposed to expose, as requested by issue #1204. > > > > I'd say you should boot with "-d int" and see what kind of interrupts > > you're getting very > > early on. I suspect that you'll see data aborts with ESR xx/yy where the > > last 6 bits of > > yy are 0x21 (alignment fault). > > Hi Richard, > > Sorry for lack of details, I was aware it wasn't great and should have stated > I planned > to come back with more details when I had time to debug. Snowed under so for > now I've > just dropped back to 8.2 and will get back to this perhaps next week. +CC EDK2 list and Gerd. Still not a thorough report but some breadcrumbs. May be something about my local build setup as the shipped EDK2 succeeds, but the one I'm building via uefi-tools/edk2-build.sh armvirtqemu64 (some aged instructions here that are more or less working still) https://people.kernel.org/jic23/ Indeed starts out with some alignment faults. Gerd, any ideas? Maybe I needs something subtly different in my edk2 build? I've not looked at this bit of the qemu infrastructure before - is there a document on how that image is built? As Richard observed, EDK2 isn't the simplest thing to build - I've been using uefitools for this for a long time, so maybe I missed some new requirement? Build machine is x86_64 ubuntu, gcc 12.2.0. I need to build it because of some necessary tweaks to debug a PCI enumeration issue in Linux. (these tests were without those tweaks) As Richard observed, most of the command line isn't needed. qemu-system-aarch64 -M virt,virtualization=true, -m 4g -cpu cortex-a76 \ -bios QEMU_EFI.fd -d int Jonathan > > Jonathan > > > > > > > r~ > >
Re: [PATCH 0/3] hw/cxl/cxl-cdat: Make cxl_doe_cdat_init() return boolean
On Thu, 18 Apr 2024 14:06:39 +0200 Philippe Mathieu-Daudé wrote: > On 18/4/24 12:04, Zhao Liu wrote: > > From: Zhao Liu > > > > --- > > Zhao Liu (3): > >hw/cxl/cxl-cdat: Make ct3_load_cdat() return boolean > >hw/cxl/cxl-cdat: Make ct3_build_cdat() return boolean > >hw/cxl/cxl-cdat: Make cxl_doe_cdat_init() return boolean > > Series: > Reviewed-by: Philippe Mathieu-Daudé > Acked-by: Jonathan Cameron
Re: [PATCH v3 5/6] target/arm: Do memory type alignment check when translation disabled
On Wed, 17 Apr 2024 13:07:35 -0700 Richard Henderson wrote: > On 4/16/24 08:11, Jonathan Cameron wrote: > > On Fri, 1 Mar 2024 10:41:09 -1000 > > Richard Henderson wrote: > > > >> If translation is disabled, the default memory type is Device, which > >> requires alignment checking. This is more optimally done early via > >> the MemOp given to the TCG memory operation. > >> > >> Reviewed-by: Philippe Mathieu-Daudé > >> Reported-by: Idan Horowitz > >> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1204 > >> Signed-off-by: Richard Henderson > > > > Hi Richard. > > > > I noticed some tests I was running stopped booting with master. > > (it's a fun and complex stack of QEMU + kvm on QEMU for vCPU Hotplug kernel > > work, > > but this is the host booting) > > > > EDK2 build from upstream as of somepoint last week. > > > > Bisects to this patch. > > > > qemu-system-aarch64 -M virt,gic-version=3,virtualization=true -m > > 4g,maxmem=8G,slots=8 -cpu cortex-a76 -smp > > cpus=4,threads=2,clusters=2,sockets=1 \ > > -kernel Image \ > > -drive if=none,file=full.qcow2,format=qcow2,id=hd \ > > -device ioh3420,id=root_port1 -device virtio-blk-pci,drive=hd \ > > -netdev user,id=mynet,hostfwd=tcp::-:22 -device > > virtio-net-pci,netdev=mynet,id=bob \ > > -nographic -no-reboot -append 'earlycon root=/dev/vda2 fsck.mode=skip > > tp_printk' \ > > -monitor telnet:127.0.0.1:1235,server,nowait -bios QEMU_EFI.fd \ > > -object memory-backend-ram,size=4G,id=mem0 \ > > -numa node,nodeid=0,cpus=0-3,memdev=mem0 > > > > Symptoms: Nothing on console from edk2 which is built in debug mode so is > > normally very noisy. > >No sign of anything much happening at all :( > > This isn't a fantastic bug report. > > (1) If it doesn't boot efi, then none of the -kernel parameters are necessary. > (2) I'd be surprised if the full.qcow2 drive parameters are necessary either. > But if they are, what contents? Is a new empty drive sufficient, just > enough to send the bios through the correct device initialization? > (3) edk2 build from ... > Well, this is partly edk2's fault, as the build documentation is awful. > I spent an entire afternoon trying to figure it out and gave up. > > I will say that the edk2 shipped with qemu does work, so... are you absolutely > certain that it isn't a bug in edk2 since then? Firmware bugs are exactly > what > that patch is supposed to expose, as requested by issue #1204. > > I'd say you should boot with "-d int" and see what kind of interrupts you're > getting very > early on. I suspect that you'll see data aborts with ESR xx/yy where the > last 6 bits of > yy are 0x21 (alignment fault). Hi Richard, Sorry for lack of details, I was aware it wasn't great and should have stated I planned to come back with more details when I had time to debug. Snowed under so for now I've just dropped back to 8.2 and will get back to this perhaps next week. Jonathan > > > r~
Re: [PATCH v6 10/12] hw/mem/cxl_type3: Add dpa range validation for accesses to DC regions
On Tue, 16 Apr 2024 09:37:09 -0700 fan wrote: > On Tue, Apr 16, 2024 at 04:00:56PM +0100, Jonathan Cameron wrote: > > On Mon, 15 Apr 2024 10:37:00 -0700 > > fan wrote: > > > > > On Fri, Apr 12, 2024 at 06:54:42PM -0400, Gregory Price wrote: > > > > On Mon, Mar 25, 2024 at 12:02:28PM -0700, nifan@gmail.com wrote: > > > > > From: Fan Ni > > > > > > > > > > All dpa ranges in the DC regions are invalid to access until an extent > > > > > covering the range has been added. Add a bitmap for each region to > > > > > record whether a DC block in the region has been backed by DC extent. > > > > > For the bitmap, a bit in the bitmap represents a DC block. When a DC > > > > > extent is added, all the bits of the blocks in the extent will be set, > > > > > which will be cleared when the extent is released. > > > > > > > > > > Reviewed-by: Jonathan Cameron > > > > > Signed-off-by: Fan Ni > > > > > --- > > > > > hw/cxl/cxl-mailbox-utils.c | 6 +++ > > > > > hw/mem/cxl_type3.c | 76 > > > > > + > > > > > include/hw/cxl/cxl_device.h | 7 > > > > > 3 files changed, 89 insertions(+) > > > > > > > > > > diff --git a/hw/cxl/cxl-mailbox-utils.c b/hw/cxl/cxl-mailbox-utils.c > > > > > index 7094e007b9..a0d2239176 100644 > > > > > --- a/hw/cxl/cxl-mailbox-utils.c > > > > > +++ b/hw/cxl/cxl-mailbox-utils.c > > > > > @@ -1620,6 +1620,7 @@ static CXLRetCode cmd_dcd_add_dyn_cap_rsp(const > > > > > struct cxl_cmd *cmd, > > > > > > > > > > cxl_insert_extent_to_extent_list(extent_list, dpa, len, > > > > > NULL, 0); > > > > > ct3d->dc.total_extent_count += 1; > > > > > +ct3_set_region_block_backed(ct3d, dpa, len); > > > > > > > > > > ent = QTAILQ_FIRST(>dc.extents_pending); > > > > > > > > > > cxl_remove_extent_from_extent_list(>dc.extents_pending, ent); > > > > > > > > > > > > > while looking at the MHD code, we had decided to "reserve" the blocks in > > > > the bitmap in the call to `qmp_cxl_process_dynamic_capacity` in order to > > > > prevent a potential double-allocation (basically we need to sanity check > > > > that two hosts aren't reserving the region PRIOR to the host being > > > > notified). > > > > > > > > I did not see any checks in the `qmp_cxl_process_dynamic_capacity` path > > > > to prevent pending extents from being double-allocated. Is this an > > > > explicit choice? > > > > > > > > I can see, for example, why you may want to allow the following in the > > > > pending list: [Add X, Remove X, Add X]. I just want to know if this is > > > > intentional or not. If not, you may consider adding a pending check > > > > during the sanity check phase of `qmp_cxl_process_dynamic_capacity` > > > > > > > > ~Gregory > > > > > > First, for remove request, pending list is not involved. See cxl r3.1, > > > 9.13.3.3. Pending basically means "pending to add". > > > So for the above example, in the pending list, you can see [Add x, add x] > > > if the > > > event is not processed in time. > > > Second, from the spec, I cannot find any text saying we cannot issue > > > another add extent X if it is still pending. > > > > I think there is text saying that the capacity is not released for reuse > > by the device until it receives a response from the host. Whilst > > it's not explicit on offers to the same host, I'm not sure that matters. > > So I don't think it is suppose to queue multiple extents... > > Are you suggesting we add a check here to reject the second add when the > first one is still pending? Yes. The capacity is not back with the device to reissue. On an MH-MLD/SLD we'd need to prevent it being added (not shared) to multiple hosts, this is kind of the temporal equivalent of that. > > Currently, we do not allow releasing an extent when it is still pending, > which aligns with the case you mentioned above "not release for reuse", I > think. > Can the second add mean a retry instead of reuse? No - or at least the device should not be doing that. The FM might try aga
Re: [PATCH v6 09/12] hw/cxl/events: Add qmp interfaces to add/release dynamic capacity extents
> > > > > > ret = cxl_detect_malformed_extent_list(ct3d, in); > > > if (ret != CXL_MBOX_SUCCESS) { > > > +cxl_extent_group_list_delete_front(>dc.extents_pending); > > > > If it's a bad message from the host, I don't think the device is supposed to > > do anything with pending extents. > > It is not clear to me here. > > In the spec r3.1 8.2.9.9.9.3, Add Dynamic Capacity Response (Opcode 4802h), > there is text like "After this command is received, the device is free to > reclaim capacity that the host does not utilize.", that seems to imply > as long as the response is received, we need to update the pending list > so the capacity unused can be reclaimed. But of course, we can say if > there is error, we cannot tell whether the host accepts the extents or > not so not update the pending list. Can try and get a clarification as I agree 'is received' is unclear, but in general any command that gets an error response should have no affect on device state. If it does, then what affect it has must be stated in the specification. > > > > > + > > > +#define REMOVAL_POLICY_MASK 0xf > > > +#define FORCED_REMOVAL_BIT BIT(4) > > > + > > > +void qmp_cxl_release_dynamic_capacity(const char *path, uint16_t hid, > > > + uint8_t flags, uint8_t region_id, > > > + const char *tag, > > > + CXLDCExtentRecordList *records, > > > + Error **errp) > > > +{ > > > +CXLDCEventType type = DC_EVENT_RELEASE_CAPACITY; > > > + > > > +if (flags & FORCED_REMOVAL_BIT) { > > > +/* TODO: enable forced removal in the future */ > > > +type = DC_EVENT_FORCED_RELEASE_CAPACITY; > > > +error_setg(errp, "Forced removal not supported yet"); > > > +return; > > > +} > > > + > > > +switch (flags & REMOVAL_POLICY_MASK) { > > > +case 1: > > Probably benefit form a suitable define. > > > > > +qmp_cxl_process_dynamic_capacity_prescriptive(path, hid, type, > > > + region_id, > > > records, errp); > > > +break; > > > > I'd not noticed before but might as well return from these case blocks. > > Sorry. I do not follow here. What do you mean by "return from these case > blocks", are you referring the check above about the forced removal case? No, what I meant was much simpler - just a code refactoring thing. case 1: qmp_cxl_process_dynamic_capacity_prescriptive(path, hid, type, region_id, records, errp); //break; return; > > Fan > > > > > > +default: > > > +error_setg(errp, "Removal policy not supported"); > > > +break; return; > > > +} > > > +}
Re: [PATCH v3 5/6] target/arm: Do memory type alignment check when translation disabled
On Fri, 1 Mar 2024 10:41:09 -1000 Richard Henderson wrote: > If translation is disabled, the default memory type is Device, which > requires alignment checking. This is more optimally done early via > the MemOp given to the TCG memory operation. > > Reviewed-by: Philippe Mathieu-Daudé > Reported-by: Idan Horowitz > Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1204 > Signed-off-by: Richard Henderson Hi Richard. I noticed some tests I was running stopped booting with master. (it's a fun and complex stack of QEMU + kvm on QEMU for vCPU Hotplug kernel work, but this is the host booting) EDK2 build from upstream as of somepoint last week. Bisects to this patch. qemu-system-aarch64 -M virt,gic-version=3,virtualization=true -m 4g,maxmem=8G,slots=8 -cpu cortex-a76 -smp cpus=4,threads=2,clusters=2,sockets=1 \ -kernel Image \ -drive if=none,file=full.qcow2,format=qcow2,id=hd \ -device ioh3420,id=root_port1 -device virtio-blk-pci,drive=hd \ -netdev user,id=mynet,hostfwd=tcp::-:22 -device virtio-net-pci,netdev=mynet,id=bob \ -nographic -no-reboot -append 'earlycon root=/dev/vda2 fsck.mode=skip tp_printk' \ -monitor telnet:127.0.0.1:1235,server,nowait -bios QEMU_EFI.fd \ -object memory-backend-ram,size=4G,id=mem0 \ -numa node,nodeid=0,cpus=0-3,memdev=mem0 Symptoms: Nothing on console from edk2 which is built in debug mode so is normally very noisy. No sign of anything much happening at all :( Jonathan > --- > target/arm/tcg/hflags.c | 34 -- > 1 file changed, 32 insertions(+), 2 deletions(-) > > diff --git a/target/arm/tcg/hflags.c b/target/arm/tcg/hflags.c > index 8e5d35d922..5da1b0fc1d 100644 > --- a/target/arm/tcg/hflags.c > +++ b/target/arm/tcg/hflags.c > @@ -26,6 +26,35 @@ static inline bool fgt_svc(CPUARMState *env, int el) > FIELD_EX64(env->cp15.fgt_exec[FGTREG_HFGITR], HFGITR_EL2, SVC_EL1); > } > > +/* Return true if memory alignment should be enforced. */ > +static bool aprofile_require_alignment(CPUARMState *env, int el, uint64_t > sctlr) > +{ > +#ifdef CONFIG_USER_ONLY > +return false; > +#else > +/* Check the alignment enable bit. */ > +if (sctlr & SCTLR_A) { > +return true; > +} > + > +/* > + * If translation is disabled, then the default memory type is > + * Device(-nGnRnE) instead of Normal, which requires that alignment > + * be enforced. Since this affects all ram, it is most efficient > + * to handle this during translation. > + */ > +if (sctlr & SCTLR_M) { > +/* Translation enabled: memory type in PTE via MAIR_ELx. */ > +return false; > +} > +if (el < 2 && (arm_hcr_el2_eff(env) & (HCR_DC | HCR_VM))) { > +/* Stage 2 translation enabled: memory type in PTE. */ > +return false; > +} > +return true; > +#endif > +} > + > static CPUARMTBFlags rebuild_hflags_common(CPUARMState *env, int fp_el, > ARMMMUIdx mmu_idx, > CPUARMTBFlags flags) > @@ -121,8 +150,9 @@ static CPUARMTBFlags rebuild_hflags_a32(CPUARMState *env, > int fp_el, > { > CPUARMTBFlags flags = {}; > int el = arm_current_el(env); > +uint64_t sctlr = arm_sctlr(env, el); > > -if (arm_sctlr(env, el) & SCTLR_A) { > +if (aprofile_require_alignment(env, el, sctlr)) { > DP_TBFLAG_ANY(flags, ALIGN_MEM, 1); > } > > @@ -223,7 +253,7 @@ static CPUARMTBFlags rebuild_hflags_a64(CPUARMState *env, > int el, int fp_el, > > sctlr = regime_sctlr(env, stage1); > > -if (sctlr & SCTLR_A) { > +if (aprofile_require_alignment(env, el, sctlr)) { > DP_TBFLAG_ANY(flags, ALIGN_MEM, 1); > } >
Re: [PATCH v6 10/12] hw/mem/cxl_type3: Add dpa range validation for accesses to DC regions
On Mon, 15 Apr 2024 10:37:00 -0700 fan wrote: > On Fri, Apr 12, 2024 at 06:54:42PM -0400, Gregory Price wrote: > > On Mon, Mar 25, 2024 at 12:02:28PM -0700, nifan@gmail.com wrote: > > > From: Fan Ni > > > > > > All dpa ranges in the DC regions are invalid to access until an extent > > > covering the range has been added. Add a bitmap for each region to > > > record whether a DC block in the region has been backed by DC extent. > > > For the bitmap, a bit in the bitmap represents a DC block. When a DC > > > extent is added, all the bits of the blocks in the extent will be set, > > > which will be cleared when the extent is released. > > > > > > Reviewed-by: Jonathan Cameron > > > Signed-off-by: Fan Ni > > > --- > > > hw/cxl/cxl-mailbox-utils.c | 6 +++ > > > hw/mem/cxl_type3.c | 76 + > > > include/hw/cxl/cxl_device.h | 7 > > > 3 files changed, 89 insertions(+) > > > > > > diff --git a/hw/cxl/cxl-mailbox-utils.c b/hw/cxl/cxl-mailbox-utils.c > > > index 7094e007b9..a0d2239176 100644 > > > --- a/hw/cxl/cxl-mailbox-utils.c > > > +++ b/hw/cxl/cxl-mailbox-utils.c > > > @@ -1620,6 +1620,7 @@ static CXLRetCode cmd_dcd_add_dyn_cap_rsp(const > > > struct cxl_cmd *cmd, > > > > > > cxl_insert_extent_to_extent_list(extent_list, dpa, len, NULL, 0); > > > ct3d->dc.total_extent_count += 1; > > > +ct3_set_region_block_backed(ct3d, dpa, len); > > > > > > ent = QTAILQ_FIRST(>dc.extents_pending); > > > cxl_remove_extent_from_extent_list(>dc.extents_pending, > > > ent); > > > > while looking at the MHD code, we had decided to "reserve" the blocks in > > the bitmap in the call to `qmp_cxl_process_dynamic_capacity` in order to > > prevent a potential double-allocation (basically we need to sanity check > > that two hosts aren't reserving the region PRIOR to the host being > > notified). > > > > I did not see any checks in the `qmp_cxl_process_dynamic_capacity` path > > to prevent pending extents from being double-allocated. Is this an > > explicit choice? > > > > I can see, for example, why you may want to allow the following in the > > pending list: [Add X, Remove X, Add X]. I just want to know if this is > > intentional or not. If not, you may consider adding a pending check > > during the sanity check phase of `qmp_cxl_process_dynamic_capacity` > > > > ~Gregory > > First, for remove request, pending list is not involved. See cxl r3.1, > 9.13.3.3. Pending basically means "pending to add". > So for the above example, in the pending list, you can see [Add x, add x] if > the > event is not processed in time. > Second, from the spec, I cannot find any text saying we cannot issue > another add extent X if it is still pending. I think there is text saying that the capacity is not released for reuse by the device until it receives a response from the host. Whilst it's not explicit on offers to the same host, I'm not sure that matters. So I don't think it is suppose to queue multiple extents... > From the kernel side, if the first one is accepted, the second one will > get rejected, and there is no issue there. > If the first is reject for some reason, the second one can get > accepted or rejected and do not need to worry about the first one. > > > Fan >
Re: [PATCH v6 09/12] hw/cxl/events: Add qmp interfaces to add/release dynamic capacity extents
On Mon, 15 Apr 2024 13:06:04 -0700 fan wrote: > From ce75be83e915fbc4dd6e489f976665b81174002b Mon Sep 17 00:00:00 2001 > From: Fan Ni > Date: Tue, 20 Feb 2024 09:48:31 -0800 > Subject: [PATCH 09/13] hw/cxl/events: Add qmp interfaces to add/release > dynamic capacity extents > > To simulate FM functionalities for initiating Dynamic Capacity Add > (Opcode 5604h) and Dynamic Capacity Release (Opcode 5605h) as in CXL spec > r3.1 7.6.7.6.5 and 7.6.7.6.6, we implemented two QMP interfaces to issue > add/release dynamic capacity extents requests. > > With the change, we allow to release an extent only when its DPA range > is contained by a single accepted extent in the device. That is to say, > extent superset release is not supported yet. > > 1. Add dynamic capacity extents: > > For example, the command to add two continuous extents (each 128MiB long) > to region 0 (starting at DPA offset 0) looks like below: > > { "execute": "qmp_capabilities" } > > { "execute": "cxl-add-dynamic-capacity", > "arguments": { > "path": "/machine/peripheral/cxl-dcd0", > "hid": 0, > "selection-policy": 2, > "region-id": 0, > "tag": "", > "extents": [ > { > "offset": 0, > "len": 134217728 > }, > { > "offset": 134217728, > "len": 134217728 > } > ] > } > } > > 2. Release dynamic capacity extents: > > For example, the command to release an extent of size 128MiB from region 0 > (DPA offset 128MiB) looks like below: > > { "execute": "cxl-release-dynamic-capacity", > "arguments": { > "path": "/machine/peripheral/cxl-dcd0", > "hid": 0, > "flags": 1, > "region-id": 0, > "tag": "", > "extents": [ > { > "offset": 134217728, > "len": 134217728 > } > ] > } > } > > Signed-off-by: Fan Ni Nice! A few small comments inline - particularly don't be nice to the kernel by blocking things it doesn't understand yet ;) Jonathan > --- > hw/cxl/cxl-mailbox-utils.c | 65 ++-- > hw/mem/cxl_type3.c | 310 +++- > hw/mem/cxl_type3_stubs.c| 20 +++ > include/hw/cxl/cxl_device.h | 22 +++ > include/hw/cxl/cxl_events.h | 18 +++ > qapi/cxl.json | 69 > 6 files changed, 491 insertions(+), 13 deletions(-) > > diff --git a/hw/cxl/cxl-mailbox-utils.c b/hw/cxl/cxl-mailbox-utils.c > index cd9092b6bf..839ae836a1 100644 > --- a/hw/cxl/cxl-mailbox-utils.c > +++ b/hw/cxl/cxl-mailbox-utils.c > /* > * CXL r3.1 Table 8-168: Add Dynamic Capacity Response Input Payload > * CXL r3.1 Table 8-170: Release Dynamic Capacity Input Payload > @@ -1541,6 +1579,7 @@ static CXLRetCode > cxl_dcd_add_dyn_cap_rsp_dry_run(CXLType3Dev *ct3d, > { > uint32_t i; > CXLDCExtent *ent; > +CXLDCExtentGroup *ext_group; > uint64_t dpa, len; > Range range1, range2; > > @@ -1551,9 +1590,13 @@ static CXLRetCode > cxl_dcd_add_dyn_cap_rsp_dry_run(CXLType3Dev *ct3d, > range_init_nofail(, dpa, len); > > /* > - * TODO: once the pending extent list is added, check against > - * the list will be added here. > + * The host-accepted DPA range must be contained by the first extent > + * group in the pending list > */ > +ext_group = QTAILQ_FIRST(>dc.extents_pending); > +if (!cxl_extents_contains_dpa_range(_group->list, dpa, len)) { > +return CXL_MBOX_INVALID_PA; > +} > > /* to-be-added range should not overlap with range already accepted > */ > QTAILQ_FOREACH(ent, >dc.extents, node) { > @@ -1588,26 +1631,26 @@ static CXLRetCode cmd_dcd_add_dyn_cap_rsp(const > struct cxl_cmd *cmd, > CXLRetCode ret; > > if (in->num_entries_updated == 0) { > -/* > - * TODO: once the pending list is introduced, extents in the > beginning > - * will get wiped out. > - */ > +cxl_extent_group_list_delete_front(>dc.extents_pending); > return CXL_MBOX_SUCCESS; > } > > /* Adding extents causes exceeding device's extent tracking ability. */ > if (in->num_entries_updated + ct3d->dc.total_extent_count > > CXL_NUM_EXTENTS_SUPPORTED) { > +cxl_extent_group_list_delete_front(>dc.extents_pending); > return CXL_MBOX_RESOURCES_EXHAUSTED; > } > > ret = cxl_detect_malformed_extent_list(ct3d, in); > if (ret != CXL_MBOX_SUCCESS) { > +cxl_extent_group_list_delete_front(>dc.extents_pending); If it's a bad message from the host, I don't think the device is supposed to do anything with pending extents. > return ret; > } > > ret = cxl_dcd_add_dyn_cap_rsp_dry_run(ct3d, in); > if (ret != CXL_MBOX_SUCCESS) { > +cxl_extent_group_list_delete_front(>dc.extents_pending); > return ret; > } > diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c > index
Re: [PATCH v2] hw/mem/cxl_type3: reset dvsecs in ct3d_reset()
On Tue, 9 Apr 2024 15:58:46 +0800 Li Zhijian wrote: > After the kernel commit > 0cab68720598 ("cxl/pci: Fix disabling memory if DVSEC CXL Range does not > match a CFMWS window") > CXL type3 devices cannot be enabled again after the reboot because the > control register(see 8.1.3.2 in CXL specifiction 2.0 for more details) was > not reset. > > These registers could be changed by the firmware or OS, let them have > their initial value in reboot so that the OS can read their clean status. > > Fixes: e1706ea83da0 ("hw/cxl/device: Add a memory device (8.2.8.5)") > Signed-off-by: Li Zhijian Hi, We need to have a close look at what this is actually doing before considering applying it. I don't have time to get that this week, but hopefully will find some time later this month. I don't want a partial fix for one particular case that causes us potential trouble in others. Jonathan > --- > root_port, usp and dsp have the same issue, if this patch get approved, > I will send another patch to fix them later. > > V2: >Add fixes tag. >Reset all dvsecs registers instead of CTRL only > --- > hw/mem/cxl_type3.c | 11 +++ > 1 file changed, 7 insertions(+), 4 deletions(-) > > diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c > index b0a7e9f11b64..4f09d0b8fedc 100644 > --- a/hw/mem/cxl_type3.c > +++ b/hw/mem/cxl_type3.c > @@ -30,6 +30,7 @@ > #include "hw/pci/msix.h" > > #define DWORD_BYTE 4 > +#define CT3D_CAP_SN_OFFSET PCI_CONFIG_SPACE_SIZE > > /* Default CDAT entries for a memory region */ > enum { > @@ -284,6 +285,10 @@ static void build_dvsecs(CXLType3Dev *ct3d) > range2_size_hi = 0, range2_size_lo = 0, > range2_base_hi = 0, range2_base_lo = 0; > > +cxl_cstate->dvsec_offset = CT3D_CAP_SN_OFFSET; > +if (ct3d->sn != UI64_NULL) { > +cxl_cstate->dvsec_offset += PCI_EXT_CAP_DSN_SIZEOF; > +} > /* > * Volatile memory is mapped as (0x0) > * Persistent memory is mapped at (volatile->size) > @@ -664,10 +669,7 @@ static void ct3_realize(PCIDevice *pci_dev, Error **errp) > > pcie_endpoint_cap_init(pci_dev, 0x80); > if (ct3d->sn != UI64_NULL) { > -pcie_dev_ser_num_init(pci_dev, 0x100, ct3d->sn); > -cxl_cstate->dvsec_offset = 0x100 + 0x0c; > -} else { > -cxl_cstate->dvsec_offset = 0x100; > +pcie_dev_ser_num_init(pci_dev, CT3D_CAP_SN_OFFSET, ct3d->sn); > } > > ct3d->cxl_cstate.pdev = pci_dev; > @@ -907,6 +909,7 @@ static void ct3d_reset(DeviceState *dev) > > cxl_component_register_init_common(reg_state, write_msk, > CXL2_TYPE3_DEVICE); > cxl_device_register_init_t3(ct3d); > +build_dvsecs(ct3d); > > /* > * Bring up an endpoint to target with MCTP over VDM.
Re: [PATCH v6 09/12] hw/cxl/events: Add qmp interfaces to add/release dynamic capacity extents
On Tue, 9 Apr 2024 14:26:51 -0700 fan wrote: > On Fri, Apr 05, 2024 at 01:18:56PM +0100, Jonathan Cameron wrote: > > On Mon, 25 Mar 2024 12:02:27 -0700 > > nifan@gmail.com wrote: > > > > > From: Fan Ni > > > > > > To simulate FM functionalities for initiating Dynamic Capacity Add > > > (Opcode 5604h) and Dynamic Capacity Release (Opcode 5605h) as in CXL spec > > > r3.1 7.6.7.6.5 and 7.6.7.6.6, we implemented two QMP interfaces to issue > > > add/release dynamic capacity extents requests. > > > > > > With the change, we allow to release an extent only when its DPA range > > > is contained by a single accepted extent in the device. That is to say, > > > extent superset release is not supported yet. > > > > > > 1. Add dynamic capacity extents: > > > > > > For example, the command to add two continuous extents (each 128MiB long) > > > to region 0 (starting at DPA offset 0) looks like below: > > > > > > { "execute": "qmp_capabilities" } > > > > > > { "execute": "cxl-add-dynamic-capacity", > > > "arguments": { > > > "path": "/machine/peripheral/cxl-dcd0", > > > "region-id": 0, > > > "extents": [ > > > { > > > "offset": 0, > > > "len": 134217728 > > > }, > > > { > > > "offset": 134217728, > > > "len": 134217728 > > > } > > > > Hi Fan, > > > > I talk more on this inline, but to me this interface takes multiple extents > > so that we can treat them as a single 'offer' of capacity. That is they > > should be linked in the event log with the more flag and the host should > > have to handle them in one go (I known Ira and Navneet's code doesn't handle > > this yet, but that doesn't mean QEMU shouldn't). > > > > Alternative for now would be to only support a single entry. Keep the > > interface defined to take multiple entries but reject it at runtime. > > > > I don't want to end up with a more complex interface in the end just > > because we allowed this form to not set the MORE flag today. > > We will need this to do tagged handling and ultimately sharing, so good > > to get it right from the start. > > > > For tagged handling I think the right option is to have the tag alongside > > region-id not in the individual extents. That way the interface is > > naturally > > used to generate the right description to the host. > > > > > ] > > > } > > > } > Hi Jonathan, > Thanks for the detailed comments. > > For the QMP interface, I have one question. > Do we want the interface to follow exactly as shown in > Table 7-70 and Table 7-71 in cxl r3.1? I don't mind if it doesn't as long as it lets us pass reasonable things in to test the kernel code. I'd have the interface designed to allow us to generate the set of records associate with a given 'request'. E.g. All same tag in the same QMP command. If we want multiple sets of such records (and the extents to back them) we can issue multiple calls. Jonathan > > Fan > > > > > > > 2. Release dynamic capacity extents: > > > > > > For example, the command to release an extent of size 128MiB from region 0 > > > (DPA offset 128MiB) looks like below: > > > > > > { "execute": "cxl-release-dynamic-capacity", > > > "arguments": { > > > "path": "/machine/peripheral/cxl-dcd0", > > > "region-id": 0, > > > "extents": [ > > > { > > > "offset": 134217728, > > > "len": 134217728 > > > } > > > ] > > > } > > > } > > > > > > Signed-off-by: Fan Ni > > > > > > > > > /* to-be-added range should not overlap with range already > > > accepted */ > > > QTAILQ_FOREACH(ent, >dc.extents, node) { > > > @@ -1585,9 +1586,13 @@ static CXLRetCode cmd_dcd_add_dyn_cap_rsp(const > > > struct cxl_cmd *cmd, > > > CXLDCExtentList *extent_list = >dc.extents; > > > uint32_t i; > > > uint64_t dpa, len; > > > +CXLDCExtent *ent; > > > CXLRetCode ret; > > > > > > if (i
Re: [External] Re: [PATCH v11 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info
On Tue, 9 Apr 2024 12:02:31 -0700 "Ho-Ren (Jack) Chuang" wrote: > Hi Jonathan, > > On Tue, Apr 9, 2024 at 9:12 AM Jonathan Cameron > wrote: > > > > On Fri, 5 Apr 2024 15:43:47 -0700 > > "Ho-Ren (Jack) Chuang" wrote: > > > > > On Fri, Apr 5, 2024 at 7:03 AM Jonathan Cameron > > > wrote: > > > > > > > > On Fri, 5 Apr 2024 00:07:06 + > > > > "Ho-Ren (Jack) Chuang" wrote: > > > > > > > > > The current implementation treats emulated memory devices, such as > > > > > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal > > > > > memory > > > > > (E820_TYPE_RAM). However, these emulated devices have different > > > > > characteristics than traditional DRAM, making it important to > > > > > distinguish them. Thus, we modify the tiered memory initialization > > > > > process > > > > > to introduce a delay specifically for CPUless NUMA nodes. This delay > > > > > ensures that the memory tier initialization for these nodes is > > > > > deferred > > > > > until HMAT information is obtained during the boot process. Finally, > > > > > demotion tables are recalculated at the end. > > > > > > > > > > * late_initcall(memory_tier_late_init); > > > > > Some device drivers may have initialized memory tiers between > > > > > `memory_tier_init()` and `memory_tier_late_init()`, potentially > > > > > bringing > > > > > online memory nodes and configuring memory tiers. They should be > > > > > excluded > > > > > in the late init. > > > > > > > > > > * Handle cases where there is no HMAT when creating memory tiers > > > > > There is a scenario where a CPUless node does not provide HMAT > > > > > information. > > > > > If no HMAT is specified, it falls back to using the default DRAM tier. > > > > > > > > > > * Introduce another new lock `default_dram_perf_lock` for adist > > > > > calculation > > > > > In the current implementation, iterating through CPUlist nodes > > > > > requires > > > > > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will > > > > > end up > > > > > trying to acquire the same lock, leading to a potential deadlock. > > > > > Therefore, we propose introducing a standalone > > > > > `default_dram_perf_lock` to > > > > > protect `default_dram_perf_*`. This approach not only avoids deadlock > > > > > but also prevents holding a large lock simultaneously. > > > > > > > > > > * Upgrade `set_node_memory_tier` to support additional cases, > > > > > including > > > > > default DRAM, late CPUless, and hot-plugged initializations. > > > > > To cover hot-plugged memory nodes, `mt_calc_adistance()` and > > > > > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` > > > > > to > > > > > handle cases where memtype is not initialized and where HMAT > > > > > information is > > > > > available. > > > > > > > > > > * Introduce `default_memory_types` for those memory types that are not > > > > > initialized by device drivers. > > > > > Because late initialized memory and default DRAM memory need to be > > > > > managed, > > > > > a default memory type is created for storing all memory types that are > > > > > not initialized by device drivers and as a fallback. > > > > > > > > > > Signed-off-by: Ho-Ren (Jack) Chuang > > > > > Signed-off-by: Hao Xiang > > > > > Reviewed-by: "Huang, Ying" > > > > > > > > Hi - one remaining question. Why can't we delay init for all nodes > > > > to either drivers or your fallback late_initcall code. > > > > It would be nice to reduce possible code paths. > > > > > > I try not to change too much of the existing code structure in > > > this patchset. > > > > > > To me, postponing/moving all memory tier registrations to > > > late_initcall() is another possible action item for the next patchset. > > > > > > After tier_mem(), hmat_init() is called, which requires registering > > > `default_dram_type` info. This is when `default_dram_type` is needed. > > > However, it is indeed possible to postpone the latter part, > > > set_node_memory_tier(), to `late_init(). So, memory_tier_init() can > > > indeed be split into two parts, and the latter part can be moved to > > > late_initcall() to be processed together. > > > > > > Doing this all memory-type drivers have to call late_initcall() to > > > register a memory tier. I’m not sure how many they are? > > > > > > What do you guys think? > > > > Gut feeling - if you are going to move it for some cases, move it for > > all of them. Then we only have to test once ;) > > > > J > > Thank you for your reminder! I agree~ That's why I'm considering > changing them in the next patchset because of the amount of changes. > And also, this patchset already contains too many things. Makes sense. (Interestingly we are reaching the same conclusion for the thread that motivated suggesting bringing them all together in the first place!) Get things work in a clean fashion, then consider moving everything to happen at the same time to simplify testing etc. Jonathan
Re: [PATCH v11 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info
On Fri, 5 Apr 2024 15:43:47 -0700 "Ho-Ren (Jack) Chuang" wrote: > On Fri, Apr 5, 2024 at 7:03 AM Jonathan Cameron > wrote: > > > > On Fri, 5 Apr 2024 00:07:06 + > > "Ho-Ren (Jack) Chuang" wrote: > > > > > The current implementation treats emulated memory devices, such as > > > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal > > > memory > > > (E820_TYPE_RAM). However, these emulated devices have different > > > characteristics than traditional DRAM, making it important to > > > distinguish them. Thus, we modify the tiered memory initialization process > > > to introduce a delay specifically for CPUless NUMA nodes. This delay > > > ensures that the memory tier initialization for these nodes is deferred > > > until HMAT information is obtained during the boot process. Finally, > > > demotion tables are recalculated at the end. > > > > > > * late_initcall(memory_tier_late_init); > > > Some device drivers may have initialized memory tiers between > > > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing > > > online memory nodes and configuring memory tiers. They should be excluded > > > in the late init. > > > > > > * Handle cases where there is no HMAT when creating memory tiers > > > There is a scenario where a CPUless node does not provide HMAT > > > information. > > > If no HMAT is specified, it falls back to using the default DRAM tier. > > > > > > * Introduce another new lock `default_dram_perf_lock` for adist > > > calculation > > > In the current implementation, iterating through CPUlist nodes requires > > > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up > > > trying to acquire the same lock, leading to a potential deadlock. > > > Therefore, we propose introducing a standalone `default_dram_perf_lock` to > > > protect `default_dram_perf_*`. This approach not only avoids deadlock > > > but also prevents holding a large lock simultaneously. > > > > > > * Upgrade `set_node_memory_tier` to support additional cases, including > > > default DRAM, late CPUless, and hot-plugged initializations. > > > To cover hot-plugged memory nodes, `mt_calc_adistance()` and > > > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to > > > handle cases where memtype is not initialized and where HMAT information > > > is > > > available. > > > > > > * Introduce `default_memory_types` for those memory types that are not > > > initialized by device drivers. > > > Because late initialized memory and default DRAM memory need to be > > > managed, > > > a default memory type is created for storing all memory types that are > > > not initialized by device drivers and as a fallback. > > > > > > Signed-off-by: Ho-Ren (Jack) Chuang > > > Signed-off-by: Hao Xiang > > > Reviewed-by: "Huang, Ying" > > > > Hi - one remaining question. Why can't we delay init for all nodes > > to either drivers or your fallback late_initcall code. > > It would be nice to reduce possible code paths. > > I try not to change too much of the existing code structure in > this patchset. > > To me, postponing/moving all memory tier registrations to > late_initcall() is another possible action item for the next patchset. > > After tier_mem(), hmat_init() is called, which requires registering > `default_dram_type` info. This is when `default_dram_type` is needed. > However, it is indeed possible to postpone the latter part, > set_node_memory_tier(), to `late_init(). So, memory_tier_init() can > indeed be split into two parts, and the latter part can be moved to > late_initcall() to be processed together. > > Doing this all memory-type drivers have to call late_initcall() to > register a memory tier. I’m not sure how many they are? > > What do you guys think? Gut feeling - if you are going to move it for some cases, move it for all of them. Then we only have to test once ;) J > > > > > Jonathan > > > > > > > --- > > > mm/memory-tiers.c | 94 +++ > > > 1 file changed, 70 insertions(+), 24 deletions(-) > > > > > > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > > > index 516b144fd45a..6632102bd5c9 100644 > > > --- a/mm/memory-tiers.c > > > +++ b/mm/memory-tiers.c > > > >
Re: [PATCH v6 09/12] hw/cxl/events: Add qmp interfaces to add/release dynamic capacity extents
On Fri, 5 Apr 2024 14:09:23 -0400 Gregory Price wrote: > On Fri, Apr 05, 2024 at 06:44:52PM +0100, Jonathan Cameron wrote: > > On Fri, 5 Apr 2024 12:07:45 -0400 > > Gregory Price wrote: > > > > > 3. (C) Upon Device receiving Release Dynamic Capacity Request > > >a. check for a pending release request. If exists, error. > > > > Not sure that's necessary - can queue as long as the head > > can track if the bits are in a pending release state. > > > > Yeah probably it's fine to just queue the event and everything > downstream just handles it. > > > >b. check that the bits in the MHD bitmap are actually set > > Good. > > > > > >function: qmp_cxl_process_dynamic_capacity > > > > > > 4. (D) Upon Device receiving Release Dynamic Capacity Response > > >a. clear the bits in the mhd bitmap > > >b. remove the pending request from the pending list > > > > > >function: cmd_dcd_release_dyn_cap > > > > > > Something to note: The MHD bitmap is essentially the same as the > > > existing DCD extent bitmap - except that it is located in a shared > > > region of memory (mmap file, shm, whatever - pick one). > > > > I think you will ideally also have a per head one to track head access > > to the things offered by the mhd. > > > > Generally I try not to duplicate state, reduces consistency problems. > > You do still need a shared memory state and a per-head state to capture > per-head data, but the allocation bitmap is really device-global state. There is a separation between 'offered' to a head and 'accepted on that head'. Sure you could track all outstanding offers (if you let more than one be outstanding) at the shared memory, just seemed easier to do that in the per head element. > > Either way you have a race condition when checking the bitmap during a > memory access in the process of adding/releasing capacity - but that's > more an indication of bad host behavior than it is of a bug in the > implementatio of the emulated device. Probably we don't need to > read-lock the bitmap (for access validation), only write-lock. > > My preference, for what it's worth, would be to have a single bitmap > and have it be anonymous-memory for Single-head and file-backed for > for Multi-head. I'll have to work out the locking mechanism. I'll go with maybe until I see the code :) J > > ~Gregory
Re: How to use pxb-pcie in correct way?
On Mon, 8 Apr 2024 13:58:00 +0200 Marcin Juszkiewicz wrote: > For quite a while I am experimenting with PCI Express setup on SBSA-Ref > system. And finally decided to write. > > We want to play with NUMA setup and "pxb-pcie" can be assigned to NUMA > node other than cpu0 one. But adding it makes other cards dissapear... > > When I boot sbsa-ref I have plain PCIe setup: > > (qemu) info pci >Bus 0, device 0, function 0: > Host bridge: PCI device 1b36:0008 >PCI subsystem 1af4:1100 >id "" >Bus 0, device 1, function 0: > Ethernet controller: PCI device 8086:10d3 >PCI subsystem 8086: >IRQ 255, pin A >BAR0: 32 bit memory at 0x [0x0001fffe]. >BAR1: 32 bit memory at 0x [0x0001fffe]. >BAR2: I/O at 0x [0x001e]. >BAR3: 32 bit memory at 0x [0x3ffe]. >BAR6: 32 bit memory at 0x [0x0003fffe]. >id "" >Bus 0, device 2, function 0: > Display controller: PCI device 1234: >PCI subsystem 1af4:1100 >BAR0: 32 bit prefetchable memory at 0x8000 [0x80ff]. >BAR2: 32 bit memory at 0x81084000 [0x81084fff]. >BAR6: 32 bit memory at 0x [0x7ffe]. >id "" > > Adding extra PCIe card works fine - both just "igb" and "igb" with > "pcie-root-port". > > But adding "pcie-root-port" + "igb" and then "pxb-pcie" makes "igb" > dissapear: > > ../code/qemu/build/qemu-system-aarch64 > -monitor telnet::45454,server,nowait > -serial stdio > -device pcie-root-port,id=ULyWl,slot=0,chassis=0 > -device igb,bus=ULyWl > -device pxb-pcie,bus_nr=1 That's setting the base bus number to 1. Very likely to clash with the bus number for the bus below the root port. Set it to bu_nr=128 or something like that. There is no sanity checking for PXBs because the bus enumeration is an EDK2 problem in general - short of enumerating the buses in QEMU there isn't a way for it to tell. J
Re: [PATCH v6 09/12] hw/cxl/events: Add qmp interfaces to add/release dynamic capacity extents
On Fri, 5 Apr 2024 12:07:45 -0400 Gregory Price wrote: > On Fri, Apr 05, 2024 at 01:27:19PM +0100, Jonathan Cameron wrote: > > On Wed, 3 Apr 2024 14:16:25 -0400 > > Gregory Price wrote: > > > > A few follow up comments. > > > > > > > > > +error_setg(errp, "no valid extents to send to process"); > > > > +return; > > > > +} > > > > + > > > > > > I'm looking at adding the MHD extensions around this point, e.g.: > > > > > > /* If MHD cannot allocate requested extents, the cmd fails */ > > > if (type == DC_EVENT_ADD_CAPACITY && dcd->mhd_dcd_extents_allocate && > > > num_extents != dcd->mhd_dcd_extents_allocate(...)) > > > return; > > > > > > where mhd_dcd_extents_allocate checks the MHD block bitmap and tags > > > for correctness (shared // no double-allocations, etc). On success, > > > it garuantees proper ownership. > > > > > > the release path would then be done in the release response path from > > > the host, as opposed to the release event injection. > > > > I think it would be polite to check if the QMP command on release > > for whether it is asking something plausible - makes for an easier > > to user QMP interface. I guess it's not strictly required though. > > What races are there on release? > > The only real critical section, barring force-release beign supported, > is when you clear the bits in the device allowing new requests to swipe > those blocks. The appropriate place appears to be after the host kernel > has responded to the release extent request. Agreed you can't release till then, but you can check if it's going to work. I think that's worth doing for ease of use reasons. > > Also need to handle the case of multiple add-requests contending for the > same region, but that's just an "oops failed to get all the bits, roll > back" scenario - easy to handle. > > Could go coarse-grained to just lock access to the bitmap entirely while > operating on it, or be fancy and use atomics to go lockless. The latter > code already exists in the Niagara model for reference. I'm fine either way, though I'd just use a lock in initial version() > > > We aren't support force release > > for now, and for anything else, it's host specific (unlike add where > > the extra rules kick in). AS such I 'think' a check at command > > time will be valid as long as the host hasn't done an async > > release of capacity between that and the event record. That > > is a race we always have and the host should at most log it and > > not release capacity twice. > > > > Borrowing from the Ira's flow chart, here are the pieces I believe are > needed to implement MHD support for DCD. > > Orchestrator FM Device Host KernelHost User > > | | || | > |-- Add ->|-- Add --->A--- Add --->| | > | Capacity | Extent | Extent | | > | | || | > | |<--Accept--B<--Accept --| | > | | Extent | Extent | | > | | || | > | | ... snip ... | | > | | || | > |-- Remove -->|--Release->C---Release->| | > | Capacity | Extent | Extent | | > | | || | > | |<-Release--D<--Release--| | > | | Extent | Extent | | > | | || | > > 1. (A) Upon Device Receiving Add Capacity Request >a. the device sanity checks the request against local mappings >b. the mhd hook is called to sanity check against global mappings >c. the mhd bitmap is updated, marking the capacity owned by that head > >function: qmp_cxl_process_dynamic_capacity > > 2. (B) Upon Device Receiving Add Dynamic Capacity Response >a. accepted extents are compared to the original request >b. not accepted extents are cleared from the bitmap (local and MHD) >(Note: My understanding is that for now each request = 1 extent) Yeah but that is a restriction I think we need to solve soon. > >function: cmd_dcd_add_dyn_cap_rsp > > 3. (C) Upon Device receiving Release Dynamic Capacity Request >a. check for a p
Re: [RFC PATCH 5/5] cxl/core: add poison injection event handler
On Fri, 15 Mar 2024 10:29:07 +0800 Shiyang Ruan wrote: > 在 2024/2/14 0:51, Jonathan Cameron 写道: > > > >> + > >> +void cxl_event_handle_record(struct cxl_memdev *cxlmd, > >> + enum cxl_event_log_type type, > >> + enum cxl_event_type event_type, > >> + const uuid_t *uuid, union cxl_event *evt) > >> +{ > >> + if (event_type == CXL_CPER_EVENT_GEN_MEDIA) { > >>trace_cxl_general_media(cxlmd, type, >gen_media); > >> - else if (event_type == CXL_CPER_EVENT_DRAM) > >> + /* handle poison event */ > >> + if (type == CXL_EVENT_TYPE_FAIL) > >> + cxl_event_handle_poison(cxlmd, >gen_media); > > > > I'm not 100% convinced this is necessary poison causing. Also > > the text tells us we should see 'an appropriate event'. > > DRAM one seems likely to be chosen by some vendors. > > I think it's right to use DRAM Event Record for volatile-memdev, but > should poison on a persistent-memdev also use DRAM Event Record too? > Though its 'Physical Address' feild has the 'Volatile' bit too, which is > same as General Media Event Record. I am a bit confused about this. That is indeed 'novel' in a DRAM device, but maybe it could be battery backed and have a path to say a flash device that isn't visible to CXL and form which the DRAM is refilled on power restore? Anyhow, doesn't make sense for persistent memory that doesn't correspond to all the other stuff in the DRAM event. > > > > > The fatal check maybe makes it a little more likely (maybe though > > I'm not sure anything says a device must log it to the failure log) > > but it might be Memory Event Type 1, which is the host tried to > > access an invalid address. Sure poison might be returned to that > > error but what would the main kernel memory handling do with it? > > Something is very wrong > > but it's not corrupted device memory. TE state violations are in there > > as well. Sure poison is returned on reads (I think - haven't checked). > > > > IF the aim here is to say 'maybe there is poison, better check the > > poison list'. Then that is reasonable but we should ensure things > > like timer expiry are definitely ruled out and rename the function > > to make it clear it might not find poison. > > I forgot to distinguish the 'Transaction Type' here. Host Inject Poison > is 0x04h. And other types should also have their specific handle method. Yes. If you can use transaction type that solves this issue I think. > > > -- > Thanks, > Ruan. > > > > > Jonathan
Re: [PATCH] mem/cxl_type3: fix hpa to dpa logic
On Mon, 1 Apr 2024 17:00:50 +0100 Jonathan Cameron via wrote: > On Thu, 28 Mar 2024 06:24:24 + > "Xingtao Yao (Fujitsu)" wrote: > > > Jonathan > > > > thanks for your reply! > > > > > -----Original Message- > > > From: Jonathan Cameron > > > Sent: Wednesday, March 27, 2024 9:28 PM > > > To: Yao, Xingtao/姚 幸涛 > > > Cc: fan...@samsung.com; qemu-devel@nongnu.org; Cao, Quanquan/曹 全全 > > > > > > Subject: Re: [PATCH] mem/cxl_type3: fix hpa to dpa logic > > > > > > On Tue, 26 Mar 2024 21:46:53 -0400 > > > Yao Xingtao wrote: > > > > > > > In 3, 6, 12 interleave ways, we could not access cxl memory properly, > > > > and when the process is running on it, a 'segmentation fault' error will > > > > occur. > > > > > > > > According to the CXL specification '8.2.4.20.13 Decoder Protection', > > > > there are two branches to convert HPA to DPA: > > > > b1: Decoder[m].IW < 8 (for 1, 2, 4, 8, 16 interleave ways) > > > > b2: Decoder[m].IW >= 8 (for 3, 6, 12 interleave ways) > > > > > > > > but only b1 has been implemented. > > > > > > > > To solve this issue, we should implement b2: > > > > DPAOffset[51:IG+8]=HPAOffset[51:IG+IW] / 3 > > > > DPAOffset[IG+7:0]=HPAOffset[IG+7:0] > > > > DPA=DPAOffset + Decoder[n].DPABase > > > > > > > > Links: > > > https://lore.kernel.org/linux-cxl/3e84b919-7631-d1db-3e1d-33000f3f3868@fujits > > > u.com/ > > > > Signed-off-by: Yao Xingtao > > > > > > Not implementing this was intentional (shouldn't seg fault obviously) but > > > I thought we were not advertising EP support for 3, 6, 12? The HDM > > > Decoder > > > configuration checking is currently terrible so we don't prevent > > > the bits being set (adding device side sanity checks for those decoders > > > has been on the todo list for a long time). There are a lot of ways of > > > programming those that will blow up. > > > > > > Can you confirm that the emulation reports they are supported. > > > https://elixir.bootlin.com/qemu/v9.0.0-rc1/source/hw/cxl/cxl-component-utils.c > > > #L246 > > > implies it shouldn't and so any software using them is broken. > > yes, the feature is not supported by QEMU, but I can still create a > > 6-interleave-ways region on kernel layer. > > > > I checked the source code of kernel, and found that the kernel did not > > check this bit when committing decoder. > > we may add some check on kernel side. > > ouch. We definitely want that check! The decoder commit will fail > anyway (which QEMU doesn't yet because we don't do all the sanity checks > we should). However failing on commit is nasty as the reason should have > been detected earlier. > > > > > > > > > The non power of 2 decodes always made me nervous as the maths is more > > > complex and any changes to that decode will need careful checking. > > > For the power of 2 cases it was a bunch of writes to edge conditions etc > > > and checking the right data landed in the backing stores. > > after applying this modification, I tested some command by using these > > memory, like 'ls', 'top'.. > > and they can be executed normally, maybe there are some other problems I > > haven't met yet. > > I usually run a bunch of manual tests with devmem2 to ensure the edge cases > are handled > correctly, but I've not really seen any errors that didn't also show up in > running > stressors (e.g. stressng) or just memhog on the memory. Hi Yao, If you have time, please spin a v2 that also sets the relevant flag to say the QEMU emulation supports this interleave. Whilst we test the kernel fixes, we can just drop that patch but longer term I'm find with having this support in general in the QEMU emulation - so I won't queue it up as a fix, but instead as a feature. Thanks, Jonathan > > Jonathan > > > > > > > > > Joanthan > > > > > > > > > > --- > > > > hw/mem/cxl_type3.c | 15 +++ > > > > 1 file changed, 11 insertions(+), 4 deletions(-) > > > > > > > > diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c > > > > index b0a7e9f11b..2c1218fb12 100644 > > > > --- a/hw/mem/cxl_type3.c > > > > +++ b/hw/mem/cxl_type3.c > > > > @@ -805,10 +805,17 @@
Re: [PATCH v11 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info
On Fri, 5 Apr 2024 00:07:06 + "Ho-Ren (Jack) Chuang" wrote: > The current implementation treats emulated memory devices, such as > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory > (E820_TYPE_RAM). However, these emulated devices have different > characteristics than traditional DRAM, making it important to > distinguish them. Thus, we modify the tiered memory initialization process > to introduce a delay specifically for CPUless NUMA nodes. This delay > ensures that the memory tier initialization for these nodes is deferred > until HMAT information is obtained during the boot process. Finally, > demotion tables are recalculated at the end. > > * late_initcall(memory_tier_late_init); > Some device drivers may have initialized memory tiers between > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing > online memory nodes and configuring memory tiers. They should be excluded > in the late init. > > * Handle cases where there is no HMAT when creating memory tiers > There is a scenario where a CPUless node does not provide HMAT information. > If no HMAT is specified, it falls back to using the default DRAM tier. > > * Introduce another new lock `default_dram_perf_lock` for adist calculation > In the current implementation, iterating through CPUlist nodes requires > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up > trying to acquire the same lock, leading to a potential deadlock. > Therefore, we propose introducing a standalone `default_dram_perf_lock` to > protect `default_dram_perf_*`. This approach not only avoids deadlock > but also prevents holding a large lock simultaneously. > > * Upgrade `set_node_memory_tier` to support additional cases, including > default DRAM, late CPUless, and hot-plugged initializations. > To cover hot-plugged memory nodes, `mt_calc_adistance()` and > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to > handle cases where memtype is not initialized and where HMAT information is > available. > > * Introduce `default_memory_types` for those memory types that are not > initialized by device drivers. > Because late initialized memory and default DRAM memory need to be managed, > a default memory type is created for storing all memory types that are > not initialized by device drivers and as a fallback. > > Signed-off-by: Ho-Ren (Jack) Chuang > Signed-off-by: Hao Xiang > Reviewed-by: "Huang, Ying" Hi - one remaining question. Why can't we delay init for all nodes to either drivers or your fallback late_initcall code. It would be nice to reduce possible code paths. Jonathan > --- > mm/memory-tiers.c | 94 +++ > 1 file changed, 70 insertions(+), 24 deletions(-) > > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > index 516b144fd45a..6632102bd5c9 100644 > --- a/mm/memory-tiers.c > +++ b/mm/memory-tiers.c > @@ -855,7 +892,8 @@ static int __init memory_tier_init(void) >* For now we can have 4 faster memory tiers with smaller adistance >* than default DRAM tier. >*/ > - default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM); > + default_dram_type = mt_find_alloc_memory_type(MEMTIER_ADISTANCE_DRAM, > + _memory_types); > if (IS_ERR(default_dram_type)) > panic("%s() failed to allocate default DRAM tier\n", __func__); > > @@ -865,6 +903,14 @@ static int __init memory_tier_init(void) >* types assigned. >*/ > for_each_node_state(node, N_MEMORY) { > + if (!node_state(node, N_CPU)) > + /* > + * Defer memory tier initialization on > + * CPUless numa nodes. These will be initialized > + * after firmware and devices are initialized. Could the comment also say why we can't defer them all? (In an odd coincidence we have a similar issue for some CPU hotplug related bring up where review feedback was move all cases later). > + */ > + continue; > + > memtier = set_node_memory_tier(node); > if (IS_ERR(memtier)) > /*
Re: [PATCH v11 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types
On Fri, 5 Apr 2024 00:07:05 + "Ho-Ren (Jack) Chuang" wrote: > Since different memory devices require finding, allocating, and putting > memory types, these common steps are abstracted in this patch, > enhancing the scalability and conciseness of the code. > > Signed-off-by: Ho-Ren (Jack) Chuang > Reviewed-by: "Huang, Ying" Reviewed-by: Jonathan Cameron
Re: [PATCH v6 12/12] hw/mem/cxl_type3: Allow to release extent superset in QMP interface
On Mon, 25 Mar 2024 12:02:30 -0700 nifan@gmail.com wrote: > From: Fan Ni > > Before the change, the QMP interface used for add/release DC extents > only allows to release an extent whose DPA range is contained by a single > accepted extent in the device. > > With the change, we relax the constraints. As long as the DPA range of > the extent is covered by accepted extents, we allow the release. > > Signed-off-by: Fan Ni Nice. Reviewed-by: Jonathan Cameron > --- > hw/mem/cxl_type3.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c > index 2628a6f50f..62c2022477 100644 > --- a/hw/mem/cxl_type3.c > +++ b/hw/mem/cxl_type3.c > @@ -1935,8 +1935,7 @@ static void qmp_cxl_process_dynamic_capacity(const char > *path, CxlEventLog log, > "cannot release extent with pending DPA range"); > return; > } > -if (!cxl_extents_contains_dpa_range(>dc.extents, > -dpa, len)) { > +if (!ct3_test_region_block_backed(dcd, dpa, len)) { > error_setg(errp, > "cannot release extent with non-existing DPA > range"); > return;
Re: [PATCH v6 11/12] hw/cxl/cxl-mailbox-utils: Add superset extent release mailbox support
On Mon, 25 Mar 2024 12:02:29 -0700 nifan@gmail.com wrote: > From: Fan Ni > > With the change, we extend the extent release mailbox command processing > to allow more flexible release. As long as the DPA range of the extent to > release is covered by accepted extent(s) in the device, the release can be > performed. > > Signed-off-by: Fan Ni Nothing to add from me. Nice and simple which is great. Jonathan
Re: [PATCH v6 10/12] hw/mem/cxl_type3: Add dpa range validation for accesses to DC regions
On Mon, 25 Mar 2024 12:02:28 -0700 nifan@gmail.com wrote: > From: Fan Ni > > All dpa ranges in the DC regions are invalid to access until an extent Let's be more consistent for commit logs and use DPA DC HPA etc all caps. It's a bit of a mixture in this series at the moment. > covering the range has been added. I'd expand that to 'has been successfully accepted by the host.' > Add a bitmap for each region to > record whether a DC block in the region has been backed by DC extent. > For the bitmap, a bit in the bitmap represents a DC block. When a DC > extent is added, all the bits of the blocks in the extent will be set, > which will be cleared when the extent is released. > > Reviewed-by: Jonathan Cameron > Signed-off-by: Fan Ni
Re: [PATCH v6 09/12] hw/cxl/events: Add qmp interfaces to add/release dynamic capacity extents
On Wed, 3 Apr 2024 14:16:25 -0400 Gregory Price wrote: A few follow up comments. > On Mon, Mar 25, 2024 at 12:02:27PM -0700, nifan@gmail.com wrote: > > From: Fan Ni > > > > To simulate FM functionalities for initiating Dynamic Capacity Add > > (Opcode 5604h) and Dynamic Capacity Release (Opcode 5605h) as in CXL spec > > r3.1 7.6.7.6.5 and 7.6.7.6.6, we implemented two QMP interfaces to issue > > add/release dynamic capacity extents requests. > > > ... snip > > + > > +/* > > + * The main function to process dynamic capacity event. Currently DC > > extents > > + * add/release requests are processed. > > + */ > > +static void qmp_cxl_process_dynamic_capacity(const char *path, CxlEventLog > > log, > > + CXLDCEventType type, uint16_t > > hid, > > + uint8_t rid, > > + CXLDCExtentRecordList > > *records, > > + Error **errp) > > +{ > ... snip > > +/* Sanity check and count the extents */ > > +list = records; > > +while (list) { > > +offset = list->value->offset; > > +len = list->value->len; > > +dpa = offset + dcd->dc.regions[rid].base; > > + > > +if (len == 0) { > > +error_setg(errp, "extent with 0 length is not allowed"); > > +return; > > +} > > + > > +if (offset % block_size || len % block_size) { > > +error_setg(errp, "dpa or len is not aligned to region block > > size"); > > +return; > > +} > > + > > +if (offset + len > dcd->dc.regions[rid].len) { > > +error_setg(errp, "extent range is beyond the region end"); > > +return; > > +} > > + > > +/* No duplicate or overlapped extents are allowed */ > > +if (test_any_bits_set(blk_bitmap, offset / block_size, > > + len / block_size)) { > > +error_setg(errp, "duplicate or overlapped extents are > > detected"); > > +return; > > +} > > +bitmap_set(blk_bitmap, offset / block_size, len / block_size); > > + > > +num_extents++; > > I think num_extents is always equal to the length of the list, otherwise > this code will return with error. > > Nitpick: > This can be moved to the bottom w/ `list = list->next` to express that a > little more clearly. > > > +if (type == DC_EVENT_RELEASE_CAPACITY) { > > +if (cxl_extents_overlaps_dpa_range(>dc.extents_pending, > > + dpa, len)) { > > +error_setg(errp, > > + "cannot release extent with pending DPA range"); > > +return; > > +} > > +if (!cxl_extents_contains_dpa_range(>dc.extents, > > +dpa, len)) { > > +error_setg(errp, > > + "cannot release extent with non-existing DPA > > range"); > > +return; > > +} > > +} > > +list = list->next; > > +} > > + > > +if (num_extents == 0) { > > Since num_extents is always the length of the list, this is equivalent to > `if (!records)` prior to the while loop. Makes it a little more clear that: > > 1. There must be at least 1 extent > 2. All extents must be valid for the command to be serviced. Agreed. > > > +error_setg(errp, "no valid extents to send to process"); > > +return; > > +} > > + > > I'm looking at adding the MHD extensions around this point, e.g.: > > /* If MHD cannot allocate requested extents, the cmd fails */ > if (type == DC_EVENT_ADD_CAPACITY && dcd->mhd_dcd_extents_allocate && > num_extents != dcd->mhd_dcd_extents_allocate(...)) > return; > > where mhd_dcd_extents_allocate checks the MHD block bitmap and tags > for correctness (shared // no double-allocations, etc). On success, > it garuantees proper ownership. > > the release path would then be done in the release response path from > the host, as opposed to the release event injection. I think it would be polite to check if the QMP command on release for whether it is asking something plausible - makes for an easier to user QMP interface. I guess it's not strictly required though. What races are there on release? We aren't support force release for now, and for anything else, it's host specific (unlike add where the extra rules kick in). AS such I 'think' a check at command time will be valid as long as the host hasn't done an async release of capacity between that and the event record. That is a race we always have and the host should at most log it and not release capacity twice. > > Do you see any issues with that flow? > > > +/* Create extent list for event being passed to host */ > > +i = 0; > > +list = records; > > +extents =
Re: [PATCH v6 09/12] hw/cxl/events: Add qmp interfaces to add/release dynamic capacity extents
On Mon, 25 Mar 2024 12:02:27 -0700 nifan@gmail.com wrote: > From: Fan Ni > > To simulate FM functionalities for initiating Dynamic Capacity Add > (Opcode 5604h) and Dynamic Capacity Release (Opcode 5605h) as in CXL spec > r3.1 7.6.7.6.5 and 7.6.7.6.6, we implemented two QMP interfaces to issue > add/release dynamic capacity extents requests. > > With the change, we allow to release an extent only when its DPA range > is contained by a single accepted extent in the device. That is to say, > extent superset release is not supported yet. > > 1. Add dynamic capacity extents: > > For example, the command to add two continuous extents (each 128MiB long) > to region 0 (starting at DPA offset 0) looks like below: > > { "execute": "qmp_capabilities" } > > { "execute": "cxl-add-dynamic-capacity", > "arguments": { > "path": "/machine/peripheral/cxl-dcd0", > "region-id": 0, > "extents": [ > { > "offset": 0, > "len": 134217728 > }, > { > "offset": 134217728, > "len": 134217728 > } Hi Fan, I talk more on this inline, but to me this interface takes multiple extents so that we can treat them as a single 'offer' of capacity. That is they should be linked in the event log with the more flag and the host should have to handle them in one go (I known Ira and Navneet's code doesn't handle this yet, but that doesn't mean QEMU shouldn't). Alternative for now would be to only support a single entry. Keep the interface defined to take multiple entries but reject it at runtime. I don't want to end up with a more complex interface in the end just because we allowed this form to not set the MORE flag today. We will need this to do tagged handling and ultimately sharing, so good to get it right from the start. For tagged handling I think the right option is to have the tag alongside region-id not in the individual extents. That way the interface is naturally used to generate the right description to the host. > ] > } > } > > 2. Release dynamic capacity extents: > > For example, the command to release an extent of size 128MiB from region 0 > (DPA offset 128MiB) looks like below: > > { "execute": "cxl-release-dynamic-capacity", > "arguments": { > "path": "/machine/peripheral/cxl-dcd0", > "region-id": 0, > "extents": [ > { > "offset": 134217728, > "len": 134217728 > } > ] > } > } > > Signed-off-by: Fan Ni > /* to-be-added range should not overlap with range already accepted > */ > QTAILQ_FOREACH(ent, >dc.extents, node) { > @@ -1585,9 +1586,13 @@ static CXLRetCode cmd_dcd_add_dyn_cap_rsp(const struct > cxl_cmd *cmd, > CXLDCExtentList *extent_list = >dc.extents; > uint32_t i; > uint64_t dpa, len; > +CXLDCExtent *ent; > CXLRetCode ret; > > if (in->num_entries_updated == 0) { > +/* Always remove the first pending extent when response received. */ > +ent = QTAILQ_FIRST(>dc.extents_pending); > +cxl_remove_extent_from_extent_list(>dc.extents_pending, ent); > return CXL_MBOX_SUCCESS; > } > > @@ -1604,6 +1609,8 @@ static CXLRetCode cmd_dcd_add_dyn_cap_rsp(const struct > cxl_cmd *cmd, > > ret = cxl_dcd_add_dyn_cap_rsp_dry_run(ct3d, in); > if (ret != CXL_MBOX_SUCCESS) { > +ent = QTAILQ_FIRST(>dc.extents_pending); > +cxl_remove_extent_from_extent_list(>dc.extents_pending, ent); Ah this deals with the todo I suggest you add to the earlier patch. I'd not mind so much if you hadn't been so thorough on other todo notes ;) Add one in the earlier patch and get rid of ti here like you do below. However as I note below I think we need to handle these as groups of extents not single extents. That way we keep an 'offered' set offered at the same time by a single command (and expose to host using the more flag) together and reject them on mass. > return ret; > } > > @@ -1613,10 +1620,9 @@ static CXLRetCode cmd_dcd_add_dyn_cap_rsp(const struct > cxl_cmd *cmd, > > cxl_insert_extent_to_extent_list(extent_list, dpa, len, NULL, 0); > ct3d->dc.total_extent_count += 1; > -/* > - * TODO: we will add a pending extent list based on event log record > - * and process the list according here. > - */ > + > +ent = QTAILQ_FIRST(>dc.extents_pending); > +cxl_remove_extent_from_extent_list(>dc.extents_pending, ent); > } > > return CXL_MBOX_SUCCESS; > diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c > index 951bd79a82..74cb64e843 100644 > --- a/hw/mem/cxl_type3.c > +++ b/hw/mem/cxl_type3.c > > static bool cxl_setup_memory(CXLType3Dev *ct3d, Error **errp) > @@ -1449,7 +1454,8 @@ static int ct3d_qmp_cxl_event_log_enc(CxlEventLog log) > return CXL_EVENT_TYPE_FAIL; > case CXL_EVENT_LOG_FATAL: > return CXL_EVENT_TYPE_FATAL; > -/* DCD not yet supported */
Re: [PATCH v6 08/12] hw/cxl/cxl-mailbox-utils: Add mailbox commands to support add/release dynamic capacity response
On Mon, 25 Mar 2024 12:02:26 -0700 nifan@gmail.com wrote: > From: Fan Ni > > Per CXL spec 3.1, two mailbox commands are implemented: > Add Dynamic Capacity Response (Opcode 4802h) 8.2.9.9.9.3, and > Release Dynamic Capacity (Opcode 4803h) 8.2.9.9.9.4. > > For the process of the above two commands, we use two-pass approach. > Pass 1: Check whether the input payload is valid or not; if not, skip > Pass 2 and return mailbox process error. > Pass 2: Do the real work--add or release extents, respectively. > > Signed-off-by: Fan Ni A few additional comments from me. Jonathan > +/* > + * For the extents in the extent list to operate, check whether they are > valid > + * 1. The extent should be in the range of a valid DC region; > + * 2. The extent should not cross multiple regions; > + * 3. The start DPA and the length of the extent should align with the block > + * size of the region; > + * 4. The address range of multiple extents in the list should not overlap. > + */ > +static CXLRetCode cxl_detect_malformed_extent_list(CXLType3Dev *ct3d, > +const CXLUpdateDCExtentListInPl *in) > +{ > +uint64_t min_block_size = UINT64_MAX; > +CXLDCRegion *region = >dc.regions[0]; This is immediately overwritten if num_regions != 0 (Which I think is checked before calling this function). So no need to initialize it. > +CXLDCRegion *lastregion = >dc.regions[ct3d->dc.num_regions - 1]; > +g_autofree unsigned long *blk_bitmap = NULL; > +uint64_t dpa, len; > +uint32_t i; > + > +for (i = 0; i < ct3d->dc.num_regions; i++) { > +region = >dc.regions[i]; > +min_block_size = MIN(min_block_size, region->block_size); > +} > + > +blk_bitmap = bitmap_new((lastregion->base + lastregion->len - > + ct3d->dc.regions[0].base) / min_block_size); > + > +for (i = 0; i < in->num_entries_updated; i++) { > +dpa = in->updated_entries[i].start_dpa; > +len = in->updated_entries[i].len; > + > +region = cxl_find_dc_region(ct3d, dpa, len); > +if (!region) { > +return CXL_MBOX_INVALID_PA; > +} > + > +dpa -= ct3d->dc.regions[0].base; > +if (dpa % region->block_size || len % region->block_size) { > +return CXL_MBOX_INVALID_EXTENT_LIST; > +} > +/* the dpa range already covered by some other extents in the list */ > +if (test_any_bits_set(blk_bitmap, dpa / min_block_size, > +len / min_block_size)) { > +return CXL_MBOX_INVALID_EXTENT_LIST; > +} > +bitmap_set(blk_bitmap, dpa / min_block_size, len / min_block_size); > + } > + > +return CXL_MBOX_SUCCESS; > +} > +/* > + * CXL r3.1 section 8.2.9.9.9.3: Add Dynamic Capacity Response (Opcode 4802h) > + * An extent is added to the extent list and becomes usable only after the > + * response is processed successfully > + */ > +static CXLRetCode cmd_dcd_add_dyn_cap_rsp(const struct cxl_cmd *cmd, > + uint8_t *payload_in, > + size_t len_in, > + uint8_t *payload_out, > + size_t *len_out, > + CXLCCI *cci) > +{ > +CXLUpdateDCExtentListInPl *in = (void *)payload_in; > +CXLType3Dev *ct3d = CXL_TYPE3(cci->d); > +CXLDCExtentList *extent_list = >dc.extents; > +uint32_t i; > +uint64_t dpa, len; > +CXLRetCode ret; > + > +if (in->num_entries_updated == 0) { > +return CXL_MBOX_SUCCESS; > +} A zero length response is a rejection of an offered set of extents. Probably want a todo here to say this will wipe out part of the pending list (similar to the one you have below). > + > +/* Adding extents causes exceeding device's extent tracking ability. */ > +if (in->num_entries_updated + ct3d->dc.total_extent_count > > +CXL_NUM_EXTENTS_SUPPORTED) { > +return CXL_MBOX_RESOURCES_EXHAUSTED; > +} > + > +ret = cxl_detect_malformed_extent_list(ct3d, in); > +if (ret != CXL_MBOX_SUCCESS) { > +return ret; > +} > + > +ret = cxl_dcd_add_dyn_cap_rsp_dry_run(ct3d, in); > +if (ret != CXL_MBOX_SUCCESS) { > +return ret; > +} > + > +for (i = 0; i < in->num_entries_updated; i++) { > +dpa = in->updated_entries[i].start_dpa; > +len = in->updated_entries[i].len; > + > +cxl_insert_extent_to_extent_list(extent_list, dpa, len, NULL, 0); > +ct3d->dc.total_extent_count += 1; > +/* > + * TODO: we will add a pending extent list based on event log record > + * and process the list according here. > + */ > +} > + > +return CXL_MBOX_SUCCESS; > +} > +static CXLRetCode cxl_dc_extent_release_dry_run(CXLType3Dev *ct3d, > +const CXLUpdateDCExtentListInPl *in) > +{ > +CXLDCExtent *ent, *ent_next; > +uint64_t dpa,
Re: [PATCH v6 08/12] hw/cxl/cxl-mailbox-utils: Add mailbox commands to support add/release dynamic capacity response
On Thu, 4 Apr 2024 13:32:23 + Jørgen Hansen wrote: Hi Jørgen, > > +static CXLRetCode cmd_dcd_add_dyn_cap_rsp(const struct cxl_cmd *cmd, > > + uint8_t *payload_in, > > + size_t len_in, > > + uint8_t *payload_out, > > + size_t *len_out, > > + CXLCCI *cci) > > +{ > > +CXLUpdateDCExtentListInPl *in = (void *)payload_in; > > +CXLType3Dev *ct3d = CXL_TYPE3(cci->d); > > +CXLDCExtentList *extent_list = >dc.extents; > > +uint32_t i; > > +uint64_t dpa, len; > > +CXLRetCode ret; > > + > > +if (in->num_entries_updated == 0) { > > +return CXL_MBOX_SUCCESS; > > +} > > The mailbox processing in patch 2 converts from le explicitly, whereas > the mailbox commands here don't. Looking at the existing mailbox > commands, convertion doesn't seem to be rigorously applied, so maybe > that is OK? The early CXL code didn't take this into account much at all. We've sort of been fixing stuff up as we happen to be working on it. Hence some stuff is big endian safe and some not :( Patches welcome, but it would be good to not introduce more cases that need fixing when we eventually clean them all up (and have a big endian test platform to see if we got it right!) Jonathan
Re: [PATCH v6 07/12] hw/mem/cxl_type3: Add DC extent list representative and get DC extent list mailbox support
On Mon, 25 Mar 2024 12:02:25 -0700 nifan@gmail.com wrote: > From: Fan Ni > > Add dynamic capacity extent list representative to the definition of > CXLType3Dev and implement get DC extent list mailbox command per > CXL.spec.3.1:.8.2.9.9.9.2. > > Signed-off-by: Fan Ni One really minor comment inline. Reviewed-by: Jonathan Cameron > > +/* > + * CXL r3.1 section 8.2.9.9.9.2: > + * Get Dynamic Capacity Extent List (Opcode 4801h) > + */ > +static CXLRetCode cmd_dcd_get_dyn_cap_ext_list(const struct cxl_cmd *cmd, > + uint8_t *payload_in, > + size_t len_in, > + uint8_t *payload_out, > + size_t *len_out, > + CXLCCI *cci) > +{ > +CXLType3Dev *ct3d = CXL_TYPE3(cci->d); > +struct { > +uint32_t extent_cnt; > +uint32_t start_extent_id; > +} QEMU_PACKED *in = (void *)payload_in; > +struct { > +uint32_t count; > +uint32_t total_extents; > +uint32_t generation_num; > +uint8_t rsvd[4]; > +CXLDCExtentRaw records[]; > +} QEMU_PACKED *out = (void *)payload_out; > +uint32_t start_extent_id = in->start_extent_id; > +CXLDCExtentList *extent_list = >dc.extents; > +uint16_t record_count = 0, i = 0, record_done = 0; > +uint16_t out_pl_len, size; > +CXLDCExtent *ent; > + > +if (start_extent_id > ct3d->dc.total_extent_count) { > +return CXL_MBOX_INVALID_INPUT; > +} > + > +record_count = MIN(in->extent_cnt, > + ct3d->dc.total_extent_count - start_extent_id); > +size = CXL_MAILBOX_MAX_PAYLOAD_SIZE - sizeof(*out); > +if (size / sizeof(out->records[0]) < record_count) { > +record_count = size / sizeof(out->records[0]); > +} Could use another min for this I think? record_count = MIN(record_count, size / sizeof(out->records[0]); > +out_pl_len = sizeof(*out) + record_count * sizeof(out->records[0]); > + > +stl_le_p(>count, record_count); > +stl_le_p(>total_extents, ct3d->dc.total_extent_count); > +stl_le_p(>generation_num, ct3d->dc.ext_list_gen_seq); > + > +if (record_count > 0) { > +CXLDCExtentRaw *out_rec = >records[record_done]; > + > +QTAILQ_FOREACH(ent, extent_list, node) { > +if (i++ < start_extent_id) { > +continue; > +} > +stq_le_p(_rec->start_dpa, ent->start_dpa); > +stq_le_p(_rec->len, ent->len); > +memcpy(_rec->tag, ent->tag, 0x10); > +stw_le_p(_rec->shared_seq, ent->shared_seq); > + > +record_done++; > +if (record_done == record_count) { > +break; > +} > +} > +} > + > +*len_out = out_pl_len; > +return CXL_MBOX_SUCCESS; > +}
Re: [PATCH v6 06/12] hw/mem/cxl_type3: Add host backend and address space handling for DC regions
On Mon, 25 Mar 2024 12:02:24 -0700 nifan@gmail.com wrote: > From: Fan Ni > > Add (file/memory backed) host backend, all the dynamic capacity regions > will share a single, large enough host backend. This doesn't parse. I suggests splitting it into 2 sentences. Add (file/memory backend) host backend for DCD. All the dynamic capacity regions will share a single, large enough host backend. > Set up address space for > DC regions to support read/write operations to dynamic capacity for DCD. > > With the change, following supports are added: Oddity of English wrt to plurals. With this change, the following support is added. > 1. Add a new property to type3 device "volatile-dc-memdev" to point to host >memory backend for dynamic capacity. Currently, all dc regions share one >host backend. > 2. Add namespace for dynamic capacity for read/write support; > 3. Create cdat entries for each dynamic capacity region; > > Signed-off-by: Fan Ni All comments trivial with exception of the one about setting size of range registers. For now I think just set the flags and we will deal with whatever output we get from the consortium in the long run. With that tweaked. Reviewed-by: Jonathan Cameron > --- > hw/cxl/cxl-mailbox-utils.c | 16 ++- > hw/mem/cxl_type3.c | 187 +--- > include/hw/cxl/cxl_device.h | 8 ++ > 3 files changed, 172 insertions(+), 39 deletions(-) > > diff --git a/hw/cxl/cxl-mailbox-utils.c b/hw/cxl/cxl-mailbox-utils.c > index 0f2ad58a14..831cef0567 100644 > --- a/hw/cxl/cxl-mailbox-utils.c > +++ b/hw/cxl/cxl-mailbox-utils.c > @@ -622,7 +622,8 @@ static CXLRetCode cmd_firmware_update_get_info(const > struct cxl_cmd *cmd, > diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c > index a9e8bdc436..75ea9b20e1 100644 > --- a/hw/mem/cxl_type3.c > +++ b/hw/mem/cxl_type3.c > @@ -45,7 +45,8 @@ enum { > +if (dc_mr) { > +int i; > +uint64_t region_base = vmr_size + pmr_size; > + > +/* > + * TODO: we assume the dynamic capacity to be volatile for now, > + * non-volatile dynamic capacity will be added if needed in the > + * future. Trivial but I'd make that 2 sentences with a full stop after "now". > assert(len == cur_ent); > > *cdat_table = g_steal_pointer(); > @@ -300,11 +336,24 @@ static void build_dvsecs(CXLType3Dev *ct3d) > range2_size_hi = ct3d->hostpmem->size >> 32; > range2_size_lo = (2 << 5) | (2 << 2) | 0x3 | > (ct3d->hostpmem->size & 0xF000); > +} else if (ct3d->dc.host_dc) { > +range2_size_hi = ct3d->dc.host_dc->size >> 32; > +range2_size_lo = (2 << 5) | (2 << 2) | 0x3 | > + (ct3d->dc.host_dc->size & 0xF000); > } > -} else { > +} else if (ct3d->hostpmem) { > range1_size_hi = ct3d->hostpmem->size >> 32; > range1_size_lo = (2 << 5) | (2 << 2) | 0x3 | > (ct3d->hostpmem->size & 0xF000); > +if (ct3d->dc.host_dc) { > +range2_size_hi = ct3d->dc.host_dc->size >> 32; > +range2_size_lo = (2 << 5) | (2 << 2) | 0x3 | > + (ct3d->dc.host_dc->size & 0xF000); > +} > +} else { > +range1_size_hi = ct3d->dc.host_dc->size >> 32; > +range1_size_lo = (2 << 5) | (2 << 2) | 0x3 | > + (ct3d->dc.host_dc->size & 0xF000); > } As per your cover letter this is a work around for an ambiguity in the spec and what Linux is currently doing with. However as per the call the other day, Linux only checks the flags. So I'd set those only and not the size field. We may have to deal with spec errata later, but I don't want to block this series on the corner case in the meantime. Given complexity of DC we'll be waiting for ever if we have to get all clarifications before we land anything! (Quick though those nice folk in the CXL consortium working groups are :)) > @@ -679,9 +746,41 @@ static bool cxl_setup_memory(CXLType3Dev *ct3d, Error > **errp) > g_free(p_name); > } > > -if (!cxl_create_dc_regions(ct3d, errp)) { > -error_setg(errp, "setup DC regions failed"); > -return false; > +ct3d->dc.total_capacity = 0; > +if (ct3d->dc.num_regions) { Trivial suggestion. As dc.num_regions already existed from patch 4, maybe it's worth pushing this if statement back there? It will be harmless short
Re: [RFC PATCH v2 3/6] cxl/core: add report option for cxl_mem_get_poison()
On Wed, 3 Apr 2024 22:56:58 +0800 Shiyang Ruan wrote: > 在 2024/3/30 9:50, Dan Williams 写道: > > Shiyang Ruan wrote: > >> The GMER only has "Physical Address" field, no such one indicates length. > >> So, when a poison event is received, we could use GET_POISON_LIST command > >> to get the poison list. Now driver has cxl_mem_get_poison(), so > >> reuse it and add a parameter 'bool report', report poison record to MCE > >> if set true. > > > > I am not sure I agree with the rationale here because there is no > > correlation between the event being signaled and the current state of > > the poison list. It also establishes race between multiple GMER events, > > i.e. imagine the hardware sends 4 GMER events to communicate a 256B > > poison discovery event. Does the driver need logic to support GMER event > > 2, 3, and 4 if it already say all 256B of poison after processing GMER > > event 1? > > Yes, I didn't thought about that. > > > > > I think the best the driver can do is assume at least 64B of poison > > per-event and depend on multiple notifications to handle larger poison > > lengths. > > Agree. This also makes things easier. > > And for qemu, I'm thinking of making a patch to limit the length of a > poison record when injecting. The length should between 64B to 4KiB per > GMER. And emit many GMERs if length > 4KiB. I'm not keen on such a restriction in QEMU. QEMU is injecting lengths allowed by the specification. That facility is useful for testing the kernel and the QEMU modeling should not be based on what the kernel supports. When you said this I wondered if we had a clever implementation that fused entries in the list, but we don't (I thought about doing so a long time ago but seems I never bothered :) So if you are using QEMU for testing and you don't want to exceed the kernel supported poison lengths, don't inject poison that big. Jonathan > > > > > Otherwise, the poison list is really only useful for pre-populating > > pages to offline after a reboot, i.e. to catch the kernel up with the > > state of poison pages after a reboot. > > Got it. > > > -- > Thanks, > Ruan.
Re: [PATCH v10 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info
> > > @@ -858,7 +910,8 @@ static int __init memory_tier_init(void) > > >* For now we can have 4 faster memory tiers with smaller adistance > > >* than default DRAM tier. > > >*/ > > > - default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM); > > > + default_dram_type = > > > mt_find_alloc_memory_type(MEMTIER_ADISTANCE_DRAM, > > > + > > > _memory_types); > > > > Unusual indenting. Align with just after ( > > > > Aligning with "(" will exceed 100 columns. Would that be acceptable? I think we are talking cross purposes. default_dram_type = mt_find_alloc_memory_type(MEMTIER_ADISTANCE_DRAM, _memory_types); Is what I was suggesting. > > > > if (IS_ERR(default_dram_type)) > > > panic("%s() failed to allocate default DRAM tier\n", > > > __func__); > > > > > > @@ -868,6 +921,14 @@ static int __init memory_tier_init(void) > > >* types assigned. > > >*/ > > > for_each_node_state(node, N_MEMORY) { > > > + if (!node_state(node, N_CPU)) > > > + /* > > > + * Defer memory tier initialization on CPUless numa > > > nodes. > > > + * These will be initialized after firmware and > > > devices are > > > > I think this wraps at just over 80 chars. Seems silly to wrap so tightly > > and not > > quite fit under 80. (this is about 83 chars. > > > > I can fix this. > I have a question. From my patch, this is <80 chars. However, > in an email, this is >80 chars. Does that mean we need to > count the number of chars in an email, not in a patch? Or if I > missed something? like vim configuration or? 3 tabs + 1 space + the text from * (58) = 24 + 1 + 58 = 83 Advantage of using claws email for kernel stuff is it has a nice per character ruler at the top of the window. I wonder if you have a different tab indent size? The kernel uses 8 characters. It might explain the few other odd indents if perhaps you have it at 4 in your editor? https://www.kernel.org/doc/html/v4.10/process/coding-style.html Jonathan > > > > + * initialized. > > > + */ > > > + continue; > > > + > > > memtier = set_node_memory_tier(node); > > > if (IS_ERR(memtier)) > > > /* > > > >
Re: [PATCH v10 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info
A few minor comments inline. > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h > index a44c03c2ba3a..16769552a338 100644 > --- a/include/linux/memory-tiers.h > +++ b/include/linux/memory-tiers.h > @@ -140,12 +140,13 @@ static inline int mt_perf_to_adistance(struct > access_coordinate *perf, int *adis > return -EIO; > } > > -struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct > list_head *memory_types) > +static inline struct memory_dev_type *mt_find_alloc_memory_type(int adist, > + struct list_head *memory_types) > { > return NULL; > } > > -void mt_put_memory_types(struct list_head *memory_types) > +static inline void mt_put_memory_types(struct list_head *memory_types) > { Why in this patch and not previous one? > > } > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > index 974af10cfdd8..44fa10980d37 100644 > --- a/mm/memory-tiers.c > +++ b/mm/memory-tiers.c > @@ -36,6 +36,11 @@ struct node_memory_type_map { > > static DEFINE_MUTEX(memory_tier_lock); > static LIST_HEAD(memory_tiers); > +/* > + * The list is used to store all memory types that are not created > + * by a device driver. > + */ > +static LIST_HEAD(default_memory_types); > static struct node_memory_type_map node_memory_types[MAX_NUMNODES]; > struct memory_dev_type *default_dram_type; > > @@ -108,6 +113,8 @@ static struct demotion_nodes *node_demotion __read_mostly; > > static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms); > > +/* The lock is used to protect `default_dram_perf*` info and nid. */ > +static DEFINE_MUTEX(default_dram_perf_lock); > static bool default_dram_perf_error; > static struct access_coordinate default_dram_perf; > static int default_dram_perf_ref_nid = NUMA_NO_NODE; > @@ -505,7 +512,8 @@ static inline void __init_node_memory_type(int node, > struct memory_dev_type *mem > static struct memory_tier *set_node_memory_tier(int node) > { > struct memory_tier *memtier; > - struct memory_dev_type *memtype; > + struct memory_dev_type *mtype = default_dram_type; Does the rename add anything major to the patch? If not I'd leave it alone to reduce the churn and give a more readable patch. If it is worth doing perhaps a precursor patch? > + int adist = MEMTIER_ADISTANCE_DRAM; > pg_data_t *pgdat = NODE_DATA(node); > > > @@ -514,11 +522,20 @@ static struct memory_tier *set_node_memory_tier(int > node) > if (!node_state(node, N_MEMORY)) > return ERR_PTR(-EINVAL); > > - __init_node_memory_type(node, default_dram_type); > + mt_calc_adistance(node, ); > + if (node_memory_types[node].memtype == NULL) { > + mtype = mt_find_alloc_memory_type(adist, _memory_types); > + if (IS_ERR(mtype)) { > + mtype = default_dram_type; > + pr_info("Failed to allocate a memory type. Fall > back.\n"); > + } > + } > + > + __init_node_memory_type(node, mtype); > > - memtype = node_memory_types[node].memtype; > - node_set(node, memtype->nodes); > - memtier = find_create_memory_tier(memtype); > + mtype = node_memory_types[node].memtype; > + node_set(node, mtype->nodes); > + memtier = find_create_memory_tier(mtype); > if (!IS_ERR(memtier)) > rcu_assign_pointer(pgdat->memtier, memtier); > return memtier; > @@ -655,6 +672,33 @@ void mt_put_memory_types(struct list_head *memory_types) > } > EXPORT_SYMBOL_GPL(mt_put_memory_types); > > +/* > + * This is invoked via `late_initcall()` to initialize memory tiers for > + * CPU-less memory nodes after driver initialization, which is > + * expected to provide `adistance` algorithms. > + */ > +static int __init memory_tier_late_init(void) > +{ > + int nid; > + > + mutex_lock(_tier_lock); > + for_each_node_state(nid, N_MEMORY) > + if (node_memory_types[nid].memtype == NULL) > + /* > + * Some device drivers may have initialized memory tiers > + * between `memory_tier_init()` and > `memory_tier_late_init()`, > + * potentially bringing online memory nodes and > + * configuring memory tiers. Exclude them here. > + */ Does the comment refer to this path, or to ones where memtype is set? > + set_node_memory_tier(nid); Given the large comment I would add {} to help with readability. You could flip the logic to reduce indent for_each_node_state(nid, N_MEMORY) { if (node_memory_types[nid].memtype) continue; /* * Some device drivers may have initialized memory tiers * between `memory_tier_init()` and `memory_tier_late_init()`, * potentially bringing online memory nodes and * configuring memory tiers. Exclude them
Re: [PATCH v10 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types
On Tue, 2 Apr 2024 00:17:37 + "Ho-Ren (Jack) Chuang" wrote: > Since different memory devices require finding, allocating, and putting > memory types, these common steps are abstracted in this patch, > enhancing the scalability and conciseness of the code. > > Signed-off-by: Ho-Ren (Jack) Chuang > Reviewed-by: "Huang, Ying" Hi, I know this is a late entry to the discussion but a few comments inline. (sorry I didn't look earlier!) All opportunities to improve code complexity and readability as a result of your factoring out. Jonathan > --- > drivers/dax/kmem.c | 20 ++-- > include/linux/memory-tiers.h | 13 + > mm/memory-tiers.c| 32 > 3 files changed, 47 insertions(+), 18 deletions(-) > > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c > index 42ee360cf4e3..01399e5b53b2 100644 > --- a/drivers/dax/kmem.c > +++ b/drivers/dax/kmem.c > @@ -55,21 +55,10 @@ static LIST_HEAD(kmem_memory_types); > > static struct memory_dev_type *kmem_find_alloc_memory_type(int adist) > { > - bool found = false; > struct memory_dev_type *mtype; > > mutex_lock(_memory_type_lock); could use guard(mutex)(_memory_type_lock); return mt_find_alloc_memory_type(adist, _memory_types); I'm fine if you ignore this comment though as may be other functions in here that could take advantage of the cleanup.h stuff in a future patch. > - list_for_each_entry(mtype, _memory_types, list) { > - if (mtype->adistance == adist) { > - found = true; > - break; > - } > - } > - if (!found) { > - mtype = alloc_memory_type(adist); > - if (!IS_ERR(mtype)) > - list_add(>list, _memory_types); > - } > + mtype = mt_find_alloc_memory_type(adist, _memory_types); > mutex_unlock(_memory_type_lock); > > return mtype; > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h > index 69e781900082..a44c03c2ba3a 100644 > --- a/include/linux/memory-tiers.h > +++ b/include/linux/memory-tiers.h > @@ -48,6 +48,9 @@ int mt_calc_adistance(int node, int *adist); > int mt_set_default_dram_perf(int nid, struct access_coordinate *perf, >const char *source); > int mt_perf_to_adistance(struct access_coordinate *perf, int *adist); > +struct memory_dev_type *mt_find_alloc_memory_type(int adist, > + struct list_head > *memory_types); That indent looks unusual. Align the start of struct with start of int. > +void mt_put_memory_types(struct list_head *memory_types); > #ifdef CONFIG_MIGRATION > int next_demotion_node(int node); > void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); > @@ -136,5 +139,15 @@ static inline int mt_perf_to_adistance(struct > access_coordinate *perf, int *adis > { > return -EIO; > } > + > +struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct > list_head *memory_types) > +{ > + return NULL; > +} > + > +void mt_put_memory_types(struct list_head *memory_types) > +{ > + No blank line needed here. > +} > #endif /* CONFIG_NUMA */ > #endif /* _LINUX_MEMORY_TIERS_H */ > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > index 0537664620e5..974af10cfdd8 100644 > --- a/mm/memory-tiers.c > +++ b/mm/memory-tiers.c > @@ -623,6 +623,38 @@ void clear_node_memory_type(int node, struct > memory_dev_type *memtype) > } > EXPORT_SYMBOL_GPL(clear_node_memory_type); > > +struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct > list_head *memory_types) Breaking this out as a separate function provides opportunity to improve it. Maybe a follow up patch makes sense given it would no longer be a straight forward code move. However in my view it would be simple enough to be obvious even within this patch. > +{ > + bool found = false; > + struct memory_dev_type *mtype; > + > + list_for_each_entry(mtype, memory_types, list) { > + if (mtype->adistance == adist) { > + found = true; Why not return here? return mtype; > + break; > + } > + } > + if (!found) { If returning above, no need for found variable - just do this unconditionally. + I suggest you flip logic for simpler to follow code flow. It's more code but I think a bit easier to read as error handling is out of the main simple flow. mtype = alloc_memory_type(adist); if (IS_ERR(mtype)) return mtype; list_add(>list, memory_types); return mtype; > + mtype = alloc_memory_type(adist); > + if (!IS_ERR(mtype)) > + list_add(>list, memory_types); > + } > + > + return mtype; > +} > +EXPORT_SYMBOL_GPL(mt_find_alloc_memory_type); > + > +void mt_put_memory_types(struct
[PATCH 6/6] bios-tables-test: Add data for complex numa test (GI, GP etc)
0242 002h] Reserved : [0F4h 0244 004h] Length : 0078 [0F8h 0248 001h] Flags (decoded below) : 00 Memory Hierarchy : 0 Use Minimum Transfer Size : 0 Non-sequential Transfers : 0 [0F9h 0249 001h] Data Type : 03 [0FAh 0250 001h] Minimum Transfer Size : 00 [0FBh 0251 001h] Reserved1 : 00 [0FCh 0252 004h] Initiator Proximity Domains # : 0004 [100h 0256 004h] Target Proximity Domains # : 0006 [104h 0260 004h] Reserved2 : [108h 0264 008h] Entry Base Unit : 0004 [110h 0272 004h] Initiator Proximity Domain List : [114h 0276 004h] Initiator Proximity Domain List : 0001 [118h 0280 004h] Initiator Proximity Domain List : 0003 [11Ch 0284 004h] Initiator Proximity Domain List : 0005 [120h 0288 004h] Target Proximity Domain List : [124h 0292 004h] Target Proximity Domain List : 0001 [128h 0296 004h] Target Proximity Domain List : 0002 [12Ch 0300 004h] Target Proximity Domain List : 0003 [130h 0304 004h] Target Proximity Domain List : 0004 [134h 0308 004h] Target Proximity Domain List : 0005 [138h 0312 002h] Entry : 00C8 [13Ah 0314 002h] Entry : [13Ch 0316 002h] Entry : 0032 [13Eh 0318 002h] Entry : [140h 0320 002h] Entry : 0032 [142h 0322 002h] Entry : 0064 [144h 0324 002h] Entry : 0019 [146h 0326 002h] Entry : [148h 0328 002h] Entry : 0064 [14Ah 0330 002h] Entry : [14Ch 0332 002h] Entry : 00C8 [14Eh 0334 002h] Entry : 0019 [150h 0336 002h] Entry : 0064 [152h 0338 002h] Entry : [154h 0340 002h] Entry : 0032 [156h 0342 002h] Entry : [158h 0344 002h] Entry : 0032 [15Ah 0346 002h] Entry : 0064 [15Ch 0348 002h] Entry : 0064 [15Eh 0350 002h] Entry : [160h 0352 002h] Entry : 0032 [162h 0354 002h] Entry : [164h 0356 002h] Entry : 0032 [166h 0358 002h] Entry : 00C8 Note the zeros represent entries where the target node has no memory. These could be surpressed but it isn't 'wrong' to provide them and it is (probably) permissible under ACPI to hotplug memory into these nodes later. Signed-off-by: Jonathan Cameron --- tests/qtest/bios-tables-test-allowed-diff.h | 5 - tests/data/acpi/q35/APIC.acpihmat-generic-x | Bin 0 -> 136 bytes tests/data/acpi/q35/CEDT.acpihmat-generic-x | Bin 0 -> 68 bytes tests/data/acpi/q35/DSDT.acpihmat-generic-x | Bin 0 -> 10400 bytes tests/data/acpi/q35/HMAT.acpihmat-generic-x | Bin 0 -> 360 bytes tests/data/acpi/q35/SRAT.acpihmat-generic-x | Bin 0 -> 520 bytes 6 files changed, 5 deletions(-) diff --git a/tests/qtest/bios-tables-test-allowed-diff.h b/tests/qtest/bios-tables-test-allowed-diff.h index a5aa801c99..dfb8523c8b 100644 --- a/tests/qtest/bios-tables-test-allowed-diff.h +++ b/tests/qtest/bios-tables-test-allowed-diff.h @@ -1,6 +1 @@ /* List of comma-separated changed AML files to ignore */ -"tests/data/acpi/q35/APIC.acpihmat-generic-x", -"tests/data/acpi/q35/CEDT.acpihmat-generic-x", -"tests/data/acpi/q35/DSDT.acpihmat-generic-x", -"tests/data/acpi/q35/HMAT.acpihmat-generic-x", -"tests/data/acpi/q35/SRAT.acpihmat-generic-x", diff --git a/tests/data/ac
[PATCH 5/6] bios-tables-test: Add complex SRAT / HMAT test for GI GP
Add a test with 6 nodes to exercise most interesting corner cases of SRAT and HMAT generation including the new Generic Initiator and Generic Port Affinity structures. More details of the set up in the following patch adding the table data. Signed-off-by: Jonathan Cameron --- tests/qtest/bios-tables-test.c | 92 ++ 1 file changed, 92 insertions(+) diff --git a/tests/qtest/bios-tables-test.c b/tests/qtest/bios-tables-test.c index d1ff4db7a2..1651d06b7b 100644 --- a/tests/qtest/bios-tables-test.c +++ b/tests/qtest/bios-tables-test.c @@ -1862,6 +1862,96 @@ static void test_acpi_q35_tcg_acpi_hmat_noinitiator(void) free_test_data(); } +/* Test intended to hit corner cases of SRAT and HMAT */ +static void test_acpi_q35_tcg_acpi_hmat_generic_x(void) +{ +test_data data = {}; + +data.machine = MACHINE_Q35; +data.variant = ".acpihmat-generic-x"; +test_acpi_one(" -machine hmat=on,cxl=on" + " -smp 3,sockets=3" + " -m 128M,maxmem=384M,slots=2" + " -device virtio-rng-pci,id=gidev" + " -device pxb-cxl,bus_nr=64,bus=pcie.0,id=cxl.1" + " -object memory-backend-ram,size=64M,id=ram0" + " -object memory-backend-ram,size=64M,id=ram1" + " -numa node,nodeid=0,cpus=0,memdev=ram0" + " -numa node,nodeid=1" + " -object acpi-generic-initiator,id=gi0,pci-dev=gidev,node=1" + " -numa node,nodeid=2" + " -object acpi-generic-port,id=gp0,pci-bus=cxl.1,node=2" + " -numa node,nodeid=3,cpus=1" + " -numa node,nodeid=4,memdev=ram1" + " -numa node,nodeid=5,cpus=2" + " -numa hmat-lb,initiator=0,target=0,hierarchy=memory," + "data-type=access-latency,latency=10" + " -numa hmat-lb,initiator=0,target=0,hierarchy=memory," + "data-type=access-bandwidth,bandwidth=800M" + " -numa hmat-lb,initiator=0,target=2,hierarchy=memory," + "data-type=access-latency,latency=100" + " -numa hmat-lb,initiator=0,target=2,hierarchy=memory," + "data-type=access-bandwidth,bandwidth=200M" + " -numa hmat-lb,initiator=0,target=4,hierarchy=memory," + "data-type=access-latency,latency=100" + " -numa hmat-lb,initiator=0,target=4,hierarchy=memory," + "data-type=access-bandwidth,bandwidth=200M" + " -numa hmat-lb,initiator=0,target=5,hierarchy=memory," + "data-type=access-latency,latency=200" + " -numa hmat-lb,initiator=0,target=5,hierarchy=memory," + "data-type=access-bandwidth,bandwidth=400M" + " -numa hmat-lb,initiator=1,target=0,hierarchy=memory," + "data-type=access-latency,latency=500" + " -numa hmat-lb,initiator=1,target=0,hierarchy=memory," + "data-type=access-bandwidth,bandwidth=100M" + " -numa hmat-lb,initiator=1,target=2,hierarchy=memory," + "data-type=access-latency,latency=50" + " -numa hmat-lb,initiator=1,target=2,hierarchy=memory," + "data-type=access-bandwidth,bandwidth=400M" + " -numa hmat-lb,initiator=1,target=4,hierarchy=memory," + "data-type=access-latency,latency=50" + " -numa hmat-lb,initiator=1,target=4,hierarchy=memory," + "data-type=access-bandwidth,bandwidth=800M" + " -numa hmat-lb,initiator=1,target=5,hierarchy=memory," + "data-type=access-latency,latency=500" + " -numa hmat-lb,initiator=1,target=5,hierarchy=memory," + "data-type=access-bandwidth,bandwidth=100M" + " -numa hmat-lb,initiator=3,target=0,hierarchy=memory," + "data-type=access-latency,latency=20" + " -numa hmat-lb,initiator=3,target=0,hierarchy=memory," + "data-type=access-bandwidth,bandwidth=400M" + " -numa hmat-lb,initiator=3,target=2,hierarchy=memory," + "data-type=access-latency,latency=80" + " -numa hmat-lb,initiator=3,target=2,hierarchy=memory," + "data-type=access-bandwidth
[PATCH 4/6] bios-tables-test: Allow for new acpihmat-generic-x test data.
The test to be added exercises many corners of the SRAT and HMAT table generation. Signed-off-by: Jonathan Cameron --- tests/qtest/bios-tables-test-allowed-diff.h | 5 + tests/data/acpi/q35/APIC.acpihmat-generic-x | 0 tests/data/acpi/q35/CEDT.acpihmat-generic-x | 0 tests/data/acpi/q35/DSDT.acpihmat-generic-x | 0 tests/data/acpi/q35/HMAT.acpihmat-generic-x | 0 tests/data/acpi/q35/SRAT.acpihmat-generic-x | 0 6 files changed, 5 insertions(+) diff --git a/tests/qtest/bios-tables-test-allowed-diff.h b/tests/qtest/bios-tables-test-allowed-diff.h index dfb8523c8b..a5aa801c99 100644 --- a/tests/qtest/bios-tables-test-allowed-diff.h +++ b/tests/qtest/bios-tables-test-allowed-diff.h @@ -1 +1,6 @@ /* List of comma-separated changed AML files to ignore */ +"tests/data/acpi/q35/APIC.acpihmat-generic-x", +"tests/data/acpi/q35/CEDT.acpihmat-generic-x", +"tests/data/acpi/q35/DSDT.acpihmat-generic-x", +"tests/data/acpi/q35/HMAT.acpihmat-generic-x", +"tests/data/acpi/q35/SRAT.acpihmat-generic-x", diff --git a/tests/data/acpi/q35/APIC.acpihmat-generic-x b/tests/data/acpi/q35/APIC.acpihmat-generic-x new file mode 100644 index 00..e69de29bb2 diff --git a/tests/data/acpi/q35/CEDT.acpihmat-generic-x b/tests/data/acpi/q35/CEDT.acpihmat-generic-x new file mode 100644 index 00..e69de29bb2 diff --git a/tests/data/acpi/q35/DSDT.acpihmat-generic-x b/tests/data/acpi/q35/DSDT.acpihmat-generic-x new file mode 100644 index 00..e69de29bb2 diff --git a/tests/data/acpi/q35/HMAT.acpihmat-generic-x b/tests/data/acpi/q35/HMAT.acpihmat-generic-x new file mode 100644 index 00..e69de29bb2 diff --git a/tests/data/acpi/q35/SRAT.acpihmat-generic-x b/tests/data/acpi/q35/SRAT.acpihmat-generic-x new file mode 100644 index 00..e69de29bb2 -- 2.39.2
[PATCH 3/6] hw/acpi: Generic Port Affinity Structure support
These are very similar to the recently added Generic Initiators but instead of representing an initiator of memory traffic they represent an edge point beyond which may lie either targets or initiators. Here we add these ports such that they may be targets of hmat_lb records to describe the latency and bandwidth from host side initiators to the port. A descoverable mechanism such as UEFI CDAT read from CXL devices and switches is used to discover the remainder fo the path and the OS can build up full latency and bandwidth numbers as need for work and data placement decisions. Signed-off-by: Jonathan Cameron --- qapi/qom.json| 18 +++ include/hw/acpi/acpi_generic_initiator.h | 18 ++- include/hw/pci/pci_bridge.h | 1 + hw/acpi/acpi_generic_initiator.c | 141 +-- hw/pci-bridge/pci_expander_bridge.c | 1 - 5 files changed, 141 insertions(+), 38 deletions(-) diff --git a/qapi/qom.json b/qapi/qom.json index 85e6b4f84a..5480d9ca24 100644 --- a/qapi/qom.json +++ b/qapi/qom.json @@ -826,6 +826,22 @@ 'data': { 'pci-dev': 'str', 'node': 'uint32' } } + +## +# @AcpiGenericPortProperties: +# +# Properties for acpi-generic-port objects. +# +# @pci-bus: PCI bus of the hostbridge associated with this SRAT entry +# +# @node: numa node associated with the PCI device +# +# Since: 9.1 +## +{ 'struct': 'AcpiGenericPortProperties', + 'data': { 'pci-bus': 'str', +'node': 'uint32' } } + ## # @RngProperties: # @@ -944,6 +960,7 @@ { 'enum': 'ObjectType', 'data': [ 'acpi-generic-initiator', +'acpi-generic-port', 'authz-list', 'authz-listfile', 'authz-pam', @@ -1016,6 +1033,7 @@ 'discriminator': 'qom-type', 'data': { 'acpi-generic-initiator': 'AcpiGenericInitiatorProperties', + 'acpi-generic-port': 'AcpiGenericPortProperties', 'authz-list': 'AuthZListProperties', 'authz-listfile': 'AuthZListFileProperties', 'authz-pam': 'AuthZPAMProperties', diff --git a/include/hw/acpi/acpi_generic_initiator.h b/include/hw/acpi/acpi_generic_initiator.h index 26e2bd92d4..49ac448034 100644 --- a/include/hw/acpi/acpi_generic_initiator.h +++ b/include/hw/acpi/acpi_generic_initiator.h @@ -30,6 +30,12 @@ typedef struct AcpiGenericInitiator { AcpiGenericNode parent; } AcpiGenericInitiator; +#define TYPE_ACPI_GENERIC_PORT "acpi-generic-port" + +typedef struct AcpiGenericPort { +AcpiGenericInitiator parent; +} AcpiGenericPort; + /* * ACPI 6.3: * Table 5-81 Flags – Generic Initiator Affinity Structure @@ -49,8 +55,16 @@ typedef enum { * Table 5-80 Device Handle - PCI */ typedef struct PCIDeviceHandle { -uint16_t segment; -uint16_t bdf; +union { +struct { +uint16_t segment; +uint16_t bdf; +}; +struct { +uint64_t hid; +uint32_t uid; +}; +}; } PCIDeviceHandle; void build_srat_generic_pci_initiator(GArray *table_data); diff --git a/include/hw/pci/pci_bridge.h b/include/hw/pci/pci_bridge.h index 5cd452115a..5456e24883 100644 --- a/include/hw/pci/pci_bridge.h +++ b/include/hw/pci/pci_bridge.h @@ -102,6 +102,7 @@ typedef struct PXBPCIEDev { PXBDev parent_obj; } PXBPCIEDev; +#define TYPE_PXB_CXL_BUS "pxb-cxl-bus" #define TYPE_PXB_DEV "pxb" OBJECT_DECLARE_SIMPLE_TYPE(PXBDev, PXB_DEV) diff --git a/hw/acpi/acpi_generic_initiator.c b/hw/acpi/acpi_generic_initiator.c index c054e0e27d..85191e90ab 100644 --- a/hw/acpi/acpi_generic_initiator.c +++ b/hw/acpi/acpi_generic_initiator.c @@ -7,6 +7,7 @@ #include "hw/acpi/acpi_generic_initiator.h" #include "hw/acpi/aml-build.h" #include "hw/boards.h" +#include "hw/pci/pci_bridge.h" #include "hw/pci/pci_device.h" #include "qemu/error-report.h" @@ -18,6 +19,10 @@ typedef struct AcpiGenericInitiatorClass { AcpiGenericNodeClass parent_class; } AcpiGenericInitiatorClass; +typedef struct AcpiGenericPortClass { +AcpiGenericInitiatorClass parent; +} AcpiGenericPortClass; + OBJECT_DEFINE_ABSTRACT_TYPE(AcpiGenericNode, acpi_generic_node, ACPI_GENERIC_NODE, OBJECT) @@ -30,6 +35,13 @@ OBJECT_DEFINE_TYPE_WITH_INTERFACES(AcpiGenericInitiator, acpi_generic_initiator, OBJECT_DECLARE_SIMPLE_TYPE(AcpiGenericInitiator, ACPI_GENERIC_INITIATOR) +OBJECT_DEFINE_TYPE_WITH_INTERFACES(AcpiGenericPort, acpi_generic_port, + ACPI_GENERIC_PORT, ACPI_GENERIC_NODE, + { TYPE_USER_CREATABLE }, + { NULL }) + +OBJECT_DECLARE_SIMPLE_TYPE(AcpiGenericPort, ACPI_GENERIC_PORT) + static void acpi_generic_node_init(Object *obj) { AcpiGenericNode *gn = ACPI_GENERIC_NODE(obj); @@ -53,6 +65,14 @@ static void acpi_generic_initiator_finalize(Object *obj) { } +static void acpi_generic_port_init
[PATCH 2/6] hw/acpi: Insert an acpi-generic-node base under acpi-generic-initiator
This will simplify reuse when adding acpi-generic-port. Note that some error_printf() messages will now print acpi-generic-node whereas others will move to type specific cases in next patch so are left alone for now. Signed-off-by: Jonathan Cameron --- include/hw/acpi/acpi_generic_initiator.h | 15 - hw/acpi/acpi_generic_initiator.c | 78 +++- 2 files changed, 62 insertions(+), 31 deletions(-) diff --git a/include/hw/acpi/acpi_generic_initiator.h b/include/hw/acpi/acpi_generic_initiator.h index a304bad73e..26e2bd92d4 100644 --- a/include/hw/acpi/acpi_generic_initiator.h +++ b/include/hw/acpi/acpi_generic_initiator.h @@ -8,15 +8,26 @@ #include "qom/object_interfaces.h" -#define TYPE_ACPI_GENERIC_INITIATOR "acpi-generic-initiator" +/* + * Abstract type to be used as base for + * - acpi-generic-initator + * - acpi-generic-port + */ +#define TYPE_ACPI_GENERIC_NODE "acpi-generic-node" -typedef struct AcpiGenericInitiator { +typedef struct AcpiGenericNode { /* private */ Object parent; /* public */ char *pci_dev; uint16_t node; +} AcpiGenericNode; + +#define TYPE_ACPI_GENERIC_INITIATOR "acpi-generic-initiator" + +typedef struct AcpiGenericInitiator { +AcpiGenericNode parent; } AcpiGenericInitiator; /* diff --git a/hw/acpi/acpi_generic_initiator.c b/hw/acpi/acpi_generic_initiator.c index 18a939b0e5..c054e0e27d 100644 --- a/hw/acpi/acpi_generic_initiator.c +++ b/hw/acpi/acpi_generic_initiator.c @@ -10,45 +10,61 @@ #include "hw/pci/pci_device.h" #include "qemu/error-report.h" -typedef struct AcpiGenericInitiatorClass { +typedef struct AcpiGenericNodeClass { ObjectClass parent_class; +} AcpiGenericNodeClass; + +typedef struct AcpiGenericInitiatorClass { + AcpiGenericNodeClass parent_class; } AcpiGenericInitiatorClass; +OBJECT_DEFINE_ABSTRACT_TYPE(AcpiGenericNode, acpi_generic_node, +ACPI_GENERIC_NODE, OBJECT) + +OBJECT_DECLARE_SIMPLE_TYPE(AcpiGenericNode, ACPI_GENERIC_NODE) + OBJECT_DEFINE_TYPE_WITH_INTERFACES(AcpiGenericInitiator, acpi_generic_initiator, - ACPI_GENERIC_INITIATOR, OBJECT, + ACPI_GENERIC_INITIATOR, ACPI_GENERIC_NODE, { TYPE_USER_CREATABLE }, { NULL }) OBJECT_DECLARE_SIMPLE_TYPE(AcpiGenericInitiator, ACPI_GENERIC_INITIATOR) +static void acpi_generic_node_init(Object *obj) +{ +AcpiGenericNode *gn = ACPI_GENERIC_NODE(obj); + +gn->node = MAX_NODES; +gn->pci_dev = NULL; +} + static void acpi_generic_initiator_init(Object *obj) { -AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj); +} + +static void acpi_generic_node_finalize(Object *obj) +{ +AcpiGenericNode *gn = ACPI_GENERIC_NODE(obj); -gi->node = MAX_NODES; -gi->pci_dev = NULL; +g_free(gn->pci_dev); } static void acpi_generic_initiator_finalize(Object *obj) { -AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj); - -g_free(gi->pci_dev); } -static void acpi_generic_initiator_set_pci_device(Object *obj, const char *val, - Error **errp) +static void acpi_generic_node_set_pci_device(Object *obj, const char *val, + Error **errp) { -AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj); +AcpiGenericNode *gn = ACPI_GENERIC_NODE(obj); -gi->pci_dev = g_strdup(val); +gn->pci_dev = g_strdup(val); } - -static void acpi_generic_initiator_set_node(Object *obj, Visitor *v, -const char *name, void *opaque, -Error **errp) +static void acpi_generic_node_set_node(Object *obj, Visitor *v, + const char *name, void *opaque, + Error **errp) { -AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj); +AcpiGenericNode *gn = ACPI_GENERIC_NODE(obj); MachineState *ms = MACHINE(qdev_get_machine()); uint32_t value; @@ -58,20 +74,24 @@ static void acpi_generic_initiator_set_node(Object *obj, Visitor *v, if (value >= MAX_NODES) { error_printf("%s: Invalid NUMA node specified\n", - TYPE_ACPI_GENERIC_INITIATOR); + TYPE_ACPI_GENERIC_NODE); exit(1); } -gi->node = value; -ms->numa_state->nodes[gi->node].has_gi = true; +gn->node = value; +ms->numa_state->nodes[gn->node].has_gi = true; } -static void acpi_generic_initiator_class_init(ObjectClass *oc, void *data) +static void acpi_generic_node_class_init(ObjectClass *oc, void *data) { object_class_property_add_str(oc, "pci-dev", NULL, -acpi_generic_initiator_set_pci_device); +acpi_generic_node_set_pci_device); object_class_property_
[PATCH 1/6] hw/acpi/GI: Fix trivial parameter alignment issue.
Before making additional modification, tidy up this misleading indentation. Signed-off-by: Jonathan Cameron --- hw/acpi/acpi_generic_initiator.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/acpi/acpi_generic_initiator.c b/hw/acpi/acpi_generic_initiator.c index 17b9a052f5..18a939b0e5 100644 --- a/hw/acpi/acpi_generic_initiator.c +++ b/hw/acpi/acpi_generic_initiator.c @@ -132,7 +132,7 @@ static int build_all_acpi_generic_initiators(Object *obj, void *opaque) dev_handle.segment = 0; dev_handle.bdf = PCI_BUILD_BDF(pci_bus_num(pci_get_bus(pci_dev)), - pci_dev->devfn); + pci_dev->devfn); build_srat_generic_pci_initiator_affinity(table_data, gi->node, _handle); -- 2.39.2
[PATCH 0/6 qemu] acpi: NUMA nodes for CXL HB as GP + complex NUMA test.
ACPI 6.5 introduced Generic Port Affinity Structures to close a system description gap that was a problem for CXL memory systems. It defines an new SRAT Affinity structure (and hence allows creation of an ACPI Proximity Node which can only be defined via an SRAT structure) for the boundary between a discoverable fabric and a non discoverable system interconnects etc. The HMAT data on latency and bandwidth is combined with discoverable information from the CXL bus (link speeds, lane counts) and CXL devices (switch port to port characteristics and USP to memory, via CDAT tables read from the device). QEMU has supported the rest of the elements of this chain for a while but now the kernel has caught up and we need the missing element of Generic Ports (this code has been used extensively in testing and debugging that kernel support, some resulting fixes currently under review). Generic Port Affinity Structures are very similar to the recently added Generic Initiator Affinity Structures (GI) so this series factors out and reuses much of that infrastructure for reuse There are subtle differences (beyond the obvious structure ID change). - The ACPI spec example (and linux kernel support) has a Generic Port not as associated with the CXL root port, but rather with the CXL Host bridge. As a result, an ACPI handle is used (rather than the PCI SBDF option for GIs). In QEMU the easiest way to get to this is to target the root bridge PCI Bus, and conveniently the root bridge bus number is used for the UID allowing us to construct an appropriate entry. A key addition of this series is a complex NUMA topology example that stretches the QEMU emulation code for GI, GP and nodes with just CPUS, just memory, just hot pluggable memory, mixture of memory and CPUs. A similar test showed up a few NUMA related bugs with fixes applied for 9.0 (note that one of these needs linux booted to identify that it rejects the HMAT table and this test is a regression test for the table generation only). https://lore.kernel.org/qemu-devel/2eb6672cfdaea7dacd8e9bb0523887f13b9f85ce.1710282274.git@redhat.com/ https://lore.kernel.org/qemu-devel/74e2845c5f95b0c139c79233ddb65bb17f2dd679.1710282274.git@redhat.com/ Jonathan Cameron (6): hw/acpi/GI: Fix trivial parameter alignment issue. hw/acpi: Insert an acpi-generic-node base under acpi-generic-initiator hw/acpi: Generic Port Affinity Structure support bios-tables-test: Allow for new acpihmat-generic-x test data. bios-tables-test: Add complex SRAT / HMAT test for GI GP bios-tables-test: Add data for complex numa test (GI, GP etc) qapi/qom.json | 18 ++ include/hw/acpi/acpi_generic_initiator.h| 33 +++- include/hw/pci/pci_bridge.h | 1 + hw/acpi/acpi_generic_initiator.c| 199 ++-- hw/pci-bridge/pci_expander_bridge.c | 1 - tests/qtest/bios-tables-test.c | 92 + tests/data/acpi/q35/APIC.acpihmat-generic-x | Bin 0 -> 136 bytes tests/data/acpi/q35/CEDT.acpihmat-generic-x | Bin 0 -> 68 bytes tests/data/acpi/q35/DSDT.acpihmat-generic-x | Bin 0 -> 10400 bytes tests/data/acpi/q35/HMAT.acpihmat-generic-x | Bin 0 -> 360 bytes tests/data/acpi/q35/SRAT.acpihmat-generic-x | Bin 0 -> 520 bytes 11 files changed, 285 insertions(+), 59 deletions(-) create mode 100644 tests/data/acpi/q35/APIC.acpihmat-generic-x create mode 100644 tests/data/acpi/q35/CEDT.acpihmat-generic-x create mode 100644 tests/data/acpi/q35/DSDT.acpihmat-generic-x create mode 100644 tests/data/acpi/q35/HMAT.acpihmat-generic-x create mode 100644 tests/data/acpi/q35/SRAT.acpihmat-generic-x -- 2.39.2
Re: [PATCH 2/2] CXL/cxl_type3: reset DVSEC CXL Control in ct3d_reset
On Tue, 2 Apr 2024 09:46:47 +0800 Li Zhijian wrote: > After the kernel commit > 0cab68720598 ("cxl/pci: Fix disabling memory if DVSEC CXL Range does not > match a CFMWS window") Fixes tag seems appropriate. > CXL type3 devices cannot be enabled again after the reboot because this > flag was not reset. > > This flag could be changed by the firmware or OS, let it have a > reset(default) value in reboot so that the OS can read its clean status. Good find. I think we should aim for a fix that is less fragile to future code rearrangement etc. > > Signed-off-by: Li Zhijian > --- > hw/mem/cxl_type3.c | 14 +- > 1 file changed, 13 insertions(+), 1 deletion(-) > > diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c > index ad2fe7d463fb..3fe136053390 100644 > --- a/hw/mem/cxl_type3.c > +++ b/hw/mem/cxl_type3.c > @@ -305,7 +305,8 @@ static void build_dvsecs(CXLType3Dev *ct3d) > > dvsec = (uint8_t *)&(CXLDVSECDevice){ > .cap = 0x1e, > -.ctrl = 0x2, > +#define CT3D_DEVSEC_CXL_CTRL 0x2 > +.ctrl = CT3D_DEVSEC_CXL_CTRL, Naming doesn't make it clear the define is a reset value / default value. > .status2 = 0x2, > .range1_size_hi = range1_size_hi, > .range1_size_lo = range1_size_lo, > @@ -906,6 +907,16 @@ MemTxResult cxl_type3_write(PCIDevice *d, hwaddr > host_addr, uint64_t data, > return address_space_write(as, dpa_offset, attrs, , size); > } > > +/* Reset DVSEC CXL Control */ > +static void ct3d_dvsec_cxl_ctrl_reset(CXLType3Dev *ct3d) > +{ > +uint16_t offset = first_dvsec_offset(ct3d); This relies to much on the current memory layout. We should doing a search of config space to find the right entry, or we should cache a pointer to the relevant structure when we fill it in the first time. > +CXLDVSECDevice *dvsec; > + > +dvsec = (CXLDVSECDevice *)(ct3d->cxl_cstate.pdev->config + offset); > +dvsec->ctrl = CT3D_DEVSEC_CXL_CTRL; > +} > + > static void ct3d_reset(DeviceState *dev) > { > CXLType3Dev *ct3d = CXL_TYPE3(dev); > @@ -914,6 +925,7 @@ static void ct3d_reset(DeviceState *dev) > > cxl_component_register_init_common(reg_state, write_msk, > CXL2_TYPE3_DEVICE); > cxl_device_register_init_t3(ct3d); > +ct3d_dvsec_cxl_ctrl_reset(ct3d); > > /* > * Bring up an endpoint to target with MCTP over VDM.
Re: [PATCH 1/2] CXL/cxl_type3: add first_dvsec_offset() helper
On Tue, 2 Apr 2024 09:46:46 +0800 Li Zhijian wrote: > It helps to figure out where the first dvsec register is located. In > addition, replace offset and size hardcore with existing macros. > > Signed-off-by: Li Zhijian I agree we should be using the macros. The offset calc is a bit specific to the the chosen memory layout, so not sure it makes sense to break it out to a separate function. I'll suggest alternative possible approaches in review of next patch. Jonathan > --- > hw/mem/cxl_type3.c | 19 +-- > 1 file changed, 13 insertions(+), 6 deletions(-) > > diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c > index b0a7e9f11b64..ad2fe7d463fb 100644 > --- a/hw/mem/cxl_type3.c > +++ b/hw/mem/cxl_type3.c > @@ -643,6 +643,16 @@ static DOEProtocol doe_cdat_prot[] = { > { } > }; > > +static uint16_t first_dvsec_offset(CXLType3Dev *ct3d) > +{ > +uint16_t offset = PCI_CONFIG_SPACE_SIZE; > + > +if (ct3d->sn != UI64_NULL) > +offset += PCI_EXT_CAP_DSN_SIZEOF; > + > +return offset; > +} > + > static void ct3_realize(PCIDevice *pci_dev, Error **errp) > { > ERRP_GUARD(); > @@ -663,13 +673,10 @@ static void ct3_realize(PCIDevice *pci_dev, Error > **errp) > pci_config_set_prog_interface(pci_conf, 0x10); > > pcie_endpoint_cap_init(pci_dev, 0x80); > -if (ct3d->sn != UI64_NULL) { > -pcie_dev_ser_num_init(pci_dev, 0x100, ct3d->sn); > -cxl_cstate->dvsec_offset = 0x100 + 0x0c; > -} else { > -cxl_cstate->dvsec_offset = 0x100; > -} > +if (ct3d->sn != UI64_NULL) > +pcie_dev_ser_num_init(pci_dev, PCI_CONFIG_SPACE_SIZE, ct3d->sn); > > +cxl_cstate->dvsec_offset = first_dvsec_offset(ct3d); > ct3d->cxl_cstate.pdev = pci_dev; > build_dvsecs(ct3d); >
Re: [PATCH v2 3/4] hw/cxl/mbox: replace sanitize_running() with cxl_dev_media_disabled()
On Sun, 21 Jan 2024 21:50:00 -0500 Hyeonggon Yoo <42.hye...@gmail.com> wrote: > On Tue, Jan 9, 2024 at 12:54 PM Jonathan Cameron > wrote: > > > > On Fri, 22 Dec 2023 18:00:50 +0900 > > Hyeonggon Yoo <42.hye...@gmail.com> wrote: > > > > > The spec states that reads/writes should have no effect and a part of > > > commands should be ignored when the media is disabled, not when the > > > sanitize command is running.qq > > > > > > Introduce cxl_dev_media_disabled() to check if the media is disabled and > > > replace sanitize_running() with it. > > > > > > Make sure that the media has been correctly disabled during sanitation > > > by adding an assert to __toggle_media(). Now, enabling when already > > > enabled or vice versa results in an assert() failure. > > > > > > Suggested-by: Davidlohr Bueso > > > Signed-off-by: Hyeonggon Yoo <42.hye...@gmail.com> > > > > This applies to > > > > hw/cxl: Add get scan media capabilities cmd support. > > > > Should I just squash it with that patch in my tree? > > For now I'm holding it immediately on top of that, but I'm not keen to > > send messy code upstream unless there is a good reason to retain the > > history. > > Oh, while the diff looks like the patch touches scan_media_running(), it's > not. > > The proper Fixes: tag will be: > Fixes: d77176724422 ("hw/cxl: Add support for device sanitation") > > > If you are doing this sort of fix series in future, please call out > > what they fix explicitly. Can't use fixes tags as the commit ids > > are unstable, but can mention the patch to make my life easier! > > Okay, next time I will either add the Fixes tag or add a comment on > what it fixes. > > By the way I guess your latest, public branch is still cxl-2023-11-02, right? > https://gitlab.com/jic23/qemu/-/tree/cxl-2023-11-02 > > I assume you adjusted my v2 series, but please let me know if you prefer > sending v3 against your latest tree. > > Thanks, > Hyeonggon Side note, in it's current form this breaks the switch-cci support in upstream QEMU. I've finally gotten back to getting ready to look at MMPT support and ran into a crash as a result. Needs protection with checked object_dynamic_cast() to make sure we have a type3 device. I'll update the patch in my tree. Thanks, Jonathan >
Re: [RFC PATCH-for-9.1 08/29] hw/i386/pc: Move CXLState to PcPciMachineState
On Thu, 28 Mar 2024 16:54:16 +0100 Philippe Mathieu-Daudé wrote: > CXL depends on PCIe, which isn't available on non-PCI > machines such the ISA-only PC one. > Move CXLState to PcPciMachineState, and move the CXL > specific calls to pc_pci_machine_initfn() and > pc_pci_machine_done(). > > Signed-off-by: Philippe Mathieu-Daudé LGTM as a change on it's own - I've not reviewed the series in general though, hence just an ack as an rb feels too strong. Acked-by: Jonathan Cameron
Re: [PATCH-for-9.0] hw/i386/pc: Restrict CXL to PCI-based machines
On Wed, 27 Mar 2024 17:16:42 +0100 Philippe Mathieu-Daudé wrote: > CXL is based on PCIe. In is pointless to initialize > its context on non-PCI machines. > > Signed-off-by: Philippe Mathieu-Daudé Seems a reasonable restriction. Acked-by: Jonathan Cameron Jonathan > --- > hw/i386/pc.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/hw/i386/pc.c b/hw/i386/pc.c > index e80f02bef4..5c21b0c4db 100644 > --- a/hw/i386/pc.c > +++ b/hw/i386/pc.c > @@ -1738,7 +1738,9 @@ static void pc_machine_initfn(Object *obj) > pcms->pcspk = isa_new(TYPE_PC_SPEAKER); > object_property_add_alias(OBJECT(pcms), "pcspk-audiodev", >OBJECT(pcms->pcspk), "audiodev"); > -cxl_machine_init(obj, >cxl_devices_state); > +if (pcmc->pci_enabled) { > +cxl_machine_init(obj, >cxl_devices_state); > +} > > pcms->machine_done.notify = pc_machine_done; > qemu_add_machine_init_done_notifier(>machine_done);
Re: [PATCH] mem/cxl_type3: fix hpa to dpa logic
On Thu, 28 Mar 2024 06:24:24 + "Xingtao Yao (Fujitsu)" wrote: > Jonathan > > thanks for your reply! > > > -Original Message- > > From: Jonathan Cameron > > Sent: Wednesday, March 27, 2024 9:28 PM > > To: Yao, Xingtao/姚 幸涛 > > Cc: fan...@samsung.com; qemu-devel@nongnu.org; Cao, Quanquan/曹 全全 > > > > Subject: Re: [PATCH] mem/cxl_type3: fix hpa to dpa logic > > > > On Tue, 26 Mar 2024 21:46:53 -0400 > > Yao Xingtao wrote: > > > > > In 3, 6, 12 interleave ways, we could not access cxl memory properly, > > > and when the process is running on it, a 'segmentation fault' error will > > > occur. > > > > > > According to the CXL specification '8.2.4.20.13 Decoder Protection', > > > there are two branches to convert HPA to DPA: > > > b1: Decoder[m].IW < 8 (for 1, 2, 4, 8, 16 interleave ways) > > > b2: Decoder[m].IW >= 8 (for 3, 6, 12 interleave ways) > > > > > > but only b1 has been implemented. > > > > > > To solve this issue, we should implement b2: > > > DPAOffset[51:IG+8]=HPAOffset[51:IG+IW] / 3 > > > DPAOffset[IG+7:0]=HPAOffset[IG+7:0] > > > DPA=DPAOffset + Decoder[n].DPABase > > > > > > Links: > > https://lore.kernel.org/linux-cxl/3e84b919-7631-d1db-3e1d-33000f3f3868@fujits > > u.com/ > > > Signed-off-by: Yao Xingtao > > > > Not implementing this was intentional (shouldn't seg fault obviously) but > > I thought we were not advertising EP support for 3, 6, 12? The HDM Decoder > > configuration checking is currently terrible so we don't prevent > > the bits being set (adding device side sanity checks for those decoders > > has been on the todo list for a long time). There are a lot of ways of > > programming those that will blow up. > > > > Can you confirm that the emulation reports they are supported. > > https://elixir.bootlin.com/qemu/v9.0.0-rc1/source/hw/cxl/cxl-component-utils.c > > #L246 > > implies it shouldn't and so any software using them is broken. > yes, the feature is not supported by QEMU, but I can still create a > 6-interleave-ways region on kernel layer. > > I checked the source code of kernel, and found that the kernel did not check > this bit when committing decoder. > we may add some check on kernel side. ouch. We definitely want that check! The decoder commit will fail anyway (which QEMU doesn't yet because we don't do all the sanity checks we should). However failing on commit is nasty as the reason should have been detected earlier. > > > > > The non power of 2 decodes always made me nervous as the maths is more > > complex and any changes to that decode will need careful checking. > > For the power of 2 cases it was a bunch of writes to edge conditions etc > > and checking the right data landed in the backing stores. > after applying this modification, I tested some command by using these > memory, like 'ls', 'top'.. > and they can be executed normally, maybe there are some other problems I > haven't met yet. I usually run a bunch of manual tests with devmem2 to ensure the edge cases are handled correctly, but I've not really seen any errors that didn't also show up in running stressors (e.g. stressng) or just memhog on the memory. Jonathan > > > > > Joanthan > > > > > > > --- > > > hw/mem/cxl_type3.c | 15 +++ > > > 1 file changed, 11 insertions(+), 4 deletions(-) > > > > > > diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c > > > index b0a7e9f11b..2c1218fb12 100644 > > > --- a/hw/mem/cxl_type3.c > > > +++ b/hw/mem/cxl_type3.c > > > @@ -805,10 +805,17 @@ static bool cxl_type3_dpa(CXLType3Dev *ct3d, hwaddr > > > > > host_addr, uint64_t *dpa) > > > continue; > > > } > > > > > > -*dpa = dpa_base + > > > -((MAKE_64BIT_MASK(0, 8 + ig) & hpa_offset) | > > > - ((MAKE_64BIT_MASK(8 + ig + iw, 64 - 8 - ig - iw) & > > > hpa_offset) > > > - >> iw)); > > > +if (iw < 8) { > > > +*dpa = dpa_base + > > > +((MAKE_64BIT_MASK(0, 8 + ig) & hpa_offset) | > > > + ((MAKE_64BIT_MASK(8 + ig + iw, 64 - 8 - ig - iw) & > > hpa_offset) > > > + >> iw)); > > > +} else { > > > +*dpa = dpa_base + > > > +((MAKE_64BIT_MASK(0, 8 + ig) & hpa_offset) | > > > + MAKE_64BIT_MASK(ig + iw, 64 - ig - iw) & hpa_offset) > > > + >> (ig + iw)) / 3) << (ig + 8))); > > > +} > > > > > > return true; > > > } >
Re: [PATCH] mem/cxl_type3: fix hpa to dpa logic
On Tue, 26 Mar 2024 21:46:53 -0400 Yao Xingtao wrote: > In 3, 6, 12 interleave ways, we could not access cxl memory properly, > and when the process is running on it, a 'segmentation fault' error will > occur. > > According to the CXL specification '8.2.4.20.13 Decoder Protection', > there are two branches to convert HPA to DPA: > b1: Decoder[m].IW < 8 (for 1, 2, 4, 8, 16 interleave ways) > b2: Decoder[m].IW >= 8 (for 3, 6, 12 interleave ways) > > but only b1 has been implemented. > > To solve this issue, we should implement b2: > DPAOffset[51:IG+8]=HPAOffset[51:IG+IW] / 3 > DPAOffset[IG+7:0]=HPAOffset[IG+7:0] > DPA=DPAOffset + Decoder[n].DPABase > > Links: > https://lore.kernel.org/linux-cxl/3e84b919-7631-d1db-3e1d-33000f3f3...@fujitsu.com/ > Signed-off-by: Yao Xingtao Not implementing this was intentional (shouldn't seg fault obviously) but I thought we were not advertising EP support for 3, 6, 12? The HDM Decoder configuration checking is currently terrible so we don't prevent the bits being set (adding device side sanity checks for those decoders has been on the todo list for a long time). There are a lot of ways of programming those that will blow up. Can you confirm that the emulation reports they are supported. https://elixir.bootlin.com/qemu/v9.0.0-rc1/source/hw/cxl/cxl-component-utils.c#L246 implies it shouldn't and so any software using them is broken. The non power of 2 decodes always made me nervous as the maths is more complex and any changes to that decode will need careful checking. For the power of 2 cases it was a bunch of writes to edge conditions etc and checking the right data landed in the backing stores. Joanthan > --- > hw/mem/cxl_type3.c | 15 +++ > 1 file changed, 11 insertions(+), 4 deletions(-) > > diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c > index b0a7e9f11b..2c1218fb12 100644 > --- a/hw/mem/cxl_type3.c > +++ b/hw/mem/cxl_type3.c > @@ -805,10 +805,17 @@ static bool cxl_type3_dpa(CXLType3Dev *ct3d, hwaddr > host_addr, uint64_t *dpa) > continue; > } > > -*dpa = dpa_base + > -((MAKE_64BIT_MASK(0, 8 + ig) & hpa_offset) | > - ((MAKE_64BIT_MASK(8 + ig + iw, 64 - 8 - ig - iw) & hpa_offset) > - >> iw)); > +if (iw < 8) { > +*dpa = dpa_base + > +((MAKE_64BIT_MASK(0, 8 + ig) & hpa_offset) | > + ((MAKE_64BIT_MASK(8 + ig + iw, 64 - 8 - ig - iw) & > hpa_offset) > + >> iw)); > +} else { > +*dpa = dpa_base + > +((MAKE_64BIT_MASK(0, 8 + ig) & hpa_offset) | > + MAKE_64BIT_MASK(ig + iw, 64 - ig - iw) & hpa_offset) > + >> (ig + iw)) / 3) << (ig + 8))); > +} > > return true; > }
Re: [PATCH v2 1/1] cxl/mem: Fix for the index of Clear Event Record Handle
On Mon, 18 Mar 2024 10:29:28 +0800 Yuquan Wang wrote: > The dev_dbg info for Clear Event Records mailbox command would report > the handle of the next record to clear not the current one. > > This was because the index 'i' had incremented before printing the > current handle value. > > Signed-off-by: Yuquan Wang > --- > drivers/cxl/core/mbox.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c > index 9adda4795eb7..b810a6aa3010 100644 > --- a/drivers/cxl/core/mbox.c > +++ b/drivers/cxl/core/mbox.c > @@ -915,7 +915,7 @@ static int cxl_clear_event_record(struct cxl_memdev_state > *mds, > > payload->handles[i++] = gen->hdr.handle; > dev_dbg(mds->cxlds.dev, "Event log '%d': Clearing %u\n", log, > - le16_to_cpu(payload->handles[i])); > + le16_to_cpu(payload->handles[i-1])); Trivial but needs spaces around the -. e.g. [i - 1] Maybe Dan can fix up whilst applying. Otherwise Reviewed-by: Jonathan Cameron > > if (i == max_handles) { > payload->nr_recs = i;
Re: [PATCH v2 2/2] hmat acpi: Fix out of bounds access due to missing use of indirection
On Wed, 13 Mar 2024 21:24:06 +0300 Michael Tokarev wrote: > 07.03.2024 19:03, Jonathan Cameron via wrote: > > With a numa set up such as > > > > -numa nodeid=0,cpus=0 \ > > -numa nodeid=1,memdev=mem \ > > -numa nodeid=2,cpus=1 > > > > and appropriate hmat_lb entries the initiator list is correctly > > computed and writen to HMAT as 0,2 but then the LB data is accessed > > using the node id (here 2), landing outside the entry_list array. > > > > Stash the reverse lookup when writing the initiator list and use > > it to get the correct array index index. > > > > Fixes: 4586a2cb83 ("hmat acpi: Build System Locality Latency and Bandwidth > > Information Structure(s)") > > Signed-off-by: Jonathan Cameron > > This seems like a -stable material, is it not? Yes. Use case is obscure, but indeed seems suitable for stable. Thanks. Jonathan > > Thanks, > > /mjt > > > --- > > hw/acpi/hmat.c | 6 +- > > 1 file changed, 5 insertions(+), 1 deletion(-) > > > > diff --git a/hw/acpi/hmat.c b/hw/acpi/hmat.c > > index 723ae28d32..b933ae3c06 100644 > > --- a/hw/acpi/hmat.c > > +++ b/hw/acpi/hmat.c > > @@ -78,6 +78,7 @@ static void build_hmat_lb(GArray *table_data, > > HMAT_LB_Info *hmat_lb, > > uint32_t *initiator_list) > > { > > int i, index; > > +uint32_t initiator_to_index[MAX_NODES] = {}; > > HMAT_LB_Data *lb_data; > > uint16_t *entry_list; > > uint32_t base; > > @@ -121,6 +122,8 @@ static void build_hmat_lb(GArray *table_data, > > HMAT_LB_Info *hmat_lb, > > /* Initiator Proximity Domain List */ > > for (i = 0; i < num_initiator; i++) { > > build_append_int_noprefix(table_data, initiator_list[i], 4); > > +/* Reverse mapping for array possitions */ > > +initiator_to_index[initiator_list[i]] = i; > > } > > > > /* Target Proximity Domain List */ > > @@ -132,7 +135,8 @@ static void build_hmat_lb(GArray *table_data, > > HMAT_LB_Info *hmat_lb, > > entry_list = g_new0(uint16_t, num_initiator * num_target); > > for (i = 0; i < hmat_lb->list->len; i++) { > > lb_data = _array_index(hmat_lb->list, HMAT_LB_Data, i); > > -index = lb_data->initiator * num_target + lb_data->target; > > +index = initiator_to_index[lb_data->initiator] * num_target + > > +lb_data->target; > > > > entry_list[index] = (uint16_t)(lb_data->data / hmat_lb->base); > > } >
Re: [PATCH v9 0/7] QEMU CXL Provide mock CXL events and irq support
On Fri, 15 Mar 2024 09:52:28 +0800 Yuquan Wang wrote: > Hello, Jonathan > > When during the test of qmps of CXL events like > "cxl-inject-general-media-event", > I am confuesd about the argument "flags". According to "qapi/cxl.json" in > qemu, > this argument represents "Event Record Flags" in Common Event Record Format. > However, it seems like the specific 'Event Record Severity' in this field can > be > different from the value of 'Event Status' in "Event Status Register". > > For instance (take an injection example in the coverlatter): > > { "execute": "cxl-inject-general-media-event", > "arguments": { > "path": "/machine/peripheral/cxl-mem0", > "log": "informational", > "flags": 1, > "dpa": 1000, > "descriptor": 3, > "type": 3, > "transaction-type": 192, > "channel": 3, > "device": 5, > "component-id": "iras mem" > }} > > In my understanding, the 'Event Status' is informational and the > 'Event Record Severity' is Warning event, which means these two arguments are > independent of each other. Is my understanding correct? The event status registers dictates the notification path (which log). So I think that's "informational" here. Whereas flags is about the specific error. One case where they might be different is where the Related Event Record Handle is set. An error might be reported as 1) Several things that were non fatal (each with their own record) 2) In combination they result in a fatal situation (also has it's own record). The QEMU injection shouldn't restrict these combinations more than the spec does (which is not at all!). This same disconnect in error severity is seen in UEFI CPER records for example where there is a containing record with one severity field, but more specific parts of record can have lower (or in theory higher) severity. Jonathan > > Many thanks > Yuquan >
Re: [PATCH v5 09/13] hw/cxl/events: Add qmp interfaces to add/release dynamic capacity extents
On Fri, 8 Mar 2024 20:35:53 -0800 fan wrote: > On Thu, Mar 07, 2024 at 12:45:55PM +0000, Jonathan Cameron wrote: > > ... > > > > > > > +list = records; > > > > > +extents = g_new0(CXLDCExtentRaw, num_extents); > > > > > +while (list) { > > > > > +CXLDCExtent *ent; > > > > > +bool skip_extent = false; > > > > > + > > > > > +offset = list->value->offset; > > > > > +len = list->value->len; > > > > > + > > > > > +extents[i].start_dpa = offset + dcd->dc.regions[rid].base; > > > > > +extents[i].len = len; > > > > > +memset(extents[i].tag, 0, 0x10); > > > > > +extents[i].shared_seq = 0; > > > > > + > > > > > +if (type == DC_EVENT_RELEASE_CAPACITY || > > > > > +type == DC_EVENT_FORCED_RELEASE_CAPACITY) { > > > > > +/* > > > > > + * if the extent is still pending to be added to the > > > > > host, > > > > > > > > Odd spacing. > > > > > > > > > + * remove it from the pending extent list, so later when > > > > > the add > > > > > + * response for the extent arrives, the device can > > > > > reject the > > > > > + * extent as it is not in the pending list. > > > > > + */ > > > > > +ent = > > > > > cxl_dc_extent_exists(>dc.extents_pending_to_add, > > > > > +[i]); > > > > > +if (ent) { > > > > > +QTAILQ_REMOVE(>dc.extents_pending_to_add, ent, > > > > > node); > > > > > +g_free(ent); > > > > > +skip_extent = true; > > > > > +} else if (!cxl_dc_extent_exists(>dc.extents, > > > > > [i])) { > > > > > +/* If the exact extent is not in the accepted list, > > > > > skip */ > > > > > +skip_extent = true; > > > > > +} > > > > I think we need to reject case of some extents skipped and others not. > > > > That's not supported yet so we need to complain if we get it at least. > > > > Maybe we need > > > > to do two passes so we know this has happened early (or perhaps this is > > > > a later > > > > patch in which case a todo here would help). > > > > > > Skip here does not mean the extent is invalid, it just means the extent > > > is still pending to add, so remove them from pending list would be > > > enough to reject the extent, no need to release further. That is based > > > on your feedback on v4. > > > > Ah. I'd missunderstood. > > Hi Jonathan, > > I think we should not allow to release extents that are still pending to > add. > If we allow it, there is a case that will not work. > Let's see the following case (time order): > 1. Send request to add extent A to host; (A --> pending list) > 2. Send request to release A from the host; (Delete A from pending list, > hoping the following add response for A will fail as there is not a matched > extent in the pending list). Definitely not allow the host to release something it hasn't accepted. Should allow QMP to release such entrees though (and same for fmapi when we get there). Any such requested from host should be treated as whatever it says to do if you release an extent that you don't have. > 3. Host send response to the device for the add request, however, for > some reason, it does not accept any of it, so updated list is empty, > spec allows it. Based on the spec, we need to drop the extent at the > head of the event log. Now we have problem. Since extent A is already > dropped from the list, we either cannot drop as the list is empty, which > is not the worst. If we have more extents in the list, we may drop the > one following A, which is for another request. If this happens, all the > following extents will be acked incorrectly as the order has been > shifted. > > Does the above reasoning make sense to you? Absolutely. I got confused here on who was doing release. Host definitely can't release stuff it hasn't successfully accepted. Jonathan > > Fan > > > > > > > > > The loop here is only to collect the extents to sent to t
Re: [PATCH v9 3/3] hw/i386/acpi-build: Add support for SRAT Generic Initiator structures
On Fri, 8 Mar 2024 14:55:25 + wrote: > From: Ankit Agrawal > > The acpi-generic-initiator object is added to allow a host device > to be linked with a NUMA node. Qemu use it to build the SRAT > Generic Initiator Affinity structure [1]. Add support for i386. > > [1] ACPI Spec 6.3, Section 5.2.16.6 > > Suggested-by: Jonathan Cameron Reviewed-by: Jonathan Cameron > Signed-off-by: Ankit Agrawal > --- > hw/i386/acpi-build.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c > index 1e178341de..b65202fc07 100644 > --- a/hw/i386/acpi-build.c > +++ b/hw/i386/acpi-build.c > @@ -68,6 +68,7 @@ > #include "hw/acpi/utils.h" > #include "hw/acpi/pci.h" > #include "hw/acpi/cxl.h" > +#include "hw/acpi/acpi_generic_initiator.h" > > #include "qom/qom-qobject.h" > #include "hw/i386/amd_iommu.h" > @@ -2056,6 +2057,8 @@ build_srat(GArray *table_data, BIOSLinker *linker, > MachineState *machine) > build_srat_memory(table_data, 0, 0, 0, MEM_AFFINITY_NOFLAGS); > } > > +build_srat_generic_pci_initiator(table_data); > + > /* > * Entry is required for Windows to enable memory hotplug in OS > * and for Linux to enable SWIOTLB when booted with less than
Re: [PULL 53/60] hw/cxl: Standardize all references on CXL r3.1 and minor updates
On Fri, 8 Mar 2024 14:38:55 + Peter Maydell wrote: > On Fri, 8 Mar 2024 at 14:34, Jonathan Cameron > wrote: > > > > On Fri, 8 Mar 2024 13:47:47 + > > Peter Maydell wrote: > > > Is there a way we could write this that would catch this error? > > > I'm thinking maybe something like > > > > > > #define CXL_CREATE_DVSEC(CXL, DEVTYPE, TYPE, DATA) do { \ > > > assert(sizeof(*DATA) == TYPE##_LENGTH); \ > > > cxl_component_create_dvsec(CXL, DEVTYPE, TYPE##_LENGTH, \ > > > TYPE, TYPE##_REVID, (uint8_t*)DATA); \ > > > } while (0) > > > > We should be able to use the length definitions in the original assert. > > I'm not sure why that wasn't done before. I think there were some cases > > where we supported multiple versions and so the length can be shorter > > than the structure defintion but that doesn't matter on this one. > > > > So I think minimal fix is u16 of padding and update the assert. > > Can circle back to tidy up the multiple places the value is defined. > > Any mismatch in which the wrong length define is used should be easy > > enough to spot so not sure we need the macro you suggest. > > Well, I mean, you didn't in fact spot the mismatch between > the struct type you were passing and the length value you > were using. That's why I think it would be helpful to > assert() that the size of the struct really does match > the length value you're passing in. At the moment the > code completely throws away the type information the compiler > has by casting the pointer to the struct to a uint8_t*. True, but the original assert at the structure definition would have fired if I'd actually used the define rather than a number :( There is definitely more to do here - but fix wants to be on the light side of all the options. cxl_component_create_dvsec() is an odd function in general as it has more code that varies depending on cxl_dev_type than is shared. So it might just make sense to split it up and provide some more trivial functions for the header writing. This is a case of code that has evolved and ended up as a far from ideal solution. We only carry the DVSECHeader in the structures so that the sizes can be read against the spec. It makes the code more complex though so maybe should consider dropping it and making the asserts next to the structure definitions more complex. The asserts in existing function can go (checking it fits etc is done by pcie_add_capability()). If not need something more like //awkward naming is because the second cxl needs to be there to match spec. void cxl_create_pcie_cxl_device_dvsec(CXLComponentState *cxl, CXLDVSECDevice *dvsec) { PCIDevice *pdev = cxl->pdev; uint16_t offset = cxl->dvsec_offset; uint16_t length = sizeof(*dvsec); uint8_t *wmask = pdev->wmask; ///next block can probably be a helper or done in a simpler way. /// A lot of what we have here is just to let us reuse this first call. pcie_add_capability(pdev, PCI_EXT_CAP_ID_DVSEC, 1, offset, length); ///These could be done by writing into dvsec, and memcpy ing more ///but the offset will be even stranger if we do that. pci_set_long(pdev->config + offset + PCIE_DVSEC_HEADER1_OFFSET, (length << 20) | (rev << 16) | CXL_VENDOR_ID); pci_set_word(pdev->config + offset + PCIE_DEVSEC_ID_OFFSET, PCIE_CXL_DEVICE_DEVSEC); memcpy(pdev->config + offset + sizeof(DVSEC_HEADER), (uint8_t *)dvsec + sizeof(DVSECHeader), length - sizeof(DVSECHEAEDR)); // all the wmask stuff for this structure. } So I'm aiming for more drastic surgery than you were suggesting but not in the fix! Jonathan > > thanks > -- PMM
[PATCH] hw/cxl: Fix missing reserved data in CXL Device DVSEC
The r3.1 specification introduced a new 2 byte field, but to maintain DWORD alignment, a additional 2 reserved bytes were added. Forgot those in updating the structure definition but did include them in the size define leading to a buffer overrun. Also use the define so that we don't duplicate the value. Fixes: Coverity ID 1534095 buffer overrun Fixes: 8700ee15de ("hw/cxl: Standardize all references on CXL r3.1 and minor updates") Reported-by: Peter Maydell Signed-off-by: Jonathan Cameron --- include/hw/cxl/cxl_pci.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/include/hw/cxl/cxl_pci.h b/include/hw/cxl/cxl_pci.h index 265db6c407..d0855ed78b 100644 --- a/include/hw/cxl/cxl_pci.h +++ b/include/hw/cxl/cxl_pci.h @@ -92,8 +92,9 @@ typedef struct CXLDVSECDevice { uint32_t range2_base_hi; uint32_t range2_base_lo; uint16_t cap3; +uint16_t resv; } QEMU_PACKED CXLDVSECDevice; -QEMU_BUILD_BUG_ON(sizeof(CXLDVSECDevice) != 0x3A); +QEMU_BUILD_BUG_ON(sizeof(CXLDVSECDevice) != PCIE_CXL_DEVICE_DVSEC_LENGTH); /* * CXL r3.1 Section 8.1.5: CXL Extensions DVSEC for Ports -- 2.39.2
Re: [PULL 53/60] hw/cxl: Standardize all references on CXL r3.1 and minor updates
On Fri, 8 Mar 2024 13:47:47 + Peter Maydell wrote: > On Wed, 14 Feb 2024 at 11:16, Michael S. Tsirkin wrote: > > > > From: Jonathan Cameron > > > > Previously not all references mentioned any spec version at all. > > Given r3.1 is the current specification available for evaluation at > > www.computeexpresslink.org update references to refer to that. > > Hopefully this won't become a never ending job. > > > > A few structure definitions have been updated to add new fields. > > Defaults of 0 and read only are valid choices for these new DVSEC > > registers so go with that for now. > > > > There are additional error codes and some of the 'questions' in > > the comments are resolved now. > > > > Update documentation reference to point to the CXL r3.1 specification > > with naming closer to what is on the cover. > > > > For cases where there are structure version numbers, add defines > > so they can be found next to the register definitions. > > Hi; Coverity points out that this change has introduced a > buffer overrun (CID 1534905). In hw/mem/cxl_type3.c:build_dvsecs() > we create a local struct of type CXLDVSecDevice, and then we > pass it to cxl_component_create_dvsec() as the body parameter, > passing it a length argument PCIE_CXL_DEVICE_DVSEC_LENGTH. > > Before this change, both sizeof(CXLDVSecDevice) and > PCIE_CXL_DEVICE_DVSEC_LENGTH were 0x38, so this was fine. > But now... > > > diff --git a/include/hw/cxl/cxl_pci.h b/include/hw/cxl/cxl_pci.h > > index ddf01a543b..265db6c407 100644 > > --- a/include/hw/cxl/cxl_pci.h > > +++ b/include/hw/cxl/cxl_pci.h > > @@ -16,9 +16,8 @@ > > #define PCIE_DVSEC_HEADER1_OFFSET 0x4 /* Offset from start of extend cap */ > > #define PCIE_DVSEC_ID_OFFSET 0x8 > > > > -#define PCIE_CXL_DEVICE_DVSEC_LENGTH 0x38 > > -#define PCIE_CXL1_DEVICE_DVSEC_REVID 0 > > -#define PCIE_CXL2_DEVICE_DVSEC_REVID 1 > > +#define PCIE_CXL_DEVICE_DVSEC_LENGTH 0x3C > > +#define PCIE_CXL31_DEVICE_DVSEC_REVID 3 > > > > #define EXTENSIONS_PORT_DVSEC_LENGTH 0x28 > > #define EXTENSIONS_PORT_DVSEC_REVID 0 > > ...PCIE_CXL_DEVICE_DVSEC_LENGTH is 0x3C... Gah. Evil spec change - they defined only one extra u16 worth of data but added padding after it and I missed that in the structure definition. > > > -/* CXL 2.0 - 8.1.3 (ID 0001) */ > > +/* > > + * CXL r3.1 Section 8.1.3: PCIe DVSEC for Devices > > + * DVSEC ID: 0, Revision: 3 > > + */ > > typedef struct CXLDVSECDevice { > > DVSECHeader hdr; > > uint16_t cap; > > @@ -82,10 +91,14 @@ typedef struct CXLDVSECDevice { > > uint32_t range2_size_lo; > > uint32_t range2_base_hi; > > uint32_t range2_base_lo; > > -} CXLDVSECDevice; > > -QEMU_BUILD_BUG_ON(sizeof(CXLDVSECDevice) != 0x38); > > +uint16_t cap3; > > +} QEMU_PACKED CXLDVSECDevice; > > +QEMU_BUILD_BUG_ON(sizeof(CXLDVSECDevice) != 0x3A); (this is the assert I mention below) > > ...and CXLDVSECDevice is only size 0x3A, so we try to read off the > end of the struct. > > What was supposed to happen here? needs an extra uint16_t resv; at the end. > > > --- a/hw/mem/cxl_type3.c > > +++ b/hw/mem/cxl_type3.c > > @@ -319,7 +319,7 @@ static void build_dvsecs(CXLType3Dev *ct3d) > > cxl_component_create_dvsec(cxl_cstate, CXL2_TYPE3_DEVICE, > > PCIE_CXL_DEVICE_DVSEC_LENGTH, > > PCIE_CXL_DEVICE_DVSEC, > > - PCIE_CXL2_DEVICE_DVSEC_REVID, dvsec); > > + PCIE_CXL31_DEVICE_DVSEC_REVID, dvsec); > > > > dvsec = (uint8_t *)&(CXLDVSECRegisterLocator){ > > .rsvd = 0, > > Perhaps this call to cxl_component_create_dvsec() was > supposed to have the length argument changed, as seems > to have been done with this other call: > > > @@ -346,9 +346,9 @@ static void build_dvsecs(CXLType3Dev *ct3d) > > .rcvd_mod_ts_data_phase1 = 0xef, /* WTF? */ > > }; > > cxl_component_create_dvsec(cxl_cstate, CXL2_TYPE3_DEVICE, > > - PCIE_FLEXBUS_PORT_DVSEC_LENGTH_2_0, > > + PCIE_CXL3_FLEXBUS_PORT_DVSEC_LENGTH, > > PCIE_FLEXBUS_PORT_DVSEC, > > - PCIE_FLEXBUS_PORT_DVSEC_REVID_2_0, dvsec); > > + PCIE_CXL3_FLEXBUS_PORT_DVSEC_REVID, dvsec); > > } > > static void hdm_decoder_commit(CXLType3Dev *ct3d, int which) > > > and with similar other
Re: Enabling internal errors for VH CXL devices: [was: Re: Questions about CXL RAS injection test in qemu]
On Fri, 8 Mar 2024 10:01:34 +0800 Yuquan Wang wrote: > On 2024-03-07 20:10, jonathan.cameron wrote: > > > Hack is fine the relevant device with lspci -tv and then use > > setpci -s 0d:00.0 0x208.l=0 > > to clear all the mask bits for uncorrectable errors. > > Thanks! The suggestions from you and Terry did work! > > BTW, is my understanding below about CXL RAS correct? > > >> 2) The error injected by "pcie_aer_inject_error" is "protocol & link > >> errors" of cxl.io? > >>The error injected by "cxl-inject-uncorrectable-errors" or > >> "cxl-inject-correctable-error" is "protocol & link errors" of cxl.cachemem > >> > > Many thanks > Yuuqan > Yes. Note the two CXL errors are actually communicated via AER uncorrectable / correctable internal error combined with data that is available on the EP in the CXL specific registers. Jonathan
Re: [PATCH v3 11/20] util/dsa: Implement DSA task asynchronous submission and wait for completion.
On Thu, 4 Jan 2024 00:44:43 + Hao Xiang wrote: > * Add a DSA task completion callback. > * DSA completion thread will call the tasks's completion callback > on every task/batch task completion. > * DSA submission path to wait for completion. > * Implement CPU fallback if DSA is not able to complete the task. > > Signed-off-by: Hao Xiang > Signed-off-by: Bryan Zhang Hi, One naming comment inline. You had me confused on how you were handling async processing at where this is used. Answer is that I think you aren't! > > +/** > + * @brief Performs buffer zero comparison on a DSA batch task asynchronously. The hardware may be doing it asynchronously but unless that buffer_zero_dsa_wait() call doesn't do what it's name suggests, this function is wrapping the async hardware related stuff to make it synchronous. So name it buffer_is_zero_dsa_batch_sync()! Jonathan > + * > + * @param batch_task A pointer to the batch task. > + * @param buf An array of memory buffers. > + * @param count The number of buffers in the array. > + * @param len The buffer length. > + * > + * @return Zero if successful, otherwise non-zero. > + */ > +int > +buffer_is_zero_dsa_batch_async(struct dsa_batch_task *batch_task, > + const void **buf, size_t count, size_t len) > +{ > +if (count <= 0 || count > batch_task->batch_size) { > +return -1; > +} > + > +assert(batch_task != NULL); > +assert(len != 0); > +assert(buf != NULL); > + > +if (count == 1) { > +/* DSA doesn't take batch operation with only 1 task. */ > +buffer_zero_dsa_async(batch_task, buf[0], len); > +} else { > +buffer_zero_dsa_batch_async(batch_task, buf, count, len); > +} > + > +buffer_zero_dsa_wait(batch_task); > +buffer_zero_cpu_fallback(batch_task); > + > +return 0; > +} > + > #endif >
[PATCH v2 2/2] hmat acpi: Fix out of bounds access due to missing use of indirection
With a numa set up such as -numa nodeid=0,cpus=0 \ -numa nodeid=1,memdev=mem \ -numa nodeid=2,cpus=1 and appropriate hmat_lb entries the initiator list is correctly computed and writen to HMAT as 0,2 but then the LB data is accessed using the node id (here 2), landing outside the entry_list array. Stash the reverse lookup when writing the initiator list and use it to get the correct array index index. Fixes: 4586a2cb83 ("hmat acpi: Build System Locality Latency and Bandwidth Information Structure(s)") Signed-off-by: Jonathan Cameron --- hw/acpi/hmat.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/hw/acpi/hmat.c b/hw/acpi/hmat.c index 723ae28d32..b933ae3c06 100644 --- a/hw/acpi/hmat.c +++ b/hw/acpi/hmat.c @@ -78,6 +78,7 @@ static void build_hmat_lb(GArray *table_data, HMAT_LB_Info *hmat_lb, uint32_t *initiator_list) { int i, index; +uint32_t initiator_to_index[MAX_NODES] = {}; HMAT_LB_Data *lb_data; uint16_t *entry_list; uint32_t base; @@ -121,6 +122,8 @@ static void build_hmat_lb(GArray *table_data, HMAT_LB_Info *hmat_lb, /* Initiator Proximity Domain List */ for (i = 0; i < num_initiator; i++) { build_append_int_noprefix(table_data, initiator_list[i], 4); +/* Reverse mapping for array possitions */ +initiator_to_index[initiator_list[i]] = i; } /* Target Proximity Domain List */ @@ -132,7 +135,8 @@ static void build_hmat_lb(GArray *table_data, HMAT_LB_Info *hmat_lb, entry_list = g_new0(uint16_t, num_initiator * num_target); for (i = 0; i < hmat_lb->list->len; i++) { lb_data = _array_index(hmat_lb->list, HMAT_LB_Data, i); -index = lb_data->initiator * num_target + lb_data->target; +index = initiator_to_index[lb_data->initiator] * num_target + +lb_data->target; entry_list[index] = (uint16_t)(lb_data->data / hmat_lb->base); } -- 2.39.2
[PATCH v2 1/2] hmat acpi: Do not add Memory Proximity Domain Attributes Structure targetting non existent memory.
If qemu is started with a proximity node containing CPUs alone, it will provide one of these structures to say memory in this node is directly connected to itself. This description is arguably pointless even if there is memory in the node. If there is no memory present, and hence no SRAT entry it breaks Linux HMAT passing and the table is rejected. https://elixir.bootlin.com/linux/v6.7/source/drivers/acpi/numa/hmat.c#L444 Signed-off-by: Jonathan Cameron v2: Fix link in patch description to be stable. --- hw/acpi/hmat.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/hw/acpi/hmat.c b/hw/acpi/hmat.c index 3042d223c8..723ae28d32 100644 --- a/hw/acpi/hmat.c +++ b/hw/acpi/hmat.c @@ -204,6 +204,13 @@ static void hmat_build_table_structs(GArray *table_data, NumaState *numa_state) build_append_int_noprefix(table_data, 0, 4); /* Reserved */ for (i = 0; i < numa_state->num_nodes; i++) { +/* + * Linux rejects whole HMAT table if a node with no memory + * has one of these structures listing it as a target. + */ +if (!numa_state->nodes[i].node_mem) { +continue; +} flags = 0; if (numa_state->nodes[i].initiator < MAX_NODES) { -- 2.39.2
[PATCH v2 0/2] hw/acpi/hmat: Misc fixes
v2: Fixed a link in patch 1 description so it points somewhere stable. Two unrelated fixes here: 1) Linux really doesn't like it when you claim non existent memory is directly connected to an initiator (here a CPU). It is a nonsense entry, though I also plan to try and get a relaxation of the condition into the kernel. Maybe we need to care about migration, but I suspect no one cares about this corner case (hence no one noticed the problem!) 2) An access outside of the allocated array when building the the latency and bandwidth tables. Given this crashes QEMU for me, I think we are fine with the potential table change. Some notes on 1: - This structure is almost entirely pointless in general - most of the fields were removed in HMAT v2. What remains, is meant to convey memory controller location when the memory is in a different Proximity Domain from the memory controller (e.g. a SoC with both HBM and DDR will present 2 NUMA domains but memory controllers will be wherever we describe the CPUs as being - typically with the DDR) Currently QEMU creates these to indicate direct connection between a CPU domain and memory in the same domain. Using the Proximity domain in SRAT conveys the same. This adds no information but it is harmless and avoids migration problems. Notes on 2: - I debated a follow up patch removing the entrees in the table for initiators on nodes that don't have any initiators. QEMU won't let you use them as initiators in the LB entries anyway so there is no way to set those entries and they end up reported as 0. OK for Bandwidth as no one is going to use the zero bandwidth channel, but that's a very attractive latency, but that's fine as no one will read the number as there are no initiators? (right?) There is a corner case in ACPI that bites us here. ACPI Proximity domains are only defined in SRAT, but nothing says they need to be fully defined. Generic Initiators are optional after all (newish feature) so it was common to use _PXM in DSDT to define where various platform devices were (and PCI but that's still not read by Linux - a story of pain and broken systems for another day). That's fine if they are in a node with CPUs (initiators) but not so much if they happen to be in a memory only node. Today I think the only thing we can make hit this condition in QEMU is a PCI Expander Bridge which doesn't initiate transactions. But things behind it do and there are drivers out there that do buffer placement based on SLIT distances. I'd expect HMAT users to follow soon. It would be nice to think all such systems will use Generic Port Affinity Structures (and I have patches for those to follow shortly) but that's overly optimistic beyond CXL where the kernel will use them and which drove their introduction. Jonathan Cameron (2): hmat acpi: Do not add Memory Proximity Domain Attributes Structure targetting non existent memory. hmat acpi: Fix out of bounds access due to missing use of indirection hw/acpi/hmat.c | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) -- 2.39.2
[PATCH v3 1/1] target/i386: Enable page walking from MMIO memory
From: Gregory Price CXL emulation of interleave requires read and write hooks due to requirement for subpage granularity. The Linux kernel stack now enables using this memory as conventional memory in a separate NUMA node. If a process is deliberately forced to run from that node $ numactl --membind=1 ls the page table walk on i386 fails. Useful part of backtrace: (cpu=cpu@entry=0x56fd9000, fmt=fmt@entry=0x55fe3378 "cpu_io_recompile: could not find TB for pc=%p") at ../../cpu-target.c:359 (retaddr=0, addr=19595792376, attrs=..., xlat=, cpu=0x56fd9000, out_offset=) at ../../accel/tcg/cputlb.c:1339 (cpu=0x56fd9000, full=0x7fffee0d96e0, ret_be=ret_be@entry=0, addr=19595792376, size=size@entry=8, mmu_idx=4, type=MMU_DATA_LOAD, ra=0) at ../../accel/tcg/cputlb.c:2030 (cpu=cpu@entry=0x56fd9000, p=p@entry=0x756fddc0, mmu_idx=, type=type@entry=MMU_DATA_LOAD, memop=, ra=ra@entry=0) at ../../accel/tcg/cputlb.c:2356 (cpu=cpu@entry=0x56fd9000, addr=addr@entry=19595792376, oi=oi@entry=52, ra=ra@entry=0, access_type=access_type@entry=MMU_DATA_LOAD) at ../../accel/tcg/cputlb.c:2439 at ../../accel/tcg/ldst_common.c.inc:301 at ../../target/i386/tcg/sysemu/excp_helper.c:173 (err=0x756fdf80, out=0x756fdf70, mmu_idx=0, access_type=MMU_INST_FETCH, addr=18446744072116178925, env=0x56fdb7c0) at ../../target/i386/tcg/sysemu/excp_helper.c:578 (cs=0x56fd9000, addr=18446744072116178925, size=, access_type=MMU_INST_FETCH, mmu_idx=0, probe=, retaddr=0) at ../../target/i386/tcg/sysemu/excp_helper.c:604 Avoid this by plumbing the address all the way down from x86_cpu_tlb_fill() where is available as retaddr to the actual accessors which provide it to probe_access_full() which already handles MMIO accesses. Reviewed-by: Philippe Mathieu-Daudé Reviewed-by: Richard Henderson Suggested-by: Peter Maydell Signed-off-by: Gregory Price Signed-off-by: Jonathan Cameron --- v3: No change. target/i386/tcg/sysemu/excp_helper.c | 57 +++- 1 file changed, 30 insertions(+), 27 deletions(-) diff --git a/target/i386/tcg/sysemu/excp_helper.c b/target/i386/tcg/sysemu/excp_helper.c index 8f7011d966..7a57b7dd10 100644 --- a/target/i386/tcg/sysemu/excp_helper.c +++ b/target/i386/tcg/sysemu/excp_helper.c @@ -59,14 +59,14 @@ typedef struct PTETranslate { hwaddr gaddr; } PTETranslate; -static bool ptw_translate(PTETranslate *inout, hwaddr addr) +static bool ptw_translate(PTETranslate *inout, hwaddr addr, uint64_t ra) { CPUTLBEntryFull *full; int flags; inout->gaddr = addr; flags = probe_access_full(inout->env, addr, 0, MMU_DATA_STORE, - inout->ptw_idx, true, >haddr, , 0); + inout->ptw_idx, true, >haddr, , ra); if (unlikely(flags & TLB_INVALID_MASK)) { TranslateFault *err = inout->err; @@ -82,20 +82,20 @@ static bool ptw_translate(PTETranslate *inout, hwaddr addr) return true; } -static inline uint32_t ptw_ldl(const PTETranslate *in) +static inline uint32_t ptw_ldl(const PTETranslate *in, uint64_t ra) { if (likely(in->haddr)) { return ldl_p(in->haddr); } -return cpu_ldl_mmuidx_ra(in->env, in->gaddr, in->ptw_idx, 0); +return cpu_ldl_mmuidx_ra(in->env, in->gaddr, in->ptw_idx, ra); } -static inline uint64_t ptw_ldq(const PTETranslate *in) +static inline uint64_t ptw_ldq(const PTETranslate *in, uint64_t ra) { if (likely(in->haddr)) { return ldq_p(in->haddr); } -return cpu_ldq_mmuidx_ra(in->env, in->gaddr, in->ptw_idx, 0); +return cpu_ldq_mmuidx_ra(in->env, in->gaddr, in->ptw_idx, ra); } /* @@ -132,7 +132,8 @@ static inline bool ptw_setl(const PTETranslate *in, uint32_t old, uint32_t set) } static bool mmu_translate(CPUX86State *env, const TranslateParams *in, - TranslateResult *out, TranslateFault *err) + TranslateResult *out, TranslateFault *err, + uint64_t ra) { const target_ulong addr = in->addr; const int pg_mode = in->pg_mode; @@ -164,11 +165,11 @@ static bool mmu_translate(CPUX86State *env, const TranslateParams *in, * Page table level 5 */ pte_addr = (in->cr3 & ~0xfff) + (((addr >> 48) & 0x1ff) << 3); -if (!ptw_translate(_trans, pte_addr)) { +if (!ptw_translate(_trans, pte_addr, ra)) { return false; } restart_5: -pte = ptw_ldq(_trans); +pte = ptw_ldq(_trans, ra); if (!(pte & PG_PRESENT_MASK)) { goto do_fault; } @@ -188,11 +189,11 @@ static bool mmu_translate(CPUX86State *env, const TranslateParams *in,
[PATCH v3 0/1] target/i386: Fix page walking from MMIO memory.
Previously: tcg/i386: Page tables in MMIO memory fixes (CXL) Richard Henderson picked up patches 1 and 3 which were architecture independent leaving just this x86 specific patch. No change to the patch. Resending because it's hard to spot individual unapplied patches in a larger series. Original cover letter (edited). CXL memory is interleaved at granularities as fine as 64 bytes. To emulate this each read and write access undergoes address translation similar to that used in physical hardware. This is done using cfmws_ops for a memory region per CXL Fixed Memory Window (the PA address range in the host that is interleaved across host bridges and beyond. The OS programs interleaved decoders in the CXL Root Bridges, switch upstream ports and the corresponding decoders CXL type 3 devices who have to know the Host PA to Device PA mappings). Unfortunately this CXL memory may be used as normal memory and anything that can end up in RAM can be placed within it. As Linux has become more capable of handling this memory we've started to get quite a few bug reports for the QEMU support. However terrible the performance is people seem to like running actual software stacks on it :( This doesn't work for KVM - so for now CXL emulation remains TCG only. (unless you are very careful on how it is used!) I plan to add some safety guards at a later date to make it slightly harder for people to shoot themselves in the foot + a more limited set of CXL functionality that is safe (no interleaving!) Previously we had some issues with TCG reading instructions from CXL memory but that is now all working. This time the issues are around the Page Tables being in the CXL memory + DMA buffers being placed in it. The test setup I've been using is simple 2 way interleave via 2 root ports below a single CXL root complex. After configuration in Linux these are mapped to their own Numa Node and numactl --membind=1 ls followed by powering down the machine is sufficient to hit all the bugs addressed in this series. Thanks to Gregory, Peter and Alex for their help figuring this lot out. Whilst thread started back at: https://lore.kernel.org/all/CAAg4PaqsGZvkDk_=ph+oz-yeeuvcvsrumncagegrkuxe_yo...@mail.gmail.com/ The QEMU part is from. https://lore.kernel.org/all/20240201130438.1...@huawei.com/ Gregory Price (1): target/i386: Enable page walking from MMIO memory target/i386/tcg/sysemu/excp_helper.c | 57 +++- 1 file changed, 30 insertions(+), 27 deletions(-) -- 2.39.2
[PATCH v2 4/4] physmem: Fix wrong address in large address_space_read/write_cached_slow()
If the access is bigger than the MemoryRegion supports, flatview_read/write_continue() will attempt to update the Memory Region. but the address passed to flatview_translate() is relative to the cache, not to the FlatView. On arm/virt with interleaved CXL memory emulation and virtio-blk-pci this lead to the first part of descriptor being read from the CXL memory and the second part from PA 0x8 which happens to be a blank region of a flash chip and all ffs on this particular configuration. Note this test requires the out of tree ARM support for CXL, but the problem is more general. Avoid this by adding new address_space_read_continue_cached() and address_space_write_continue_cached() which share all the logic with the flatview versions except for the MemoryRegion lookup which is unnecessary as the MemoryRegionCache only covers one MemoryRegion. Signed-off-by: Jonathan Cameron --- v2: Review from Peter Xu - Drop additional lookups of the MemoryRegion via address_space_translate_cached() as it will always return the same answer. - Drop various parameters that are then unused. - rename addr1 to mr_addr. - Drop a fuzz_dma_read_cb(). Could put this back but it means carrying the address into the inner call and the only in tree fuzzer checks if it is normal RAM and if not does nothing anyway. We don't hit this path for normal RAM. --- system/physmem.c | 63 +++- 1 file changed, 57 insertions(+), 6 deletions(-) diff --git a/system/physmem.c b/system/physmem.c index 1264eab24b..701bea27dd 100644 --- a/system/physmem.c +++ b/system/physmem.c @@ -3381,6 +3381,59 @@ static inline MemoryRegion *address_space_translate_cached( return section.mr; } +/* Called within RCU critical section. */ +static MemTxResult address_space_write_continue_cached(MemTxAttrs attrs, + const void *ptr, + hwaddr len, + hwaddr mr_addr, + hwaddr l, + MemoryRegion *mr) +{ +MemTxResult result = MEMTX_OK; +const uint8_t *buf = ptr; + +for (;;) { +result |= flatview_write_continue_step(attrs, buf, len, mr_addr, , + mr); + +len -= l; +buf += l; +mr_addr += l; + +if (!len) { +break; +} + +l = len; +} + +return result; +} + +/* Called within RCU critical section. */ +static MemTxResult address_space_read_continue_cached(MemTxAttrs attrs, + void *ptr, hwaddr len, + hwaddr mr_addr, hwaddr l, + MemoryRegion *mr) +{ +MemTxResult result = MEMTX_OK; +uint8_t *buf = ptr; + +for (;;) { +result |= flatview_read_continue_step(attrs, buf, len, mr_addr, , mr); +len -= l; +buf += l; +mr_addr += l; + +if (!len) { +break; +} +l = len; +} + +return result; +} + /* Called from RCU critical section. address_space_read_cached uses this * out of line function when the target is an MMIO or IOMMU region. */ @@ -3394,9 +3447,8 @@ address_space_read_cached_slow(MemoryRegionCache *cache, hwaddr addr, l = len; mr = address_space_translate_cached(cache, addr, _addr, , false, MEMTXATTRS_UNSPECIFIED); -return flatview_read_continue(cache->fv, - addr, MEMTXATTRS_UNSPECIFIED, buf, len, - mr_addr, l, mr); +return address_space_read_continue_cached(MEMTXATTRS_UNSPECIFIED, + buf, len, mr_addr, l, mr); } /* Called from RCU critical section. address_space_write_cached uses this @@ -3412,9 +3464,8 @@ address_space_write_cached_slow(MemoryRegionCache *cache, hwaddr addr, l = len; mr = address_space_translate_cached(cache, addr, _addr, , true, MEMTXATTRS_UNSPECIFIED); -return flatview_write_continue(cache->fv, - addr, MEMTXATTRS_UNSPECIFIED, buf, len, - mr_addr, l, mr); +return address_space_write_continue_cached(MEMTXATTRS_UNSPECIFIED, + buf, len, mr_addr, l, mr); } #define ARG1_DECLMemoryRegionCache *cache -- 2.39.2
[PATCH v2 3/4] physmem: Factor out body of flatview_read/write_continue() loop
This code will be reused for the address_space_cached accessors shortly. Also reduce scope of result variable now we aren't directly calling this in the loop. Signed-off-by: Jonathan Cameron --- v2: Thanks to Peter Xu - Fix alignment of code. - Drop unused addr parameter. - Carry through new mr_addr parameter name. - RB not picked up as not sure what Peter will think wrt to resulting parameter ordering. --- system/physmem.c | 169 +++ 1 file changed, 99 insertions(+), 70 deletions(-) diff --git a/system/physmem.c b/system/physmem.c index a64a96a3e5..1264eab24b 100644 --- a/system/physmem.c +++ b/system/physmem.c @@ -2681,6 +2681,56 @@ static bool flatview_access_allowed(MemoryRegion *mr, MemTxAttrs attrs, return false; } +static MemTxResult flatview_write_continue_step(MemTxAttrs attrs, +const uint8_t *buf, +hwaddr len, hwaddr mr_addr, +hwaddr *l, MemoryRegion *mr) +{ +if (!flatview_access_allowed(mr, attrs, mr_addr, *l)) { +return MEMTX_ACCESS_ERROR; +} + +if (!memory_access_is_direct(mr, true)) { +uint64_t val; +MemTxResult result; +bool release_lock = prepare_mmio_access(mr); + +*l = memory_access_size(mr, *l, mr_addr); +/* + * XXX: could force current_cpu to NULL to avoid + * potential bugs + */ + +/* + * Assure Coverity (and ourselves) that we are not going to OVERRUN + * the buffer by following ldn_he_p(). + */ +#ifdef QEMU_STATIC_ANALYSIS +assert((*l == 1 && len >= 1) || + (*l == 2 && len >= 2) || + (*l == 4 && len >= 4) || + (*l == 8 && len >= 8)); +#endif +val = ldn_he_p(buf, *l); +result = memory_region_dispatch_write(mr, mr_addr, val, + size_memop(*l), attrs); +if (release_lock) { +bql_unlock(); +} + +return result; +} else { +/* RAM case */ +uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, l, + false); + +memmove(ram_ptr, buf, *l); +invalidate_and_set_dirty(mr, mr_addr, *l); + +return MEMTX_OK; +} +} + /* Called within RCU critical section. */ static MemTxResult flatview_write_continue(FlatView *fv, hwaddr addr, MemTxAttrs attrs, @@ -2692,44 +2742,8 @@ static MemTxResult flatview_write_continue(FlatView *fv, hwaddr addr, const uint8_t *buf = ptr; for (;;) { -if (!flatview_access_allowed(mr, attrs, mr_addr, l)) { -result |= MEMTX_ACCESS_ERROR; -/* Keep going. */ -} else if (!memory_access_is_direct(mr, true)) { -uint64_t val; -bool release_lock = prepare_mmio_access(mr); - -l = memory_access_size(mr, l, mr_addr); -/* XXX: could force current_cpu to NULL to avoid - potential bugs */ - -/* - * Assure Coverity (and ourselves) that we are not going to OVERRUN - * the buffer by following ldn_he_p(). - */ -#ifdef QEMU_STATIC_ANALYSIS -assert((l == 1 && len >= 1) || - (l == 2 && len >= 2) || - (l == 4 && len >= 4) || - (l == 8 && len >= 8)); -#endif -val = ldn_he_p(buf, l); -result |= memory_region_dispatch_write(mr, mr_addr, val, - size_memop(l), attrs); -if (release_lock) { -bql_unlock(); -} - - -} else { -/* RAM case */ - -uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, , - false); - -memmove(ram_ptr, buf, l); -invalidate_and_set_dirty(mr, mr_addr, l); -} +result |= flatview_write_continue_step(attrs, buf, len, mr_addr, , + mr); len -= l; buf += l; @@ -2763,6 +2777,52 @@ static MemTxResult flatview_write(FlatView *fv, hwaddr addr, MemTxAttrs attrs, mr_addr, l, mr); } +static MemTxResult flatview_read_continue_step(MemTxAttrs attrs, uint8_t *buf, + hwaddr len, hwaddr mr_addr, + hwaddr *l, + MemoryRegion *mr) +{ +if (!flatview_access_allowed(mr, attrs, mr_addr, *l)) { +return MEMTX_ACCESS_ERROR; +} + +if (!memory_access_is_direct(mr, false)) { +
[PATCH v2 2/4] physmem: Reduce local variable scope in flatview_read/write_continue()
Precursor to factoring out the inner loops for reuse. Reviewed-by: Peter Xu Signed-off-by: Jonathan Cameron --- v2: Picked up tag from Peter. system/physmem.c | 40 1 file changed, 20 insertions(+), 20 deletions(-) diff --git a/system/physmem.c b/system/physmem.c index 2704b780f6..a64a96a3e5 100644 --- a/system/physmem.c +++ b/system/physmem.c @@ -2688,10 +2688,7 @@ static MemTxResult flatview_write_continue(FlatView *fv, hwaddr addr, hwaddr len, hwaddr mr_addr, hwaddr l, MemoryRegion *mr) { -uint8_t *ram_ptr; -uint64_t val; MemTxResult result = MEMTX_OK; -bool release_lock = false; const uint8_t *buf = ptr; for (;;) { @@ -2699,7 +2696,9 @@ static MemTxResult flatview_write_continue(FlatView *fv, hwaddr addr, result |= MEMTX_ACCESS_ERROR; /* Keep going. */ } else if (!memory_access_is_direct(mr, true)) { -release_lock |= prepare_mmio_access(mr); +uint64_t val; +bool release_lock = prepare_mmio_access(mr); + l = memory_access_size(mr, l, mr_addr); /* XXX: could force current_cpu to NULL to avoid potential bugs */ @@ -2717,18 +2716,21 @@ static MemTxResult flatview_write_continue(FlatView *fv, hwaddr addr, val = ldn_he_p(buf, l); result |= memory_region_dispatch_write(mr, mr_addr, val, size_memop(l), attrs); +if (release_lock) { +bql_unlock(); +} + + } else { /* RAM case */ -ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, , false); + +uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, , + false); + memmove(ram_ptr, buf, l); invalidate_and_set_dirty(mr, mr_addr, l); } -if (release_lock) { -bql_unlock(); -release_lock = false; -} - len -= l; buf += l; addr += l; @@ -2767,10 +2769,7 @@ MemTxResult flatview_read_continue(FlatView *fv, hwaddr addr, hwaddr len, hwaddr mr_addr, hwaddr l, MemoryRegion *mr) { -uint8_t *ram_ptr; -uint64_t val; MemTxResult result = MEMTX_OK; -bool release_lock = false; uint8_t *buf = ptr; fuzz_dma_read_cb(addr, len, mr); @@ -2780,7 +2779,9 @@ MemTxResult flatview_read_continue(FlatView *fv, hwaddr addr, /* Keep going. */ } else if (!memory_access_is_direct(mr, false)) { /* I/O case */ -release_lock |= prepare_mmio_access(mr); +uint64_t val; +bool release_lock = prepare_mmio_access(mr); + l = memory_access_size(mr, l, mr_addr); result |= memory_region_dispatch_read(mr, mr_addr, , size_memop(l), attrs); @@ -2796,17 +2797,16 @@ MemTxResult flatview_read_continue(FlatView *fv, hwaddr addr, (l == 8 && len >= 8)); #endif stn_he_p(buf, l, val); +if (release_lock) { +bql_unlock(); +} } else { /* RAM case */ -ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, , false); +uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, , + false); memcpy(buf, ram_ptr, l); } -if (release_lock) { -bql_unlock(); -release_lock = false; -} - len -= l; buf += l; addr += l; -- 2.39.2
[PATCH v2 1/4] physmem: Rename addr1 to more informative mr_addr in flatview_read/write() and similar
The calls to flatview_read/write[_continue]() have parameters addr and addr1 but the names give no indication of what they are addresses of. Rename addr1 to mr_addr to reflect that it is the translated address offset within the MemoryRegion returned by flatview_translate(). Similarly rename the parameter in address_space_read/write_cached_slow() Suggested-by: Peter Xu Signed-off-by: Jonathan Cameron --- v2: New patch. - I have kept the renames to only the code I'm touching later in this series, but they could be applied much more widely. --- system/physmem.c | 50 1 file changed, 25 insertions(+), 25 deletions(-) diff --git a/system/physmem.c b/system/physmem.c index 05997a7ca7..2704b780f6 100644 --- a/system/physmem.c +++ b/system/physmem.c @@ -2685,7 +2685,7 @@ static bool flatview_access_allowed(MemoryRegion *mr, MemTxAttrs attrs, static MemTxResult flatview_write_continue(FlatView *fv, hwaddr addr, MemTxAttrs attrs, const void *ptr, - hwaddr len, hwaddr addr1, + hwaddr len, hwaddr mr_addr, hwaddr l, MemoryRegion *mr) { uint8_t *ram_ptr; @@ -2695,12 +2695,12 @@ static MemTxResult flatview_write_continue(FlatView *fv, hwaddr addr, const uint8_t *buf = ptr; for (;;) { -if (!flatview_access_allowed(mr, attrs, addr1, l)) { +if (!flatview_access_allowed(mr, attrs, mr_addr, l)) { result |= MEMTX_ACCESS_ERROR; /* Keep going. */ } else if (!memory_access_is_direct(mr, true)) { release_lock |= prepare_mmio_access(mr); -l = memory_access_size(mr, l, addr1); +l = memory_access_size(mr, l, mr_addr); /* XXX: could force current_cpu to NULL to avoid potential bugs */ @@ -2715,13 +2715,13 @@ static MemTxResult flatview_write_continue(FlatView *fv, hwaddr addr, (l == 8 && len >= 8)); #endif val = ldn_he_p(buf, l); -result |= memory_region_dispatch_write(mr, addr1, val, +result |= memory_region_dispatch_write(mr, mr_addr, val, size_memop(l), attrs); } else { /* RAM case */ -ram_ptr = qemu_ram_ptr_length(mr->ram_block, addr1, , false); +ram_ptr = qemu_ram_ptr_length(mr->ram_block, mr_addr, , false); memmove(ram_ptr, buf, l); -invalidate_and_set_dirty(mr, addr1, l); +invalidate_and_set_dirty(mr, mr_addr, l); } if (release_lock) { @@ -2738,7 +2738,7 @@ static MemTxResult flatview_write_continue(FlatView *fv, hwaddr addr, } l = len; -mr = flatview_translate(fv, addr, , , true, attrs); +mr = flatview_translate(fv, addr, _addr, , true, attrs); } return result; @@ -2749,22 +2749,22 @@ static MemTxResult flatview_write(FlatView *fv, hwaddr addr, MemTxAttrs attrs, const void *buf, hwaddr len) { hwaddr l; -hwaddr addr1; +hwaddr mr_addr; MemoryRegion *mr; l = len; -mr = flatview_translate(fv, addr, , , true, attrs); +mr = flatview_translate(fv, addr, _addr, , true, attrs); if (!flatview_access_allowed(mr, attrs, addr, len)) { return MEMTX_ACCESS_ERROR; } return flatview_write_continue(fv, addr, attrs, buf, len, - addr1, l, mr); + mr_addr, l, mr); } /* Called within RCU critical section. */ MemTxResult flatview_read_continue(FlatView *fv, hwaddr addr, MemTxAttrs attrs, void *ptr, - hwaddr len, hwaddr addr1, hwaddr l, + hwaddr len, hwaddr mr_addr, hwaddr l, MemoryRegion *mr) { uint8_t *ram_ptr; @@ -2775,14 +2775,14 @@ MemTxResult flatview_read_continue(FlatView *fv, hwaddr addr, fuzz_dma_read_cb(addr, len, mr); for (;;) { -if (!flatview_access_allowed(mr, attrs, addr1, l)) { +if (!flatview_access_allowed(mr, attrs, mr_addr, l)) { result |= MEMTX_ACCESS_ERROR; /* Keep going. */ } else if (!memory_access_is_direct(mr, false)) { /* I/O case */ release_lock |= prepare_mmio_access(mr); -l = memory_access_size(mr, l, addr1); -result |= memory_region_dispatch_read(mr, addr1, , +l = memory_access_size(mr, l, mr_addr); +result |= memory_region_dispatch_read(mr, mr_addr, , size_memop(l), attrs); /* @@ -2798,7 +2798,7 @@ MemTxResult flatview_read_cont
[PATCH v2 0/4] physmem: Fix MemoryRegion for second access to cached MMIO Address Space
v2: (Thanks to Peter Xu for reviewing!) - New patch 1 to rename addr1 to mr_addr in the interests of meaningful naming. - Take advantage of a cached address space only allow for a single MR to simplify the new code. - Various cleanups of indentation etc. - Cover letter and some patch descriptions updated to reflect changes. - Changes all called out in specific patches. Issue seen testing virtio-blk-pci with CXL emulated interleave memory. Tests were done on arm64, but the issue isn't architecture specific. Note that some additional fixes are needed to TCG to be able to run far enough to hit this on arm64 or x86. Most of these are now upstream with exception of: target/i386: Enable page walking from MMIO memory https://lore.kernel.org/qemu-devel/20240219173153.12114-3-jonathan.came...@huawei.com/ The address_space_read_cached_slow() and address_space_write_cached_slow() functions query the MemoryRegion for the cached address space correctly using address_space_translate_cached() but then call into flatview_read_continue() / flatview_write_continue(). If the access is to a MMIO MemoryRegion and is bigger than the MemoryRegion supports, the loop will query the MemoryRegion for the next access to use. That query uses flatview_translate() but the address passed is suitable for the cache, not the flatview. On my test setup that mean the second 8 bytes and onwards of the virtio descriptor was read from flash memory at the beginning of the system address map, not the CXL emulated memory where the descriptor was found. Result happened to be all fs so easy to spot. Changes these calls to assume that the MemoryRegion does not change as multiple acceses are perfomed to the MemoryRegionCache. The first patch renames the addr1 parameter to the hopefully more informative mr_addr. To avoid duplicating most of the code, the next 2 patches factor out the common parts of flatview_read_continue() and flatview_write_continue() so they can be reused. Write path has not been tested but it so similar to the read path I've included it here. Jonathan Cameron (4): physmem: Rename addr1 to more informative mr_addr in flatview_read/write() and similar physmem: Reduce local variable scope in flatview_read/write_continue() physmem: Factor out body of flatview_read/write_continue() loop physmem: Fix wrong address in large address_space_read/write_cached_slow() system/physmem.c | 260 +++ 1 file changed, 170 insertions(+), 90 deletions(-) -- 2.39.2
Re: [PATCH 3/3] physmem: Fix wrong MR in large address_space_read/write_cached_slow()
On Fri, 1 Mar 2024 13:44:01 +0800 Peter Xu wrote: > On Thu, Feb 15, 2024 at 02:28:17PM +0000, Jonathan Cameron wrote: > > Can we rename the subject? > > physmem: Fix wrong MR in large address_space_read/write_cached_slow() > > IMHO "wrong MR" is misleading, as the MR was wrong only because the address > passed over is wrong at the first place. Perhaps s/MR/addr/? > > > If the access is bigger than the MemoryRegion supports, > > flatview_read/write_continue() will attempt to update the Memory Region. > > but the address passed to flatview_translate() is relative to the cache, not > > to the FlatView. > > > > On arm/virt with interleaved CXL memory emulation and virtio-blk-pci this > > lead to the first part of descriptor being read from the CXL memory and the > > second part from PA 0x8 which happens to be a blank region > > of a flash chip and all ffs on this particular configuration. > > Note this test requires the out of tree ARM support for CXL, but > > the problem is more general. > > > > Avoid this by adding new address_space_read_continue_cached() > > and address_space_write_continue_cached() which share all the logic > > with the flatview versions except for the MemoryRegion lookup. > > > > Signed-off-by: Jonathan Cameron > > --- > > system/physmem.c | 78 > > 1 file changed, 72 insertions(+), 6 deletions(-) > > > > [...] > > > +/* Called within RCU critical section. */ > > +static MemTxResult address_space_read_continue_cached(MemoryRegionCache > > *cache, > > + hwaddr addr, > > + MemTxAttrs attrs, > > + void *ptr, hwaddr > > len, > > + hwaddr addr1, hwaddr > > l, > > + MemoryRegion *mr) > > It looks like "addr" (of flatview AS) is not needed for a cached RW (see > below), because we should have a stable (cached) MR to operate anyway? > > How about we also use "mr_addr" as the single addr of the MR, then drop > addr1? Agreed, but also need to drop the fuzz_dma_read_cb(). However given the first thing that is checked by the only in tree fuzzing code is whether we are dealing with RAM, I think that's fine. > > > +{ > > +MemTxResult result = MEMTX_OK; > > +uint8_t *buf = ptr; > > + > > +fuzz_dma_read_cb(addr, len, mr); > > +for (;;) { > > + > > Remove empty line? > > > +result |= flatview_read_continue_step(addr, attrs, buf, len, addr1, > > + , mr); > > +len -= l; > > +buf += l; > > +addr += l; > > + > > +if (!len) { > > +break; > > +} > > +l = len; > > + > > +mr = address_space_translate_cached(cache, addr, , , false, > > +attrs); > > Here IIUC the mr will always be the same as before? If so, maybe "mr_addr > += l" should be enough? > I had the same thought originally but couldn't convince myself that there was no route to end up with a different MR here. I don't yet have a good enough grip on how this all fits together so I particularly appreciate your help. With hindsight I should have called this out as a question in this patch set. Can drop passing in cache as well given it is no longer used within this function. Thanks, Jonathan > (similar comment applies to the writer side too) > > > +} > > + > > +return result; > > +} > > + > > /* Called from RCU critical section. address_space_read_cached uses this > > * out of line function when the target is an MMIO or IOMMU region. > > */ > > @@ -3390,9 +3456,9 @@ address_space_read_cached_slow(MemoryRegionCache > > *cache, hwaddr addr, > > l = len; > > mr = address_space_translate_cached(cache, addr, , , false, > > MEMTXATTRS_UNSPECIFIED); > > -return flatview_read_continue(cache->fv, > > - addr, MEMTXATTRS_UNSPECIFIED, buf, len, > > - addr1, l, mr); > > +return address_space_read_continue_cached(cache, addr, > > + MEMTXATTRS_UNSPECIFIED, buf, > > len, > > +
Re: [PATCH 2/3] physmem: Factor out body of flatview_read/write_continue() loop
On Fri, 1 Mar 2024 13:35:26 +0800 Peter Xu wrote: > On Fri, Mar 01, 2024 at 01:29:04PM +0800, Peter Xu wrote: > > On Thu, Feb 15, 2024 at 02:28:16PM +, Jonathan Cameron wrote: > > > This code will be reused for the address_space_cached accessors > > > shortly. > > > > > > Also reduce scope of result variable now we aren't directly > > > calling this in the loop. > > > > > > Signed-off-by: Jonathan Cameron > > > --- > > > system/physmem.c | 165 --- > > > 1 file changed, 98 insertions(+), 67 deletions(-) > > > > > > diff --git a/system/physmem.c b/system/physmem.c > > > index 39b5ac751e..74f92bb3b8 100644 > > > --- a/system/physmem.c > > > +++ b/system/physmem.c > > > @@ -2677,6 +2677,54 @@ static bool flatview_access_allowed(MemoryRegion > > > *mr, MemTxAttrs attrs, > > > return false; > > > } > > > > > > +static MemTxResult flatview_write_continue_step(hwaddr addr, > > One more thing: this addr var is not used, afaict. We could drop addr1 > below and use this to represent the MR offset. I'm tempted to keep the addr1 where it is in the parameter list just so that it matches up with the caller location but a rename makes a lot of sense. > > I'm wondering whether we should start to use some better namings already > for memory API functions to show obviously what AS it is describing. From > that POV, perhaps rename it to "mr_addr"? I'll add a precursor patch renaming these for the functions this series touches. We can tidy up other cases later. I'll put a note in that patch below the cut to observe that the rename makes sense more widely. I've not picked up the RB given because of the parameter ordering question. Thanks, Jonathan > > > > +MemTxAttrs attrs, > > > +const uint8_t *buf, > > > +hwaddr len, hwaddr addr1, > > > +hwaddr *l, MemoryRegion > > > *mr) > > > +{ > > > +if (!flatview_access_allowed(mr, attrs, addr1, *l)) { > > > +return MEMTX_ACCESS_ERROR; > > > +} > > > + > > > +if (!memory_access_is_direct(mr, true)) { > > > +uint64_t val; > > > +MemTxResult result; > > > +bool release_lock = prepare_mmio_access(mr); > > > + > > > +*l = memory_access_size(mr, *l, addr1); > > > +/* XXX: could force current_cpu to NULL to avoid > > > + potential bugs */ > > > + > > > +/* > > > + * Assure Coverity (and ourselves) that we are not going to > > > OVERRUN > > > + * the buffer by following ldn_he_p(). > > > + */ > > > +#ifdef QEMU_STATIC_ANALYSIS > > > +assert((*l == 1 && len >= 1) || > > > + (*l == 2 && len >= 2) || > > > + (*l == 4 && len >= 4) || > > > + (*l == 8 && len >= 8)); > > > +#endif > > > +val = ldn_he_p(buf, *l); > > > +result = memory_region_dispatch_write(mr, addr1, val, > > > + size_memop(*l), attrs); > > > +if (release_lock) { > > > +bql_unlock(); > > > +} > > > + > > > +return result; > > > +} else { > > > +/* RAM case */ > > > +uint8_t *ram_ptr = qemu_ram_ptr_length(mr->ram_block, addr1, l, > > > false); > > > + > > > +memmove(ram_ptr, buf, *l); > > > +invalidate_and_set_dirty(mr, addr1, *l); > > > + > > > +return MEMTX_OK; > > > +} > > > +} > > > + > > > /* Called within RCU critical section. */ > > > static MemTxResult flatview_write_continue(FlatView *fv, hwaddr addr, > > > MemTxAttrs attrs, > > > @@ -2688,42 +2736,9 @@ static MemTxResult > > > flatview_write_continue(FlatView *fv, hwaddr addr, > > > const uint8_t *buf = ptr; > > > > > > for (;;) { > > > -if (!flatview_access_allowed(mr, attrs, addr1, l)) { > > > -result |= MEMTX_ACCESS_ERROR; > > > -/* Keep going. */ > > > -} else if (!memory_access_is_direct
Re: [PATCH v5 09/13] hw/cxl/events: Add qmp interfaces to add/release dynamic capacity extents
> > > > > + * remove it from the pending extent list, so later when the > > > add > > > + * response for the extent arrives, the device can reject the > > > + * extent as it is not in the pending list. > > > + */ > > > +ent = cxl_dc_extent_exists(>dc.extents_pending_to_add, > > > +[i]); > > > +if (ent) { > > > +QTAILQ_REMOVE(>dc.extents_pending_to_add, ent, > > > node); > > > +g_free(ent); > > > +skip_extent = true; > > > +} else if (!cxl_dc_extent_exists(>dc.extents, > > > [i])) { > > > +/* If the exact extent is not in the accepted list, skip > > > */ > > > +skip_extent = true; > > > +} > > I think we need to reject case of some extents skipped and others not. > > That's not supported yet so we need to complain if we get it at least. > > Maybe we need > > to do two passes so we know this has happened early (or perhaps this is a > > later > > patch in which case a todo here would help). > > If the second skip_extent case, I will reject earlier instead of > skipping. That was me misunderstanding the flow. I think this is fine as you have it already. Jonathan
Re: [PATCH v5 09/13] hw/cxl/events: Add qmp interfaces to add/release dynamic capacity extents
... > > > +list = records; > > > +extents = g_new0(CXLDCExtentRaw, num_extents); > > > +while (list) { > > > +CXLDCExtent *ent; > > > +bool skip_extent = false; > > > + > > > +offset = list->value->offset; > > > +len = list->value->len; > > > + > > > +extents[i].start_dpa = offset + dcd->dc.regions[rid].base; > > > +extents[i].len = len; > > > +memset(extents[i].tag, 0, 0x10); > > > +extents[i].shared_seq = 0; > > > + > > > +if (type == DC_EVENT_RELEASE_CAPACITY || > > > +type == DC_EVENT_FORCED_RELEASE_CAPACITY) { > > > +/* > > > + * if the extent is still pending to be added to the host, > > > > Odd spacing. > > > > > + * remove it from the pending extent list, so later when the > > > add > > > + * response for the extent arrives, the device can reject the > > > + * extent as it is not in the pending list. > > > + */ > > > +ent = cxl_dc_extent_exists(>dc.extents_pending_to_add, > > > +[i]); > > > +if (ent) { > > > +QTAILQ_REMOVE(>dc.extents_pending_to_add, ent, > > > node); > > > +g_free(ent); > > > +skip_extent = true; > > > +} else if (!cxl_dc_extent_exists(>dc.extents, > > > [i])) { > > > +/* If the exact extent is not in the accepted list, skip > > > */ > > > +skip_extent = true; > > > +} > > I think we need to reject case of some extents skipped and others not. > > That's not supported yet so we need to complain if we get it at least. > > Maybe we need > > to do two passes so we know this has happened early (or perhaps this is a > > later > > patch in which case a todo here would help). > > Skip here does not mean the extent is invalid, it just means the extent > is still pending to add, so remove them from pending list would be > enough to reject the extent, no need to release further. That is based > on your feedback on v4. Ah. I'd missunderstood. > > The loop here is only to collect the extents to sent to the event log. > But as you said, we need one pass before updating pending list. > Actually if we do not allow the above case where extents to release is > still in the pending to add list, we can just return here with error, no > extra dry run needed. > > What do you think? I think we need a way to back out extents from the pending to add list so we can create the race where they are offered to the OS and it takes forever to accept and by the time it does we've removed them. > > > > > > + > > > + > > > +/* No duplicate or overlapped extents are allowed */ > > > +if (test_any_bits_set(blk_bitmap, offset / block_size, > > > + len / block_size)) { > > > +error_setg(errp, "duplicate or overlapped extents are > > > detected"); > > > +return; > > > +} > > > +bitmap_set(blk_bitmap, offset / block_size, len / block_size); > > > + > > > +list = list->next; > > > +if (!skip_extent) { > > > +i++; > > Problem is if we skip one in the middle the records will be wrong below. > > Why? Only extents passed the check will be stored in variable extents and > processed further and i be updated. > For skipped ones, since i is not updated, they will be > overwritten by following valid ones. Ah. I'd missed the fact you store into the extent without a check on validity but only move the index on if they were valid. Then rely on not passing a trailing entry at the end. If would be more readable I think if local variables were used for the parameters until we've decided not to skip and the this ended with if (!skip_extent) { extents[i] = (DCXLDCExtentRaw) { .start_dpa = ... ... }; i++ } We have local len already so probably just need uint64_t start_dpa = offset + dcd->dc.regions[rid].base; Also maybe skip_extent_evlog or something like that to explain we are only skipping that part. Helps people like me who read it completely wrong! Jonathan
Re: [PATCH v5 08/13] hw/cxl/cxl-mailbox-utils: Add mailbox commands to support add/release dynamic capacity response
> > > +static void cxl_destroy_dc_regions(CXLType3Dev *ct3d) > > > +{ > > > +CXLDCExtent *ent; > > > + > > > +while (!QTAILQ_EMPTY(>dc.extents)) { > > > +ent = QTAILQ_FIRST(>dc.extents); > > > +cxl_remove_extent_from_extent_list(>dc.extents, ent); > > > > Isn't this same a something like. > > QTAILQ_FOREACH_SAFE(ent, >dc.extents, node)) { > > cxl_remove_extent_from_extent_list(>dc.extents, ent); > > //This wrapper is small enough I'd be tempted to just have the > > //code inline at the places it's called. > > > We will have more to release after we introduce pending list as well as > bitmap. Keep it? ok. > > Fan > > > } > > > +} > > > +} > >
Re: [PATCH v5 08/13] hw/cxl/cxl-mailbox-utils: Add mailbox commands to support add/release dynamic capacity response
On Wed, 6 Mar 2024 13:39:50 -0800 fan wrote: > > > +} > > > +if (len2) { > > > +cxl_insert_extent_to_extent_list(extent_list, > > > dpa + len, > > > + len2, NULL, 0); > > > +ct3d->dc.total_extent_count += 1; > > > +} > > > +break; > > Maybe this makes sense after the support below is added, but at this > > point in the series > > return CXL_MBOX_SUCCESS; > > then found isn't relevant so can drop that. Looks like you drop it later in > > the > > series anyway. > > We cannot return directly as we have more extents to release. Ah good point. I'd missed the double loop. > One thing I think I need to add is a dry run to test if any extent in > the income list is not contained by an extent in the extent list and > return error before starting to do the real release. The spec just says > we need to return invalid PA but not specify whether we should update the list > until we found a "bad" extent or reject the request directly. Current code > leaves a situation where we may have updated the extent list until we found a > "bad" extent to release. Yes, I'm not sure on the correct answer to this either. My assumption is that in error cases there are no side effects, but I don't see a clear statement of that. So I think we are in the world of best practice, not spec compliance. If we wanted to recover from such an error case we'd have to verify the current extent list. I'll fire off a question to relevant folk in appropriate forum. Jonathan
Re: [PATCH v5 06/13] hw/mem/cxl_type3: Add host backend and address space handling for DC regions
> > > @@ -868,16 +974,24 @@ static int cxl_type3_hpa_to_as_and_dpa(CXLType3Dev > > > *ct3d, > > > AddressSpace **as, > > > uint64_t *dpa_offset) > > > { > > > -MemoryRegion *vmr = NULL, *pmr = NULL; > > > +MemoryRegion *vmr = NULL, *pmr = NULL, *dc_mr = NULL; > > > +uint64_t vmr_size = 0, pmr_size = 0, dc_size = 0; > > > > > > if (ct3d->hostvmem) { > > > vmr = host_memory_backend_get_memory(ct3d->hostvmem); > > > +vmr_size = memory_region_size(vmr); > > > } > > > if (ct3d->hostpmem) { > > > pmr = host_memory_backend_get_memory(ct3d->hostpmem); > > > +pmr_size = memory_region_size(pmr); > > > +} > > > +if (ct3d->dc.host_dc) { > > > +dc_mr = host_memory_backend_get_memory(ct3d->dc.host_dc); > > > +/* Do we want dc_size to be dc_mr->size or not?? */ > > > > Maybe - definitely don't want to leave this comment here > > unanswered and I think you enforce it above anyway. > > > > So if we get here ct3d->dc.total_capacity == > > memory_region_size(ct3d->dc.host_dc); > > As such we could just not stash total_capacity at all? > > I cannot identify a case where these two will be different. But > total_capacity is referenced at quite some places, it may be nice to have > it so we do not need to call the function to get the value every time? I kind of like having it via one path so that there is no confusion for the reader, but up to you on this one. The function called is trivial (other than some magic to handle very large memory regions) so this is just a readability question, not a perf one. Whatever, don't leave the question behind. Find to have something that says they are always the same size if you don't get rid of the total_capacity representation. Jonathan
Re: [PATCH v5 0/3] Initial support for SPDM Responders
On Thu, 7 Mar 2024 10:58:56 +1000 Alistair Francis wrote: > The Security Protocol and Data Model (SPDM) Specification defines > messages, data objects, and sequences for performing message exchanges > over a variety of transport and physical media. > - > https://www.dmtf.org/sites/default/files/standards/documents/DSP0274_1.3.0.pdf > > SPDM currently supports PCIe DOE and MCTP transports, but it can be > extended to support others in the future. This series adds > support to QEMU to connect to an external SPDM instance. > > SPDM support can be added to any QEMU device by exposing a > TCP socket to a SPDM server. The server can then implement the SPDM > decoding/encoding support, generally using libspdm [1]. > > This is similar to how the current TPM implementation works and means > that the heavy lifting of setting up certificate chains, capabilities, > measurements and complex crypto can be done outside QEMU by a well > supported and tested library. > > This series implements socket support and exposes SPDM for a NVMe device. Thanks Alastair, I'm really keen to seen this land soon as I have the CXL infrastructure for this backed up behind it. Also will be needed for PCI (IDE) and CXL link encryption emulation and most if not all of the confidential computing stacks with QEMU emulating the host system + peripherals. I believe it's just waiting for a PCI Maintainer Ack at this point? Klaus said he was happy to take it through NVME but wanted a PCI Ack first. Michael / Marcel, if you have time to look at it that would be great. Thanks, Jonathan > > 1: https://github.com/DMTF/libspdm > > v5: > - Update MAINTAINERS > v4: > - Rebase > v3: > - Spelling fixes > - Support for SPDM-Utils > v2: > - Add cover letter > - A few code fixes based on comments > - Document SPDM-Utils > - A few tweaks and clarifications to the documentation > > Alistair Francis (1): > hw/pci: Add all Data Object Types defined in PCIe r6.0 > > Huai-Cheng Kuo (1): > backends: Initial support for SPDM socket support > > Wilfred Mallawa (1): > hw/nvme: Add SPDM over DOE support > > MAINTAINERS | 6 + > docs/specs/index.rst | 1 + > docs/specs/spdm.rst | 122 > include/hw/pci/pci_device.h | 5 + > include/hw/pci/pcie_doe.h| 5 + > include/sysemu/spdm-socket.h | 44 +++ > backends/spdm-socket.c | 216 +++ > hw/nvme/ctrl.c | 53 + > backends/Kconfig | 4 + > backends/meson.build | 2 + > 10 files changed, 458 insertions(+) > create mode 100644 docs/specs/spdm.rst > create mode 100644 include/sysemu/spdm-socket.h > create mode 100644 backends/spdm-socket.c >
Re: [PATCH v8 2/2] hw/acpi: Implement the SRAT GI affinity structure
On Thu, 7 Mar 2024 03:03:02 + Ankit Agrawal wrote: > >> > >> [1] ACPI Spec 6.3, Section 5.2.16.6 > >> [2] ACPI Spec 6.3, Table 5.80 > >> > >> Cc: Jonathan Cameron > >> Cc: Alex Williamson > >> Cc: Cedric Le Goater > >> Signed-off-by: Ankit Agrawal > > > > I guess we gloss over the bisection breakage due to being able to add > > these nodes and have them used in HMAT as initiators before we have > > added SRAT support. Linux will moan about it and not use such an HMAT > > but meh, it will boot. > > > > You could drag the HMAT change after this but perhaps it's not worth > > bothering. > > Sorry this part isn't clear to me. Are you suggesting we keep the HMAT > changes out from this patch? No - don't drop them. Move them from patch 1 to either patch 2, or to a patch 3 if that ends up looking clearer. I think patch 2 is the right choice though as that enables everything at once. It's valid to have SRAT containing GI entries without the same in HMAT (as HMAT doesn't have to be complete), it's not valid to have HMAT refer to entries that aren't in SRAT. Another thing we may need to do add in the long run is the _OSC support. That's needed for DSDT entries with _PXM associated with a GI only node so that we can make them move node depending on whether or not the Guest OS supports GIs and so will create the nodes. Requires a bit of magic AML to make that work. It used to crash linux if you didn't do that, but that's been fixed for a while I believe. For now we aren't adding any such _PXM entries though so this is just one for the TODO list :) > > > Otherwise LGTM > > Reviewed-by: Jonathan Cameron > > Thanks! > > > Could add x86 support (posted in reply to v7 this morning) > > and sounds like you have the test nearly ready which is great. > > Ok, will add the x86 part as well. I could reuse what you shared > earlier. > > https://gitlab.com/jic23/qemu/-/commit/ccfb4fe22167e035173390cf147d9c226951b9b6 Excellent - thanks! Jonathan > > >
Re: [PATCH v5 12/13] hw/mem/cxl_type3: Allow to release partial extent and extent superset in QMP interface
On Mon, 4 Mar 2024 11:34:07 -0800 nifan@gmail.com wrote: > From: Fan Ni > > Before the change, the QMP interface used for add/release DC extents > only allows to release extents that exist in either pending-to-add list > or accepted list in the device, which means the DPA range of the extent must > match exactly that of an extent in either list. Otherwise, the release > request will be ignored. > > With the change, we relax the constraints. As long as the DPA range of the > extent to release is covered by extents in one of the two lists > mentioned above, we allow the release. > > Signed-off-by: Fan Ni Run out of time today, so just took a very quick look at this. Seemed fine but similar comments on exit conditions and retry gotos as earlier patches. > +/* > + * Remove all extents whose DPA range has overlaps with the DPA range > + * [dpa, dpa + len) from the list, and delete the overlapped portion. > + * Note: > + * 1. If the removed extents is fully within the DPA range, delete the > extent; > + * 2. Otherwise, keep the portion that does not overlap, insert new extents > to > + * the list if needed for the un-coverlapped part. > + */ > +static void cxl_delist_extent_by_dpa_range(CXLDCExtentList *list, > + uint64_t dpa, uint64_t len) > +{ > +CXLDCExtent *ent; > > -return NULL; > +process_leftover: As before can we turn this into a while loop so the exit conditions are more obvious? Based on len I think. > +QTAILQ_FOREACH(ent, list, node) { > +if (ent->start_dpa <= dpa && dpa < ent->start_dpa + ent->len) { > +uint64_t ent_start_dpa = ent->start_dpa; > +uint64_t ent_len = ent->len; > +uint64_t len1 = dpa - ent_start_dpa; > + > +cxl_remove_extent_from_extent_list(list, ent); > +if (len1) { > +cxl_insert_extent_to_extent_list(list, ent_start_dpa, > + len1, NULL, 0); > +} > + > +if (dpa + len <= ent_start_dpa + ent_len) { > +uint64_t len2 = ent_start_dpa + ent_len - dpa - len; > +if (len2) { > +cxl_insert_extent_to_extent_list(list, dpa + len, > + len2, NULL, 0); > +} > +} else { > +len = dpa + len - ent_start_dpa - ent_len; > +dpa = ent_start_dpa + ent_len; > +goto process_leftover; > +} > +} > +} > } > > /* > @@ -1915,8 +1966,8 @@ static void qmp_cxl_process_dynamic_capacity(const char > *path, CxlEventLog log, > list = records; > extents = g_new0(CXLDCExtentRaw, num_extents); > while (list) { > -CXLDCExtent *ent; > bool skip_extent = false; > +CXLDCExtentList *extent_list; > > offset = list->value->offset; > len = list->value->len; > @@ -1933,15 +1984,32 @@ static void qmp_cxl_process_dynamic_capacity(const > char *path, CxlEventLog log, > * remove it from the pending extent list, so later when the add > * response for the extent arrives, the device can reject the > * extent as it is not in the pending list. > + * Now, we can handle the case where the extent covers the DPA No need for Now. Anyone reading it is look at the cod here. > + * range of multiple extents in the pending_to_add list. > + * TODO: we do not allow the extent covers range of extents in > + * pending_to_add list and accepted list at the same time for > now. > */ > -ent = cxl_dc_extent_exists(>dc.extents_pending_to_add, > -[i]); > -if (ent) { > -QTAILQ_REMOVE(>dc.extents_pending_to_add, ent, node); > -g_free(ent); > +extent_list = >dc.extents_pending_to_add; > +if (cxl_test_dpa_range_covered_by_extents(extent_list, > + extents[i].start_dpa, > + extents[i].len)) { > +cxl_delist_extent_by_dpa_range(extent_list, > + extents[i].start_dpa, > + extents[i].len); > +} else if (!ct3_test_region_block_backed(dcd, > extents[i].start_dpa, > + extents[i].len)) { > +/* > + * If the DPA range of the extent is not covered by extents > + * in the accepted list, skip > + */ > skip_extent = true; > -} else if (!cxl_dc_extent_exists(>dc.extents, [i])) > { > -/* If the exact extent is not in the accepted list, skip */ > +} > +} else if (type ==
Re: [PATCH v5 11/13] hw/cxl/cxl-mailbox-utils: Add partial and superset extent release mailbox support
On Mon, 4 Mar 2024 11:34:06 -0800 nifan@gmail.com wrote: > From: Fan Ni > > With the change, we extend the extent release mailbox command processing > to allow more flexible release. As long as the DPA range of the extent to > release is covered by valid extent(s) in the device, the release can be > performed. > > Signed-off-by: Fan Ni Ouch this is more complex than I was thinking, but seems correct to me. A few minor comments inline Jonathan > +/* > + * Detect potential extent overflow caused by extent split during processing > + * extent release requests, also allow releasing superset of extents where > the > + * extent to release covers the range of multiple extents in the device. > + * Note: > + * 1.we will reject releasing an extent if some portion of its rang is range > + * not covered by valid extents. > + * 2.This function is called after cxl_detect_malformed_extent_list so checks > + * already performed there will be skipped. > + */ > +static CXLRetCode cxl_detect_extent_overflow(const CXLType3Dev *ct3d, > +const CXLUpdateDCExtentListInPl *in) This code is basically dry running the actual removal. Can we just make the core code the same for both cases? The bit where you update bitmaps and extent lists at least. > +{ > +uint64_t nbits, offset; > +const CXLDCRegion *region; > +unsigned long **bitmaps_copied; > +uint64_t dpa, len; > +int i, rid; > +CXLRetCode ret = CXL_MBOX_SUCCESS; > +long extent_cnt_delta = 0; > +CXLDCExtentList tmp_list; > +CXLDCExtent *ent; > + > +QTAILQ_INIT(_list); > +copy_extent_list(_list, >dc.extents); > + > +bitmaps_copied = g_new0(unsigned long *, ct3d->dc.num_regions); > +for (i = 0; i < ct3d->dc.num_regions; i++) { > +region = >dc.regions[i]; > +nbits = region->len / region->block_size; > +bitmaps_copied[i] = bitmap_new(nbits); > +bitmap_copy(bitmaps_copied[i], region->blk_bitmap, nbits); > +} > + > +for (i = 0; i < in->num_entries_updated; i++) { > +dpa = in->updated_entries[i].start_dpa; > +len = in->updated_entries[i].len; > + > +rid = cxl_find_dc_region_id(ct3d, dpa, len); > +region = >dc.regions[rid]; > +offset = (dpa - region->base) / region->block_size; > +nbits = len / region->block_size; > + > +/* Check whether range [dpa, dpa + len) is covered by valid range */ > +if (find_next_zero_bit(bitmaps_copied[rid], offset + nbits, offset) < > + offset + nbits) { > +ret = CXL_MBOX_INVALID_PA; > +goto free_and_exit; > +} > + > +QTAILQ_FOREACH(ent, _list, node) { > +/* Only split within an extent can cause extent count increase */ > +if (ent->start_dpa <= dpa && > +dpa + len <= ent->start_dpa + ent->len) { > +uint64_t ent_start_dpa = ent->start_dpa; > +uint64_t ent_len = ent->len; > +uint64_t len1 = dpa - ent_start_dpa; > +uint64_t len2 = ent_start_dpa + ent_len - dpa - len; > + > +extent_cnt_delta += len1 && len2 ? 2 : (len1 || len2 ? 1 : > 0); I think this is the same as if (len1) extent_cnt_delta++; if (len2) extent_cnt_delta++; extent_cnt_delta--; > +extent_cnt_delta -= 1; > +if (ct3d->dc.total_extent_count + extent_cnt_delta > > +CXL_NUM_EXTENTS_SUPPORTED) { This early overflow detect seems valid to me because a device might run out or resource mid processing the list even if it would fit at the end. Good. > +ret = CXL_MBOX_RESOURCES_EXHAUSTED; > +goto free_and_exit; > +} > + > +offset = (ent->start_dpa - region->base) / > region->block_size; > +nbits = ent->len / region->block_size; > +bitmap_clear(bitmaps_copied[rid], offset, nbits); > +cxl_remove_extent_from_extent_list(_list, ent); > + > + if (len1) { > +offset = (dpa - region->base) / region->block_size; > +nbits = len1 / region->block_size; > +bitmap_set(bitmaps_copied[rid], offset, nbits); > +cxl_insert_extent_to_extent_list(_list, > + ent_start_dpa, len1, > + NULL, 0); > + } > + > + if (len2) { > +offset = (dpa + len - region->base) / region->block_size; > +nbits = len2 / region->block_size; > +bitmap_set(bitmaps_copied[rid], offset, nbits); > +cxl_insert_extent_to_extent_list(_list, dpa + len, > + len2, NULL, 0);
Re: [PATCH v5 10/13] hw/mem/cxl_type3: Add dpa range validation for accesses to DC regions
On Mon, 4 Mar 2024 11:34:05 -0800 nifan@gmail.com wrote: > From: Fan Ni > > Not all dpa range in the DC regions is valid to access until an extent All DPA ranges in the DC regions are invalid to access until an extent covering the range has been added. > covering the range has been added. Add a bitmap for each region to > record whether a DC block in the region has been backed by DC extent. > For the bitmap, a bit in the bitmap represents a DC block. When a DC > extent is added, all the bits of the blocks in the extent will be set, > which will be cleared when the extent is released. > > Signed-off-by: Fan Ni Reviewed-by: Jonathan Cameron
Re: [PATCH v5 09/13] hw/cxl/events: Add qmp interfaces to add/release dynamic capacity extents
On Mon, 4 Mar 2024 11:34:04 -0800 nifan@gmail.com wrote: > From: Fan Ni > > Since fabric manager emulation is not supported yet, the change implements > the functions to add/release dynamic capacity extents as QMP interfaces. We'll need them anyway, or to implement an fm interface via QMP which is going to be ugly and complex. > > Note: we skips any FM issued extent release request if the exact extent > does not exist in the extent list of the device. We will loose the > restriction later once we have partial release support in the kernel. Maybe the kernel will treat it as a request to release the extent it is tracking that contains it. So we may want to add a way to poke that. Not today though! > > 1. Add dynamic capacity extents: > > For example, the command to add two continuous extents (each 128MiB long) > to region 0 (starting at DPA offset 0) looks like below: > > { "execute": "qmp_capabilities" } > > { "execute": "cxl-add-dynamic-capacity", > "arguments": { > "path": "/machine/peripheral/cxl-dcd0", > "region-id": 0, > "extents": [ > { > "dpa": 0, > "len": 134217728 > }, > { > "dpa": 134217728, > "len": 134217728 > } > ] > } > } > > 2. Release dynamic capacity extents: > > For example, the command to release an extent of size 128MiB from region 0 > (DPA offset 128MiB) look like below: > > { "execute": "cxl-release-dynamic-capacity", > "arguments": { > "path": "/machine/peripheral/cxl-dcd0", > "region-id": 0, > "extents": [ > { > "dpa": 134217728, > "len": 134217728 > } > ] > } > } > > Signed-off-by: Fan Ni ... > diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c > index dccfaaad3a..e9c8994cdb 100644 > --- a/hw/mem/cxl_type3.c > +++ b/hw/mem/cxl_type3.c > @@ -674,6 +674,7 @@ static bool cxl_create_dc_regions(CXLType3Dev *ct3d, > Error **errp) > ct3d->dc.total_capacity += region->len; > } > QTAILQ_INIT(>dc.extents); > +QTAILQ_INIT(>dc.extents_pending_to_add); > > return true; > } > @@ -686,6 +687,12 @@ static void cxl_destroy_dc_regions(CXLType3Dev *ct3d) > ent = QTAILQ_FIRST(>dc.extents); > cxl_remove_extent_from_extent_list(>dc.extents, ent); > } > + > +while (!QTAILQ_EMPTY(>dc.extents_pending_to_add)) { QTAILQ_FOR_EACHSAFE > +ent = QTAILQ_FIRST(>dc.extents_pending_to_add); > +cxl_remove_extent_from_extent_list(>dc.extents_pending_to_add, > + ent); > +} > } > +/* > + * The main function to process dynamic capacity event. Currently DC extents > + * add/release requests are processed. > + */ > +static void qmp_cxl_process_dynamic_capacity(const char *path, CxlEventLog > log, > + CXLDCEventType type, uint16_t > hid, > + uint8_t rid, > + CXLDCExtentRecordList *records, > + Error **errp) > +{ > +Object *obj; > +CXLEventDynamicCapacity dCap = {}; > +CXLEventRecordHdr *hdr = > +CXLType3Dev *dcd; > +uint8_t flags = 1 << CXL_EVENT_TYPE_INFO; > +uint32_t num_extents = 0; > +CXLDCExtentRecordList *list; > +g_autofree CXLDCExtentRaw *extents = NULL; > +uint8_t enc_log; > +uint64_t offset, len, block_size; > +int i; > +int rc; Combine the two lines above. > +g_autofree unsigned long *blk_bitmap = NULL; > + > +obj = object_resolve_path(path, NULL); > +if (!obj) { > +error_setg(errp, "Unable to resolve path"); > +return; > +} object_resolve_path_type() and skip a step (should do this in various places in our existing code!) > +if (!object_dynamic_cast(obj, TYPE_CXL_TYPE3)) { > +error_setg(errp, "Path not point to a valid CXL type3 device"); > +return; > +} > + > +dcd = CXL_TYPE3(obj); > +if (!dcd->dc.num_regions) { > +error_setg(errp, "No dynamic capacity support from the device"); > +return; > +} > + > +rc = ct3d_qmp_cxl_event_log_enc(log); > +if (rc < 0) { > +error_setg(errp, "Unhandled error log type"); > +return; > +} > +enc_log = rc; > + > +if (rid >= dcd->dc.num_regions) { > +error_setg(errp, "region id is too large"); > +return; > +} > +block_size = dcd->dc.regions[rid].block_size; > + > +/* Sanity check and count the extents */ > +list = records; > +while (list) { > +offset = list->value->offset; > +len = list->value->len; > + > +if (len == 0) { > +error_setg(errp, "extent with 0 length is not allowed"); > +return; > +} > + > +if (offset % block_size || len % block_size) { > +error_setg(errp, "dpa or len is not aligned to region block > size"); > +
Re: [PATCH v5 08/13] hw/cxl/cxl-mailbox-utils: Add mailbox commands to support add/release dynamic capacity response
On Mon, 4 Mar 2024 11:34:03 -0800 nifan@gmail.com wrote: > From: Fan Ni > > Per CXL spec 3.1, two mailbox commands are implemented: > Add Dynamic Capacity Response (Opcode 4802h) 8.2.9.9.9.3, and > Release Dynamic Capacity (Opcode 4803h) 8.2.9.9.9.4. > > Signed-off-by: Fan Ni Hmm. So I had a thought which would work for what you have here. See include/qemu/range.h I like the region merging stuff that is also in the list operators but we shouldn't use that because we have other reasons not to fuse ranges (sequence numbering etc) We could make an extent a wrapper around a struct Range though so that we can use the comparison stuff directly. + we can use the list manipulation in there as the basis for a future extent merging infrastructure that is tag and sequence number (if provided - so shared capacity or pmem) aware. Jonathan > --- > + > +/* > + * CXL r3.1 Table 8-168: Add Dynamic Capacity Response Input Payload > + * CXL r3.1 Table 8-170: Release Dynamic Capacity Input Payload > + */ > +typedef struct CXLUpdateDCExtentListInPl { > +uint32_t num_entries_updated; > +uint8_t flags; > +uint8_t rsvd[3]; > +/* CXL r3.1 Table 8-169: Updated Extent */ > +struct { > +uint64_t start_dpa; > +uint64_t len; > +uint8_t rsvd[8]; > +} QEMU_PACKED updated_entries[]; > +} QEMU_PACKED CXLUpdateDCExtentListInPl; > + > +/* > + * For the extents in the extent list to operate, check whether they are > valid > + * 1. The extent should be in the range of a valid DC region; > + * 2. The extent should not cross multiple regions; > + * 3. The start DPA and the length of the extent should align with the block > + * size of the region; > + * 4. The address range of multiple extents in the list should not overlap. Hmm. Interesting. I was thinking a given add / remove command rather than just the extents can't overlap a region. However I can't find text on that so I believe your interpretation is correct. It is only specified for the event records, but that is good enough I think. We might want to propose tightening the spec on this to allow devices to say no to such complex extent lists. Maybe a nice friendly Memory vendor should query this one if it's a potential problem for real devices. Might not be! > + */ > +static CXLRetCode cxl_detect_malformed_extent_list(CXLType3Dev *ct3d, > +const CXLUpdateDCExtentListInPl *in) > +{ > +uint64_t min_block_size = UINT64_MAX; > +CXLDCRegion *region = >dc.regions[0]; > +CXLDCRegion *lastregion = >dc.regions[ct3d->dc.num_regions - 1]; > +g_autofree unsigned long *blk_bitmap = NULL; > +uint64_t dpa, len; > +uint32_t i; > + > +for (i = 0; i < ct3d->dc.num_regions; i++) { > +region = >dc.regions[i]; > +min_block_size = MIN(min_block_size, region->block_size); > +} > + > +blk_bitmap = bitmap_new((lastregion->base + lastregion->len - > + ct3d->dc.regions[0].base) / min_block_size); > + > +for (i = 0; i < in->num_entries_updated; i++) { > +dpa = in->updated_entries[i].start_dpa; > +len = in->updated_entries[i].len; > + > +region = cxl_find_dc_region(ct3d, dpa, len); > +if (!region) { > +return CXL_MBOX_INVALID_PA; > +} > + > +dpa -= ct3d->dc.regions[0].base; > +if (dpa % region->block_size || len % region->block_size) { > +return CXL_MBOX_INVALID_EXTENT_LIST; > +} > +/* the dpa range already covered by some other extents in the list */ > +if (test_any_bits_set(blk_bitmap, dpa / min_block_size, > +len / min_block_size)) { > +return CXL_MBOX_INVALID_EXTENT_LIST; > +} > +bitmap_set(blk_bitmap, dpa / min_block_size, len / min_block_size); > + } > + > +return CXL_MBOX_SUCCESS; > +} > + > +/* > + * CXL r3.1 section 8.2.9.9.9.3: Add Dynamic Capacity Response (Opcode 4802h) > + * An extent is added to the extent list and becomes usable only after the > + * response is processed successfully > + */ > +static CXLRetCode cmd_dcd_add_dyn_cap_rsp(const struct cxl_cmd *cmd, > + uint8_t *payload_in, > + size_t len_in, > + uint8_t *payload_out, > + size_t *len_out, > + CXLCCI *cci) > +{ > +CXLUpdateDCExtentListInPl *in = (void *)payload_in; > +CXLType3Dev *ct3d = CXL_TYPE3(cci->d); > +CXLDCExtentList *extent_list = >dc.extents; > +CXLDCExtent *ent; > +uint32_t i; > +uint64_t dpa, len; > +CXLRetCode ret; > + > +if (in->num_entries_updated == 0) { > +return CXL_MBOX_SUCCESS; > +} > + > +/* Adding extents causes exceeding device's extent tracking ability. */ > +if (in->num_entries_updated + ct3d->dc.total_extent_count > > +CXL_NUM_EXTENTS_SUPPORTED) { > +