RE: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-07 Thread Tian, Kevin
> From: Ankit Agrawal 
> Sent: Thursday, February 8, 2024 3:13 PM
> >> > +    * Determine how many bytes to be actually read from the
> >> > device memory.
> >> > +    * Read request beyond the actual device memory size is
> >> > filled with ~0,
> >> > +    * while those beyond the actual reported size is skipped.
> >> > +    */
> >> > +   if (offset >= memregion->memlength)
> >> > +   mem_count = 0;
> >>
> >> If mem_count == 0, going through nvgrace_gpu_map_and_read() is not
> >> necessary.
> >
> > Harmless, other than the possibly unnecessary call through to
> > nvgrace_gpu_map_device_mem().  Maybe both
> nvgrace_gpu_map_and_read()
> > and nvgrace_gpu_map_and_write() could conditionally return 0 as their
> > first operation when !mem_count.  Thanks,
> >
> >Alex
> 
> IMO, this seems like adding too much code to reduce the call length for a
> very specific case. If there aren't any strong opinion on this, I'm planning 
> to
> leave this code as it is.

a slight difference. if mem_count==0 the result should always succeed
no matter nvgrace_gpu_map_device_mem() succeeds or not. Of course
if it fails it's already a big problem probably nobody cares about the subtle
difference when reading non-exist range.

but regarding to readability it's still clearer:

if (mem_count)
nvgrace_gpu_map_and_read();



RE: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-07 Thread Tian, Kevin
> From: ank...@nvidia.com 
> Sent: Tuesday, February 6, 2024 7:01 AM
> 
> Note that the usemem memory is added by the VM Nvidia device driver [5]
> to the VM kernel as memblocks. Hence make the usable memory size
> memblock
> aligned.

Is memblock size defined in spec or purely a guest implementation choice?

> 
> If the bare metal properties are not present, the driver registers the
> vfio-pci-core function pointers.

so if qemu doesn't generate such property the variant driver running
inside guest will always go to use core functions and guest vfio userspace
will observe both resmem and usemem bars. But then there is nothing
in field to prohibit mapping resmem bar as cacheable.

should this driver check the presence of either ACPI property or 
resmem/usemem bars to enable variant function pointers?

> +config NVGRACE_GPU_VFIO_PCI
> + tristate "VFIO support for the GPU in the NVIDIA Grace Hopper
> Superchip"
> + depends on ARM64 || (COMPILE_TEST && 64BIT)
> + select VFIO_PCI_CORE
> + help
> +   VFIO support for the GPU in the NVIDIA Grace Hopper Superchip is
> +   required to assign the GPU device using KVM/qemu/etc.

"assign the GPU device to userspace"

> +
> +/* Memory size expected as non cached and reserved by the VM driver */
> +#define RESMEM_SIZE 0x4000
> +#define MEMBLK_SIZE 0x2000

also add a comment for MEMBLK_SIZE

> +
> +struct nvgrace_gpu_vfio_pci_core_device {

will nvgrace refer to a non-gpu device? if not probably all prefixes with
'nvgrace_gpu' can be simplified to 'nvgrace'.

btw following other variant drivers 'vfio' can be removed too.

> +
> +/*
> + * Both the usable (usemem) and the reserved (resmem) device memory
> region
> + * are exposed as a 64b fake BARs in the VM. These fake BARs must respond

s/VM/device/

> + * to the accesses on their respective PCI config space offsets.
> + *
> + * resmem BAR owns PCI_BASE_ADDRESS_2 & PCI_BASE_ADDRESS_3.
> + * usemem BAR owns PCI_BASE_ADDRESS_4 & PCI_BASE_ADDRESS_5.
> + */
> +static ssize_t
> +nvgrace_gpu_read_config_emu(struct vfio_device *core_vdev,
> + char __user *buf, size_t count, loff_t *ppos)
> +{
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev =
> + container_of(core_vdev, struct
> nvgrace_gpu_vfio_pci_core_device,
> +  core_device.vdev);
> + struct mem_region *memregion = NULL;
> + u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
> + __le64 val64;
> + size_t register_offset;
> + loff_t copy_offset;
> + size_t copy_count;
> + int ret;
> +
> + ret = vfio_pci_core_read(core_vdev, buf, count, ppos);
> + if (ret < 0)
> + return ret;

here if core_read succeeds *ppos has been updated...

> +
> + if (vfio_pci_core_range_intersect_range(pos, count,
> PCI_BASE_ADDRESS_2,
> + sizeof(val64),
> + _offset, _count,
> + _offset))
> + memregion =
> nvgrace_gpu_memregion(RESMEM_REGION_INDEX, nvdev);
> + else if (vfio_pci_core_range_intersect_range(pos, count,
> +  PCI_BASE_ADDRESS_4,
> +  sizeof(val64),
> +  _offset,
> _count,
> +  _offset))
> + memregion =
> nvgrace_gpu_memregion(USEMEM_REGION_INDEX, nvdev);
> +
> + if (memregion) {
> + val64 = nvgrace_gpu_get_read_value(memregion->bar_size,
> +
> PCI_BASE_ADDRESS_MEM_TYPE_64 |
> +
> PCI_BASE_ADDRESS_MEM_PREFETCH,
> +memregion->bar_val);
> + if (copy_to_user(buf + copy_offset,
> +  (void *) + register_offset, copy_count))
> + return -EFAULT;

...but here it's not adjusted back upon error.

> +
> +/*
> + * Read the data from the device memory (mapped either through ioremap
> + * or memremap) into the user buffer.
> + */
> +static int
> +nvgrace_gpu_map_and_read(struct nvgrace_gpu_vfio_pci_core_device
> *nvdev,
> +  char __user *buf, size_t mem_count, loff_t *ppos)
> +{
> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> + u64 offset = *ppos & VFIO_PCI_OFFSET_MASK;
> + int ret;
> +
> + /*
> +  * Handle read on the BAR regions. Map to the target device memory
> +  * physical address and copy to the request read buffer.
> +  */

duplicate with the earlier comment for the function. 

> +/*
> + * Read count bytes from the device memory at an offset. The actual device
> + * memory size (available) may not be a power-of-2. So the driver fakes
> + * the size to a power-of-2 (reported) when exposing to a user space driver.
> + *
> + * Reads extending beyond the reported size are truncated; reads starting
> + * beyond the reported size generate 

Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-07 Thread Ankit Agrawal
>> > +/* Memory size expected as non cached and reserved by the VM driver
>> > */ +#define RESMEM_SIZE 0x4000
>> > +#define MEMBLK_SIZE 0x2000
>> > +
>>
>> Maybe use SZ_* definitions in linux/size.h
>
> Good suggestion.

Ack.

>>
>> Better move this part to the place between vfio_pci_core_enable() and
>> vfio_pci_core_finish_enable() like others for respecting the expected
>> device initialization sequence of life cycle.
>>
>> It doesn't bite something right now, but think about when someone
>> changes the behavior of vfio_pci_core_finish_enable() in the future,
>> they have to propose a patch for this.
>
> Agree.

Good point, will move it.

>>
>> Wouldn't it be better to do the map in the open path?
>
> AIUI the device would typically be used exclusively through the mmap of
> these ranges, these mappings are only for pread/pwrite type accesses,
> so I think it makes sense to map them on demand.

That's right, agree with Alex to keep it on-demand.


>> > +    * Determine how many bytes to be actually read from the
>> > device memory.
>> > +    * Read request beyond the actual device memory size is
>> > filled with ~0,
>> > +    * while those beyond the actual reported size is skipped.
>> > +    */
>> > +   if (offset >= memregion->memlength)
>> > +   mem_count = 0;
>>
>> If mem_count == 0, going through nvgrace_gpu_map_and_read() is not
>> necessary.
>
> Harmless, other than the possibly unnecessary call through to
> nvgrace_gpu_map_device_mem().  Maybe both nvgrace_gpu_map_and_read()
> and nvgrace_gpu_map_and_write() could conditionally return 0 as their
> first operation when !mem_count.  Thanks,
>
>Alex

IMO, this seems like adding too much code to reduce the call length for a
very specific case. If there aren't any strong opinion on this, I'm planning to
leave this code as it is.



Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-07 Thread Ankit Agrawal
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 8999497011a2..529ec8966f58 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -23103,6 +23103,12 @@ L:   k...@vger.kernel.org
>>  S:   Maintained
>>  F:   drivers/vfio/platform/
>>
>> +VFIO NVIDIA GRACE GPU DRIVER
>> +M:   Ankit Agrawal 
>> +L:   k...@vger.kernel.org
>> +S:   Supported
>> +F:   drivers/vfio/pci/nvgrace-gpu/
>> +
>
> Entries should be alphabetical.  This will end up colliding with [1] so
> I'll plan to fix it either way.

I will make the change to put it at the right place.


> Otherwise just a couple optional comments from me below.  I see Zhi also
> has a few good comments.  I'd suggest soliciting a review from the other
> variant driver reviewers for this version and maybe we can make v18 the
> final version.  Thanks,
>
> Alex
> 
> [1]https://lore.kernel.org/all/20240205235427.2103714-1-alex.william...@redhat.com/

Great!


>> +static ssize_t
>> +nvgrace_gpu_write_config_emu(struct vfio_device *core_vdev,
>> +  const char __user *buf, size_t count, loff_t 
>> *ppos)
>> +{
>> + struct nvgrace_gpu_vfio_pci_core_device *nvdev =
>> + container_of(core_vdev, struct 
>> nvgrace_gpu_vfio_pci_core_device,
>> +  core_device.vdev);
>> + u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
>> + size_t register_offset;
>> + loff_t copy_offset;
>> + size_t copy_count;
>> + struct mem_region *memregion = NULL;
>
> Nit, consistency and reverse Christmas tree variable declaration would
> suggest pushing this up in the list, but it's not strictly required.

Ack, I'll make the change.


>> + if (index == USEMEM_REGION_INDEX && !memregion->memaddr) {
>> + memregion->memaddr = memremap(memregion->memphys,
>> +   memregion->memlength,
>> +   MEMREMAP_WB);
>> + if (!memregion->memaddr)
>> + ret = -ENOMEM;
>> + } else if (index == RESMEM_REGION_INDEX && !memregion->ioaddr) {
>> + memregion->ioaddr = ioremap_wc(memregion->memphys,
>> +    memregion->memlength);
>> + if (!memregion->ioaddr)
>> + ret = -ENOMEM;
>> + }
>
> As .memaddr and .ioaddr are a union we can consolidate the NULL test.

Ok, will do that.



Re: [PATCH v17 2/3] vfio/pci: rename and export range_intesect_range

2024-02-07 Thread Ankit Agrawal
Thanks Kevin.

>> range_intesect_range determines an overlap between two ranges. If an
>
> s/intesect/intersect/

Will fix the typo.

>> + * vfio_pci_core_range_intersect_range() - Determine overlap between a
>> buffer
>> + *  and register offset ranges.
>> + * @range1_start:    start offset of the buffer
>> + * @count1:   number of buffer bytes.
>> + * @range2_start:    start register offset
>> + * @count2:   number of bytes of register
>> + * @start_offset:    start offset of overlap start in the buffer
>> + * @intersect_count: number of overlapping bytes
>> + * @register_offset: start offset of overlap start in register
>> + *
>> + * The function determines an overlap between a register and a buffer.
>> + * range1 represents the buffer and range2 represents register.
>> + *
>> + * Returns: true if there is overlap, false if not.
>> + * The overlap start and range is returned through function args.
>> + */
>> +bool vfio_pci_core_range_intersect_range(loff_t range1_start, size_t count1,
>> +  loff_t range2_start, size_t count2,
>> +  loff_t *start_offset,
>> +  size_t *intersect_count,
>> +  size_t *register_offset)
>
> based on description it's probably clearer to rename:
>
> range1_start -> buf_start
> count1 -> buf_cnt
> range2_start -> reg_start
> count2 -> reg_cnt
> start_offset -> buf_offset
>
> but not big deal, so:

Fine by me. Will rename them.

> Reviewed-by: Kevin Tian 

Thanks!


RE: [PATCH v17 1/3] vfio/pci: rename and export do_io_rw()

2024-02-07 Thread Tian, Kevin
> From: ank...@nvidia.com 
> Sent: Tuesday, February 6, 2024 7:01 AM
> 
> From: Ankit Agrawal 
> 
> do_io_rw() is used to read/write to the device MMIO. The grace hopper
> VFIO PCI variant driver require this functionality to read/write to
> its memory.
> 
> Rename this as vfio_pci_core functions and export as GPL.
> 
> Signed-off-by: Ankit Agrawal 

Reviewed-by: Kevin Tian 



RE: [PATCH v17 2/3] vfio/pci: rename and export range_intesect_range

2024-02-07 Thread Tian, Kevin
> From: ank...@nvidia.com 
> Sent: Tuesday, February 6, 2024 7:01 AM
> 
> From: Ankit Agrawal 
> 
> range_intesect_range determines an overlap between two ranges. If an

s/intesect/intersect/

> overlap, the helper function returns the overlapping offset and size.
> 
> The VFIO PCI variant driver emulates the PCI config space BAR offset
> registers. These offset may be accessed for read/write with a variety
> of lengths including sub-word sizes from sub-word offsets. The driver
> makes use of this helper function to read/write the targeted part of
> the emulated register.
> 
> Make this a vfio_pci_core function, rename and export as GPL. Also
> update references in virtio driver.
> 
> Signed-off-by: Ankit Agrawal 
> ---
>  drivers/vfio/pci/vfio_pci_config.c | 45 +++
>  drivers/vfio/pci/virtio/main.c | 72 +++---
>  include/linux/vfio_pci_core.h  |  5 +++
>  3 files changed, 76 insertions(+), 46 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_config.c
> b/drivers/vfio/pci/vfio_pci_config.c
> index 672a1804af6a..4fc3c605af13 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -1966,3 +1966,48 @@ ssize_t vfio_pci_config_rw(struct
> vfio_pci_core_device *vdev, char __user *buf,
> 
>   return done;
>  }
> +
> +/**
> + * vfio_pci_core_range_intersect_range() - Determine overlap between a
> buffer
> + *  and register offset ranges.
> + * @range1_start:start offset of the buffer
> + * @count1:   number of buffer bytes.
> + * @range2_start:start register offset
> + * @count2:   number of bytes of register
> + * @start_offset:start offset of overlap start in the buffer
> + * @intersect_count: number of overlapping bytes
> + * @register_offset: start offset of overlap start in register
> + *
> + * The function determines an overlap between a register and a buffer.
> + * range1 represents the buffer and range2 represents register.
> + *
> + * Returns: true if there is overlap, false if not.
> + * The overlap start and range is returned through function args.
> + */
> +bool vfio_pci_core_range_intersect_range(loff_t range1_start, size_t count1,
> +  loff_t range2_start, size_t count2,
> +  loff_t *start_offset,
> +  size_t *intersect_count,
> +  size_t *register_offset)

based on description it's probably clearer to rename:

range1_start -> buf_start
count1 -> buf_cnt
range2_start -> reg_start
count2 -> reg_cnt
start_offset -> buf_offset

but not big deal, so:

Reviewed-by: Kevin Tian 



Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-07 Thread Alex Williamson
On Tue, 6 Feb 2024 04:31:23 +0530
 wrote:

> From: Ankit Agrawal 
> 
> NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
> for the on-chip GPU that is the logical OS representation of the
> internal proprietary chip-to-chip cache coherent interconnect.
> 
> The device is peculiar compared to a real PCI device in that whilst
> there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the
> device, it is not used to access device memory once the faster
> chip-to-chip interconnect is initialized (occurs at the time of host
> system boot). The device memory is accessed instead using the chip-to-chip
> interconnect that is exposed as a contiguous physically addressable
> region on the host. This device memory aperture can be obtained from host
> ACPI table using device_property_read_u64(), according to the FW
> specification. Since the device memory is cache coherent with the CPU,
> it can be mmap into the user VMA with a cacheable mapping using
> remap_pfn_range() and used like a regular RAM. The device memory
> is not added to the host kernel, but mapped directly as this reduces
> memory wastage due to struct pages.
> 
> There is also a requirement of a reserved 1G uncached region (termed as
> resmem) to support the Multi-Instance GPU (MIG) feature [1]. This is
> to work around a HW defect. Based on [2], the requisite properties
> (uncached, unaligned access) can be achieved through a VM mapping (S1)
> of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide
> a different non-cached property to the reserved 1G region, it needs to
> be carved out from the device memory and mapped as a separate region
> in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets the
> Qemu VMA page properties (pgprot) as NORMAL_NC.
> 
> Provide a VFIO PCI variant driver that adapts the unique device memory
> representation into a more standard PCI representation facing userspace.
> 
> The variant driver exposes these two regions - the non-cached reserved
> (resmem) and the cached rest of the device memory (termed as usemem) as
> separate VFIO 64b BAR regions. This is divergent from the baremetal
> approach, where the device memory is exposed as a device memory region.
> The decision for a different approach was taken in view of the fact that
> it would necessiate additional code in Qemu to discover and insert those
> regions in the VM IPA, along with the additional VM ACPI DSDT changes to
> communicate the device memory region IPA to the VM workloads. Moreover,
> this behavior would have to be added to a variety of emulators (beyond
> top of tree Qemu) out there desiring grace hopper support.
> 
> Since the device implements 64-bit BAR0, the VFIO PCI variant driver
> maps the uncached carved out region to the next available PCI BAR (i.e.
> comprising of region 2 and 3). The cached device memory aperture is
> assigned BAR region 4 and 5. Qemu will then naturally generate a PCI
> device in the VM with the uncached aperture reported as BAR2 region,
> the cacheable as BAR4. The variant driver provides emulation for these
> fake BARs' PCI config space offset registers.
> 
> The hardware ensures that the system does not crash when the memory
> is accessed with the memory enable turned off. It synthesis ~0 reads
> and dropped writes on such access. So there is no need to support the
> disablement/enablement of BAR through PCI_COMMAND config space register.
> 
> The memory layout on the host looks like the following:
>devmem (memlength)
> |--|
> |-cached|--NC--|
> |   |
> usemem.phys/memphys resmem.phys
> 
> PCI BARs need to be aligned to the power-of-2, but the actual memory on the
> device may not. A read or write access to the physical address from the
> last device PFN up to the next power-of-2 aligned physical address
> results in reading ~0 and dropped writes. Note that the GPU device
> driver [6] is capable of knowing the exact device memory size through
> separate means. The device memory size is primarily kept in the system
> ACPI tables for use by the VFIO PCI variant module.
> 
> Note that the usemem memory is added by the VM Nvidia device driver [5]
> to the VM kernel as memblocks. Hence make the usable memory size memblock
> aligned.
> 
> Currently there is no provision in KVM for a S2 mapping with
> MemAttr[2:0]=0b101, but there is an ongoing effort to provide the same [3].
> As previously mentioned, resmem is mapped pgprot_writecombine(), that
> sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Using the
> proposed changes in [4] and [3], KVM marks the region with
> MemAttr[2:0]=0b101 in S2.
> 
> If the bare metal properties are not present, the driver registers the
> vfio-pci-core function pointers.
> 
> This goes along with a qemu series [6] to provides the necessary
> implementation of the Grace Hopper 

Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-07 Thread Alex Williamson
On Thu, 8 Feb 2024 00:32:10 +0200
Zhi Wang  wrote:

> On Tue, 6 Feb 2024 04:31:23 +0530
>  wrote:
> 
> > From: Ankit Agrawal 
> > 
> > NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
> > for the on-chip GPU that is the logical OS representation of the
> > internal proprietary chip-to-chip cache coherent interconnect.
> > 
> > The device is peculiar compared to a real PCI device in that whilst
> > there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the
> > device, it is not used to access device memory once the faster
> > chip-to-chip interconnect is initialized (occurs at the time of host
> > system boot). The device memory is accessed instead using the
> > chip-to-chip interconnect that is exposed as a contiguous physically
> > addressable region on the host. This device memory aperture can be
> > obtained from host ACPI table using device_property_read_u64(),
> > according to the FW specification. Since the device memory is cache
> > coherent with the CPU, it can be mmap into the user VMA with a
> > cacheable mapping using remap_pfn_range() and used like a regular
> > RAM. The device memory is not added to the host kernel, but mapped
> > directly as this reduces memory wastage due to struct pages.
> > 
> > There is also a requirement of a reserved 1G uncached region (termed
> > as resmem) to support the Multi-Instance GPU (MIG) feature [1]. This
> > is to work around a HW defect. Based on [2], the requisite properties
> > (uncached, unaligned access) can be achieved through a VM mapping (S1)
> > of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide
> > a different non-cached property to the reserved 1G region, it needs to
> > be carved out from the device memory and mapped as a separate region
> > in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets the
> > Qemu VMA page properties (pgprot) as NORMAL_NC.
> > 
> > Provide a VFIO PCI variant driver that adapts the unique device memory
> > representation into a more standard PCI representation facing
> > userspace.
> > 
> > The variant driver exposes these two regions - the non-cached reserved
> > (resmem) and the cached rest of the device memory (termed as usemem)
> > as separate VFIO 64b BAR regions. This is divergent from the baremetal
> > approach, where the device memory is exposed as a device memory
> > region. The decision for a different approach was taken in view of
> > the fact that it would necessiate additional code in Qemu to discover
> > and insert those regions in the VM IPA, along with the additional VM
> > ACPI DSDT changes to communicate the device memory region IPA to the
> > VM workloads. Moreover, this behavior would have to be added to a
> > variety of emulators (beyond top of tree Qemu) out there desiring
> > grace hopper support.
> > 
> > Since the device implements 64-bit BAR0, the VFIO PCI variant driver
> > maps the uncached carved out region to the next available PCI BAR
> > (i.e. comprising of region 2 and 3). The cached device memory
> > aperture is assigned BAR region 4 and 5. Qemu will then naturally
> > generate a PCI device in the VM with the uncached aperture reported
> > as BAR2 region, the cacheable as BAR4. The variant driver provides
> > emulation for these fake BARs' PCI config space offset registers.
> > 
> > The hardware ensures that the system does not crash when the memory
> > is accessed with the memory enable turned off. It synthesis ~0 reads
> > and dropped writes on such access. So there is no need to support the
> > disablement/enablement of BAR through PCI_COMMAND config space
> > register.
> > 
> > The memory layout on the host looks like the following:
> >devmem (memlength)
> > |--|
> > |-cached|--NC--|
> > |   |
> > usemem.phys/memphys resmem.phys
> > 
> > PCI BARs need to be aligned to the power-of-2, but the actual memory
> > on the device may not. A read or write access to the physical address
> > from the last device PFN up to the next power-of-2 aligned physical
> > address results in reading ~0 and dropped writes. Note that the GPU
> > device driver [6] is capable of knowing the exact device memory size
> > through separate means. The device memory size is primarily kept in
> > the system ACPI tables for use by the VFIO PCI variant module.
> > 
> > Note that the usemem memory is added by the VM Nvidia device driver
> > [5] to the VM kernel as memblocks. Hence make the usable memory size
> > memblock aligned.
> > 
> > Currently there is no provision in KVM for a S2 mapping with
> > MemAttr[2:0]=0b101, but there is an ongoing effort to provide the
> > same [3]. As previously mentioned, resmem is mapped
> > pgprot_writecombine(), that sets the Qemu VMA page properties
> > (pgprot) as NORMAL_NC. Using the proposed changes in [4] and [3], KVM
> > marks the region with 

Re: [PATCH v5] remoteproc: Make rproc_get_by_phandle() work for clusters

2024-02-07 Thread Tanmay Shah
I am sorry, I missed the fact that this patch was picked up and available on 
for-next branch.

Won't be sending new one now.

Thanks,

Tanmay

On 2/7/24 4:18 PM, Tanmay Shah wrote:
> Rejected-by: Tanmay Shah 
>
> I will send new v5 with change long included.
>
> On 1/30/24 9:48 AM, Tanmay Shah wrote:
> > From: Mathieu Poirier 
> >
> > Multi-cluster remoteproc designs typically have the following DT
> > declaration:
> >
> > remoteproc-cluster {
> > compatible = "soc,remoteproc-cluster";
> >
> > core0: core0 {
> > compatible = "soc,remoteproc-core"
> > memory-region;
> > sram;
> > };
> >
> > core1: core1 {
> > compatible = "soc,remoteproc-core"
> > memory-region;
> > sram;
> > }
> > };
> >
> > A driver exists for the cluster rather than the individual cores
> > themselves so that operation mode and HW specific configurations
> > applicable to the cluster can be made.
> >
> > Because the driver exists at the cluster level and not the individual
> > core level, function rproc_get_by_phandle() fails to return the
> > remoteproc associated with the phandled it is called for.
> >
> > This patch enhances rproc_get_by_phandle() by looking for the cluster's
> > driver when the driver for the immediate remoteproc's parent is not
> > found.
> >
> > Reported-by: Ben Levinsky 
> > Signed-off-by: Mathieu Poirier 
> > Co-developed-by: Tarak Reddy 
> > Signed-off-by: Tarak Reddy 
> > Co-developed-by: Tanmay Shah 
> > Signed-off-by: Tanmay Shah 
> > ---
> >  drivers/remoteproc/remoteproc_core.c | 29 ++--
> >  1 file changed, 27 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/remoteproc/remoteproc_core.c 
> > b/drivers/remoteproc/remoteproc_core.c
> > index 695cce218e8c..f276956f2c5c 100644
> > --- a/drivers/remoteproc/remoteproc_core.c
> > +++ b/drivers/remoteproc/remoteproc_core.c
> > @@ -33,6 +33,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -2112,6 +2113,7 @@ EXPORT_SYMBOL(rproc_detach);
> >  struct rproc *rproc_get_by_phandle(phandle phandle)
> >  {
> > struct rproc *rproc = NULL, *r;
> > +   struct device_driver *driver;
> > struct device_node *np;
> >  
> > np = of_find_node_by_phandle(phandle);
> > @@ -2122,7 +2124,26 @@ struct rproc *rproc_get_by_phandle(phandle phandle)
> > list_for_each_entry_rcu(r, _list, node) {
> > if (r->dev.parent && device_match_of_node(r->dev.parent, np)) {
> > /* prevent underlying implementation from being removed 
> > */
> > -   if (!try_module_get(r->dev.parent->driver->owner)) {
> > +
> > +   /*
> > +* If the remoteproc's parent has a driver, the
> > +* remoteproc is not part of a cluster and we can use
> > +* that driver.
> > +*/
> > +   driver = r->dev.parent->driver;
> > +
> > +   /*
> > +* If the remoteproc's parent does not have a driver,
> > +* look for the driver associated with the cluster.
> > +*/
> > +   if (!driver) {
> > +   if (r->dev.parent->parent)
> > +   driver = r->dev.parent->parent->driver;
> > +   if (!driver)
> > +   break;
> > +   }
> > +
> > +   if (!try_module_get(driver->owner)) {
> > dev_err(>dev, "can't get owner\n");
> > break;
> > }
> > @@ -2533,7 +2554,11 @@ EXPORT_SYMBOL(rproc_free);
> >   */
> >  void rproc_put(struct rproc *rproc)
> >  {
> > -   module_put(rproc->dev.parent->driver->owner);
> > +   if (rproc->dev.parent->driver)
> > +   module_put(rproc->dev.parent->driver->owner);
> > +   else
> > +   module_put(rproc->dev.parent->parent->driver->owner);
> > +
> > put_device(>dev);
> >  }
> >  EXPORT_SYMBOL(rproc_put);
> >
> > base-commit: 99f59b148871dadb9104366e3d25b120a97f897b



Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-07 Thread Zhi Wang
On Tue, 6 Feb 2024 04:31:23 +0530
 wrote:

> From: Ankit Agrawal 
> 
> NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
> for the on-chip GPU that is the logical OS representation of the
> internal proprietary chip-to-chip cache coherent interconnect.
> 
> The device is peculiar compared to a real PCI device in that whilst
> there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the
> device, it is not used to access device memory once the faster
> chip-to-chip interconnect is initialized (occurs at the time of host
> system boot). The device memory is accessed instead using the
> chip-to-chip interconnect that is exposed as a contiguous physically
> addressable region on the host. This device memory aperture can be
> obtained from host ACPI table using device_property_read_u64(),
> according to the FW specification. Since the device memory is cache
> coherent with the CPU, it can be mmap into the user VMA with a
> cacheable mapping using remap_pfn_range() and used like a regular
> RAM. The device memory is not added to the host kernel, but mapped
> directly as this reduces memory wastage due to struct pages.
> 
> There is also a requirement of a reserved 1G uncached region (termed
> as resmem) to support the Multi-Instance GPU (MIG) feature [1]. This
> is to work around a HW defect. Based on [2], the requisite properties
> (uncached, unaligned access) can be achieved through a VM mapping (S1)
> of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide
> a different non-cached property to the reserved 1G region, it needs to
> be carved out from the device memory and mapped as a separate region
> in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets the
> Qemu VMA page properties (pgprot) as NORMAL_NC.
> 
> Provide a VFIO PCI variant driver that adapts the unique device memory
> representation into a more standard PCI representation facing
> userspace.
> 
> The variant driver exposes these two regions - the non-cached reserved
> (resmem) and the cached rest of the device memory (termed as usemem)
> as separate VFIO 64b BAR regions. This is divergent from the baremetal
> approach, where the device memory is exposed as a device memory
> region. The decision for a different approach was taken in view of
> the fact that it would necessiate additional code in Qemu to discover
> and insert those regions in the VM IPA, along with the additional VM
> ACPI DSDT changes to communicate the device memory region IPA to the
> VM workloads. Moreover, this behavior would have to be added to a
> variety of emulators (beyond top of tree Qemu) out there desiring
> grace hopper support.
> 
> Since the device implements 64-bit BAR0, the VFIO PCI variant driver
> maps the uncached carved out region to the next available PCI BAR
> (i.e. comprising of region 2 and 3). The cached device memory
> aperture is assigned BAR region 4 and 5. Qemu will then naturally
> generate a PCI device in the VM with the uncached aperture reported
> as BAR2 region, the cacheable as BAR4. The variant driver provides
> emulation for these fake BARs' PCI config space offset registers.
> 
> The hardware ensures that the system does not crash when the memory
> is accessed with the memory enable turned off. It synthesis ~0 reads
> and dropped writes on such access. So there is no need to support the
> disablement/enablement of BAR through PCI_COMMAND config space
> register.
> 
> The memory layout on the host looks like the following:
>devmem (memlength)
> |--|
> |-cached|--NC--|
> |   |
> usemem.phys/memphys resmem.phys
> 
> PCI BARs need to be aligned to the power-of-2, but the actual memory
> on the device may not. A read or write access to the physical address
> from the last device PFN up to the next power-of-2 aligned physical
> address results in reading ~0 and dropped writes. Note that the GPU
> device driver [6] is capable of knowing the exact device memory size
> through separate means. The device memory size is primarily kept in
> the system ACPI tables for use by the VFIO PCI variant module.
> 
> Note that the usemem memory is added by the VM Nvidia device driver
> [5] to the VM kernel as memblocks. Hence make the usable memory size
> memblock aligned.
> 
> Currently there is no provision in KVM for a S2 mapping with
> MemAttr[2:0]=0b101, but there is an ongoing effort to provide the
> same [3]. As previously mentioned, resmem is mapped
> pgprot_writecombine(), that sets the Qemu VMA page properties
> (pgprot) as NORMAL_NC. Using the proposed changes in [4] and [3], KVM
> marks the region with MemAttr[2:0]=0b101 in S2.
> 
> If the bare metal properties are not present, the driver registers the
> vfio-pci-core function pointers.
> 
> This goes along with a qemu series [6] to provides the necessary
> implementation of the Grace 

Re: [PATCH v5] remoteproc: Make rproc_get_by_phandle() work for clusters

2024-02-07 Thread Tanmay Shah
Rejected-by: Tanmay Shah 

I will send new v5 with change long included.

On 1/30/24 9:48 AM, Tanmay Shah wrote:
> From: Mathieu Poirier 
>
> Multi-cluster remoteproc designs typically have the following DT
> declaration:
>
> remoteproc-cluster {
> compatible = "soc,remoteproc-cluster";
>
> core0: core0 {
> compatible = "soc,remoteproc-core"
> memory-region;
> sram;
> };
>
> core1: core1 {
> compatible = "soc,remoteproc-core"
> memory-region;
> sram;
> }
> };
>
> A driver exists for the cluster rather than the individual cores
> themselves so that operation mode and HW specific configurations
> applicable to the cluster can be made.
>
> Because the driver exists at the cluster level and not the individual
> core level, function rproc_get_by_phandle() fails to return the
> remoteproc associated with the phandled it is called for.
>
> This patch enhances rproc_get_by_phandle() by looking for the cluster's
> driver when the driver for the immediate remoteproc's parent is not
> found.
>
> Reported-by: Ben Levinsky 
> Signed-off-by: Mathieu Poirier 
> Co-developed-by: Tarak Reddy 
> Signed-off-by: Tarak Reddy 
> Co-developed-by: Tanmay Shah 
> Signed-off-by: Tanmay Shah 
> ---
>  drivers/remoteproc/remoteproc_core.c | 29 ++--
>  1 file changed, 27 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/remoteproc/remoteproc_core.c 
> b/drivers/remoteproc/remoteproc_core.c
> index 695cce218e8c..f276956f2c5c 100644
> --- a/drivers/remoteproc/remoteproc_core.c
> +++ b/drivers/remoteproc/remoteproc_core.c
> @@ -33,6 +33,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -2112,6 +2113,7 @@ EXPORT_SYMBOL(rproc_detach);
>  struct rproc *rproc_get_by_phandle(phandle phandle)
>  {
>   struct rproc *rproc = NULL, *r;
> + struct device_driver *driver;
>   struct device_node *np;
>  
>   np = of_find_node_by_phandle(phandle);
> @@ -2122,7 +2124,26 @@ struct rproc *rproc_get_by_phandle(phandle phandle)
>   list_for_each_entry_rcu(r, _list, node) {
>   if (r->dev.parent && device_match_of_node(r->dev.parent, np)) {
>   /* prevent underlying implementation from being removed 
> */
> - if (!try_module_get(r->dev.parent->driver->owner)) {
> +
> + /*
> +  * If the remoteproc's parent has a driver, the
> +  * remoteproc is not part of a cluster and we can use
> +  * that driver.
> +  */
> + driver = r->dev.parent->driver;
> +
> + /*
> +  * If the remoteproc's parent does not have a driver,
> +  * look for the driver associated with the cluster.
> +  */
> + if (!driver) {
> + if (r->dev.parent->parent)
> + driver = r->dev.parent->parent->driver;
> + if (!driver)
> + break;
> + }
> +
> + if (!try_module_get(driver->owner)) {
>   dev_err(>dev, "can't get owner\n");
>   break;
>   }
> @@ -2533,7 +2554,11 @@ EXPORT_SYMBOL(rproc_free);
>   */
>  void rproc_put(struct rproc *rproc)
>  {
> - module_put(rproc->dev.parent->driver->owner);
> + if (rproc->dev.parent->driver)
> + module_put(rproc->dev.parent->driver->owner);
> + else
> + module_put(rproc->dev.parent->parent->driver->owner);
> +
>   put_device(>dev);
>  }
>  EXPORT_SYMBOL(rproc_put);
>
> base-commit: 99f59b148871dadb9104366e3d25b120a97f897b



Re: [PATCH] remoteproc: qcom: Add NOTIFY_FATAL event type to SSR subdevice

2024-02-07 Thread Vignesh Viswanathan



On 11/25/2023 12:24 AM, Vignesh Viswanathan wrote:
> 
> 
> On 9/4/2023 10:23 PM, Vignesh Viswanathan wrote:
>>
>>
>> On 7/16/2023 1:50 AM, Bjorn Andersson wrote:
>>> On Wed, May 03, 2023 at 11:51:46AM +0530, Vignesh Viswanathan wrote:
 Currently the SSR subdevice notifies the client driver on crash of the
 rproc from the recovery workqueue using the BEFORE_SHUTDOWN event.
 However the client driver might be interested to know that the device
 has crashed immediately to pause any further transactions with the
 rproc. This calls for an event to be sent to the driver in the IRQ
 context as soon as the rproc crashes.

>>>
>>> Please make your argumentation more concrete, I can only guess what
>>> client driver you're referring to.
>>>
>>> You can do this either by spelling out which actual problem you're
>>> solving, or better yet, include some patches in the series that actually
>>> uses this interface.
>>>
>>
>> Hi Bjorn,
>>
>> Apologies for the delay in response.
>>
>> The client driver in my scenario is a Wi-Fi driver which is continuously
>> queuing data to the remoteproc and needs to know if remoteproc crashes
>> as soon as possible to stop queuing further data and also dump some 
>> debugstatistics on the driver side that could potentially help in debug
>> of why the remoteproc crashed.
>>
>> Also in the case with upcoming Wi-Fi 7 targets with multi-link operation, 
>> the driver might need to know that the remoteproc has crashed
>> instantly to handle some multi-link specific handling.
>>
>> The ath11k/ath12k WLAN drivers today partially have support for handling
>> such FATAL notification but it has not been upstreamed yet.
>>
>> Reference patch: 
>> https://git.codelinaro.org/clo/qsdk/oss/system/feeds/wlan-open/-/blob/win.wlan_host_opensource.1.0/mac80211/patches/031-ath11k-print-stats-on-crash.patch
>>  -- event SUBSYS_PREPARE_FOR_FATAL_SHUTDOWN.
>>
>> Also, Mukesh mentioned earlier that in some MSM targets with PCIe where 
>> latency cannot be tolerated, a similar downstream patch adds support for 
>> "early notifier". If this patch is accepted, the early notifier can also be 
>> replaced to use the same NOTIFY_FATAL event from SSR Subdevice
>>
>> https://git.codelinaro.org/clo/la/kernel/msm-5.15/-/commit/7583d24de337aa1bf7c375a7da706af9b995b9a1#a840754ebb0e24e88adbf48177e1abd0830b72d2
>> https://git.codelinaro.org/clo/la/kernel/msm-5.15/-/commit/257de41c63a5a51a081cc7887cdaa4a46e4d1744
> 
> Hi Bjorn,
> 
> Gentle reminder for this patch.
> 
> Thanks,
> Vignesh

Hi Bjorn,

Could you please help review this patch and let me know if any comments.

Thanks,
Vignesh
>>
>> Thanks,
>> Vignesh
>>
>>> Regards,
>>> Bjorn



Re: [PATCH] tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()

2024-02-07 Thread Steven Rostedt
On Wed, 07 Feb 2024 14:33:21 +0100
Sven Schnelle  wrote:

> My assumption without reading the code is that something like this
> happens:
> 
> CPU0 CPU1
> [ringbuffer enabled]
>  ring_buffer_write()
>  if 
> (atomic_read(>record_disabled))
> goto out;
> echo 0 > tracing_on
> record_disabled |= RB_BUFFER_OFF
> csum1=`md5sum trace`

Note, the CPU1 is performing with preemption disabled, so for this to
happen, something really bad happened on CPU0 to delay preempt disabled
section so long to allow the trace to be read. Perhaps we should have
the return of the echo 0 > tracing_on require a synchronize_rcu() to
make sure all ring buffers see it disabled before it returns.

But unless your system is doing something really stressed to cause the
preempt disabled section to take so long, I highly doubt this was the
race.

-- Steve


>  [adds trace entry to ring buffer,
>   overwriting savedcmd_lines entry because
>   it thinks ring buffer is enabled]
> csum2=`md5sum trace`




Re: [PATCH v5 7/8] selftests/ftrace: add kprobe test cases for VFS type "%pd" and "%pD"

2024-02-07 Thread Google
On Thu, 25 Jan 2024 15:39:22 +0800
Ye Bin  wrote:

> This patch adds test cases for new print format type "%pd/%pD".The test cases
> test the following items:
> 1. Test README if add "%pd/%pD" type;
> 2. Test "%pd" type for dput();
> 3. Test "%pD" type for vfs_read();
> 
> This test case require enable CONFIG_HAVE_FUNCTION_ARG_ACCESS_API 
> configuration.
> 
> Signed-off-by: Ye Bin 
> ---
>  .../ftrace/test.d/kprobe/kprobe_args_vfs.tc   | 43 +++
>  1 file changed, 43 insertions(+)
>  create mode 100644 
> tools/testing/selftests/ftrace/test.d/kprobe/kprobe_args_vfs.tc
> 
> diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_args_vfs.tc 
> b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_args_vfs.tc
> new file mode 100644
> index ..cf0599b90f1a
> --- /dev/null
> +++ b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_args_vfs.tc
> @@ -0,0 +1,43 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0
> +# description: Kprobe event VFS type argument
> +# requires: kprobe_events
> +
> +: "Test argument %pd/%pD in README"
> +grep -q "%pd/%pD" README

No. This means, if the kernel doesn't support %pd/%pD, this test fails.
But the latest version of the ftracetest will be used for testing stable
kernels (which will not support this feature).
So you should make test returning 'unsupported' for that case.

You can add a requirement as

# requires: kprobe_events "%pd/%pD":README

(e.g. tools/testing/selftests/ftrace/test.d/dynevent/eprobes_syntax_errors.tc)

Thank you,

> +
> +: "Test argument %pd with name"
> +echo 'p:testprobe dput name=$arg1:%pd' > kprobe_events
> +echo 1 > events/kprobes/testprobe/enable
> +grep -q "1" events/kprobes/testprobe/enable
> +echo 0 > events/kprobes/testprobe/enable
> +grep "dput" trace | grep -q "enable"
> +echo "" > kprobe_events
> +echo "" > trace
> +
> +: "Test argument %pd without name"
> +echo 'p:testprobe dput $arg1:%pd' > kprobe_events
> +echo 1 > events/kprobes/testprobe/enable
> +grep -q "1" events/kprobes/testprobe/enable
> +echo 0 > events/kprobes/testprobe/enable
> +grep "dput" trace | grep -q "enable"
> +echo "" > kprobe_events
> +echo "" > trace
> +
> +: "Test argument %pD with name"
> +echo 'p:testprobe vfs_read name=$arg1:%pD' > kprobe_events
> +echo 1 > events/kprobes/testprobe/enable
> +grep -q "1" events/kprobes/testprobe/enable
> +echo 0 > events/kprobes/testprobe/enable
> +grep "vfs_read" trace | grep -q "enable"
> +echo "" > kprobe_events
> +echo "" > trace
> +
> +: "Test argument %pD without name"
> +echo 'p:testprobe vfs_read $arg1:%pD' > kprobe_events
> +echo 1 > events/kprobes/testprobe/enable
> +grep -q "1"  events/kprobes/testprobe/enable
> +echo 0 > events/kprobes/testprobe/enable
> +grep "vfs_read" trace | grep -q "enable"
> +echo "" > kprobe_events
> +echo "" > trace
> -- 
> 2.31.1
> 
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH v5 5/8] tracing: add new type "%pd/%pD" in readme_msg[]

2024-02-07 Thread Google
On Thu, 25 Jan 2024 15:39:20 +0800
Ye Bin  wrote:

> Signed-off-by: Ye Bin 
> ---
>  kernel/trace/trace.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 2a7c6fd934e9..13197d3b86bd 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -5745,7 +5745,7 @@ static const char readme_msg[] =
>   "\t   +|-[u](), \\imm-value, 
> \\\"imm-string\"\n"
>   "\t type: s8/16/32/64, u8/16/32/64, x8/16/32/64, char, string, 
> symbol,\n"
>   "\t   b@/, ustring,\n"
> - "\t   symstr, \\[\\]\n"
> + "\t   symstr, %pd/%pD \\[\\]\n"

Almost! you missed the ',' after %pd/%pD. :)

Thank you,


>  #ifdef CONFIG_HIST_TRIGGERS
>   "\tfield:  ;\n"
>   "\tstype: u8/u16/u32/u64, s8/s16/s32/s64, pid_t,\n"
> -- 
> 2.31.1
> 
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH v5 4/8] tracing/probes: support '%pd/%pD' type for fprobe

2024-02-07 Thread Google
On Thu, 25 Jan 2024 15:39:19 +0800
Ye Bin  wrote:

> Support print type '%pd/%pD' for print dentry's or file's name.
> 

nit: Looks good to me. but the patch ordering seems a bit strange.
This should be next to [2/8].

Acked-by: Masami Hiramatsu (Google) 

Thank you,

> Signed-off-by: Ye Bin 
> ---
>  kernel/trace/trace_fprobe.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/kernel/trace/trace_fprobe.c b/kernel/trace/trace_fprobe.c
> index 7d2ddbcfa377..988d68e906ad 100644
> --- a/kernel/trace/trace_fprobe.c
> +++ b/kernel/trace/trace_fprobe.c
> @@ -976,6 +976,7 @@ static int __trace_fprobe_create(int argc, const char 
> *argv[])
>   char gbuf[MAX_EVENT_NAME_LEN];
>   char sbuf[KSYM_NAME_LEN];
>   char abuf[MAX_BTF_ARGS_LEN];
> + char *dbuf = NULL;
>   bool is_tracepoint = false;
>   struct tracepoint *tpoint = NULL;
>   struct traceprobe_parse_context ctx = {
> @@ -1086,6 +1087,10 @@ static int __trace_fprobe_create(int argc, const char 
> *argv[])
>   argv = new_argv;
>   }
>  
> + ret = traceprobe_expand_dentry_args(argc, argv, );
> + if (ret)
> + goto out;
> +
>   /* setup a probe */
>   tf = alloc_trace_fprobe(group, event, symbol, tpoint, maxactive,
>   argc, is_return);
> @@ -1131,6 +1136,7 @@ static int __trace_fprobe_create(int argc, const char 
> *argv[])
>   trace_probe_log_clear();
>   kfree(new_argv);
>   kfree(symbol);
> + kfree(dbuf);
>   return ret;
>  
>  parse_error:
> -- 
> 2.31.1
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH v5 1/8] tracing/probes: add traceprobe_expand_dentry_args() helper

2024-02-07 Thread Google
Hi Ye,

On Thu, 25 Jan 2024 15:39:16 +0800
Ye Bin  wrote:

> Add traceprobe_expand_dentry_args() to expand dentry args. this API is
> prepare to support "%pd" print format for kprobe.
> 
> Signed-off-by: Ye Bin 
> ---
>  kernel/trace/trace_probe.c | 50 ++
>  kernel/trace/trace_probe.h |  2 ++
>  2 files changed, 52 insertions(+)
> 
> diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
> index 4dc74d73fc1d..0c1caf0f474a 100644
> --- a/kernel/trace/trace_probe.c
> +++ b/kernel/trace/trace_probe.c
> @@ -1565,6 +1565,56 @@ const char **traceprobe_expand_meta_args(int argc, 
> const char *argv[],
>   return ERR_PTR(ret);
>  }
>  
> +/* @buf: *buf must be equal to NULL. Caller must to free *buf */
> +int traceprobe_expand_dentry_args(int argc, const char *argv[], char **buf)
> +{
> + int i, used, ret;
> + int bufsize = MAX_DENTRY_ARGS_LEN;

nit: this bufsize is consistent, thus you should use the MAX_DENTRY_ARGS_LEN
 directly. (or add 'const' or maybe you can update bufsize as remaining 
size)

Thanks,

> + char *tmpbuf = NULL;
> +
> + if (*buf)
> + return -EINVAL;
> +
> + used = 0;
> + for (i = 0; i < argc; i++) {
> + if (glob_match("*:%pd", argv[i])) {
> + char *tmp;
> + char *equal;
> +
> + if (!tmpbuf) {
> + tmpbuf = kmalloc(bufsize, GFP_KERNEL);
> + if (!tmpbuf)
> + return -ENOMEM;
> + }
> +
> + tmp = kstrdup(argv[i], GFP_KERNEL);
> + if (!tmp)
> + goto nomem;
> +
> + equal = strchr(tmp, '=');
> + if (equal)
> + *equal = '\0';
> + tmp[strlen(argv[i]) - 4] = '\0';
> + ret = snprintf(tmpbuf + used, bufsize - used,
> +"%s%s+0x0(+0x%zx(%s)):string",
> +equal ? tmp : "", equal ? "=" : "",
> +offsetof(struct dentry, d_name.name),
> +equal ? equal + 1 : tmp);
> + kfree(tmp);
> + if (ret >= bufsize - used)
> + goto nomem;
> + argv[i] = tmpbuf + used;
> + used += ret + 1;
> + }
> + }
> +
> + *buf = tmpbuf;
> + return 0;
> +nomem:
> + kfree(tmpbuf);
> + return -ENOMEM;
> +}
> +
>  void traceprobe_finish_parse(struct traceprobe_parse_context *ctx)
>  {
>   clear_btf_context(ctx);
> diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
> index 850d9ecb6765..5800513439f3 100644
> --- a/kernel/trace/trace_probe.h
> +++ b/kernel/trace/trace_probe.h
> @@ -34,6 +34,7 @@
>  #define MAX_ARRAY_LEN64
>  #define MAX_ARG_NAME_LEN 32
>  #define MAX_BTF_ARGS_LEN 128
> +#define MAX_DENTRY_ARGS_LEN  256
>  #define MAX_STRING_SIZE  PATH_MAX
>  #define MAX_ARG_BUF_LEN  (MAX_TRACE_ARGS * MAX_ARG_NAME_LEN)
>  
> @@ -402,6 +403,7 @@ extern int traceprobe_parse_probe_arg(struct trace_probe 
> *tp, int i,
>  const char **traceprobe_expand_meta_args(int argc, const char *argv[],
>int *new_argc, char *buf, int bufsize,
>struct traceprobe_parse_context *ctx);
> +extern int traceprobe_expand_dentry_args(int argc, const char *argv[], char 
> **buf);
>  
>  extern int traceprobe_update_arg(struct probe_arg *arg);
>  extern void traceprobe_free_probe_arg(struct probe_arg *arg);
> -- 
> 2.31.1
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH net-next v5] virtio_net: Support RX hash XDP hint

2024-02-07 Thread Paolo Abeni
On Wed, 2024-02-07 at 10:54 +0800, Liang Chen wrote:
> On Tue, Feb 6, 2024 at 6:44 PM Paolo Abeni  wrote:
> > 
> > On Sat, 2024-02-03 at 10:56 +0800, Liang Chen wrote:
> > > On Sat, Feb 3, 2024 at 12:20 AM Jesper Dangaard Brouer  
> > > wrote:
> > > > On 02/02/2024 13.11, Liang Chen wrote:
> > [...]
> > > > > @@ -1033,6 +1039,16 @@ static void put_xdp_frags(struct xdp_buff *xdp)
> > > > >   }
> > > > >   }
> > > > > 
> > > > > +static void virtnet_xdp_save_rx_hash(struct virtnet_xdp_buff 
> > > > > *virtnet_xdp,
> > > > > +  struct net_device *dev,
> > > > > +  struct virtio_net_hdr_v1_hash 
> > > > > *hdr_hash)
> > > > > +{
> > > > > + if (dev->features & NETIF_F_RXHASH) {
> > > > > + virtnet_xdp->hash_value = hdr_hash->hash_value;
> > > > > + virtnet_xdp->hash_report = hdr_hash->hash_report;
> > > > > + }
> > > > > +}
> > > > > +
> > > > 
> > > > Would it be possible to store a pointer to hdr_hash in virtnet_xdp_buff,
> > > > with the purpose of delaying extracting this, until and only if XDP
> > > > bpf_prog calls the kfunc?
> > > > 
> > > 
> > > That seems to be the way v1 works,
> > > https://lore.kernel.org/all/20240122102256.261374-1-liangchen.li...@gmail.com/
> > > . But it was pointed out that the inline header may be overwritten by
> > > the xdp prog, so the hash is copied out to maintain its integrity.
> > 
> > Why? isn't XDP supposed to get write access only to the pkt
> > contents/buffer?
> > 
> 
> Normally, an XDP program accesses only the packet data. However,
> there's also an XDP RX Metadata area, referenced by the data_meta
> pointer. This pointer can be adjusted with bpf_xdp_adjust_meta to
> point somewhere ahead of the data buffer, thereby granting the XDP
> program access to the virtio header located immediately before the

AFAICS bpf_xdp_adjust_meta() does not allow moving the meta_data before
xdp->data_hard_start:

https://elixir.bootlin.com/linux/latest/source/net/core/filter.c#L4210

and virtio net set such field after the virtio_net_hdr:

https://elixir.bootlin.com/linux/latest/source/drivers/net/virtio_net.c#L1218
https://elixir.bootlin.com/linux/latest/source/drivers/net/virtio_net.c#L1420

I don't see how the virtio hdr could be touched? Possibly even more
important: if such thing is possible, I think is should be somewhat
denied (for the same reason an H/W nic should prevent XDP from
modifying its own buffer descriptor).

Cheers,

Paolo




Re: [PATCH v7 13/36] function_graph: Have the instances use their own ftrace_ops for filtering

2024-02-07 Thread Google
On Wed,  7 Feb 2024 00:09:54 +0900
"Masami Hiramatsu (Google)"  wrote:

> diff --git a/arch/loongarch/kernel/ftrace_dyn.c 
> b/arch/loongarch/kernel/ftrace_dyn.c
> index 73858c9029cc..81d18b911cc1 100644
> --- a/arch/loongarch/kernel/ftrace_dyn.c
> +++ b/arch/loongarch/kernel/ftrace_dyn.c
> @@ -241,10 +241,17 @@ void prepare_ftrace_return(unsigned long self_addr, 
> unsigned long *parent)
>  void ftrace_graph_func(unsigned long ip, unsigned long parent_ip,
>  struct ftrace_ops *op, struct ftrace_regs *fregs)
>  {
> + struct fgraph_ops *gops = container_of(op, struct fgraph_ops, ops);
>   struct pt_regs *regs = >regs;
>   unsigned long *parent = (unsigned long *)>regs[1];
>  
> - prepare_ftrace_return(ip, (unsigned long *)parent);
> + if (unlikely(atomic_read(>tracing_graph_pause)))
> + return;
> +
> + old = *parent;

Oops, this caused an error. 

> +
> + if (!function_graph_enter_ops(old, ip, 0, parent, gops))

So this should be

if (!function_graph_enter_ops(*parent, ip, 0, parent, gops))

Thanks, 

-- 
Masami Hiramatsu (Google) 



Re: [PATCH] tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()

2024-02-07 Thread Sven Schnelle
Steven Rostedt  writes:

> On Wed, 7 Feb 2024 13:07:36 +0100
> Mete Durlu  wrote:
>
>> wouldn't the following scenario explain the behavior we are seeing.
>> When using event triggers, trace uses lockless ringbuffer control paths.
>> If cmdline update and trace output reading is happening on different
>> cpus, the ordering can get messed up.
>>
>> 1. event happens and trace trigger tells ring buffer to stop writes
>> 2. (on cpu#1)test calculates checksum on current state of trace
>> output.
>> 3. (on cpu#2)not knowing about the trace buffers status yet, writer adds
>> a one last entry which would collide with a pid in cmdline map before
>> actually stopping. While (on cpu#1) checksum is being calculated, new
>> saved cmdlines entry is waiting for spinlocks to be unlocked and then
>> gets added.
>> 4. test calculates checksum again and finds that the trace output has
>> changed. <...> is put on collided pid.
>
> But the failure is here:
>
> on=`cat tracing_on`
> if [ $on != "0" ]; then
> fail "Tracing is not off"
> fi

It might be misleading because we're discussing two issues in one
thread. The failure above was one problem, which the initial fix
is for. The other issue we're still seeing is the test below:

> csum1=`md5sum trace`
> sleep $SLEEP_TIME
> csum2=`md5sum trace`
>
> if [ "$csum1" != "$csum2" ]; then
> fail "Tracing file is still changing"
> fi
>
> 1. tracing is off
> 2. do checksum of trace
> 3. sleep
> 4. do another checksum of trace
> 5. compare the two checksums
>
> Now how did they come up differently in that amount of time? The
> saved_cmdlines really should not have been updated.

My assumption without reading the code is that something like this
happens:

CPU0 CPU1
[ringbuffer enabled]
 ring_buffer_write()
 if (atomic_read(>record_disabled))
goto out;
echo 0 > tracing_on
record_disabled |= RB_BUFFER_OFF
csum1=`md5sum trace`
 [adds trace entry to ring buffer,
  overwriting savedcmd_lines entry because
  it thinks ring buffer is enabled]
csum2=`md5sum trace`




Re: [PATCH] tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()

2024-02-07 Thread Steven Rostedt
On Wed, 7 Feb 2024 13:07:36 +0100
Mete Durlu  wrote:

> wouldn't the following scenario explain the behavior we are seeing.
> When using event triggers, trace uses lockless ringbuffer control paths.
> If cmdline update and trace output reading is happening on different
> cpus, the ordering can get messed up.
> 
> 1. event happens and trace trigger tells ring buffer to stop writes
> 2. (on cpu#1)test calculates checksum on current state of trace
> output.
> 3. (on cpu#2)not knowing about the trace buffers status yet, writer adds
> a one last entry which would collide with a pid in cmdline map before
> actually stopping. While (on cpu#1) checksum is being calculated, new
> saved cmdlines entry is waiting for spinlocks to be unlocked and then
> gets added.
> 4. test calculates checksum again and finds that the trace output has
> changed. <...> is put on collided pid.

But the failure is here:

on=`cat tracing_on`
if [ $on != "0" ]; then
fail "Tracing is not off"
fi

csum1=`md5sum trace`
sleep $SLEEP_TIME
csum2=`md5sum trace`

if [ "$csum1" != "$csum2" ]; then
fail "Tracing file is still changing"
fi

1. tracing is off
2. do checksum of trace
3. sleep
4. do another checksum of trace
5. compare the two checksums

Now how did they come up differently in that amount of time? The
saved_cmdlines really should not have been updated.

(note, I'm not against the patch, I just want to understand why this
test really failed).

-- Steve



Re: [PATCH] tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()

2024-02-07 Thread Mete Durlu

On 2/7/24 12:09, Steven Rostedt wrote:

On Wed, 07 Feb 2024 06:50:59 +0100
Sven Schnelle  wrote:


Hi Steven,



Not sure whether that is enough, have to test. However, it's not really
a fix, it would just require a bit more load and the test would fail
again. The fundamental problem is that even after disabling tracing
there might be some tracing line added due to the lockless nature of
the ringbuffer. This might then replace some existing cmdline entry.
So maybe we should change the test to ignore the program name when
calculating the checksums.


Have you figured out what caused the cmdlines to change when tracing is
off. It shouldn't be updated even with just 128 entries.

I'm also looking into a way to keep more of a LRU command lines around,
but nothing concrete yet.

-- Steve


Hi,

wouldn't the following scenario explain the behavior we are seeing.
When using event triggers, trace uses lockless ringbuffer control paths.
If cmdline update and trace output reading is happening on different
cpus, the ordering can get messed up.

1. event happens and trace trigger tells ring buffer to stop writes
2. (on cpu#1)test calculates checksum on current state of trace
   output.
3. (on cpu#2)not knowing about the trace buffers status yet, writer adds
   a one last entry which would collide with a pid in cmdline map before
   actually stopping. While (on cpu#1) checksum is being calculated, new
   saved cmdlines entry is waiting for spinlocks to be unlocked and then
   gets added.
4. test calculates checksum again and finds that the trace output has
   changed. <...> is put on collided pid.




Re: [PATCH v3 0/5] PM: domains: Add helpers for multi PM domains to avoid open-coding

2024-02-07 Thread Ulf Hansson
On Tue, 30 Jan 2024 at 13:39, Ulf Hansson  wrote:
>
> Rafael, my plan is queue up this series via my pmdomain tree. Please let me 
> know
> if you see any issues with that, especially around patch1.
>
> Updates in v3:
> - Added tested-by, reviewed-by and suggested-by tags. No other changes
> have been made.
>
> Updates in v2:
> - Ccing Daniel Baluta and Iuliana Prodan the NXP remoteproc patches to
> requests help with testing.
> - Fixed NULL pointer bug in patch1, pointed out by Nikunj.
> - Added some tested/reviewed-by tags.
>
> Attaching/detaching of a device to multiple PM domains has started to become a
> common operation for many drivers, typically during ->probe() and ->remove().
> In most cases, this has lead to lots of boilerplate code in the drivers.
>
> This series adds a pair of helper functions to manage the attach/detach of a
> device to its multiple PM domains. Moreover, a couple of drivers have been
> converted to use the new helpers as a proof of concept.
>
> Note 1)
> The changes in the drivers have only been compile tested, while the helpers
> have been tested along with a couple of local dummy drivers that I have hacked
> up to model both genpd providers and genpd consumers.
>
> Note 2)
> I was struggling to make up mind if we should have a separate helper to attach
> all available power-domains described in DT, rather than providing "NULL" to 
> the
> dev_pm_domain_attach_list(). I decided not to, but please let me know if you
> prefer the other option.
>
> Note 3)
> For OPP integration, as a follow up I am striving to make the
> dev_pm_opp_attach_genpd() redundant. Instead I think we should move towards
> using dev_pm_opp_set_config()->_opp_set_required_devs(), which would allow us 
> to
> use the helpers that $subject series is adding.
>
> Kind regards
> Ulf Hansson
>
> Ulf Hansson (5):
>   PM: domains: Add helper functions to attach/detach multiple PM domains
>   remoteproc: imx_dsp_rproc: Convert to
> dev_pm_domain_attach|detach_list()
>   remoteproc: imx_rproc: Convert to dev_pm_domain_attach|detach_list()
>   remoteproc: qcom_q6v5_adsp: Convert to
> dev_pm_domain_attach|detach_list()
>   media: venus: Convert to dev_pm_domain_attach|detach_list() for vcodec
>
>  drivers/base/power/common.c   | 134 +++
>  drivers/media/platform/qcom/venus/core.c  |  12 +-
>  drivers/media/platform/qcom/venus/core.h  |   7 +-
>  .../media/platform/qcom/venus/pm_helpers.c|  48 ++
>  drivers/remoteproc/imx_dsp_rproc.c|  82 +
>  drivers/remoteproc/imx_rproc.c|  73 +---
>  drivers/remoteproc/qcom_q6v5_adsp.c   | 160 --
>  include/linux/pm_domain.h |  38 +
>  8 files changed, 289 insertions(+), 265 deletions(-)
>
>

I have now applied this to my next branch to my pmdomain tree, to get
it more tested in linux-next.

Please let me know if there are objections to this or if any of you
want to provide an ack/tested/reviewed-by tag, thanks!

Kind regards
Uffe



Re: [PATCH] tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()

2024-02-07 Thread Steven Rostedt
On Wed, 07 Feb 2024 06:50:59 +0100
Sven Schnelle  wrote:

> Hi Steven,

> Not sure whether that is enough, have to test. However, it's not really
> a fix, it would just require a bit more load and the test would fail
> again. The fundamental problem is that even after disabling tracing
> there might be some tracing line added due to the lockless nature of
> the ringbuffer. This might then replace some existing cmdline entry.
> So maybe we should change the test to ignore the program name when
> calculating the checksums.

Have you figured out what caused the cmdlines to change when tracing is
off. It shouldn't be updated even with just 128 entries.

I'm also looking into a way to keep more of a LRU command lines around,
but nothing concrete yet.

-- Steve



Re: [PATCH 0/3] Fairphone 5 PMIC-GLINK support (USB-C, charger, fuel gauge)

2024-02-07 Thread Greg Kroah-Hartman
On Wed, Feb 07, 2024 at 12:20:00AM +0100, Luca Weiss wrote:
> On Tue Jan 2, 2024 at 2:53 PM CET, Greg Kroah-Hartman wrote:
> > On Tue, Jan 02, 2024 at 02:43:24PM +0100, Luca Weiss wrote:
> > > On Tue Jan 2, 2024 at 2:36 PM CET, Greg Kroah-Hartman wrote:
> > > > On Thu, Dec 21, 2023 at 02:45:26PM +0100, Luca Weiss wrote:
> > > > > On Thu Dec 21, 2023 at 1:53 PM CET, Konrad Dybcio wrote:
> > > > > > On 21.12.2023 11:34, Dmitry Baryshkov wrote:
> > > > > > > On Thu, 21 Dec 2023 at 09:33, Luca Weiss 
> > > > > > >  wrote:
> > > > > > >>
> > > > > > >> On Wed Dec 20, 2023 at 1:32 PM CET, Konrad Dybcio wrote:
> > > > > > >>> On 20.12.2023 11:02, Luca Weiss wrote:
> > > > > >  This series adds all the necessary bits to enable USB-C role 
> > > > > >  switching,
> > > > > >  charger and fuel gauge (all via pmic-glink) on Fairphone 5.
> > > > > > 
> > > > > >  One thing that could be made different is the pmic-glink 
> > > > > >  compatible.
> > > > > >  I've chosen to use qcm6490 compatible for it and not sc7280 
> > > > > >  since
> > > > > >  there's plenty of firmware variety on sc7280-based platforms 
> > > > > >  and they
> > > > > >  might require different quirks in the future, so limit this 
> > > > > >  PDOS quirk
> > > > > >  to just qcm6490 for now.
> > > > > > 
> > > > > >  If someone thinks it should be qcom,sc7280-pmic-glink, please 
> > > > > >  let me
> > > > > >  know :)
> > > > > > >>> IMO it's best to continue using the "base soc" (which just so 
> > > > > > >>> happened
> > > > > > >>> to fall onto sc7280 this time around) for all compatibles, 
> > > > > > >>> unless the
> > > > > > >>> derivatives actually had changes
> > > > > > >>
> > > > > > >> Hi Konrad,
> > > > > > >>
> > > > > > >> I think at some point I asked Dmitry what he thought and he 
> > > > > > >> mentioned
> > > > > > >> qcm6490. Even found the message again:
> > > > > > >>
> > > > > > >>> well, since it is a firmware thing, you might want to emphasise 
> > > > > > >>> that.
> > > > > > >>> So from my POV qcm6490 makes more sense
> > > > > > >>
> > > > > > >> But yeah since it's likely that sc7280 firmware behaves the same 
> > > > > > >> as
> > > > > > >> qcm6490 firmware it's probably okay to use sc7280 compatible, 
> > > > > > >> worst case
> > > > > > >> we change it later :) I'll send a v2 with those changes.
> > > > > > > 
> > > > > > > Worst case we end up with sc7280 which has yet another slightly
> > > > > > > different UCSI / PMIC GLINK implementation, but the compatible 
> > > > > > > string
> > > > > > > is already taken.
> > > > > > > I still suppose that this should be a qcm6490-related string.
> > > > > > Right, let's keep qcm then
> > > > > 
> > > > > Ack from my side also. Thanks for the feedback!
> > > >
> > > > This doesn't apply to my tree, where should it be going through?
> > > 
> > > As far as I can see the dependency for the driver commit 1d103d6af241
> > > ("usb: typec: ucsi: fix UCSI on buggy Qualcomm devices") was applied to
> > > Bjorn's qcom tree, so 2/3 should also go there then.
> > > 
> > > Patch 3/3 (arm64 dts) definitely also Bjorn's qcom tree.
> > > 
> > > So that leaves patch 1/3 which Bjorn can probably pick up as well but
> > > looking at git log you also picked up some for that file in the past,
> > > dunno.
> >
> > Ok, for any remaining ones that want to be merged before 6.8-rc1 is out,
> > feel free to add my:
> >
> > Acked-by: Greg Kroah-Hartman 
> >
> > If they don't get picked up by 6.8-rc1, feel free to rebase and send it
> > for me to take through my tree.
> 
> Hi Greg,
> 
> This applies cleanly on -next as of next-20240206 still.
> 
> Could you please pick it up for v6.9? I can also send a v2 with only
> the two remaining patches (dts was applied to qcom by Bjorn already).

v2 with just the remaining patches would be great, thanks.

greg k-h



Re: Re: [PATCH] vhost-vdpa: fail enabling virtqueue in certain conditions

2024-02-07 Thread Stefano Garzarella

On Wed, Feb 07, 2024 at 11:27:14AM +0800, Jason Wang wrote:

On Tue, Feb 6, 2024 at 10:52 PM Stefano Garzarella  wrote:


If VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK is not negotiated, we expect
the driver to enable virtqueue before setting DRIVER_OK. If the driver
tries anyway, better to fail right away as soon as we get the ioctl.
Let's also update the documentation to make it clearer.

We had a problem in QEMU for not meeting this requirement, see
https://lore.kernel.org/qemu-devel/20240202132521.32714-1-kw...@redhat.com/


Maybe it's better to only enable cvq when the backend supports
VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK. Eugenio, any comment on this?



Fixes: 9f09fd6171fe ("vdpa: accept VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK backend 
feature")
Cc: epere...@redhat.com
Signed-off-by: Stefano Garzarella 
---
 include/uapi/linux/vhost_types.h | 3 ++-
 drivers/vhost/vdpa.c | 4 
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
index d7656908f730..5df49b6021a7 100644
--- a/include/uapi/linux/vhost_types.h
+++ b/include/uapi/linux/vhost_types.h
@@ -182,7 +182,8 @@ struct vhost_vdpa_iova_range {
 /* Device can be resumed */
 #define VHOST_BACKEND_F_RESUME  0x5
 /* Device supports the driver enabling virtqueues both before and after
- * DRIVER_OK
+ * DRIVER_OK. If this feature is not negotiated, the virtqueues must be
+ * enabled before setting DRIVER_OK.
  */
 #define VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK  0x6
 /* Device may expose the virtqueue's descriptor area, driver area and
diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index bc4a51e4638b..1fba305ba8c1 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -651,6 +651,10 @@ static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, 
unsigned int cmd,
case VHOST_VDPA_SET_VRING_ENABLE:
if (copy_from_user(, argp, sizeof(s)))
return -EFAULT;
+   if (!vhost_backend_has_feature(vq,
+   VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK) &&
+   (ops->get_status(vdpa) & VIRTIO_CONFIG_S_DRIVER_OK))
+   return -EINVAL;


As discussed, without VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK, we don't
know if parents can do vq_ready after driver_ok.

So maybe we need to keep this behaviour to unbreak some "legacy" userspace?


I'm not sure it's a good idea, since "legacy" userspace are currently 
broken if used with VDUSE device. So we need to fix userspace in any 
case, and IMHO is better if we start to return an error, so the user 
understands what went wrong, because the problem in QEMU took us quite 
some time to figure out that we couldn't enable vq after DRIVER_OK.


Since userspace is unable to understand if a vhost-vdpa device is VDUSE 
or not, I think we have only 2 options either merge this patch or fix 
VDUSE somehow. But the last one I think is more complicated/intrusive.


Thanks,
Stefano



For example ifcvf did:

static void ifcvf_vdpa_set_vq_ready(struct vdpa_device *vdpa_dev,
   u16 qid, bool ready)
{
 struct ifcvf_hw *vf = vdpa_to_vf(vdpa_dev);

   ifcvf_set_vq_ready(vf, qid, ready);
}

And it did:

void ifcvf_set_vq_ready(struct ifcvf_hw *hw, u16 qid, bool ready)
{
   struct virtio_pci_common_cfg __iomem *cfg = hw->common_cfg;

   vp_iowrite16(qid, >queue_select);
   vp_iowrite16(ready, >queue_enable);
}

Though it didn't advertise VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK?

Adding LingShan for more thought.

Thanks


ops->set_vq_ready(vdpa, idx, s.num);
return 0;
case VHOST_VDPA_GET_VRING_GROUP:
--
2.43.0