[PATCH 3/3] sh: Call paging_init() earlier in the init sequence

2024-02-09 Thread Oreoluwa Babatunde
The unflatten_device_tree() function contains a call to
memblock_alloc(). This is a problem because this allocation is done
before any of the reserved memory is set aside in paging_init().
This means that there is a possibility for memblock to allocate from
any of the memory regions that are supposed to be set aside as reserved.

Hence, move the call to paging_init() to be earlier in the init
sequence so that the reserved memory regions are set aside before any
allocations are done using memblock.

Signed-off-by: Oreoluwa Babatunde 
---
 arch/sh/kernel/setup.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/sh/kernel/setup.c b/arch/sh/kernel/setup.c
index d3175f0..ea40798 100644
--- a/arch/sh/kernel/setup.c
+++ b/arch/sh/kernel/setup.c
@@ -322,6 +322,8 @@ void __init setup_arch(char **cmdline_p)
/* Let earlyprintk output early console messages */
sh_early_platform_driver_probe("earlyprintk", 1, 1);
 
+   paging_init();
+
 #ifdef CONFIG_OF_EARLY_FLATTREE
 #ifdef CONFIG_USE_BUILTIN_DTB
unflatten_and_copy_device_tree();
@@ -330,8 +332,6 @@ void __init setup_arch(char **cmdline_p)
 #endif
 #endif
 
-   paging_init();
-
/* Perform the machine specific initialisation */
if (likely(sh_mv.mv_setup))
sh_mv.mv_setup(cmdline_p);
-- 



[PATCH 2/3] openrisc: Call setup_memory() earlier in the init sequence

2024-02-09 Thread Oreoluwa Babatunde
The unflatten_and_copy_device_tree() function contains a call to
memblock_alloc(). This means that memblock is allocating memory before
any of the reserved memory regions are set aside in the setup_memory()
function which calls early_init_fdt_scan_reserved_mem(). Therefore,
there is a possibility for memblock to allocate from any of the
reserved memory regions.

Hence, move the call to setup_memory() to be earlier in the init
sequence so that the reserved memory regions are set aside before any
allocations are done using memblock.

Signed-off-by: Oreoluwa Babatunde 
---
 arch/openrisc/kernel/setup.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/openrisc/kernel/setup.c b/arch/openrisc/kernel/setup.c
index 9cf7fb6..be56eaa 100644
--- a/arch/openrisc/kernel/setup.c
+++ b/arch/openrisc/kernel/setup.c
@@ -255,6 +255,9 @@ void calibrate_delay(void)
 
 void __init setup_arch(char **cmdline_p)
 {
+   /* setup memblock allocator */
+   setup_memory();
+
unflatten_and_copy_device_tree();
 
setup_cpuinfo();
@@ -278,9 +281,6 @@ void __init setup_arch(char **cmdline_p)
}
 #endif
 
-   /* setup memblock allocator */
-   setup_memory();
-
/* paging_init() sets up the MMU and marks all pages as reserved */
paging_init();
 
-- 



[PATCH 1/3] loongarch: Call arch_mem_init() before platform_init() in the init sequence

2024-02-09 Thread Oreoluwa Babatunde
The platform_init() function which is called during device bootup
contains a few calls to memblock_alloc().
This is an issue because these allocations are done before reserved
memory regions are set aside in arch_mem_init().
This means that there is a possibility for memblock to allocate memory
from any of the reserved memory regions.

Hence, move the call to arch_mem_init() to be earlier in the init
sequence so that all reserved memory is set aside before any allocations
are made with memblock.

Signed-off-by: Oreoluwa Babatunde 
---
 arch/loongarch/kernel/setup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/loongarch/kernel/setup.c b/arch/loongarch/kernel/setup.c
index edf2bba..66c307c 100644
--- a/arch/loongarch/kernel/setup.c
+++ b/arch/loongarch/kernel/setup.c
@@ -597,8 +597,8 @@ void __init setup_arch(char **cmdline_p)
parse_early_param();
reserve_initrd_mem();
 
-   platform_init();
arch_mem_init(cmdline_p);
+   platform_init();
 
resource_init();
 #ifdef CONFIG_SMP
-- 



[PATCH 0/3] Restructure init sequence to set aside reserved memory earlier

2024-02-09 Thread Oreoluwa Babatunde
The loongarch, openric, and sh architectures allocate memory from
memblock before it gets the chance to set aside reserved memory regions.
This means that there is a possibility for memblock to allocate from
memory regions that are supposed to be reserved.

This series makes changes to the arch specific setup code to call the
functions responsible for setting aside the reserved memory regions earlier
in the init sequence.
Hence, by the time memblock starts being used to allocate memory, the
reserved memory regions should already be set aside, and it will no
longer be possible for allocations to come from them.

I am currnetly using an arm64 device, and so I will need assistance from
the relevant arch maintainers to help check if this breaks anything from
compilation to device bootup.

Oreoluwa Babatunde (3):
  loongarch: Call arch_mem_init() before platform_init() in the init
sequence
  openrisc: Call setup_memory() earlier in the init sequence
  sh: Call paging_init() earlier in the init sequence

 arch/loongarch/kernel/setup.c | 2 +-
 arch/openrisc/kernel/setup.c  | 6 +++---
 arch/sh/kernel/setup.c| 4 ++--
 3 files changed, 6 insertions(+), 6 deletions(-)

-- 



Re: [PATCH V1] vdpa_sim: reset must not run

2024-02-09 Thread Steven Sistare
On 1/22/2024 5:59 AM, Stefano Garzarella wrote:
> On Mon, Jan 22, 2024 at 11:47:22AM +0100, Eugenio Perez Martin wrote:
>> On Mon, Jan 22, 2024 at 11:22 AM Stefano Garzarella  
>> wrote:
>>>
>>> On Wed, Jan 17, 2024 at 11:23:23AM -0800, Steve Sistare wrote:
>>> >vdpasim_do_reset sets running to true, which is wrong, as it allows
>>> >vdpasim_kick_vq to post work requests before the device has been
>>> >configured.  To fix, do not set running until VIRTIO_CONFIG_S_FEATURES_OK
>>> >is set.
>>> >
>>> >Fixes: 0c89e2a3a9d0 ("vdpa_sim: Implement suspend vdpa op")
>>> >Signed-off-by: Steve Sistare 
>>> >Reviewed-by: Eugenio Pérez 
>>> >---
>>> > drivers/vdpa/vdpa_sim/vdpa_sim.c | 3 ++-
>>> > 1 file changed, 2 insertions(+), 1 deletion(-)
>>> >
>>> >diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c 
>>> >b/drivers/vdpa/vdpa_sim/vdpa_sim.c
>>> >index be2925d0d283..6304cb0b4770 100644
>>> >--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
>>> >+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
>>> >@@ -160,7 +160,7 @@ static void vdpasim_do_reset(struct vdpasim *vdpasim, 
>>> >u32 flags)
>>> >   }
>>> >   }
>>> >
>>> >-  vdpasim->running = true;
>>> >+  vdpasim->running = false;
>>> >   spin_unlock(>iommu_lock);
>>> >
>>> >   vdpasim->features = 0;
>>> >@@ -483,6 +483,7 @@ static void vdpasim_set_status(struct vdpa_device 
>>> >*vdpa, u8 status)
>>> >
>>> >   mutex_lock(>mutex);
>>> >   vdpasim->status = status;
>>> >+  vdpasim->running = (status & VIRTIO_CONFIG_S_FEATURES_OK) != 0;
>>> >   mutex_unlock(>mutex);
>>>
>>> Should we do something similar also in vdpasim_resume() ?
>>>
>>> I mean something like this:
>>>
>>> diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c 
>>> b/drivers/vdpa/vdpa_sim/vdpa_sim.c
>>> index be2925d0d283..55e4633d5442 100644
>>> --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
>>> +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
>>> @@ -520,7 +520,7 @@ static int vdpasim_resume(struct vdpa_device *vdpa)
>>>  int i;
>>>
>>>  mutex_lock(>mutex);
>>> -   vdpasim->running = true;
>>> +   vdpasim->running = (vdpasim->status & VIRTIO_CONFIG_S_FEATURES_OK) 
>>> != 0;
>>>
>>>  if (vdpasim->pending_kick) {
>>>  /* Process pending descriptors */
>>>
>>> Thanks,
>>> Stefano
>>>
>>
>> The suspend and resume operation should not be called before
>> DRIVER_OK, so maybe we should add that protection at
>> drivers/vhost/vdpa.c actually?
> 
> Yeah, I think so!
> 
> Anyway, IMHO we should at least return an error in vdpa_sim if 
> vdpasim_suspend/resume are called before DRIVER_OK (in another patch of 
> course).

I submitted "vdpa: suspend and resume require DRIVER_OK" to check this in 
vdpa.c so there
is no need to check it in the leaf drivers.

I also submitted V2 of this patch, "vdpa_sim: reset must not run".
It checks for DRIVER_OK, instead of FEATURES_OK.

- Steve



[PATCH V1] vdpa: suspend and resume require DRIVER_OK

2024-02-09 Thread Steve Sistare
Calling suspend or resume requires VIRTIO_CONFIG_S_DRIVER_OK, for all
vdpa devices.

Suggested-by: Eugenio Perez Martin "
Signed-off-by: Steve Sistare 
---
 drivers/vhost/vdpa.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index bc4a51e4638b..ce1882acfc3b 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -598,6 +598,9 @@ static long vhost_vdpa_suspend(struct vhost_vdpa *v)
if (!ops->suspend)
return -EOPNOTSUPP;
 
+   if (!(ops->get_status(vdpa) & VIRTIO_CONFIG_S_DRIVER_OK))
+   return -EINVAL;
+
ret = ops->suspend(vdpa);
if (!ret)
v->suspended = true;
@@ -618,6 +621,9 @@ static long vhost_vdpa_resume(struct vhost_vdpa *v)
if (!ops->resume)
return -EOPNOTSUPP;
 
+   if (!(ops->get_status(vdpa) & VIRTIO_CONFIG_S_DRIVER_OK))
+   return -EINVAL;
+
ret = ops->resume(vdpa);
if (!ret)
v->suspended = false;
-- 
2.39.3




[PATCH V2] vdpa_sim: reset must not run

2024-02-09 Thread Steve Sistare
vdpasim_do_reset sets running to true, which is wrong, as it allows
vdpasim_kick_vq to post work requests before the device has been
configured.  To fix, do not set running until VIRTIO_CONFIG_S_DRIVER_OK
is set.

Fixes: 0c89e2a3a9d0 ("vdpa_sim: Implement suspend vdpa op")
Signed-off-by: Steve Sistare 
Reviewed-by: Eugenio Pérez 
Acked-by: Jason Wang 
---
 drivers/vdpa/vdpa_sim/vdpa_sim.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index be2925d0d283..18584ce70bf0 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -160,7 +160,7 @@ static void vdpasim_do_reset(struct vdpasim *vdpasim, u32 
flags)
}
}
 
-   vdpasim->running = true;
+   vdpasim->running = false;
spin_unlock(>iommu_lock);
 
vdpasim->features = 0;
@@ -483,6 +483,7 @@ static void vdpasim_set_status(struct vdpa_device *vdpa, u8 
status)
 
mutex_lock(>mutex);
vdpasim->status = status;
+   vdpasim->running = (status & VIRTIO_CONFIG_S_DRIVER_OK) != 0;
mutex_unlock(>mutex);
 }
 
-- 
2.39.3




RE: CPU data cache across reboot/kexec for pmem/dax devices

2024-02-09 Thread Dan Williams
Mathieu Desnoyers wrote:
> Hi Dan,
> 
> In the context of extracting user-space trace data when the kernel crashes,
> the LTTng user-space tracer recommends using nvdimm/pmem to reserve an area
> of physical (volatile) RAM at boot (memmap=nn[KMG]!ss[KMG]), and use the
> resulting device to create/mount a dax-enabled fs (e.g. ext4).
> 
> We then use this filesystem to mmap() the shared memory files for the tracer.
> 
> I want to make sure that the very last events from the userspace tracer 
> written
> to the memory mapped buffers (mmap()) by userspace are present after a
> warm-reboot (or kexec/kdump).
> 
> Note that the LTTng user-space tracer (LTTng-UST) does *not* issue any clflush
> (or equivalent pmem_persist() from libpmem) for performance reasons: ring 
> buffer
> data is usually overwritten many times before the system actually crashes, and
> the only thing we really need to make sure is that the cache lines are not
> invalidated without write back.
> 
> So I understand that the main use-case for pmem is nvdimm, and that in order 
> to
> guarantee persistence of the data on power off an explicit pmem_persist() is
> needed after each "transaction", but for the needs of tracing, is there some
> kind of architectural guarantee that the data present in the cpu data cache
> is not invalidated prior to write back in each of those scenarios ?
>
> - reboot with bios explicitly not clearing memory,

This one gives me pause, because a trip through the BIOS typically means
lots of resets and other low level magic, so this would likely require
pushing dirty data out of CPU caches prior to entering the BIOS code
paths.

So this either needs explicit cache flushing or mapping the memory with
write-through semantics. That latter one is not supported in the stack
today.

> - kexec/kdump.

This should maintain the state of CPU caches. As far as the CPU is
concerned it is just long jumping into a new kernel in memory without
resetting any CPU cache state.



[PATCH] dt-bindings: usb: typec: anx7688: start a binding document

2024-02-09 Thread Pavel Machek
Add binding for anx7688 usb type-c bridge. I don't have a datasheet,
but I did best I could.

Signed-off-by: Pavel Machek 

diff --git a/Documentation/devicetree/bindings/usb/analogix,anx7688.yaml 
b/Documentation/devicetree/bindings/usb/analogix,anx7688.yaml
new file mode 100644
index ..b9d60586937f
--- /dev/null
+++ b/Documentation/devicetree/bindings/usb/analogix,anx7688.yaml
@@ -0,0 +1,140 @@
+# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/usb/analogix,anx7688.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Analogix ANX7688 Type-C controller
+
+maintainers:
+  - Pavel Machek 
+
+properties:
+  compatible:
+enum:
+  - analogix,anx7688
+
+  reg:
+maxItems: 1
+
+  interrupts:
+maxItems: 1
+
+  reset-gpios:
+maxItems: 1
+  enable-gpios:
+maxItems: 1
+  cabledet-gpios:
+maxItems: 1
+
+  avdd10-supply:
+description:
+  1.0V power supply
+  dvdd10-supply:
+description:
+  1.0V power supply
+  avdd18-supply:
+description:
+  1.8V power supply
+  dvdd18-supply:
+description:
+  1.8V power supply
+  avdd33-supply:
+description:
+  3.3V power supply
+  i2c-supply:
+description:
+  Power supply
+  vconn-supply:
+description:
+  Power supply
+  hdmi_vt-supply:
+description:
+  Power supply
+
+  vbus-supply:
+description:
+  Power supply
+  vbus_in-supply:
+description:
+  Power supply
+
+  connector:
+type: object
+$ref: ../connector/usb-connector.yaml
+unevaluatedProperties: false
+
+description:
+  Properties for usb c connector.
+
+properties:
+  compatible:
+const: usb-c-connector
+
+  power-role: true
+
+  data-role: true
+
+  try-power-role: true
+
+required:
+  - compatible
+
+required:
+  - compatible
+  - reg
+  - connector
+
+additionalProperties: false
+
+examples:
+  - |
+#include 
+#include 
+
+i2c {
+#address-cells = <1>;
+#size-cells = <0>;
+
+typec@2c {
+compatible = "analogix,anx7688";
+reg = <0x2c>;
+interrupts = <8 IRQ_TYPE_EDGE_FALLING>;
+interrupt-parent = <>;
+
+enable-gpios = < 3 10 GPIO_ACTIVE_LOW>; /* PD10 */
+reset-gpios = < 3 6 GPIO_ACTIVE_HIGH>; /* PD6 */
+cabledet-gpios = <_pio 0 8 GPIO_ACTIVE_HIGH>; /* PL8 */
+
+avdd10-supply = <_anx1v0>;
+dvdd10-supply = <_anx1v0>;
+avdd18-supply = <_ldo_io1>;
+dvdd18-supply = <_ldo_io1>;
+avdd33-supply = <_dcdc1>;
+i2c-supply = <_ldo_io0>;
+vconn-supply = <_vconn5v0>;
+hdmi_vt-supply = <_dldo1>;
+
+vbus-supply = <_usb_5v>;
+vbus_in-supply = <_power_supply>;
+
+typec_con: connector {
+compatible = "usb-c-connector";
+power-role = "dual";
+data-role = "dual";
+try-power-role = "source";
+
+ports {
+#address-cells = <1>;
+#size-cells = <0>;
+port@0 {
+reg = <0>;
+typec_con_ep: endpoint {
+remote-endpoint = <_hs_ep>;
+};
+};
+};
+};
+};
+};
+...
diff --git a/Documentation/devicetree/bindings/usb/generic-xhci.yaml 
b/Documentation/devicetree/bindings/usb/generic-xhci.yaml
index 594ebb3ee432..6ceafa4af292 100644
--- a/Documentation/devicetree/bindings/usb/generic-xhci.yaml
+++ b/Documentation/devicetree/bindings/usb/generic-xhci.yaml
@@ -9,9 +9,6 @@ title: USB xHCI Controller
 maintainers:
   - Mathias Nyman 
 
-allOf:
-  - $ref: usb-xhci.yaml#
-
 properties:
   compatible:
 oneOf:
@@ -25,6 +22,11 @@ properties:
   - marvell,armada-380-xhci
   - marvell,armada-8k-xhci
   - const: generic-xhci
+  - description: Broadcom SoCs with power domains
+items:
+  - enum:
+  - brcm,bcm2711-xhci
+  - const: brcm,xhci-brcm-v2
   - description: Broadcom STB SoCs with xHCI
 enum:
   - brcm,xhci-brcm-v2
@@ -49,6 +51,9 @@ properties:
   - const: core
   - const: reg
 
+  power-domains:
+maxItems: 1
+
 unevaluatedProperties: false
 
 required:
@@ -56,6 +61,20 @@ required:
   - reg
   - interrupts
 
+allOf:
+  - $ref: usb-xhci.yaml#
+  - if:
+  properties:
+compatible:
+  contains:
+const: brcm,bcm2711-xhci
+then:
+  required:
+- power-domains
+else:
+  properties:
+power-domains: false
+
 examples:
   - |
 usb@f0931000 {
diff --git a/Documentation/devicetree/bindings/usb/genesys,gl850g.yaml 
b/Documentation/devicetree/bindings/usb/genesys,gl850g.yaml
index ee08b9c3721f..37cf5249e526 100644
--- 

Re: [PATCH] usb: typec: anx7688: Add driver for ANX7688 USB-C HDMI bridge

2024-02-09 Thread Pavel Machek
Hi!

> > From: Ondrej Jirman 
> > 
> > This is driver for ANX7688 USB-C HDMI, with flashing and debugging
> > features removed. ANX7688 is rather criticial piece on PinePhone,
> > there's no display and no battery charging without it.
> 
> Don't remove the flashing part. Some Pinephones come without the firmware
> in the past. Even recently, I've seen some people in the Pine chat
> asking how to flash the firmware on some old PinePhone.
> 
> It's a safe operation that can be done at any time and can only be done
> from the kernel driver.

I can re-add it later, but initial merge will be tricky enough as-is.

> > There's likely more work to be done here, but having basic support
> > in mainline is needed to be able to work on the other stuff
> > (networking, cameras, power management).
> > 
> > Signed-off-by: Ondrej Jirman 
> 
> I should be second in order of sign-offs. Martijn wrote a non-working skeleton
> https://megous.com/git/linux/commit/?h=pp-5.7=30e33cefd7956a2b49fb27008b4af9d868974e58
> driver. Then I picked it up and developed it over years to a working thing.
> Almost none of the skeleton remains.

I believe From: should match First signed-off. I guess Martijn should
be marked as co-developed-by or something like that?

> License is GPLv2.

Ok.

> > +   ret = of_property_read_variable_u32_array(dev->of_node, "sink-caps",
> > + anx7688->snk_caps,
> > + 1, 
> > ARRAY_SIZE(anx7688->snk_caps));
> > +   if (ret < 0) {
> > +   dev_err(dev, "failed to get sink-caps from DT\n");
> > +   return ret;
> > +   }
> 
> ^^^ The driver will need to follow usb-c-connector bindings and it will need
> a bindings documentation for itself.
> 
> That's one of the missing things that I did not implement, yet.

Yep, I fought with device trees for few days. I should have yaml
ready.

Best regards,
Pavel

-- 
People of Russia, stop Putin before his war on Ukraine escalates.


signature.asc
Description: PGP signature


CPU data cache across reboot/kexec for pmem/dax devices

2024-02-09 Thread Mathieu Desnoyers

Hi Dan,

In the context of extracting user-space trace data when the kernel crashes,
the LTTng user-space tracer recommends using nvdimm/pmem to reserve an area
of physical (volatile) RAM at boot (memmap=nn[KMG]!ss[KMG]), and use the
resulting device to create/mount a dax-enabled fs (e.g. ext4).

We then use this filesystem to mmap() the shared memory files for the tracer.

I want to make sure that the very last events from the userspace tracer written
to the memory mapped buffers (mmap()) by userspace are present after a
warm-reboot (or kexec/kdump).

Note that the LTTng user-space tracer (LTTng-UST) does *not* issue any clflush
(or equivalent pmem_persist() from libpmem) for performance reasons: ring buffer
data is usually overwritten many times before the system actually crashes, and
the only thing we really need to make sure is that the cache lines are not
invalidated without write back.

So I understand that the main use-case for pmem is nvdimm, and that in order to
guarantee persistence of the data on power off an explicit pmem_persist() is
needed after each "transaction", but for the needs of tracing, is there some
kind of architectural guarantee that the data present in the cpu data cache
is not invalidated prior to write back in each of those scenarios ?

- reboot with bios explicitly not clearing memory,
- kexec/kdump.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com



Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-09 Thread Alex Williamson
On Fri, 9 Feb 2024 13:19:03 -0400
Jason Gunthorpe  wrote:

> On Fri, Feb 09, 2024 at 08:55:31AM -0700, Alex Williamson wrote:
> > I think Kevin's point is also relative to this latter scenario, in the
> > L1 instance of the nvgrace-gpu driver the mmap of the usemem BAR is
> > cachable, but in the L2 instance of the driver where we only use the
> > vfio-pci-core ops nothing maintains that cachable mapping.  Is that a
> > problem?  An uncached mapping on top of a cachable mapping is often
> > prone to problems.
> 
> On these CPUs the ARM architecture won't permit it, the L0 level
> blocks uncachable using FWB and page table attributes. The VM, no
> matter what it does, cannot make the cachable memory uncachable.

Great, thanks,

Alex




Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-09 Thread Jason Gunthorpe
On Fri, Feb 09, 2024 at 08:55:31AM -0700, Alex Williamson wrote:
> I think Kevin's point is also relative to this latter scenario, in the
> L1 instance of the nvgrace-gpu driver the mmap of the usemem BAR is
> cachable, but in the L2 instance of the driver where we only use the
> vfio-pci-core ops nothing maintains that cachable mapping.  Is that a
> problem?  An uncached mapping on top of a cachable mapping is often
> prone to problems.  

On these CPUs the ARM architecture won't permit it, the L0 level
blocks uncachable using FWB and page table attributes. The VM, no
matter what it does, cannot make the cachable memory uncachable.

Jason



[PATCH v5 0/5] input/touchscreen: imagis: add support for IST3032C

2024-02-09 Thread Karel Balej
From: Karel Balej 

Hello,

this patch series generalizes the Imagis touchscreen driver to support
other Imagis chips, namely IST3038B and IST3032C, which use a slightly
different protocol.

The motivation for IST3032C is the samsung,coreprimevelte smartphone
with which this series has been tested. However, the support for this
device is not yet in-tree, the effort is happening at [1]. Preliminary
version of the regulator driver needed to use the touchscreen on this
phone can be found here [2].

Note that this is a prerequisite for (at least a part of) this series
[3] which among other things implements support for touch keys for
Imagis touchscreens that have it.

[1] 
https://lore.kernel.org/all/20240110-pxa1908-lkml-v8-0-fea768a59...@skole.hr/
[2] 
https://lore.kernel.org/all/20231228100208.2932-1-kar...@gimli.ms.mff.cuni.cz/
[3] 
https://lore.kernel.org/all/20240120-b4-imagis-keys-v2-0-d7fc16f2e...@skole.hr/
---
v5:
- Rebase to v6.8-rc3.
- v4: 
https://lore.kernel.org/all/20240120191940.3631-1-kar...@gimli.ms.mff.cuni.cz/
v4:
- Rebase to v6.7.
- v3: 
https://lore.kernel.org/all/20231202125948.10345-1-kar...@gimli.ms.mff.cuni.cz/
- Address feedback and add trailers.
v3:
- Rebase to v6.7-rc3.
- v2: 
https://lore.kernel.org/all/20231003133440.4696-1-kar...@gimli.ms.mff.cuni.cz/
v2:
- Do not rename the driver.
- Do not hardcode voltage required by the IST3032C.
- Use Markuss' series which generalizes the driver. Link to the original
  series: 
https://lore.kernel.org/all/20220504152406.8730-1-markuss.br...@gmail.com/
- Separate bindings into separate patch.
- v1: https://lore.kernel.org/all/20230926173531.18715-1-bal...@matfyz.cz/

Karel Balej (2):
  dt-bindings: input/touchscreen: imagis: add compatible for IST3032C
  input/touchscreen: imagis: add support for IST3032C

Markuss Broks (3):
  input/touchscreen: imagis: Correct the maximum touch area value
  dt-bindings: input/touchscreen: Add compatible for IST3038B
  input/touchscreen: imagis: Add support for Imagis IST3038B

 .../input/touchscreen/imagis,ist3038c.yaml|  2 +
 drivers/input/touchscreen/imagis.c| 70 +++
 2 files changed, 60 insertions(+), 12 deletions(-)

-- 
2.43.0




[PATCH v5 3/5] input/touchscreen: imagis: Add support for Imagis IST3038B

2024-02-09 Thread Karel Balej
From: Markuss Broks 

Imagis IST3038B is another variant of Imagis IST3038 IC, which has
a different register interface from IST3038C (possibly firmware defined).
This should also work for IST3044B (though untested), however other
variants using this interface/protocol(IST3026, IST3032, IST3026B,
IST3032B) have a different format for coordinates, and they'd need
additional effort to be supported by this driver.

Signed-off-by: Markuss Broks 
Signed-off-by: Karel Balej 
---

Notes:
v4:
* Sort the definitions in alphanumerical order.

 drivers/input/touchscreen/imagis.c | 58 --
 1 file changed, 47 insertions(+), 11 deletions(-)

diff --git a/drivers/input/touchscreen/imagis.c 
b/drivers/input/touchscreen/imagis.c
index e67fd3011027..9af8a6332ae6 100644
--- a/drivers/input/touchscreen/imagis.c
+++ b/drivers/input/touchscreen/imagis.c
@@ -11,9 +11,13 @@
 #include 
 #include 
 
+#define IST3038B_REG_STATUS0x20
+#define IST3038B_REG_CHIPID0x30
+#define IST3038B_WHOAMI0x30380b
+
 #define IST3038C_HIB_ACCESS(0x800B << 16)
 #define IST3038C_DIRECT_ACCESS BIT(31)
-#define IST3038C_REG_CHIPID0x40001000
+#define IST3038C_REG_CHIPID(0x40001000 | IST3038C_DIRECT_ACCESS)
 #define IST3038C_REG_HIB_BASE  0x3100
 #define IST3038C_REG_TOUCH_STATUS  (IST3038C_REG_HIB_BASE | 
IST3038C_HIB_ACCESS)
 #define IST3038C_REG_TOUCH_COORD   (IST3038C_REG_HIB_BASE | 
IST3038C_HIB_ACCESS | 0x8)
@@ -31,8 +35,17 @@
 #define IST3038C_FINGER_COUNT_SHIFT12
 #define IST3038C_FINGER_STATUS_MASKGENMASK(9, 0)
 
+struct imagis_properties {
+   unsigned int interrupt_msg_cmd;
+   unsigned int touch_coord_cmd;
+   unsigned int whoami_cmd;
+   unsigned int whoami_val;
+   bool protocol_b;
+};
+
 struct imagis_ts {
struct i2c_client *client;
+   const struct imagis_properties *tdata;
struct input_dev *input_dev;
struct touchscreen_properties prop;
struct regulator_bulk_data supplies[2];
@@ -84,8 +97,7 @@ static irqreturn_t imagis_interrupt(int irq, void *dev_id)
int i;
int error;
 
-   error = imagis_i2c_read_reg(ts, IST3038C_REG_INTR_MESSAGE,
-   _message);
+   error = imagis_i2c_read_reg(ts, ts->tdata->interrupt_msg_cmd, 
_message);
if (error) {
dev_err(>client->dev,
"failed to read the interrupt message: %d\n", error);
@@ -104,9 +116,13 @@ static irqreturn_t imagis_interrupt(int irq, void *dev_id)
finger_pressed = intr_message & IST3038C_FINGER_STATUS_MASK;
 
for (i = 0; i < finger_count; i++) {
-   error = imagis_i2c_read_reg(ts,
-   IST3038C_REG_TOUCH_COORD + (i * 4),
-   _status);
+   if (ts->tdata->protocol_b)
+   error = imagis_i2c_read_reg(ts,
+   ts->tdata->touch_coord_cmd, 
_status);
+   else
+   error = imagis_i2c_read_reg(ts,
+   ts->tdata->touch_coord_cmd 
+ (i * 4),
+   _status);
if (error) {
dev_err(>client->dev,
"failed to read coordinates for finger %d: 
%d\n",
@@ -261,6 +277,12 @@ static int imagis_probe(struct i2c_client *i2c)
 
ts->client = i2c;
 
+   ts->tdata = device_get_match_data(dev);
+   if (!ts->tdata) {
+   dev_err(dev, "missing chip data\n");
+   return -EINVAL;
+   }
+
error = imagis_init_regulators(ts);
if (error) {
dev_err(dev, "regulator init error: %d\n", error);
@@ -279,15 +301,13 @@ static int imagis_probe(struct i2c_client *i2c)
return error;
}
 
-   error = imagis_i2c_read_reg(ts,
-   IST3038C_REG_CHIPID | IST3038C_DIRECT_ACCESS,
-   _id);
+   error = imagis_i2c_read_reg(ts, ts->tdata->whoami_cmd, _id);
if (error) {
dev_err(dev, "chip ID read failure: %d\n", error);
return error;
}
 
-   if (chip_id != IST3038C_WHOAMI) {
+   if (chip_id != ts->tdata->whoami_val) {
dev_err(dev, "unknown chip ID: 0x%x\n", chip_id);
return -EINVAL;
}
@@ -343,9 +363,25 @@ static int imagis_resume(struct device *dev)
 
 static DEFINE_SIMPLE_DEV_PM_OPS(imagis_pm_ops, imagis_suspend, imagis_resume);
 
+static const struct imagis_properties imagis_3038b_data = {
+   .interrupt_msg_cmd = IST3038B_REG_STATUS,
+   .touch_coord_cmd = IST3038B_REG_STATUS,
+   .whoami_cmd = IST3038B_REG_CHIPID,
+   .whoami_val = IST3038B_WHOAMI,
+   .protocol_b = true,
+};
+
+static const struct 

[PATCH v5 4/5] dt-bindings: input/touchscreen: imagis: add compatible for IST3032C

2024-02-09 Thread Karel Balej
From: Karel Balej 

IST3032C is a touchscreen IC which seems mostly compatible with IST3038C
except that it reports a different chip ID value.

Acked-by: Rob Herring 
Signed-off-by: Karel Balej 
---

Notes:
v5:
- Add Rob's trailer.
v4:
- Reword commit description to mention how this IC differs from the
  already supported.

 .../devicetree/bindings/input/touchscreen/imagis,ist3038c.yaml   | 1 +
 1 file changed, 1 insertion(+)

diff --git 
a/Documentation/devicetree/bindings/input/touchscreen/imagis,ist3038c.yaml 
b/Documentation/devicetree/bindings/input/touchscreen/imagis,ist3038c.yaml
index b5372c4eae56..2af71cbcc97d 100644
--- a/Documentation/devicetree/bindings/input/touchscreen/imagis,ist3038c.yaml
+++ b/Documentation/devicetree/bindings/input/touchscreen/imagis,ist3038c.yaml
@@ -18,6 +18,7 @@ properties:
 
   compatible:
 enum:
+  - imagis,ist3032c
   - imagis,ist3038b
   - imagis,ist3038c
 
-- 
2.43.0




[PATCH v5 2/5] dt-bindings: input/touchscreen: Add compatible for IST3038B

2024-02-09 Thread Karel Balej
From: Markuss Broks 

Imagis IST3038B is a variant (firmware?) of Imagis IST3038 IC
differing from IST3038C in its register interface. Add the
compatible for it to the IST3038C bindings.

Signed-off-by: Markuss Broks 
Acked-by: Conor Dooley 
[bal...@matfyz.cz: elaborate chip differences in the commit message]
Signed-off-by: Karel Balej 
---

Notes:
v4:
* Mention how the chip is different in terms of the programming model in
  the commit message.
* Add Conor's trailer.

 .../devicetree/bindings/input/touchscreen/imagis,ist3038c.yaml   | 1 +
 1 file changed, 1 insertion(+)

diff --git 
a/Documentation/devicetree/bindings/input/touchscreen/imagis,ist3038c.yaml 
b/Documentation/devicetree/bindings/input/touchscreen/imagis,ist3038c.yaml
index 0d6b033fd5fb..b5372c4eae56 100644
--- a/Documentation/devicetree/bindings/input/touchscreen/imagis,ist3038c.yaml
+++ b/Documentation/devicetree/bindings/input/touchscreen/imagis,ist3038c.yaml
@@ -18,6 +18,7 @@ properties:
 
   compatible:
 enum:
+  - imagis,ist3038b
   - imagis,ist3038c
 
   reg:
-- 
2.43.0




[PATCH v5 1/5] input/touchscreen: imagis: Correct the maximum touch area value

2024-02-09 Thread Karel Balej
From: Markuss Broks 

As specified in downstream IST3038B driver and proved by testing,
the correct maximum reported value of touch area is 16.

Signed-off-by: Markuss Broks 
Signed-off-by: Karel Balej 
---
 drivers/input/touchscreen/imagis.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/input/touchscreen/imagis.c 
b/drivers/input/touchscreen/imagis.c
index 07111ca24455..e67fd3011027 100644
--- a/drivers/input/touchscreen/imagis.c
+++ b/drivers/input/touchscreen/imagis.c
@@ -210,7 +210,7 @@ static int imagis_init_input_dev(struct imagis_ts *ts)
 
input_set_capability(input_dev, EV_ABS, ABS_MT_POSITION_X);
input_set_capability(input_dev, EV_ABS, ABS_MT_POSITION_Y);
-   input_set_abs_params(input_dev, ABS_MT_TOUCH_MAJOR, 0, 255, 0, 0);
+   input_set_abs_params(input_dev, ABS_MT_TOUCH_MAJOR, 0, 16, 0, 0);
 
touchscreen_parse_properties(input_dev, true, >prop);
if (!ts->prop.max_x || !ts->prop.max_y) {
-- 
2.43.0




[PATCH v5 5/5] input/touchscreen: imagis: add support for IST3032C

2024-02-09 Thread Karel Balej
From: Karel Balej 

IST3032C is a touchscreen chip used for instance in the
samsung,coreprimevelte smartphone, with which this was tested. Add the
chip specific information to the driver.

Reviewed-by: Markuss Broks 
Signed-off-by: Karel Balej 
---

Notes:
v4:
* Change the WHOAMI definition position to preserve alphanumerical order
  of the definitions.
* Add Markuss' Reviewed-by trailer.

 drivers/input/touchscreen/imagis.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/input/touchscreen/imagis.c 
b/drivers/input/touchscreen/imagis.c
index 9af8a6332ae6..e1fafa561ee3 100644
--- a/drivers/input/touchscreen/imagis.c
+++ b/drivers/input/touchscreen/imagis.c
@@ -11,6 +11,8 @@
 #include 
 #include 
 
+#define IST3032C_WHOAMI0x32c
+
 #define IST3038B_REG_STATUS0x20
 #define IST3038B_REG_CHIPID0x30
 #define IST3038B_WHOAMI0x30380b
@@ -363,6 +365,13 @@ static int imagis_resume(struct device *dev)
 
 static DEFINE_SIMPLE_DEV_PM_OPS(imagis_pm_ops, imagis_suspend, imagis_resume);
 
+static const struct imagis_properties imagis_3032c_data = {
+   .interrupt_msg_cmd = IST3038C_REG_INTR_MESSAGE,
+   .touch_coord_cmd = IST3038C_REG_TOUCH_COORD,
+   .whoami_cmd = IST3038C_REG_CHIPID,
+   .whoami_val = IST3032C_WHOAMI,
+};
+
 static const struct imagis_properties imagis_3038b_data = {
.interrupt_msg_cmd = IST3038B_REG_STATUS,
.touch_coord_cmd = IST3038B_REG_STATUS,
@@ -380,6 +389,7 @@ static const struct imagis_properties imagis_3038c_data = {
 
 #ifdef CONFIG_OF
 static const struct of_device_id imagis_of_match[] = {
+   { .compatible = "imagis,ist3032c", .data = _3032c_data },
{ .compatible = "imagis,ist3038b", .data = _3038b_data },
{ .compatible = "imagis,ist3038c", .data = _3038c_data },
{ },
-- 
2.43.0




[PATCH v16 5/6] Documentation: tracing: Add ring-buffer mapping

2024-02-09 Thread Vincent Donnefort
It is now possible to mmap() a ring-buffer to stream its content. Add
some documentation and a code example.

Signed-off-by: Vincent Donnefort 

diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
index 5092d6c13af5..0b300901fd75 100644
--- a/Documentation/trace/index.rst
+++ b/Documentation/trace/index.rst
@@ -29,6 +29,7 @@ Linux Tracing Technologies
timerlat-tracer
intel_th
ring-buffer-design
+   ring-buffer-map
stm
sys-t
coresight/index
diff --git a/Documentation/trace/ring-buffer-map.rst 
b/Documentation/trace/ring-buffer-map.rst
new file mode 100644
index ..628254e63830
--- /dev/null
+++ b/Documentation/trace/ring-buffer-map.rst
@@ -0,0 +1,104 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==
+Tracefs ring-buffer memory mapping
+==
+
+:Author: Vincent Donnefort 
+
+Overview
+
+Tracefs ring-buffer memory map provides an efficient method to stream data
+as no memory copy is necessary. The application mapping the ring-buffer becomes
+then a consumer for that ring-buffer, in a similar fashion to trace_pipe.
+
+Memory mapping setup
+
+The mapping works with a mmap() of the trace_pipe_raw interface.
+
+The first system page of the mapping contains ring-buffer statistics and
+description. It is referred as the meta-page. One of the most important field 
of
+the meta-page is the reader. It contains the sub-buffer ID which can be safely
+read by the mapper (see ring-buffer-design.rst).
+
+The meta-page is followed by all the sub-buffers, ordered by ascendant ID. It 
is
+therefore effortless to know where the reader starts in the mapping:
+
+.. code-block:: c
+
+reader_id = meta->reader->id;
+reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;
+
+When the application is done with the current reader, it can get a new one 
using
+the trace_pipe_raw ioctl() TRACE_MMAP_IOCTL_GET_READER. This ioctl also updates
+the meta-page fields.
+
+Limitations
+===
+When a mapping is in place on a Tracefs ring-buffer, it is not possible to
+either resize it (either by increasing the entire size of the ring-buffer or
+each subbuf). It is also not possible to use snapshot or splice.
+
+Concurrent readers (either another application mapping that ring-buffer or the
+kernel with trace_pipe) are allowed but not recommended. They will compete for
+the ring-buffer and the output is unpredictable.
+
+Example
+===
+
+.. code-block:: c
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include 
+#include 
+
+#define TRACE_PIPE_RAW 
"/sys/kernel/tracing/per_cpu/cpu0/trace_pipe_raw"
+
+int main(void)
+{
+int page_size = getpagesize(), fd, reader_id;
+unsigned long meta_len, data_len;
+struct trace_buffer_meta *meta;
+void *map, *reader, *data;
+
+fd = open(TRACE_PIPE_RAW, O_RDONLY | O_NONBLOCK);
+if (fd < 0)
+exit(EXIT_FAILURE);
+
+map = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0);
+if (map == MAP_FAILED)
+exit(EXIT_FAILURE);
+
+meta = (struct trace_buffer_meta *)map;
+meta_len = meta->meta_page_size;
+
+printf("entries:%llu\n", meta->entries);
+printf("overrun:%llu\n", meta->overrun);
+printf("read:   %llu\n", meta->read);
+printf("nr_subbufs: %u\n", meta->nr_subbufs);
+
+data_len = meta->subbuf_size * meta->nr_subbufs;
+data = mmap(NULL, data_len, PROT_READ, MAP_SHARED, fd, 
meta_len);
+if (data == MAP_FAILED)
+exit(EXIT_FAILURE);
+
+if (ioctl(fd, TRACE_MMAP_IOCTL_GET_READER) < 0)
+exit(EXIT_FAILURE);
+
+reader_id = meta->reader.id;
+reader = data + meta->subbuf_size * reader_id;
+
+printf("Current reader address: %p\n", reader);
+
+munmap(data, data_len);
+munmap(meta, meta_len);
+close (fd);
+
+return 0;
+}
-- 
2.43.0.687.g38aa6559b0-goog




[PATCH v16 4/6] tracing: Allow user-space mapping of the ring-buffer

2024-02-09 Thread Vincent Donnefort
Currently, user-space extracts data from the ring-buffer via splice,
which is handy for storage or network sharing. However, due to splice
limitations, it is imposible to do real-time analysis without a copy.

A solution for that problem is to let the user-space map the ring-buffer
directly.

The mapping is exposed via the per-CPU file trace_pipe_raw. The first
element of the mapping is the meta-page. It is followed by each
subbuffer constituting the ring-buffer, ordered by their unique page ID:

  * Meta-page -- include/uapi/linux/trace_mmap.h for a description
  * Subbuf ID 0
  * Subbuf ID 1
 ...

It is therefore easy to translate a subbuf ID into an offset in the
mapping:

  reader_id = meta->reader->id;
  reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;

When new data is available, the mapper must call a newly introduced ioctl:
TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to
point to the next reader containing unread data.

Mapping will prevent snapshot and buffer size modifications.

Signed-off-by: Vincent Donnefort 

diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
index 182e05a3004a..7330249257e7 100644
--- a/include/uapi/linux/trace_mmap.h
+++ b/include/uapi/linux/trace_mmap.h
@@ -43,4 +43,6 @@ struct trace_buffer_meta {
__u64   Reserved2;
 };
 
+#define TRACE_MMAP_IOCTL_GET_READER_IO('T', 0x1)
+
 #endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 4ebf4d0bd14c..2c0deeedf619 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1175,6 +1175,12 @@ static void tracing_snapshot_instance_cond(struct 
trace_array *tr,
return;
}
 
+   if (tr->mapped) {
+   trace_array_puts(tr, "*** BUFFER MEMORY MAPPED ***\n");
+   trace_array_puts(tr, "*** Can not use snapshot (sorry) ***\n");
+   return;
+   }
+
local_irq_save(flags);
update_max_tr(tr, current, smp_processor_id(), cond_data);
local_irq_restore(flags);
@@ -1307,7 +1313,7 @@ static int tracing_arm_snapshot_locked(struct trace_array 
*tr)
lockdep_assert_held(_types_lock);
 
spin_lock(>snapshot_trigger_lock);
-   if (tr->snapshot == UINT_MAX) {
+   if (tr->snapshot == UINT_MAX || tr->mapped) {
spin_unlock(>snapshot_trigger_lock);
return -EBUSY;
}
@@ -6533,7 +6539,7 @@ static void tracing_set_nop(struct trace_array *tr)
 {
if (tr->current_trace == _trace)
return;
-   
+
tr->current_trace->enabled--;
 
if (tr->current_trace->reset)
@@ -8652,15 +8658,31 @@ tracing_buffers_splice_read(struct file *file, loff_t 
*ppos,
return ret;
 }
 
-/* An ioctl call with cmd 0 to the ring buffer file will wake up all waiters */
 static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, 
unsigned long arg)
 {
struct ftrace_buffer_info *info = file->private_data;
struct trace_iterator *iter = >iter;
+   int err;
+
+   if (cmd == TRACE_MMAP_IOCTL_GET_READER) {
+   if (!(file->f_flags & O_NONBLOCK)) {
+   err = ring_buffer_wait(iter->array_buffer->buffer,
+  iter->cpu_file,
+  iter->tr->buffer_percent);
+   if (err)
+   return err;
+   }
 
-   if (cmd)
-   return -ENOIOCTLCMD;
+   return ring_buffer_map_get_reader(iter->array_buffer->buffer,
+ iter->cpu_file);
+   } else if (cmd) {
+   return -ENOTTY;
+   }
 
+   /*
+* An ioctl call with cmd 0 to the ring buffer file will wake up all
+* waiters
+*/
mutex_lock(_types_lock);
 
iter->wait_index++;
@@ -8673,6 +8695,97 @@ static long tracing_buffers_ioctl(struct file *file, 
unsigned int cmd, unsigned
return 0;
 }
 
+static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf)
+{
+   struct ftrace_buffer_info *info = vmf->vma->vm_file->private_data;
+   struct trace_iterator *iter = >iter;
+   vm_fault_t ret = VM_FAULT_SIGBUS;
+   struct page *page;
+
+   page = ring_buffer_map_fault(iter->array_buffer->buffer, iter->cpu_file,
+vmf->pgoff);
+   if (!page)
+   return ret;
+
+   get_page(page);
+   vmf->page = page;
+   vmf->page->mapping = vmf->vma->vm_file->f_mapping;
+   vmf->page->index = vmf->pgoff;
+
+   return 0;
+}
+
+static void tracing_buffers_mmap_close(struct vm_area_struct *vma)
+{
+   struct ftrace_buffer_info *info = vma->vm_file->private_data;
+   struct trace_iterator *iter = >iter;
+   struct trace_array __maybe_unused *tr = iter->tr;
+
+   ring_buffer_unmap(iter->array_buffer->buffer, 

[PATCH v16 3/6] tracing: Add snapshot refcount

2024-02-09 Thread Vincent Donnefort
When a ring-buffer is memory mapped by user-space, no trace or
ring-buffer swap is possible. This means the snapshot feature is
mutually exclusive with the memory mapping. Having a refcount on
snapshot users will help to know if a mapping is possible or not.

Instead of relying on the global trace_types_lock, a new spinlock is
introduced to serialize accesses to trace_array->snapshot. This intends
to allow access to that variable in a context where the mmap lock is
already held.

Signed-off-by: Vincent Donnefort 

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 2a7c6fd934e9..4ebf4d0bd14c 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1300,6 +1300,50 @@ static void free_snapshot(struct trace_array *tr)
tr->allocated_snapshot = false;
 }
 
+static int tracing_arm_snapshot_locked(struct trace_array *tr)
+{
+   int ret;
+
+   lockdep_assert_held(_types_lock);
+
+   spin_lock(>snapshot_trigger_lock);
+   if (tr->snapshot == UINT_MAX) {
+   spin_unlock(>snapshot_trigger_lock);
+   return -EBUSY;
+   }
+
+   tr->snapshot++;
+   spin_unlock(>snapshot_trigger_lock);
+
+   ret = tracing_alloc_snapshot_instance(tr);
+   if (ret) {
+   spin_lock(>snapshot_trigger_lock);
+   tr->snapshot--;
+   spin_unlock(>snapshot_trigger_lock);
+   }
+
+   return ret;
+}
+
+int tracing_arm_snapshot(struct trace_array *tr)
+{
+   int ret;
+
+   mutex_lock(_types_lock);
+   ret = tracing_arm_snapshot_locked(tr);
+   mutex_unlock(_types_lock);
+
+   return ret;
+}
+
+void tracing_disarm_snapshot(struct trace_array *tr)
+{
+   spin_lock(>snapshot_trigger_lock);
+   if (!WARN_ON(!tr->snapshot))
+   tr->snapshot--;
+   spin_unlock(>snapshot_trigger_lock);
+}
+
 /**
  * tracing_alloc_snapshot - allocate snapshot buffer.
  *
@@ -1373,10 +1417,6 @@ int tracing_snapshot_cond_enable(struct trace_array *tr, 
void *cond_data,
 
mutex_lock(_types_lock);
 
-   ret = tracing_alloc_snapshot_instance(tr);
-   if (ret)
-   goto fail_unlock;
-
if (tr->current_trace->use_max_tr) {
ret = -EBUSY;
goto fail_unlock;
@@ -1395,6 +1435,10 @@ int tracing_snapshot_cond_enable(struct trace_array *tr, 
void *cond_data,
goto fail_unlock;
}
 
+   ret = tracing_arm_snapshot_locked(tr);
+   if (ret)
+   goto fail_unlock;
+
local_irq_disable();
arch_spin_lock(>max_lock);
tr->cond_snapshot = cond_snapshot;
@@ -1439,6 +1483,8 @@ int tracing_snapshot_cond_disable(struct trace_array *tr)
arch_spin_unlock(>max_lock);
local_irq_enable();
 
+   tracing_disarm_snapshot(tr);
+
return ret;
 }
 EXPORT_SYMBOL_GPL(tracing_snapshot_cond_disable);
@@ -1481,6 +1527,7 @@ int tracing_snapshot_cond_disable(struct trace_array *tr)
 }
 EXPORT_SYMBOL_GPL(tracing_snapshot_cond_disable);
 #define free_snapshot(tr)  do { } while (0)
+#define tracing_arm_snapshot_locked(tr) ({ -EBUSY; })
 #endif /* CONFIG_TRACER_SNAPSHOT */
 
 void tracer_tracing_off(struct trace_array *tr)
@@ -6593,11 +6640,12 @@ int tracing_set_tracer(struct trace_array *tr, const 
char *buf)
 */
synchronize_rcu();
free_snapshot(tr);
+   tracing_disarm_snapshot(tr);
}
 
-   if (t->use_max_tr && !tr->allocated_snapshot) {
-   ret = tracing_alloc_snapshot_instance(tr);
-   if (ret < 0)
+   if (t->use_max_tr) {
+   ret = tracing_arm_snapshot_locked(tr);
+   if (ret)
goto out;
}
 #else
@@ -6606,8 +6654,13 @@ int tracing_set_tracer(struct trace_array *tr, const 
char *buf)
 
if (t->init) {
ret = tracer_init(t, tr);
-   if (ret)
+   if (ret) {
+#ifdef CONFIG_TRACER_MAX_TRACE
+   if (t->use_max_tr)
+   tracing_disarm_snapshot(tr);
+#endif
goto out;
+   }
}
 
tr->current_trace = t;
@@ -7709,10 +7762,11 @@ tracing_snapshot_write(struct file *filp, const char 
__user *ubuf, size_t cnt,
if (tr->allocated_snapshot)
ret = resize_buffer_duplicate_size(>max_buffer,
>array_buffer, iter->cpu_file);
-   else
-   ret = tracing_alloc_snapshot_instance(tr);
-   if (ret < 0)
+
+   ret = tracing_arm_snapshot_locked(tr);
+   if (ret)
break;
+
/* Now, we're going to swap */
if (iter->cpu_file == RING_BUFFER_ALL_CPUS) {
local_irq_disable();
@@ -7722,6 +7776,7 @@ tracing_snapshot_write(struct file *filp, const char 
__user *ubuf, size_t cnt,

[PATCH v16 2/6] ring-buffer: Introducing ring-buffer mapping functions

2024-02-09 Thread Vincent Donnefort
In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

  ring_buffer_{map,unmap}()
  ring_buffer_map_fault()

And controls on the ring-buffer:

  ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

  A unique ID for each subbuf of the ring-buffer, currently they are
  only identified through their in-kernel VA.

  A meta-page, where are stored ring-buffer statistics and a
  description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

Signed-off-by: Vincent Donnefort 

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index fa802db216f9..0841ba8bab14 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -6,6 +6,8 @@
 #include 
 #include 
 
+#include 
+
 struct trace_buffer;
 struct ring_buffer_iter;
 
@@ -221,4 +223,9 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct 
hlist_node *node);
 #define trace_rb_cpu_prepare   NULL
 #endif
 
+int ring_buffer_map(struct trace_buffer *buffer, int cpu);
+int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
+struct page *ring_buffer_map_fault(struct trace_buffer *buffer, int cpu,
+  unsigned long pgoff);
+int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
 #endif /* _LINUX_RING_BUFFER_H */
diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
new file mode 100644
index ..182e05a3004a
--- /dev/null
+++ b/include/uapi/linux/trace_mmap.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _TRACE_MMAP_H_
+#define _TRACE_MMAP_H_
+
+#include 
+
+/**
+ * struct trace_buffer_meta - Ring-buffer Meta-page description
+ * @meta_page_size:Size of this meta-page.
+ * @meta_struct_len:   Size of this structure.
+ * @subbuf_size:   Size of each sub-buffer.
+ * @nr_subbufs:Number of subbfs in the ring-buffer.
+ * @reader.lost_events:Number of events lost at the time of the reader 
swap.
+ * @reader.id: subbuf ID of the current reader. From 0 to @nr_subbufs 
- 1
+ * @reader.read:   Number of bytes read on the reader subbuf.
+ * @flags: Placeholder for now, no defined values.
+ * @entries:   Number of entries in the ring-buffer.
+ * @overrun:   Number of entries lost in the ring-buffer.
+ * @read:  Number of entries that have been read.
+ * @Reserved1: Reserved for future use.
+ * @Reserved2: Reserved for future use.
+ */
+struct trace_buffer_meta {
+   __u32   meta_page_size;
+   __u32   meta_struct_len;
+
+   __u32   subbuf_size;
+   __u32   nr_subbufs;
+
+   struct {
+   __u64   lost_events;
+   __u32   id;
+   __u32   read;
+   } reader;
+
+   __u64   flags;
+
+   __u64   entries;
+   __u64   overrun;
+   __u64   read;
+
+   __u64   Reserved1;
+   __u64   Reserved2;
+};
+
+#endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index ca796675c0a1..4543fc51567d 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -338,6 +339,7 @@ struct buffer_page {
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
unsigned order; /* order of the page */
+   u32  id;/* ID for external mapping */
struct buffer_data_page *page;  /* Actual data page */
 };
 
@@ -484,6 +486,12 @@ struct ring_buffer_per_cpu {
u64 read_stamp;
/* pages removed since last reset */
unsigned long   pages_removed;
+
+   unsigned intmapped;
+   struct mutexmapping_lock;
+   unsigned long   *subbuf_ids;/* ID to addr */
+   struct trace_buffer_meta*meta_page;
+
/* ring buffer pages to update, > 0 to add, < 0 to remove */
longnr_pages_to_update;
struct list_headnew_pages; /* new pages to add */
@@ -1548,6 +1556,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long 
nr_pages, int cpu)
init_irq_work(_buffer->irq_work.work, rb_wake_up_waiters);
init_waitqueue_head(_buffer->irq_work.waiters);
init_waitqueue_head(_buffer->irq_work.full_waiters);
+   mutex_init(_buffer->mapping_lock);
 

[PATCH v16 1/6] ring-buffer: Zero ring-buffer sub-buffers

2024-02-09 Thread Vincent Donnefort
In preparation for the ring-buffer memory mapping where each subbuf will
be accessible to user-space, zero all the page allocations.

Signed-off-by: Vincent Donnefort 
Reviewed-by: Masami Hiramatsu (Google) 

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index fd4bfe3ecf01..ca796675c0a1 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1472,7 +1472,8 @@ static int __rb_allocate_pages(struct ring_buffer_per_cpu 
*cpu_buffer,
 
list_add(>list, pages);
 
-   page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu), mflags,
+   page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu),
+   mflags | __GFP_ZERO,
cpu_buffer->buffer->subbuf_order);
if (!page)
goto free_pages;
@@ -1557,7 +1558,8 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long 
nr_pages, int cpu)
 
cpu_buffer->reader_page = bpage;
 
-   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, 
cpu_buffer->buffer->subbuf_order);
+   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_ZERO,
+   cpu_buffer->buffer->subbuf_order);
if (!page)
goto fail_free_reader;
bpage->page = page_address(page);
@@ -5525,7 +5527,8 @@ ring_buffer_alloc_read_page(struct trace_buffer *buffer, 
int cpu)
if (bpage->data)
goto out;
 
-   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_NORETRY,
+   page = alloc_pages_node(cpu_to_node(cpu),
+   GFP_KERNEL | __GFP_NORETRY | __GFP_ZERO,
cpu_buffer->buffer->subbuf_order);
if (!page) {
kfree(bpage);
-- 
2.43.0.687.g38aa6559b0-goog




[PATCH v16 0/6] Introducing trace buffer mapping by user-space

2024-02-09 Thread Vincent Donnefort
The tracing ring-buffers can be stored on disk or sent to network
without any copy via splice. However the later doesn't allow real time
processing of the traces. A solution is to give userspace direct access
to the ring-buffer pages via a mapping. An application can now become a
consumer of the ring-buffer, in a similar fashion to what trace_pipe
offers.

Support for this new feature can already be found in libtracefs from
version 1.8, when built with EXTRA_CFLAGS=-DFORCE_MMAP_ENABLE.

Vincent

v15 -> v16:
  * Add comment for the dcache flush.
  * Remove now unnecessary WRITE_ONCE for the meta-page.

v14 -> v15:
  * Add meta-page and reader-page flush. Intends to fix the mapping
for VIVT and aliasing-VIPT data caches.
  * -EPERM on VM_EXEC.
  * Fix build warning !CONFIG_TRACER_MAX_TRACE.

v13 -> v14:
  * All cpu_buffer->mapped readers use READ_ONCE (except for swap_cpu)
  * on unmap, sync meta-page teardown with the reader_lock instead of
the synchronize_rcu.
  * Add a dedicated spinlock for trace_array ->snapshot and ->mapped.
(intends to fix a lockdep issue)
  * Add kerneldoc for flags and Reserved fields.
  * Add kselftest for snapshot/map mutual exclusion.

v12 -> v13:
  * Swap subbufs_{touched,lost} for Reserved fields.
  * Add a flag field in the meta-page.
  * Fix CONFIG_TRACER_MAX_TRACE.
  * Rebase on top of trace/urgent.
  * Add a comment for try_unregister_trigger()

v11 -> v12:
  * Fix code sample mmap bug.
  * Add logging in sample code.
  * Reset tracer in selftest.
  * Add a refcount for the snapshot users.
  * Prevent mapping when there are snapshot users and vice versa.
  * Refine the meta-page.
  * Fix types in the meta-page.
  * Collect Reviewed-by.

v10 -> v11:
  * Add Documentation and code sample.
  * Add a selftest.
  * Move all the update to the meta-page into a single
rb_update_meta_page().
  * rb_update_meta_page() is now called from
ring_buffer_map_get_reader() to fix NOBLOCK callers.
  * kerneldoc for struct trace_meta_page.
  * Add a patch to zero all the ring-buffer allocations.

v9 -> v10:
  * Refactor rb_update_meta_page()
  * In-loop declaration for foreach_subbuf_page()
  * Check for cpu_buffer->mapped overflow

v8 -> v9:
  * Fix the unlock path in ring_buffer_map()
  * Fix cpu_buffer cast with rb_work_rq->is_cpu_buffer
  * Rebase on linux-trace/for-next (3cb3091138ca0921c4569bcf7ffa062519639b6a)

v7 -> v8:
  * Drop the subbufs renaming into bpages
  * Use subbuf as a name when relevant

v6 -> v7:
  * Rebase onto lore.kernel.org/lkml/20231215175502.106587...@goodmis.org/
  * Support for subbufs
  * Rename subbufs into bpages

v5 -> v6:
  * Rebase on next-20230802.
  * (unsigned long) -> (void *) cast for virt_to_page().
  * Add a wait for the GET_READER_PAGE ioctl.
  * Move writer fields update (overrun/pages_lost/entries/pages_touched)
in the irq_work.
  * Rearrange id in struct buffer_page.
  * Rearrange the meta-page.
  * ring_buffer_meta_page -> trace_buffer_meta_page.
  * Add meta_struct_len into the meta-page.

v4 -> v5:
  * Trivial rebase onto 6.5-rc3 (previously 6.4-rc3)

v3 -> v4:
  * Add to the meta-page:
   - pages_lost / pages_read (allow to compute how full is the
 ring-buffer)
   - read (allow to compute how many entries can be read)
   - A reader_page struct.
  * Rename ring_buffer_meta_header -> ring_buffer_meta
  * Rename ring_buffer_get_reader_page -> ring_buffer_map_get_reader_page
  * Properly consume events on ring_buffer_map_get_reader_page() with
rb_advance_reader().

v2 -> v3:
  * Remove data page list (for non-consuming read)
** Implies removing order > 0 meta-page
  * Add a new meta page field ->read
  * Rename ring_buffer_meta_page_header into ring_buffer_meta_header

v1 -> v2:
  * Hide data_pages from the userspace struct
  * Fix META_PAGE_MAX_PAGES
  * Support for order > 0 meta-page
  * Add missing page->mapping.

Vincent Donnefort (6):
  ring-buffer: Zero ring-buffer sub-buffers
  ring-buffer: Introducing ring-buffer mapping functions
  tracing: Add snapshot refcount
  tracing: Allow user-space mapping of the ring-buffer
  Documentation: tracing: Add ring-buffer mapping
  ring-buffer/selftest: Add ring-buffer mapping test

 Documentation/trace/index.rst |   1 +
 Documentation/trace/ring-buffer-map.rst   | 104 ++
 include/linux/ring_buffer.h   |   7 +
 include/uapi/linux/trace_mmap.h   |  48 +++
 kernel/trace/ring_buffer.c| 351 +-
 kernel/trace/trace.c  | 221 +--
 kernel/trace/trace.h  |   9 +-
 kernel/trace/trace_events_trigger.c   |  58 ++-
 tools/testing/selftests/ring-buffer/Makefile  |   8 +
 tools/testing/selftests/ring-buffer/config|   2 +
 .../testing/selftests/ring-buffer/map_test.c  | 273 ++
 11 files changed, 1036 insertions(+), 46 deletions(-)
 create mode 100644 Documentation/trace/ring-buffer-map.rst
 create mode 

Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-09 Thread Alex Williamson
On Fri, 9 Feb 2024 09:20:22 +
Ankit Agrawal  wrote:

> Thanks Kevin for the review. Comments inline.
> 
> >>
> >> Note that the usemem memory is added by the VM Nvidia device driver [5]
> >> to the VM kernel as memblocks. Hence make the usable memory size
> >> memblock
> >> aligned.  
> >
> > Is memblock size defined in spec or purely a guest implementation choice?  
> 
> The MEMBLOCK value is a hardwired and a constant ABI value between the GPU
> FW and VFIO driver.
> 
> >>
> >> If the bare metal properties are not present, the driver registers the
> >> vfio-pci-core function pointers.  
> >
> > so if qemu doesn't generate such property the variant driver running
> > inside guest will always go to use core functions and guest vfio userspace
> > will observe both resmem and usemem bars. But then there is nothing
> > in field to prohibit mapping resmem bar as cacheable.
> >
> > should this driver check the presence of either ACPI property or
> > resmem/usemem bars to enable variant function pointers?  
> 
> Maybe I am missing something here; but if the ACPI property is absent,
> the real physical BARs present on the device will be exposed by the
> vfio-pci-core functions to the VM. So I think if the variant driver is ran
> within the VM, it should not see the fake usemem and resmem BARs.

There are two possibilities here, either we're assigning the pure
physical device from a host that does not have the ACPI properties or
we're performing a nested assignment.  In the former case we're simply
passing along the unmodified physical BARs.  In the latter case we're
actually passing through the fake BARs, the virtualization of the
device has already happened in the level 1 assignment.

I think Kevin's point is also relative to this latter scenario, in the
L1 instance of the nvgrace-gpu driver the mmap of the usemem BAR is
cachable, but in the L2 instance of the driver where we only use the
vfio-pci-core ops nothing maintains that cachable mapping.  Is that a
problem?  An uncached mapping on top of a cachable mapping is often
prone to problems.  Thanks,

Alex




Re: [RFC V1 05/13] vhost-vdpa: VHOST_IOTLB_REMAP

2024-02-09 Thread Steven Sistare
On 1/16/2024 1:14 PM, Eugenio Perez Martin wrote:
> On Wed, Jan 10, 2024 at 9:40 PM Steve Sistare  
> wrote:
>>
>> When device ownership is passed to a new process via VHOST_NEW_OWNER,
>> some devices need to know the new userland addresses of the dma mappings.
>> Define the new iotlb message type VHOST_IOTLB_REMAP to update the uaddr
>> of a mapping.  The new uaddr must address the same memory object as
>> originally mapped.
> 
> I think this new ioctl is very interesting, and can be used to
> optimize some parts of live migration with shadow virtqueue if it
> allows to actually replace the uaddr. Would you be interested in that
> capability?

Sure.  Please share your thoughts on how it would be used.  I don't follow,
because with live migration, you are creating a new vdpa instance with new
shadow queues on the destination, versus relocating old shadow queues (which 
we could do during live update which preserves the same vdpa instance).

(Sorry for the late response, I stashed this email and forgot to respond.)

>> The user must suspend the device before the old address is invalidated,
>> and cannot resume it until after VHOST_IOTLB_REMAP is called, but this
>> requirement is not enforced by the API.
>>
>> Signed-off-by: Steve Sistare 
>> ---
>>  drivers/vhost/vdpa.c | 34 
>>  include/uapi/linux/vhost_types.h | 11 ++-
>>  2 files changed, 44 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>> index faed6471934a..ec5ca20bd47d 100644
>> --- a/drivers/vhost/vdpa.c
>> +++ b/drivers/vhost/vdpa.c
>> @@ -1219,6 +1219,37 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>>
>>  }
>>
>> +static int vhost_vdpa_process_iotlb_remap(struct vhost_vdpa *v,
>> + struct vhost_iotlb *iotlb,
>> + struct vhost_iotlb_msg *msg)
>> +{
>> +   struct vdpa_device *vdpa = v->vdpa;
>> +   const struct vdpa_config_ops *ops = vdpa->config;
>> +   u32 asid = iotlb_to_asid(iotlb);
>> +   u64 start = msg->iova;
>> +   u64 last = start + msg->size - 1;
>> +   struct vhost_iotlb_map *map;
>> +   int r = 0;
>> +
>> +   if (msg->perm || !msg->size)
>> +   return -EINVAL;
>> +
>> +   map = vhost_iotlb_itree_first(iotlb, start, last);
>> +   if (!map)
>> +   return -ENOENT;
>> +
>> +   if (map->start != start || map->last != last)
>> +   return -EINVAL;
>> +
>> +   /* batch will finish with remap.  non-batch must do it now. */
>> +   if (!v->in_batch)
>> +   r = ops->set_map(vdpa, asid, iotlb);
> 
> I'm missing how these cases are handled:
>
> * The device does not expose set_map but dma_map / dma_unmap (you can
> check this case with the simulator)

I chose not to support dma_map, because set_map looks to be more commonly 
supported,
and batch-mode plus set_map is the most efficient way to support remapping.
I could define a dma_remap entry point if folks think that is important.
Regardless, I will check that ops->set_map != NULL before calling it.

> * The device uses platform iommu (!set_map && !dma_unmap vdpa_ops).

I believe this just works, because iommu_map() never saves userland address, so 
it 
does not need to be informed when uaddr changes.  That is certainly true for 
the 
code path:

  vhost_vdpa_pa_map()
vhost_vdpa_map()
  if !dma_map && !set_map
iommu_map()

However, this code path confuses me:

  vhost_vdpa_process_iotlb_update()
if (vdpa->use_va)
  vhost_vdpa_va_map(... uaddr)
vhost_vdpa_map(... uaddr)
  iommu_map(... uaddr)

AFAICT uaddr is never translated to physical.
Maybe use_va is always false if !dma_map && !set_map  ?

- Steve

>> +   if (!r)
>> +   map->addr = msg->uaddr;
>> +
>> +   return r;
>> +}
>> +
>>  static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>>struct vhost_iotlb *iotlb,
>>struct vhost_iotlb_msg *msg)
>> @@ -1298,6 +1329,9 @@ static int vhost_vdpa_process_iotlb_msg(struct 
>> vhost_dev *dev, u32 asid,
>> ops->set_map(vdpa, asid, iotlb);
>> v->in_batch = false;
>> break;
>> +   case VHOST_IOTLB_REMAP:
>> +   r = vhost_vdpa_process_iotlb_remap(v, iotlb, msg);
>> +   break;
>> default:
>> r = -EINVAL;
>> break;
>> diff --git a/include/uapi/linux/vhost_types.h 
>> b/include/uapi/linux/vhost_types.h
>> index 9177843951e9..35908315ff55 100644
>> --- a/include/uapi/linux/vhost_types.h
>> +++ b/include/uapi/linux/vhost_types.h
>> @@ -79,7 +79,7 @@ struct vhost_iotlb_msg {
>>  /*
>>   * VHOST_IOTLB_BATCH_BEGIN and VHOST_IOTLB_BATCH_END allow modifying
>>   * multiple mappings in one go: beginning with
>> - * VHOST_IOTLB_BATCH_BEGIN, followed by any number 

Re: [PATCH v15 2/6] ring-buffer: Introducing ring-buffer mapping functions

2024-02-09 Thread Steven Rostedt
On Fri,  9 Feb 2024 10:21:58 +
Vincent Donnefort  wrote:

> +static void rb_update_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
> +{
> + struct trace_buffer_meta *meta = cpu_buffer->meta_page;
> +
> + WRITE_ONCE(meta->reader.read, cpu_buffer->reader_page->read);
> + WRITE_ONCE(meta->reader.id, cpu_buffer->reader_page->id);
> + WRITE_ONCE(meta->reader.lost_events, cpu_buffer->lost_events);
> +
> + WRITE_ONCE(meta->entries, local_read(_buffer->entries));
> + WRITE_ONCE(meta->overrun, local_read(_buffer->overrun));
> + WRITE_ONCE(meta->read, cpu_buffer->read);
> +

Small nit, but we should add a comment to the two flush_dache_folio() calls:

/* Some archs do not have data cache coherency between kernel and user 
space */
> + flush_dcache_folio(virt_to_folio(cpu_buffer->meta_page));
> +}
> +


-- Steve



Re: [v2 PATCH 0/1] ALSA: virtio: add support for audio controls

2024-02-09 Thread Takashi Iwai
On Mon, 15 Jan 2024 14:36:53 +0100,
Aiswarya Cyriac wrote:
> 
> From: Anton Yakovlev 
> 
> Changes in v2
> - Fix reporting of incorrect number of items for ENUMERATED controls
> 
> Implementation of support for audio controls in accordance with the
> extension of the virtio sound device specification[1] planned for
> virtio-v1.3-cs01.
> 
> Design of virtual audio controls is based on and derived from ALSA
> audio controls. It allows the driver to perform all standard operations,
> such as reading and writing audio control value, as well as working with
> metadata (represented in the TLV form).
> 
> The driver part was tested on top of the Linux 5.10 kernel.
> 
> As a device part was used OpenSynergy proprietary implementation.
> 
> [1] https://lists.oasis-open.org/archives/virtio-comment/202104/msg00013.html
> 
> Anton Yakovlev (1):
>   ALSA: virtio: add support for audio controls

Applied now.  Thanks.


Takashi



Re: [PATCH net-next v5] virtio_net: Support RX hash XDP hint

2024-02-09 Thread Paolo Abeni
On Fri, 2024-02-09 at 18:39 +0800, Liang Chen wrote:
> On Wed, Feb 7, 2024 at 10:27 PM Paolo Abeni  wrote:
> > 
> > On Wed, 2024-02-07 at 10:54 +0800, Liang Chen wrote:
> > > On Tue, Feb 6, 2024 at 6:44 PM Paolo Abeni  wrote:
> > > > 
> > > > On Sat, 2024-02-03 at 10:56 +0800, Liang Chen wrote:
> > > > > On Sat, Feb 3, 2024 at 12:20 AM Jesper Dangaard Brouer 
> > > > >  wrote:
> > > > > > On 02/02/2024 13.11, Liang Chen wrote:
> > > > [...]
> > > > > > > @@ -1033,6 +1039,16 @@ static void put_xdp_frags(struct xdp_buff 
> > > > > > > *xdp)
> > > > > > >   }
> > > > > > >   }
> > > > > > > 
> > > > > > > +static void virtnet_xdp_save_rx_hash(struct virtnet_xdp_buff 
> > > > > > > *virtnet_xdp,
> > > > > > > +  struct net_device *dev,
> > > > > > > +  struct virtio_net_hdr_v1_hash 
> > > > > > > *hdr_hash)
> > > > > > > +{
> > > > > > > + if (dev->features & NETIF_F_RXHASH) {
> > > > > > > + virtnet_xdp->hash_value = hdr_hash->hash_value;
> > > > > > > + virtnet_xdp->hash_report = hdr_hash->hash_report;
> > > > > > > + }
> > > > > > > +}
> > > > > > > +
> > > > > > 
> > > > > > Would it be possible to store a pointer to hdr_hash in 
> > > > > > virtnet_xdp_buff,
> > > > > > with the purpose of delaying extracting this, until and only if XDP
> > > > > > bpf_prog calls the kfunc?
> > > > > > 
> > > > > 
> > > > > That seems to be the way v1 works,
> > > > > https://lore.kernel.org/all/20240122102256.261374-1-liangchen.li...@gmail.com/
> > > > > . But it was pointed out that the inline header may be overwritten by
> > > > > the xdp prog, so the hash is copied out to maintain its integrity.
> > > > 
> > > > Why? isn't XDP supposed to get write access only to the pkt
> > > > contents/buffer?
> > > > 
> > > 
> > > Normally, an XDP program accesses only the packet data. However,
> > > there's also an XDP RX Metadata area, referenced by the data_meta
> > > pointer. This pointer can be adjusted with bpf_xdp_adjust_meta to
> > > point somewhere ahead of the data buffer, thereby granting the XDP
> > > program access to the virtio header located immediately before the
> > 
> > AFAICS bpf_xdp_adjust_meta() does not allow moving the meta_data before
> > xdp->data_hard_start:
> > 
> > https://elixir.bootlin.com/linux/latest/source/net/core/filter.c#L4210
> > 
> > and virtio net set such field after the virtio_net_hdr:
> > 
> > https://elixir.bootlin.com/linux/latest/source/drivers/net/virtio_net.c#L1218
> > https://elixir.bootlin.com/linux/latest/source/drivers/net/virtio_net.c#L1420
> > 
> > I don't see how the virtio hdr could be touched? Possibly even more
> > important: if such thing is possible, I think is should be somewhat
> > denied (for the same reason an H/W nic should prevent XDP from
> > modifying its own buffer descriptor).
> 
> Thank you for highlighting this concern. The header layout differs
> slightly between small and mergeable mode. Taking 'mergeable mode' as
> an example, after calling xdp_prepare_buff the layout of xdp_buff
> would be as depicted in the diagram below,
> 
>   buf
>|
>v
> +--+--+-+
> | xdp headroom | virtio header| packet  |
> | (256 bytes)  | (20 bytes)   | content |
> +--+--+-+
> ^ ^
> | |
>  data_hard_startdata
>   data_meta
> 
> If 'bpf_xdp_adjust_meta' repositions the 'data_meta' pointer a little
> towards 'data_hard_start', it would point to the inline header, thus
> potentially allowing the XDP program to access the inline header.

I see. That layout was completely unexpected to me.

AFAICS the virtio_net driver tries to avoid accessing/using the
virtio_net_hdr after the XDP program execution, so nothing tragic
should happen.

@Michael, @Jason, I guess the above is like that by design? Isn't it a
bit fragile?

Thanks!

Paolo




[PATCH v2] tracing: Fix wasted memory in saved_cmdlines logic

2024-02-09 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

While looking at improving the saved_cmdlines cache I found a huge amount
of wasted memory that should be used for the cmdlines.

The tracing data saves pids during the trace. At sched switch, if a trace
occurred, it will save the comm of the task that did the trace. This is
saved in a "cache" that maps pids to comms and exposed to user space via
the /sys/kernel/tracing/saved_cmdlines file. Currently it only caches by
default 128 comms.

The structure that uses this creates an array to store the pids using
PID_MAX_DEFAULT (which is usually set to 32768). This causes the structure
to be of the size of 131104 bytes on 64 bit machines.

In hex: 131104 = 0x20020, and since the kernel allocates generic memory in
powers of two, the kernel would allocate 0x4 or 262144 bytes to store
this structure. That leaves 131040 bytes of wasted space.

Worse, the structure points to an allocated array to store the comm names,
which is 16 bytes times the amount of names to save (currently 128), which
is 2048 bytes. Instead of allocating a separate array, make the structure
end with a variable length string and use the extra space for that.

This is similar to a recommendation that Linus had made about eventfs_inode 
names:

  
https://lore.kernel.org/all/20240130190355.11486-5-torva...@linux-foundation.org/

Instead of allocating a separate string array to hold the saved comms,
have the structure end with: char saved_cmdlines[]; and round up to the
next power of two over sizeof(struct saved_cmdline_buffers) + num_cmdlines * 
TASK_COMM_LEN
It will use this extra space for the saved_cmdline portion.

Now, instead of saving only 128 comms by default, by using this wasted
space at the end of the structure it can save over 8000 comms and even
saves space by removing the need for allocating the other array.

Cc: sta...@vger.kernel.org
Fixes: 939c7a4f04fcd ("tracing: Introduce saved_cmdlines_size file")
Signed-off-by: Steven Rostedt (Google) 
---
Changes since v1: 
https://lore.kernel.org/linux-trace-kernel/20240208105328.7e73f...@rorschach.local.home

- Added back error check of s->map_cmdline_to_pid allocation failure

 kernel/trace/trace.c | 75 ++--
 1 file changed, 37 insertions(+), 38 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 2a7c6fd934e9..9ff8a439d674 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2320,7 +2320,7 @@ struct saved_cmdlines_buffer {
unsigned *map_cmdline_to_pid;
unsigned cmdline_num;
int cmdline_idx;
-   char *saved_cmdlines;
+   char saved_cmdlines[];
 };
 static struct saved_cmdlines_buffer *savedcmd;
 
@@ -2334,47 +2334,58 @@ static inline void set_cmdline(int idx, const char 
*cmdline)
strncpy(get_saved_cmdlines(idx), cmdline, TASK_COMM_LEN);
 }
 
-static int allocate_cmdlines_buffer(unsigned int val,
-   struct saved_cmdlines_buffer *s)
+static void free_saved_cmdlines_buffer(struct saved_cmdlines_buffer *s)
+{
+   int order = get_order(sizeof(*s) + s->cmdline_num * TASK_COMM_LEN);
+
+   kfree(s->map_cmdline_to_pid);
+   free_pages((unsigned long)s, order);
+}
+
+static struct saved_cmdlines_buffer *allocate_cmdlines_buffer(unsigned int val)
 {
+   struct saved_cmdlines_buffer *s;
+   struct page *page;
+   int orig_size, size;
+   int order;
+
+   /* Figure out how much is needed to hold the given number of cmdlines */
+   orig_size = sizeof(*s) + val * TASK_COMM_LEN;
+   order = get_order(orig_size);
+   size = 1 << (order + PAGE_SHIFT);
+   page = alloc_pages(GFP_KERNEL, order);
+   if (!page)
+   return NULL;
+
+   s = page_address(page);
+   memset(s, 0, sizeof(*s));
+
+   /* Round up to actual allocation */
+   val = (size - sizeof(*s)) / TASK_COMM_LEN;
+   s->cmdline_num = val;
+
s->map_cmdline_to_pid = kmalloc_array(val,
  sizeof(*s->map_cmdline_to_pid),
  GFP_KERNEL);
-   if (!s->map_cmdline_to_pid)
-   return -ENOMEM;
-
-   s->saved_cmdlines = kmalloc_array(TASK_COMM_LEN, val, GFP_KERNEL);
-   if (!s->saved_cmdlines) {
-   kfree(s->map_cmdline_to_pid);
-   return -ENOMEM;
+   if (!s->map_cmdline_to_pid) {
+   free_saved_cmdlines_buffer(s);
+   return NULL;
}
 
s->cmdline_idx = 0;
-   s->cmdline_num = val;
memset(>map_pid_to_cmdline, NO_CMDLINE_MAP,
   sizeof(s->map_pid_to_cmdline));
memset(s->map_cmdline_to_pid, NO_CMDLINE_MAP,
   val * sizeof(*s->map_cmdline_to_pid));
 
-   return 0;
+   return s;
 }
 
 static int trace_create_savedcmd(void)
 {
-   int ret;
-
-   savedcmd = kmalloc(sizeof(*savedcmd), GFP_KERNEL);
-   if (!savedcmd)
-   return -ENOMEM;
-
-   

Re: [PATCH net-next v5] virtio_net: Support RX hash XDP hint

2024-02-09 Thread Liang Chen
On Wed, Feb 7, 2024 at 10:27 PM Paolo Abeni  wrote:
>
> On Wed, 2024-02-07 at 10:54 +0800, Liang Chen wrote:
> > On Tue, Feb 6, 2024 at 6:44 PM Paolo Abeni  wrote:
> > >
> > > On Sat, 2024-02-03 at 10:56 +0800, Liang Chen wrote:
> > > > On Sat, Feb 3, 2024 at 12:20 AM Jesper Dangaard Brouer 
> > > >  wrote:
> > > > > On 02/02/2024 13.11, Liang Chen wrote:
> > > [...]
> > > > > > @@ -1033,6 +1039,16 @@ static void put_xdp_frags(struct xdp_buff 
> > > > > > *xdp)
> > > > > >   }
> > > > > >   }
> > > > > >
> > > > > > +static void virtnet_xdp_save_rx_hash(struct virtnet_xdp_buff 
> > > > > > *virtnet_xdp,
> > > > > > +  struct net_device *dev,
> > > > > > +  struct virtio_net_hdr_v1_hash 
> > > > > > *hdr_hash)
> > > > > > +{
> > > > > > + if (dev->features & NETIF_F_RXHASH) {
> > > > > > + virtnet_xdp->hash_value = hdr_hash->hash_value;
> > > > > > + virtnet_xdp->hash_report = hdr_hash->hash_report;
> > > > > > + }
> > > > > > +}
> > > > > > +
> > > > >
> > > > > Would it be possible to store a pointer to hdr_hash in 
> > > > > virtnet_xdp_buff,
> > > > > with the purpose of delaying extracting this, until and only if XDP
> > > > > bpf_prog calls the kfunc?
> > > > >
> > > >
> > > > That seems to be the way v1 works,
> > > > https://lore.kernel.org/all/20240122102256.261374-1-liangchen.li...@gmail.com/
> > > > . But it was pointed out that the inline header may be overwritten by
> > > > the xdp prog, so the hash is copied out to maintain its integrity.
> > >
> > > Why? isn't XDP supposed to get write access only to the pkt
> > > contents/buffer?
> > >
> >
> > Normally, an XDP program accesses only the packet data. However,
> > there's also an XDP RX Metadata area, referenced by the data_meta
> > pointer. This pointer can be adjusted with bpf_xdp_adjust_meta to
> > point somewhere ahead of the data buffer, thereby granting the XDP
> > program access to the virtio header located immediately before the
>
> AFAICS bpf_xdp_adjust_meta() does not allow moving the meta_data before
> xdp->data_hard_start:
>
> https://elixir.bootlin.com/linux/latest/source/net/core/filter.c#L4210
>
> and virtio net set such field after the virtio_net_hdr:
>
> https://elixir.bootlin.com/linux/latest/source/drivers/net/virtio_net.c#L1218
> https://elixir.bootlin.com/linux/latest/source/drivers/net/virtio_net.c#L1420
>
> I don't see how the virtio hdr could be touched? Possibly even more
> important: if such thing is possible, I think is should be somewhat
> denied (for the same reason an H/W nic should prevent XDP from
> modifying its own buffer descriptor).

Thank you for highlighting this concern. The header layout differs
slightly between small and mergeable mode. Taking 'mergeable mode' as
an example, after calling xdp_prepare_buff the layout of xdp_buff
would be as depicted in the diagram below,

  buf
   |
   v
+--+--+-+
| xdp headroom | virtio header| packet  |
| (256 bytes)  | (20 bytes)   | content |
+--+--+-+
^ ^
| |
 data_hard_startdata
  data_meta

If 'bpf_xdp_adjust_meta' repositions the 'data_meta' pointer a little
towards 'data_hard_start', it would point to the inline header, thus
potentially allowing the XDP program to access the inline header.

We will take a closer look on how to prevent the inline header from
being altered, possibly by borrowing some ideas from other
xdp_metadata_ops implementation.


Thanks,
Liang

>
> Cheers,
>
> Paolo
>



[PATCH v15 5/6] Documentation: tracing: Add ring-buffer mapping

2024-02-09 Thread Vincent Donnefort
It is now possible to mmap() a ring-buffer to stream its content. Add
some documentation and a code example.

Signed-off-by: Vincent Donnefort 

diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
index 5092d6c13af5..0b300901fd75 100644
--- a/Documentation/trace/index.rst
+++ b/Documentation/trace/index.rst
@@ -29,6 +29,7 @@ Linux Tracing Technologies
timerlat-tracer
intel_th
ring-buffer-design
+   ring-buffer-map
stm
sys-t
coresight/index
diff --git a/Documentation/trace/ring-buffer-map.rst 
b/Documentation/trace/ring-buffer-map.rst
new file mode 100644
index ..628254e63830
--- /dev/null
+++ b/Documentation/trace/ring-buffer-map.rst
@@ -0,0 +1,104 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==
+Tracefs ring-buffer memory mapping
+==
+
+:Author: Vincent Donnefort 
+
+Overview
+
+Tracefs ring-buffer memory map provides an efficient method to stream data
+as no memory copy is necessary. The application mapping the ring-buffer becomes
+then a consumer for that ring-buffer, in a similar fashion to trace_pipe.
+
+Memory mapping setup
+
+The mapping works with a mmap() of the trace_pipe_raw interface.
+
+The first system page of the mapping contains ring-buffer statistics and
+description. It is referred as the meta-page. One of the most important field 
of
+the meta-page is the reader. It contains the sub-buffer ID which can be safely
+read by the mapper (see ring-buffer-design.rst).
+
+The meta-page is followed by all the sub-buffers, ordered by ascendant ID. It 
is
+therefore effortless to know where the reader starts in the mapping:
+
+.. code-block:: c
+
+reader_id = meta->reader->id;
+reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;
+
+When the application is done with the current reader, it can get a new one 
using
+the trace_pipe_raw ioctl() TRACE_MMAP_IOCTL_GET_READER. This ioctl also updates
+the meta-page fields.
+
+Limitations
+===
+When a mapping is in place on a Tracefs ring-buffer, it is not possible to
+either resize it (either by increasing the entire size of the ring-buffer or
+each subbuf). It is also not possible to use snapshot or splice.
+
+Concurrent readers (either another application mapping that ring-buffer or the
+kernel with trace_pipe) are allowed but not recommended. They will compete for
+the ring-buffer and the output is unpredictable.
+
+Example
+===
+
+.. code-block:: c
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include 
+#include 
+
+#define TRACE_PIPE_RAW 
"/sys/kernel/tracing/per_cpu/cpu0/trace_pipe_raw"
+
+int main(void)
+{
+int page_size = getpagesize(), fd, reader_id;
+unsigned long meta_len, data_len;
+struct trace_buffer_meta *meta;
+void *map, *reader, *data;
+
+fd = open(TRACE_PIPE_RAW, O_RDONLY | O_NONBLOCK);
+if (fd < 0)
+exit(EXIT_FAILURE);
+
+map = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0);
+if (map == MAP_FAILED)
+exit(EXIT_FAILURE);
+
+meta = (struct trace_buffer_meta *)map;
+meta_len = meta->meta_page_size;
+
+printf("entries:%llu\n", meta->entries);
+printf("overrun:%llu\n", meta->overrun);
+printf("read:   %llu\n", meta->read);
+printf("nr_subbufs: %u\n", meta->nr_subbufs);
+
+data_len = meta->subbuf_size * meta->nr_subbufs;
+data = mmap(NULL, data_len, PROT_READ, MAP_SHARED, fd, 
meta_len);
+if (data == MAP_FAILED)
+exit(EXIT_FAILURE);
+
+if (ioctl(fd, TRACE_MMAP_IOCTL_GET_READER) < 0)
+exit(EXIT_FAILURE);
+
+reader_id = meta->reader.id;
+reader = data + meta->subbuf_size * reader_id;
+
+printf("Current reader address: %p\n", reader);
+
+munmap(data, data_len);
+munmap(meta, meta_len);
+close (fd);
+
+return 0;
+}
-- 
2.43.0.687.g38aa6559b0-goog




[PATCH v15 4/6] tracing: Allow user-space mapping of the ring-buffer

2024-02-09 Thread Vincent Donnefort
Currently, user-space extracts data from the ring-buffer via splice,
which is handy for storage or network sharing. However, due to splice
limitations, it is imposible to do real-time analysis without a copy.

A solution for that problem is to let the user-space map the ring-buffer
directly.

The mapping is exposed via the per-CPU file trace_pipe_raw. The first
element of the mapping is the meta-page. It is followed by each
subbuffer constituting the ring-buffer, ordered by their unique page ID:

  * Meta-page -- include/uapi/linux/trace_mmap.h for a description
  * Subbuf ID 0
  * Subbuf ID 1
 ...

It is therefore easy to translate a subbuf ID into an offset in the
mapping:

  reader_id = meta->reader->id;
  reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;

When new data is available, the mapper must call a newly introduced ioctl:
TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to
point to the next reader containing unread data.

Mapping will prevent snapshot and buffer size modifications.

Signed-off-by: Vincent Donnefort 

diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
index 182e05a3004a..7330249257e7 100644
--- a/include/uapi/linux/trace_mmap.h
+++ b/include/uapi/linux/trace_mmap.h
@@ -43,4 +43,6 @@ struct trace_buffer_meta {
__u64   Reserved2;
 };
 
+#define TRACE_MMAP_IOCTL_GET_READER_IO('T', 0x1)
+
 #endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 4ebf4d0bd14c..2c0deeedf619 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1175,6 +1175,12 @@ static void tracing_snapshot_instance_cond(struct 
trace_array *tr,
return;
}
 
+   if (tr->mapped) {
+   trace_array_puts(tr, "*** BUFFER MEMORY MAPPED ***\n");
+   trace_array_puts(tr, "*** Can not use snapshot (sorry) ***\n");
+   return;
+   }
+
local_irq_save(flags);
update_max_tr(tr, current, smp_processor_id(), cond_data);
local_irq_restore(flags);
@@ -1307,7 +1313,7 @@ static int tracing_arm_snapshot_locked(struct trace_array 
*tr)
lockdep_assert_held(_types_lock);
 
spin_lock(>snapshot_trigger_lock);
-   if (tr->snapshot == UINT_MAX) {
+   if (tr->snapshot == UINT_MAX || tr->mapped) {
spin_unlock(>snapshot_trigger_lock);
return -EBUSY;
}
@@ -6533,7 +6539,7 @@ static void tracing_set_nop(struct trace_array *tr)
 {
if (tr->current_trace == _trace)
return;
-   
+
tr->current_trace->enabled--;
 
if (tr->current_trace->reset)
@@ -8652,15 +8658,31 @@ tracing_buffers_splice_read(struct file *file, loff_t 
*ppos,
return ret;
 }
 
-/* An ioctl call with cmd 0 to the ring buffer file will wake up all waiters */
 static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, 
unsigned long arg)
 {
struct ftrace_buffer_info *info = file->private_data;
struct trace_iterator *iter = >iter;
+   int err;
+
+   if (cmd == TRACE_MMAP_IOCTL_GET_READER) {
+   if (!(file->f_flags & O_NONBLOCK)) {
+   err = ring_buffer_wait(iter->array_buffer->buffer,
+  iter->cpu_file,
+  iter->tr->buffer_percent);
+   if (err)
+   return err;
+   }
 
-   if (cmd)
-   return -ENOIOCTLCMD;
+   return ring_buffer_map_get_reader(iter->array_buffer->buffer,
+ iter->cpu_file);
+   } else if (cmd) {
+   return -ENOTTY;
+   }
 
+   /*
+* An ioctl call with cmd 0 to the ring buffer file will wake up all
+* waiters
+*/
mutex_lock(_types_lock);
 
iter->wait_index++;
@@ -8673,6 +8695,97 @@ static long tracing_buffers_ioctl(struct file *file, 
unsigned int cmd, unsigned
return 0;
 }
 
+static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf)
+{
+   struct ftrace_buffer_info *info = vmf->vma->vm_file->private_data;
+   struct trace_iterator *iter = >iter;
+   vm_fault_t ret = VM_FAULT_SIGBUS;
+   struct page *page;
+
+   page = ring_buffer_map_fault(iter->array_buffer->buffer, iter->cpu_file,
+vmf->pgoff);
+   if (!page)
+   return ret;
+
+   get_page(page);
+   vmf->page = page;
+   vmf->page->mapping = vmf->vma->vm_file->f_mapping;
+   vmf->page->index = vmf->pgoff;
+
+   return 0;
+}
+
+static void tracing_buffers_mmap_close(struct vm_area_struct *vma)
+{
+   struct ftrace_buffer_info *info = vma->vm_file->private_data;
+   struct trace_iterator *iter = >iter;
+   struct trace_array __maybe_unused *tr = iter->tr;
+
+   ring_buffer_unmap(iter->array_buffer->buffer, 

[PATCH v15 3/6] tracing: Add snapshot refcount

2024-02-09 Thread Vincent Donnefort
When a ring-buffer is memory mapped by user-space, no trace or
ring-buffer swap is possible. This means the snapshot feature is
mutually exclusive with the memory mapping. Having a refcount on
snapshot users will help to know if a mapping is possible or not.

Instead of relying on the global trace_types_lock, a new spinlock is
introduced to serialize accesses to trace_array->snapshot. This intends
to allow access to that variable in a context where the mmap lock is
already held.

Signed-off-by: Vincent Donnefort 

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 2a7c6fd934e9..4ebf4d0bd14c 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1300,6 +1300,50 @@ static void free_snapshot(struct trace_array *tr)
tr->allocated_snapshot = false;
 }
 
+static int tracing_arm_snapshot_locked(struct trace_array *tr)
+{
+   int ret;
+
+   lockdep_assert_held(_types_lock);
+
+   spin_lock(>snapshot_trigger_lock);
+   if (tr->snapshot == UINT_MAX) {
+   spin_unlock(>snapshot_trigger_lock);
+   return -EBUSY;
+   }
+
+   tr->snapshot++;
+   spin_unlock(>snapshot_trigger_lock);
+
+   ret = tracing_alloc_snapshot_instance(tr);
+   if (ret) {
+   spin_lock(>snapshot_trigger_lock);
+   tr->snapshot--;
+   spin_unlock(>snapshot_trigger_lock);
+   }
+
+   return ret;
+}
+
+int tracing_arm_snapshot(struct trace_array *tr)
+{
+   int ret;
+
+   mutex_lock(_types_lock);
+   ret = tracing_arm_snapshot_locked(tr);
+   mutex_unlock(_types_lock);
+
+   return ret;
+}
+
+void tracing_disarm_snapshot(struct trace_array *tr)
+{
+   spin_lock(>snapshot_trigger_lock);
+   if (!WARN_ON(!tr->snapshot))
+   tr->snapshot--;
+   spin_unlock(>snapshot_trigger_lock);
+}
+
 /**
  * tracing_alloc_snapshot - allocate snapshot buffer.
  *
@@ -1373,10 +1417,6 @@ int tracing_snapshot_cond_enable(struct trace_array *tr, 
void *cond_data,
 
mutex_lock(_types_lock);
 
-   ret = tracing_alloc_snapshot_instance(tr);
-   if (ret)
-   goto fail_unlock;
-
if (tr->current_trace->use_max_tr) {
ret = -EBUSY;
goto fail_unlock;
@@ -1395,6 +1435,10 @@ int tracing_snapshot_cond_enable(struct trace_array *tr, 
void *cond_data,
goto fail_unlock;
}
 
+   ret = tracing_arm_snapshot_locked(tr);
+   if (ret)
+   goto fail_unlock;
+
local_irq_disable();
arch_spin_lock(>max_lock);
tr->cond_snapshot = cond_snapshot;
@@ -1439,6 +1483,8 @@ int tracing_snapshot_cond_disable(struct trace_array *tr)
arch_spin_unlock(>max_lock);
local_irq_enable();
 
+   tracing_disarm_snapshot(tr);
+
return ret;
 }
 EXPORT_SYMBOL_GPL(tracing_snapshot_cond_disable);
@@ -1481,6 +1527,7 @@ int tracing_snapshot_cond_disable(struct trace_array *tr)
 }
 EXPORT_SYMBOL_GPL(tracing_snapshot_cond_disable);
 #define free_snapshot(tr)  do { } while (0)
+#define tracing_arm_snapshot_locked(tr) ({ -EBUSY; })
 #endif /* CONFIG_TRACER_SNAPSHOT */
 
 void tracer_tracing_off(struct trace_array *tr)
@@ -6593,11 +6640,12 @@ int tracing_set_tracer(struct trace_array *tr, const 
char *buf)
 */
synchronize_rcu();
free_snapshot(tr);
+   tracing_disarm_snapshot(tr);
}
 
-   if (t->use_max_tr && !tr->allocated_snapshot) {
-   ret = tracing_alloc_snapshot_instance(tr);
-   if (ret < 0)
+   if (t->use_max_tr) {
+   ret = tracing_arm_snapshot_locked(tr);
+   if (ret)
goto out;
}
 #else
@@ -6606,8 +6654,13 @@ int tracing_set_tracer(struct trace_array *tr, const 
char *buf)
 
if (t->init) {
ret = tracer_init(t, tr);
-   if (ret)
+   if (ret) {
+#ifdef CONFIG_TRACER_MAX_TRACE
+   if (t->use_max_tr)
+   tracing_disarm_snapshot(tr);
+#endif
goto out;
+   }
}
 
tr->current_trace = t;
@@ -7709,10 +7762,11 @@ tracing_snapshot_write(struct file *filp, const char 
__user *ubuf, size_t cnt,
if (tr->allocated_snapshot)
ret = resize_buffer_duplicate_size(>max_buffer,
>array_buffer, iter->cpu_file);
-   else
-   ret = tracing_alloc_snapshot_instance(tr);
-   if (ret < 0)
+
+   ret = tracing_arm_snapshot_locked(tr);
+   if (ret)
break;
+
/* Now, we're going to swap */
if (iter->cpu_file == RING_BUFFER_ALL_CPUS) {
local_irq_disable();
@@ -7722,6 +7776,7 @@ tracing_snapshot_write(struct file *filp, const char 
__user *ubuf, size_t cnt,

[PATCH v15 2/6] ring-buffer: Introducing ring-buffer mapping functions

2024-02-09 Thread Vincent Donnefort
In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

  ring_buffer_{map,unmap}()
  ring_buffer_map_fault()

And controls on the ring-buffer:

  ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

  A unique ID for each subbuf of the ring-buffer, currently they are
  only identified through their in-kernel VA.

  A meta-page, where are stored ring-buffer statistics and a
  description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

Signed-off-by: Vincent Donnefort 

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index fa802db216f9..0841ba8bab14 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -6,6 +6,8 @@
 #include 
 #include 
 
+#include 
+
 struct trace_buffer;
 struct ring_buffer_iter;
 
@@ -221,4 +223,9 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct 
hlist_node *node);
 #define trace_rb_cpu_prepare   NULL
 #endif
 
+int ring_buffer_map(struct trace_buffer *buffer, int cpu);
+int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
+struct page *ring_buffer_map_fault(struct trace_buffer *buffer, int cpu,
+  unsigned long pgoff);
+int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
 #endif /* _LINUX_RING_BUFFER_H */
diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
new file mode 100644
index ..182e05a3004a
--- /dev/null
+++ b/include/uapi/linux/trace_mmap.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _TRACE_MMAP_H_
+#define _TRACE_MMAP_H_
+
+#include 
+
+/**
+ * struct trace_buffer_meta - Ring-buffer Meta-page description
+ * @meta_page_size:Size of this meta-page.
+ * @meta_struct_len:   Size of this structure.
+ * @subbuf_size:   Size of each sub-buffer.
+ * @nr_subbufs:Number of subbfs in the ring-buffer.
+ * @reader.lost_events:Number of events lost at the time of the reader 
swap.
+ * @reader.id: subbuf ID of the current reader. From 0 to @nr_subbufs 
- 1
+ * @reader.read:   Number of bytes read on the reader subbuf.
+ * @flags: Placeholder for now, no defined values.
+ * @entries:   Number of entries in the ring-buffer.
+ * @overrun:   Number of entries lost in the ring-buffer.
+ * @read:  Number of entries that have been read.
+ * @Reserved1: Reserved for future use.
+ * @Reserved2: Reserved for future use.
+ */
+struct trace_buffer_meta {
+   __u32   meta_page_size;
+   __u32   meta_struct_len;
+
+   __u32   subbuf_size;
+   __u32   nr_subbufs;
+
+   struct {
+   __u64   lost_events;
+   __u32   id;
+   __u32   read;
+   } reader;
+
+   __u64   flags;
+
+   __u64   entries;
+   __u64   overrun;
+   __u64   read;
+
+   __u64   Reserved1;
+   __u64   Reserved2;
+};
+
+#endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index ca796675c0a1..bd23e78e8ec2 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -338,6 +339,7 @@ struct buffer_page {
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
unsigned order; /* order of the page */
+   u32  id;/* ID for external mapping */
struct buffer_data_page *page;  /* Actual data page */
 };
 
@@ -484,6 +486,12 @@ struct ring_buffer_per_cpu {
u64 read_stamp;
/* pages removed since last reset */
unsigned long   pages_removed;
+
+   unsigned intmapped;
+   struct mutexmapping_lock;
+   unsigned long   *subbuf_ids;/* ID to addr */
+   struct trace_buffer_meta*meta_page;
+
/* ring buffer pages to update, > 0 to add, < 0 to remove */
longnr_pages_to_update;
struct list_headnew_pages; /* new pages to add */
@@ -1548,6 +1556,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long 
nr_pages, int cpu)
init_irq_work(_buffer->irq_work.work, rb_wake_up_waiters);
init_waitqueue_head(_buffer->irq_work.waiters);
init_waitqueue_head(_buffer->irq_work.full_waiters);
+   mutex_init(_buffer->mapping_lock);
 

[PATCH v15 1/6] ring-buffer: Zero ring-buffer sub-buffers

2024-02-09 Thread Vincent Donnefort
In preparation for the ring-buffer memory mapping where each subbuf will
be accessible to user-space, zero all the page allocations.

Signed-off-by: Vincent Donnefort 
Reviewed-by: Masami Hiramatsu (Google) 

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index fd4bfe3ecf01..ca796675c0a1 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1472,7 +1472,8 @@ static int __rb_allocate_pages(struct ring_buffer_per_cpu 
*cpu_buffer,
 
list_add(>list, pages);
 
-   page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu), mflags,
+   page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu),
+   mflags | __GFP_ZERO,
cpu_buffer->buffer->subbuf_order);
if (!page)
goto free_pages;
@@ -1557,7 +1558,8 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long 
nr_pages, int cpu)
 
cpu_buffer->reader_page = bpage;
 
-   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, 
cpu_buffer->buffer->subbuf_order);
+   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_ZERO,
+   cpu_buffer->buffer->subbuf_order);
if (!page)
goto fail_free_reader;
bpage->page = page_address(page);
@@ -5525,7 +5527,8 @@ ring_buffer_alloc_read_page(struct trace_buffer *buffer, 
int cpu)
if (bpage->data)
goto out;
 
-   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_NORETRY,
+   page = alloc_pages_node(cpu_to_node(cpu),
+   GFP_KERNEL | __GFP_NORETRY | __GFP_ZERO,
cpu_buffer->buffer->subbuf_order);
if (!page) {
kfree(bpage);
-- 
2.43.0.687.g38aa6559b0-goog




[PATCH v15 0/6] Introducing trace buffer mapping by user-space

2024-02-09 Thread Vincent Donnefort
The tracing ring-buffers can be stored on disk or sent to network
without any copy via splice. However the later doesn't allow real time
processing of the traces. A solution is to give userspace direct access
to the ring-buffer pages via a mapping. An application can now become a
consumer of the ring-buffer, in a similar fashion to what trace_pipe
offers.

Support for this new feature can already be found in libtracefs from
version 1.8, when built with EXTRA_CFLAGS=-DFORCE_MMAP_ENABLE.

Vincent

v14 -> v15:
  * Add meta-page and reader-page flush. Intends to fix the mapping
for VIVT and aliasing-VIPT data caches.
  * -EPERM on VM_EXEC.
  * Fix build warning !CONFIG_TRACER_MAX_TRACE.

v13 -> v14:
  * All cpu_buffer->mapped readers use READ_ONCE (except for swap_cpu)
  * on unmap, sync meta-page teardown with the reader_lock instead of
the synchronize_rcu.
  * Add a dedicated spinlock for trace_array ->snapshot and ->mapped.
(intends to fix a lockdep issue)
  * Add kerneldoc for flags and Reserved fields.
  * Add kselftest for snapshot/map mutual exclusion.

v12 -> v13:
  * Swap subbufs_{touched,lost} for Reserved fields.
  * Add a flag field in the meta-page.
  * Fix CONFIG_TRACER_MAX_TRACE.
  * Rebase on top of trace/urgent.
  * Add a comment for try_unregister_trigger()

v11 -> v12:
  * Fix code sample mmap bug.
  * Add logging in sample code.
  * Reset tracer in selftest.
  * Add a refcount for the snapshot users.
  * Prevent mapping when there are snapshot users and vice versa.
  * Refine the meta-page.
  * Fix types in the meta-page.
  * Collect Reviewed-by.

v10 -> v11:
  * Add Documentation and code sample.
  * Add a selftest.
  * Move all the update to the meta-page into a single
rb_update_meta_page().
  * rb_update_meta_page() is now called from
ring_buffer_map_get_reader() to fix NOBLOCK callers.
  * kerneldoc for struct trace_meta_page.
  * Add a patch to zero all the ring-buffer allocations.

v9 -> v10:
  * Refactor rb_update_meta_page()
  * In-loop declaration for foreach_subbuf_page()
  * Check for cpu_buffer->mapped overflow

v8 -> v9:
  * Fix the unlock path in ring_buffer_map()
  * Fix cpu_buffer cast with rb_work_rq->is_cpu_buffer
  * Rebase on linux-trace/for-next (3cb3091138ca0921c4569bcf7ffa062519639b6a)

v7 -> v8:
  * Drop the subbufs renaming into bpages
  * Use subbuf as a name when relevant

v6 -> v7:
  * Rebase onto lore.kernel.org/lkml/20231215175502.106587...@goodmis.org/
  * Support for subbufs
  * Rename subbufs into bpages

v5 -> v6:
  * Rebase on next-20230802.
  * (unsigned long) -> (void *) cast for virt_to_page().
  * Add a wait for the GET_READER_PAGE ioctl.
  * Move writer fields update (overrun/pages_lost/entries/pages_touched)
in the irq_work.
  * Rearrange id in struct buffer_page.
  * Rearrange the meta-page.
  * ring_buffer_meta_page -> trace_buffer_meta_page.
  * Add meta_struct_len into the meta-page.

v4 -> v5:
  * Trivial rebase onto 6.5-rc3 (previously 6.4-rc3)

v3 -> v4:
  * Add to the meta-page:
   - pages_lost / pages_read (allow to compute how full is the
 ring-buffer)
   - read (allow to compute how many entries can be read)
   - A reader_page struct.
  * Rename ring_buffer_meta_header -> ring_buffer_meta
  * Rename ring_buffer_get_reader_page -> ring_buffer_map_get_reader_page
  * Properly consume events on ring_buffer_map_get_reader_page() with
rb_advance_reader().

v2 -> v3:
  * Remove data page list (for non-consuming read)
** Implies removing order > 0 meta-page
  * Add a new meta page field ->read
  * Rename ring_buffer_meta_page_header into ring_buffer_meta_header

v1 -> v2:
  * Hide data_pages from the userspace struct
  * Fix META_PAGE_MAX_PAGES
  * Support for order > 0 meta-page
  * Add missing page->mapping.

Vincent Donnefort (6):
  ring-buffer: Zero ring-buffer sub-buffers
  ring-buffer: Introducing ring-buffer mapping functions
  tracing: Add snapshot refcount
  tracing: Allow user-space mapping of the ring-buffer
  Documentation: tracing: Add ring-buffer mapping
  ring-buffer/selftest: Add ring-buffer mapping test

 Documentation/trace/index.rst |   1 +
 Documentation/trace/ring-buffer-map.rst   | 104 ++
 include/linux/ring_buffer.h   |   7 +
 include/uapi/linux/trace_mmap.h   |  48 +++
 kernel/trace/ring_buffer.c| 349 +-
 kernel/trace/trace.c  | 221 +--
 kernel/trace/trace.h  |   9 +-
 kernel/trace/trace_events_trigger.c   |  58 ++-
 tools/testing/selftests/ring-buffer/Makefile  |   8 +
 tools/testing/selftests/ring-buffer/config|   2 +
 .../testing/selftests/ring-buffer/map_test.c  | 273 ++
 11 files changed, 1034 insertions(+), 46 deletions(-)
 create mode 100644 Documentation/trace/ring-buffer-map.rst
 create mode 100644 include/uapi/linux/trace_mmap.h
 create mode 100644 tools/testing/selftests/ring-buffer/Makefile
 

Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-09 Thread Ankit Agrawal
>> >
>> > IMO, this seems like adding too much code to reduce the call length for a
>> > very specific case. If there aren't any strong opinion on this, I'm 
>> > planning to
>> > leave this code as it is.
>>
>> a slight difference. if mem_count==0 the result should always succeed
>> no matter nvgrace_gpu_map_device_mem() succeeds or not. Of course
>> if it fails it's already a big problem probably nobody cares about the subtle
>> difference when reading non-exist range.
>>
>> but regarding to readability it's still clearer:
>>
>> if (mem_count)
>>   nvgrace_gpu_map_and_read();
>>
>
> The below has better flow imo vs conditionalizing the call to
> map_and_read/write and subsequent error handling, but I don't think
> either adds too much code.  Thanks,
>
> Alex
>
> --- a/drivers/vfio/pci/nvgrace-gpu/main.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> @@ -429,6 +429,9 @@ nvgrace_gpu_map_and_read(struct 
> nvgrace_gpu_vfio_pci_core_device *nvdev,
>     u64 offset = *ppos & VFIO_PCI_OFFSET_MASK;
>     int ret;
>
> +   if (!mem_count)
> +   return 0;
> +
>     /*
>     * Handle read on the BAR regions. Map to the target device memory
>  * physical address and copy to the request read buffer.
> @@ -547,6 +550,9 @@ nvgrace_gpu_map_and_write(struct 
> nvgrace_gpu_vfio_pci_core_device *nvdev,
>     loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
>     int ret;
>
> +   if (!mem_count)
> +   return 0;
> +
>     ret = nvgrace_gpu_map_device_mem(index, nvdev);
>     if (ret)
>    return ret;

Sure, will update it as mentioned above.


Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-09 Thread Ankit Agrawal
Thanks Kevin for the review. Comments inline.

>>
>> Note that the usemem memory is added by the VM Nvidia device driver [5]
>> to the VM kernel as memblocks. Hence make the usable memory size
>> memblock
>> aligned.
>
> Is memblock size defined in spec or purely a guest implementation choice?

The MEMBLOCK value is a hardwired and a constant ABI value between the GPU
FW and VFIO driver.

>>
>> If the bare metal properties are not present, the driver registers the
>> vfio-pci-core function pointers.
>
> so if qemu doesn't generate such property the variant driver running
> inside guest will always go to use core functions and guest vfio userspace
> will observe both resmem and usemem bars. But then there is nothing
> in field to prohibit mapping resmem bar as cacheable.
>
> should this driver check the presence of either ACPI property or
> resmem/usemem bars to enable variant function pointers?

Maybe I am missing something here; but if the ACPI property is absent,
the real physical BARs present on the device will be exposed by the
vfio-pci-core functions to the VM. So I think if the variant driver is ran
within the VM, it should not see the fake usemem and resmem BARs.

>> +config NVGRACE_GPU_VFIO_PCI
>> + tristate "VFIO support for the GPU in the NVIDIA Grace Hopper
>> Superchip"
>> + depends on ARM64 || (COMPILE_TEST && 64BIT)
>> + select VFIO_PCI_CORE
>> + help
>> +   VFIO support for the GPU in the NVIDIA Grace Hopper Superchip is
>> +   required to assign the GPU device using KVM/qemu/etc.
>
> "assign the GPU device to userspace"

Ack, will make the change.

>> +
>> +/* Memory size expected as non cached and reserved by the VM driver */
>> +#define RESMEM_SIZE 0x4000
>> +#define MEMBLK_SIZE 0x2000
>
> also add a comment for MEMBLK_SIZE

Yes.

>> +
>> +struct nvgrace_gpu_vfio_pci_core_device {
>
> will nvgrace refer to a non-gpu device? if not probably all prefixes with
> 'nvgrace_gpu' can be simplified to 'nvgrace'.

nvgrace_gpu indicates the GPU associated with grace CPU here. That is a
one of the reason behind keeping the nomenclature as nvgrace_gpu. Calling
it nvgrace may not be apt as it is a CPU.

> btw following other variant drivers 'vfio' can be removed too.

Ack.

>> +
>> +/*
>> + * Both the usable (usemem) and the reserved (resmem) device memory
>> region
>> + * are 64b fake BARs in the VM. These fake BARs must respond
>
> s/VM/device/

I can make it clearer to something like "64b fake device BARs in the VM".

>> + int ret;
>> +
>> + ret = vfio_pci_core_read(core_vdev, buf, count, ppos);
>> + if (ret < 0)
>> + return ret;
>
> here if core_read succeeds *ppos has been updated...

Thanks for pointing that out, will fix the code handling *ppos.

>> + * Read the data from the device memory (mapped either through ioremap
>> + * or memremap) into the user buffer.
>> + */
>> +static int
>> +nvgrace_gpu_map_and_read(struct nvgrace_gpu_vfio_pci_core_device
>> *nvdev,
>> +  char __user *buf, size_t mem_count, loff_t *ppos)
>> +{
>> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
>> + u64 offset = *ppos & VFIO_PCI_OFFSET_MASK;
>> + int ret;
>> +
>> + /*
>> +  * Handle read on the BAR regions. Map to the target device memory
>> +  * physical address and copy to the request read buffer.
>> +  */
>
> duplicate with the earlier comment for the function.

Ack.

>> +/*
>> + * Read count bytes from the device memory at an offset. The actual device
>> + * memory size (available) may not be a power-of-2. So the driver fakes
>> + * the size to a power-of-2 (reported) when exposing to a user space driver.
>> + *
>> + * Reads extending beyond the reported size are truncated; reads starting
>> + * beyond the reported size generate -EINVAL; reads extending beyond the
>> + * actual device size is filled with ~0.
>
> slightly clearer to order the description: read starting beyond reported
> size, then read extending beyond device size, and read extending beyond
> reported size.

Yes, will make the change.

>> +  * exposing the rest (termed as usable memory and represented
>> using usemem)
>> +  * as cacheable 64b BAR (region 4 and 5).
>> +  *
>> +  *   devmem (memlength)
>> +  * |-|
>> +  * |   |
>> +  * usemem.phys/memphys resmem.phys
>
> there is no usemem.phys and resmem.phys

Silly miss. Will fix that.

>> +  */
>> + nvdev->usemem.memphys = memphys;
>> +
>> + /*
>> +  * The device memory exposed to the VM is added to the kernel by
>> the
>> +  * VM driver module in chunks of memory block size. Only the usable
>> +  * memory (usemem) is added to the kernel for usage by the VM
>> +  * workloads. Make the usable memory size memblock aligned.
>> +  */
>
> If memblock size is defined by hw spec then say so.
> otherwise this