Re: [Nouveau] [PATCH v2 8/9] PCI: Exclude PCIe ports used for tunneling in pcie_bandwidth_available()

2023-11-06 Thread Mika Westerberg
On Tue, Nov 07, 2023 at 07:45:26AM +0200, Mika Westerberg wrote:
> Hi,
> 
> On Mon, Nov 06, 2023 at 07:56:52PM +0100, Lukas Wunner wrote:
> > On Mon, Nov 06, 2023 at 12:44:25PM -0600, Mario Limonciello wrote:
> > > Tangentially related; the link speed is currently symmetric but there are
> > > two sysfs files.  Mika left a comment in drivers/thunderbolt/switch.c it 
> > > may
> > > be asymmetric in the future. So we may need to keep that in mind on any
> > > design that builds on top of them.
> > 
> > Aren't asymmetric Thunderbolt speeds just a DisplayPort thing?
> 
> No, they affect the whole fabric. We have the initial code for
> asymmetric switching in v6.7-rc1.
> 
> > > As 'thunderbolt' can be a module or built in, we need to bring code into 
> > > PCI
> > > core so that it works in early boot before it loads.
> > 
> > tb_switch_get_generation() is small enough that it could be moved to the
> > PCI core.  I doubt that we need to make thunderbolt built-in only
> > or move a large amount of code to the PCI core.
> 
> If at all possible I would like to avoid this and littering PCI side
> with non-PCI stuff. There could be other similar "mediums" in the future
> where you can transfer packets of "native" protocols such as PCIe so
> instead of making it Thunderbolt/USB4 specific it should be generic
> enough to support future extensions.
> 
> In case of Thunderbolt/USB4 there is no real way to figure out how much
> bandwidth each PCIe tunnel gets (it is kind of bulk traffic that gets
> what is left from isochronous protocols) so I would not even try that
> and instead use the real PCIe links in pcie_bandwidth_available() and
> skip all the "virtual" ones.

Actually can we call the new function something like pci_link_is_virtual()
instead and make pcie_bandwidth_available() call it? That would be more
future proof IMHO.


Re: [Nouveau] [PATCH v2 8/9] PCI: Exclude PCIe ports used for tunneling in pcie_bandwidth_available()

2023-11-06 Thread Mika Westerberg
Hi,

On Mon, Nov 06, 2023 at 07:56:52PM +0100, Lukas Wunner wrote:
> On Mon, Nov 06, 2023 at 12:44:25PM -0600, Mario Limonciello wrote:
> > Tangentially related; the link speed is currently symmetric but there are
> > two sysfs files.  Mika left a comment in drivers/thunderbolt/switch.c it may
> > be asymmetric in the future. So we may need to keep that in mind on any
> > design that builds on top of them.
> 
> Aren't asymmetric Thunderbolt speeds just a DisplayPort thing?

No, they affect the whole fabric. We have the initial code for
asymmetric switching in v6.7-rc1.

> > As 'thunderbolt' can be a module or built in, we need to bring code into PCI
> > core so that it works in early boot before it loads.
> 
> tb_switch_get_generation() is small enough that it could be moved to the
> PCI core.  I doubt that we need to make thunderbolt built-in only
> or move a large amount of code to the PCI core.

If at all possible I would like to avoid this and littering PCI side
with non-PCI stuff. There could be other similar "mediums" in the future
where you can transfer packets of "native" protocols such as PCIe so
instead of making it Thunderbolt/USB4 specific it should be generic
enough to support future extensions.

In case of Thunderbolt/USB4 there is no real way to figure out how much
bandwidth each PCIe tunnel gets (it is kind of bulk traffic that gets
what is left from isochronous protocols) so I would not even try that
and instead use the real PCIe links in pcie_bandwidth_available() and
skip all the "virtual" ones.


[Nouveau] [PATCH] nouveau: use an rwlock for the event lock.

2023-11-06 Thread Dave Airlie
From: Dave Airlie 

This allows it to break the following circular locking dependency.

Aug 10 07:01:29 dg1test kernel: 
==
Aug 10 07:01:29 dg1test kernel: WARNING: possible circular locking dependency 
detected
Aug 10 07:01:29 dg1test kernel: 6.4.0-rc7+ #10 Not tainted
Aug 10 07:01:29 dg1test kernel: 
--
Aug 10 07:01:29 dg1test kernel: wireplumber/2236 is trying to acquire lock:
Aug 10 07:01:29 dg1test kernel: 8fca5320da18 (>lock){-...}-{2:2}, at: 
nouveau_fence_wait_uevent_handler+0x2b/0x100 [nouveau]
Aug 10 07:01:29 dg1test kernel:
but task is already holding lock:
Aug 10 07:01:29 dg1test kernel: 8fca41208610 
(>list_lock#2){-...}-{2:2}, at: nvkm_event_ntfy+0x50/0xf0 [nouveau]
Aug 10 07:01:29 dg1test kernel:
which lock already depends on the new lock.
Aug 10 07:01:29 dg1test kernel:
the existing dependency chain (in reverse 
order) is:
Aug 10 07:01:29 dg1test kernel:
-> #3 (>list_lock#2){-...}-{2:2}:
Aug 10 07:01:29 dg1test kernel:_raw_spin_lock_irqsave+0x4b/0x70
Aug 10 07:01:29 dg1test kernel:nvkm_event_ntfy+0x50/0xf0 [nouveau]
Aug 10 07:01:29 dg1test kernel:ga100_fifo_nonstall_intr+0x24/0x30 
[nouveau]
Aug 10 07:01:29 dg1test kernel:nvkm_intr+0x12c/0x240 [nouveau]
Aug 10 07:01:29 dg1test kernel:__handle_irq_event_percpu+0x88/0x240
Aug 10 07:01:29 dg1test kernel:handle_irq_event+0x38/0x80
Aug 10 07:01:29 dg1test kernel:handle_edge_irq+0xa3/0x240
Aug 10 07:01:29 dg1test kernel:__common_interrupt+0x72/0x160
Aug 10 07:01:29 dg1test kernel:common_interrupt+0x60/0xe0
Aug 10 07:01:29 dg1test kernel:asm_common_interrupt+0x26/0x40
Aug 10 07:01:29 dg1test kernel:
-> #2 (>intr.lock){-...}-{2:2}:
Aug 10 07:01:29 dg1test kernel:_raw_spin_lock_irqsave+0x4b/0x70
Aug 10 07:01:29 dg1test kernel:nvkm_inth_allow+0x2c/0x80 [nouveau]
Aug 10 07:01:29 dg1test kernel:nvkm_event_ntfy_state+0x181/0x250 
[nouveau]
Aug 10 07:01:29 dg1test kernel:nvkm_event_ntfy_allow+0x63/0xd0 [nouveau]
Aug 10 07:01:29 dg1test kernel:nvkm_uevent_mthd+0x4d/0x70 [nouveau]
Aug 10 07:01:29 dg1test kernel:nvkm_ioctl+0x10b/0x250 [nouveau]
Aug 10 07:01:29 dg1test kernel:nvif_object_mthd+0xa8/0x1f0 [nouveau]
Aug 10 07:01:29 dg1test kernel:nvif_event_allow+0x2a/0xa0 [nouveau]
Aug 10 07:01:29 dg1test kernel:nouveau_fence_enable_signaling+0x78/0x80 
[nouveau]
Aug 10 07:01:29 dg1test kernel:__dma_fence_enable_signaling+0x5e/0x100
Aug 10 07:01:29 dg1test kernel:dma_fence_add_callback+0x4b/0xd0
Aug 10 07:01:29 dg1test kernel:nouveau_cli_work_queue+0xae/0x110 
[nouveau]
Aug 10 07:01:29 dg1test kernel:nouveau_gem_object_close+0x1d1/0x2a0 
[nouveau]
Aug 10 07:01:29 dg1test kernel:drm_gem_handle_delete+0x70/0xe0 [drm]
Aug 10 07:01:29 dg1test kernel:drm_ioctl_kernel+0xa5/0x150 [drm]
Aug 10 07:01:29 dg1test kernel:drm_ioctl+0x256/0x490 [drm]
Aug 10 07:01:29 dg1test kernel:nouveau_drm_ioctl+0x5a/0xb0 [nouveau]
Aug 10 07:01:29 dg1test kernel:__x64_sys_ioctl+0x91/0xd0
Aug 10 07:01:29 dg1test kernel:do_syscall_64+0x3c/0x90
Aug 10 07:01:29 dg1test kernel:entry_SYSCALL_64_after_hwframe+0x72/0xdc
Aug 10 07:01:29 dg1test kernel:
-> #1 (>refs_lock#4){}-{2:2}:
Aug 10 07:01:29 dg1test kernel:_raw_spin_lock_irqsave+0x4b/0x70
Aug 10 07:01:29 dg1test kernel:nvkm_event_ntfy_state+0x37/0x250 
[nouveau]
Aug 10 07:01:29 dg1test kernel:nvkm_event_ntfy_allow+0x63/0xd0 [nouveau]
Aug 10 07:01:29 dg1test kernel:nvkm_uevent_mthd+0x4d/0x70 [nouveau]
Aug 10 07:01:29 dg1test kernel:nvkm_ioctl+0x10b/0x250 [nouveau]
Aug 10 07:01:29 dg1test kernel:nvif_object_mthd+0xa8/0x1f0 [nouveau]
Aug 10 07:01:29 dg1test kernel:nvif_event_allow+0x2a/0xa0 [nouveau]
Aug 10 07:01:29 dg1test kernel:nouveau_fence_enable_signaling+0x78/0x80 
[nouveau]
Aug 10 07:01:29 dg1test kernel:__dma_fence_enable_signaling+0x5e/0x100
Aug 10 07:01:29 dg1test kernel:dma_fence_add_callback+0x4b/0xd0
Aug 10 07:01:29 dg1test kernel:nouveau_cli_work_queue+0xae/0x110 
[nouveau]
Aug 10 07:01:29 dg1test kernel:nouveau_gem_object_close+0x1d1/0x2a0 
[nouveau]
Aug 10 07:01:29 dg1test kernel:drm_gem_handle_delete+0x70/0xe0 [drm]
Aug 10 07:01:29 dg1test kernel:drm_ioctl_kernel+0xa5/0x150 [drm]
Aug 10 07:01:29 dg1test kernel:drm_ioctl+0x256/0x490 [drm]
Aug 10 07:01:29 dg1test kernel:nouveau_drm_ioctl+0x5a/0xb0 [nouveau]
Aug 10 07:01:29 dg1test kernel:__x64_sys_ioctl+0x91/0xd0
Aug 10 07:01:29 dg1test kernel:do_syscall_64+0x3c/0x90
Aug 10 07:01:29 dg1test kernel: 

Re: [Nouveau] [PATCH v2 8/9] PCI: Exclude PCIe ports used for tunneling in pcie_bandwidth_available()

2023-11-06 Thread Lukas Wunner
On Mon, Nov 06, 2023 at 12:44:25PM -0600, Mario Limonciello wrote:
> Tangentially related; the link speed is currently symmetric but there are
> two sysfs files.  Mika left a comment in drivers/thunderbolt/switch.c it may
> be asymmetric in the future. So we may need to keep that in mind on any
> design that builds on top of them.

Aren't asymmetric Thunderbolt speeds just a DisplayPort thing?


> As 'thunderbolt' can be a module or built in, we need to bring code into PCI
> core so that it works in early boot before it loads.

tb_switch_get_generation() is small enough that it could be moved to the
PCI core.  I doubt that we need to make thunderbolt built-in only
or move a large amount of code to the PCI core.

Thanks,

Lukas


Re: [Nouveau] [PATCH v2 8/9] PCI: Exclude PCIe ports used for tunneling in pcie_bandwidth_available()

2023-11-06 Thread Mario Limonciello

On 11/6/2023 12:10, Lukas Wunner wrote:

On Fri, Nov 03, 2023 at 02:07:57PM -0500, Mario Limonciello wrote:

The USB4 spec specifies that PCIe ports that are used for tunneling
PCIe traffic over USB4 fabric will be hardcoded to advertise 2.5GT/s and
behave as a PCIe Gen1 device. The actual performance of these ports is
controlled by the fabric implementation.

Downstream drivers such as amdgpu which utilize pcie_bandwidth_available()
to program the device will always find the PCIe ports used for
tunneling as a limiting factor potentially leading to incorrect
performance decisions.

To prevent problems in downstream drivers check explicitly for ports
being used for PCIe tunneling and skip them when looking for bandwidth
limitations of the hierarchy. If the only device connected is a root port
used for tunneling then report that device.


I think a better approach would be to define three new bandwidths for
Thunderbolt in enum pci_bus_speed and add appropriate descriptions in
pci_speed_string().  Those three bandwidths would be 10 GBit/s for
Thunderbolt 1, 20 GBit/s for Thunderbolt 2, 40 GBit/s for Thunderbolt 3
and 4.


It's an interesting idea, but there's a few short comings I can think of.

1) The USB4 specification doesn't actually require 40GB/s support, this 
is only a Thunderbolt 4 requirement.


https://www.tomsguide.com/features/thunderbolt-4-vs-usb4-whats-the-difference

The TBT/USB4 link speed can be discovered, but it's not a property of 
the *switch* not of the PCIe tunneling port.


Tangentially related; the link speed is currently symmetric but there 
are two sysfs files.  Mika left a comment in 
drivers/thunderbolt/switch.c it may be asymmetric in the future. So we 
may need to keep that in mind on any design that builds on top of them.


On an AMD Phoenix system connected to a TBT3 Alpine Ridge based eGPU 
enclosure I can see:


$ cat /sys/bus/thunderbolt/devices/1-0/generation
4
$ cat /sys/bus/thunderbolt/devices/1-2/generation
3
$ cat /sys/bus/thunderbolt/devices/1-2/tx_speed
20.0 Gb/s
$ cat /sys/bus/thunderbolt/devices/1-2/rx_speed
20.0 Gb/s

2) This works until you end up with USB4v2 which supports 80GBit/s.

So this might mean an extra 80GB/s enum and porting some variation of 
usb4_switch_version() outside of the thunderbolt driver so that PCI core 
can use it too.




Code to determine the Thunderbolt generation from the PCI ID already exists
in tb_switch_get_generation().


As 'thunderbolt' can be a module or built in, we need to bring code into 
PCI core so that it works in early boot before it loads.


On the presumption that no more "new" TBT3 devices will be released to 
the market I suppose that *a lot* of that code could come over to PCI 
core especially if we can bring some variation of usb4_switch_version() 
over too.


The other option is that we can stop allowing thunderbolt as a module 
and require it to be built-in. If we do this we can export symbols from 
it and use some of them in PCI core too.




This will not only address the amdgpu issue you're trying to solve,
but also emit an accurate speed from __pcie_print_link_status().

The speed you're reporting with your approach is not necessarily
accurate because the next non-tunneled device in the hierarchy might
be connected with a far higher PCIe speed than what the Thunderbolt
fabric allows.
 > Thanks,

Lukas




Re: [Nouveau] [PATCH v2 6/9] PCI: Rename is_thunderbolt to is_tunneled

2023-11-06 Thread Lukas Wunner
On Mon, Nov 06, 2023 at 10:59:13AM -0600, Mario Limonciello wrote:
> On 11/5/2023 11:39, Lukas Wunner wrote:
> > On Fri, Nov 03, 2023 at 02:07:55PM -0500, Mario Limonciello wrote:
> > > The `is_thunderbolt` bit has been used to indicate that a PCIe device
> > > contained the Intel VSEC which is used by various parts of the kernel
> > > to change behavior. To later allow usage with USB4 controllers as well,
> > > rename this to `is_tunneled`.
> > 
> > This doesn't seem to make sense.  is_thunderbolt indicates that a device
> > is part of a Thunderbolt controller.  See the code comment:
> > 
> > > - unsigned intis_thunderbolt:1;   /* Thunderbolt controller */
> > 
> > A Thunderbolt controller is not necessarily tunneled.  The PCIe switch,
> > NHI and XHCI of the Thunderbolt host controller are not tunneled at all.
> 
> I could really use some clarification which PCIe devices actually contain
> the Intel VSEC.
> 
> Is it in all 3 of those PCIe devices and not just the switch?

Yes, I've just double-checked Light Ridge, Cactus Ridge, Alpine Ridge.

Thanks,

Lukas


Re: [Nouveau] [PATCH v2 8/9] PCI: Exclude PCIe ports used for tunneling in pcie_bandwidth_available()

2023-11-06 Thread Lukas Wunner
On Fri, Nov 03, 2023 at 02:07:57PM -0500, Mario Limonciello wrote:
> The USB4 spec specifies that PCIe ports that are used for tunneling
> PCIe traffic over USB4 fabric will be hardcoded to advertise 2.5GT/s and
> behave as a PCIe Gen1 device. The actual performance of these ports is
> controlled by the fabric implementation.
> 
> Downstream drivers such as amdgpu which utilize pcie_bandwidth_available()
> to program the device will always find the PCIe ports used for
> tunneling as a limiting factor potentially leading to incorrect
> performance decisions.
> 
> To prevent problems in downstream drivers check explicitly for ports
> being used for PCIe tunneling and skip them when looking for bandwidth
> limitations of the hierarchy. If the only device connected is a root port
> used for tunneling then report that device.

I think a better approach would be to define three new bandwidths for
Thunderbolt in enum pci_bus_speed and add appropriate descriptions in
pci_speed_string().  Those three bandwidths would be 10 GBit/s for
Thunderbolt 1, 20 GBit/s for Thunderbolt 2, 40 GBit/s for Thunderbolt 3
and 4.

Code to determine the Thunderbolt generation from the PCI ID already exists
in tb_switch_get_generation().

This will not only address the amdgpu issue you're trying to solve,
but also emit an accurate speed from __pcie_print_link_status().

The speed you're reporting with your approach is not necessarily
accurate because the next non-tunneled device in the hierarchy might
be connected with a far higher PCIe speed than what the Thunderbolt
fabric allows.

Thanks,

Lukas


Re: [Nouveau] [PATCH v2 6/9] PCI: Rename is_thunderbolt to is_tunneled

2023-11-06 Thread Mario Limonciello

On 11/5/2023 11:39, Lukas Wunner wrote:

On Fri, Nov 03, 2023 at 02:07:55PM -0500, Mario Limonciello wrote:

The `is_thunderbolt` bit has been used to indicate that a PCIe device
contained the Intel VSEC which is used by various parts of the kernel
to change behavior. To later allow usage with USB4 controllers as well,
rename this to `is_tunneled`.


This doesn't seem to make sense.  is_thunderbolt indicates that a device
is part of a Thunderbolt controller.  See the code comment:


-   unsigned intis_thunderbolt:1;   /* Thunderbolt controller */


A Thunderbolt controller is not necessarily tunneled.  The PCIe switch,
NHI and XHCI of the Thunderbolt host controller are not tunneled at all.

Thanks,

Lukas


I could really use some clarification which PCIe devices actually 
contain the Intel VSEC.


Is it in all 3 of those PCIe devices and not just the switch?

If so, I think I would rather introduce a separate bit.  So after this 
series we would have:


is_tunneled:1
is_thunderbolt:1
no_command_complete:1

* TBT1 devices would set no_command_complete
  - The consumer would be pcie_init()
* All devices with the Intel VSEC would set is_thunderbolt and the two 
consumers would be:

 - apple-gmux.c
 - pci_bridge_d3_possible()
* USB4 devices and PCIe switches with the VSEC would set is_tunneled.



Re: [Nouveau] [PATCH v2 8/9] PCI: Exclude PCIe ports used for tunneling in pcie_bandwidth_available()

2023-11-06 Thread Mario Limonciello

On 11/6/2023 06:52, Ilpo Järvinen wrote:

On Fri, 3 Nov 2023, Mario Limonciello wrote:


The USB4 spec specifies that PCIe ports that are used for tunneling
PCIe traffic over USB4 fabric will be hardcoded to advertise 2.5GT/s and
behave as a PCIe Gen1 device. The actual performance of these ports is
controlled by the fabric implementation.

Downstream drivers such as amdgpu which utilize pcie_bandwidth_available()
to program the device will always find the PCIe ports used for
tunneling as a limiting factor potentially leading to incorrect
performance decisions.

To prevent problems in downstream drivers check explicitly for ports
being used for PCIe tunneling and skip them when looking for bandwidth
limitations of the hierarchy. If the only device connected is a root port
used for tunneling then report that device.

Downstream drivers could make this change on their own but then they
wouldn't be able to detect other potential speed bottlenecks from the
hierarchy without duplicating pcie_bandwidth_available() logic.

Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2925#note_2145860
Link: https://www.usb.org/document-library/usb4r-specification-v20
   USB4 V2 with Errata and ECN through June 2023
   Section 11.2.1
Signed-off-by: Mario Limonciello 
---
  drivers/pci/pci.c | 74 +++
  1 file changed, 49 insertions(+), 25 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index d9aa5a39f585..15e37164ce56 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -6223,6 +6223,35 @@ int pcie_set_mps(struct pci_dev *dev, int mps)
  }
  EXPORT_SYMBOL(pcie_set_mps);
  
+static u32 pcie_calc_bw_limits(struct pci_dev *dev, u32 bw,

+  struct pci_dev **limiting_dev,
+  enum pci_bus_speed *speed,
+  enum pcie_link_width *width)
+{
+   enum pcie_link_width next_width;
+   enum pci_bus_speed next_speed;
+   u32 next_bw;
+   u16 lnksta;
+
+   pcie_capability_read_word(dev, PCI_EXP_LNKSTA, );
+   next_speed = pcie_link_speed[lnksta & PCI_EXP_LNKSTA_CLS];
+   next_width = (lnksta & PCI_EXP_LNKSTA_NLW) >> PCI_EXP_LNKSTA_NLW_SHIFT;
+   next_bw = next_width * PCIE_SPEED2MBS_ENC(next_speed);
+
+   /* Check if current device limits the total bandwidth */
+   if (!bw || next_bw <= bw) {
+   bw = next_bw;
+   if (limiting_dev)
+   *limiting_dev = dev;
+   if (speed)
+   *speed = next_speed;
+   if (width)
+   *width = next_width;
+   }
+
+   return bw;
+}
+
  /**
   * pcie_bandwidth_available - determine minimum link settings of a PCIe
   *  device and its bandwidth limitation
@@ -6236,47 +6265,42 @@ EXPORT_SYMBOL(pcie_set_mps);
   * limiting_dev, speed, and width pointers are supplied) information about
   * that point.  The bandwidth returned is in Mb/s, i.e., megabits/second of
   * raw bandwidth.
+ *
+ * This excludes the bandwidth calculation that has been returned from a
+ * PCIe device used for transmitting tunneled PCIe traffic over a Thunderbolt
+ * or USB4 link that is part of larger hierarchy. The calculation is excluded
+ * because the USB4 specification specifies that the max speed returned from
+ * PCIe configuration registers for the tunneling link is always PCI 1x 2.5 
GT/s.
+ * When only tunneled devices are present, the bandwidth returned is the
+ * bandwidth available from the first tunneled device.
   */
  u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev 
**limiting_dev,
 enum pci_bus_speed *speed,
 enum pcie_link_width *width)
  {
-   u16 lnksta;
-   enum pci_bus_speed next_speed;
-   enum pcie_link_width next_width;
-   u32 bw, next_bw;
+   struct pci_dev *tdev = NULL;
+   u32 bw = 0;
  
  	if (speed)

*speed = PCI_SPEED_UNKNOWN;
if (width)
*width = PCIE_LNK_WIDTH_UNKNOWN;
  
-	bw = 0;

-
while (dev) {
-   pcie_capability_read_word(dev, PCI_EXP_LNKSTA, );
-
-   next_speed = pcie_link_speed[lnksta & PCI_EXP_LNKSTA_CLS];
-   next_width = (lnksta & PCI_EXP_LNKSTA_NLW) >>
-   PCI_EXP_LNKSTA_NLW_SHIFT;
-
-   next_bw = next_width * PCIE_SPEED2MBS_ENC(next_speed);
-
-   /* Check if current device limits the total bandwidth */
-   if (!bw || next_bw <= bw) {
-   bw = next_bw;
-
-   if (limiting_dev)
-   *limiting_dev = dev;
-   if (speed)
-   *speed = next_speed;
-   if (width)
-   *width = next_width;
+   if (dev->is_tunneled) {
+   if (!tdev)
+   

Re: [Nouveau] [PATCH v2 5/9] PCI: pciehp: Move check for is_thunderbolt into a quirk

2023-11-06 Thread Mario Limonciello

On 11/6/2023 06:41, Ilpo Järvinen wrote:

On Fri, 3 Nov 2023, Mario Limonciello wrote:


commit 493fb50e958c ("PCI: pciehp: Assume NoCompl+ for Thunderbolt
ports") added a check into pciehp code to explicitly set NoCompl+
for all Intel Thunderbolt controllers, including those that don't
need it.

This overloaded the purpose of the `is_thunderbolt` member of
`struct pci_device` because that means that any controller that
identifies as thunderbolt would set NoCompl+ even if it doesn't
suffer this deficiency. As that commit helpfully specifies all the
controllers with the problem, move them into a PCI quirk.

Signed-off-by: Mario Limonciello 
---
  drivers/pci/hotplug/pciehp_hpc.c |  6 +-
  drivers/pci/quirks.c | 20 
  include/linux/pci.h  |  1 +
  3 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index fd713abdfb9f..23a92d681d1c 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -991,11 +991,7 @@ struct controller *pcie_init(struct pcie_device *dev)
if (pdev->hotplug_user_indicators)
slot_cap &= ~(PCI_EXP_SLTCAP_AIP | PCI_EXP_SLTCAP_PIP);
  
-	/*

-* We assume no Thunderbolt controllers support Command Complete events,
-* but some controllers falsely claim they do.
-*/
-   if (pdev->is_thunderbolt)
+   if (pdev->no_command_complete)
slot_cap |= PCI_EXP_SLTCAP_NCCS;
  
  	ctrl->slot_cap = slot_cap;

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index eeec1d6f9023..4bbf6e33ca11 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3807,6 +3807,26 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 
PCI_DEVICE_ID_INTEL_CACTUS_RIDGE_4C
  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_PORT_RIDGE,
quirk_thunderbolt_hotplug_msi);
  
+/*

+ * We assume no Thunderbolt controllers support Command Complete events,
+ * but some controllers falsely claim they do.


IMO, this wording makes little sense with the new code. How about taking
some text from the original commit's changelog:

/*
  * Certain Thunderbolt 1 controllers falsely claim to support Command
  * Completed events.
  */

The code change looks fine.



Ack, will change.


Re: [Nouveau] [PATCH v2 1/9] drm/nouveau: Switch from pci_is_thunderbolt_attached() to dev_is_removable()

2023-11-06 Thread Mario Limonciello

On 11/6/2023 10:47, Mika Westerberg wrote:

On Mon, Nov 06, 2023 at 02:25:24PM +0200, Ilpo Järvinen wrote:

On Fri, 3 Nov 2023, Mario Limonciello wrote:


pci_is_thunderbolt_attached() only works for Intel TBT devices. Switch to
using dev_is_removable() to be able to detect USB4 devices as well.


Please extend this with more details. I had to lookup the TBT change to
be able to make any guess why you're doing this (and it's still a guess
at best).


Also please write it as Thunderbolt not TBT so that it is clear what we
are talking about. Ditto to all patches.


Ack, thanks I will add more detail to these first few patches and make 
that s/TBT/Thunderbolt/ change when I rebase this on 6.7-rc1.




Re: [Nouveau] [PATCH v2 1/9] drm/nouveau: Switch from pci_is_thunderbolt_attached() to dev_is_removable()

2023-11-06 Thread Mika Westerberg
On Mon, Nov 06, 2023 at 02:25:24PM +0200, Ilpo Järvinen wrote:
> On Fri, 3 Nov 2023, Mario Limonciello wrote:
> 
> > pci_is_thunderbolt_attached() only works for Intel TBT devices. Switch to
> > using dev_is_removable() to be able to detect USB4 devices as well.
> 
> Please extend this with more details. I had to lookup the TBT change to 
> be able to make any guess why you're doing this (and it's still a guess 
> at best).

Also please write it as Thunderbolt not TBT so that it is clear what we
are talking about. Ditto to all patches.


Re: [Nouveau] [PATCH v2 3/9] PCI: Drop pci_is_thunderbolt_attached()

2023-11-06 Thread Mario Limonciello

On 11/6/2023 06:33, Ilpo Järvinen wrote:

On Fri, 3 Nov 2023, Mario Limonciello wrote:


All callers have switched to dev_is_removable() for detecting
hotpluggable PCIe devices.

Signed-off-by: Mario Limonciello 
---
  include/linux/pci.h | 22 --
  1 file changed, 22 deletions(-)

diff --git a/include/linux/pci.h b/include/linux/pci.h
index b56417276042..530b0a360514 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -2616,28 +2616,6 @@ static inline bool pci_ari_enabled(struct pci_bus *bus)
return bus->self && bus->self->ari_enabled;
  }
  
-/**

- * pci_is_thunderbolt_attached - whether device is on a Thunderbolt daisy chain
- * @pdev: PCI device to check
- *
- * Walk upwards from @pdev and check for each encountered bridge if it's part
- * of a Thunderbolt controller.  Reaching the host bridge means @pdev is not
- * Thunderbolt-attached.  (But rather soldered to the mainboard usually.)
- */
-static inline bool pci_is_thunderbolt_attached(struct pci_dev *pdev)
-{
-   struct pci_dev *parent = pdev;
-
-   if (pdev->is_thunderbolt)
-   return true;
-
-   while ((parent = pci_upstream_bridge(parent)))
-   if (parent->is_thunderbolt)
-   return true;
-
-   return false;
-}
-
  #if defined(CONFIG_PCIEPORTBUS) || defined(CONFIG_EEH)
  void pci_uevent_ers(struct pci_dev *pdev, enum  pci_ers_result err_type);
  #endif



I don't think all callers have been removed. Ah, lkp has caught the same
problem.



As I mentioned in the cover letter this series is done on 6.6 + a patch 
going into 6.7-rc1.  The LKP report will drop off when I rebase the 
series on 6.7-rc1.


As it's not yet in Linus' tree here is that patch so you can see it:

https://gitlab.freedesktop.org/agd5f/linux/-/commit/7b1c6263eaf4fd64ffe1cafdc504a42ee4bfbb33


Re: [Nouveau] [PATCH drm-misc-next v8 09/12] drm/gpuvm: reference count drm_gpuvm structures

2023-11-06 Thread Danilo Krummrich
On Mon, Nov 06, 2023 at 04:10:50PM +0100, Christian König wrote:
> Am 06.11.23 um 15:11 schrieb Danilo Krummrich:
> > On Mon, Nov 06, 2023 at 02:05:13PM +0100, Christian König wrote:
> > > Am 06.11.23 um 13:16 schrieb Danilo Krummrich:
> > > > [SNIP]
> > > > This reference count just prevents that the VM is freed as long as other
> > > > ressources are attached to it that carry a VM pointer, such as mappings 
> > > > and
> > > > VM_BOs. The motivation for that are VM_BOs. For mappings it's indeed a 
> > > > bit
> > > > paranoid, but it doesn't hurt either and keeps it consistant.
> > > Ah! Yeah, we have similar semantics in amdgpu as well.
> > > 
> > > But we keep the reference to the root GEM object and not the VM.
> > > 
> > > Ok, that makes much more sense then keeping one reference for each 
> > > mapping.
> > > 
> > > > > Because of this the mapping should *never* have a reference to the 
> > > > > VM, but
> > > > > rather the VM destroys all mapping when it is destroyed itself.
> > > > > 
> > > > > > Hence, If the VM is still alive at a point where you don't expect 
> > > > > > it to
> > > > > > be, then it's
> > > > > > simply a driver bug.
> > > > > Driver bugs is just what I try to prevent here. When individual 
> > > > > mappings
> > > > > keep the VM structure alive then drivers are responsible to clean 
> > > > > them up,
> > > > > if the VM cleans up after itself then we don't need to worry about it 
> > > > > in the
> > > > > driver.
> > > > Drivers are *always* responsible for that. This has nothing to do with 
> > > > whether
> > > > the VM is reference counted or not. GPUVM can't clean up mappings after 
> > > > itself.
> > > Why not?
> > I feel like we're talking past each other here, at least to some extend.
> > However, I can't yet see where exactly the misunderstanding resides.
> 
> +1
> 
> > > At least in amdgpu we have it exactly like that. E.g. the higher level can
> > > cleanup the BO_VM structure at any time possible, even when there are
> > > mappings.
> > What do you mean with "cleanup the VM_BO structue" exactly?
> > 
> > The VM_BO structure keeps track of all the mappings mapped in the VM_BO's VM
> > being backed by the VM_BO's GEM object. And the GEM objects keeps a list of
> > the corresponding VM_BOs.
> > 
> > Hence, as long as there are mappings that this VM_BO keeps track of, this 
> > VM_BO
> > should stay alive.
> 
> No, exactly the other way around. When the VM_BO structure is destroyed the
> mappings are destroyed with them.

This seems to be the same misunderstanding as with the VM reference count.

It seems to me that you want to say that for amdgpu it seems to be a use-case
to get rid of all mappings backed by a given BO and mapped in a given VM, hence
a VM_BO. You can do that. Thers's even a helper for that in GPUVM.

But also in this case you first need to get rid of all mappings before you
*free* the VM_BO - GPUVM ensures that.

> 
> Otherwise you would need to destroy each individual mapping separately
> before teardown which is quite inefficient.

Not sure what you mean, but I don't see a difference between walking all VM_BOs
and removing their mappings and walking the VM's tree of mappings and removing
each of them. Comes down to the same effort in the end. But surely can go both
ways if you know all the existing VM_BOs.

> 
> > > The VM then keeps track which areas still need to be invalidated
> > > in the physical representation of the page tables.
> > And the VM does that through its tree of mappings (struct drm_gpuva). 
> > Hence, if
> > the VM would just remove those structures on cleanup by itself, you'd loose 
> > the
> > ability of cleaning up the page tables. Unless, you track this separately, 
> > which
> > would make the whole tracking of GPUVM itself kinda pointless.
> 
> But how do you then keep track of areas which are freed and needs to be
> updated so that nobody can access the underlying memory any more?

"areas which are freed", what do refer to? What do yo mean with that?

Do you mean areas of the VA space not containing mappings? Why would I need to
track them explicitly? When the mapping is removed the corresponding page tables
/ page table entries are gone as well, hence no subsequent access to the
underlaying memory would be possible.

> 
> > > I would expect that the generalized GPU VM handling would need something
> > > similar. If we leave that to the driver then each driver would have to
> > > implement that stuff on it's own again.
> > Similar to what? What exactly do you think can be generalized here?
> 
> Similar to how amdgpu works.

I don't think it's quite fair to just throw the "look at what amdgpu does"
argument at me. What am I supposed to do? Read and understand *every* detail of
*every* driver?

Did you read through the GPUVM code? That's a honest question and I'm asking it
because I feel like you're picking up some details from commit messages and
start questioning them (and that's perfectly fine and absolutely 

Re: [Nouveau] [PATCH drm-misc-next v8 09/12] drm/gpuvm: reference count drm_gpuvm structures

2023-11-06 Thread Christian König

Am 06.11.23 um 15:11 schrieb Danilo Krummrich:

On Mon, Nov 06, 2023 at 02:05:13PM +0100, Christian König wrote:

Am 06.11.23 um 13:16 schrieb Danilo Krummrich:

[SNIP]
This reference count just prevents that the VM is freed as long as other
ressources are attached to it that carry a VM pointer, such as mappings and
VM_BOs. The motivation for that are VM_BOs. For mappings it's indeed a bit
paranoid, but it doesn't hurt either and keeps it consistant.

Ah! Yeah, we have similar semantics in amdgpu as well.

But we keep the reference to the root GEM object and not the VM.

Ok, that makes much more sense then keeping one reference for each mapping.


Because of this the mapping should *never* have a reference to the VM, but
rather the VM destroys all mapping when it is destroyed itself.


Hence, If the VM is still alive at a point where you don't expect it to
be, then it's
simply a driver bug.

Driver bugs is just what I try to prevent here. When individual mappings
keep the VM structure alive then drivers are responsible to clean them up,
if the VM cleans up after itself then we don't need to worry about it in the
driver.

Drivers are *always* responsible for that. This has nothing to do with whether
the VM is reference counted or not. GPUVM can't clean up mappings after itself.

Why not?

I feel like we're talking past each other here, at least to some extend.
However, I can't yet see where exactly the misunderstanding resides.


+1


At least in amdgpu we have it exactly like that. E.g. the higher level can
cleanup the BO_VM structure at any time possible, even when there are
mappings.

What do you mean with "cleanup the VM_BO structue" exactly?

The VM_BO structure keeps track of all the mappings mapped in the VM_BO's VM
being backed by the VM_BO's GEM object. And the GEM objects keeps a list of
the corresponding VM_BOs.

Hence, as long as there are mappings that this VM_BO keeps track of, this VM_BO
should stay alive.


No, exactly the other way around. When the VM_BO structure is destroyed 
the mappings are destroyed with them.


Otherwise you would need to destroy each individual mapping separately 
before teardown which is quite inefficient.



The VM then keeps track which areas still need to be invalidated
in the physical representation of the page tables.

And the VM does that through its tree of mappings (struct drm_gpuva). Hence, if
the VM would just remove those structures on cleanup by itself, you'd loose the
ability of cleaning up the page tables. Unless, you track this separately, which
would make the whole tracking of GPUVM itself kinda pointless.


But how do you then keep track of areas which are freed and needs to be 
updated so that nobody can access the underlying memory any more?



I would expect that the generalized GPU VM handling would need something
similar. If we leave that to the driver then each driver would have to
implement that stuff on it's own again.

Similar to what? What exactly do you think can be generalized here?


Similar to how amdgpu works.

From what I can see you are basically re-inventing everything we 
already have in there and asking the same questions we stumbled over 
years ago.



If the driver left mappings, GPUVM would just leak them without reference count.
It doesn't know about the drivers surrounding structures, nor does it know about
attached ressources such as PT(E)s.

What are we talking with the word "mapping"? The BO_VM structure? Or each
individual mapping?

An individual mapping represented by struct drm_gpuva.


Yeah than that certainly doesn't work. See below.


E.g. what we need to prevent is that VM structure (or the root GEM object)
is released while VM_BOs are still around. That's what I totally agree on.

But each individual mapping is a different story. Userspace can create so
many of them that we probably could even overrun a 32bit counter quite
easily.

REFCOUNT_MAX is specified as 0x7fff_. I agree there can be a lot of
mappings, but (including the VM_BO references) more than 2.147.483.647 per VM?


IIRC on amdgpu we can create something like 100k mappings per second and 
each takes ~64 bytes.


So you just need 128GiB of memory and approx 20 seconds to let the 
kernel run into a refcount overrun.


The worst I've seen in a real world game was around 19k mappings, but 
that doesn't mean that this here can't be exploited.


What can be done is to keep one reference per VM_BO structure, but I 
think per mapping is rather unrealistic.


Regards,
Christian.




Re: [Nouveau] [PATCH drm-misc-next v8 09/12] drm/gpuvm: reference count drm_gpuvm structures

2023-11-06 Thread Danilo Krummrich
On Mon, Nov 06, 2023 at 02:05:13PM +0100, Christian König wrote:
> Am 06.11.23 um 13:16 schrieb Danilo Krummrich:
> > [SNIP]
> > This reference count just prevents that the VM is freed as long as other
> > ressources are attached to it that carry a VM pointer, such as mappings and
> > VM_BOs. The motivation for that are VM_BOs. For mappings it's indeed a bit
> > paranoid, but it doesn't hurt either and keeps it consistant.
> 
> Ah! Yeah, we have similar semantics in amdgpu as well.
> 
> But we keep the reference to the root GEM object and not the VM.
> 
> Ok, that makes much more sense then keeping one reference for each mapping.
> 
> > > Because of this the mapping should *never* have a reference to the VM, but
> > > rather the VM destroys all mapping when it is destroyed itself.
> > > 
> > > > Hence, If the VM is still alive at a point where you don't expect it to
> > > > be, then it's
> > > > simply a driver bug.
> > > Driver bugs is just what I try to prevent here. When individual mappings
> > > keep the VM structure alive then drivers are responsible to clean them up,
> > > if the VM cleans up after itself then we don't need to worry about it in 
> > > the
> > > driver.
> > Drivers are *always* responsible for that. This has nothing to do with 
> > whether
> > the VM is reference counted or not. GPUVM can't clean up mappings after 
> > itself.
> 
> Why not?

I feel like we're talking past each other here, at least to some extend.
However, I can't yet see where exactly the misunderstanding resides.

> 
> At least in amdgpu we have it exactly like that. E.g. the higher level can
> cleanup the BO_VM structure at any time possible, even when there are
> mappings.

What do you mean with "cleanup the VM_BO structue" exactly?

The VM_BO structure keeps track of all the mappings mapped in the VM_BO's VM
being backed by the VM_BO's GEM object. And the GEM objects keeps a list of
the corresponding VM_BOs.

Hence, as long as there are mappings that this VM_BO keeps track of, this VM_BO
should stay alive.

> The VM then keeps track which areas still need to be invalidated
> in the physical representation of the page tables.

And the VM does that through its tree of mappings (struct drm_gpuva). Hence, if
the VM would just remove those structures on cleanup by itself, you'd loose the
ability of cleaning up the page tables. Unless, you track this separately, which
would make the whole tracking of GPUVM itself kinda pointless.

> 
> I would expect that the generalized GPU VM handling would need something
> similar. If we leave that to the driver then each driver would have to
> implement that stuff on it's own again.

Similar to what? What exactly do you think can be generalized here?

> 
> > If the driver left mappings, GPUVM would just leak them without reference 
> > count.
> > It doesn't know about the drivers surrounding structures, nor does it know 
> > about
> > attached ressources such as PT(E)s.
> 
> What are we talking with the word "mapping"? The BO_VM structure? Or each
> individual mapping?

An individual mapping represented by struct drm_gpuva.

> 
> E.g. what we need to prevent is that VM structure (or the root GEM object)
> is released while VM_BOs are still around. That's what I totally agree on.
> 
> But each individual mapping is a different story. Userspace can create so
> many of them that we probably could even overrun a 32bit counter quite
> easily.

REFCOUNT_MAX is specified as 0x7fff_. I agree there can be a lot of
mappings, but (including the VM_BO references) more than 2.147.483.647 per VM?

> 
> > > When the mapping is destroyed with the VM drivers can't mess this common
> > > operation up. That's why this is more defensive.
> > > 
> > > What is a possible requirement is that external code needs to keep
> > > references to the VM, but *never* the VM to itself through the mappings. I
> > > would consider that a major bug in the component.
> > Obviously, you just (want to) apply a different semantics to this reference
> > count. It is meant to reflect that the VM structure can be freed, instead 
> > of the
> > VM can be cleaned up. If you want to latter, you can have a driver specifc
> > reference count for that in the exact same way as it was before this patch.
> 
> Yeah, it becomes clear that you try to solve some different problem than I
> have expected.
> 
> Regards,
> Christian.
> 
> > 
> > > Regards,
> > > Christian.
> > > 
> > > > > Reference counting is nice when you don't know who else is referring
> > > > > to your VM, but the cost is that you also don't know when the object
> > > > > will guardedly be destroyed.
> > > > > 
> > > > > I can trivially work around this by saying that the generic GPUVM
> > > > > object has a different lifetime than the amdgpu specific object, but
> > > > > that opens up doors for use after free again.
> > > > If your driver never touches the VM's reference count and exits the VM
> > > > with a clean state
> > > > (no mappings and no VM_BOs left), 

Re: [Nouveau] [PATCH drm-misc-next v8 09/12] drm/gpuvm: reference count drm_gpuvm structures

2023-11-06 Thread Christian König

Am 06.11.23 um 13:16 schrieb Danilo Krummrich:

[SNIP]
This reference count just prevents that the VM is freed as long as other
ressources are attached to it that carry a VM pointer, such as mappings and
VM_BOs. The motivation for that are VM_BOs. For mappings it's indeed a bit
paranoid, but it doesn't hurt either and keeps it consistant.


Ah! Yeah, we have similar semantics in amdgpu as well.

But we keep the reference to the root GEM object and not the VM.

Ok, that makes much more sense then keeping one reference for each mapping.


Because of this the mapping should *never* have a reference to the VM, but
rather the VM destroys all mapping when it is destroyed itself.


Hence, If the VM is still alive at a point where you don't expect it to
be, then it's
simply a driver bug.

Driver bugs is just what I try to prevent here. When individual mappings
keep the VM structure alive then drivers are responsible to clean them up,
if the VM cleans up after itself then we don't need to worry about it in the
driver.

Drivers are *always* responsible for that. This has nothing to do with whether
the VM is reference counted or not. GPUVM can't clean up mappings after itself.


Why not?

At least in amdgpu we have it exactly like that. E.g. the higher level 
can cleanup the BO_VM structure at any time possible, even when there 
are mappings. The VM then keeps track which areas still need to be 
invalidated in the physical representation of the page tables.


I would expect that the generalized GPU VM handling would need something 
similar. If we leave that to the driver then each driver would have to 
implement that stuff on it's own again.



If the driver left mappings, GPUVM would just leak them without reference count.
It doesn't know about the drivers surrounding structures, nor does it know about
attached ressources such as PT(E)s.


What are we talking with the word "mapping"? The BO_VM structure? Or 
each individual mapping?


E.g. what we need to prevent is that VM structure (or the root GEM 
object) is released while VM_BOs are still around. That's what I totally 
agree on.


But each individual mapping is a different story. Userspace can create 
so many of them that we probably could even overrun a 32bit counter 
quite easily.



When the mapping is destroyed with the VM drivers can't mess this common
operation up. That's why this is more defensive.

What is a possible requirement is that external code needs to keep
references to the VM, but *never* the VM to itself through the mappings. I
would consider that a major bug in the component.

Obviously, you just (want to) apply a different semantics to this reference
count. It is meant to reflect that the VM structure can be freed, instead of the
VM can be cleaned up. If you want to latter, you can have a driver specifc
reference count for that in the exact same way as it was before this patch.


Yeah, it becomes clear that you try to solve some different problem than 
I have expected.


Regards,
Christian.




Regards,
Christian.


Reference counting is nice when you don't know who else is referring
to your VM, but the cost is that you also don't know when the object
will guardedly be destroyed.

I can trivially work around this by saying that the generic GPUVM
object has a different lifetime than the amdgpu specific object, but
that opens up doors for use after free again.

If your driver never touches the VM's reference count and exits the VM
with a clean state
(no mappings and no VM_BOs left), effectively, this is the same as
having no reference
count.

In the very worst case you could argue that we trade a potential UAF
*and* memroy leak
(no reference count) with *only* a memory leak (with reference count),
which to me seems
reasonable.


Regards,
Christian.


Thanks,
Christian.

[1]https://lore.kernel.org/dri-devel/6fa058a4-20d3-44b9-af58-755cfb375...@redhat.com/



Signed-off-by: Danilo Krummrich
---
    drivers/gpu/drm/drm_gpuvm.c    | 44
+++---
    drivers/gpu/drm/nouveau/nouveau_uvmm.c | 20 +---
    include/drm/drm_gpuvm.h    | 31 +-
    3 files changed, 78 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/drm_gpuvm.c
b/drivers/gpu/drm/drm_gpuvm.c
index 53e2c406fb04..6a88eafc5229 100644
--- a/drivers/gpu/drm/drm_gpuvm.c
+++ b/drivers/gpu/drm/drm_gpuvm.c
@@ -746,6 +746,8 @@ drm_gpuvm_init(struct drm_gpuvm
*gpuvm, const char *name,
    gpuvm->rb.tree = RB_ROOT_CACHED;
    INIT_LIST_HEAD(>rb.list);
+    kref_init(>kref);
+
    gpuvm->name = name ? name : "unknown";
    gpuvm->flags = flags;
    gpuvm->ops = ops;
@@ -770,15 +772,8 @@ drm_gpuvm_init(struct drm_gpuvm
*gpuvm, const char *name,
    }
    EXPORT_SYMBOL_GPL(drm_gpuvm_init);
-/**
- * drm_gpuvm_destroy() - cleanup a _gpuvm
- * @gpuvm: pointer to the _gpuvm to clean up
- *
- * Note that it is a bug to call this function on a
manager that still
- * holds GPU VA mappings.
- */
-void

Re: [Nouveau] [PATCH drm-misc-next v8 09/12] drm/gpuvm: reference count drm_gpuvm structures

2023-11-06 Thread Danilo Krummrich
On Mon, Nov 06, 2023 at 10:14:29AM +0100, Christian König wrote:
> Am 03.11.23 um 16:34 schrieb Danilo Krummrich:
> [SNIP]
> > > 
> > > Especially we most likely don't want the VM to live longer than the
> > > application which originally used it. If you make the GPUVM an
> > > independent object you actually open up driver abuse for the
> > > lifetime of this.
> > 
> > Right, we don't want that. But I don't see how the reference count
> > prevents that.
> 
> It doesn't prevents that, it's just not the most defensive approach.
> 
> > 
> > Independant object is relative. struct drm_gpuvm is still embedded into
> > a driver
> > specific structure. It's working the same way as with struct
> > drm_gem_obejct.
> > 
> > > 
> > > Additional to that see below for a quite real problem with this.
> > > 
> > > > > Background is that the most common use case I see is that
> > > > > this object is
> > > > > embedded into something else and a reference count is then
> > > > > not really a good
> > > > > idea.
> > > > Do you have a specific use-case in mind where this would interfere?
> > > 
> > > Yes, absolutely. For an example see amdgpu_mes_self_test(), here we
> > > initialize a temporary amdgpu VM for an in kernel unit test which
> > > runs during driver load.
> > > 
> > > When the function returns I need to guarantee that the VM is
> > > destroyed or otherwise I will mess up normal operation.
> > 
> > Nothing prevents that. The reference counting is well defined. If the
> > driver did not
> > take additional references (which is clearly up to the driver taking
> > care of) and all
> > VM_BOs and mappings are cleaned up, the reference count is guaranteed to
> > be 1 at this
> > point.
> > 
> > Also note that if the driver would have not cleaned up all VM_BOs and
> > mappings before
> > shutting down the VM, it would have been a bug anyways and the driver
> > would potentially
> > leak memory and UAF issues.
> 
> Exactly that's what I'm talking about why I think this is an extremely bad
> idea.
> 
> It's a perfect normal operation to shutdown the VM while there are still
> mappings. This is just what happens when you kill an application.

Shut down the VM in terms of removing existing mappings, doing driver specifc
cleanup operations, etc. But not freeing the VM structure yet. That's what you
gonna do after you cleaned up everything.

So, when your application gets killed, you just call your driver_vm_destroy()
function, cleaning up all mappings etc. and then you call drm_gpuvm_put(). And
if you did a decent job cleaning things up (removing existing mappings etc.)
this drm_gpuvm_put() call will result into the vm_free() callback being called.

This reference count just prevents that the VM is freed as long as other
ressources are attached to it that carry a VM pointer, such as mappings and
VM_BOs. The motivation for that are VM_BOs. For mappings it's indeed a bit
paranoid, but it doesn't hurt either and keeps it consistant.

> 
> Because of this the mapping should *never* have a reference to the VM, but
> rather the VM destroys all mapping when it is destroyed itself.
> 
> > Hence, If the VM is still alive at a point where you don't expect it to
> > be, then it's
> > simply a driver bug.
> 
> Driver bugs is just what I try to prevent here. When individual mappings
> keep the VM structure alive then drivers are responsible to clean them up,
> if the VM cleans up after itself then we don't need to worry about it in the
> driver.

Drivers are *always* responsible for that. This has nothing to do with whether
the VM is reference counted or not. GPUVM can't clean up mappings after itself.
If the driver left mappings, GPUVM would just leak them without reference count.
It doesn't know about the drivers surrounding structures, nor does it know about
attached ressources such as PT(E)s. 

> 
> When the mapping is destroyed with the VM drivers can't mess this common
> operation up. That's why this is more defensive.
> 
> What is a possible requirement is that external code needs to keep
> references to the VM, but *never* the VM to itself through the mappings. I
> would consider that a major bug in the component.

Obviously, you just (want to) apply a different semantics to this reference
count. It is meant to reflect that the VM structure can be freed, instead of the
VM can be cleaned up. If you want to latter, you can have a driver specifc
reference count for that in the exact same way as it was before this patch.

> 
> Regards,
> Christian.
> 
> > 
> > > 
> > > Reference counting is nice when you don't know who else is referring
> > > to your VM, but the cost is that you also don't know when the object
> > > will guardedly be destroyed.
> > > 
> > > I can trivially work around this by saying that the generic GPUVM
> > > object has a different lifetime than the amdgpu specific object, but
> > > that opens up doors for use after free again.
> > 
> > If your driver never touches the VM's reference count and exits the VM
> > 

Re: [Nouveau] [PATCH drm-misc-next v8 09/12] drm/gpuvm: reference count drm_gpuvm structures

2023-11-06 Thread Christian König

Am 03.11.23 um 16:34 schrieb Danilo Krummrich:
[SNIP]


Especially we most likely don't want the VM to live longer than the 
application which originally used it. If you make the GPUVM an 
independent object you actually open up driver abuse for the lifetime 
of this.


Right, we don't want that. But I don't see how the reference count 
prevents that.


It doesn't prevents that, it's just not the most defensive approach.



Independant object is relative. struct drm_gpuvm is still embedded 
into a driver
specific structure. It's working the same way as with struct 
drm_gem_obejct.




Additional to that see below for a quite real problem with this.

Background is that the most common use case I see is that this 
object is
embedded into something else and a reference count is then not 
really a good

idea.

Do you have a specific use-case in mind where this would interfere?


Yes, absolutely. For an example see amdgpu_mes_self_test(), here we 
initialize a temporary amdgpu VM for an in kernel unit test which 
runs during driver load.


When the function returns I need to guarantee that the VM is 
destroyed or otherwise I will mess up normal operation.


Nothing prevents that. The reference counting is well defined. If the 
driver did not
take additional references (which is clearly up to the driver taking 
care of) and all
VM_BOs and mappings are cleaned up, the reference count is guaranteed 
to be 1 at this

point.

Also note that if the driver would have not cleaned up all VM_BOs and 
mappings before
shutting down the VM, it would have been a bug anyways and the driver 
would potentially

leak memory and UAF issues.


Exactly that's what I'm talking about why I think this is an extremely 
bad idea.


It's a perfect normal operation to shutdown the VM while there are still 
mappings. This is just what happens when you kill an application.


Because of this the mapping should *never* have a reference to the VM, 
but rather the VM destroys all mapping when it is destroyed itself.


Hence, If the VM is still alive at a point where you don't expect it 
to be, then it's

simply a driver bug.


Driver bugs is just what I try to prevent here. When individual mappings 
keep the VM structure alive then drivers are responsible to clean them 
up, if the VM cleans up after itself then we don't need to worry about 
it in the driver.


When the mapping is destroyed with the VM drivers can't mess this common 
operation up. That's why this is more defensive.


What is a possible requirement is that external code needs to keep 
references to the VM, but *never* the VM to itself through the mappings. 
I would consider that a major bug in the component.


Regards,
Christian.





Reference counting is nice when you don't know who else is referring 
to your VM, but the cost is that you also don't know when the object 
will guardedly be destroyed.


I can trivially work around this by saying that the generic GPUVM 
object has a different lifetime than the amdgpu specific object, but 
that opens up doors for use after free again.


If your driver never touches the VM's reference count and exits the VM 
with a clean state
(no mappings and no VM_BOs left), effectively, this is the same as 
having no reference

count.

In the very worst case you could argue that we trade a potential UAF 
*and* memroy leak
(no reference count) with *only* a memory leak (with reference count), 
which to me seems

reasonable.



Regards,
Christian.


Thanks,
Christian.
[1]https://lore.kernel.org/dri-devel/6fa058a4-20d3-44b9-af58-755cfb375...@redhat.com/ 




Signed-off-by: Danilo Krummrich
---
   drivers/gpu/drm/drm_gpuvm.c    | 44 
+++---

   drivers/gpu/drm/nouveau/nouveau_uvmm.c | 20 +---
   include/drm/drm_gpuvm.h    | 31 +-
   3 files changed, 78 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/drm_gpuvm.c 
b/drivers/gpu/drm/drm_gpuvm.c

index 53e2c406fb04..6a88eafc5229 100644
--- a/drivers/gpu/drm/drm_gpuvm.c
+++ b/drivers/gpu/drm/drm_gpuvm.c
@@ -746,6 +746,8 @@ drm_gpuvm_init(struct drm_gpuvm *gpuvm, const 
char *name,

   gpuvm->rb.tree = RB_ROOT_CACHED;
   INIT_LIST_HEAD(>rb.list);
+    kref_init(>kref);
+
   gpuvm->name = name ? name : "unknown";
   gpuvm->flags = flags;
   gpuvm->ops = ops;
@@ -770,15 +772,8 @@ drm_gpuvm_init(struct drm_gpuvm *gpuvm, const 
char *name,

   }
   EXPORT_SYMBOL_GPL(drm_gpuvm_init);
-/**
- * drm_gpuvm_destroy() - cleanup a _gpuvm
- * @gpuvm: pointer to the _gpuvm to clean up
- *
- * Note that it is a bug to call this function on a manager that 
still

- * holds GPU VA mappings.
- */
-void
-drm_gpuvm_destroy(struct drm_gpuvm *gpuvm)
+static void
+drm_gpuvm_fini(struct drm_gpuvm *gpuvm)
   {
   gpuvm->name = NULL;
@@ -790,7 +785,33 @@ drm_gpuvm_destroy(struct drm_gpuvm *gpuvm)
   drm_gem_object_put(gpuvm->r_obj);
   }
-EXPORT_SYMBOL_GPL(drm_gpuvm_destroy);
+
+static void
+drm_gpuvm_free(struct kref *kref)
+{