Re: [Nouveau] [PATCH v2 8/9] PCI: Exclude PCIe ports used for tunneling in pcie_bandwidth_available()
On Tue, Nov 07, 2023 at 07:45:26AM +0200, Mika Westerberg wrote: > Hi, > > On Mon, Nov 06, 2023 at 07:56:52PM +0100, Lukas Wunner wrote: > > On Mon, Nov 06, 2023 at 12:44:25PM -0600, Mario Limonciello wrote: > > > Tangentially related; the link speed is currently symmetric but there are > > > two sysfs files. Mika left a comment in drivers/thunderbolt/switch.c it > > > may > > > be asymmetric in the future. So we may need to keep that in mind on any > > > design that builds on top of them. > > > > Aren't asymmetric Thunderbolt speeds just a DisplayPort thing? > > No, they affect the whole fabric. We have the initial code for > asymmetric switching in v6.7-rc1. > > > > As 'thunderbolt' can be a module or built in, we need to bring code into > > > PCI > > > core so that it works in early boot before it loads. > > > > tb_switch_get_generation() is small enough that it could be moved to the > > PCI core. I doubt that we need to make thunderbolt built-in only > > or move a large amount of code to the PCI core. > > If at all possible I would like to avoid this and littering PCI side > with non-PCI stuff. There could be other similar "mediums" in the future > where you can transfer packets of "native" protocols such as PCIe so > instead of making it Thunderbolt/USB4 specific it should be generic > enough to support future extensions. > > In case of Thunderbolt/USB4 there is no real way to figure out how much > bandwidth each PCIe tunnel gets (it is kind of bulk traffic that gets > what is left from isochronous protocols) so I would not even try that > and instead use the real PCIe links in pcie_bandwidth_available() and > skip all the "virtual" ones. Actually can we call the new function something like pci_link_is_virtual() instead and make pcie_bandwidth_available() call it? That would be more future proof IMHO.
Re: [Nouveau] [PATCH v2 8/9] PCI: Exclude PCIe ports used for tunneling in pcie_bandwidth_available()
Hi, On Mon, Nov 06, 2023 at 07:56:52PM +0100, Lukas Wunner wrote: > On Mon, Nov 06, 2023 at 12:44:25PM -0600, Mario Limonciello wrote: > > Tangentially related; the link speed is currently symmetric but there are > > two sysfs files. Mika left a comment in drivers/thunderbolt/switch.c it may > > be asymmetric in the future. So we may need to keep that in mind on any > > design that builds on top of them. > > Aren't asymmetric Thunderbolt speeds just a DisplayPort thing? No, they affect the whole fabric. We have the initial code for asymmetric switching in v6.7-rc1. > > As 'thunderbolt' can be a module or built in, we need to bring code into PCI > > core so that it works in early boot before it loads. > > tb_switch_get_generation() is small enough that it could be moved to the > PCI core. I doubt that we need to make thunderbolt built-in only > or move a large amount of code to the PCI core. If at all possible I would like to avoid this and littering PCI side with non-PCI stuff. There could be other similar "mediums" in the future where you can transfer packets of "native" protocols such as PCIe so instead of making it Thunderbolt/USB4 specific it should be generic enough to support future extensions. In case of Thunderbolt/USB4 there is no real way to figure out how much bandwidth each PCIe tunnel gets (it is kind of bulk traffic that gets what is left from isochronous protocols) so I would not even try that and instead use the real PCIe links in pcie_bandwidth_available() and skip all the "virtual" ones.
[Nouveau] [PATCH] nouveau: use an rwlock for the event lock.
From: Dave Airlie This allows it to break the following circular locking dependency. Aug 10 07:01:29 dg1test kernel: == Aug 10 07:01:29 dg1test kernel: WARNING: possible circular locking dependency detected Aug 10 07:01:29 dg1test kernel: 6.4.0-rc7+ #10 Not tainted Aug 10 07:01:29 dg1test kernel: -- Aug 10 07:01:29 dg1test kernel: wireplumber/2236 is trying to acquire lock: Aug 10 07:01:29 dg1test kernel: 8fca5320da18 (>lock){-...}-{2:2}, at: nouveau_fence_wait_uevent_handler+0x2b/0x100 [nouveau] Aug 10 07:01:29 dg1test kernel: but task is already holding lock: Aug 10 07:01:29 dg1test kernel: 8fca41208610 (>list_lock#2){-...}-{2:2}, at: nvkm_event_ntfy+0x50/0xf0 [nouveau] Aug 10 07:01:29 dg1test kernel: which lock already depends on the new lock. Aug 10 07:01:29 dg1test kernel: the existing dependency chain (in reverse order) is: Aug 10 07:01:29 dg1test kernel: -> #3 (>list_lock#2){-...}-{2:2}: Aug 10 07:01:29 dg1test kernel:_raw_spin_lock_irqsave+0x4b/0x70 Aug 10 07:01:29 dg1test kernel:nvkm_event_ntfy+0x50/0xf0 [nouveau] Aug 10 07:01:29 dg1test kernel:ga100_fifo_nonstall_intr+0x24/0x30 [nouveau] Aug 10 07:01:29 dg1test kernel:nvkm_intr+0x12c/0x240 [nouveau] Aug 10 07:01:29 dg1test kernel:__handle_irq_event_percpu+0x88/0x240 Aug 10 07:01:29 dg1test kernel:handle_irq_event+0x38/0x80 Aug 10 07:01:29 dg1test kernel:handle_edge_irq+0xa3/0x240 Aug 10 07:01:29 dg1test kernel:__common_interrupt+0x72/0x160 Aug 10 07:01:29 dg1test kernel:common_interrupt+0x60/0xe0 Aug 10 07:01:29 dg1test kernel:asm_common_interrupt+0x26/0x40 Aug 10 07:01:29 dg1test kernel: -> #2 (>intr.lock){-...}-{2:2}: Aug 10 07:01:29 dg1test kernel:_raw_spin_lock_irqsave+0x4b/0x70 Aug 10 07:01:29 dg1test kernel:nvkm_inth_allow+0x2c/0x80 [nouveau] Aug 10 07:01:29 dg1test kernel:nvkm_event_ntfy_state+0x181/0x250 [nouveau] Aug 10 07:01:29 dg1test kernel:nvkm_event_ntfy_allow+0x63/0xd0 [nouveau] Aug 10 07:01:29 dg1test kernel:nvkm_uevent_mthd+0x4d/0x70 [nouveau] Aug 10 07:01:29 dg1test kernel:nvkm_ioctl+0x10b/0x250 [nouveau] Aug 10 07:01:29 dg1test kernel:nvif_object_mthd+0xa8/0x1f0 [nouveau] Aug 10 07:01:29 dg1test kernel:nvif_event_allow+0x2a/0xa0 [nouveau] Aug 10 07:01:29 dg1test kernel:nouveau_fence_enable_signaling+0x78/0x80 [nouveau] Aug 10 07:01:29 dg1test kernel:__dma_fence_enable_signaling+0x5e/0x100 Aug 10 07:01:29 dg1test kernel:dma_fence_add_callback+0x4b/0xd0 Aug 10 07:01:29 dg1test kernel:nouveau_cli_work_queue+0xae/0x110 [nouveau] Aug 10 07:01:29 dg1test kernel:nouveau_gem_object_close+0x1d1/0x2a0 [nouveau] Aug 10 07:01:29 dg1test kernel:drm_gem_handle_delete+0x70/0xe0 [drm] Aug 10 07:01:29 dg1test kernel:drm_ioctl_kernel+0xa5/0x150 [drm] Aug 10 07:01:29 dg1test kernel:drm_ioctl+0x256/0x490 [drm] Aug 10 07:01:29 dg1test kernel:nouveau_drm_ioctl+0x5a/0xb0 [nouveau] Aug 10 07:01:29 dg1test kernel:__x64_sys_ioctl+0x91/0xd0 Aug 10 07:01:29 dg1test kernel:do_syscall_64+0x3c/0x90 Aug 10 07:01:29 dg1test kernel:entry_SYSCALL_64_after_hwframe+0x72/0xdc Aug 10 07:01:29 dg1test kernel: -> #1 (>refs_lock#4){}-{2:2}: Aug 10 07:01:29 dg1test kernel:_raw_spin_lock_irqsave+0x4b/0x70 Aug 10 07:01:29 dg1test kernel:nvkm_event_ntfy_state+0x37/0x250 [nouveau] Aug 10 07:01:29 dg1test kernel:nvkm_event_ntfy_allow+0x63/0xd0 [nouveau] Aug 10 07:01:29 dg1test kernel:nvkm_uevent_mthd+0x4d/0x70 [nouveau] Aug 10 07:01:29 dg1test kernel:nvkm_ioctl+0x10b/0x250 [nouveau] Aug 10 07:01:29 dg1test kernel:nvif_object_mthd+0xa8/0x1f0 [nouveau] Aug 10 07:01:29 dg1test kernel:nvif_event_allow+0x2a/0xa0 [nouveau] Aug 10 07:01:29 dg1test kernel:nouveau_fence_enable_signaling+0x78/0x80 [nouveau] Aug 10 07:01:29 dg1test kernel:__dma_fence_enable_signaling+0x5e/0x100 Aug 10 07:01:29 dg1test kernel:dma_fence_add_callback+0x4b/0xd0 Aug 10 07:01:29 dg1test kernel:nouveau_cli_work_queue+0xae/0x110 [nouveau] Aug 10 07:01:29 dg1test kernel:nouveau_gem_object_close+0x1d1/0x2a0 [nouveau] Aug 10 07:01:29 dg1test kernel:drm_gem_handle_delete+0x70/0xe0 [drm] Aug 10 07:01:29 dg1test kernel:drm_ioctl_kernel+0xa5/0x150 [drm] Aug 10 07:01:29 dg1test kernel:drm_ioctl+0x256/0x490 [drm] Aug 10 07:01:29 dg1test kernel:nouveau_drm_ioctl+0x5a/0xb0 [nouveau] Aug 10 07:01:29 dg1test kernel:__x64_sys_ioctl+0x91/0xd0 Aug 10 07:01:29 dg1test kernel:do_syscall_64+0x3c/0x90 Aug 10 07:01:29 dg1test kernel:
Re: [Nouveau] [PATCH v2 8/9] PCI: Exclude PCIe ports used for tunneling in pcie_bandwidth_available()
On Mon, Nov 06, 2023 at 12:44:25PM -0600, Mario Limonciello wrote: > Tangentially related; the link speed is currently symmetric but there are > two sysfs files. Mika left a comment in drivers/thunderbolt/switch.c it may > be asymmetric in the future. So we may need to keep that in mind on any > design that builds on top of them. Aren't asymmetric Thunderbolt speeds just a DisplayPort thing? > As 'thunderbolt' can be a module or built in, we need to bring code into PCI > core so that it works in early boot before it loads. tb_switch_get_generation() is small enough that it could be moved to the PCI core. I doubt that we need to make thunderbolt built-in only or move a large amount of code to the PCI core. Thanks, Lukas
Re: [Nouveau] [PATCH v2 8/9] PCI: Exclude PCIe ports used for tunneling in pcie_bandwidth_available()
On 11/6/2023 12:10, Lukas Wunner wrote: On Fri, Nov 03, 2023 at 02:07:57PM -0500, Mario Limonciello wrote: The USB4 spec specifies that PCIe ports that are used for tunneling PCIe traffic over USB4 fabric will be hardcoded to advertise 2.5GT/s and behave as a PCIe Gen1 device. The actual performance of these ports is controlled by the fabric implementation. Downstream drivers such as amdgpu which utilize pcie_bandwidth_available() to program the device will always find the PCIe ports used for tunneling as a limiting factor potentially leading to incorrect performance decisions. To prevent problems in downstream drivers check explicitly for ports being used for PCIe tunneling and skip them when looking for bandwidth limitations of the hierarchy. If the only device connected is a root port used for tunneling then report that device. I think a better approach would be to define three new bandwidths for Thunderbolt in enum pci_bus_speed and add appropriate descriptions in pci_speed_string(). Those three bandwidths would be 10 GBit/s for Thunderbolt 1, 20 GBit/s for Thunderbolt 2, 40 GBit/s for Thunderbolt 3 and 4. It's an interesting idea, but there's a few short comings I can think of. 1) The USB4 specification doesn't actually require 40GB/s support, this is only a Thunderbolt 4 requirement. https://www.tomsguide.com/features/thunderbolt-4-vs-usb4-whats-the-difference The TBT/USB4 link speed can be discovered, but it's not a property of the *switch* not of the PCIe tunneling port. Tangentially related; the link speed is currently symmetric but there are two sysfs files. Mika left a comment in drivers/thunderbolt/switch.c it may be asymmetric in the future. So we may need to keep that in mind on any design that builds on top of them. On an AMD Phoenix system connected to a TBT3 Alpine Ridge based eGPU enclosure I can see: $ cat /sys/bus/thunderbolt/devices/1-0/generation 4 $ cat /sys/bus/thunderbolt/devices/1-2/generation 3 $ cat /sys/bus/thunderbolt/devices/1-2/tx_speed 20.0 Gb/s $ cat /sys/bus/thunderbolt/devices/1-2/rx_speed 20.0 Gb/s 2) This works until you end up with USB4v2 which supports 80GBit/s. So this might mean an extra 80GB/s enum and porting some variation of usb4_switch_version() outside of the thunderbolt driver so that PCI core can use it too. Code to determine the Thunderbolt generation from the PCI ID already exists in tb_switch_get_generation(). As 'thunderbolt' can be a module or built in, we need to bring code into PCI core so that it works in early boot before it loads. On the presumption that no more "new" TBT3 devices will be released to the market I suppose that *a lot* of that code could come over to PCI core especially if we can bring some variation of usb4_switch_version() over too. The other option is that we can stop allowing thunderbolt as a module and require it to be built-in. If we do this we can export symbols from it and use some of them in PCI core too. This will not only address the amdgpu issue you're trying to solve, but also emit an accurate speed from __pcie_print_link_status(). The speed you're reporting with your approach is not necessarily accurate because the next non-tunneled device in the hierarchy might be connected with a far higher PCIe speed than what the Thunderbolt fabric allows. > Thanks, Lukas
Re: [Nouveau] [PATCH v2 6/9] PCI: Rename is_thunderbolt to is_tunneled
On Mon, Nov 06, 2023 at 10:59:13AM -0600, Mario Limonciello wrote: > On 11/5/2023 11:39, Lukas Wunner wrote: > > On Fri, Nov 03, 2023 at 02:07:55PM -0500, Mario Limonciello wrote: > > > The `is_thunderbolt` bit has been used to indicate that a PCIe device > > > contained the Intel VSEC which is used by various parts of the kernel > > > to change behavior. To later allow usage with USB4 controllers as well, > > > rename this to `is_tunneled`. > > > > This doesn't seem to make sense. is_thunderbolt indicates that a device > > is part of a Thunderbolt controller. See the code comment: > > > > > - unsigned intis_thunderbolt:1; /* Thunderbolt controller */ > > > > A Thunderbolt controller is not necessarily tunneled. The PCIe switch, > > NHI and XHCI of the Thunderbolt host controller are not tunneled at all. > > I could really use some clarification which PCIe devices actually contain > the Intel VSEC. > > Is it in all 3 of those PCIe devices and not just the switch? Yes, I've just double-checked Light Ridge, Cactus Ridge, Alpine Ridge. Thanks, Lukas
Re: [Nouveau] [PATCH v2 8/9] PCI: Exclude PCIe ports used for tunneling in pcie_bandwidth_available()
On Fri, Nov 03, 2023 at 02:07:57PM -0500, Mario Limonciello wrote: > The USB4 spec specifies that PCIe ports that are used for tunneling > PCIe traffic over USB4 fabric will be hardcoded to advertise 2.5GT/s and > behave as a PCIe Gen1 device. The actual performance of these ports is > controlled by the fabric implementation. > > Downstream drivers such as amdgpu which utilize pcie_bandwidth_available() > to program the device will always find the PCIe ports used for > tunneling as a limiting factor potentially leading to incorrect > performance decisions. > > To prevent problems in downstream drivers check explicitly for ports > being used for PCIe tunneling and skip them when looking for bandwidth > limitations of the hierarchy. If the only device connected is a root port > used for tunneling then report that device. I think a better approach would be to define three new bandwidths for Thunderbolt in enum pci_bus_speed and add appropriate descriptions in pci_speed_string(). Those three bandwidths would be 10 GBit/s for Thunderbolt 1, 20 GBit/s for Thunderbolt 2, 40 GBit/s for Thunderbolt 3 and 4. Code to determine the Thunderbolt generation from the PCI ID already exists in tb_switch_get_generation(). This will not only address the amdgpu issue you're trying to solve, but also emit an accurate speed from __pcie_print_link_status(). The speed you're reporting with your approach is not necessarily accurate because the next non-tunneled device in the hierarchy might be connected with a far higher PCIe speed than what the Thunderbolt fabric allows. Thanks, Lukas
Re: [Nouveau] [PATCH v2 6/9] PCI: Rename is_thunderbolt to is_tunneled
On 11/5/2023 11:39, Lukas Wunner wrote: On Fri, Nov 03, 2023 at 02:07:55PM -0500, Mario Limonciello wrote: The `is_thunderbolt` bit has been used to indicate that a PCIe device contained the Intel VSEC which is used by various parts of the kernel to change behavior. To later allow usage with USB4 controllers as well, rename this to `is_tunneled`. This doesn't seem to make sense. is_thunderbolt indicates that a device is part of a Thunderbolt controller. See the code comment: - unsigned intis_thunderbolt:1; /* Thunderbolt controller */ A Thunderbolt controller is not necessarily tunneled. The PCIe switch, NHI and XHCI of the Thunderbolt host controller are not tunneled at all. Thanks, Lukas I could really use some clarification which PCIe devices actually contain the Intel VSEC. Is it in all 3 of those PCIe devices and not just the switch? If so, I think I would rather introduce a separate bit. So after this series we would have: is_tunneled:1 is_thunderbolt:1 no_command_complete:1 * TBT1 devices would set no_command_complete - The consumer would be pcie_init() * All devices with the Intel VSEC would set is_thunderbolt and the two consumers would be: - apple-gmux.c - pci_bridge_d3_possible() * USB4 devices and PCIe switches with the VSEC would set is_tunneled.
Re: [Nouveau] [PATCH v2 8/9] PCI: Exclude PCIe ports used for tunneling in pcie_bandwidth_available()
On 11/6/2023 06:52, Ilpo Järvinen wrote: On Fri, 3 Nov 2023, Mario Limonciello wrote: The USB4 spec specifies that PCIe ports that are used for tunneling PCIe traffic over USB4 fabric will be hardcoded to advertise 2.5GT/s and behave as a PCIe Gen1 device. The actual performance of these ports is controlled by the fabric implementation. Downstream drivers such as amdgpu which utilize pcie_bandwidth_available() to program the device will always find the PCIe ports used for tunneling as a limiting factor potentially leading to incorrect performance decisions. To prevent problems in downstream drivers check explicitly for ports being used for PCIe tunneling and skip them when looking for bandwidth limitations of the hierarchy. If the only device connected is a root port used for tunneling then report that device. Downstream drivers could make this change on their own but then they wouldn't be able to detect other potential speed bottlenecks from the hierarchy without duplicating pcie_bandwidth_available() logic. Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2925#note_2145860 Link: https://www.usb.org/document-library/usb4r-specification-v20 USB4 V2 with Errata and ECN through June 2023 Section 11.2.1 Signed-off-by: Mario Limonciello --- drivers/pci/pci.c | 74 +++ 1 file changed, 49 insertions(+), 25 deletions(-) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index d9aa5a39f585..15e37164ce56 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -6223,6 +6223,35 @@ int pcie_set_mps(struct pci_dev *dev, int mps) } EXPORT_SYMBOL(pcie_set_mps); +static u32 pcie_calc_bw_limits(struct pci_dev *dev, u32 bw, + struct pci_dev **limiting_dev, + enum pci_bus_speed *speed, + enum pcie_link_width *width) +{ + enum pcie_link_width next_width; + enum pci_bus_speed next_speed; + u32 next_bw; + u16 lnksta; + + pcie_capability_read_word(dev, PCI_EXP_LNKSTA, ); + next_speed = pcie_link_speed[lnksta & PCI_EXP_LNKSTA_CLS]; + next_width = (lnksta & PCI_EXP_LNKSTA_NLW) >> PCI_EXP_LNKSTA_NLW_SHIFT; + next_bw = next_width * PCIE_SPEED2MBS_ENC(next_speed); + + /* Check if current device limits the total bandwidth */ + if (!bw || next_bw <= bw) { + bw = next_bw; + if (limiting_dev) + *limiting_dev = dev; + if (speed) + *speed = next_speed; + if (width) + *width = next_width; + } + + return bw; +} + /** * pcie_bandwidth_available - determine minimum link settings of a PCIe * device and its bandwidth limitation @@ -6236,47 +6265,42 @@ EXPORT_SYMBOL(pcie_set_mps); * limiting_dev, speed, and width pointers are supplied) information about * that point. The bandwidth returned is in Mb/s, i.e., megabits/second of * raw bandwidth. + * + * This excludes the bandwidth calculation that has been returned from a + * PCIe device used for transmitting tunneled PCIe traffic over a Thunderbolt + * or USB4 link that is part of larger hierarchy. The calculation is excluded + * because the USB4 specification specifies that the max speed returned from + * PCIe configuration registers for the tunneling link is always PCI 1x 2.5 GT/s. + * When only tunneled devices are present, the bandwidth returned is the + * bandwidth available from the first tunneled device. */ u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev **limiting_dev, enum pci_bus_speed *speed, enum pcie_link_width *width) { - u16 lnksta; - enum pci_bus_speed next_speed; - enum pcie_link_width next_width; - u32 bw, next_bw; + struct pci_dev *tdev = NULL; + u32 bw = 0; if (speed) *speed = PCI_SPEED_UNKNOWN; if (width) *width = PCIE_LNK_WIDTH_UNKNOWN; - bw = 0; - while (dev) { - pcie_capability_read_word(dev, PCI_EXP_LNKSTA, ); - - next_speed = pcie_link_speed[lnksta & PCI_EXP_LNKSTA_CLS]; - next_width = (lnksta & PCI_EXP_LNKSTA_NLW) >> - PCI_EXP_LNKSTA_NLW_SHIFT; - - next_bw = next_width * PCIE_SPEED2MBS_ENC(next_speed); - - /* Check if current device limits the total bandwidth */ - if (!bw || next_bw <= bw) { - bw = next_bw; - - if (limiting_dev) - *limiting_dev = dev; - if (speed) - *speed = next_speed; - if (width) - *width = next_width; + if (dev->is_tunneled) { + if (!tdev) +
Re: [Nouveau] [PATCH v2 5/9] PCI: pciehp: Move check for is_thunderbolt into a quirk
On 11/6/2023 06:41, Ilpo Järvinen wrote: On Fri, 3 Nov 2023, Mario Limonciello wrote: commit 493fb50e958c ("PCI: pciehp: Assume NoCompl+ for Thunderbolt ports") added a check into pciehp code to explicitly set NoCompl+ for all Intel Thunderbolt controllers, including those that don't need it. This overloaded the purpose of the `is_thunderbolt` member of `struct pci_device` because that means that any controller that identifies as thunderbolt would set NoCompl+ even if it doesn't suffer this deficiency. As that commit helpfully specifies all the controllers with the problem, move them into a PCI quirk. Signed-off-by: Mario Limonciello --- drivers/pci/hotplug/pciehp_hpc.c | 6 +- drivers/pci/quirks.c | 20 include/linux/pci.h | 1 + 3 files changed, 22 insertions(+), 5 deletions(-) diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index fd713abdfb9f..23a92d681d1c 100644 --- a/drivers/pci/hotplug/pciehp_hpc.c +++ b/drivers/pci/hotplug/pciehp_hpc.c @@ -991,11 +991,7 @@ struct controller *pcie_init(struct pcie_device *dev) if (pdev->hotplug_user_indicators) slot_cap &= ~(PCI_EXP_SLTCAP_AIP | PCI_EXP_SLTCAP_PIP); - /* -* We assume no Thunderbolt controllers support Command Complete events, -* but some controllers falsely claim they do. -*/ - if (pdev->is_thunderbolt) + if (pdev->no_command_complete) slot_cap |= PCI_EXP_SLTCAP_NCCS; ctrl->slot_cap = slot_cap; diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index eeec1d6f9023..4bbf6e33ca11 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -3807,6 +3807,26 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_CACTUS_RIDGE_4C DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_PORT_RIDGE, quirk_thunderbolt_hotplug_msi); +/* + * We assume no Thunderbolt controllers support Command Complete events, + * but some controllers falsely claim they do. IMO, this wording makes little sense with the new code. How about taking some text from the original commit's changelog: /* * Certain Thunderbolt 1 controllers falsely claim to support Command * Completed events. */ The code change looks fine. Ack, will change.
Re: [Nouveau] [PATCH v2 1/9] drm/nouveau: Switch from pci_is_thunderbolt_attached() to dev_is_removable()
On 11/6/2023 10:47, Mika Westerberg wrote: On Mon, Nov 06, 2023 at 02:25:24PM +0200, Ilpo Järvinen wrote: On Fri, 3 Nov 2023, Mario Limonciello wrote: pci_is_thunderbolt_attached() only works for Intel TBT devices. Switch to using dev_is_removable() to be able to detect USB4 devices as well. Please extend this with more details. I had to lookup the TBT change to be able to make any guess why you're doing this (and it's still a guess at best). Also please write it as Thunderbolt not TBT so that it is clear what we are talking about. Ditto to all patches. Ack, thanks I will add more detail to these first few patches and make that s/TBT/Thunderbolt/ change when I rebase this on 6.7-rc1.
Re: [Nouveau] [PATCH v2 1/9] drm/nouveau: Switch from pci_is_thunderbolt_attached() to dev_is_removable()
On Mon, Nov 06, 2023 at 02:25:24PM +0200, Ilpo Järvinen wrote: > On Fri, 3 Nov 2023, Mario Limonciello wrote: > > > pci_is_thunderbolt_attached() only works for Intel TBT devices. Switch to > > using dev_is_removable() to be able to detect USB4 devices as well. > > Please extend this with more details. I had to lookup the TBT change to > be able to make any guess why you're doing this (and it's still a guess > at best). Also please write it as Thunderbolt not TBT so that it is clear what we are talking about. Ditto to all patches.
Re: [Nouveau] [PATCH v2 3/9] PCI: Drop pci_is_thunderbolt_attached()
On 11/6/2023 06:33, Ilpo Järvinen wrote: On Fri, 3 Nov 2023, Mario Limonciello wrote: All callers have switched to dev_is_removable() for detecting hotpluggable PCIe devices. Signed-off-by: Mario Limonciello --- include/linux/pci.h | 22 -- 1 file changed, 22 deletions(-) diff --git a/include/linux/pci.h b/include/linux/pci.h index b56417276042..530b0a360514 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -2616,28 +2616,6 @@ static inline bool pci_ari_enabled(struct pci_bus *bus) return bus->self && bus->self->ari_enabled; } -/** - * pci_is_thunderbolt_attached - whether device is on a Thunderbolt daisy chain - * @pdev: PCI device to check - * - * Walk upwards from @pdev and check for each encountered bridge if it's part - * of a Thunderbolt controller. Reaching the host bridge means @pdev is not - * Thunderbolt-attached. (But rather soldered to the mainboard usually.) - */ -static inline bool pci_is_thunderbolt_attached(struct pci_dev *pdev) -{ - struct pci_dev *parent = pdev; - - if (pdev->is_thunderbolt) - return true; - - while ((parent = pci_upstream_bridge(parent))) - if (parent->is_thunderbolt) - return true; - - return false; -} - #if defined(CONFIG_PCIEPORTBUS) || defined(CONFIG_EEH) void pci_uevent_ers(struct pci_dev *pdev, enum pci_ers_result err_type); #endif I don't think all callers have been removed. Ah, lkp has caught the same problem. As I mentioned in the cover letter this series is done on 6.6 + a patch going into 6.7-rc1. The LKP report will drop off when I rebase the series on 6.7-rc1. As it's not yet in Linus' tree here is that patch so you can see it: https://gitlab.freedesktop.org/agd5f/linux/-/commit/7b1c6263eaf4fd64ffe1cafdc504a42ee4bfbb33
Re: [Nouveau] [PATCH drm-misc-next v8 09/12] drm/gpuvm: reference count drm_gpuvm structures
On Mon, Nov 06, 2023 at 04:10:50PM +0100, Christian König wrote: > Am 06.11.23 um 15:11 schrieb Danilo Krummrich: > > On Mon, Nov 06, 2023 at 02:05:13PM +0100, Christian König wrote: > > > Am 06.11.23 um 13:16 schrieb Danilo Krummrich: > > > > [SNIP] > > > > This reference count just prevents that the VM is freed as long as other > > > > ressources are attached to it that carry a VM pointer, such as mappings > > > > and > > > > VM_BOs. The motivation for that are VM_BOs. For mappings it's indeed a > > > > bit > > > > paranoid, but it doesn't hurt either and keeps it consistant. > > > Ah! Yeah, we have similar semantics in amdgpu as well. > > > > > > But we keep the reference to the root GEM object and not the VM. > > > > > > Ok, that makes much more sense then keeping one reference for each > > > mapping. > > > > > > > > Because of this the mapping should *never* have a reference to the > > > > > VM, but > > > > > rather the VM destroys all mapping when it is destroyed itself. > > > > > > > > > > > Hence, If the VM is still alive at a point where you don't expect > > > > > > it to > > > > > > be, then it's > > > > > > simply a driver bug. > > > > > Driver bugs is just what I try to prevent here. When individual > > > > > mappings > > > > > keep the VM structure alive then drivers are responsible to clean > > > > > them up, > > > > > if the VM cleans up after itself then we don't need to worry about it > > > > > in the > > > > > driver. > > > > Drivers are *always* responsible for that. This has nothing to do with > > > > whether > > > > the VM is reference counted or not. GPUVM can't clean up mappings after > > > > itself. > > > Why not? > > I feel like we're talking past each other here, at least to some extend. > > However, I can't yet see where exactly the misunderstanding resides. > > +1 > > > > At least in amdgpu we have it exactly like that. E.g. the higher level can > > > cleanup the BO_VM structure at any time possible, even when there are > > > mappings. > > What do you mean with "cleanup the VM_BO structue" exactly? > > > > The VM_BO structure keeps track of all the mappings mapped in the VM_BO's VM > > being backed by the VM_BO's GEM object. And the GEM objects keeps a list of > > the corresponding VM_BOs. > > > > Hence, as long as there are mappings that this VM_BO keeps track of, this > > VM_BO > > should stay alive. > > No, exactly the other way around. When the VM_BO structure is destroyed the > mappings are destroyed with them. This seems to be the same misunderstanding as with the VM reference count. It seems to me that you want to say that for amdgpu it seems to be a use-case to get rid of all mappings backed by a given BO and mapped in a given VM, hence a VM_BO. You can do that. Thers's even a helper for that in GPUVM. But also in this case you first need to get rid of all mappings before you *free* the VM_BO - GPUVM ensures that. > > Otherwise you would need to destroy each individual mapping separately > before teardown which is quite inefficient. Not sure what you mean, but I don't see a difference between walking all VM_BOs and removing their mappings and walking the VM's tree of mappings and removing each of them. Comes down to the same effort in the end. But surely can go both ways if you know all the existing VM_BOs. > > > > The VM then keeps track which areas still need to be invalidated > > > in the physical representation of the page tables. > > And the VM does that through its tree of mappings (struct drm_gpuva). > > Hence, if > > the VM would just remove those structures on cleanup by itself, you'd loose > > the > > ability of cleaning up the page tables. Unless, you track this separately, > > which > > would make the whole tracking of GPUVM itself kinda pointless. > > But how do you then keep track of areas which are freed and needs to be > updated so that nobody can access the underlying memory any more? "areas which are freed", what do refer to? What do yo mean with that? Do you mean areas of the VA space not containing mappings? Why would I need to track them explicitly? When the mapping is removed the corresponding page tables / page table entries are gone as well, hence no subsequent access to the underlaying memory would be possible. > > > > I would expect that the generalized GPU VM handling would need something > > > similar. If we leave that to the driver then each driver would have to > > > implement that stuff on it's own again. > > Similar to what? What exactly do you think can be generalized here? > > Similar to how amdgpu works. I don't think it's quite fair to just throw the "look at what amdgpu does" argument at me. What am I supposed to do? Read and understand *every* detail of *every* driver? Did you read through the GPUVM code? That's a honest question and I'm asking it because I feel like you're picking up some details from commit messages and start questioning them (and that's perfectly fine and absolutely
Re: [Nouveau] [PATCH drm-misc-next v8 09/12] drm/gpuvm: reference count drm_gpuvm structures
Am 06.11.23 um 15:11 schrieb Danilo Krummrich: On Mon, Nov 06, 2023 at 02:05:13PM +0100, Christian König wrote: Am 06.11.23 um 13:16 schrieb Danilo Krummrich: [SNIP] This reference count just prevents that the VM is freed as long as other ressources are attached to it that carry a VM pointer, such as mappings and VM_BOs. The motivation for that are VM_BOs. For mappings it's indeed a bit paranoid, but it doesn't hurt either and keeps it consistant. Ah! Yeah, we have similar semantics in amdgpu as well. But we keep the reference to the root GEM object and not the VM. Ok, that makes much more sense then keeping one reference for each mapping. Because of this the mapping should *never* have a reference to the VM, but rather the VM destroys all mapping when it is destroyed itself. Hence, If the VM is still alive at a point where you don't expect it to be, then it's simply a driver bug. Driver bugs is just what I try to prevent here. When individual mappings keep the VM structure alive then drivers are responsible to clean them up, if the VM cleans up after itself then we don't need to worry about it in the driver. Drivers are *always* responsible for that. This has nothing to do with whether the VM is reference counted or not. GPUVM can't clean up mappings after itself. Why not? I feel like we're talking past each other here, at least to some extend. However, I can't yet see where exactly the misunderstanding resides. +1 At least in amdgpu we have it exactly like that. E.g. the higher level can cleanup the BO_VM structure at any time possible, even when there are mappings. What do you mean with "cleanup the VM_BO structue" exactly? The VM_BO structure keeps track of all the mappings mapped in the VM_BO's VM being backed by the VM_BO's GEM object. And the GEM objects keeps a list of the corresponding VM_BOs. Hence, as long as there are mappings that this VM_BO keeps track of, this VM_BO should stay alive. No, exactly the other way around. When the VM_BO structure is destroyed the mappings are destroyed with them. Otherwise you would need to destroy each individual mapping separately before teardown which is quite inefficient. The VM then keeps track which areas still need to be invalidated in the physical representation of the page tables. And the VM does that through its tree of mappings (struct drm_gpuva). Hence, if the VM would just remove those structures on cleanup by itself, you'd loose the ability of cleaning up the page tables. Unless, you track this separately, which would make the whole tracking of GPUVM itself kinda pointless. But how do you then keep track of areas which are freed and needs to be updated so that nobody can access the underlying memory any more? I would expect that the generalized GPU VM handling would need something similar. If we leave that to the driver then each driver would have to implement that stuff on it's own again. Similar to what? What exactly do you think can be generalized here? Similar to how amdgpu works. From what I can see you are basically re-inventing everything we already have in there and asking the same questions we stumbled over years ago. If the driver left mappings, GPUVM would just leak them without reference count. It doesn't know about the drivers surrounding structures, nor does it know about attached ressources such as PT(E)s. What are we talking with the word "mapping"? The BO_VM structure? Or each individual mapping? An individual mapping represented by struct drm_gpuva. Yeah than that certainly doesn't work. See below. E.g. what we need to prevent is that VM structure (or the root GEM object) is released while VM_BOs are still around. That's what I totally agree on. But each individual mapping is a different story. Userspace can create so many of them that we probably could even overrun a 32bit counter quite easily. REFCOUNT_MAX is specified as 0x7fff_. I agree there can be a lot of mappings, but (including the VM_BO references) more than 2.147.483.647 per VM? IIRC on amdgpu we can create something like 100k mappings per second and each takes ~64 bytes. So you just need 128GiB of memory and approx 20 seconds to let the kernel run into a refcount overrun. The worst I've seen in a real world game was around 19k mappings, but that doesn't mean that this here can't be exploited. What can be done is to keep one reference per VM_BO structure, but I think per mapping is rather unrealistic. Regards, Christian.
Re: [Nouveau] [PATCH drm-misc-next v8 09/12] drm/gpuvm: reference count drm_gpuvm structures
On Mon, Nov 06, 2023 at 02:05:13PM +0100, Christian König wrote: > Am 06.11.23 um 13:16 schrieb Danilo Krummrich: > > [SNIP] > > This reference count just prevents that the VM is freed as long as other > > ressources are attached to it that carry a VM pointer, such as mappings and > > VM_BOs. The motivation for that are VM_BOs. For mappings it's indeed a bit > > paranoid, but it doesn't hurt either and keeps it consistant. > > Ah! Yeah, we have similar semantics in amdgpu as well. > > But we keep the reference to the root GEM object and not the VM. > > Ok, that makes much more sense then keeping one reference for each mapping. > > > > Because of this the mapping should *never* have a reference to the VM, but > > > rather the VM destroys all mapping when it is destroyed itself. > > > > > > > Hence, If the VM is still alive at a point where you don't expect it to > > > > be, then it's > > > > simply a driver bug. > > > Driver bugs is just what I try to prevent here. When individual mappings > > > keep the VM structure alive then drivers are responsible to clean them up, > > > if the VM cleans up after itself then we don't need to worry about it in > > > the > > > driver. > > Drivers are *always* responsible for that. This has nothing to do with > > whether > > the VM is reference counted or not. GPUVM can't clean up mappings after > > itself. > > Why not? I feel like we're talking past each other here, at least to some extend. However, I can't yet see where exactly the misunderstanding resides. > > At least in amdgpu we have it exactly like that. E.g. the higher level can > cleanup the BO_VM structure at any time possible, even when there are > mappings. What do you mean with "cleanup the VM_BO structue" exactly? The VM_BO structure keeps track of all the mappings mapped in the VM_BO's VM being backed by the VM_BO's GEM object. And the GEM objects keeps a list of the corresponding VM_BOs. Hence, as long as there are mappings that this VM_BO keeps track of, this VM_BO should stay alive. > The VM then keeps track which areas still need to be invalidated > in the physical representation of the page tables. And the VM does that through its tree of mappings (struct drm_gpuva). Hence, if the VM would just remove those structures on cleanup by itself, you'd loose the ability of cleaning up the page tables. Unless, you track this separately, which would make the whole tracking of GPUVM itself kinda pointless. > > I would expect that the generalized GPU VM handling would need something > similar. If we leave that to the driver then each driver would have to > implement that stuff on it's own again. Similar to what? What exactly do you think can be generalized here? > > > If the driver left mappings, GPUVM would just leak them without reference > > count. > > It doesn't know about the drivers surrounding structures, nor does it know > > about > > attached ressources such as PT(E)s. > > What are we talking with the word "mapping"? The BO_VM structure? Or each > individual mapping? An individual mapping represented by struct drm_gpuva. > > E.g. what we need to prevent is that VM structure (or the root GEM object) > is released while VM_BOs are still around. That's what I totally agree on. > > But each individual mapping is a different story. Userspace can create so > many of them that we probably could even overrun a 32bit counter quite > easily. REFCOUNT_MAX is specified as 0x7fff_. I agree there can be a lot of mappings, but (including the VM_BO references) more than 2.147.483.647 per VM? > > > > When the mapping is destroyed with the VM drivers can't mess this common > > > operation up. That's why this is more defensive. > > > > > > What is a possible requirement is that external code needs to keep > > > references to the VM, but *never* the VM to itself through the mappings. I > > > would consider that a major bug in the component. > > Obviously, you just (want to) apply a different semantics to this reference > > count. It is meant to reflect that the VM structure can be freed, instead > > of the > > VM can be cleaned up. If you want to latter, you can have a driver specifc > > reference count for that in the exact same way as it was before this patch. > > Yeah, it becomes clear that you try to solve some different problem than I > have expected. > > Regards, > Christian. > > > > > > Regards, > > > Christian. > > > > > > > > Reference counting is nice when you don't know who else is referring > > > > > to your VM, but the cost is that you also don't know when the object > > > > > will guardedly be destroyed. > > > > > > > > > > I can trivially work around this by saying that the generic GPUVM > > > > > object has a different lifetime than the amdgpu specific object, but > > > > > that opens up doors for use after free again. > > > > If your driver never touches the VM's reference count and exits the VM > > > > with a clean state > > > > (no mappings and no VM_BOs left),
Re: [Nouveau] [PATCH drm-misc-next v8 09/12] drm/gpuvm: reference count drm_gpuvm structures
Am 06.11.23 um 13:16 schrieb Danilo Krummrich: [SNIP] This reference count just prevents that the VM is freed as long as other ressources are attached to it that carry a VM pointer, such as mappings and VM_BOs. The motivation for that are VM_BOs. For mappings it's indeed a bit paranoid, but it doesn't hurt either and keeps it consistant. Ah! Yeah, we have similar semantics in amdgpu as well. But we keep the reference to the root GEM object and not the VM. Ok, that makes much more sense then keeping one reference for each mapping. Because of this the mapping should *never* have a reference to the VM, but rather the VM destroys all mapping when it is destroyed itself. Hence, If the VM is still alive at a point where you don't expect it to be, then it's simply a driver bug. Driver bugs is just what I try to prevent here. When individual mappings keep the VM structure alive then drivers are responsible to clean them up, if the VM cleans up after itself then we don't need to worry about it in the driver. Drivers are *always* responsible for that. This has nothing to do with whether the VM is reference counted or not. GPUVM can't clean up mappings after itself. Why not? At least in amdgpu we have it exactly like that. E.g. the higher level can cleanup the BO_VM structure at any time possible, even when there are mappings. The VM then keeps track which areas still need to be invalidated in the physical representation of the page tables. I would expect that the generalized GPU VM handling would need something similar. If we leave that to the driver then each driver would have to implement that stuff on it's own again. If the driver left mappings, GPUVM would just leak them without reference count. It doesn't know about the drivers surrounding structures, nor does it know about attached ressources such as PT(E)s. What are we talking with the word "mapping"? The BO_VM structure? Or each individual mapping? E.g. what we need to prevent is that VM structure (or the root GEM object) is released while VM_BOs are still around. That's what I totally agree on. But each individual mapping is a different story. Userspace can create so many of them that we probably could even overrun a 32bit counter quite easily. When the mapping is destroyed with the VM drivers can't mess this common operation up. That's why this is more defensive. What is a possible requirement is that external code needs to keep references to the VM, but *never* the VM to itself through the mappings. I would consider that a major bug in the component. Obviously, you just (want to) apply a different semantics to this reference count. It is meant to reflect that the VM structure can be freed, instead of the VM can be cleaned up. If you want to latter, you can have a driver specifc reference count for that in the exact same way as it was before this patch. Yeah, it becomes clear that you try to solve some different problem than I have expected. Regards, Christian. Regards, Christian. Reference counting is nice when you don't know who else is referring to your VM, but the cost is that you also don't know when the object will guardedly be destroyed. I can trivially work around this by saying that the generic GPUVM object has a different lifetime than the amdgpu specific object, but that opens up doors for use after free again. If your driver never touches the VM's reference count and exits the VM with a clean state (no mappings and no VM_BOs left), effectively, this is the same as having no reference count. In the very worst case you could argue that we trade a potential UAF *and* memroy leak (no reference count) with *only* a memory leak (with reference count), which to me seems reasonable. Regards, Christian. Thanks, Christian. [1]https://lore.kernel.org/dri-devel/6fa058a4-20d3-44b9-af58-755cfb375...@redhat.com/ Signed-off-by: Danilo Krummrich --- drivers/gpu/drm/drm_gpuvm.c | 44 +++--- drivers/gpu/drm/nouveau/nouveau_uvmm.c | 20 +--- include/drm/drm_gpuvm.h | 31 +- 3 files changed, 78 insertions(+), 17 deletions(-) diff --git a/drivers/gpu/drm/drm_gpuvm.c b/drivers/gpu/drm/drm_gpuvm.c index 53e2c406fb04..6a88eafc5229 100644 --- a/drivers/gpu/drm/drm_gpuvm.c +++ b/drivers/gpu/drm/drm_gpuvm.c @@ -746,6 +746,8 @@ drm_gpuvm_init(struct drm_gpuvm *gpuvm, const char *name, gpuvm->rb.tree = RB_ROOT_CACHED; INIT_LIST_HEAD(>rb.list); + kref_init(>kref); + gpuvm->name = name ? name : "unknown"; gpuvm->flags = flags; gpuvm->ops = ops; @@ -770,15 +772,8 @@ drm_gpuvm_init(struct drm_gpuvm *gpuvm, const char *name, } EXPORT_SYMBOL_GPL(drm_gpuvm_init); -/** - * drm_gpuvm_destroy() - cleanup a _gpuvm - * @gpuvm: pointer to the _gpuvm to clean up - * - * Note that it is a bug to call this function on a manager that still - * holds GPU VA mappings. - */ -void
Re: [Nouveau] [PATCH drm-misc-next v8 09/12] drm/gpuvm: reference count drm_gpuvm structures
On Mon, Nov 06, 2023 at 10:14:29AM +0100, Christian König wrote: > Am 03.11.23 um 16:34 schrieb Danilo Krummrich: > [SNIP] > > > > > > Especially we most likely don't want the VM to live longer than the > > > application which originally used it. If you make the GPUVM an > > > independent object you actually open up driver abuse for the > > > lifetime of this. > > > > Right, we don't want that. But I don't see how the reference count > > prevents that. > > It doesn't prevents that, it's just not the most defensive approach. > > > > > Independant object is relative. struct drm_gpuvm is still embedded into > > a driver > > specific structure. It's working the same way as with struct > > drm_gem_obejct. > > > > > > > > Additional to that see below for a quite real problem with this. > > > > > > > > Background is that the most common use case I see is that > > > > > this object is > > > > > embedded into something else and a reference count is then > > > > > not really a good > > > > > idea. > > > > Do you have a specific use-case in mind where this would interfere? > > > > > > Yes, absolutely. For an example see amdgpu_mes_self_test(), here we > > > initialize a temporary amdgpu VM for an in kernel unit test which > > > runs during driver load. > > > > > > When the function returns I need to guarantee that the VM is > > > destroyed or otherwise I will mess up normal operation. > > > > Nothing prevents that. The reference counting is well defined. If the > > driver did not > > take additional references (which is clearly up to the driver taking > > care of) and all > > VM_BOs and mappings are cleaned up, the reference count is guaranteed to > > be 1 at this > > point. > > > > Also note that if the driver would have not cleaned up all VM_BOs and > > mappings before > > shutting down the VM, it would have been a bug anyways and the driver > > would potentially > > leak memory and UAF issues. > > Exactly that's what I'm talking about why I think this is an extremely bad > idea. > > It's a perfect normal operation to shutdown the VM while there are still > mappings. This is just what happens when you kill an application. Shut down the VM in terms of removing existing mappings, doing driver specifc cleanup operations, etc. But not freeing the VM structure yet. That's what you gonna do after you cleaned up everything. So, when your application gets killed, you just call your driver_vm_destroy() function, cleaning up all mappings etc. and then you call drm_gpuvm_put(). And if you did a decent job cleaning things up (removing existing mappings etc.) this drm_gpuvm_put() call will result into the vm_free() callback being called. This reference count just prevents that the VM is freed as long as other ressources are attached to it that carry a VM pointer, such as mappings and VM_BOs. The motivation for that are VM_BOs. For mappings it's indeed a bit paranoid, but it doesn't hurt either and keeps it consistant. > > Because of this the mapping should *never* have a reference to the VM, but > rather the VM destroys all mapping when it is destroyed itself. > > > Hence, If the VM is still alive at a point where you don't expect it to > > be, then it's > > simply a driver bug. > > Driver bugs is just what I try to prevent here. When individual mappings > keep the VM structure alive then drivers are responsible to clean them up, > if the VM cleans up after itself then we don't need to worry about it in the > driver. Drivers are *always* responsible for that. This has nothing to do with whether the VM is reference counted or not. GPUVM can't clean up mappings after itself. If the driver left mappings, GPUVM would just leak them without reference count. It doesn't know about the drivers surrounding structures, nor does it know about attached ressources such as PT(E)s. > > When the mapping is destroyed with the VM drivers can't mess this common > operation up. That's why this is more defensive. > > What is a possible requirement is that external code needs to keep > references to the VM, but *never* the VM to itself through the mappings. I > would consider that a major bug in the component. Obviously, you just (want to) apply a different semantics to this reference count. It is meant to reflect that the VM structure can be freed, instead of the VM can be cleaned up. If you want to latter, you can have a driver specifc reference count for that in the exact same way as it was before this patch. > > Regards, > Christian. > > > > > > > > > Reference counting is nice when you don't know who else is referring > > > to your VM, but the cost is that you also don't know when the object > > > will guardedly be destroyed. > > > > > > I can trivially work around this by saying that the generic GPUVM > > > object has a different lifetime than the amdgpu specific object, but > > > that opens up doors for use after free again. > > > > If your driver never touches the VM's reference count and exits the VM > >
Re: [Nouveau] [PATCH drm-misc-next v8 09/12] drm/gpuvm: reference count drm_gpuvm structures
Am 03.11.23 um 16:34 schrieb Danilo Krummrich: [SNIP] Especially we most likely don't want the VM to live longer than the application which originally used it. If you make the GPUVM an independent object you actually open up driver abuse for the lifetime of this. Right, we don't want that. But I don't see how the reference count prevents that. It doesn't prevents that, it's just not the most defensive approach. Independant object is relative. struct drm_gpuvm is still embedded into a driver specific structure. It's working the same way as with struct drm_gem_obejct. Additional to that see below for a quite real problem with this. Background is that the most common use case I see is that this object is embedded into something else and a reference count is then not really a good idea. Do you have a specific use-case in mind where this would interfere? Yes, absolutely. For an example see amdgpu_mes_self_test(), here we initialize a temporary amdgpu VM for an in kernel unit test which runs during driver load. When the function returns I need to guarantee that the VM is destroyed or otherwise I will mess up normal operation. Nothing prevents that. The reference counting is well defined. If the driver did not take additional references (which is clearly up to the driver taking care of) and all VM_BOs and mappings are cleaned up, the reference count is guaranteed to be 1 at this point. Also note that if the driver would have not cleaned up all VM_BOs and mappings before shutting down the VM, it would have been a bug anyways and the driver would potentially leak memory and UAF issues. Exactly that's what I'm talking about why I think this is an extremely bad idea. It's a perfect normal operation to shutdown the VM while there are still mappings. This is just what happens when you kill an application. Because of this the mapping should *never* have a reference to the VM, but rather the VM destroys all mapping when it is destroyed itself. Hence, If the VM is still alive at a point where you don't expect it to be, then it's simply a driver bug. Driver bugs is just what I try to prevent here. When individual mappings keep the VM structure alive then drivers are responsible to clean them up, if the VM cleans up after itself then we don't need to worry about it in the driver. When the mapping is destroyed with the VM drivers can't mess this common operation up. That's why this is more defensive. What is a possible requirement is that external code needs to keep references to the VM, but *never* the VM to itself through the mappings. I would consider that a major bug in the component. Regards, Christian. Reference counting is nice when you don't know who else is referring to your VM, but the cost is that you also don't know when the object will guardedly be destroyed. I can trivially work around this by saying that the generic GPUVM object has a different lifetime than the amdgpu specific object, but that opens up doors for use after free again. If your driver never touches the VM's reference count and exits the VM with a clean state (no mappings and no VM_BOs left), effectively, this is the same as having no reference count. In the very worst case you could argue that we trade a potential UAF *and* memroy leak (no reference count) with *only* a memory leak (with reference count), which to me seems reasonable. Regards, Christian. Thanks, Christian. [1]https://lore.kernel.org/dri-devel/6fa058a4-20d3-44b9-af58-755cfb375...@redhat.com/ Signed-off-by: Danilo Krummrich --- drivers/gpu/drm/drm_gpuvm.c | 44 +++--- drivers/gpu/drm/nouveau/nouveau_uvmm.c | 20 +--- include/drm/drm_gpuvm.h | 31 +- 3 files changed, 78 insertions(+), 17 deletions(-) diff --git a/drivers/gpu/drm/drm_gpuvm.c b/drivers/gpu/drm/drm_gpuvm.c index 53e2c406fb04..6a88eafc5229 100644 --- a/drivers/gpu/drm/drm_gpuvm.c +++ b/drivers/gpu/drm/drm_gpuvm.c @@ -746,6 +746,8 @@ drm_gpuvm_init(struct drm_gpuvm *gpuvm, const char *name, gpuvm->rb.tree = RB_ROOT_CACHED; INIT_LIST_HEAD(>rb.list); + kref_init(>kref); + gpuvm->name = name ? name : "unknown"; gpuvm->flags = flags; gpuvm->ops = ops; @@ -770,15 +772,8 @@ drm_gpuvm_init(struct drm_gpuvm *gpuvm, const char *name, } EXPORT_SYMBOL_GPL(drm_gpuvm_init); -/** - * drm_gpuvm_destroy() - cleanup a _gpuvm - * @gpuvm: pointer to the _gpuvm to clean up - * - * Note that it is a bug to call this function on a manager that still - * holds GPU VA mappings. - */ -void -drm_gpuvm_destroy(struct drm_gpuvm *gpuvm) +static void +drm_gpuvm_fini(struct drm_gpuvm *gpuvm) { gpuvm->name = NULL; @@ -790,7 +785,33 @@ drm_gpuvm_destroy(struct drm_gpuvm *gpuvm) drm_gem_object_put(gpuvm->r_obj); } -EXPORT_SYMBOL_GPL(drm_gpuvm_destroy); + +static void +drm_gpuvm_free(struct kref *kref) +{