Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
On 12/2/20 1:20 PM, Greg KH wrote: On Wed, Dec 02, 2020 at 01:02:06PM -0500, Andrey Grodzovsky wrote: On 12/2/20 12:34 PM, Greg KH wrote: On Wed, Dec 02, 2020 at 10:48:01AM -0500, Andrey Grodzovsky wrote: On 11/11/20 10:34 AM, Greg KH wrote: On Wed, Nov 11, 2020 at 10:13:13AM -0500, Andrey Grodzovsky wrote: On 11/10/20 12:59 PM, Greg KH wrote: On Tue, Nov 10, 2020 at 12:54:21PM -0500, Andrey Grodzovsky wrote: Hi, back to this after a long context switch for some higher priority stuff. So here I was able eventually to drop all this code and this change here https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Damd-staging-drm-next-device-unplug%26id%3D61852c8a59b4dd89d637693552c73175b9f2ccd6data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C13040ab9b50947a95acc08d896eec15d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637425299507092187%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=CIXEl9hWHTAdo7t9yrdtu0OdEIZ3X2GQmJRhDUj28mw%3Dreserved=0 was enough for me. Seems like while device_remove_file can handle the use case where the file and the parent directory already gone, sysfs_remove_group goes down in flames in that case due to kobj->sd being unset on device removal. A driver shouldn't ever have to remove individual sysfs groups, the driver core/bus logic should do it for them automatically. And whenever a driver calls a sysfs_* call, that's a hint that something is not working properly. Do you mean that while the driver creates the groups and files explicitly from it's different subsystems it should not explicitly remove each one of them because all of them should be removed at once (and recursively) when the device is being removed ? Individual drivers should never add groups/files in sysfs, the driver core should do it properly for you if you have everything set up properly. And yes, the driver core will automatically remove them as well. Please use the default groups attribute for your bus/subsystem and this will happen automagically. Hi Greg, I tried your suggestion to hang amdgpu's sysfs attributes on default attributes in struct device.groups but turns out it's not usable since by the time i have access to struct device from amdgpu code it has already been initialized by pci core (i.e. past the point where device_add->device_add_attrs->device_add_groups with dev->groups is called) and so i can't really use it. That's odd, why can't you just set the groups pointer in your pci_driver structure? That's what it is there for, right? I am probably missing something but amdgpu sysfs attrs are per device not per driver Oops, you are right, you want the 'dev_groups' field. Looks like pci doesn't export that directly, so you can do: .driver { .dev_groups = my_device_groups; }, in your pci_driver structure. Or I'm sure the PCI driver maintainer would take a patch like 7d9c1d2f7aca ("USB: add support for dev_groups to struct usb_device_driver") was done for the USB subsystem, as diving into the "raw" .driver pointer isn't really that clean or nice in my opinion. Looks like what I need exactly. I will probably start with assigning raw pointer just to push ahead my work and in parallel will probably submit same patch as yours for PCI subsystem review as the rework to switch to this is really minimal. Andrey thanks, greg k-h ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
On Wed, Dec 02, 2020 at 01:02:06PM -0500, Andrey Grodzovsky wrote: > > On 12/2/20 12:34 PM, Greg KH wrote: > > On Wed, Dec 02, 2020 at 10:48:01AM -0500, Andrey Grodzovsky wrote: > > > On 11/11/20 10:34 AM, Greg KH wrote: > > > > On Wed, Nov 11, 2020 at 10:13:13AM -0500, Andrey Grodzovsky wrote: > > > > > On 11/10/20 12:59 PM, Greg KH wrote: > > > > > > On Tue, Nov 10, 2020 at 12:54:21PM -0500, Andrey Grodzovsky wrote: > > > > > > > Hi, back to this after a long context switch for some higher > > > > > > > priority stuff. > > > > > > > > > > > > > > So here I was able eventually to drop all this code and this > > > > > > > change here > > > > > > > https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Damd-staging-drm-next-device-unplug%26id%3D61852c8a59b4dd89d637693552c73175b9f2ccd6data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C29ff7efb89bd47d8488708d896e86e7c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637425272317529134%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=Vzc3fVofA6%2BMPSqHmBqcWavQLKWU1%2FXKJFun24irLf0%3Dreserved=0 > > > > > > > was enough for me. Seems like while device_remove_file can handle > > > > > > > the use > > > > > > > case where the file and the parent directory already gone, > > > > > > > sysfs_remove_group goes down in flames in that case > > > > > > > due to kobj->sd being unset on device removal. > > > > > > A driver shouldn't ever have to remove individual sysfs groups, the > > > > > > driver core/bus logic should do it for them automatically. > > > > > > > > > > > > And whenever a driver calls a sysfs_* call, that's a hint that > > > > > > something > > > > > > is not working properly. > > > > > > > > > > Do you mean that while the driver creates the groups and files > > > > > explicitly > > > > > from it's different subsystems it should not explicitly remove each > > > > > one of them because all of them should be removed at once (and > > > > > recursively) when the device is being removed ? > > > > Individual drivers should never add groups/files in sysfs, the driver > > > > core should do it properly for you if you have everything set up > > > > properly. And yes, the driver core will automatically remove them as > > > > well. > > > > > > > > Please use the default groups attribute for your bus/subsystem and this > > > > will happen automagically. > > > > > > Hi Greg, I tried your suggestion to hang amdgpu's sysfs > > > attributes on default attributes in struct device.groups but turns out > > > it's > > > not usable since by the > > > time i have access to struct device from amdgpu code it has already been > > > initialized by pci core > > > (i.e. past the point where > > > device_add->device_add_attrs->device_add_groups > > > with dev->groups is called) > > > and so i can't really use it. > > That's odd, why can't you just set the groups pointer in your pci_driver > > structure? That's what it is there for, right? > > I am probably missing something but amdgpu sysfs attrs are per device not > per driver Oops, you are right, you want the 'dev_groups' field. Looks like pci doesn't export that directly, so you can do: .driver { .dev_groups = my_device_groups; }, in your pci_driver structure. Or I'm sure the PCI driver maintainer would take a patch like 7d9c1d2f7aca ("USB: add support for dev_groups to struct usb_device_driver") was done for the USB subsystem, as diving into the "raw" .driver pointer isn't really that clean or nice in my opinion. thanks, greg k-h ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
On 12/2/20 12:34 PM, Greg KH wrote: On Wed, Dec 02, 2020 at 10:48:01AM -0500, Andrey Grodzovsky wrote: On 11/11/20 10:34 AM, Greg KH wrote: On Wed, Nov 11, 2020 at 10:13:13AM -0500, Andrey Grodzovsky wrote: On 11/10/20 12:59 PM, Greg KH wrote: On Tue, Nov 10, 2020 at 12:54:21PM -0500, Andrey Grodzovsky wrote: Hi, back to this after a long context switch for some higher priority stuff. So here I was able eventually to drop all this code and this change here https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Damd-staging-drm-next-device-unplug%26id%3D61852c8a59b4dd89d637693552c73175b9f2ccd6data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C29ff7efb89bd47d8488708d896e86e7c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637425272317529134%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=Vzc3fVofA6%2BMPSqHmBqcWavQLKWU1%2FXKJFun24irLf0%3Dreserved=0 was enough for me. Seems like while device_remove_file can handle the use case where the file and the parent directory already gone, sysfs_remove_group goes down in flames in that case due to kobj->sd being unset on device removal. A driver shouldn't ever have to remove individual sysfs groups, the driver core/bus logic should do it for them automatically. And whenever a driver calls a sysfs_* call, that's a hint that something is not working properly. Do you mean that while the driver creates the groups and files explicitly from it's different subsystems it should not explicitly remove each one of them because all of them should be removed at once (and recursively) when the device is being removed ? Individual drivers should never add groups/files in sysfs, the driver core should do it properly for you if you have everything set up properly. And yes, the driver core will automatically remove them as well. Please use the default groups attribute for your bus/subsystem and this will happen automagically. Hi Greg, I tried your suggestion to hang amdgpu's sysfs attributes on default attributes in struct device.groups but turns out it's not usable since by the time i have access to struct device from amdgpu code it has already been initialized by pci core (i.e. past the point where device_add->device_add_attrs->device_add_groups with dev->groups is called) and so i can't really use it. That's odd, why can't you just set the groups pointer in your pci_driver structure? That's what it is there for, right? I am probably missing something but amdgpu sysfs attrs are per device not per driver and their life cycle is bound to the device and their location in the sysfs topology is under each device. Putting them as driver default attr will not put them in their current per device location and won't make them automatically be destroyed once a particular device goes away, no ? Andrey What I can only think of using is creating my own struct attribute_group ** array in amdgpu where I aggregate all amdgpu sysfs attributes, call device_add_groups in the end of amgpu pci probe with that array and on device remove call device_remove_groups with the same array. Horrid, no, see above :) thanks, greg k-h ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
On Wed, Dec 02, 2020 at 10:48:01AM -0500, Andrey Grodzovsky wrote: > > On 11/11/20 10:34 AM, Greg KH wrote: > > On Wed, Nov 11, 2020 at 10:13:13AM -0500, Andrey Grodzovsky wrote: > > > On 11/10/20 12:59 PM, Greg KH wrote: > > > > On Tue, Nov 10, 2020 at 12:54:21PM -0500, Andrey Grodzovsky wrote: > > > > > Hi, back to this after a long context switch for some higher priority > > > > > stuff. > > > > > > > > > > So here I was able eventually to drop all this code and this change > > > > > here > > > > > https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Damd-staging-drm-next-device-unplug%26id%3D61852c8a59b4dd89d637693552c73175b9f2ccd6data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C9fbfecac94a340dfb68408d886571609%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637407055896651058%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=Ye8HJR1vidppcOBnlOgVu5GwKD2%2Bb5ztHbiI%2BubKKT0%3Dreserved=0 > > > > > was enough for me. Seems like while device_remove_file can handle the > > > > > use > > > > > case where the file and the parent directory already gone, > > > > > sysfs_remove_group goes down in flames in that case > > > > > due to kobj->sd being unset on device removal. > > > > A driver shouldn't ever have to remove individual sysfs groups, the > > > > driver core/bus logic should do it for them automatically. > > > > > > > > And whenever a driver calls a sysfs_* call, that's a hint that something > > > > is not working properly. > > > > > > > > > Do you mean that while the driver creates the groups and files explicitly > > > from it's different subsystems it should not explicitly remove each > > > one of them because all of them should be removed at once (and > > > recursively) when the device is being removed ? > > Individual drivers should never add groups/files in sysfs, the driver > > core should do it properly for you if you have everything set up > > properly. And yes, the driver core will automatically remove them as > > well. > > > > Please use the default groups attribute for your bus/subsystem and this > > will happen automagically. > > > Hi Greg, I tried your suggestion to hang amdgpu's sysfs > attributes on default attributes in struct device.groups but turns out it's > not usable since by the > time i have access to struct device from amdgpu code it has already been > initialized by pci core > (i.e. past the point where device_add->device_add_attrs->device_add_groups > with dev->groups is called) > and so i can't really use it. That's odd, why can't you just set the groups pointer in your pci_driver structure? That's what it is there for, right? > What I can only think of using is creating my own struct attribute_group ** > array in amdgpu where I aggregate all > amdgpu sysfs attributes, call device_add_groups in the end of amgpu pci > probe with that array and on device remove call > device_remove_groups with the same array. Horrid, no, see above :) thanks, greg k-h ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
On 11/11/20 10:34 AM, Greg KH wrote: On Wed, Nov 11, 2020 at 10:13:13AM -0500, Andrey Grodzovsky wrote: On 11/10/20 12:59 PM, Greg KH wrote: On Tue, Nov 10, 2020 at 12:54:21PM -0500, Andrey Grodzovsky wrote: Hi, back to this after a long context switch for some higher priority stuff. So here I was able eventually to drop all this code and this change here https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Damd-staging-drm-next-device-unplug%26id%3D61852c8a59b4dd89d637693552c73175b9f2ccd6data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C9fbfecac94a340dfb68408d886571609%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637407055896651058%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=Ye8HJR1vidppcOBnlOgVu5GwKD2%2Bb5ztHbiI%2BubKKT0%3Dreserved=0 was enough for me. Seems like while device_remove_file can handle the use case where the file and the parent directory already gone, sysfs_remove_group goes down in flames in that case due to kobj->sd being unset on device removal. A driver shouldn't ever have to remove individual sysfs groups, the driver core/bus logic should do it for them automatically. And whenever a driver calls a sysfs_* call, that's a hint that something is not working properly. Do you mean that while the driver creates the groups and files explicitly from it's different subsystems it should not explicitly remove each one of them because all of them should be removed at once (and recursively) when the device is being removed ? Individual drivers should never add groups/files in sysfs, the driver core should do it properly for you if you have everything set up properly. And yes, the driver core will automatically remove them as well. Please use the default groups attribute for your bus/subsystem and this will happen automagically. Hi Greg, I tried your suggestion to hang amdgpu's sysfs attributes on default attributes in struct device.groups but turns out it's not usable since by the time i have access to struct device from amdgpu code it has already been initialized by pci core (i.e. past the point where device_add->device_add_attrs->device_add_groups with dev->groups is called) and so i can't really use it. What I can only think of using is creating my own struct attribute_group ** array in amdgpu where I aggregate all amdgpu sysfs attributes, call device_add_groups in the end of amgpu pci probe with that array and on device remove call device_remove_groups with the same array. Do you maybe have a better suggestion for me ? Andrey thanks, greg k-h ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
On 11/11/20 11:06 AM, Greg KH wrote: On Wed, Nov 11, 2020 at 10:45:53AM -0500, Andrey Grodzovsky wrote: On 11/11/20 10:34 AM, Greg KH wrote: On Wed, Nov 11, 2020 at 10:13:13AM -0500, Andrey Grodzovsky wrote: On 11/10/20 12:59 PM, Greg KH wrote: On Tue, Nov 10, 2020 at 12:54:21PM -0500, Andrey Grodzovsky wrote: Hi, back to this after a long context switch for some higher priority stuff. So here I was able eventually to drop all this code and this change here https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Damd-staging-drm-next-device-unplug%26id%3D61852c8a59b4dd89d637693552c73175b9f2ccd6data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C3e019e2780114b696b4f08d8865bac36%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637407075579242822%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=E%2FIZmVeJDvHiY2xSaaPaay4mXN49EbhSJaJ4zlt6WKk%3Dreserved=0 was enough for me. Seems like while device_remove_file can handle the use case where the file and the parent directory already gone, sysfs_remove_group goes down in flames in that case due to kobj->sd being unset on device removal. A driver shouldn't ever have to remove individual sysfs groups, the driver core/bus logic should do it for them automatically. And whenever a driver calls a sysfs_* call, that's a hint that something is not working properly. Do you mean that while the driver creates the groups and files explicitly from it's different subsystems it should not explicitly remove each one of them because all of them should be removed at once (and recursively) when the device is being removed ? Individual drivers should never add groups/files in sysfs, the driver core should do it properly for you if you have everything set up properly. And yes, the driver core will automatically remove them as well. Please use the default groups attribute for your bus/subsystem and this will happen automagically. Googling for default groups attributes i found this - https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linuxfoundation.org%2Fblog%2F2013%2F06%2Fhow-to-create-a-sysfs-file-correctly%2Fdata=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C3e019e2780114b696b4f08d8865bac36%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637407075579252818%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=AVhdi%2BcKeFXM8CBv%2BhRNTCYX2XSS8oo0so6mB%2BPuEfk%3Dreserved=0 Odd, mirror of the original article: https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fkroah.com%2Flog%2Fblog%2F2013%2F06%2F26%2Fhow-to-create-a-sysfs-file-correctly%2Fdata=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C3e019e2780114b696b4f08d8865bac36%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637407075579252818%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=lGMd3PJOWIlKUpvbV3Zz%2FvbBIRwz6%2BlJ%2BS%2BiVcXxuzM%3Dreserved=0 Would this be what you suggest for us ? Specifically for our case the struct device's groups seems the right solution as different devices might have slightly diffreent sysfs attributes. That's what the is_visable() callback in your attribute group is for, to tell the kernel if an individual sysfs attribute should be created or not. I see, this looks like a good improvement to our current way of managing sysfs. Since this change is somewhat fundamental and requires good testing I prefer to deal with it separately from my current work on device unplug and so I will put it on TODO right after finishing this work. Andrey thanks, greg k-h ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
On Wed, Nov 11, 2020 at 10:45:53AM -0500, Andrey Grodzovsky wrote: > > On 11/11/20 10:34 AM, Greg KH wrote: > > On Wed, Nov 11, 2020 at 10:13:13AM -0500, Andrey Grodzovsky wrote: > > > On 11/10/20 12:59 PM, Greg KH wrote: > > > > On Tue, Nov 10, 2020 at 12:54:21PM -0500, Andrey Grodzovsky wrote: > > > > > Hi, back to this after a long context switch for some higher priority > > > > > stuff. > > > > > > > > > > So here I was able eventually to drop all this code and this change > > > > > here > > > > > https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Damd-staging-drm-next-device-unplug%26id%3D61852c8a59b4dd89d637693552c73175b9f2ccd6data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C9fbfecac94a340dfb68408d886571609%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637407055896651058%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=Ye8HJR1vidppcOBnlOgVu5GwKD2%2Bb5ztHbiI%2BubKKT0%3Dreserved=0 > > > > > was enough for me. Seems like while device_remove_file can handle the > > > > > use > > > > > case where the file and the parent directory already gone, > > > > > sysfs_remove_group goes down in flames in that case > > > > > due to kobj->sd being unset on device removal. > > > > A driver shouldn't ever have to remove individual sysfs groups, the > > > > driver core/bus logic should do it for them automatically. > > > > > > > > And whenever a driver calls a sysfs_* call, that's a hint that something > > > > is not working properly. > > > > > > > > > Do you mean that while the driver creates the groups and files explicitly > > > from it's different subsystems it should not explicitly remove each > > > one of them because all of them should be removed at once (and > > > recursively) when the device is being removed ? > > Individual drivers should never add groups/files in sysfs, the driver > > core should do it properly for you if you have everything set up > > properly. And yes, the driver core will automatically remove them as > > well. > > > > Please use the default groups attribute for your bus/subsystem and this > > will happen automagically. > > Googling for default groups attributes i found this - > https://www.linuxfoundation.org/blog/2013/06/how-to-create-a-sysfs-file-correctly/ Odd, mirror of the original article: http://kroah.com/log/blog/2013/06/26/how-to-create-a-sysfs-file-correctly/ > Would this be what you suggest for us ? Specifically for our case the struct > device's groups seems the right solution as different devices > might have slightly diffreent sysfs attributes. That's what the is_visable() callback in your attribute group is for, to tell the kernel if an individual sysfs attribute should be created or not. thanks, greg k-h ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
On 11/11/20 10:34 AM, Greg KH wrote: On Wed, Nov 11, 2020 at 10:13:13AM -0500, Andrey Grodzovsky wrote: On 11/10/20 12:59 PM, Greg KH wrote: On Tue, Nov 10, 2020 at 12:54:21PM -0500, Andrey Grodzovsky wrote: Hi, back to this after a long context switch for some higher priority stuff. So here I was able eventually to drop all this code and this change here https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Damd-staging-drm-next-device-unplug%26id%3D61852c8a59b4dd89d637693552c73175b9f2ccd6data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C9fbfecac94a340dfb68408d886571609%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637407055896651058%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=Ye8HJR1vidppcOBnlOgVu5GwKD2%2Bb5ztHbiI%2BubKKT0%3Dreserved=0 was enough for me. Seems like while device_remove_file can handle the use case where the file and the parent directory already gone, sysfs_remove_group goes down in flames in that case due to kobj->sd being unset on device removal. A driver shouldn't ever have to remove individual sysfs groups, the driver core/bus logic should do it for them automatically. And whenever a driver calls a sysfs_* call, that's a hint that something is not working properly. Do you mean that while the driver creates the groups and files explicitly from it's different subsystems it should not explicitly remove each one of them because all of them should be removed at once (and recursively) when the device is being removed ? Individual drivers should never add groups/files in sysfs, the driver core should do it properly for you if you have everything set up properly. And yes, the driver core will automatically remove them as well. Please use the default groups attribute for your bus/subsystem and this will happen automagically. Googling for default groups attributes i found this - https://www.linuxfoundation.org/blog/2013/06/how-to-create-a-sysfs-file-correctly/ Would this be what you suggest for us ? Specifically for our case the struct device's groups seems the right solution as different devices might have slightly diffreent sysfs attributes. Andrey thanks, greg k-h ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
On Wed, Nov 11, 2020 at 10:13:13AM -0500, Andrey Grodzovsky wrote: > > On 11/10/20 12:59 PM, Greg KH wrote: > > On Tue, Nov 10, 2020 at 12:54:21PM -0500, Andrey Grodzovsky wrote: > > > Hi, back to this after a long context switch for some higher priority > > > stuff. > > > > > > So here I was able eventually to drop all this code and this change here > > > https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Damd-staging-drm-next-device-unplug%26id%3D61852c8a59b4dd89d637693552c73175b9f2ccd6data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C7ae9e5798c7648d6dbb908d885a22c58%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637406278875513811%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=aoFIsBxpLC9tBZw3E%2B8IJlNqFSq6uRgEvvciaZ6B1iw%3Dreserved=0 > > > was enough for me. Seems like while device_remove_file can handle the use > > > case where the file and the parent directory already gone, > > > sysfs_remove_group goes down in flames in that case > > > due to kobj->sd being unset on device removal. > > A driver shouldn't ever have to remove individual sysfs groups, the > > driver core/bus logic should do it for them automatically. > > > > And whenever a driver calls a sysfs_* call, that's a hint that something > > is not working properly. > > > > Do you mean that while the driver creates the groups and files explicitly > from it's different subsystems it should not explicitly remove each > one of them because all of them should be removed at once (and > recursively) when the device is being removed ? Individual drivers should never add groups/files in sysfs, the driver core should do it properly for you if you have everything set up properly. And yes, the driver core will automatically remove them as well. Please use the default groups attribute for your bus/subsystem and this will happen automagically. thanks, greg k-h ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
On 11/10/20 12:59 PM, Greg KH wrote: On Tue, Nov 10, 2020 at 12:54:21PM -0500, Andrey Grodzovsky wrote: Hi, back to this after a long context switch for some higher priority stuff. So here I was able eventually to drop all this code and this change here https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Damd-staging-drm-next-device-unplug%26id%3D61852c8a59b4dd89d637693552c73175b9f2ccd6data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C7ae9e5798c7648d6dbb908d885a22c58%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637406278875513811%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=aoFIsBxpLC9tBZw3E%2B8IJlNqFSq6uRgEvvciaZ6B1iw%3Dreserved=0 was enough for me. Seems like while device_remove_file can handle the use case where the file and the parent directory already gone, sysfs_remove_group goes down in flames in that case due to kobj->sd being unset on device removal. A driver shouldn't ever have to remove individual sysfs groups, the driver core/bus logic should do it for them automatically. And whenever a driver calls a sysfs_* call, that's a hint that something is not working properly. Do you mean that while the driver creates the groups and files explicitly from it's different subsystems it should not explicitly remove each one of them because all of them should be removed at once (and recursively) when the device is being removed ? Andrey Also, run your patch above through checkpatch.pl before submitting it :) thanks, greg k-h ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
On Tue, Nov 10, 2020 at 12:54:21PM -0500, Andrey Grodzovsky wrote: > Hi, back to this after a long context switch for some higher priority stuff. > > So here I was able eventually to drop all this code and this change here > https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=amd-staging-drm-next-device-unplug=61852c8a59b4dd89d637693552c73175b9f2ccd6 > was enough for me. Seems like while device_remove_file can handle the use > case where the file and the parent directory already gone, > sysfs_remove_group goes down in flames in that case > due to kobj->sd being unset on device removal. A driver shouldn't ever have to remove individual sysfs groups, the driver core/bus logic should do it for them automatically. And whenever a driver calls a sysfs_* call, that's a hint that something is not working properly. Also, run your patch above through checkpatch.pl before submitting it :) thanks, greg k-h ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
Hi, back to this after a long context switch for some higher priority stuff. So here I was able eventually to drop all this code and this change here https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=amd-staging-drm-next-device-unplug=61852c8a59b4dd89d637693552c73175b9f2ccd6 was enough for me. Seems like while device_remove_file can handle the use case where the file and the parent directory already gone, sysfs_remove_group goes down in flames in that case due to kobj->sd being unset on device removal. Andrey On 6/24/20 2:11 AM, Greg KH wrote: But why are things being removed twice? Not sure I understand what removed twice ? I remove only once per sysfs attribute. This code path shows that the kernel is trying to remove a file that is not present, so someone removed it already... thanks, gre k-h ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
On 6/24/20 2:11 AM, Greg KH wrote: On Tue, Jun 23, 2020 at 11:04:30PM -0400, Andrey Grodzovsky wrote: On 6/23/20 2:05 AM, Greg KH wrote: On Tue, Jun 23, 2020 at 12:51:00AM -0400, Andrey Grodzovsky wrote: On 6/22/20 12:45 PM, Greg KH wrote: On Mon, Jun 22, 2020 at 12:07:25PM -0400, Andrey Grodzovsky wrote: On 6/22/20 7:21 AM, Greg KH wrote: On Mon, Jun 22, 2020 at 11:51:24AM +0200, Daniel Vetter wrote: On Sun, Jun 21, 2020 at 02:03:05AM -0400, Andrey Grodzovsky wrote: Track sysfs files in a list so they all can be removed during pci remove since otherwise their removal after that causes crash because parent folder was already removed during pci remove. Huh? That should not happen, do you have a backtrace of that crash? 2 examples in the attached trace. Odd, how did you trigger these? By manually triggering PCI remove from sysfs cd /sys/bus/pci/devices/\:05\:00.0 && echo 1 > remove For some reason, I didn't think that video/drm devices could handle hot-remove like this. The "old" PCI hotplug specification explicitly said that video devices were not supported, has that changed? And this whole issue is probably tied to the larger issue that Daniel was asking me about, when it came to device lifetimes and the drm layer, so odds are we need to fix that up first before worrying about trying to support this crazy request, right? :) [ 925.738225 <0.188086>] BUG: kernel NULL pointer dereference, address: 0090 [ 925.738232 <0.07>] #PF: supervisor read access in kernel mode [ 925.738236 <0.04>] #PF: error_code(0x) - not-present page [ 925.738240 <0.04>] PGD 0 P4D 0 [ 925.738245 <0.05>] Oops: [#1] SMP PTI [ 925.738249 <0.04>] CPU: 7 PID: 2547 Comm: amdgpu_test Tainted: G W OE 5.5.0-rc7-dev-kfd+ #50 [ 925.738256 <0.07>] Hardware name: System manufacturer System Product Name/RAMPAGE IV FORMULA, BIOS 4804 12/30/2013 [ 925.738266 <0.10>] RIP: 0010:kernfs_find_ns+0x18/0x110 [ 925.738270 <0.04>] Code: a6 cf ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41 57 41 56 49 89 f6 41 55 41 54 49 89 fd 55 53 49 89 d4 <0f> b7 af 90 00 00 00 8b 05 8f ee 6b 01 48 8b 5f 68 66 83 e5 20 41 [ 925.738282 <0.12>] RSP: 0018:ad6d0118fb00 EFLAGS: 00010246 [ 925.738287 <0.05>] RAX: RBX: RCX: 2098a12076864b7e [ 925.738292 <0.05>] RDX: RSI: b6606b31 RDI: [ 925.738297 <0.05>] RBP: b6606b31 R08: b5379d10 R09: [ 925.738302 <0.05>] R10: ad6d0118fb38 R11: 9a75f64820a8 R12: [ 925.738307 <0.05>] R13: R14: b6606b31 R15: 9a7612b06130 [ 925.738313 <0.06>] FS: 7f3eca4e8700() GS:9a763dbc() knlGS: [ 925.738319 <0.06>] CS: 0010 DS: ES: CR0: 80050033 [ 925.738323 <0.04>] CR2: 0090 CR3: 35e5a005 CR4: 000606e0 [ 925.738329 <0.06>] Call Trace: [ 925.738334 <0.05>] kernfs_find_and_get_ns+0x2e/0x50 [ 925.738339 <0.05>] sysfs_remove_group+0x25/0x80 [ 925.738344 <0.05>] sysfs_remove_groups+0x29/0x40 [ 925.738350 <0.06>] free_msi_irqs+0xf5/0x190 [ 925.738354 <0.04>] pci_disable_msi+0xe9/0x120 So the PCI core is trying to clean up attributes that it had registered, which is fine. But we can't seem to find the attributes? Were they already removed somewhere else? that's odd. Yes, as i pointed above i am emulating device remove from sysfs and this triggers pci device remove sequence and as part of that my specific device folder (05:00.0) is removed from the sysfs tree. But why are things being removed twice? Not sure I understand what removed twice ? I remove only once per sysfs attribute. This code path shows that the kernel is trying to remove a file that is not present, so someone removed it already... thanks, gre k-h That a mystery for me too... Andrey ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
On Tue, Jun 23, 2020 at 11:04:30PM -0400, Andrey Grodzovsky wrote: > > On 6/23/20 2:05 AM, Greg KH wrote: > > On Tue, Jun 23, 2020 at 12:51:00AM -0400, Andrey Grodzovsky wrote: > > > On 6/22/20 12:45 PM, Greg KH wrote: > > > > On Mon, Jun 22, 2020 at 12:07:25PM -0400, Andrey Grodzovsky wrote: > > > > > On 6/22/20 7:21 AM, Greg KH wrote: > > > > > > On Mon, Jun 22, 2020 at 11:51:24AM +0200, Daniel Vetter wrote: > > > > > > > On Sun, Jun 21, 2020 at 02:03:05AM -0400, Andrey Grodzovsky wrote: > > > > > > > > Track sysfs files in a list so they all can be removed during > > > > > > > > pci remove > > > > > > > > since otherwise their removal after that causes crash because > > > > > > > > parent > > > > > > > > folder was already removed during pci remove. > > > > > > Huh? That should not happen, do you have a backtrace of that crash? > > > > > 2 examples in the attached trace. > > > > Odd, how did you trigger these? > > > > > > By manually triggering PCI remove from sysfs > > > > > > cd /sys/bus/pci/devices/\:05\:00.0 && echo 1 > remove > > For some reason, I didn't think that video/drm devices could handle > > hot-remove like this. The "old" PCI hotplug specification explicitly > > said that video devices were not supported, has that changed? > > > > And this whole issue is probably tied to the larger issue that Daniel > > was asking me about, when it came to device lifetimes and the drm layer, > > so odds are we need to fix that up first before worrying about trying to > > support this crazy request, right? :) > > > > > > > [ 925.738225 <0.188086>] BUG: kernel NULL pointer dereference, > > > > > address: 0090 > > > > > [ 925.738232 <0.07>] #PF: supervisor read access in kernel > > > > > mode > > > > > [ 925.738236 <0.04>] #PF: error_code(0x) - not-present > > > > > page > > > > > [ 925.738240 <0.04>] PGD 0 P4D 0 > > > > > [ 925.738245 <0.05>] Oops: [#1] SMP PTI > > > > > [ 925.738249 <0.04>] CPU: 7 PID: 2547 Comm: amdgpu_test > > > > > Tainted: GW OE 5.5.0-rc7-dev-kfd+ #50 > > > > > [ 925.738256 <0.07>] Hardware name: System manufacturer > > > > > System Product Name/RAMPAGE IV FORMULA, BIOS 4804 12/30/2013 > > > > > [ 925.738266 <0.10>] RIP: 0010:kernfs_find_ns+0x18/0x110 > > > > > [ 925.738270 <0.04>] Code: a6 cf ff 0f 1f 44 00 00 66 2e 0f > > > > > 1f 84 00 00 00 00 00 66 66 66 66 90 41 57 41 56 49 89 f6 41 55 41 54 > > > > > 49 89 fd 55 53 49 89 d4 <0f> b7 af 90 00 00 00 8b 05 8f ee 6b 01 48 > > > > > 8b 5f 68 66 83 e5 20 41 > > > > > [ 925.738282 <0.12>] RSP: 0018:ad6d0118fb00 EFLAGS: > > > > > 00010246 > > > > > [ 925.738287 <0.05>] RAX: RBX: > > > > > RCX: 2098a12076864b7e > > > > > [ 925.738292 <0.05>] RDX: RSI: > > > > > b6606b31 RDI: > > > > > [ 925.738297 <0.05>] RBP: b6606b31 R08: > > > > > b5379d10 R09: > > > > > [ 925.738302 <0.05>] R10: ad6d0118fb38 R11: > > > > > 9a75f64820a8 R12: > > > > > [ 925.738307 <0.05>] R13: R14: > > > > > b6606b31 R15: 9a7612b06130 > > > > > [ 925.738313 <0.06>] FS: 7f3eca4e8700() > > > > > GS:9a763dbc() knlGS: > > > > > [ 925.738319 <0.06>] CS: 0010 DS: ES: CR0: > > > > > 80050033 > > > > > [ 925.738323 <0.04>] CR2: 0090 CR3: > > > > > 35e5a005 CR4: 000606e0 > > > > > [ 925.738329 <0.06>] Call Trace: > > > > > [ 925.738334 <0.05>] kernfs_find_and_get_ns+0x2e/0x50 > > > > > [ 925.738339 <0.05>] sysfs_remove_group+0x25/0x80 > > > > > [ 925.738344 <0.05>] sysfs_remove_groups+0x29/0x40 > > > > > [ 925.738350 <0.06>] free_msi_irqs+0xf5/0x190 > > > > > [ 925.738354 <0.04>] pci_disable_msi+0xe9/0x120 > > > > So the PCI core is trying to clean up attributes that it had registered, > > > > which is fine. But we can't seem to find the attributes? Were they > > > > already removed somewhere else? > > > > > > > > that's odd. > > > > > > Yes, as i pointed above i am emulating device remove from sysfs and this > > > triggers pci device remove sequence and as part of that my specific device > > > folder (05:00.0) is removed from the sysfs tree. > > But why are things being removed twice? > > > Not sure I understand what removed twice ? I remove only once per sysfs > attribute. This code path shows that the kernel is trying to remove a file that is not present, so someone removed it already... thanks, gre k-h ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
On 6/23/20 2:05 AM, Greg KH wrote: On Tue, Jun 23, 2020 at 12:51:00AM -0400, Andrey Grodzovsky wrote: On 6/22/20 12:45 PM, Greg KH wrote: On Mon, Jun 22, 2020 at 12:07:25PM -0400, Andrey Grodzovsky wrote: On 6/22/20 7:21 AM, Greg KH wrote: On Mon, Jun 22, 2020 at 11:51:24AM +0200, Daniel Vetter wrote: On Sun, Jun 21, 2020 at 02:03:05AM -0400, Andrey Grodzovsky wrote: Track sysfs files in a list so they all can be removed during pci remove since otherwise their removal after that causes crash because parent folder was already removed during pci remove. Huh? That should not happen, do you have a backtrace of that crash? 2 examples in the attached trace. Odd, how did you trigger these? By manually triggering PCI remove from sysfs cd /sys/bus/pci/devices/\:05\:00.0 && echo 1 > remove For some reason, I didn't think that video/drm devices could handle hot-remove like this. The "old" PCI hotplug specification explicitly said that video devices were not supported, has that changed? And this whole issue is probably tied to the larger issue that Daniel was asking me about, when it came to device lifetimes and the drm layer, so odds are we need to fix that up first before worrying about trying to support this crazy request, right? :) [ 925.738225 <0.188086>] BUG: kernel NULL pointer dereference, address: 0090 [ 925.738232 <0.07>] #PF: supervisor read access in kernel mode [ 925.738236 <0.04>] #PF: error_code(0x) - not-present page [ 925.738240 <0.04>] PGD 0 P4D 0 [ 925.738245 <0.05>] Oops: [#1] SMP PTI [ 925.738249 <0.04>] CPU: 7 PID: 2547 Comm: amdgpu_test Tainted: G W OE 5.5.0-rc7-dev-kfd+ #50 [ 925.738256 <0.07>] Hardware name: System manufacturer System Product Name/RAMPAGE IV FORMULA, BIOS 4804 12/30/2013 [ 925.738266 <0.10>] RIP: 0010:kernfs_find_ns+0x18/0x110 [ 925.738270 <0.04>] Code: a6 cf ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41 57 41 56 49 89 f6 41 55 41 54 49 89 fd 55 53 49 89 d4 <0f> b7 af 90 00 00 00 8b 05 8f ee 6b 01 48 8b 5f 68 66 83 e5 20 41 [ 925.738282 <0.12>] RSP: 0018:ad6d0118fb00 EFLAGS: 00010246 [ 925.738287 <0.05>] RAX: RBX: RCX: 2098a12076864b7e [ 925.738292 <0.05>] RDX: RSI: b6606b31 RDI: [ 925.738297 <0.05>] RBP: b6606b31 R08: b5379d10 R09: [ 925.738302 <0.05>] R10: ad6d0118fb38 R11: 9a75f64820a8 R12: [ 925.738307 <0.05>] R13: R14: b6606b31 R15: 9a7612b06130 [ 925.738313 <0.06>] FS: 7f3eca4e8700() GS:9a763dbc() knlGS: [ 925.738319 <0.06>] CS: 0010 DS: ES: CR0: 80050033 [ 925.738323 <0.04>] CR2: 0090 CR3: 35e5a005 CR4: 000606e0 [ 925.738329 <0.06>] Call Trace: [ 925.738334 <0.05>] kernfs_find_and_get_ns+0x2e/0x50 [ 925.738339 <0.05>] sysfs_remove_group+0x25/0x80 [ 925.738344 <0.05>] sysfs_remove_groups+0x29/0x40 [ 925.738350 <0.06>] free_msi_irqs+0xf5/0x190 [ 925.738354 <0.04>] pci_disable_msi+0xe9/0x120 So the PCI core is trying to clean up attributes that it had registered, which is fine. But we can't seem to find the attributes? Were they already removed somewhere else? that's odd. Yes, as i pointed above i am emulating device remove from sysfs and this triggers pci device remove sequence and as part of that my specific device folder (05:00.0) is removed from the sysfs tree. But why are things being removed twice? Not sure I understand what removed twice ? I remove only once per sysfs attribute. Andrey [ 925.738406 <0.52>] amdgpu_irq_fini+0xe3/0xf0 [amdgpu] [ 925.738453 <0.47>] tonga_ih_sw_fini+0xe/0x30 [amdgpu] [ 925.738490 <0.37>] amdgpu_device_fini_late+0x14b/0x440 [amdgpu] [ 925.738529 <0.39>] amdgpu_driver_release_kms+0x16/0x40 [amdgpu] [ 925.738548 <0.19>] drm_dev_put+0x5b/0x80 [drm] [ 925.738558 <0.10>] drm_release+0xc6/0xd0 [drm] [ 925.738563 <0.05>] __fput+0xc6/0x260 [ 925.738568 <0.05>] task_work_run+0x79/0xb0 [ 925.738573 <0.05>] do_exit+0x3d0/0xc60 [ 925.738578 <0.05>] do_group_exit+0x47/0xb0 [ 925.738583 <0.05>] get_signal+0x18b/0xc30 [ 925.738589 <0.06>] do_signal+0x36/0x6a0 [ 925.738593 <0.04>] ? force_sig_info_to_task+0xbc/0xd0 [ 925.738597 <0.04>] ? signal_wake_up_state+0x15/0x30 [ 925.738603 <0.06>] exit_to_usermode_loop+0x6f/0xc0 [ 925.738608 <0.05>] prepare_exit_to_usermode+0xc7/0x110 [ 925.738613 <0.05>] ret_from_intr+0x25/0x35 [ 925.738617 <0.04>] RIP: 0033:0x417369 [ 925.738621 <0.04>] Code: Bad RIP value. [
Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
On Tue, Jun 23, 2020 at 12:51:00AM -0400, Andrey Grodzovsky wrote: > > On 6/22/20 12:45 PM, Greg KH wrote: > > On Mon, Jun 22, 2020 at 12:07:25PM -0400, Andrey Grodzovsky wrote: > > > On 6/22/20 7:21 AM, Greg KH wrote: > > > > On Mon, Jun 22, 2020 at 11:51:24AM +0200, Daniel Vetter wrote: > > > > > On Sun, Jun 21, 2020 at 02:03:05AM -0400, Andrey Grodzovsky wrote: > > > > > > Track sysfs files in a list so they all can be removed during pci > > > > > > remove > > > > > > since otherwise their removal after that causes crash because parent > > > > > > folder was already removed during pci remove. > > > > Huh? That should not happen, do you have a backtrace of that crash? > > > > > > 2 examples in the attached trace. > > Odd, how did you trigger these? > > > By manually triggering PCI remove from sysfs > > cd /sys/bus/pci/devices/\:05\:00.0 && echo 1 > remove For some reason, I didn't think that video/drm devices could handle hot-remove like this. The "old" PCI hotplug specification explicitly said that video devices were not supported, has that changed? And this whole issue is probably tied to the larger issue that Daniel was asking me about, when it came to device lifetimes and the drm layer, so odds are we need to fix that up first before worrying about trying to support this crazy request, right? :) > > > [ 925.738225 <0.188086>] BUG: kernel NULL pointer dereference, > > > address: 0090 > > > [ 925.738232 <0.07>] #PF: supervisor read access in kernel mode > > > [ 925.738236 <0.04>] #PF: error_code(0x) - not-present page > > > [ 925.738240 <0.04>] PGD 0 P4D 0 > > > [ 925.738245 <0.05>] Oops: [#1] SMP PTI > > > [ 925.738249 <0.04>] CPU: 7 PID: 2547 Comm: amdgpu_test Tainted: > > > GW OE 5.5.0-rc7-dev-kfd+ #50 > > > [ 925.738256 <0.07>] Hardware name: System manufacturer System > > > Product Name/RAMPAGE IV FORMULA, BIOS 4804 12/30/2013 > > > [ 925.738266 <0.10>] RIP: 0010:kernfs_find_ns+0x18/0x110 > > > [ 925.738270 <0.04>] Code: a6 cf ff 0f 1f 44 00 00 66 2e 0f 1f > > > 84 00 00 00 00 00 66 66 66 66 90 41 57 41 56 49 89 f6 41 55 41 54 49 89 > > > fd 55 53 49 89 d4 <0f> b7 af 90 00 00 00 8b 05 8f ee 6b 01 48 8b 5f 68 66 > > > 83 e5 20 41 > > > [ 925.738282 <0.12>] RSP: 0018:ad6d0118fb00 EFLAGS: 00010246 > > > [ 925.738287 <0.05>] RAX: RBX: > > > RCX: 2098a12076864b7e > > > [ 925.738292 <0.05>] RDX: RSI: b6606b31 > > > RDI: > > > [ 925.738297 <0.05>] RBP: b6606b31 R08: b5379d10 > > > R09: > > > [ 925.738302 <0.05>] R10: ad6d0118fb38 R11: 9a75f64820a8 > > > R12: > > > [ 925.738307 <0.05>] R13: R14: b6606b31 > > > R15: 9a7612b06130 > > > [ 925.738313 <0.06>] FS: 7f3eca4e8700() > > > GS:9a763dbc() knlGS: > > > [ 925.738319 <0.06>] CS: 0010 DS: ES: CR0: > > > 80050033 > > > [ 925.738323 <0.04>] CR2: 0090 CR3: 35e5a005 > > > CR4: 000606e0 > > > [ 925.738329 <0.06>] Call Trace: > > > [ 925.738334 <0.05>] kernfs_find_and_get_ns+0x2e/0x50 > > > [ 925.738339 <0.05>] sysfs_remove_group+0x25/0x80 > > > [ 925.738344 <0.05>] sysfs_remove_groups+0x29/0x40 > > > [ 925.738350 <0.06>] free_msi_irqs+0xf5/0x190 > > > [ 925.738354 <0.04>] pci_disable_msi+0xe9/0x120 > > So the PCI core is trying to clean up attributes that it had registered, > > which is fine. But we can't seem to find the attributes? Were they > > already removed somewhere else? > > > > that's odd. > > > Yes, as i pointed above i am emulating device remove from sysfs and this > triggers pci device remove sequence and as part of that my specific device > folder (05:00.0) is removed from the sysfs tree. But why are things being removed twice? > > > [ 925.738406 <0.52>] amdgpu_irq_fini+0xe3/0xf0 [amdgpu] > > > [ 925.738453 <0.47>] tonga_ih_sw_fini+0xe/0x30 [amdgpu] > > > [ 925.738490 <0.37>] amdgpu_device_fini_late+0x14b/0x440 > > > [amdgpu] > > > [ 925.738529 <0.39>] amdgpu_driver_release_kms+0x16/0x40 > > > [amdgpu] > > > [ 925.738548 <0.19>] drm_dev_put+0x5b/0x80 [drm] > > > [ 925.738558 <0.10>] drm_release+0xc6/0xd0 [drm] > > > [ 925.738563 <0.05>] __fput+0xc6/0x260 > > > [ 925.738568 <0.05>] task_work_run+0x79/0xb0 > > > [ 925.738573 <0.05>] do_exit+0x3d0/0xc60 > > > [ 925.738578 <0.05>] do_group_exit+0x47/0xb0 > > > [ 925.738583 <0.05>] get_signal+0x18b/0xc30 > > > [ 925.738589 <0.06>] do_signal+0x36/0x6a0 > > > [ 925.738593 <0.04>] ? force_sig_info_to_task+0xbc/0xd0 > > > [ 925.738597 <
Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
On 6/22/20 12:45 PM, Greg KH wrote: On Mon, Jun 22, 2020 at 12:07:25PM -0400, Andrey Grodzovsky wrote: On 6/22/20 7:21 AM, Greg KH wrote: On Mon, Jun 22, 2020 at 11:51:24AM +0200, Daniel Vetter wrote: On Sun, Jun 21, 2020 at 02:03:05AM -0400, Andrey Grodzovsky wrote: Track sysfs files in a list so they all can be removed during pci remove since otherwise their removal after that causes crash because parent folder was already removed during pci remove. Huh? That should not happen, do you have a backtrace of that crash? 2 examples in the attached trace. Odd, how did you trigger these? By manually triggering PCI remove from sysfs cd /sys/bus/pci/devices/\:05\:00.0 && echo 1 > remove [ 925.738225 <0.188086>] BUG: kernel NULL pointer dereference, address: 0090 [ 925.738232 <0.07>] #PF: supervisor read access in kernel mode [ 925.738236 <0.04>] #PF: error_code(0x) - not-present page [ 925.738240 <0.04>] PGD 0 P4D 0 [ 925.738245 <0.05>] Oops: [#1] SMP PTI [ 925.738249 <0.04>] CPU: 7 PID: 2547 Comm: amdgpu_test Tainted: G W OE 5.5.0-rc7-dev-kfd+ #50 [ 925.738256 <0.07>] Hardware name: System manufacturer System Product Name/RAMPAGE IV FORMULA, BIOS 4804 12/30/2013 [ 925.738266 <0.10>] RIP: 0010:kernfs_find_ns+0x18/0x110 [ 925.738270 <0.04>] Code: a6 cf ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41 57 41 56 49 89 f6 41 55 41 54 49 89 fd 55 53 49 89 d4 <0f> b7 af 90 00 00 00 8b 05 8f ee 6b 01 48 8b 5f 68 66 83 e5 20 41 [ 925.738282 <0.12>] RSP: 0018:ad6d0118fb00 EFLAGS: 00010246 [ 925.738287 <0.05>] RAX: RBX: RCX: 2098a12076864b7e [ 925.738292 <0.05>] RDX: RSI: b6606b31 RDI: [ 925.738297 <0.05>] RBP: b6606b31 R08: b5379d10 R09: [ 925.738302 <0.05>] R10: ad6d0118fb38 R11: 9a75f64820a8 R12: [ 925.738307 <0.05>] R13: R14: b6606b31 R15: 9a7612b06130 [ 925.738313 <0.06>] FS: 7f3eca4e8700() GS:9a763dbc() knlGS: [ 925.738319 <0.06>] CS: 0010 DS: ES: CR0: 80050033 [ 925.738323 <0.04>] CR2: 0090 CR3: 35e5a005 CR4: 000606e0 [ 925.738329 <0.06>] Call Trace: [ 925.738334 <0.05>] kernfs_find_and_get_ns+0x2e/0x50 [ 925.738339 <0.05>] sysfs_remove_group+0x25/0x80 [ 925.738344 <0.05>] sysfs_remove_groups+0x29/0x40 [ 925.738350 <0.06>] free_msi_irqs+0xf5/0x190 [ 925.738354 <0.04>] pci_disable_msi+0xe9/0x120 So the PCI core is trying to clean up attributes that it had registered, which is fine. But we can't seem to find the attributes? Were they already removed somewhere else? that's odd. Yes, as i pointed above i am emulating device remove from sysfs and this triggers pci device remove sequence and as part of that my specific device folder (05:00.0) is removed from the sysfs tree. [ 925.738406 <0.52>] amdgpu_irq_fini+0xe3/0xf0 [amdgpu] [ 925.738453 <0.47>] tonga_ih_sw_fini+0xe/0x30 [amdgpu] [ 925.738490 <0.37>] amdgpu_device_fini_late+0x14b/0x440 [amdgpu] [ 925.738529 <0.39>] amdgpu_driver_release_kms+0x16/0x40 [amdgpu] [ 925.738548 <0.19>] drm_dev_put+0x5b/0x80 [drm] [ 925.738558 <0.10>] drm_release+0xc6/0xd0 [drm] [ 925.738563 <0.05>] __fput+0xc6/0x260 [ 925.738568 <0.05>] task_work_run+0x79/0xb0 [ 925.738573 <0.05>] do_exit+0x3d0/0xc60 [ 925.738578 <0.05>] do_group_exit+0x47/0xb0 [ 925.738583 <0.05>] get_signal+0x18b/0xc30 [ 925.738589 <0.06>] do_signal+0x36/0x6a0 [ 925.738593 <0.04>] ? force_sig_info_to_task+0xbc/0xd0 [ 925.738597 <0.04>] ? signal_wake_up_state+0x15/0x30 [ 925.738603 <0.06>] exit_to_usermode_loop+0x6f/0xc0 [ 925.738608 <0.05>] prepare_exit_to_usermode+0xc7/0x110 [ 925.738613 <0.05>] ret_from_intr+0x25/0x35 [ 925.738617 <0.04>] RIP: 0033:0x417369 [ 925.738621 <0.04>] Code: Bad RIP value. [ 925.738625 <0.04>] RSP: 002b:7ffdd6bf0900 EFLAGS: 00010246 [ 925.738629 <0.04>] RAX: 7f3eca509000 RBX: 001e RCX: 7f3ec95ba260 [ 925.738634 <0.05>] RDX: 7f3ec9889790 RSI: 000a RDI: [ 925.738639 <0.05>] RBP: 7ffdd6bf0990 R08: 7f3ec9889780 R09: 7f3eca4e8700 [ 925.738645 <0.06>] R10: 035c R11: 0246 R12: 021c6170 [ 925.738650 <0.05>] R13: 7ffdd6bf0c00 R14: R15: [ 40.880899 <0.04>] BUG: kernel NULL pointer dereference, address: 0090 [ 40.880906 <0.07>]
Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
On Mon, Jun 22, 2020 at 12:07:25PM -0400, Andrey Grodzovsky wrote: > > On 6/22/20 7:21 AM, Greg KH wrote: > > On Mon, Jun 22, 2020 at 11:51:24AM +0200, Daniel Vetter wrote: > > > On Sun, Jun 21, 2020 at 02:03:05AM -0400, Andrey Grodzovsky wrote: > > > > Track sysfs files in a list so they all can be removed during pci remove > > > > since otherwise their removal after that causes crash because parent > > > > folder was already removed during pci remove. > > Huh? That should not happen, do you have a backtrace of that crash? > > > 2 examples in the attached trace. Odd, how did you trigger these? > [ 925.738225 <0.188086>] BUG: kernel NULL pointer dereference, address: > 0090 > [ 925.738232 <0.07>] #PF: supervisor read access in kernel mode > [ 925.738236 <0.04>] #PF: error_code(0x) - not-present page > [ 925.738240 <0.04>] PGD 0 P4D 0 > [ 925.738245 <0.05>] Oops: [#1] SMP PTI > [ 925.738249 <0.04>] CPU: 7 PID: 2547 Comm: amdgpu_test Tainted: G > W OE 5.5.0-rc7-dev-kfd+ #50 > [ 925.738256 <0.07>] Hardware name: System manufacturer System > Product Name/RAMPAGE IV FORMULA, BIOS 4804 12/30/2013 > [ 925.738266 <0.10>] RIP: 0010:kernfs_find_ns+0x18/0x110 > [ 925.738270 <0.04>] Code: a6 cf ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 > 00 00 00 00 66 66 66 66 90 41 57 41 56 49 89 f6 41 55 41 54 49 89 fd 55 53 49 > 89 d4 <0f> b7 af 90 00 00 00 8b 05 8f ee 6b 01 48 8b 5f 68 66 83 e5 20 41 > [ 925.738282 <0.12>] RSP: 0018:ad6d0118fb00 EFLAGS: 00010246 > [ 925.738287 <0.05>] RAX: RBX: > RCX: 2098a12076864b7e > [ 925.738292 <0.05>] RDX: RSI: b6606b31 > RDI: > [ 925.738297 <0.05>] RBP: b6606b31 R08: b5379d10 > R09: > [ 925.738302 <0.05>] R10: ad6d0118fb38 R11: 9a75f64820a8 > R12: > [ 925.738307 <0.05>] R13: R14: b6606b31 > R15: 9a7612b06130 > [ 925.738313 <0.06>] FS: 7f3eca4e8700() > GS:9a763dbc() knlGS: > [ 925.738319 <0.06>] CS: 0010 DS: ES: CR0: > 80050033 > [ 925.738323 <0.04>] CR2: 0090 CR3: 35e5a005 > CR4: 000606e0 > [ 925.738329 <0.06>] Call Trace: > [ 925.738334 <0.05>] kernfs_find_and_get_ns+0x2e/0x50 > [ 925.738339 <0.05>] sysfs_remove_group+0x25/0x80 > [ 925.738344 <0.05>] sysfs_remove_groups+0x29/0x40 > [ 925.738350 <0.06>] free_msi_irqs+0xf5/0x190 > [ 925.738354 <0.04>] pci_disable_msi+0xe9/0x120 So the PCI core is trying to clean up attributes that it had registered, which is fine. But we can't seem to find the attributes? Were they already removed somewhere else? that's odd. > [ 925.738406 <0.52>] amdgpu_irq_fini+0xe3/0xf0 [amdgpu] > [ 925.738453 <0.47>] tonga_ih_sw_fini+0xe/0x30 [amdgpu] > [ 925.738490 <0.37>] amdgpu_device_fini_late+0x14b/0x440 [amdgpu] > [ 925.738529 <0.39>] amdgpu_driver_release_kms+0x16/0x40 [amdgpu] > [ 925.738548 <0.19>] drm_dev_put+0x5b/0x80 [drm] > [ 925.738558 <0.10>] drm_release+0xc6/0xd0 [drm] > [ 925.738563 <0.05>] __fput+0xc6/0x260 > [ 925.738568 <0.05>] task_work_run+0x79/0xb0 > [ 925.738573 <0.05>] do_exit+0x3d0/0xc60 > [ 925.738578 <0.05>] do_group_exit+0x47/0xb0 > [ 925.738583 <0.05>] get_signal+0x18b/0xc30 > [ 925.738589 <0.06>] do_signal+0x36/0x6a0 > [ 925.738593 <0.04>] ? force_sig_info_to_task+0xbc/0xd0 > [ 925.738597 <0.04>] ? signal_wake_up_state+0x15/0x30 > [ 925.738603 <0.06>] exit_to_usermode_loop+0x6f/0xc0 > [ 925.738608 <0.05>] prepare_exit_to_usermode+0xc7/0x110 > [ 925.738613 <0.05>] ret_from_intr+0x25/0x35 > [ 925.738617 <0.04>] RIP: 0033:0x417369 > [ 925.738621 <0.04>] Code: Bad RIP value. > [ 925.738625 <0.04>] RSP: 002b:7ffdd6bf0900 EFLAGS: 00010246 > [ 925.738629 <0.04>] RAX: 7f3eca509000 RBX: 001e > RCX: 7f3ec95ba260 > [ 925.738634 <0.05>] RDX: 7f3ec9889790 RSI: 000a > RDI: > [ 925.738639 <0.05>] RBP: 7ffdd6bf0990 R08: 7f3ec9889780 > R09: 7f3eca4e8700 > [ 925.738645 <0.06>] R10: 035c R11: 0246 > R12: 021c6170 > [ 925.738650 <0.05>] R13: 7ffdd6bf0c00 R14: > R15: > > > > > [ 40.880899 <0.04>] BUG: kernel NULL pointer dereference, address: > 0090 > [ 40.880906 <0.07>] #PF: supervisor read access in kernel mode > [ 40.880910 <0.04>] #PF: error_code(0x) - not-present page > [ 40.880915 <0.05>] PGD 0 P4D 0 >
Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
On 6/22/20 7:21 AM, Greg KH wrote: On Mon, Jun 22, 2020 at 11:51:24AM +0200, Daniel Vetter wrote: On Sun, Jun 21, 2020 at 02:03:05AM -0400, Andrey Grodzovsky wrote: Track sysfs files in a list so they all can be removed during pci remove since otherwise their removal after that causes crash because parent folder was already removed during pci remove. Huh? That should not happen, do you have a backtrace of that crash? 2 examples in the attached trace. Andrey Signed-off-by: Andrey Grodzovsky Uh I thought sysfs just gets yanked completely. Please check with Greg KH whether hand-rolling all this really is the right solution here ... Feels very wrong. I thought this was all supposed to work by adding attributes before publishing the sysfs node, and then letting sysfs clean up everything. Not by cleaning up manually yourself. Yes, that is supposed to be the correct thing to do. Adding Greg for an authoritative answer. -Daniel --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 13 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c | 7 +- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 35 drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c | 12 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 8 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 17 -- drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 13 ++- drivers/gpu/drm/amd/amdgpu/df_v3_6.c | 10 +--- 8 files changed, 99 insertions(+), 16 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 604a681..ba3775f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -726,6 +726,15 @@ struct amd_powerplay { #define AMDGPU_RESET_MAGIC_NUM 64 #define AMDGPU_MAX_DF_PERFMONS 4 + +struct amdgpu_sysfs_list_node { + struct list_head head; + struct device_attribute *attr; +}; You know we have lists of attributes already, called attribute groups, if you really wanted to do something like this. But, I don't think so. Either way, don't hand-roll your own stuff that the driver core has provided for you for a decade or more, that's just foolish :) + +#define AMDGPU_DEVICE_ATTR_LIST_NODE(_attr) \ + struct amdgpu_sysfs_list_node dev_attr_handle_##_attr = {.attr = _attr_##_attr} + struct amdgpu_device { struct device *dev; struct drm_device *ddev; @@ -992,6 +1001,10 @@ struct amdgpu_device { charproduct_number[16]; charproduct_name[32]; charserial[16]; + + struct list_head sysfs_files_list; + struct mutex sysfs_files_list_lock; + }; static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device *bdev) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c index fdd52d8..c1549ee 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c @@ -1950,8 +1950,10 @@ static ssize_t amdgpu_atombios_get_vbios_version(struct device *dev, return snprintf(buf, PAGE_SIZE, "%s\n", ctx->vbios_version); } + static DEVICE_ATTR(vbios_version, 0444, amdgpu_atombios_get_vbios_version, NULL); +static AMDGPU_DEVICE_ATTR_LIST_NODE(vbios_version); /** * amdgpu_atombios_fini - free the driver info and callbacks for atombios @@ -1972,7 +1974,6 @@ void amdgpu_atombios_fini(struct amdgpu_device *adev) adev->mode_info.atom_context = NULL; kfree(adev->mode_info.atom_card_info); adev->mode_info.atom_card_info = NULL; - device_remove_file(adev->dev, _attr_vbios_version); } /** @@ -2038,6 +2039,10 @@ int amdgpu_atombios_init(struct amdgpu_device *adev) return ret; } + mutex_lock(>sysfs_files_list_lock); + list_add_tail(_attr_handle_vbios_version.head, >sysfs_files_list); + mutex_unlock(>sysfs_files_list_lock); + return 0; } diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index e7b9065..3173046 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -2928,6 +2928,12 @@ static const struct attribute *amdgpu_dev_attributes[] = { NULL }; +static AMDGPU_DEVICE_ATTR_LIST_NODE(product_name); +static AMDGPU_DEVICE_ATTR_LIST_NODE(product_number); +static AMDGPU_DEVICE_ATTR_LIST_NODE(serial_number); +static AMDGPU_DEVICE_ATTR_LIST_NODE(pcie_replay_count); + + /** * amdgpu_device_init - initialize the driver * @@ -3029,6 +3035,9 @@ int amdgpu_device_init(struct amdgpu_device *adev, INIT_LIST_HEAD(>shadow_list); mutex_init(>shadow_list_lock); + INIT_LIST_HEAD(>sysfs_files_list); + mutex_init(>sysfs_files_list_lock); +
Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
Am 21.06.20 um 08:03 schrieb Andrey Grodzovsky: Track sysfs files in a list so they all can be removed during pci remove since otherwise their removal after that causes crash because parent folder was already removed during pci remove. That looks extremely fishy to me. It sounds like we just don't remove stuff in the right order. Christian. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 13 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c | 7 +- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 35 drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c | 12 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 8 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 17 -- drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 13 ++- drivers/gpu/drm/amd/amdgpu/df_v3_6.c | 10 +--- 8 files changed, 99 insertions(+), 16 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 604a681..ba3775f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -726,6 +726,15 @@ struct amd_powerplay { #define AMDGPU_RESET_MAGIC_NUM 64 #define AMDGPU_MAX_DF_PERFMONS 4 + +struct amdgpu_sysfs_list_node { + struct list_head head; + struct device_attribute *attr; +}; + +#define AMDGPU_DEVICE_ATTR_LIST_NODE(_attr) \ + struct amdgpu_sysfs_list_node dev_attr_handle_##_attr = {.attr = _attr_##_attr} + struct amdgpu_device { struct device *dev; struct drm_device *ddev; @@ -992,6 +1001,10 @@ struct amdgpu_device { charproduct_number[16]; charproduct_name[32]; charserial[16]; + + struct list_head sysfs_files_list; + struct mutex sysfs_files_list_lock; + }; static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device *bdev) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c index fdd52d8..c1549ee 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c @@ -1950,8 +1950,10 @@ static ssize_t amdgpu_atombios_get_vbios_version(struct device *dev, return snprintf(buf, PAGE_SIZE, "%s\n", ctx->vbios_version); } + static DEVICE_ATTR(vbios_version, 0444, amdgpu_atombios_get_vbios_version, NULL); +static AMDGPU_DEVICE_ATTR_LIST_NODE(vbios_version); /** * amdgpu_atombios_fini - free the driver info and callbacks for atombios @@ -1972,7 +1974,6 @@ void amdgpu_atombios_fini(struct amdgpu_device *adev) adev->mode_info.atom_context = NULL; kfree(adev->mode_info.atom_card_info); adev->mode_info.atom_card_info = NULL; - device_remove_file(adev->dev, _attr_vbios_version); } /** @@ -2038,6 +2039,10 @@ int amdgpu_atombios_init(struct amdgpu_device *adev) return ret; } + mutex_lock(>sysfs_files_list_lock); + list_add_tail(_attr_handle_vbios_version.head, >sysfs_files_list); + mutex_unlock(>sysfs_files_list_lock); + return 0; } diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index e7b9065..3173046 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -2928,6 +2928,12 @@ static const struct attribute *amdgpu_dev_attributes[] = { NULL }; +static AMDGPU_DEVICE_ATTR_LIST_NODE(product_name); +static AMDGPU_DEVICE_ATTR_LIST_NODE(product_number); +static AMDGPU_DEVICE_ATTR_LIST_NODE(serial_number); +static AMDGPU_DEVICE_ATTR_LIST_NODE(pcie_replay_count); + + /** * amdgpu_device_init - initialize the driver * @@ -3029,6 +3035,9 @@ int amdgpu_device_init(struct amdgpu_device *adev, INIT_LIST_HEAD(>shadow_list); mutex_init(>shadow_list_lock); + INIT_LIST_HEAD(>sysfs_files_list); + mutex_init(>sysfs_files_list_lock); + INIT_DELAYED_WORK(>delayed_init_work, amdgpu_device_delayed_init_work_handler); INIT_DELAYED_WORK(>gfx.gfx_off_delay_work, @@ -3281,6 +3290,13 @@ int amdgpu_device_init(struct amdgpu_device *adev, if (r) { dev_err(adev->dev, "Could not create amdgpu device attr\n"); return r; + } else { + mutex_lock(>sysfs_files_list_lock); + list_add_tail(_attr_handle_product_name.head, >sysfs_files_list); + list_add_tail(_attr_handle_product_number.head, >sysfs_files_list); + list_add_tail(_attr_handle_serial_number.head, >sysfs_files_list); + list_add_tail(_attr_handle_pcie_replay_count.head, >sysfs_files_list); + mutex_unlock(>sysfs_files_list_lock); } if
Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
On Mon, Jun 22, 2020 at 11:51:24AM +0200, Daniel Vetter wrote: > On Sun, Jun 21, 2020 at 02:03:05AM -0400, Andrey Grodzovsky wrote: > > Track sysfs files in a list so they all can be removed during pci remove > > since otherwise their removal after that causes crash because parent > > folder was already removed during pci remove. Huh? That should not happen, do you have a backtrace of that crash? > > Signed-off-by: Andrey Grodzovsky > > Uh I thought sysfs just gets yanked completely. Please check with Greg KH > whether hand-rolling all this really is the right solution here ... Feels > very wrong. I thought this was all supposed to work by adding attributes > before publishing the sysfs node, and then letting sysfs clean up > everything. Not by cleaning up manually yourself. Yes, that is supposed to be the correct thing to do. > > Adding Greg for an authoritative answer. > -Daniel > > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 13 +++ > > drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c | 7 +- > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 35 > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c | 12 ++ > > drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 8 ++- > > drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 17 -- > > drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 13 ++- > > drivers/gpu/drm/amd/amdgpu/df_v3_6.c | 10 +--- > > 8 files changed, 99 insertions(+), 16 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > index 604a681..ba3775f 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > @@ -726,6 +726,15 @@ struct amd_powerplay { > > > > #define AMDGPU_RESET_MAGIC_NUM 64 > > #define AMDGPU_MAX_DF_PERFMONS 4 > > + > > +struct amdgpu_sysfs_list_node { > > + struct list_head head; > > + struct device_attribute *attr; > > +}; You know we have lists of attributes already, called attribute groups, if you really wanted to do something like this. But, I don't think so. Either way, don't hand-roll your own stuff that the driver core has provided for you for a decade or more, that's just foolish :) > > + > > +#define AMDGPU_DEVICE_ATTR_LIST_NODE(_attr) \ > > + struct amdgpu_sysfs_list_node dev_attr_handle_##_attr = {.attr = > > _attr_##_attr} > > + > > struct amdgpu_device { > > struct device *dev; > > struct drm_device *ddev; > > @@ -992,6 +1001,10 @@ struct amdgpu_device { > > charproduct_number[16]; > > charproduct_name[32]; > > charserial[16]; > > + > > + struct list_head sysfs_files_list; > > + struct mutex sysfs_files_list_lock; > > + > > }; > > > > static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device > > *bdev) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c > > index fdd52d8..c1549ee 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c > > @@ -1950,8 +1950,10 @@ static ssize_t > > amdgpu_atombios_get_vbios_version(struct device *dev, > > return snprintf(buf, PAGE_SIZE, "%s\n", ctx->vbios_version); > > } > > > > + > > static DEVICE_ATTR(vbios_version, 0444, amdgpu_atombios_get_vbios_version, > >NULL); > > +static AMDGPU_DEVICE_ATTR_LIST_NODE(vbios_version); > > > > /** > > * amdgpu_atombios_fini - free the driver info and callbacks for atombios > > @@ -1972,7 +1974,6 @@ void amdgpu_atombios_fini(struct amdgpu_device *adev) > > adev->mode_info.atom_context = NULL; > > kfree(adev->mode_info.atom_card_info); > > adev->mode_info.atom_card_info = NULL; > > - device_remove_file(adev->dev, _attr_vbios_version); > > } > > > > /** > > @@ -2038,6 +2039,10 @@ int amdgpu_atombios_init(struct amdgpu_device *adev) > > return ret; > > } > > > > + mutex_lock(>sysfs_files_list_lock); > > + list_add_tail(_attr_handle_vbios_version.head, > > >sysfs_files_list); > > + mutex_unlock(>sysfs_files_list_lock); > > + > > return 0; > > } > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > index e7b9065..3173046 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > @@ -2928,6 +2928,12 @@ static const struct attribute > > *amdgpu_dev_attributes[] = { > > NULL > > }; > > > > +static AMDGPU_DEVICE_ATTR_LIST_NODE(product_name); > > +static AMDGPU_DEVICE_ATTR_LIST_NODE(product_number); > > +static AMDGPU_DEVICE_ATTR_LIST_NODE(serial_number); > > +static AMDGPU_DEVICE_ATTR_LIST_NODE(pcie_replay_count); > > + > > + > > /** > > * amdgpu_device_init - initialize the driver > > * > > @@
Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal
On Sun, Jun 21, 2020 at 02:03:05AM -0400, Andrey Grodzovsky wrote: > Track sysfs files in a list so they all can be removed during pci remove > since otherwise their removal after that causes crash because parent > folder was already removed during pci remove. > > Signed-off-by: Andrey Grodzovsky Uh I thought sysfs just gets yanked completely. Please check with Greg KH whether hand-rolling all this really is the right solution here ... Feels very wrong. I thought this was all supposed to work by adding attributes before publishing the sysfs node, and then letting sysfs clean up everything. Not by cleaning up manually yourself. Adding Greg for an authoritative answer. -Daniel > --- > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 13 +++ > drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c | 7 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 35 > > drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c | 12 ++ > drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 8 ++- > drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 17 -- > drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 13 ++- > drivers/gpu/drm/amd/amdgpu/df_v3_6.c | 10 +--- > 8 files changed, 99 insertions(+), 16 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > index 604a681..ba3775f 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > @@ -726,6 +726,15 @@ struct amd_powerplay { > > #define AMDGPU_RESET_MAGIC_NUM 64 > #define AMDGPU_MAX_DF_PERFMONS 4 > + > +struct amdgpu_sysfs_list_node { > + struct list_head head; > + struct device_attribute *attr; > +}; > + > +#define AMDGPU_DEVICE_ATTR_LIST_NODE(_attr) \ > + struct amdgpu_sysfs_list_node dev_attr_handle_##_attr = {.attr = > _attr_##_attr} > + > struct amdgpu_device { > struct device *dev; > struct drm_device *ddev; > @@ -992,6 +1001,10 @@ struct amdgpu_device { > charproduct_number[16]; > charproduct_name[32]; > charserial[16]; > + > + struct list_head sysfs_files_list; > + struct mutex sysfs_files_list_lock; > + > }; > > static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device > *bdev) > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c > index fdd52d8..c1549ee 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c > @@ -1950,8 +1950,10 @@ static ssize_t > amdgpu_atombios_get_vbios_version(struct device *dev, > return snprintf(buf, PAGE_SIZE, "%s\n", ctx->vbios_version); > } > > + > static DEVICE_ATTR(vbios_version, 0444, amdgpu_atombios_get_vbios_version, > NULL); > +static AMDGPU_DEVICE_ATTR_LIST_NODE(vbios_version); > > /** > * amdgpu_atombios_fini - free the driver info and callbacks for atombios > @@ -1972,7 +1974,6 @@ void amdgpu_atombios_fini(struct amdgpu_device *adev) > adev->mode_info.atom_context = NULL; > kfree(adev->mode_info.atom_card_info); > adev->mode_info.atom_card_info = NULL; > - device_remove_file(adev->dev, _attr_vbios_version); > } > > /** > @@ -2038,6 +2039,10 @@ int amdgpu_atombios_init(struct amdgpu_device *adev) > return ret; > } > > + mutex_lock(>sysfs_files_list_lock); > + list_add_tail(_attr_handle_vbios_version.head, > >sysfs_files_list); > + mutex_unlock(>sysfs_files_list_lock); > + > return 0; > } > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index e7b9065..3173046 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -2928,6 +2928,12 @@ static const struct attribute *amdgpu_dev_attributes[] > = { > NULL > }; > > +static AMDGPU_DEVICE_ATTR_LIST_NODE(product_name); > +static AMDGPU_DEVICE_ATTR_LIST_NODE(product_number); > +static AMDGPU_DEVICE_ATTR_LIST_NODE(serial_number); > +static AMDGPU_DEVICE_ATTR_LIST_NODE(pcie_replay_count); > + > + > /** > * amdgpu_device_init - initialize the driver > * > @@ -3029,6 +3035,9 @@ int amdgpu_device_init(struct amdgpu_device *adev, > INIT_LIST_HEAD(>shadow_list); > mutex_init(>shadow_list_lock); > > + INIT_LIST_HEAD(>sysfs_files_list); > + mutex_init(>sysfs_files_list_lock); > + > INIT_DELAYED_WORK(>delayed_init_work, > amdgpu_device_delayed_init_work_handler); > INIT_DELAYED_WORK(>gfx.gfx_off_delay_work, > @@ -3281,6 +3290,13 @@ int amdgpu_device_init(struct amdgpu_device *adev, > if (r) { > dev_err(adev->dev, "Could not create amdgpu device attr\n"); > return r; > + } else { > +