Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal

2020-12-02 Thread Andrey Grodzovsky


On 12/2/20 1:20 PM, Greg KH wrote:

On Wed, Dec 02, 2020 at 01:02:06PM -0500, Andrey Grodzovsky wrote:

On 12/2/20 12:34 PM, Greg KH wrote:

On Wed, Dec 02, 2020 at 10:48:01AM -0500, Andrey Grodzovsky wrote:

On 11/11/20 10:34 AM, Greg KH wrote:

On Wed, Nov 11, 2020 at 10:13:13AM -0500, Andrey Grodzovsky wrote:

On 11/10/20 12:59 PM, Greg KH wrote:

On Tue, Nov 10, 2020 at 12:54:21PM -0500, Andrey Grodzovsky wrote:

Hi, back to this after a long context switch for some higher priority stuff.

So here I was able eventually to drop all this code and this change here 
https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Damd-staging-drm-next-device-unplug%26id%3D61852c8a59b4dd89d637693552c73175b9f2ccd6data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C13040ab9b50947a95acc08d896eec15d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637425299507092187%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=CIXEl9hWHTAdo7t9yrdtu0OdEIZ3X2GQmJRhDUj28mw%3Dreserved=0
was enough for me. Seems like while device_remove_file can handle the use
case where the file and the parent directory already gone,
sysfs_remove_group goes down in flames in that case
due to kobj->sd being unset on device removal.

A driver shouldn't ever have to remove individual sysfs groups, the
driver core/bus logic should do it for them automatically.

And whenever a driver calls a sysfs_* call, that's a hint that something
is not working properly.

Do you mean that while the driver creates the groups and files explicitly
from it's different subsystems it should not explicitly remove each
one of them because all of them should be removed at once (and
recursively) when the device is being removed ?

Individual drivers should never add groups/files in sysfs, the driver
core should do it properly for you if you have everything set up
properly.  And yes, the driver core will automatically remove them as
well.

Please use the default groups attribute for your bus/subsystem and this
will happen automagically.

Hi Greg, I tried your suggestion to hang amdgpu's sysfs
attributes on default attributes in struct device.groups but turns out it's
not usable since by the
time i have access to struct device from amdgpu code it has already been
initialized by pci core
(i.e.  past the point where device_add->device_add_attrs->device_add_groups
with dev->groups is called)
and so i can't really use it.

That's odd, why can't you just set the groups pointer in your pci_driver
structure?  That's what it is there for, right?

I am probably missing something but amdgpu sysfs attrs are per device not
per driver

Oops, you are right, you want the 'dev_groups' field.  Looks like pci
doesn't export that directly, so you can do:
.driver {
.dev_groups = my_device_groups;
},
in your pci_driver structure.

Or I'm sure the PCI driver maintainer would take a patch like
7d9c1d2f7aca ("USB: add support for dev_groups to struct
usb_device_driver") was done for the USB subsystem, as diving into the
"raw" .driver pointer isn't really that clean or nice in my opinion.



Looks like what I need exactly. I will probably start with assigning raw pointer 
just

to push ahead my work and in parallel will probably submit same patch as yours
for PCI subsystem review as the rework to switch to this is really minimal.

Andrey




thanks,

greg k-h

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal

2020-12-02 Thread Greg KH
On Wed, Dec 02, 2020 at 01:02:06PM -0500, Andrey Grodzovsky wrote:
> 
> On 12/2/20 12:34 PM, Greg KH wrote:
> > On Wed, Dec 02, 2020 at 10:48:01AM -0500, Andrey Grodzovsky wrote:
> > > On 11/11/20 10:34 AM, Greg KH wrote:
> > > > On Wed, Nov 11, 2020 at 10:13:13AM -0500, Andrey Grodzovsky wrote:
> > > > > On 11/10/20 12:59 PM, Greg KH wrote:
> > > > > > On Tue, Nov 10, 2020 at 12:54:21PM -0500, Andrey Grodzovsky wrote:
> > > > > > > Hi, back to this after a long context switch for some higher 
> > > > > > > priority stuff.
> > > > > > > 
> > > > > > > So here I was able eventually to drop all this code and this 
> > > > > > > change here 
> > > > > > > https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Damd-staging-drm-next-device-unplug%26id%3D61852c8a59b4dd89d637693552c73175b9f2ccd6data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C29ff7efb89bd47d8488708d896e86e7c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637425272317529134%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=Vzc3fVofA6%2BMPSqHmBqcWavQLKWU1%2FXKJFun24irLf0%3Dreserved=0
> > > > > > > was enough for me. Seems like while device_remove_file can handle 
> > > > > > > the use
> > > > > > > case where the file and the parent directory already gone,
> > > > > > > sysfs_remove_group goes down in flames in that case
> > > > > > > due to kobj->sd being unset on device removal.
> > > > > > A driver shouldn't ever have to remove individual sysfs groups, the
> > > > > > driver core/bus logic should do it for them automatically.
> > > > > > 
> > > > > > And whenever a driver calls a sysfs_* call, that's a hint that 
> > > > > > something
> > > > > > is not working properly.
> > > > > 
> > > > > Do you mean that while the driver creates the groups and files 
> > > > > explicitly
> > > > > from it's different subsystems it should not explicitly remove each
> > > > > one of them because all of them should be removed at once (and
> > > > > recursively) when the device is being removed ?
> > > > Individual drivers should never add groups/files in sysfs, the driver
> > > > core should do it properly for you if you have everything set up
> > > > properly.  And yes, the driver core will automatically remove them as
> > > > well.
> > > > 
> > > > Please use the default groups attribute for your bus/subsystem and this
> > > > will happen automagically.
> > > 
> > > Hi Greg, I tried your suggestion to hang amdgpu's sysfs
> > > attributes on default attributes in struct device.groups but turns out 
> > > it's
> > > not usable since by the
> > > time i have access to struct device from amdgpu code it has already been
> > > initialized by pci core
> > > (i.e.  past the point where 
> > > device_add->device_add_attrs->device_add_groups
> > > with dev->groups is called)
> > > and so i can't really use it.
> > That's odd, why can't you just set the groups pointer in your pci_driver
> > structure?  That's what it is there for, right?
> 
> I am probably missing something but amdgpu sysfs attrs are per device not
> per driver

Oops, you are right, you want the 'dev_groups' field.  Looks like pci
doesn't export that directly, so you can do:
.driver {
.dev_groups = my_device_groups;
},
in your pci_driver structure.

Or I'm sure the PCI driver maintainer would take a patch like
7d9c1d2f7aca ("USB: add support for dev_groups to struct
usb_device_driver") was done for the USB subsystem, as diving into the
"raw" .driver pointer isn't really that clean or nice in my opinion.

thanks,

greg k-h
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal

2020-12-02 Thread Andrey Grodzovsky


On 12/2/20 12:34 PM, Greg KH wrote:

On Wed, Dec 02, 2020 at 10:48:01AM -0500, Andrey Grodzovsky wrote:

On 11/11/20 10:34 AM, Greg KH wrote:

On Wed, Nov 11, 2020 at 10:13:13AM -0500, Andrey Grodzovsky wrote:

On 11/10/20 12:59 PM, Greg KH wrote:

On Tue, Nov 10, 2020 at 12:54:21PM -0500, Andrey Grodzovsky wrote:

Hi, back to this after a long context switch for some higher priority stuff.

So here I was able eventually to drop all this code and this change here 
https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Damd-staging-drm-next-device-unplug%26id%3D61852c8a59b4dd89d637693552c73175b9f2ccd6data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C29ff7efb89bd47d8488708d896e86e7c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637425272317529134%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=Vzc3fVofA6%2BMPSqHmBqcWavQLKWU1%2FXKJFun24irLf0%3Dreserved=0
was enough for me. Seems like while device_remove_file can handle the use
case where the file and the parent directory already gone,
sysfs_remove_group goes down in flames in that case
due to kobj->sd being unset on device removal.

A driver shouldn't ever have to remove individual sysfs groups, the
driver core/bus logic should do it for them automatically.

And whenever a driver calls a sysfs_* call, that's a hint that something
is not working properly.


Do you mean that while the driver creates the groups and files explicitly
from it's different subsystems it should not explicitly remove each
one of them because all of them should be removed at once (and
recursively) when the device is being removed ?

Individual drivers should never add groups/files in sysfs, the driver
core should do it properly for you if you have everything set up
properly.  And yes, the driver core will automatically remove them as
well.

Please use the default groups attribute for your bus/subsystem and this
will happen automagically.


Hi Greg, I tried your suggestion to hang amdgpu's sysfs
attributes on default attributes in struct device.groups but turns out it's
not usable since by the
time i have access to struct device from amdgpu code it has already been
initialized by pci core
(i.e.  past the point where device_add->device_add_attrs->device_add_groups
with dev->groups is called)
and so i can't really use it.

That's odd, why can't you just set the groups pointer in your pci_driver
structure?  That's what it is there for, right?


I am probably missing something but amdgpu sysfs attrs are per device not per 
driver
and their life cycle is bound to the device and their location in the sysfs 
topology is
under each device. Putting them as driver default attr will not put them in 
their current per device location
and won't make them automatically be destroyed once a particular device goes 
away, no ?


Andrey





What I can only think of using is creating my own struct attribute_group **
array in amdgpu where I aggregate all
amdgpu sysfs attributes, call device_add_groups in the end of amgpu pci
probe with that array and on device remove call
device_remove_groups with the same array.

Horrid, no, see above :)

thanks,

greg k-h

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal

2020-12-02 Thread Greg KH
On Wed, Dec 02, 2020 at 10:48:01AM -0500, Andrey Grodzovsky wrote:
> 
> On 11/11/20 10:34 AM, Greg KH wrote:
> > On Wed, Nov 11, 2020 at 10:13:13AM -0500, Andrey Grodzovsky wrote:
> > > On 11/10/20 12:59 PM, Greg KH wrote:
> > > > On Tue, Nov 10, 2020 at 12:54:21PM -0500, Andrey Grodzovsky wrote:
> > > > > Hi, back to this after a long context switch for some higher priority 
> > > > > stuff.
> > > > > 
> > > > > So here I was able eventually to drop all this code and this change 
> > > > > here 
> > > > > https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Damd-staging-drm-next-device-unplug%26id%3D61852c8a59b4dd89d637693552c73175b9f2ccd6data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C9fbfecac94a340dfb68408d886571609%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637407055896651058%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=Ye8HJR1vidppcOBnlOgVu5GwKD2%2Bb5ztHbiI%2BubKKT0%3Dreserved=0
> > > > > was enough for me. Seems like while device_remove_file can handle the 
> > > > > use
> > > > > case where the file and the parent directory already gone,
> > > > > sysfs_remove_group goes down in flames in that case
> > > > > due to kobj->sd being unset on device removal.
> > > > A driver shouldn't ever have to remove individual sysfs groups, the
> > > > driver core/bus logic should do it for them automatically.
> > > > 
> > > > And whenever a driver calls a sysfs_* call, that's a hint that something
> > > > is not working properly.
> > > 
> > > 
> > > Do you mean that while the driver creates the groups and files explicitly
> > > from it's different subsystems it should not explicitly remove each
> > > one of them because all of them should be removed at once (and
> > > recursively) when the device is being removed ?
> > Individual drivers should never add groups/files in sysfs, the driver
> > core should do it properly for you if you have everything set up
> > properly.  And yes, the driver core will automatically remove them as
> > well.
> > 
> > Please use the default groups attribute for your bus/subsystem and this
> > will happen automagically.
> 
> 
> Hi Greg, I tried your suggestion to hang amdgpu's sysfs
> attributes on default attributes in struct device.groups but turns out it's
> not usable since by the
> time i have access to struct device from amdgpu code it has already been
> initialized by pci core
> (i.e.  past the point where device_add->device_add_attrs->device_add_groups
> with dev->groups is called)
> and so i can't really use it.

That's odd, why can't you just set the groups pointer in your pci_driver
structure?  That's what it is there for, right?

> What I can only think of using is creating my own struct attribute_group **
> array in amdgpu where I aggregate all
> amdgpu sysfs attributes, call device_add_groups in the end of amgpu pci
> probe with that array and on device remove call
> device_remove_groups with the same array.

Horrid, no, see above :)

thanks,

greg k-h
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal

2020-12-02 Thread Andrey Grodzovsky


On 11/11/20 10:34 AM, Greg KH wrote:

On Wed, Nov 11, 2020 at 10:13:13AM -0500, Andrey Grodzovsky wrote:

On 11/10/20 12:59 PM, Greg KH wrote:

On Tue, Nov 10, 2020 at 12:54:21PM -0500, Andrey Grodzovsky wrote:

Hi, back to this after a long context switch for some higher priority stuff.

So here I was able eventually to drop all this code and this change here 
https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Damd-staging-drm-next-device-unplug%26id%3D61852c8a59b4dd89d637693552c73175b9f2ccd6data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C9fbfecac94a340dfb68408d886571609%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637407055896651058%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=Ye8HJR1vidppcOBnlOgVu5GwKD2%2Bb5ztHbiI%2BubKKT0%3Dreserved=0
was enough for me. Seems like while device_remove_file can handle the use
case where the file and the parent directory already gone,
sysfs_remove_group goes down in flames in that case
due to kobj->sd being unset on device removal.

A driver shouldn't ever have to remove individual sysfs groups, the
driver core/bus logic should do it for them automatically.

And whenever a driver calls a sysfs_* call, that's a hint that something
is not working properly.



Do you mean that while the driver creates the groups and files explicitly
from it's different subsystems it should not explicitly remove each
one of them because all of them should be removed at once (and
recursively) when the device is being removed ?

Individual drivers should never add groups/files in sysfs, the driver
core should do it properly for you if you have everything set up
properly.  And yes, the driver core will automatically remove them as
well.

Please use the default groups attribute for your bus/subsystem and this
will happen automagically.



Hi Greg, I tried your suggestion to hang amdgpu's sysfs
attributes on default attributes in struct device.groups but turns out it's not 
usable since by the
time i have access to struct device from amdgpu code it has already been 
initialized by pci core
(i.e.  past the point where device_add->device_add_attrs->device_add_groups with 
dev->groups is called)

and so i can't really use it.

What I can only think of using is creating my own struct attribute_group ** 
array in amdgpu where I aggregate all
amdgpu sysfs attributes, call device_add_groups in the end of amgpu pci probe 
with that array and on device remove call

device_remove_groups with the same array.

Do you maybe have a better suggestion for me ?

Andrey




thanks,

greg k-h


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal

2020-11-11 Thread Andrey Grodzovsky


On 11/11/20 11:06 AM, Greg KH wrote:

On Wed, Nov 11, 2020 at 10:45:53AM -0500, Andrey Grodzovsky wrote:

On 11/11/20 10:34 AM, Greg KH wrote:

On Wed, Nov 11, 2020 at 10:13:13AM -0500, Andrey Grodzovsky wrote:

On 11/10/20 12:59 PM, Greg KH wrote:

On Tue, Nov 10, 2020 at 12:54:21PM -0500, Andrey Grodzovsky wrote:

Hi, back to this after a long context switch for some higher priority stuff.

So here I was able eventually to drop all this code and this change here 
https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Damd-staging-drm-next-device-unplug%26id%3D61852c8a59b4dd89d637693552c73175b9f2ccd6data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C3e019e2780114b696b4f08d8865bac36%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637407075579242822%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=E%2FIZmVeJDvHiY2xSaaPaay4mXN49EbhSJaJ4zlt6WKk%3Dreserved=0
was enough for me. Seems like while device_remove_file can handle the use
case where the file and the parent directory already gone,
sysfs_remove_group goes down in flames in that case
due to kobj->sd being unset on device removal.

A driver shouldn't ever have to remove individual sysfs groups, the
driver core/bus logic should do it for them automatically.

And whenever a driver calls a sysfs_* call, that's a hint that something
is not working properly.


Do you mean that while the driver creates the groups and files explicitly
from it's different subsystems it should not explicitly remove each
one of them because all of them should be removed at once (and
recursively) when the device is being removed ?

Individual drivers should never add groups/files in sysfs, the driver
core should do it properly for you if you have everything set up
properly.  And yes, the driver core will automatically remove them as
well.

Please use the default groups attribute for your bus/subsystem and this
will happen automagically.

Googling for default groups attributes i found this - 
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linuxfoundation.org%2Fblog%2F2013%2F06%2Fhow-to-create-a-sysfs-file-correctly%2Fdata=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C3e019e2780114b696b4f08d8865bac36%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637407075579252818%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=AVhdi%2BcKeFXM8CBv%2BhRNTCYX2XSS8oo0so6mB%2BPuEfk%3Dreserved=0

Odd, mirror of the original article:

https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fkroah.com%2Flog%2Fblog%2F2013%2F06%2F26%2Fhow-to-create-a-sysfs-file-correctly%2Fdata=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C3e019e2780114b696b4f08d8865bac36%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637407075579252818%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=lGMd3PJOWIlKUpvbV3Zz%2FvbBIRwz6%2BlJ%2BS%2BiVcXxuzM%3Dreserved=0


Would this be what you suggest for us ? Specifically for our case the struct
device's  groups  seems the right solution as different devices
might have slightly diffreent sysfs attributes.

That's what the is_visable() callback in your attribute group is for, to
tell the kernel if an individual sysfs attribute should be created or
not.


I see, this looks like a good improvement to our current way of managing sysfs. 
Since this
change is somewhat fundamental and requires good testing I prefer to deal with 
it separately from my current

work on device unplug and so I will put it on TODO right after finishing this 
work.

Andrey




thanks,

greg k-h

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal

2020-11-11 Thread Greg KH
On Wed, Nov 11, 2020 at 10:45:53AM -0500, Andrey Grodzovsky wrote:
> 
> On 11/11/20 10:34 AM, Greg KH wrote:
> > On Wed, Nov 11, 2020 at 10:13:13AM -0500, Andrey Grodzovsky wrote:
> > > On 11/10/20 12:59 PM, Greg KH wrote:
> > > > On Tue, Nov 10, 2020 at 12:54:21PM -0500, Andrey Grodzovsky wrote:
> > > > > Hi, back to this after a long context switch for some higher priority 
> > > > > stuff.
> > > > > 
> > > > > So here I was able eventually to drop all this code and this change 
> > > > > here 
> > > > > https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Damd-staging-drm-next-device-unplug%26id%3D61852c8a59b4dd89d637693552c73175b9f2ccd6data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C9fbfecac94a340dfb68408d886571609%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637407055896651058%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=Ye8HJR1vidppcOBnlOgVu5GwKD2%2Bb5ztHbiI%2BubKKT0%3Dreserved=0
> > > > > was enough for me. Seems like while device_remove_file can handle the 
> > > > > use
> > > > > case where the file and the parent directory already gone,
> > > > > sysfs_remove_group goes down in flames in that case
> > > > > due to kobj->sd being unset on device removal.
> > > > A driver shouldn't ever have to remove individual sysfs groups, the
> > > > driver core/bus logic should do it for them automatically.
> > > > 
> > > > And whenever a driver calls a sysfs_* call, that's a hint that something
> > > > is not working properly.
> > > 
> > > 
> > > Do you mean that while the driver creates the groups and files explicitly
> > > from it's different subsystems it should not explicitly remove each
> > > one of them because all of them should be removed at once (and
> > > recursively) when the device is being removed ?
> > Individual drivers should never add groups/files in sysfs, the driver
> > core should do it properly for you if you have everything set up
> > properly.  And yes, the driver core will automatically remove them as
> > well.
> > 
> > Please use the default groups attribute for your bus/subsystem and this
> > will happen automagically.
> 
> Googling for default groups attributes i found this - 
> https://www.linuxfoundation.org/blog/2013/06/how-to-create-a-sysfs-file-correctly/

Odd, mirror of the original article:

http://kroah.com/log/blog/2013/06/26/how-to-create-a-sysfs-file-correctly/

> Would this be what you suggest for us ? Specifically for our case the struct
> device's  groups  seems the right solution as different devices
> might have slightly diffreent sysfs attributes.

That's what the is_visable() callback in your attribute group is for, to
tell the kernel if an individual sysfs attribute should be created or
not.

thanks,

greg k-h
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal

2020-11-11 Thread Andrey Grodzovsky


On 11/11/20 10:34 AM, Greg KH wrote:

On Wed, Nov 11, 2020 at 10:13:13AM -0500, Andrey Grodzovsky wrote:

On 11/10/20 12:59 PM, Greg KH wrote:

On Tue, Nov 10, 2020 at 12:54:21PM -0500, Andrey Grodzovsky wrote:

Hi, back to this after a long context switch for some higher priority stuff.

So here I was able eventually to drop all this code and this change here 
https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Damd-staging-drm-next-device-unplug%26id%3D61852c8a59b4dd89d637693552c73175b9f2ccd6data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C9fbfecac94a340dfb68408d886571609%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637407055896651058%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=Ye8HJR1vidppcOBnlOgVu5GwKD2%2Bb5ztHbiI%2BubKKT0%3Dreserved=0
was enough for me. Seems like while device_remove_file can handle the use
case where the file and the parent directory already gone,
sysfs_remove_group goes down in flames in that case
due to kobj->sd being unset on device removal.

A driver shouldn't ever have to remove individual sysfs groups, the
driver core/bus logic should do it for them automatically.

And whenever a driver calls a sysfs_* call, that's a hint that something
is not working properly.



Do you mean that while the driver creates the groups and files explicitly
from it's different subsystems it should not explicitly remove each
one of them because all of them should be removed at once (and
recursively) when the device is being removed ?

Individual drivers should never add groups/files in sysfs, the driver
core should do it properly for you if you have everything set up
properly.  And yes, the driver core will automatically remove them as
well.

Please use the default groups attribute for your bus/subsystem and this
will happen automagically.


Googling for default groups attributes i found this - 
https://www.linuxfoundation.org/blog/2013/06/how-to-create-a-sysfs-file-correctly/
Would this be what you suggest for us ? Specifically for our case the struct 
device's  groups  seems the right solution as different devices

might have slightly diffreent sysfs attributes.

Andrey




thanks,

greg k-h

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal

2020-11-11 Thread Greg KH
On Wed, Nov 11, 2020 at 10:13:13AM -0500, Andrey Grodzovsky wrote:
> 
> On 11/10/20 12:59 PM, Greg KH wrote:
> > On Tue, Nov 10, 2020 at 12:54:21PM -0500, Andrey Grodzovsky wrote:
> > > Hi, back to this after a long context switch for some higher priority 
> > > stuff.
> > > 
> > > So here I was able eventually to drop all this code and this change here 
> > > https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Damd-staging-drm-next-device-unplug%26id%3D61852c8a59b4dd89d637693552c73175b9f2ccd6data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C7ae9e5798c7648d6dbb908d885a22c58%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637406278875513811%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=aoFIsBxpLC9tBZw3E%2B8IJlNqFSq6uRgEvvciaZ6B1iw%3Dreserved=0
> > > was enough for me. Seems like while device_remove_file can handle the use
> > > case where the file and the parent directory already gone,
> > > sysfs_remove_group goes down in flames in that case
> > > due to kobj->sd being unset on device removal.
> > A driver shouldn't ever have to remove individual sysfs groups, the
> > driver core/bus logic should do it for them automatically.
> > 
> > And whenever a driver calls a sysfs_* call, that's a hint that something
> > is not working properly.
> 
> 
> 
> Do you mean that while the driver creates the groups and files explicitly
> from it's different subsystems it should not explicitly remove each
> one of them because all of them should be removed at once (and
> recursively) when the device is being removed ?

Individual drivers should never add groups/files in sysfs, the driver
core should do it properly for you if you have everything set up
properly.  And yes, the driver core will automatically remove them as
well.

Please use the default groups attribute for your bus/subsystem and this
will happen automagically.

thanks,

greg k-h
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal

2020-11-11 Thread Andrey Grodzovsky



On 11/10/20 12:59 PM, Greg KH wrote:

On Tue, Nov 10, 2020 at 12:54:21PM -0500, Andrey Grodzovsky wrote:

Hi, back to this after a long context switch for some higher priority stuff.

So here I was able eventually to drop all this code and this change here 
https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Damd-staging-drm-next-device-unplug%26id%3D61852c8a59b4dd89d637693552c73175b9f2ccd6data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C7ae9e5798c7648d6dbb908d885a22c58%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637406278875513811%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=aoFIsBxpLC9tBZw3E%2B8IJlNqFSq6uRgEvvciaZ6B1iw%3Dreserved=0
was enough for me. Seems like while device_remove_file can handle the use
case where the file and the parent directory already gone,
sysfs_remove_group goes down in flames in that case
due to kobj->sd being unset on device removal.

A driver shouldn't ever have to remove individual sysfs groups, the
driver core/bus logic should do it for them automatically.

And whenever a driver calls a sysfs_* call, that's a hint that something
is not working properly.




Do you mean that while the driver creates the groups and files explicitly from 
it's different
subsystems it should not explicitly remove each one of them because all of them 
should

be removed at once (and recursively) when the device is being removed ?

Andrey




Also, run your patch above through checkpatch.pl before submitting it :)

thanks,

greg k-h

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal

2020-11-10 Thread Greg KH
On Tue, Nov 10, 2020 at 12:54:21PM -0500, Andrey Grodzovsky wrote:
> Hi, back to this after a long context switch for some higher priority stuff.
> 
> So here I was able eventually to drop all this code and this change here 
> https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=amd-staging-drm-next-device-unplug=61852c8a59b4dd89d637693552c73175b9f2ccd6
> was enough for me. Seems like while device_remove_file can handle the use
> case where the file and the parent directory already gone,
> sysfs_remove_group goes down in flames in that case
> due to kobj->sd being unset on device removal.

A driver shouldn't ever have to remove individual sysfs groups, the
driver core/bus logic should do it for them automatically.

And whenever a driver calls a sysfs_* call, that's a hint that something
is not working properly.

Also, run your patch above through checkpatch.pl before submitting it :)

thanks,

greg k-h
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal

2020-11-10 Thread Andrey Grodzovsky

Hi, back to this after a long context switch for some higher priority stuff.

So here I was able eventually to drop all this code and this change here 
https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=amd-staging-drm-next-device-unplug=61852c8a59b4dd89d637693552c73175b9f2ccd6
was enough for me. Seems like while device_remove_file can handle the use case 
where the file and the parent directory already gone, sysfs_remove_group goes 
down in flames in that case

due to kobj->sd being unset on device removal.

Andrey

On 6/24/20 2:11 AM, Greg KH wrote:

But why are things being removed twice?

Not sure I understand what removed twice ? I remove only once per sysfs 
attribute.

This code path shows that the kernel is trying to remove a file that is
not present, so someone removed it already...

thanks,

gre k-h

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal

2020-06-24 Thread Andrey Grodzovsky



On 6/24/20 2:11 AM, Greg KH wrote:

On Tue, Jun 23, 2020 at 11:04:30PM -0400, Andrey Grodzovsky wrote:

On 6/23/20 2:05 AM, Greg KH wrote:

On Tue, Jun 23, 2020 at 12:51:00AM -0400, Andrey Grodzovsky wrote:

On 6/22/20 12:45 PM, Greg KH wrote:

On Mon, Jun 22, 2020 at 12:07:25PM -0400, Andrey Grodzovsky wrote:

On 6/22/20 7:21 AM, Greg KH wrote:

On Mon, Jun 22, 2020 at 11:51:24AM +0200, Daniel Vetter wrote:

On Sun, Jun 21, 2020 at 02:03:05AM -0400, Andrey Grodzovsky wrote:

Track sysfs files in a list so they all can be removed during pci remove
since otherwise their removal after that causes crash because parent
folder was already removed during pci remove.

Huh?  That should not happen, do you have a backtrace of that crash?

2 examples in the attached trace.

Odd, how did you trigger these?

By manually triggering PCI remove from sysfs

cd /sys/bus/pci/devices/\:05\:00.0 && echo 1 > remove

For some reason, I didn't think that video/drm devices could handle
hot-remove like this.  The "old" PCI hotplug specification explicitly
said that video devices were not supported, has that changed?

And this whole issue is probably tied to the larger issue that Daniel
was asking me about, when it came to device lifetimes and the drm layer,
so odds are we need to fix that up first before worrying about trying to
support this crazy request, right?  :)


[  925.738225 <0.188086>] BUG: kernel NULL pointer dereference, address: 
0090
[  925.738232 <0.07>] #PF: supervisor read access in kernel mode
[  925.738236 <0.04>] #PF: error_code(0x) - not-present page
[  925.738240 <0.04>] PGD 0 P4D 0
[  925.738245 <0.05>] Oops:  [#1] SMP PTI
[  925.738249 <0.04>] CPU: 7 PID: 2547 Comm: amdgpu_test Tainted: G 
   W  OE 5.5.0-rc7-dev-kfd+ #50
[  925.738256 <0.07>] Hardware name: System manufacturer System Product 
Name/RAMPAGE IV FORMULA, BIOS 4804 12/30/2013
[  925.738266 <0.10>] RIP: 0010:kernfs_find_ns+0x18/0x110
[  925.738270 <0.04>] Code: a6 cf ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 
00 00 66 66 66 66 90 41 57 41 56 49 89 f6 41 55 41 54 49 89 fd 55 53 49 89 d4 <0f> b7 
af 90 00 00 00 8b 05 8f ee 6b 01 48 8b 5f 68 66 83 e5 20 41
[  925.738282 <0.12>] RSP: 0018:ad6d0118fb00 EFLAGS: 00010246
[  925.738287 <0.05>] RAX:  RBX:  RCX: 
2098a12076864b7e
[  925.738292 <0.05>] RDX:  RSI: b6606b31 RDI: 

[  925.738297 <0.05>] RBP: b6606b31 R08: b5379d10 R09: 

[  925.738302 <0.05>] R10: ad6d0118fb38 R11: 9a75f64820a8 R12: 

[  925.738307 <0.05>] R13:  R14: b6606b31 R15: 
9a7612b06130
[  925.738313 <0.06>] FS:  7f3eca4e8700() 
GS:9a763dbc() knlGS:
[  925.738319 <0.06>] CS:  0010 DS:  ES:  CR0: 80050033
[  925.738323 <0.04>] CR2: 0090 CR3: 35e5a005 CR4: 
000606e0
[  925.738329 <0.06>] Call Trace:
[  925.738334 <0.05>]  kernfs_find_and_get_ns+0x2e/0x50
[  925.738339 <0.05>]  sysfs_remove_group+0x25/0x80
[  925.738344 <0.05>]  sysfs_remove_groups+0x29/0x40
[  925.738350 <0.06>]  free_msi_irqs+0xf5/0x190
[  925.738354 <0.04>]  pci_disable_msi+0xe9/0x120

So the PCI core is trying to clean up attributes that it had registered,
which is fine.  But we can't seem to find the attributes?  Were they
already removed somewhere else?

that's odd.

Yes, as i pointed above i am emulating device remove from sysfs and this
triggers pci device remove sequence and as part of that my specific device
folder (05:00.0) is removed from the sysfs tree.

But why are things being removed twice?


Not sure I understand what removed twice ? I remove only once per sysfs 
attribute.

This code path shows that the kernel is trying to remove a file that is
not present, so someone removed it already...

thanks,

gre k-h



That a mystery for me too...

Andrey


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal

2020-06-24 Thread Greg KH
On Tue, Jun 23, 2020 at 11:04:30PM -0400, Andrey Grodzovsky wrote:
> 
> On 6/23/20 2:05 AM, Greg KH wrote:
> > On Tue, Jun 23, 2020 at 12:51:00AM -0400, Andrey Grodzovsky wrote:
> > > On 6/22/20 12:45 PM, Greg KH wrote:
> > > > On Mon, Jun 22, 2020 at 12:07:25PM -0400, Andrey Grodzovsky wrote:
> > > > > On 6/22/20 7:21 AM, Greg KH wrote:
> > > > > > On Mon, Jun 22, 2020 at 11:51:24AM +0200, Daniel Vetter wrote:
> > > > > > > On Sun, Jun 21, 2020 at 02:03:05AM -0400, Andrey Grodzovsky wrote:
> > > > > > > > Track sysfs files in a list so they all can be removed during 
> > > > > > > > pci remove
> > > > > > > > since otherwise their removal after that causes crash because 
> > > > > > > > parent
> > > > > > > > folder was already removed during pci remove.
> > > > > > Huh?  That should not happen, do you have a backtrace of that crash?
> > > > > 2 examples in the attached trace.
> > > > Odd, how did you trigger these?
> > > 
> > > By manually triggering PCI remove from sysfs
> > > 
> > > cd /sys/bus/pci/devices/\:05\:00.0 && echo 1 > remove
> > For some reason, I didn't think that video/drm devices could handle
> > hot-remove like this.  The "old" PCI hotplug specification explicitly
> > said that video devices were not supported, has that changed?
> > 
> > And this whole issue is probably tied to the larger issue that Daniel
> > was asking me about, when it came to device lifetimes and the drm layer,
> > so odds are we need to fix that up first before worrying about trying to
> > support this crazy request, right?  :)
> > 
> > > > > [  925.738225 <0.188086>] BUG: kernel NULL pointer dereference, 
> > > > > address: 0090
> > > > > [  925.738232 <0.07>] #PF: supervisor read access in kernel 
> > > > > mode
> > > > > [  925.738236 <0.04>] #PF: error_code(0x) - not-present 
> > > > > page
> > > > > [  925.738240 <0.04>] PGD 0 P4D 0
> > > > > [  925.738245 <0.05>] Oops:  [#1] SMP PTI
> > > > > [  925.738249 <0.04>] CPU: 7 PID: 2547 Comm: amdgpu_test 
> > > > > Tainted: GW  OE 5.5.0-rc7-dev-kfd+ #50
> > > > > [  925.738256 <0.07>] Hardware name: System manufacturer 
> > > > > System Product Name/RAMPAGE IV FORMULA, BIOS 4804 12/30/2013
> > > > > [  925.738266 <0.10>] RIP: 0010:kernfs_find_ns+0x18/0x110
> > > > > [  925.738270 <0.04>] Code: a6 cf ff 0f 1f 44 00 00 66 2e 0f 
> > > > > 1f 84 00 00 00 00 00 66 66 66 66 90 41 57 41 56 49 89 f6 41 55 41 54 
> > > > > 49 89 fd 55 53 49 89 d4 <0f> b7 af 90 00 00 00 8b 05 8f ee 6b 01 48 
> > > > > 8b 5f 68 66 83 e5 20 41
> > > > > [  925.738282 <0.12>] RSP: 0018:ad6d0118fb00 EFLAGS: 
> > > > > 00010246
> > > > > [  925.738287 <0.05>] RAX:  RBX: 
> > > > >  RCX: 2098a12076864b7e
> > > > > [  925.738292 <0.05>] RDX:  RSI: 
> > > > > b6606b31 RDI: 
> > > > > [  925.738297 <0.05>] RBP: b6606b31 R08: 
> > > > > b5379d10 R09: 
> > > > > [  925.738302 <0.05>] R10: ad6d0118fb38 R11: 
> > > > > 9a75f64820a8 R12: 
> > > > > [  925.738307 <0.05>] R13:  R14: 
> > > > > b6606b31 R15: 9a7612b06130
> > > > > [  925.738313 <0.06>] FS:  7f3eca4e8700() 
> > > > > GS:9a763dbc() knlGS:
> > > > > [  925.738319 <0.06>] CS:  0010 DS:  ES:  CR0: 
> > > > > 80050033
> > > > > [  925.738323 <0.04>] CR2: 0090 CR3: 
> > > > > 35e5a005 CR4: 000606e0
> > > > > [  925.738329 <0.06>] Call Trace:
> > > > > [  925.738334 <0.05>]  kernfs_find_and_get_ns+0x2e/0x50
> > > > > [  925.738339 <0.05>]  sysfs_remove_group+0x25/0x80
> > > > > [  925.738344 <0.05>]  sysfs_remove_groups+0x29/0x40
> > > > > [  925.738350 <0.06>]  free_msi_irqs+0xf5/0x190
> > > > > [  925.738354 <0.04>]  pci_disable_msi+0xe9/0x120
> > > > So the PCI core is trying to clean up attributes that it had registered,
> > > > which is fine.  But we can't seem to find the attributes?  Were they
> > > > already removed somewhere else?
> > > > 
> > > > that's odd.
> > > 
> > > Yes, as i pointed above i am emulating device remove from sysfs and this
> > > triggers pci device remove sequence and as part of that my specific device
> > > folder (05:00.0) is removed from the sysfs tree.
> > But why are things being removed twice?
> 
> 
> Not sure I understand what removed twice ? I remove only once per sysfs 
> attribute.

This code path shows that the kernel is trying to remove a file that is
not present, so someone removed it already...

thanks,

gre k-h
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal

2020-06-23 Thread Andrey Grodzovsky


On 6/23/20 2:05 AM, Greg KH wrote:

On Tue, Jun 23, 2020 at 12:51:00AM -0400, Andrey Grodzovsky wrote:

On 6/22/20 12:45 PM, Greg KH wrote:

On Mon, Jun 22, 2020 at 12:07:25PM -0400, Andrey Grodzovsky wrote:

On 6/22/20 7:21 AM, Greg KH wrote:

On Mon, Jun 22, 2020 at 11:51:24AM +0200, Daniel Vetter wrote:

On Sun, Jun 21, 2020 at 02:03:05AM -0400, Andrey Grodzovsky wrote:

Track sysfs files in a list so they all can be removed during pci remove
since otherwise their removal after that causes crash because parent
folder was already removed during pci remove.

Huh?  That should not happen, do you have a backtrace of that crash?

2 examples in the attached trace.

Odd, how did you trigger these?


By manually triggering PCI remove from sysfs

cd /sys/bus/pci/devices/\:05\:00.0 && echo 1 > remove

For some reason, I didn't think that video/drm devices could handle
hot-remove like this.  The "old" PCI hotplug specification explicitly
said that video devices were not supported, has that changed?

And this whole issue is probably tied to the larger issue that Daniel
was asking me about, when it came to device lifetimes and the drm layer,
so odds are we need to fix that up first before worrying about trying to
support this crazy request, right?  :)


[  925.738225 <0.188086>] BUG: kernel NULL pointer dereference, address: 
0090
[  925.738232 <0.07>] #PF: supervisor read access in kernel mode
[  925.738236 <0.04>] #PF: error_code(0x) - not-present page
[  925.738240 <0.04>] PGD 0 P4D 0
[  925.738245 <0.05>] Oops:  [#1] SMP PTI
[  925.738249 <0.04>] CPU: 7 PID: 2547 Comm: amdgpu_test Tainted: G 
   W  OE 5.5.0-rc7-dev-kfd+ #50
[  925.738256 <0.07>] Hardware name: System manufacturer System Product 
Name/RAMPAGE IV FORMULA, BIOS 4804 12/30/2013
[  925.738266 <0.10>] RIP: 0010:kernfs_find_ns+0x18/0x110
[  925.738270 <0.04>] Code: a6 cf ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 
00 00 66 66 66 66 90 41 57 41 56 49 89 f6 41 55 41 54 49 89 fd 55 53 49 89 d4 <0f> b7 
af 90 00 00 00 8b 05 8f ee 6b 01 48 8b 5f 68 66 83 e5 20 41
[  925.738282 <0.12>] RSP: 0018:ad6d0118fb00 EFLAGS: 00010246
[  925.738287 <0.05>] RAX:  RBX:  RCX: 
2098a12076864b7e
[  925.738292 <0.05>] RDX:  RSI: b6606b31 RDI: 

[  925.738297 <0.05>] RBP: b6606b31 R08: b5379d10 R09: 

[  925.738302 <0.05>] R10: ad6d0118fb38 R11: 9a75f64820a8 R12: 

[  925.738307 <0.05>] R13:  R14: b6606b31 R15: 
9a7612b06130
[  925.738313 <0.06>] FS:  7f3eca4e8700() 
GS:9a763dbc() knlGS:
[  925.738319 <0.06>] CS:  0010 DS:  ES:  CR0: 80050033
[  925.738323 <0.04>] CR2: 0090 CR3: 35e5a005 CR4: 
000606e0
[  925.738329 <0.06>] Call Trace:
[  925.738334 <0.05>]  kernfs_find_and_get_ns+0x2e/0x50
[  925.738339 <0.05>]  sysfs_remove_group+0x25/0x80
[  925.738344 <0.05>]  sysfs_remove_groups+0x29/0x40
[  925.738350 <0.06>]  free_msi_irqs+0xf5/0x190
[  925.738354 <0.04>]  pci_disable_msi+0xe9/0x120

So the PCI core is trying to clean up attributes that it had registered,
which is fine.  But we can't seem to find the attributes?  Were they
already removed somewhere else?

that's odd.


Yes, as i pointed above i am emulating device remove from sysfs and this
triggers pci device remove sequence and as part of that my specific device
folder (05:00.0) is removed from the sysfs tree.

But why are things being removed twice?



Not sure I understand what removed twice ? I remove only once per sysfs 
attribute.

Andrey





[  925.738406 <0.52>]  amdgpu_irq_fini+0xe3/0xf0 [amdgpu]
[  925.738453 <0.47>]  tonga_ih_sw_fini+0xe/0x30 [amdgpu]
[  925.738490 <0.37>]  amdgpu_device_fini_late+0x14b/0x440 [amdgpu]
[  925.738529 <0.39>]  amdgpu_driver_release_kms+0x16/0x40 [amdgpu]
[  925.738548 <0.19>]  drm_dev_put+0x5b/0x80 [drm]
[  925.738558 <0.10>]  drm_release+0xc6/0xd0 [drm]
[  925.738563 <0.05>]  __fput+0xc6/0x260
[  925.738568 <0.05>]  task_work_run+0x79/0xb0
[  925.738573 <0.05>]  do_exit+0x3d0/0xc60
[  925.738578 <0.05>]  do_group_exit+0x47/0xb0
[  925.738583 <0.05>]  get_signal+0x18b/0xc30
[  925.738589 <0.06>]  do_signal+0x36/0x6a0
[  925.738593 <0.04>]  ? force_sig_info_to_task+0xbc/0xd0
[  925.738597 <0.04>]  ? signal_wake_up_state+0x15/0x30
[  925.738603 <0.06>]  exit_to_usermode_loop+0x6f/0xc0
[  925.738608 <0.05>]  prepare_exit_to_usermode+0xc7/0x110
[  925.738613 <0.05>]  ret_from_intr+0x25/0x35
[  925.738617 <0.04>] RIP: 0033:0x417369
[  925.738621 <0.04>] Code: Bad RIP value.
[  

Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal

2020-06-23 Thread Greg KH
On Tue, Jun 23, 2020 at 12:51:00AM -0400, Andrey Grodzovsky wrote:
> 
> On 6/22/20 12:45 PM, Greg KH wrote:
> > On Mon, Jun 22, 2020 at 12:07:25PM -0400, Andrey Grodzovsky wrote:
> > > On 6/22/20 7:21 AM, Greg KH wrote:
> > > > On Mon, Jun 22, 2020 at 11:51:24AM +0200, Daniel Vetter wrote:
> > > > > On Sun, Jun 21, 2020 at 02:03:05AM -0400, Andrey Grodzovsky wrote:
> > > > > > Track sysfs files in a list so they all can be removed during pci 
> > > > > > remove
> > > > > > since otherwise their removal after that causes crash because parent
> > > > > > folder was already removed during pci remove.
> > > > Huh?  That should not happen, do you have a backtrace of that crash?
> > > 
> > > 2 examples in the attached trace.
> > Odd, how did you trigger these?
> 
> 
> By manually triggering PCI remove from sysfs
> 
> cd /sys/bus/pci/devices/\:05\:00.0 && echo 1 > remove

For some reason, I didn't think that video/drm devices could handle
hot-remove like this.  The "old" PCI hotplug specification explicitly
said that video devices were not supported, has that changed?

And this whole issue is probably tied to the larger issue that Daniel
was asking me about, when it came to device lifetimes and the drm layer,
so odds are we need to fix that up first before worrying about trying to
support this crazy request, right?  :)

> > > [  925.738225 <0.188086>] BUG: kernel NULL pointer dereference, 
> > > address: 0090
> > > [  925.738232 <0.07>] #PF: supervisor read access in kernel mode
> > > [  925.738236 <0.04>] #PF: error_code(0x) - not-present page
> > > [  925.738240 <0.04>] PGD 0 P4D 0
> > > [  925.738245 <0.05>] Oops:  [#1] SMP PTI
> > > [  925.738249 <0.04>] CPU: 7 PID: 2547 Comm: amdgpu_test Tainted: 
> > > GW  OE 5.5.0-rc7-dev-kfd+ #50
> > > [  925.738256 <0.07>] Hardware name: System manufacturer System 
> > > Product Name/RAMPAGE IV FORMULA, BIOS 4804 12/30/2013
> > > [  925.738266 <0.10>] RIP: 0010:kernfs_find_ns+0x18/0x110
> > > [  925.738270 <0.04>] Code: a6 cf ff 0f 1f 44 00 00 66 2e 0f 1f 
> > > 84 00 00 00 00 00 66 66 66 66 90 41 57 41 56 49 89 f6 41 55 41 54 49 89 
> > > fd 55 53 49 89 d4 <0f> b7 af 90 00 00 00 8b 05 8f ee 6b 01 48 8b 5f 68 66 
> > > 83 e5 20 41
> > > [  925.738282 <0.12>] RSP: 0018:ad6d0118fb00 EFLAGS: 00010246
> > > [  925.738287 <0.05>] RAX:  RBX:  
> > > RCX: 2098a12076864b7e
> > > [  925.738292 <0.05>] RDX:  RSI: b6606b31 
> > > RDI: 
> > > [  925.738297 <0.05>] RBP: b6606b31 R08: b5379d10 
> > > R09: 
> > > [  925.738302 <0.05>] R10: ad6d0118fb38 R11: 9a75f64820a8 
> > > R12: 
> > > [  925.738307 <0.05>] R13:  R14: b6606b31 
> > > R15: 9a7612b06130
> > > [  925.738313 <0.06>] FS:  7f3eca4e8700() 
> > > GS:9a763dbc() knlGS:
> > > [  925.738319 <0.06>] CS:  0010 DS:  ES:  CR0: 
> > > 80050033
> > > [  925.738323 <0.04>] CR2: 0090 CR3: 35e5a005 
> > > CR4: 000606e0
> > > [  925.738329 <0.06>] Call Trace:
> > > [  925.738334 <0.05>]  kernfs_find_and_get_ns+0x2e/0x50
> > > [  925.738339 <0.05>]  sysfs_remove_group+0x25/0x80
> > > [  925.738344 <0.05>]  sysfs_remove_groups+0x29/0x40
> > > [  925.738350 <0.06>]  free_msi_irqs+0xf5/0x190
> > > [  925.738354 <0.04>]  pci_disable_msi+0xe9/0x120
> > So the PCI core is trying to clean up attributes that it had registered,
> > which is fine.  But we can't seem to find the attributes?  Were they
> > already removed somewhere else?
> > 
> > that's odd.
> 
> 
> Yes, as i pointed above i am emulating device remove from sysfs and this
> triggers pci device remove sequence and as part of that my specific device
> folder (05:00.0) is removed from the sysfs tree.

But why are things being removed twice?

> > > [  925.738406 <0.52>]  amdgpu_irq_fini+0xe3/0xf0 [amdgpu]
> > > [  925.738453 <0.47>]  tonga_ih_sw_fini+0xe/0x30 [amdgpu]
> > > [  925.738490 <0.37>]  amdgpu_device_fini_late+0x14b/0x440 
> > > [amdgpu]
> > > [  925.738529 <0.39>]  amdgpu_driver_release_kms+0x16/0x40 
> > > [amdgpu]
> > > [  925.738548 <0.19>]  drm_dev_put+0x5b/0x80 [drm]
> > > [  925.738558 <0.10>]  drm_release+0xc6/0xd0 [drm]
> > > [  925.738563 <0.05>]  __fput+0xc6/0x260
> > > [  925.738568 <0.05>]  task_work_run+0x79/0xb0
> > > [  925.738573 <0.05>]  do_exit+0x3d0/0xc60
> > > [  925.738578 <0.05>]  do_group_exit+0x47/0xb0
> > > [  925.738583 <0.05>]  get_signal+0x18b/0xc30
> > > [  925.738589 <0.06>]  do_signal+0x36/0x6a0
> > > [  925.738593 <0.04>]  ? force_sig_info_to_task+0xbc/0xd0
> > > [  925.738597 <

Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal

2020-06-22 Thread Andrey Grodzovsky


On 6/22/20 12:45 PM, Greg KH wrote:

On Mon, Jun 22, 2020 at 12:07:25PM -0400, Andrey Grodzovsky wrote:

On 6/22/20 7:21 AM, Greg KH wrote:

On Mon, Jun 22, 2020 at 11:51:24AM +0200, Daniel Vetter wrote:

On Sun, Jun 21, 2020 at 02:03:05AM -0400, Andrey Grodzovsky wrote:

Track sysfs files in a list so they all can be removed during pci remove
since otherwise their removal after that causes crash because parent
folder was already removed during pci remove.

Huh?  That should not happen, do you have a backtrace of that crash?


2 examples in the attached trace.

Odd, how did you trigger these?



By manually triggering PCI remove from sysfs

cd /sys/bus/pci/devices/\:05\:00.0 && echo 1 > remove






[  925.738225 <0.188086>] BUG: kernel NULL pointer dereference, address: 
0090
[  925.738232 <0.07>] #PF: supervisor read access in kernel mode
[  925.738236 <0.04>] #PF: error_code(0x) - not-present page
[  925.738240 <0.04>] PGD 0 P4D 0
[  925.738245 <0.05>] Oops:  [#1] SMP PTI
[  925.738249 <0.04>] CPU: 7 PID: 2547 Comm: amdgpu_test Tainted: G 
   W  OE 5.5.0-rc7-dev-kfd+ #50
[  925.738256 <0.07>] Hardware name: System manufacturer System Product 
Name/RAMPAGE IV FORMULA, BIOS 4804 12/30/2013
[  925.738266 <0.10>] RIP: 0010:kernfs_find_ns+0x18/0x110
[  925.738270 <0.04>] Code: a6 cf ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 
00 00 66 66 66 66 90 41 57 41 56 49 89 f6 41 55 41 54 49 89 fd 55 53 49 89 d4 <0f> b7 
af 90 00 00 00 8b 05 8f ee 6b 01 48 8b 5f 68 66 83 e5 20 41
[  925.738282 <0.12>] RSP: 0018:ad6d0118fb00 EFLAGS: 00010246
[  925.738287 <0.05>] RAX:  RBX:  RCX: 
2098a12076864b7e
[  925.738292 <0.05>] RDX:  RSI: b6606b31 RDI: 

[  925.738297 <0.05>] RBP: b6606b31 R08: b5379d10 R09: 

[  925.738302 <0.05>] R10: ad6d0118fb38 R11: 9a75f64820a8 R12: 

[  925.738307 <0.05>] R13:  R14: b6606b31 R15: 
9a7612b06130
[  925.738313 <0.06>] FS:  7f3eca4e8700() 
GS:9a763dbc() knlGS:
[  925.738319 <0.06>] CS:  0010 DS:  ES:  CR0: 80050033
[  925.738323 <0.04>] CR2: 0090 CR3: 35e5a005 CR4: 
000606e0
[  925.738329 <0.06>] Call Trace:
[  925.738334 <0.05>]  kernfs_find_and_get_ns+0x2e/0x50
[  925.738339 <0.05>]  sysfs_remove_group+0x25/0x80
[  925.738344 <0.05>]  sysfs_remove_groups+0x29/0x40
[  925.738350 <0.06>]  free_msi_irqs+0xf5/0x190
[  925.738354 <0.04>]  pci_disable_msi+0xe9/0x120

So the PCI core is trying to clean up attributes that it had registered,
which is fine.  But we can't seem to find the attributes?  Were they
already removed somewhere else?

that's odd.



Yes, as i pointed above i am emulating device remove from sysfs and this 
triggers pci device remove sequence and as part of that my specific device 
folder (05:00.0) is removed from the sysfs tree.






[  925.738406 <0.52>]  amdgpu_irq_fini+0xe3/0xf0 [amdgpu]
[  925.738453 <0.47>]  tonga_ih_sw_fini+0xe/0x30 [amdgpu]
[  925.738490 <0.37>]  amdgpu_device_fini_late+0x14b/0x440 [amdgpu]
[  925.738529 <0.39>]  amdgpu_driver_release_kms+0x16/0x40 [amdgpu]
[  925.738548 <0.19>]  drm_dev_put+0x5b/0x80 [drm]
[  925.738558 <0.10>]  drm_release+0xc6/0xd0 [drm]
[  925.738563 <0.05>]  __fput+0xc6/0x260
[  925.738568 <0.05>]  task_work_run+0x79/0xb0
[  925.738573 <0.05>]  do_exit+0x3d0/0xc60
[  925.738578 <0.05>]  do_group_exit+0x47/0xb0
[  925.738583 <0.05>]  get_signal+0x18b/0xc30
[  925.738589 <0.06>]  do_signal+0x36/0x6a0
[  925.738593 <0.04>]  ? force_sig_info_to_task+0xbc/0xd0
[  925.738597 <0.04>]  ? signal_wake_up_state+0x15/0x30
[  925.738603 <0.06>]  exit_to_usermode_loop+0x6f/0xc0
[  925.738608 <0.05>]  prepare_exit_to_usermode+0xc7/0x110
[  925.738613 <0.05>]  ret_from_intr+0x25/0x35
[  925.738617 <0.04>] RIP: 0033:0x417369
[  925.738621 <0.04>] Code: Bad RIP value.
[  925.738625 <0.04>] RSP: 002b:7ffdd6bf0900 EFLAGS: 00010246
[  925.738629 <0.04>] RAX: 7f3eca509000 RBX: 001e RCX: 
7f3ec95ba260
[  925.738634 <0.05>] RDX: 7f3ec9889790 RSI: 000a RDI: 

[  925.738639 <0.05>] RBP: 7ffdd6bf0990 R08: 7f3ec9889780 R09: 
7f3eca4e8700
[  925.738645 <0.06>] R10: 035c R11: 0246 R12: 
021c6170
[  925.738650 <0.05>] R13: 7ffdd6bf0c00 R14:  R15: 





[   40.880899 <0.04>] BUG: kernel NULL pointer dereference, address: 
0090
[   40.880906 <0.07>] 

Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal

2020-06-22 Thread Greg KH
On Mon, Jun 22, 2020 at 12:07:25PM -0400, Andrey Grodzovsky wrote:
> 
> On 6/22/20 7:21 AM, Greg KH wrote:
> > On Mon, Jun 22, 2020 at 11:51:24AM +0200, Daniel Vetter wrote:
> > > On Sun, Jun 21, 2020 at 02:03:05AM -0400, Andrey Grodzovsky wrote:
> > > > Track sysfs files in a list so they all can be removed during pci remove
> > > > since otherwise their removal after that causes crash because parent
> > > > folder was already removed during pci remove.
> > Huh?  That should not happen, do you have a backtrace of that crash?
> 
> 
> 2 examples in the attached trace.

Odd, how did you trigger these?


> [  925.738225 <0.188086>] BUG: kernel NULL pointer dereference, address: 
> 0090
> [  925.738232 <0.07>] #PF: supervisor read access in kernel mode
> [  925.738236 <0.04>] #PF: error_code(0x) - not-present page
> [  925.738240 <0.04>] PGD 0 P4D 0 
> [  925.738245 <0.05>] Oops:  [#1] SMP PTI
> [  925.738249 <0.04>] CPU: 7 PID: 2547 Comm: amdgpu_test Tainted: G   
>  W  OE 5.5.0-rc7-dev-kfd+ #50
> [  925.738256 <0.07>] Hardware name: System manufacturer System 
> Product Name/RAMPAGE IV FORMULA, BIOS 4804 12/30/2013
> [  925.738266 <0.10>] RIP: 0010:kernfs_find_ns+0x18/0x110
> [  925.738270 <0.04>] Code: a6 cf ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 
> 00 00 00 00 66 66 66 66 90 41 57 41 56 49 89 f6 41 55 41 54 49 89 fd 55 53 49 
> 89 d4 <0f> b7 af 90 00 00 00 8b 05 8f ee 6b 01 48 8b 5f 68 66 83 e5 20 41
> [  925.738282 <0.12>] RSP: 0018:ad6d0118fb00 EFLAGS: 00010246
> [  925.738287 <0.05>] RAX:  RBX:  
> RCX: 2098a12076864b7e
> [  925.738292 <0.05>] RDX:  RSI: b6606b31 
> RDI: 
> [  925.738297 <0.05>] RBP: b6606b31 R08: b5379d10 
> R09: 
> [  925.738302 <0.05>] R10: ad6d0118fb38 R11: 9a75f64820a8 
> R12: 
> [  925.738307 <0.05>] R13:  R14: b6606b31 
> R15: 9a7612b06130
> [  925.738313 <0.06>] FS:  7f3eca4e8700() 
> GS:9a763dbc() knlGS:
> [  925.738319 <0.06>] CS:  0010 DS:  ES:  CR0: 
> 80050033
> [  925.738323 <0.04>] CR2: 0090 CR3: 35e5a005 
> CR4: 000606e0
> [  925.738329 <0.06>] Call Trace:
> [  925.738334 <0.05>]  kernfs_find_and_get_ns+0x2e/0x50
> [  925.738339 <0.05>]  sysfs_remove_group+0x25/0x80
> [  925.738344 <0.05>]  sysfs_remove_groups+0x29/0x40
> [  925.738350 <0.06>]  free_msi_irqs+0xf5/0x190
> [  925.738354 <0.04>]  pci_disable_msi+0xe9/0x120

So the PCI core is trying to clean up attributes that it had registered,
which is fine.  But we can't seem to find the attributes?  Were they
already removed somewhere else?

that's odd.

> [  925.738406 <0.52>]  amdgpu_irq_fini+0xe3/0xf0 [amdgpu]
> [  925.738453 <0.47>]  tonga_ih_sw_fini+0xe/0x30 [amdgpu]
> [  925.738490 <0.37>]  amdgpu_device_fini_late+0x14b/0x440 [amdgpu]
> [  925.738529 <0.39>]  amdgpu_driver_release_kms+0x16/0x40 [amdgpu]
> [  925.738548 <0.19>]  drm_dev_put+0x5b/0x80 [drm]
> [  925.738558 <0.10>]  drm_release+0xc6/0xd0 [drm]
> [  925.738563 <0.05>]  __fput+0xc6/0x260
> [  925.738568 <0.05>]  task_work_run+0x79/0xb0
> [  925.738573 <0.05>]  do_exit+0x3d0/0xc60
> [  925.738578 <0.05>]  do_group_exit+0x47/0xb0
> [  925.738583 <0.05>]  get_signal+0x18b/0xc30
> [  925.738589 <0.06>]  do_signal+0x36/0x6a0
> [  925.738593 <0.04>]  ? force_sig_info_to_task+0xbc/0xd0
> [  925.738597 <0.04>]  ? signal_wake_up_state+0x15/0x30
> [  925.738603 <0.06>]  exit_to_usermode_loop+0x6f/0xc0
> [  925.738608 <0.05>]  prepare_exit_to_usermode+0xc7/0x110
> [  925.738613 <0.05>]  ret_from_intr+0x25/0x35
> [  925.738617 <0.04>] RIP: 0033:0x417369
> [  925.738621 <0.04>] Code: Bad RIP value.
> [  925.738625 <0.04>] RSP: 002b:7ffdd6bf0900 EFLAGS: 00010246
> [  925.738629 <0.04>] RAX: 7f3eca509000 RBX: 001e 
> RCX: 7f3ec95ba260
> [  925.738634 <0.05>] RDX: 7f3ec9889790 RSI: 000a 
> RDI: 
> [  925.738639 <0.05>] RBP: 7ffdd6bf0990 R08: 7f3ec9889780 
> R09: 7f3eca4e8700
> [  925.738645 <0.06>] R10: 035c R11: 0246 
> R12: 021c6170
> [  925.738650 <0.05>] R13: 7ffdd6bf0c00 R14:  
> R15: 
> 
> 
> 
> 
> [   40.880899 <0.04>] BUG: kernel NULL pointer dereference, address: 
> 0090
> [   40.880906 <0.07>] #PF: supervisor read access in kernel mode
> [   40.880910 <0.04>] #PF: error_code(0x) - not-present page
> [   40.880915 <0.05>] PGD 0 P4D 0 
> 

Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal

2020-06-22 Thread Andrey Grodzovsky


On 6/22/20 7:21 AM, Greg KH wrote:

On Mon, Jun 22, 2020 at 11:51:24AM +0200, Daniel Vetter wrote:

On Sun, Jun 21, 2020 at 02:03:05AM -0400, Andrey Grodzovsky wrote:

Track sysfs files in a list so they all can be removed during pci remove
since otherwise their removal after that causes crash because parent
folder was already removed during pci remove.

Huh?  That should not happen, do you have a backtrace of that crash?



2 examples in the attached trace.

Andrey





Signed-off-by: Andrey Grodzovsky 

Uh I thought sysfs just gets yanked completely. Please check with Greg KH
whether hand-rolling all this really is the right solution here ... Feels
very wrong. I thought this was all supposed to work by adding attributes
before publishing the sysfs node, and then letting sysfs clean up
everything. Not by cleaning up manually yourself.

Yes, that is supposed to be the correct thing to do.


Adding Greg for an authoritative answer.
-Daniel


---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h  | 13 +++
  drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c |  7 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c   | 35 
  drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c  | 12 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c  |  8 ++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 17 --
  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 13 ++-
  drivers/gpu/drm/amd/amdgpu/df_v3_6.c | 10 +---
  8 files changed, 99 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 604a681..ba3775f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -726,6 +726,15 @@ struct amd_powerplay {
  
  #define AMDGPU_RESET_MAGIC_NUM 64

  #define AMDGPU_MAX_DF_PERFMONS 4
+
+struct amdgpu_sysfs_list_node {
+   struct list_head head;
+   struct device_attribute *attr;
+};

You know we have lists of attributes already, called attribute groups,
if you really wanted to do something like this.  But, I don't think so.

Either way, don't hand-roll your own stuff that the driver core has
provided for you for a decade or more, that's just foolish :)


+
+#define AMDGPU_DEVICE_ATTR_LIST_NODE(_attr) \
+   struct amdgpu_sysfs_list_node dev_attr_handle_##_attr = {.attr = 
_attr_##_attr}
+
  struct amdgpu_device {
struct device   *dev;
struct drm_device   *ddev;
@@ -992,6 +1001,10 @@ struct amdgpu_device {
charproduct_number[16];
charproduct_name[32];
charserial[16];
+
+   struct list_head sysfs_files_list;
+   struct mutex sysfs_files_list_lock;
+
  };
  
  static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device *bdev)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
index fdd52d8..c1549ee 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
@@ -1950,8 +1950,10 @@ static ssize_t amdgpu_atombios_get_vbios_version(struct 
device *dev,
return snprintf(buf, PAGE_SIZE, "%s\n", ctx->vbios_version);
  }
  
+

  static DEVICE_ATTR(vbios_version, 0444, amdgpu_atombios_get_vbios_version,
   NULL);
+static AMDGPU_DEVICE_ATTR_LIST_NODE(vbios_version);
  
  /**

   * amdgpu_atombios_fini - free the driver info and callbacks for atombios
@@ -1972,7 +1974,6 @@ void amdgpu_atombios_fini(struct amdgpu_device *adev)
adev->mode_info.atom_context = NULL;
kfree(adev->mode_info.atom_card_info);
adev->mode_info.atom_card_info = NULL;
-   device_remove_file(adev->dev, _attr_vbios_version);
  }
  
  /**

@@ -2038,6 +2039,10 @@ int amdgpu_atombios_init(struct amdgpu_device *adev)
return ret;
}
  
+	mutex_lock(>sysfs_files_list_lock);

+   list_add_tail(_attr_handle_vbios_version.head, 
>sysfs_files_list);
+   mutex_unlock(>sysfs_files_list_lock);
+
return 0;
  }
  
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index e7b9065..3173046 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2928,6 +2928,12 @@ static const struct attribute *amdgpu_dev_attributes[] = 
{
NULL
  };
  
+static AMDGPU_DEVICE_ATTR_LIST_NODE(product_name);

+static AMDGPU_DEVICE_ATTR_LIST_NODE(product_number);
+static AMDGPU_DEVICE_ATTR_LIST_NODE(serial_number);
+static AMDGPU_DEVICE_ATTR_LIST_NODE(pcie_replay_count);
+
+
  /**
   * amdgpu_device_init - initialize the driver
   *
@@ -3029,6 +3035,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
INIT_LIST_HEAD(>shadow_list);
mutex_init(>shadow_list_lock);
  
+	INIT_LIST_HEAD(>sysfs_files_list);

+   mutex_init(>sysfs_files_list_lock);
+
  

Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal

2020-06-22 Thread Christian König

Am 21.06.20 um 08:03 schrieb Andrey Grodzovsky:

Track sysfs files in a list so they all can be removed during pci remove
since otherwise their removal after that causes crash because parent
folder was already removed during pci remove.


That looks extremely fishy to me.

It sounds like we just don't remove stuff in the right order.

Christian.



Signed-off-by: Andrey Grodzovsky 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h  | 13 +++
  drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c |  7 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c   | 35 
  drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c  | 12 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c  |  8 ++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 17 --
  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 13 ++-
  drivers/gpu/drm/amd/amdgpu/df_v3_6.c | 10 +---
  8 files changed, 99 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 604a681..ba3775f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -726,6 +726,15 @@ struct amd_powerplay {
  
  #define AMDGPU_RESET_MAGIC_NUM 64

  #define AMDGPU_MAX_DF_PERFMONS 4
+
+struct amdgpu_sysfs_list_node {
+   struct list_head head;
+   struct device_attribute *attr;
+};
+
+#define AMDGPU_DEVICE_ATTR_LIST_NODE(_attr) \
+   struct amdgpu_sysfs_list_node dev_attr_handle_##_attr = {.attr = 
_attr_##_attr}
+
  struct amdgpu_device {
struct device   *dev;
struct drm_device   *ddev;
@@ -992,6 +1001,10 @@ struct amdgpu_device {
charproduct_number[16];
charproduct_name[32];
charserial[16];
+
+   struct list_head sysfs_files_list;
+   struct mutex sysfs_files_list_lock;
+
  };
  
  static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device *bdev)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
index fdd52d8..c1549ee 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
@@ -1950,8 +1950,10 @@ static ssize_t amdgpu_atombios_get_vbios_version(struct 
device *dev,
return snprintf(buf, PAGE_SIZE, "%s\n", ctx->vbios_version);
  }
  
+

  static DEVICE_ATTR(vbios_version, 0444, amdgpu_atombios_get_vbios_version,
   NULL);
+static AMDGPU_DEVICE_ATTR_LIST_NODE(vbios_version);
  
  /**

   * amdgpu_atombios_fini - free the driver info and callbacks for atombios
@@ -1972,7 +1974,6 @@ void amdgpu_atombios_fini(struct amdgpu_device *adev)
adev->mode_info.atom_context = NULL;
kfree(adev->mode_info.atom_card_info);
adev->mode_info.atom_card_info = NULL;
-   device_remove_file(adev->dev, _attr_vbios_version);
  }
  
  /**

@@ -2038,6 +2039,10 @@ int amdgpu_atombios_init(struct amdgpu_device *adev)
return ret;
}
  
+	mutex_lock(>sysfs_files_list_lock);

+   list_add_tail(_attr_handle_vbios_version.head, 
>sysfs_files_list);
+   mutex_unlock(>sysfs_files_list_lock);
+
return 0;
  }
  
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index e7b9065..3173046 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2928,6 +2928,12 @@ static const struct attribute *amdgpu_dev_attributes[] = 
{
NULL
  };
  
+static AMDGPU_DEVICE_ATTR_LIST_NODE(product_name);

+static AMDGPU_DEVICE_ATTR_LIST_NODE(product_number);
+static AMDGPU_DEVICE_ATTR_LIST_NODE(serial_number);
+static AMDGPU_DEVICE_ATTR_LIST_NODE(pcie_replay_count);
+
+
  /**
   * amdgpu_device_init - initialize the driver
   *
@@ -3029,6 +3035,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
INIT_LIST_HEAD(>shadow_list);
mutex_init(>shadow_list_lock);
  
+	INIT_LIST_HEAD(>sysfs_files_list);

+   mutex_init(>sysfs_files_list_lock);
+
INIT_DELAYED_WORK(>delayed_init_work,
  amdgpu_device_delayed_init_work_handler);
INIT_DELAYED_WORK(>gfx.gfx_off_delay_work,
@@ -3281,6 +3290,13 @@ int amdgpu_device_init(struct amdgpu_device *adev,
if (r) {
dev_err(adev->dev, "Could not create amdgpu device attr\n");
return r;
+   } else {
+   mutex_lock(>sysfs_files_list_lock);
+   list_add_tail(_attr_handle_product_name.head, 
>sysfs_files_list);
+   list_add_tail(_attr_handle_product_number.head, 
>sysfs_files_list);
+   list_add_tail(_attr_handle_serial_number.head, 
>sysfs_files_list);
+   list_add_tail(_attr_handle_pcie_replay_count.head, 
>sysfs_files_list);
+   mutex_unlock(>sysfs_files_list_lock);
}
  
  	if 

Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal

2020-06-22 Thread Greg KH
On Mon, Jun 22, 2020 at 11:51:24AM +0200, Daniel Vetter wrote:
> On Sun, Jun 21, 2020 at 02:03:05AM -0400, Andrey Grodzovsky wrote:
> > Track sysfs files in a list so they all can be removed during pci remove
> > since otherwise their removal after that causes crash because parent
> > folder was already removed during pci remove.

Huh?  That should not happen, do you have a backtrace of that crash?

> > Signed-off-by: Andrey Grodzovsky 
> 
> Uh I thought sysfs just gets yanked completely. Please check with Greg KH
> whether hand-rolling all this really is the right solution here ... Feels
> very wrong. I thought this was all supposed to work by adding attributes
> before publishing the sysfs node, and then letting sysfs clean up
> everything. Not by cleaning up manually yourself.

Yes, that is supposed to be the correct thing to do.

> 
> Adding Greg for an authoritative answer.
> -Daniel
> 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu.h  | 13 +++
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c |  7 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c   | 35 
> > 
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c  | 12 ++
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c  |  8 ++-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 17 --
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 13 ++-
> >  drivers/gpu/drm/amd/amdgpu/df_v3_6.c | 10 +---
> >  8 files changed, 99 insertions(+), 16 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > index 604a681..ba3775f 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > @@ -726,6 +726,15 @@ struct amd_powerplay {
> >  
> >  #define AMDGPU_RESET_MAGIC_NUM 64
> >  #define AMDGPU_MAX_DF_PERFMONS 4
> > +
> > +struct amdgpu_sysfs_list_node {
> > +   struct list_head head;
> > +   struct device_attribute *attr;
> > +};

You know we have lists of attributes already, called attribute groups,
if you really wanted to do something like this.  But, I don't think so.

Either way, don't hand-roll your own stuff that the driver core has
provided for you for a decade or more, that's just foolish :)

> > +
> > +#define AMDGPU_DEVICE_ATTR_LIST_NODE(_attr) \
> > +   struct amdgpu_sysfs_list_node dev_attr_handle_##_attr = {.attr = 
> > _attr_##_attr}
> > +
> >  struct amdgpu_device {
> > struct device   *dev;
> > struct drm_device   *ddev;
> > @@ -992,6 +1001,10 @@ struct amdgpu_device {
> > charproduct_number[16];
> > charproduct_name[32];
> > charserial[16];
> > +
> > +   struct list_head sysfs_files_list;
> > +   struct mutex sysfs_files_list_lock;
> > +
> >  };
> >  
> >  static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device 
> > *bdev)
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
> > index fdd52d8..c1549ee 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
> > @@ -1950,8 +1950,10 @@ static ssize_t 
> > amdgpu_atombios_get_vbios_version(struct device *dev,
> > return snprintf(buf, PAGE_SIZE, "%s\n", ctx->vbios_version);
> >  }
> >  
> > +
> >  static DEVICE_ATTR(vbios_version, 0444, amdgpu_atombios_get_vbios_version,
> >NULL);
> > +static AMDGPU_DEVICE_ATTR_LIST_NODE(vbios_version);
> >  
> >  /**
> >   * amdgpu_atombios_fini - free the driver info and callbacks for atombios
> > @@ -1972,7 +1974,6 @@ void amdgpu_atombios_fini(struct amdgpu_device *adev)
> > adev->mode_info.atom_context = NULL;
> > kfree(adev->mode_info.atom_card_info);
> > adev->mode_info.atom_card_info = NULL;
> > -   device_remove_file(adev->dev, _attr_vbios_version);
> >  }
> >  
> >  /**
> > @@ -2038,6 +2039,10 @@ int amdgpu_atombios_init(struct amdgpu_device *adev)
> > return ret;
> > }
> >  
> > +   mutex_lock(>sysfs_files_list_lock);
> > +   list_add_tail(_attr_handle_vbios_version.head, 
> > >sysfs_files_list);
> > +   mutex_unlock(>sysfs_files_list_lock);
> > +
> > return 0;
> >  }
> >  
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index e7b9065..3173046 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -2928,6 +2928,12 @@ static const struct attribute 
> > *amdgpu_dev_attributes[] = {
> > NULL
> >  };
> >  
> > +static AMDGPU_DEVICE_ATTR_LIST_NODE(product_name);
> > +static AMDGPU_DEVICE_ATTR_LIST_NODE(product_number);
> > +static AMDGPU_DEVICE_ATTR_LIST_NODE(serial_number);
> > +static AMDGPU_DEVICE_ATTR_LIST_NODE(pcie_replay_count);
> > +
> > +
> >  /**
> >   * amdgpu_device_init - initialize the driver
> >   *
> > @@ 

Re: [PATCH v2 5/8] drm/amdgpu: Refactor sysfs removal

2020-06-22 Thread Daniel Vetter
On Sun, Jun 21, 2020 at 02:03:05AM -0400, Andrey Grodzovsky wrote:
> Track sysfs files in a list so they all can be removed during pci remove
> since otherwise their removal after that causes crash because parent
> folder was already removed during pci remove.
> 
> Signed-off-by: Andrey Grodzovsky 

Uh I thought sysfs just gets yanked completely. Please check with Greg KH
whether hand-rolling all this really is the right solution here ... Feels
very wrong. I thought this was all supposed to work by adding attributes
before publishing the sysfs node, and then letting sysfs clean up
everything. Not by cleaning up manually yourself.

Adding Greg for an authoritative answer.
-Daniel

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h  | 13 +++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c |  7 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c   | 35 
> 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c  | 12 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c  |  8 ++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 17 --
>  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 13 ++-
>  drivers/gpu/drm/amd/amdgpu/df_v3_6.c | 10 +---
>  8 files changed, 99 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 604a681..ba3775f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -726,6 +726,15 @@ struct amd_powerplay {
>  
>  #define AMDGPU_RESET_MAGIC_NUM 64
>  #define AMDGPU_MAX_DF_PERFMONS 4
> +
> +struct amdgpu_sysfs_list_node {
> + struct list_head head;
> + struct device_attribute *attr;
> +};
> +
> +#define AMDGPU_DEVICE_ATTR_LIST_NODE(_attr) \
> + struct amdgpu_sysfs_list_node dev_attr_handle_##_attr = {.attr = 
> _attr_##_attr}
> +
>  struct amdgpu_device {
>   struct device   *dev;
>   struct drm_device   *ddev;
> @@ -992,6 +1001,10 @@ struct amdgpu_device {
>   charproduct_number[16];
>   charproduct_name[32];
>   charserial[16];
> +
> + struct list_head sysfs_files_list;
> + struct mutex sysfs_files_list_lock;
> +
>  };
>  
>  static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device 
> *bdev)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
> index fdd52d8..c1549ee 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
> @@ -1950,8 +1950,10 @@ static ssize_t 
> amdgpu_atombios_get_vbios_version(struct device *dev,
>   return snprintf(buf, PAGE_SIZE, "%s\n", ctx->vbios_version);
>  }
>  
> +
>  static DEVICE_ATTR(vbios_version, 0444, amdgpu_atombios_get_vbios_version,
>  NULL);
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(vbios_version);
>  
>  /**
>   * amdgpu_atombios_fini - free the driver info and callbacks for atombios
> @@ -1972,7 +1974,6 @@ void amdgpu_atombios_fini(struct amdgpu_device *adev)
>   adev->mode_info.atom_context = NULL;
>   kfree(adev->mode_info.atom_card_info);
>   adev->mode_info.atom_card_info = NULL;
> - device_remove_file(adev->dev, _attr_vbios_version);
>  }
>  
>  /**
> @@ -2038,6 +2039,10 @@ int amdgpu_atombios_init(struct amdgpu_device *adev)
>   return ret;
>   }
>  
> + mutex_lock(>sysfs_files_list_lock);
> + list_add_tail(_attr_handle_vbios_version.head, 
> >sysfs_files_list);
> + mutex_unlock(>sysfs_files_list_lock);
> +
>   return 0;
>  }
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e7b9065..3173046 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2928,6 +2928,12 @@ static const struct attribute *amdgpu_dev_attributes[] 
> = {
>   NULL
>  };
>  
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(product_name);
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(product_number);
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(serial_number);
> +static AMDGPU_DEVICE_ATTR_LIST_NODE(pcie_replay_count);
> +
> +
>  /**
>   * amdgpu_device_init - initialize the driver
>   *
> @@ -3029,6 +3035,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>   INIT_LIST_HEAD(>shadow_list);
>   mutex_init(>shadow_list_lock);
>  
> + INIT_LIST_HEAD(>sysfs_files_list);
> + mutex_init(>sysfs_files_list_lock);
> +
>   INIT_DELAYED_WORK(>delayed_init_work,
> amdgpu_device_delayed_init_work_handler);
>   INIT_DELAYED_WORK(>gfx.gfx_off_delay_work,
> @@ -3281,6 +3290,13 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>   if (r) {
>   dev_err(adev->dev, "Could not create amdgpu device attr\n");
>   return r;
> + } else {
> +