Re: [RFC PATCH] pci: Proof of concept at fixing pci_enable_device/bridge races

2018-08-19 Thread Guenter Roeck
On Thu, Aug 16, 2018 at 09:38:41AM +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2018-08-15 at 15:40 -0700, Guenter Roeck wrote:
> > On Thu, Aug 16, 2018 at 07:50:13AM +1000, Benjamin Herrenschmidt wrote:
> > > (Resent with lkml on copy)
> > > 
> > > [Note: This isn't meant to be merged, it need splitting at the very
> > > least, see below]
> > > 
> > > This is something I cooked up quickly today to test if that would fix
> > > my problems with large number of switch and NVME devices on POWER.
> > > 
> > 
> > Is that a problem that can be reproduced with a qemu setup ?
> 
> With difficulty... mt-tcg might help, but you need a rather large
> systems to reproduce it.
> 
> My repro-case is a 2 socket POWER9 system (about 40 cores off the top
> of my mind, so 160 threads) with 72 NVME devices underneath a tree of
> switches (I don't have the system at hand today to check how many).
> 
> It's possible to observe it I suppose on a smaller system (in theory a
> single bridge with 2 devices is enough) but in practice the timing is
> extremely hard to hit.
> 
> You need a combination of:
> 
>   - The bridges come up disabled (which is the case when Linux does the
> resource assignment, such as on POWER but not on x86 unless it's
> hotplug)
> 
>   - The nvme devices try to enable them simultaneously
> 
> Also the resulting error is a UR, I don't know how well qemu models
> that.
> 
Not well enough, apparently. I tried for a while, registering as many
nvme drives as the system would take, but I was not able to reproduce
the problem with qemu. It was worth a try, though.

Guenter


Re: [RFC PATCH] pci: Proof of concept at fixing pci_enable_device/bridge races

2018-08-19 Thread Guenter Roeck
On Thu, Aug 16, 2018 at 09:38:41AM +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2018-08-15 at 15:40 -0700, Guenter Roeck wrote:
> > On Thu, Aug 16, 2018 at 07:50:13AM +1000, Benjamin Herrenschmidt wrote:
> > > (Resent with lkml on copy)
> > > 
> > > [Note: This isn't meant to be merged, it need splitting at the very
> > > least, see below]
> > > 
> > > This is something I cooked up quickly today to test if that would fix
> > > my problems with large number of switch and NVME devices on POWER.
> > > 
> > 
> > Is that a problem that can be reproduced with a qemu setup ?
> 
> With difficulty... mt-tcg might help, but you need a rather large
> systems to reproduce it.
> 
> My repro-case is a 2 socket POWER9 system (about 40 cores off the top
> of my mind, so 160 threads) with 72 NVME devices underneath a tree of
> switches (I don't have the system at hand today to check how many).
> 
> It's possible to observe it I suppose on a smaller system (in theory a
> single bridge with 2 devices is enough) but in practice the timing is
> extremely hard to hit.
> 
> You need a combination of:
> 
>   - The bridges come up disabled (which is the case when Linux does the
> resource assignment, such as on POWER but not on x86 unless it's
> hotplug)
> 
>   - The nvme devices try to enable them simultaneously
> 
> Also the resulting error is a UR, I don't know how well qemu models
> that.
> 
Not well enough, apparently. I tried for a while, registering as many
nvme drives as the system would take, but I was not able to reproduce
the problem with qemu. It was worth a try, though.

Guenter


Re: [RFC PATCH] pci: Proof of concept at fixing pci_enable_device/bridge races

2018-08-16 Thread Benjamin Herrenschmidt
On Thu, 2018-08-16 at 22:07 -0500, Bjorn Helgaas wrote:
> [+cc Srinath, Guenter, Jens, Lukas, Konstantin, Marta, Pierre-Yves]
> 
> On Thu, Aug 16, 2018 at 07:50:13AM +1000, Benjamin Herrenschmidt wrote:
> > [Note: This isn't meant to be merged, it need splitting at the very
> > least, see below]
> > 
> > This is something I cooked up quickly today to test if that would fix
> > my problems with large number of switch and NVME devices on POWER.
> > 
> > So far it does...
> > 
> > The issue (as discussed in the Re: PCIe enable device races thread) is
> > that pci_enable_device and related functions along with pci_set_master
> > and pci_enable_bridge are fundamentally racy.
> > 
> > There is no lockign protecting the state of the device in pci_dev and
> > if multiple devices under the same bridge try to enable it simultaneously
> > one some of them will potentially start accessing it before it has actually
> > been enabled.
> > 
> > Now there are a LOT more fields in pci_dev that aren't covered by any
> > form of locking.
> 
> Most of the PCI core relies on the assumption that only a single
> thread touches a device at a time.  This is generally true of the core
> during enumeration because enumeration is single-threaded.  It's
> generally true in normal driver operation because the core shouldn't
> touch a device after a driver claims it.

Mostly :-) There are a few exceptions though.

> But there are several exceptions, and I think we need to understand
> those scenarios before applying locks willy-nilly.

We need to stop creating ad-hoc locks. We have a good handle already on
the main enable/disable and bus master scenario, and the race with
is_added.

Ignore the patch itself, it has at least 2 bugs with PM, I'll send a
series improving things a bit later.

> One big exception is that enabling device A may also touch an upstream
> bridge B.  That causes things like your issue and Srinath's issue
> where drivers simultaneously enabling two devices below the same
> bridge corrupt the bridge's state [1].  Marta reported essentially the
> same issue [2].
> 
> Hari's issue [3] seems related to a race between a driver work queue
> and the core enumerating the device.  I should have pushed harder to
> understand this; I feel like we papered over the immediate problem
> without clearing up the larger issue of why the core enumeration path
> (pci_bus_add_device()) is running at the same time as a driver.

It's not. What is happening is that is_added is set by
pci_bus_add_device() after it has bound the driver. An easy fix would
have been to move it up instead:

diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index 737d1c52f002..ff4d536d43fc 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -323,16 +323,16 @@ void pci_bus_add_device(struct pci_dev *dev)
pci_proc_attach_device(dev);
pci_bridge_d3_update(dev);
 
+   dev->is_added = 1;
dev->match_driver = true;
retval = device_attach(>dev);
if (retval < 0 && retval != -EPROBE_DEFER) {
+   dev->is_added = 0;
pci_warn(dev, "device attach failed (%d)\n", retval);
pci_proc_detach_device(dev);
pci_remove_sysfs_dev_files(dev);
return;
}
-
-   dev->is_added = 1;
 }
 EXPORT_SYMBOL_GPL(pci_bus_add_device);

(Untested).

Note: another advantage of the above is that the current code has an
odd asymetry: is_added is currently set after we attach but also
cleared after we detatch.

If we want to keep the flag being set after attaching, then we do
indeed need to protect it against concurrent access to other fields.

The easiest way to do that would have been to remove the :1 as this:

-   unsigned intis_added:1;
+   unsigned intis_added;

Again, none of these approach involves the invasive patch your merged
which uses that atomic operation which provides the false sense of
security that your are somewhat "protected" while in fact you only
protect the field itself, but provide no protection about overall
concurrency of the callers which might clash in different ways.

Finally, we could also move is_added under the protection of the new
mutex I propose adding, but that would really only work as long as
we move all the :1 fields protected by that mutex together inside
the struct pci_dev structure as to avoid collisions with other fields
being modified.

All of the above are preferable to what you merged.

> DPC/AER error handling adds more cases where the core potentially
> accesses devices asynchronously to the driver.
>
> User-mode sysfs controls like ASPM are also asynchronous to drivers.
> 
> Even setpci is a potential issue, though I don't know how far we want
> to go to protect it.  I think we should make setpci taint the kernel
> anyway.

I wouldn't bother too much about it.

> It might be nice if we had some sort of writeup of the locking
> strategy as a whole.
> 
> [1] 
> 

Re: [RFC PATCH] pci: Proof of concept at fixing pci_enable_device/bridge races

2018-08-16 Thread Benjamin Herrenschmidt
On Thu, 2018-08-16 at 22:07 -0500, Bjorn Helgaas wrote:
> [+cc Srinath, Guenter, Jens, Lukas, Konstantin, Marta, Pierre-Yves]
> 
> On Thu, Aug 16, 2018 at 07:50:13AM +1000, Benjamin Herrenschmidt wrote:
> > [Note: This isn't meant to be merged, it need splitting at the very
> > least, see below]
> > 
> > This is something I cooked up quickly today to test if that would fix
> > my problems with large number of switch and NVME devices on POWER.
> > 
> > So far it does...
> > 
> > The issue (as discussed in the Re: PCIe enable device races thread) is
> > that pci_enable_device and related functions along with pci_set_master
> > and pci_enable_bridge are fundamentally racy.
> > 
> > There is no lockign protecting the state of the device in pci_dev and
> > if multiple devices under the same bridge try to enable it simultaneously
> > one some of them will potentially start accessing it before it has actually
> > been enabled.
> > 
> > Now there are a LOT more fields in pci_dev that aren't covered by any
> > form of locking.
> 
> Most of the PCI core relies on the assumption that only a single
> thread touches a device at a time.  This is generally true of the core
> during enumeration because enumeration is single-threaded.  It's
> generally true in normal driver operation because the core shouldn't
> touch a device after a driver claims it.

Mostly :-) There are a few exceptions though.

> But there are several exceptions, and I think we need to understand
> those scenarios before applying locks willy-nilly.

We need to stop creating ad-hoc locks. We have a good handle already on
the main enable/disable and bus master scenario, and the race with
is_added.

Ignore the patch itself, it has at least 2 bugs with PM, I'll send a
series improving things a bit later.

> One big exception is that enabling device A may also touch an upstream
> bridge B.  That causes things like your issue and Srinath's issue
> where drivers simultaneously enabling two devices below the same
> bridge corrupt the bridge's state [1].  Marta reported essentially the
> same issue [2].
> 
> Hari's issue [3] seems related to a race between a driver work queue
> and the core enumerating the device.  I should have pushed harder to
> understand this; I feel like we papered over the immediate problem
> without clearing up the larger issue of why the core enumeration path
> (pci_bus_add_device()) is running at the same time as a driver.

It's not. What is happening is that is_added is set by
pci_bus_add_device() after it has bound the driver. An easy fix would
have been to move it up instead:

diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index 737d1c52f002..ff4d536d43fc 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -323,16 +323,16 @@ void pci_bus_add_device(struct pci_dev *dev)
pci_proc_attach_device(dev);
pci_bridge_d3_update(dev);
 
+   dev->is_added = 1;
dev->match_driver = true;
retval = device_attach(>dev);
if (retval < 0 && retval != -EPROBE_DEFER) {
+   dev->is_added = 0;
pci_warn(dev, "device attach failed (%d)\n", retval);
pci_proc_detach_device(dev);
pci_remove_sysfs_dev_files(dev);
return;
}
-
-   dev->is_added = 1;
 }
 EXPORT_SYMBOL_GPL(pci_bus_add_device);

(Untested).

Note: another advantage of the above is that the current code has an
odd asymetry: is_added is currently set after we attach but also
cleared after we detatch.

If we want to keep the flag being set after attaching, then we do
indeed need to protect it against concurrent access to other fields.

The easiest way to do that would have been to remove the :1 as this:

-   unsigned intis_added:1;
+   unsigned intis_added;

Again, none of these approach involves the invasive patch your merged
which uses that atomic operation which provides the false sense of
security that your are somewhat "protected" while in fact you only
protect the field itself, but provide no protection about overall
concurrency of the callers which might clash in different ways.

Finally, we could also move is_added under the protection of the new
mutex I propose adding, but that would really only work as long as
we move all the :1 fields protected by that mutex together inside
the struct pci_dev structure as to avoid collisions with other fields
being modified.

All of the above are preferable to what you merged.

> DPC/AER error handling adds more cases where the core potentially
> accesses devices asynchronously to the driver.
>
> User-mode sysfs controls like ASPM are also asynchronous to drivers.
> 
> Even setpci is a potential issue, though I don't know how far we want
> to go to protect it.  I think we should make setpci taint the kernel
> anyway.

I wouldn't bother too much about it.

> It might be nice if we had some sort of writeup of the locking
> strategy as a whole.
> 
> [1] 
> 

Re: [RFC PATCH] pci: Proof of concept at fixing pci_enable_device/bridge races

2018-08-16 Thread Bjorn Helgaas
[+cc Srinath, Guenter, Jens, Lukas, Konstantin, Marta, Pierre-Yves]

On Thu, Aug 16, 2018 at 07:50:13AM +1000, Benjamin Herrenschmidt wrote:
> [Note: This isn't meant to be merged, it need splitting at the very
> least, see below]
> 
> This is something I cooked up quickly today to test if that would fix
> my problems with large number of switch and NVME devices on POWER.
> 
> So far it does...
> 
> The issue (as discussed in the Re: PCIe enable device races thread) is
> that pci_enable_device and related functions along with pci_set_master
> and pci_enable_bridge are fundamentally racy.
> 
> There is no lockign protecting the state of the device in pci_dev and
> if multiple devices under the same bridge try to enable it simultaneously
> one some of them will potentially start accessing it before it has actually
> been enabled.
> 
> Now there are a LOT more fields in pci_dev that aren't covered by any
> form of locking.

Most of the PCI core relies on the assumption that only a single
thread touches a device at a time.  This is generally true of the core
during enumeration because enumeration is single-threaded.  It's
generally true in normal driver operation because the core shouldn't
touch a device after a driver claims it.

But there are several exceptions, and I think we need to understand
those scenarios before applying locks willy-nilly.

One big exception is that enabling device A may also touch an upstream
bridge B.  That causes things like your issue and Srinath's issue
where drivers simultaneously enabling two devices below the same
bridge corrupt the bridge's state [1].  Marta reported essentially the
same issue [2].

Hari's issue [3] seems related to a race between a driver work queue
and the core enumerating the device.  I should have pushed harder to
understand this; I feel like we papered over the immediate problem
without clearing up the larger issue of why the core enumeration path
(pci_bus_add_device()) is running at the same time as a driver.

DPC/AER error handling adds more cases where the core potentially
accesses devices asynchronously to the driver.

User-mode sysfs controls like ASPM are also asynchronous to drivers.

Even setpci is a potential issue, though I don't know how far we want
to go to protect it.  I think we should make setpci taint the kernel
anyway.

It might be nice if we had some sort of writeup of the locking
strategy as a whole.

[1] 
https://lkml.kernel.org/r/1501858648-8-1-git-send-email-srinath.man...@broadcom.com
[2] 
https://lkml.kernel.org/r/744877924.5841545.1521630049567.javamail.zim...@kalray.eu
[3] 
https://lkml.kernel.org/r/1530608741-30664-2-git-send-email-hari.v...@broadcom.com


Re: [RFC PATCH] pci: Proof of concept at fixing pci_enable_device/bridge races

2018-08-16 Thread Bjorn Helgaas
[+cc Srinath, Guenter, Jens, Lukas, Konstantin, Marta, Pierre-Yves]

On Thu, Aug 16, 2018 at 07:50:13AM +1000, Benjamin Herrenschmidt wrote:
> [Note: This isn't meant to be merged, it need splitting at the very
> least, see below]
> 
> This is something I cooked up quickly today to test if that would fix
> my problems with large number of switch and NVME devices on POWER.
> 
> So far it does...
> 
> The issue (as discussed in the Re: PCIe enable device races thread) is
> that pci_enable_device and related functions along with pci_set_master
> and pci_enable_bridge are fundamentally racy.
> 
> There is no lockign protecting the state of the device in pci_dev and
> if multiple devices under the same bridge try to enable it simultaneously
> one some of them will potentially start accessing it before it has actually
> been enabled.
> 
> Now there are a LOT more fields in pci_dev that aren't covered by any
> form of locking.

Most of the PCI core relies on the assumption that only a single
thread touches a device at a time.  This is generally true of the core
during enumeration because enumeration is single-threaded.  It's
generally true in normal driver operation because the core shouldn't
touch a device after a driver claims it.

But there are several exceptions, and I think we need to understand
those scenarios before applying locks willy-nilly.

One big exception is that enabling device A may also touch an upstream
bridge B.  That causes things like your issue and Srinath's issue
where drivers simultaneously enabling two devices below the same
bridge corrupt the bridge's state [1].  Marta reported essentially the
same issue [2].

Hari's issue [3] seems related to a race between a driver work queue
and the core enumerating the device.  I should have pushed harder to
understand this; I feel like we papered over the immediate problem
without clearing up the larger issue of why the core enumeration path
(pci_bus_add_device()) is running at the same time as a driver.

DPC/AER error handling adds more cases where the core potentially
accesses devices asynchronously to the driver.

User-mode sysfs controls like ASPM are also asynchronous to drivers.

Even setpci is a potential issue, though I don't know how far we want
to go to protect it.  I think we should make setpci taint the kernel
anyway.

It might be nice if we had some sort of writeup of the locking
strategy as a whole.

[1] 
https://lkml.kernel.org/r/1501858648-8-1-git-send-email-srinath.man...@broadcom.com
[2] 
https://lkml.kernel.org/r/744877924.5841545.1521630049567.javamail.zim...@kalray.eu
[3] 
https://lkml.kernel.org/r/1530608741-30664-2-git-send-email-hari.v...@broadcom.com


Re: [RFC PATCH] pci: Proof of concept at fixing pci_enable_device/bridge races

2018-08-15 Thread Benjamin Herrenschmidt
On Wed, 2018-08-15 at 15:40 -0700, Guenter Roeck wrote:
> On Thu, Aug 16, 2018 at 07:50:13AM +1000, Benjamin Herrenschmidt wrote:
> > (Resent with lkml on copy)
> > 
> > [Note: This isn't meant to be merged, it need splitting at the very
> > least, see below]
> > 
> > This is something I cooked up quickly today to test if that would fix
> > my problems with large number of switch and NVME devices on POWER.
> > 
> 
> Is that a problem that can be reproduced with a qemu setup ?

With difficulty... mt-tcg might help, but you need a rather large
systems to reproduce it.

My repro-case is a 2 socket POWER9 system (about 40 cores off the top
of my mind, so 160 threads) with 72 NVME devices underneath a tree of
switches (I don't have the system at hand today to check how many).

It's possible to observe it I suppose on a smaller system (in theory a
single bridge with 2 devices is enough) but in practice the timing is
extremely hard to hit.

You need a combination of:

  - The bridges come up disabled (which is the case when Linux does the
resource assignment, such as on POWER but not on x86 unless it's
hotplug)

  - The nvme devices try to enable them simultaneously

Also the resulting error is a UR, I don't know how well qemu models
that.

On the above system, I get usually *one* device failing due to the race
out of 72, and not on every boot.

However, the bug is known (see Bjorn's reply to the other thread) "Re:
PCIe enable device races (Was: [PATCH v3] PCI: Data corruption
happening due to race condition)" on linux-pci, so I'm not the only one
with a repro-case around.

Cheers,
Ben.




Re: [RFC PATCH] pci: Proof of concept at fixing pci_enable_device/bridge races

2018-08-15 Thread Benjamin Herrenschmidt
On Wed, 2018-08-15 at 15:40 -0700, Guenter Roeck wrote:
> On Thu, Aug 16, 2018 at 07:50:13AM +1000, Benjamin Herrenschmidt wrote:
> > (Resent with lkml on copy)
> > 
> > [Note: This isn't meant to be merged, it need splitting at the very
> > least, see below]
> > 
> > This is something I cooked up quickly today to test if that would fix
> > my problems with large number of switch and NVME devices on POWER.
> > 
> 
> Is that a problem that can be reproduced with a qemu setup ?

With difficulty... mt-tcg might help, but you need a rather large
systems to reproduce it.

My repro-case is a 2 socket POWER9 system (about 40 cores off the top
of my mind, so 160 threads) with 72 NVME devices underneath a tree of
switches (I don't have the system at hand today to check how many).

It's possible to observe it I suppose on a smaller system (in theory a
single bridge with 2 devices is enough) but in practice the timing is
extremely hard to hit.

You need a combination of:

  - The bridges come up disabled (which is the case when Linux does the
resource assignment, such as on POWER but not on x86 unless it's
hotplug)

  - The nvme devices try to enable them simultaneously

Also the resulting error is a UR, I don't know how well qemu models
that.

On the above system, I get usually *one* device failing due to the race
out of 72, and not on every boot.

However, the bug is known (see Bjorn's reply to the other thread) "Re:
PCIe enable device races (Was: [PATCH v3] PCI: Data corruption
happening due to race condition)" on linux-pci, so I'm not the only one
with a repro-case around.

Cheers,
Ben.




Re: [RFC PATCH] pci: Proof of concept at fixing pci_enable_device/bridge races

2018-08-15 Thread Guenter Roeck
On Thu, Aug 16, 2018 at 07:50:13AM +1000, Benjamin Herrenschmidt wrote:
> (Resent with lkml on copy)
> 
> [Note: This isn't meant to be merged, it need splitting at the very
> least, see below]
> 
> This is something I cooked up quickly today to test if that would fix
> my problems with large number of switch and NVME devices on POWER.
> 

Is that a problem that can be reproduced with a qemu setup ?

Thanks,
Guenter


Re: [RFC PATCH] pci: Proof of concept at fixing pci_enable_device/bridge races

2018-08-15 Thread Guenter Roeck
On Thu, Aug 16, 2018 at 07:50:13AM +1000, Benjamin Herrenschmidt wrote:
> (Resent with lkml on copy)
> 
> [Note: This isn't meant to be merged, it need splitting at the very
> least, see below]
> 
> This is something I cooked up quickly today to test if that would fix
> my problems with large number of switch and NVME devices on POWER.
> 

Is that a problem that can be reproduced with a qemu setup ?

Thanks,
Guenter


[RFC PATCH] pci: Proof of concept at fixing pci_enable_device/bridge races

2018-08-15 Thread Benjamin Herrenschmidt
(Resent with lkml on copy)

[Note: This isn't meant to be merged, it need splitting at the very
least, see below]

This is something I cooked up quickly today to test if that would fix
my problems with large number of switch and NVME devices on POWER.

So far it does...

The issue (as discussed in the Re: PCIe enable device races thread) is
that pci_enable_device and related functions along with pci_set_master
and pci_enable_bridge are fundamentally racy.

There is no lockign protecting the state of the device in pci_dev and
if multiple devices under the same bridge try to enable it simultaneously
one some of them will potentially start accessing it before it has actually
been enabled.

Now there are a LOT more fields in pci_dev that aren't covered by any
form of locking.

Additionally, some other parts such as ASPM or SR-IOV seem to be trying to do
their own ad-hoc locking, but will manipulate bit fields in pci_dev that
might be sharing words with other unrelated bit fields racily.

So I think we need to introduce a pci_dev mutex aimed at synchronizing
the main enable/master state of a given device, and possibly also the
bazillion of state bits held inside pci_dev.

I suggest a progressive approach:

 1- First we add the mutex, and put a red line comment "/* - */" in
struct pci_dev, at the bottom

 2- We move the basic enable/disable/set_master stuff under that lock and
move the corresponding fields in pci_dev below the line comment.

 3- Progressively, as we untangle them and verify their locking, we move
individual fields below the line so that we have a good visibility of
what has been addressed/audited already and what not

Note: I would very much like to keep most of the probing/resource allocation
single threaded at least within a given domain, I know we have some attempts
at locking in the hotplug code for that, I'm not ready to tackle that or
change it at this stage.

We might be able to also addrss the is_added issues (Hari Vyas stuff) using
the same mutex, I haven't looked into the details.

Not-signed-off-by: Benjamin Herrenschmidt 
---

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 316496e99da9..9c4895c40f1d 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1348,9 +1348,13 @@ static int do_pci_enable_device(struct pci_dev *dev, int 
bars)
  */
 int pci_reenable_device(struct pci_dev *dev)
 {
+   int rc = 0;
+
+   mutex_lock(>lock);
if (pci_is_enabled(dev))
-   return do_pci_enable_device(dev, (1 << PCI_NUM_RESOURCES) - 1);
-   return 0;
+   rc = do_pci_enable_device(dev, (1 << PCI_NUM_RESOURCES) - 1);
+   mutex_unlock(>lock);
+   return rc;
 }
 EXPORT_SYMBOL(pci_reenable_device);
 
@@ -1363,12 +1367,6 @@ static void pci_enable_bridge(struct pci_dev *dev)
if (bridge)
pci_enable_bridge(bridge);
 
-   if (pci_is_enabled(dev)) {
-   if (!dev->is_busmaster)
-   pci_set_master(dev);
-   return;
-   }
-
retval = pci_enable_device(dev);
if (retval)
pci_err(dev, "Error enabling bridge (%d), continuing\n",
@@ -1379,9 +1377,16 @@ static void pci_enable_bridge(struct pci_dev *dev)
 static int pci_enable_device_flags(struct pci_dev *dev, unsigned long flags)
 {
struct pci_dev *bridge;
-   int err;
+   int err = 0;
int i, bars = 0;
 
+   /* Handle upstream bridges first to avoid locking issues */
+   bridge = pci_upstream_bridge(dev);
+   if (bridge)
+   pci_enable_bridge(bridge);
+
+   mutex_lock(>lock);
+
/*
 * Power state could be unknown at this point, either due to a fresh
 * boot or a device removal call.  So get the current power state
@@ -1394,12 +1399,9 @@ static int pci_enable_device_flags(struct pci_dev *dev, 
unsigned long flags)
dev->current_state = (pmcsr & PCI_PM_CTRL_STATE_MASK);
}
 
+   /* Already enabled ? */
if (atomic_inc_return(>enable_cnt) > 1)
-   return 0;   /* already enabled */
-
-   bridge = pci_upstream_bridge(dev);
-   if (bridge)
-   pci_enable_bridge(bridge);
+   goto bail;
 
/* only skip sriov related */
for (i = 0; i <= PCI_ROM_RESOURCE; i++)
@@ -1412,6 +1414,8 @@ static int pci_enable_device_flags(struct pci_dev *dev, 
unsigned long flags)
err = do_pci_enable_device(dev, bars);
if (err < 0)
atomic_dec(>enable_cnt);
+ bail:
+   mutex_unlock(>lock);
return err;
 }
 
@@ -1618,6 +1622,7 @@ static void do_pci_disable_device(struct pci_dev *dev)
if (pci_command & PCI_COMMAND_MASTER) {
pci_command &= ~PCI_COMMAND_MASTER;
pci_write_config_word(dev, PCI_COMMAND, pci_command);
+   dev->is_busmaster = 0;
}
 
pcibios_disable_device(dev);
@@ -1632,8 +1637,10 @@ static void 

[RFC PATCH] pci: Proof of concept at fixing pci_enable_device/bridge races

2018-08-15 Thread Benjamin Herrenschmidt
(Resent with lkml on copy)

[Note: This isn't meant to be merged, it need splitting at the very
least, see below]

This is something I cooked up quickly today to test if that would fix
my problems with large number of switch and NVME devices on POWER.

So far it does...

The issue (as discussed in the Re: PCIe enable device races thread) is
that pci_enable_device and related functions along with pci_set_master
and pci_enable_bridge are fundamentally racy.

There is no lockign protecting the state of the device in pci_dev and
if multiple devices under the same bridge try to enable it simultaneously
one some of them will potentially start accessing it before it has actually
been enabled.

Now there are a LOT more fields in pci_dev that aren't covered by any
form of locking.

Additionally, some other parts such as ASPM or SR-IOV seem to be trying to do
their own ad-hoc locking, but will manipulate bit fields in pci_dev that
might be sharing words with other unrelated bit fields racily.

So I think we need to introduce a pci_dev mutex aimed at synchronizing
the main enable/master state of a given device, and possibly also the
bazillion of state bits held inside pci_dev.

I suggest a progressive approach:

 1- First we add the mutex, and put a red line comment "/* - */" in
struct pci_dev, at the bottom

 2- We move the basic enable/disable/set_master stuff under that lock and
move the corresponding fields in pci_dev below the line comment.

 3- Progressively, as we untangle them and verify their locking, we move
individual fields below the line so that we have a good visibility of
what has been addressed/audited already and what not

Note: I would very much like to keep most of the probing/resource allocation
single threaded at least within a given domain, I know we have some attempts
at locking in the hotplug code for that, I'm not ready to tackle that or
change it at this stage.

We might be able to also addrss the is_added issues (Hari Vyas stuff) using
the same mutex, I haven't looked into the details.

Not-signed-off-by: Benjamin Herrenschmidt 
---

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 316496e99da9..9c4895c40f1d 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1348,9 +1348,13 @@ static int do_pci_enable_device(struct pci_dev *dev, int 
bars)
  */
 int pci_reenable_device(struct pci_dev *dev)
 {
+   int rc = 0;
+
+   mutex_lock(>lock);
if (pci_is_enabled(dev))
-   return do_pci_enable_device(dev, (1 << PCI_NUM_RESOURCES) - 1);
-   return 0;
+   rc = do_pci_enable_device(dev, (1 << PCI_NUM_RESOURCES) - 1);
+   mutex_unlock(>lock);
+   return rc;
 }
 EXPORT_SYMBOL(pci_reenable_device);
 
@@ -1363,12 +1367,6 @@ static void pci_enable_bridge(struct pci_dev *dev)
if (bridge)
pci_enable_bridge(bridge);
 
-   if (pci_is_enabled(dev)) {
-   if (!dev->is_busmaster)
-   pci_set_master(dev);
-   return;
-   }
-
retval = pci_enable_device(dev);
if (retval)
pci_err(dev, "Error enabling bridge (%d), continuing\n",
@@ -1379,9 +1377,16 @@ static void pci_enable_bridge(struct pci_dev *dev)
 static int pci_enable_device_flags(struct pci_dev *dev, unsigned long flags)
 {
struct pci_dev *bridge;
-   int err;
+   int err = 0;
int i, bars = 0;
 
+   /* Handle upstream bridges first to avoid locking issues */
+   bridge = pci_upstream_bridge(dev);
+   if (bridge)
+   pci_enable_bridge(bridge);
+
+   mutex_lock(>lock);
+
/*
 * Power state could be unknown at this point, either due to a fresh
 * boot or a device removal call.  So get the current power state
@@ -1394,12 +1399,9 @@ static int pci_enable_device_flags(struct pci_dev *dev, 
unsigned long flags)
dev->current_state = (pmcsr & PCI_PM_CTRL_STATE_MASK);
}
 
+   /* Already enabled ? */
if (atomic_inc_return(>enable_cnt) > 1)
-   return 0;   /* already enabled */
-
-   bridge = pci_upstream_bridge(dev);
-   if (bridge)
-   pci_enable_bridge(bridge);
+   goto bail;
 
/* only skip sriov related */
for (i = 0; i <= PCI_ROM_RESOURCE; i++)
@@ -1412,6 +1414,8 @@ static int pci_enable_device_flags(struct pci_dev *dev, 
unsigned long flags)
err = do_pci_enable_device(dev, bars);
if (err < 0)
atomic_dec(>enable_cnt);
+ bail:
+   mutex_unlock(>lock);
return err;
 }
 
@@ -1618,6 +1622,7 @@ static void do_pci_disable_device(struct pci_dev *dev)
if (pci_command & PCI_COMMAND_MASTER) {
pci_command &= ~PCI_COMMAND_MASTER;
pci_write_config_word(dev, PCI_COMMAND, pci_command);
+   dev->is_busmaster = 0;
}
 
pcibios_disable_device(dev);
@@ -1632,8 +1637,10 @@ static void