from:"Loic Prylli"

Re: [PATCH] net: Fix NETDEV_CHANGE notifier usage causing spurious arp flush

2014-07-07 Thread Loic Prylli

On Wed, Jul 02, 2014 at 10:03:49AM +0300, Dan Aloni wrote:
> On Tue, Jul 01, 2014 at 09:39:43PM -0700, Loic Prylli wrote:
> > A bug was introduced in NETDEV_CHANGE notifier sequence causing the
> > arp table to be sometimes spuriously cleared (including manual arp
> > entries marked permanent), upon network link carrier changes.
> > 
> > The changed argument for the notifier was applied only to a single
> > caller of NETDEV_CHANGE, missing among others netdev_state_change().
> > So upon net_carrier events induced by the network, which are
> > triggering a call to netdev_state_change(), arp_netdev_event() would
> > decide whether to clear or not arp cache based on random/junk stack
> > values (a kind of read buffer overflow).
> [..]
> >  {
> > if (dev->flags & IFF_UP) {
> > -   call_netdevice_notifiers(NETDEV_CHANGE, dev);
> > +   struct netdev_notifier_change_info change_info;
> > +
> > +   change_info.flags_changed = 0;
> 
> I think it would be safer to do:
> 
>struct netdev_notifier_change_info change_info = {};
> 
> So that when future fields are added to the struct and this call-site
> happens to be forgotten, they will get 0 by default rather than
> random stack values.

Thanks for the review. Will take into account suggestion. For the
record, another (possibly bigger) trap from the preexisting code that
remains (and caused the bug) is NETDEV_CHANGE being the only netdev
notifier with a different special calling sequence. Since calls to
NETDEV_CHANGE notifier have been reduced over time to the two
instances in net/core/dev.c, it hopefully won't be a problem (and
fixing that maintainability issue would be out-of-scope of this simple
low-risk bug fix).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] net: Fix NETDEV_CHANGE notifier usage causing spurious arp flush

2014-07-07 Thread Loic Prylli

On Wed, Jul 02, 2014 at 10:03:49AM +0300, Dan Aloni wrote:
 On Tue, Jul 01, 2014 at 09:39:43PM -0700, Loic Prylli wrote:
  A bug was introduced in NETDEV_CHANGE notifier sequence causing the
  arp table to be sometimes spuriously cleared (including manual arp
  entries marked permanent), upon network link carrier changes.
  
  The changed argument for the notifier was applied only to a single
  caller of NETDEV_CHANGE, missing among others netdev_state_change().
  So upon net_carrier events induced by the network, which are
  triggering a call to netdev_state_change(), arp_netdev_event() would
  decide whether to clear or not arp cache based on random/junk stack
  values (a kind of read buffer overflow).
 [..]
   {
  if (dev-flags  IFF_UP) {
  -   call_netdevice_notifiers(NETDEV_CHANGE, dev);
  +   struct netdev_notifier_change_info change_info;
  +
  +   change_info.flags_changed = 0;
 
 I think it would be safer to do:
 
struct netdev_notifier_change_info change_info = {};
 
 So that when future fields are added to the struct and this call-site
 happens to be forgotten, they will get 0 by default rather than
 random stack values.

Thanks for the review. Will take into account suggestion. For the
record, another (possibly bigger) trap from the preexisting code that
remains (and caused the bug) is NETDEV_CHANGE being the only netdev
notifier with a different special calling sequence. Since calls to
NETDEV_CHANGE notifier have been reduced over time to the two
instances in net/core/dev.c, it hopefully won't be a problem (and
fixing that maintainability issue would be out-of-scope of this simple
low-risk bug fix).
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] net: Fix NETDEV_CHANGE notifier usage causing spurious arp flush

2014-07-01 Thread Loic Prylli

A bug was introduced in NETDEV_CHANGE notifier sequence causing the
arp table to be sometimes spuriously cleared (including manual arp
entries marked permanent), upon network link carrier changes.

The changed argument for the notifier was applied only to a single
caller of NETDEV_CHANGE, missing among others netdev_state_change().
So upon net_carrier events induced by the network, which are
triggering a call to netdev_state_change(), arp_netdev_event() would
decide whether to clear or not arp cache based on random/junk stack
values (a kind of read buffer overflow).

Fixes: be9efd365328 ("net: pass changed flags along with NETDEV_CHANGE event")
Fixes: 6c8b4e3ff81b ("arp: flush arp cache on IFF_NOARP change")
Signed-off-by: Loic Prylli 
---
 net/core/dev.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 30eedf6..63129d4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -148,6 +148,9 @@ struct list_head ptype_all __read_mostly;   /* Taps */
 static struct list_head offload_base __read_mostly;
 
 static int netif_rx_internal(struct sk_buff *skb);
+static int call_netdevice_notifiers_info(unsigned long val,
+struct net_device *dev,
+struct netdev_notifier_info *info);
 
 /*
  * The @dev_base_head list is protected by @dev_base_lock and the rtnl
@@ -1207,7 +1210,11 @@ EXPORT_SYMBOL(netdev_features_change);
 void netdev_state_change(struct net_device *dev)
 {
if (dev->flags & IFF_UP) {
-   call_netdevice_notifiers(NETDEV_CHANGE, dev);
+   struct netdev_notifier_change_info change_info;
+
+   change_info.flags_changed = 0;
+   call_netdevice_notifiers_info(NETDEV_CHANGE, dev,
+ _info.info);
rtmsg_ifinfo(RTM_NEWLINK, dev, 0, GFP_KERNEL);
}
 }
-- 
2.0.0.526.g5318336

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] net: Fix NETDEV_CHANGE notifier usage causing spurious arp flush

2014-07-01 Thread Loic Prylli

A bug was introduced in NETDEV_CHANGE notifier sequence causing the
arp table to be sometimes spuriously cleared (including manual arp
entries marked permanent), upon network link carrier changes.

The changed argument for the notifier was applied only to a single
caller of NETDEV_CHANGE, missing among others netdev_state_change().
So upon net_carrier events induced by the network, which are
triggering a call to netdev_state_change(), arp_netdev_event() would
decide whether to clear or not arp cache based on random/junk stack
values (a kind of read buffer overflow).

Fixes: be9efd365328 (net: pass changed flags along with NETDEV_CHANGE event)
Fixes: 6c8b4e3ff81b (arp: flush arp cache on IFF_NOARP change)
Signed-off-by: Loic Prylli lo...@google.com
---
 net/core/dev.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 30eedf6..63129d4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -148,6 +148,9 @@ struct list_head ptype_all __read_mostly;   /* Taps */
 static struct list_head offload_base __read_mostly;
 
 static int netif_rx_internal(struct sk_buff *skb);
+static int call_netdevice_notifiers_info(unsigned long val,
+struct net_device *dev,
+struct netdev_notifier_info *info);
 
 /*
  * The @dev_base_head list is protected by @dev_base_lock and the rtnl
@@ -1207,7 +1210,11 @@ EXPORT_SYMBOL(netdev_features_change);
 void netdev_state_change(struct net_device *dev)
 {
if (dev-flags  IFF_UP) {
-   call_netdevice_notifiers(NETDEV_CHANGE, dev);
+   struct netdev_notifier_change_info change_info;
+
+   change_info.flags_changed = 0;
+   call_netdevice_notifiers_info(NETDEV_CHANGE, dev,
+ change_info.info);
rtmsg_ifinfo(RTM_NEWLINK, dev, 0, GFP_KERNEL);
}
 }
-- 
2.0.0.526.g5318336

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in

2008-01-15 Thread Loic Prylli




On 1/15/2008 2:38 PM, Linus Torvalds wrote:

On Tue, 15 Jan 2008, Tony Camuso wrote:
  

Linus is confident that conf1 is not going away for at least the
next five years.



Not on PC's. Small birds tell me that there can be all these non-PC x86 
subarchitectures that may or may not have conf1.


Linus

  





But is there a ACPI-compliant/architecture that only offers mmconfig for 
configuration-space access and no other fallback method (i.e. no conf1, 
no bios,...)?


2.6.24 supports mmconfig for:
- ACPI-system with  MCFG
- a couple chipset discovered by conf1


If a system has no conf1, but does not have e820+ACPI+MCFG, or does have 
some other method than mmconfig, it was already irrelevant in the 
discussion of Ivan's initial patch in december (because that system was 
either never supported or not impacted, and we were trying to fix bugs, 
not introduce support for new class of systems).



Maybe Arjan could share his knowledge, and tell us what system he was 
thinking about (and whether it needed to be supported by 2.6.24) when 
saying:
 "When (and I'm saying "when" not "if") systems arrive that only have 
MMCONFIG for some of the devices."



Anyway Ivan's patch + Matthew's extensions are handling that non-PC 
arch. That combination is advocated by at least:

Ivan Kokshaysky
Matthew Wilcox
Tony Camuso
Loic Prylli
even Arjan's said that while he prefers his patch (saying it's more 
conservative), he does not see a existing problem with the Ivan/Matthew 
combination.


[ simpler, less ambitious fixes can be forgotten if nothing can be done 
for 2.6.24, I can understand that choice ]



The list of problems I see with Arjan's patch are:
- no word on whether the existing Linux driver/pci/pcie/aer code should 
be converted to opt-in?
- mmconfig still needs to be revisited to sort-out the mix of 
mmconfig+conf1+third-method access.
- you cannot test if ext-conf-space is available without taking risks: 
when pci_enable_ext_config() is called, even legacy-conf-space is 
switched to the new method.  So some administrator action (lspci -v 
+maybe-other-flag) or some driver action (that can optionally use 
ext-conf-space but does not *rely* on it) could cause some devices to 
totally disappear (if some pci hierarchy is handled by mmconfig as a 
0x section as seen on many amd machines). Matthew/Ivan will 
simply in the worst case detect that ext-conf-space is not available in 
pci_cfg_space_size()), legacy-conf-space will still work (and that 
0x section is perfectly *safe* to query, tell me if you need 
more details of why).
- introduce a new user-api, and a new kernel API, while in practice 
there is no evidence that brings any benefits compared to Ivan/Matthew.



IMHO, making  "pci=nommconf" the default behaviour is better than 
Arjan's patch: for the exaggerated 99.99% users he claims don't need 
ext-conf-space, that's obviously as good. And many of the others would 
benefit from the ability to test and optionally use ext-conf-space is 
available without taking the risk of crashing something, so something 
else is better for them.



With Arjan's patch, in 10 years, we might still have to use an extra 
option (or some other action) when using lspci to display extended caps, 
and we would still run the risk of crashing some old machine when doing 
so (unless maybe a blacklist of some sort will be added, making the 
newly introduced API completely useless soon, or unless we keep the 
painful bitmaps in mmconfig potentially ending-up with 3 set of pci-ops).



Loic

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in

2008-01-15 Thread Loic Prylli




On 1/14/2008 6:04 PM, Adrian Bunk wrote:

I thought so, but due to the way that things are initialised, mmconfig
happens before conf1.  conf1 is known to be usable, but hasn't set
raw_pci_ops at this point.  Confusing, and not ideal, but fixing this
isn't in scope for 2.6.24.
...



*ahem*

I don't think anything of what was discussed in this thread would be in 
scope for 2.6.24 (unless Linus wants to let the bunny that brings eggs 
release 2.6.24).


cu
Adrian
  



Why not put in 2.6.24 a simple fix for the last known remaining mmconfig 
problems in 2.6.24?  There has mostly been three bugs related to mmconfig:

- BIOS/hardware: exaggerated MCFG claims: solved long ago
- hardware: buggy CRS+mmconfig chipset: fix included last month
- Linux code: mmconfig incompatible with live BAR-probing: *not fixed*

It would be ironic to not fix the only one that is really confined to 
the Linux code.


Everybody more or less agrees *any* patches submitted so far does solve 
the known problems, and will not cause regressions. The only long 
discussion is about how to best prevent the effect of an "imaginary" 
fourth bug, and by nature that's a controversial topic.


For 2.6.24, if nothing more than a few lines can be done, either make 
pci=nommconf the default and add a pci=mmconf option, or/and apply one 
of the easiest patch to review i.e.Tony's one, so small I copy it again 
below (using 0x40 or 0x100 for the comparison does not really matter, 
personally I would change it to 0x100 to be like Ivan's patch, but 
either is much better than nothing). Replacing some mmconfig access by 
conf1 cannot cause any regression.



Loic


P.S.: with that patch, conf1-less x86 systems requiring mmconfig would 
not be supported. But they are like UFOs. They are plenty of them in the 
galaxy, but earth sightings are not convincing enough for 2.6.24 
support, they can wait 2.6.25.



diff --git a/arch/x86/pci/mmconfig_32.c b/arch/x86/pci/mmconfig_32.c
index 1bf5816..4474979 100644
--- a/arch/x86/pci/mmconfig_32.c
+++ b/arch/x86/pci/mmconfig_32.c
@@ -73,7 +73,7 @@ static int pci_mmcfg_read(unsigned int seg, unsigned 
int bus,

}

base = get_base_addr(seg, bus, devfn);
-if (!base)
+if ((!base) || (reg < 0x40))
return pci_conf1_read(seg,bus,devfn,reg,len,value);

spin_lock_irqsave(_config_lock, flags);
@@ -106,7 +106,7 @@ static int pci_mmcfg_write(unsigned int seg, 
unsigned int bus,

return -EINVAL;

base = get_base_addr(seg, bus, devfn);
-if (!base)
+if ((!base) || (reg < 0x40))
return pci_conf1_write(seg,bus,devfn,reg,len,value);

spin_lock_irqsave(_config_lock, flags);
diff --git a/arch/x86/pci/mmconfig_64.c b/arch/x86/pci/mmconfig_64.c
index 4095e4d..4ad1fcb 100644
--- a/arch/x86/pci/mmconfig_64.c
+++ b/arch/x86/pci/mmconfig_64.c
@@ -61,7 +61,7 @@ static int pci_mmcfg_read(unsigned int seg, unsigned 
int bus,

}

addr = pci_dev_base(seg, bus, devfn);
-if (!addr)
+if ((!addr) || (reg < 0x40))
return pci_conf1_read(seg,bus,devfn,reg,len,value);

switch (len) {
@@ -89,7 +89,7 @@ static int pci_mmcfg_write(unsigned int seg, unsigned 
int bus,

return -EINVAL;

addr = pci_dev_base(seg, bus, devfn);
-if (!addr)
+if ((!addr) || (reg < 0x40))
return pci_conf1_write(seg,bus,devfn,reg,len,value);

switch (len) {





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in

2008-01-15 Thread Loic Prylli




On 1/14/2008 6:04 PM, Adrian Bunk wrote:

I thought so, but due to the way that things are initialised, mmconfig
happens before conf1.  conf1 is known to be usable, but hasn't set
raw_pci_ops at this point.  Confusing, and not ideal, but fixing this
isn't in scope for 2.6.24.
...



*ahem*

I don't think anything of what was discussed in this thread would be in 
scope for 2.6.24 (unless Linus wants to let the bunny that brings eggs 
release 2.6.24).


cu
Adrian
  



Why not put in 2.6.24 a simple fix for the last known remaining mmconfig 
problems in 2.6.24?  There has mostly been three bugs related to mmconfig:

- BIOS/hardware: exaggerated MCFG claims: solved long ago
- hardware: buggy CRS+mmconfig chipset: fix included last month
- Linux code: mmconfig incompatible with live BAR-probing: *not fixed*

It would be ironic to not fix the only one that is really confined to 
the Linux code.


Everybody more or less agrees *any* patches submitted so far does solve 
the known problems, and will not cause regressions. The only long 
discussion is about how to best prevent the effect of an imaginary 
fourth bug, and by nature that's a controversial topic.


For 2.6.24, if nothing more than a few lines can be done, either make 
pci=nommconf the default and add a pci=mmconf option, or/and apply one 
of the easiest patch to review i.e.Tony's one, so small I copy it again 
below (using 0x40 or 0x100 for the comparison does not really matter, 
personally I would change it to 0x100 to be like Ivan's patch, but 
either is much better than nothing). Replacing some mmconfig access by 
conf1 cannot cause any regression.



Loic


P.S.: with that patch, conf1-less x86 systems requiring mmconfig would 
not be supported. But they are like UFOs. They are plenty of them in the 
galaxy, but earth sightings are not convincing enough for 2.6.24 
support, they can wait 2.6.25.



diff --git a/arch/x86/pci/mmconfig_32.c b/arch/x86/pci/mmconfig_32.c
index 1bf5816..4474979 100644
--- a/arch/x86/pci/mmconfig_32.c
+++ b/arch/x86/pci/mmconfig_32.c
@@ -73,7 +73,7 @@ static int pci_mmcfg_read(unsigned int seg, unsigned 
int bus,

}

base = get_base_addr(seg, bus, devfn);
-if (!base)
+if ((!base) || (reg  0x40))
return pci_conf1_read(seg,bus,devfn,reg,len,value);

spin_lock_irqsave(pci_config_lock, flags);
@@ -106,7 +106,7 @@ static int pci_mmcfg_write(unsigned int seg, 
unsigned int bus,

return -EINVAL;

base = get_base_addr(seg, bus, devfn);
-if (!base)
+if ((!base) || (reg  0x40))
return pci_conf1_write(seg,bus,devfn,reg,len,value);

spin_lock_irqsave(pci_config_lock, flags);
diff --git a/arch/x86/pci/mmconfig_64.c b/arch/x86/pci/mmconfig_64.c
index 4095e4d..4ad1fcb 100644
--- a/arch/x86/pci/mmconfig_64.c
+++ b/arch/x86/pci/mmconfig_64.c
@@ -61,7 +61,7 @@ static int pci_mmcfg_read(unsigned int seg, unsigned 
int bus,

}

addr = pci_dev_base(seg, bus, devfn);
-if (!addr)
+if ((!addr) || (reg  0x40))
return pci_conf1_read(seg,bus,devfn,reg,len,value);

switch (len) {
@@ -89,7 +89,7 @@ static int pci_mmcfg_write(unsigned int seg, unsigned 
int bus,

return -EINVAL;

addr = pci_dev_base(seg, bus, devfn);
-if (!addr)
+if ((!addr) || (reg  0x40))
return pci_conf1_write(seg,bus,devfn,reg,len,value);

switch (len) {





--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in

2008-01-15 Thread Loic Prylli




On 1/15/2008 2:38 PM, Linus Torvalds wrote:

On Tue, 15 Jan 2008, Tony Camuso wrote:
  

Linus is confident that conf1 is not going away for at least the
next five years.



Not on PC's. Small birds tell me that there can be all these non-PC x86 
subarchitectures that may or may not have conf1.


Linus

  





But is there a ACPI-compliant/architecture that only offers mmconfig for 
configuration-space access and no other fallback method (i.e. no conf1, 
no bios,...)?


2.6.24 supports mmconfig for:
- ACPI-system with  MCFG
- a couple chipset discovered by conf1


If a system has no conf1, but does not have e820+ACPI+MCFG, or does have 
some other method than mmconfig, it was already irrelevant in the 
discussion of Ivan's initial patch in december (because that system was 
either never supported or not impacted, and we were trying to fix bugs, 
not introduce support for new class of systems).



Maybe Arjan could share his knowledge, and tell us what system he was 
thinking about (and whether it needed to be supported by 2.6.24) when 
saying:
 When (and I'm saying when not if) systems arrive that only have 
MMCONFIG for some of the devices.



Anyway Ivan's patch + Matthew's extensions are handling that non-PC 
arch. That combination is advocated by at least:

Ivan Kokshaysky
Matthew Wilcox
Tony Camuso
Loic Prylli
even Arjan's said that while he prefers his patch (saying it's more 
conservative), he does not see a existing problem with the Ivan/Matthew 
combination.


[ simpler, less ambitious fixes can be forgotten if nothing can be done 
for 2.6.24, I can understand that choice ]



The list of problems I see with Arjan's patch are:
- no word on whether the existing Linux driver/pci/pcie/aer code should 
be converted to opt-in?
- mmconfig still needs to be revisited to sort-out the mix of 
mmconfig+conf1+third-method access.
- you cannot test if ext-conf-space is available without taking risks: 
when pci_enable_ext_config() is called, even legacy-conf-space is 
switched to the new method.  So some administrator action (lspci -v 
+maybe-other-flag) or some driver action (that can optionally use 
ext-conf-space but does not *rely* on it) could cause some devices to 
totally disappear (if some pci hierarchy is handled by mmconfig as a 
0x section as seen on many amd machines). Matthew/Ivan will 
simply in the worst case detect that ext-conf-space is not available in 
pci_cfg_space_size()), legacy-conf-space will still work (and that 
0x section is perfectly *safe* to query, tell me if you need 
more details of why).
- introduce a new user-api, and a new kernel API, while in practice 
there is no evidence that brings any benefits compared to Ivan/Matthew.



IMHO, making  pci=nommconf the default behaviour is better than 
Arjan's patch: for the exaggerated 99.99% users he claims don't need 
ext-conf-space, that's obviously as good. And many of the others would 
benefit from the ability to test and optionally use ext-conf-space is 
available without taking the risk of crashing something, so something 
else is better for them.



With Arjan's patch, in 10 years, we might still have to use an extra 
option (or some other action) when using lspci to display extended caps, 
and we would still run the risk of crashing some old machine when doing 
so (unless maybe a blacklist of some sort will be added, making the 
newly introduced API completely useless soon, or unless we keep the 
painful bitmaps in mmconfig potentially ending-up with 3 set of pci-ops).



Loic

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in

2008-01-13 Thread Loic Prylli




On 1/13/2008 3:43 PM, Matthew Wilcox wrote:

On Sun, Jan 13, 2008 at 10:41:24AM -0800, Arjan van de Ven wrote:
  
Note: There is not a 100% overlap between "need" and "will not be used in 
the patches that use legacy for < 256". In the other patches posted, 
extended config space will be used in cases where it won't be with my 
patch. (Most obvious one is an "lspci -vx" from automated scripts). 



I believe you to be mistaken in this belief.  If you take Ivan's patch,
conf1 is used for all accesses below 256 bytes.  lspci -x only dumps
config space up to 64 bytes; lspci - is needed to show extended pci
config space.
  



I agree with Arjan about that "not a 100% overlap". It is about the 
extra ext-conf-space access done while probing in drivers/pci/probe.c:

   dev->cfg_size = pci_cfg_space_size(dev);

(and lspci -v will also query/show the list of extended-caps for 
pci-x/pcie-x devices that have some, provided the kernel can access 
ext-conf-space).


With Ivan's patch, that line would still cause one extended-conf-space 
access at offset 256 for pcie/pci-x2 devices  (to check the ability to 
query ext-space). Arjan "opt-in" patch would prevent that extra access.


IMHO that access is OK and harmless in all cases, we are already 
protected by MCFG/e820 checks, but I agree one can express a different 
opinion based on trying to prevent "never-seen/potential" hardware/BIOS 
bugs. FWIW it is also there that I was suggested to exclude PCI-X2 
devices (when restricted to pcie, that access while probing cannot even 
cause the harmless master-abort/0x), but there is a small trade-off.



Loic

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in

2008-01-13 Thread Loic Prylli

On 1/13/2008 1:41 PM, Arjan van de Ven wrote:

On Sun, 13 Jan 2008 13:23:35 -0500
Loic Prylli <[EMAIL PROTECTED]> wrote:

Matthew pointed a patch that basically does what you suggested; only one
comment on your mail left after that:

2) the pci_enable_ext_config() function and dev->ext_cfg_space field,
sysfs interface should be removed from the patch. There has never

been a problem reporting crashes or any undefined behaviour while
trying to access ext-conf-space, all the problems where *using
mmconfig to access legacy-conf-space*.

This entirely misses the point of why I made the patch. The point is NOT
that devices are buggy. The point is that right now, 99.99% of the machines
out there do NOT need extended config space (no matter how it gets accessed),

The point of my patch was to make people who don't need extended config space,
not have to deal with it anymore.

I think I got your point the first time, and I agree it is sound. But in
my subjective and biased opinion, I just think ext-conf-space is
already useful and widespread enough (being used is not the same as
being strictly required for basic operation) for your proposed tradeoff
to not be optimal (protecting against "future/non-proven" hardware bugs,
i.e. bringing non-proven benefits, at the expense of making life harder
for ext-conf-space users while bringing additional extra API/code).

To take an example from the linux tree: the driver/pci/pcie/aer code
uses ext-conf-space for every pcie-root (currently several distributions
enable it by default), does it mean opt-in would be automatically
activated for most pcie hierarchies (defeating most of the benefits of
being opt-in), or we just disable that code by default?

Does lspci -v will automatically opt-in all pcie (right now by default
it tries to list the extended-capabilities for pcie and pcix), or do we
now require manual explicit sysfs operations to get the whole thing? Is
is an additional flag to lspci (if so will that flag also apply to pcix,
possibly causing a crash for lspci -v
- on some machines).

Note: There is not a 100% overlap between "need" and "will not be used in
the patches that use legacy for < 256". In the other patches posted,
extended config space will be used in cases where it won't be with my
patch. (Most obvious one is an "lspci -vx" from automated scripts).

To go one step your direction, I have already argued in a couple of
emails that I would prefer to not implement ext-conf-space access for
any PCI-X devices (removing PCI-X2 from pci_ext_cfg_size), because there
we are trying to support devices that we don't really know exists or
will ever exists. And protecting against "unproven bugs" makes more
sense when it only removes "unproven benefits".

Is that a problem? We've had 2 years of mess, with one not-enough patch after another.

There still are problems TODAY (eg im 2.6.24-rc7). The patch that falls back
to an alternative method for below 256 is no doubt a step in the right direction.
(although I'm not all that happy about mixing access types, it's not provably incorrect)
Is it enough? I'm not sure.

FWIW, I have in my tree a patch almost identical to Ivan's dated
"December 2005". Because of the constant activity on the mmconfig front
(that I thought would make it obsolete), I never took the effort of
suggesting it before one month ago (I am not a regular user of
linux-kernel). I admit nobody else should view it that way, but for me
rather than the last attempt at fixing mmconfig, it's a patch first used
two years ago that would have arguably prevented all problems that have
been reported since then.

Besides, recent mails show that hypothetically, we could even not change
anything to the existing conf-space code, since the only known bug
remaining is the one associated with bar probing and could be adressed by:

http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc6/2.6.24-rc6-mm1/broken-out/pci-disable-decoding-during-sizing-of-bars.patch

[ Thanks to Robert hancok and Grant Grundler for explaining to me the
history of bar-probing last month ]

Even if that bar-probing patch was applied (maybe it needs to be more
combat-proven), by default, it still seems better to not use mmconfig
for legacy-conf-space access, but going two extra precaution steps
beyond what seems necessary might be excessive.

Only time can tell I suppose, but the risk side is that
if it is not enough, users who don't need the extended config space for
functionality
will suffer the bugs AGAIN.

You can indeed never exclude 100% that possibility, but if they see a
problem again, it is likely to be a new category of hardware/BIOS bugs
never seen before.

Loic

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMA

Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in

2008-01-13 Thread Loic Prylli




On 1/12/2008 12:45 PM, Arjan van de Ven wrote:

On Sat, 12 Jan 2008 17:40:30 +0300
Ivan Kokshaysky <[EMAIL PROTECTED]> wrote:
  
 
+	if (reg < 256)

+   return pci_conf1_read(seg,bus,devfn,reg,len,value);
+




btw this is my main objection to your patch; it intertwines the conf1 and 
mmconfig code even more.
When (and I'm saying "when" not "if") systems arrive that only have MMCONFIG 
for some of the devices,
we'll have to detangle this again, and I'm really not looking forward to that.
  



conf1 has been a hardcoded dependencies of mmconfig for years. Ivan's 
patch does not make it worse (in fact it considerably simplifies that 
code, making it easier to untangle later).



IMHO, either your patch or Ivan's can be a good base, but:

1) For your remark above to be given any consideration, your patch 
should be modified to remove the hardcoded conf1 from the *current* 
mmconfig code, otherwise we end up with 3 set of ops (mmconfig + conf1+ 
a possible third set of operations) intertwined in a confusing manner. 
And removing that dependency is not a straightforward operation unless 
you also do 2):


2) the pci_enable_ext_config() function and dev->ext_cfg_space field, 
sysfs interface should be removed from the patch.  There has never been 
a problem reporting crashes or any undefined behaviour while trying to 
access ext-conf-space, all the problems where *using mmconfig to access 
legacy-conf-space*. The "if (dev->cfg_space_ext > 0)" checks can instead 
be replaced by  "if (reg >= 256)".
Otherwise when using per-device explicit enabling, just *checking* 
whether ext-conf-space is available by calling pci_enable_ext_config(), 
will make some of the old problems of *loosing legacy conf-space* come 
back: you would  have introduced a new user-space and kernel API while 
only solving half the problems, not a good deal.


if you do 1) and 2), then you really support the good properties you 
claimed:

- You can use mmconfig for ext-space and something else for legacy-space.
- You can use mmconfig for everything (for instance if conf1 is not 
implemented).


Of course it is as straightforward to modify Ivan's patch to also have 
the same properties.




Loic








Loic

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in

2008-01-13 Thread Loic Prylli




On 1/12/2008 12:45 PM, Arjan van de Ven wrote:

On Sat, 12 Jan 2008 17:40:30 +0300
Ivan Kokshaysky [EMAIL PROTECTED] wrote:
  
 
+	if (reg  256)

+   return pci_conf1_read(seg,bus,devfn,reg,len,value);
+




btw this is my main objection to your patch; it intertwines the conf1 and 
mmconfig code even more.
When (and I'm saying when not if) systems arrive that only have MMCONFIG 
for some of the devices,
we'll have to detangle this again, and I'm really not looking forward to that.
  



conf1 has been a hardcoded dependencies of mmconfig for years. Ivan's 
patch does not make it worse (in fact it considerably simplifies that 
code, making it easier to untangle later).



IMHO, either your patch or Ivan's can be a good base, but:

1) For your remark above to be given any consideration, your patch 
should be modified to remove the hardcoded conf1 from the *current* 
mmconfig code, otherwise we end up with 3 set of ops (mmconfig + conf1+ 
a possible third set of operations) intertwined in a confusing manner. 
And removing that dependency is not a straightforward operation unless 
you also do 2):


2) the pci_enable_ext_config() function and dev-ext_cfg_space field, 
sysfs interface should be removed from the patch.  There has never been 
a problem reporting crashes or any undefined behaviour while trying to 
access ext-conf-space, all the problems where *using mmconfig to access 
legacy-conf-space*. The if (dev-cfg_space_ext  0) checks can instead 
be replaced by  if (reg = 256).
Otherwise when using per-device explicit enabling, just *checking* 
whether ext-conf-space is available by calling pci_enable_ext_config(), 
will make some of the old problems of *loosing legacy conf-space* come 
back: you would  have introduced a new user-space and kernel API while 
only solving half the problems, not a good deal.


if you do 1) and 2), then you really support the good properties you 
claimed:

- You can use mmconfig for ext-space and something else for legacy-space.
- You can use mmconfig for everything (for instance if conf1 is not 
implemented).


Of course it is as straightforward to modify Ivan's patch to also have 
the same properties.




Loic








Loic

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in

2008-01-13 Thread Loic Prylli

On 1/13/2008 1:41 PM, Arjan van de Ven wrote:

On Sun, 13 Jan 2008 13:23:35 -0500
Loic Prylli [EMAIL PROTECTED] wrote:

Matthew pointed a patch that basically does what you suggested; only one
comment on your mail left after that:

2) the pci_enable_ext_config() function and dev-ext_cfg_space field,
sysfs interface should be removed from the patch. There has never

been a problem reporting crashes or any undefined behaviour while
trying to access ext-conf-space, all the problems where *using
mmconfig to access legacy-conf-space*.

The point of my patch was to make people who don't need extended config space,
not have to deal with it anymore.

I think I got your point the first time, and I agree it is sound. But in
my subjective and biased opinion, I just think ext-conf-space is
already useful and widespread enough (being used is not the same as
being strictly required for basic operation) for your proposed tradeoff
to not be optimal (protecting against future/non-proven hardware bugs,
i.e. bringing non-proven benefits, at the expense of making life harder
for ext-conf-space users while bringing additional extra API/code).

Note: There is not a 100% overlap between need and will not be used in
the patches that use legacy for 256. In the other patches posted,
extended config space will be used in cases where it won't be with my
patch. (Most obvious one is an lspci -vx from automated scripts).

To go one step your direction, I have already argued in a couple of
emails that I would prefer to not implement ext-conf-space access for
any PCI-X devices (removing PCI-X2 from pci_ext_cfg_size), because there
we are trying to support devices that we don't really know exists or
will ever exists. And protecting against unproven bugs makes more
sense when it only removes unproven benefits.

Is that a problem? We've had 2 years of mess, with one not-enough patch after another.

FWIW, I have in my tree a patch almost identical to Ivan's dated
December 2005. Because of the constant activity on the mmconfig front
(that I thought would make it obsolete), I never took the effort of
suggesting it before one month ago (I am not a regular user of
linux-kernel). I admit nobody else should view it that way, but for me
rather than the last attempt at fixing mmconfig, it's a patch first used
two years ago that would have arguably prevented all problems that have
been reported since then.

http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc6/2.6.24-rc6-mm1/broken-out/pci-disable-decoding-during-sizing-of-bars.patch

[ Thanks to Robert hancok and Grant Grundler for explaining to me the
history of bar-probing last month ]

Only time can tell I suppose, but the risk side is that
if it is not enough, users who don't need the extended config space for
functionality
will suffer the bugs AGAIN.

You can indeed never exclude 100% that possibility, but if they see a
problem again, it is likely to be a new category of hardware/BIOS bugs
never seen before.

Loic

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in

2008-01-13 Thread Loic Prylli




On 1/13/2008 3:43 PM, Matthew Wilcox wrote:

On Sun, Jan 13, 2008 at 10:41:24AM -0800, Arjan van de Ven wrote:
  
Note: There is not a 100% overlap between need and will not be used in 
the patches that use legacy for  256. In the other patches posted, 
extended config space will be used in cases where it won't be with my 
patch. (Most obvious one is an lspci -vx from automated scripts). 



I believe you to be mistaken in this belief.  If you take Ivan's patch,
conf1 is used for all accesses below 256 bytes.  lspci -x only dumps
config space up to 64 bytes; lspci - is needed to show extended pci
config space.
  



I agree with Arjan about that not a 100% overlap. It is about the 
extra ext-conf-space access done while probing in drivers/pci/probe.c:

   dev-cfg_size = pci_cfg_space_size(dev);

(and lspci -v will also query/show the list of extended-caps for 
pci-x/pcie-x devices that have some, provided the kernel can access 
ext-conf-space).


With Ivan's patch, that line would still cause one extended-conf-space 
access at offset 256 for pcie/pci-x2 devices  (to check the ability to 
query ext-space). Arjan opt-in patch would prevent that extra access.


IMHO that access is OK and harmless in all cases, we are already 
protected by MCFG/e820 checks, but I agree one can express a different 
opinion based on trying to prevent never-seen/potential hardware/BIOS 
bugs. FWIW it is also there that I was suggested to exclude PCI-X2 
devices (when restricted to pcie, that access while probing cannot even 
cause the harmless master-abort/0x), but there is a small trade-off.



Loic

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in

2008-01-12 Thread Loic Prylli




On 1/13/2008 1:01 AM, Matthew Wilcox wrote:

On Sat, Dec 29, 2007 at 12:12:19AM +0300, Ivan Kokshaysky wrote:
  

On Fri, Dec 28, 2007 at 12:40:53PM -0500, Loic Prylli wrote:


One thing that could be changed in pci_cfg_space_size() is to avoid
making a special case for PCI-X 266MHz/533Mhz (assume cfg_size == 256
for such devices too, reserve extended cfg-space for pci-express
devices). 
  

I agree, we should remove it. IIRC, this PCI-X check was written
long ago with some draft (not a final spec) in hands. Matthew?



I have what I believe to be the released version of PCI-X 2.0a (July
22, 2003).  It is quite clear that Mode 2 devices (ie those running at
266MHz or 533MHz) are required to support all 4096 bytes of extended
config space.

More to the point, I don't think we have any bug reports suggesting that
PCI-X Mode 2 devices/bridges have any problems. 




As PCI-X2 bridge/chipset, I only knows about the AMD-8132 (from what I 
understand it does PCI-X Mode 2), and some obscure IBM enterprise 
chipset (I am sure there are a few more).




Too bad for the spec, but we definitely know for sure the AMD-8132 
doesn't do ext-space (and makes it unusable for any device behind it).





There are relatively
few of them in existance, and my impression is that PCI-X2 is only being
implemented on server-class machines. 




True.




 'Consumer grade' equipment is
where all the problems lie anyway.
  




mmconfig has been a pain on the servers too (there are a lot of server 
class amd machines using one pcie/mmconfig/chipset + amd-8131/2).




While the PCI-X 2.0a spec does not define any Extended Capability IDs,
it simply states that "This field is a PCI-SIG defined ID number that
indicates the nature and format of the Extended Capabilities List item".
The PCIe spec does define Extended Capability IDs, and I would think
it's entirely appropriate to use the same IDs for PCI-X Mode 2 devices.
  



Sure it might be needed on PCI-X2. But contrary to pcie (where the 
driver/pci/pcie/aer subsystem already use ext-conf-space, and other 
usages are bound to increase), needing ext-conf-space in the future on 
pci-x2 is quite unlikely (pcie is long-lived, whereas PCI-X2 was 
short-lived, obsoleted by PCI-E, and nobody has mentioned yet an example 
of using ext-registers with a PCI-X2 device).


I was only mentioning that because of the very small trade-off:  if you 
don't exclude PCI-X2, on platforms with the amd-8132+bad-MCFG, you might 
trigger a cfg-read==0x/master-abort in pci_cfg_space_size() for 
such devices with Ivan patch. This is harmless, because a lot of similar 
master-abort happen during PCI-probing anyway, so one more won't change 
anything.



Anyway, I am equally happy with keeping pci_cfg_space_size() as it is.


Loic

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in

2008-01-12 Thread Loic Prylli




On 1/13/2008 1:01 AM, Matthew Wilcox wrote:

On Sat, Dec 29, 2007 at 12:12:19AM +0300, Ivan Kokshaysky wrote:
  

On Fri, Dec 28, 2007 at 12:40:53PM -0500, Loic Prylli wrote:


One thing that could be changed in pci_cfg_space_size() is to avoid
making a special case for PCI-X 266MHz/533Mhz (assume cfg_size == 256
for such devices too, reserve extended cfg-space for pci-express
devices). 
  

I agree, we should remove it. IIRC, this PCI-X check was written
long ago with some draft (not a final spec) in hands. Matthew?



I have what I believe to be the released version of PCI-X 2.0a (July
22, 2003).  It is quite clear that Mode 2 devices (ie those running at
266MHz or 533MHz) are required to support all 4096 bytes of extended
config space.

More to the point, I don't think we have any bug reports suggesting that
PCI-X Mode 2 devices/bridges have any problems. 




As PCI-X2 bridge/chipset, I only knows about the AMD-8132 (from what I 
understand it does PCI-X Mode 2), and some obscure IBM enterprise 
chipset (I am sure there are a few more).




Too bad for the spec, but we definitely know for sure the AMD-8132 
doesn't do ext-space (and makes it unusable for any device behind it).





There are relatively
few of them in existance, and my impression is that PCI-X2 is only being
implemented on server-class machines. 




True.




 'Consumer grade' equipment is
where all the problems lie anyway.
  




mmconfig has been a pain on the servers too (there are a lot of server 
class amd machines using one pcie/mmconfig/chipset + amd-8131/2).




While the PCI-X 2.0a spec does not define any Extended Capability IDs,
it simply states that This field is a PCI-SIG defined ID number that
indicates the nature and format of the Extended Capabilities List item.
The PCIe spec does define Extended Capability IDs, and I would think
it's entirely appropriate to use the same IDs for PCI-X Mode 2 devices.
  



Sure it might be needed on PCI-X2. But contrary to pcie (where the 
driver/pci/pcie/aer subsystem already use ext-conf-space, and other 
usages are bound to increase), needing ext-conf-space in the future on 
pci-x2 is quite unlikely (pcie is long-lived, whereas PCI-X2 was 
short-lived, obsoleted by PCI-E, and nobody has mentioned yet an example 
of using ext-registers with a PCI-X2 device).


I was only mentioning that because of the very small trade-off:  if you 
don't exclude PCI-X2, on platforms with the amd-8132+bad-MCFG, you might 
trigger a cfg-read==0x/master-abort in pci_cfg_space_size() for 
such devices with Ivan patch. This is harmless, because a lot of similar 
master-abort happen during PCI-probing anyway, so one more won't change 
anything.



Anyway, I am equally happy with keeping pci_cfg_space_size() as it is.


Loic

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in

2007-12-28 Thread Loic Prylli

On 12/28/2007 1:06 AM, Benjamin Herrenschmidt wrote:
> On Thu, 2007-12-27 at 21:37 -0800, Linus Torvalds wrote:
>   
>> On Fri, 28 Dec 2007, Benjamin Herrenschmidt wrote:
>> 
>>> I have embedded boards where proper CRS operations is critical since the
>>> kernel brings the PCIe link up itself, and thus is likely to hit devices
>>> still in the middle of CRS.
>>>   
>> .. but that's perfectly fine. A PCI-E bridge will certainly retry it in 
>> hardware (or it isn't a PCI-E bridge!).
>> 
>
> Only a handful of times in many bridges I've seen.
>   

Yes retry implementation are not universal. But bridges with
CRS-visibility might be even less common (both are
optional/implementation-specific features).

To make some choices for our own devices, we experimented at Myricom
with CRS on a few chipsets to see what they implement:

broadcom/serverworks 2000/1000: no-retry/ no-visibility
nvidia CK804/IO804: no-retry/no-visibility
nvidia MPC55/IO55: no-retry/no-visibility
intel-7520: does-retry/no-visibility
intel-975X:  does-retry/no-visibility
intel-5000series: does-retry/no-visibility
intel X38: does-retry/no-visibility
AMD/ATI RD790: no-retry/no-visibility

[ in all no-retry cases, when a CRS is somehow returned to the
pcie-root, the requester will get a 0xf, depending on chipsets,
other error bits (malformed-tlp, completion-timeout) might be triggered ]

>From the data reported on this mailing-list, it seems like the ATI tries
to implement CRS-visibility, but it is obviously buggy.

Not knowing whether there is any chipset with the visibility feature,
but without the retry capability, and given that CRS is irrelevant for
most Linux platforms (it only matters just after power-on, long before
Linux is started in the common case), it seems fair that
CRS-visibility-enabling should only be added in the specific code of the
specific embedded platforms (with no BIOS/firmware of any kind) that
might need it.

>  
>   
>> So I'm going to disable that thing. If there is some _other_ PCI-E bridge 
>> that is simply buggy, and cannot handle the hw retry itself or is just 
>> otherwise dodgy, we can have a white-list for cases where it really needs 
>> to be done, but the current code is just bogus.
>> 
>
> If you disable it, then isn't there also a problem with PCIE->PCI-X
> bridge which will stop issuing CRS when they should ? (not sure here, I
> may be a bit confused).
>   

This is mostly independant, allowing a PCIE->PCI-X bridge to generate
CRS is a different bit (bit 15 of pcie->devctl on the bridge). FWIW,
Linux does not seem to touch it, and it defaults to zero, so it does not
seem like most current PCIE->PCI-X bridge will never generates a CRS
(some BIOSes might do it, but not the couple of platforms I looked at).
Again the choice of setting here seems something better left to the
specific BIOS/embedded-code for a given platform.

Loic

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in

2007-12-28 Thread Loic Prylli

On 12/28/2007 11:38 AM, Ivan Kokshaysky wrote:
> On Fri, Dec 28, 2007 at 08:14:18AM -0800, Arjan van de Ven wrote:
>   
>> it removes code by removing quirks / known not working stuff..
>> 
>
>   

The only quirk I see removed is a bitmap with an arbitrary size (that we
don't really know is sufficient for every system), and that is only
built using comparison between mmconf and type1 accesses.  IMHO, there
is zero knowledge in that removed code (no knowledge about specific
chipsets that work or don't work, or misleading BIOSes).

> This not working stuff gets detected at probe time - see
> drivers/pci/probe.c:pci_cfg_space_size().
>   

This indeed avoids most mmconf invalid attempts (for extended-conf-space
probing that goes through pci_find_ext_capability() or from user-space).

One could think of adding a cfg_size check in
pci_read_config_{read/write), but IMHO that would be useless, since
direct read/write into a known extended-conf-space register in the
extended-config-space can only happen for a pci-express device, and
there is ample evidence that such accesses always work (more exactly
MCFG can always be trusted for pcie devices).

One thing that could be changed in pci_cfg_space_size() is to avoid
making a special case for PCI-X 266MHz/533Mhz (assume cfg_size == 256
for such devices too, reserve extended cfg-space for pci-express
devices). There is good reasons to think no such PCI-X 266Mhz/533 device
will ever have an extended-space (no capability IDs was ever defined in
the PCI-X 2.0 spec, no new revision is planned). Such a check would
avoid the possibility of trying extended-conf-space access for PCI-X 2.0
devices behind a amd-8132 or similar (such accesses would just returnd
-1, but there was some objections raised about doing anything like that
other than at initialization time, even if there is ample reasons to
argue it would be harmless).

Loic

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in

2007-12-28 Thread Loic Prylli



On 12/28/2007 11:38 AM, Ivan Kokshaysky wrote:
 On Fri, Dec 28, 2007 at 08:14:18AM -0800, Arjan van de Ven wrote:
   
 it removes code by removing quirks / known not working stuff..
 

   




The only quirk I see removed is a bitmap with an arbitrary size (that we
don't really know is sufficient for every system), and that is only
built using comparison between mmconf and type1 accesses.  IMHO, there
is zero knowledge in that removed code (no knowledge about specific
chipsets that work or don't work, or misleading BIOSes).




 This not working stuff gets detected at probe time - see
 drivers/pci/probe.c:pci_cfg_space_size().
   


This indeed avoids most mmconf invalid attempts (for extended-conf-space
probing that goes through pci_find_ext_capability() or from user-space).


One could think of adding a cfg_size check in
pci_read_config_{read/write), but IMHO that would be useless, since
direct read/write into a known extended-conf-space register in the
extended-config-space can only happen for a pci-express device, and
there is ample evidence that such accesses always work (more exactly
MCFG can always be trusted for pcie devices).

One thing that could be changed in pci_cfg_space_size() is to avoid
making a special case for PCI-X 266MHz/533Mhz (assume cfg_size == 256
for such devices too, reserve extended cfg-space for pci-express
devices). There is good reasons to think no such PCI-X 266Mhz/533 device
will ever have an extended-space (no capability IDs was ever defined in
the PCI-X 2.0 spec, no new revision is planned). Such a check would
avoid the possibility of trying extended-conf-space access for PCI-X 2.0
devices behind a amd-8132 or similar (such accesses would just returnd
-1, but there was some objections raised about doing anything like that
other than at initialization time, even if there is ample reasons to
argue it would be harmless).


Loic


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in

2007-12-28 Thread Loic Prylli



On 12/28/2007 1:06 AM, Benjamin Herrenschmidt wrote:
 On Thu, 2007-12-27 at 21:37 -0800, Linus Torvalds wrote:
   
 On Fri, 28 Dec 2007, Benjamin Herrenschmidt wrote:
 
 I have embedded boards where proper CRS operations is critical since the
 kernel brings the PCIe link up itself, and thus is likely to hit devices
 still in the middle of CRS.
   
 .. but that's perfectly fine. A PCI-E bridge will certainly retry it in 
 hardware (or it isn't a PCI-E bridge!).
 

 Only a handful of times in many bridges I've seen.
   




Yes retry implementation are not universal. But bridges with
CRS-visibility might be even less common (both are
optional/implementation-specific features).


To make some choices for our own devices, we experimented at Myricom
with CRS on a few chipsets to see what they implement:

broadcom/serverworks 2000/1000: no-retry/ no-visibility
nvidia CK804/IO804: no-retry/no-visibility
nvidia MPC55/IO55: no-retry/no-visibility
intel-7520: does-retry/no-visibility
intel-975X:  does-retry/no-visibility
intel-5000series: does-retry/no-visibility
intel X38: does-retry/no-visibility
AMD/ATI RD790: no-retry/no-visibility


[ in all no-retry cases, when a CRS is somehow returned to the
pcie-root, the requester will get a 0xf, depending on chipsets,
other error bits (malformed-tlp, completion-timeout) might be triggered ]


From the data reported on this mailing-list, it seems like the ATI tries
to implement CRS-visibility, but it is obviously buggy.


Not knowing whether there is any chipset with the visibility feature,
but without the retry capability, and given that CRS is irrelevant for
most Linux platforms (it only matters just after power-on, long before
Linux is started in the common case), it seems fair that
CRS-visibility-enabling should only be added in the specific code of the
specific embedded platforms (with no BIOS/firmware of any kind) that
might need it.



  
   
 So I'm going to disable that thing. If there is some _other_ PCI-E bridge 
 that is simply buggy, and cannot handle the hw retry itself or is just 
 otherwise dodgy, we can have a white-list for cases where it really needs 
 to be done, but the current code is just bogus.
 

 If you disable it, then isn't there also a problem with PCIE-PCI-X
 bridge which will stop issuing CRS when they should ? (not sure here, I
 may be a bit confused).
   


This is mostly independant, allowing a PCIE-PCI-X bridge to generate
CRS is a different bit (bit 15 of pcie-devctl on the bridge). FWIW,
Linux does not seem to touch it, and it defaults to zero, so it does not
seem like most current PCIE-PCI-X bridge will never generates a CRS
(some BIOSes might do it, but not the couple of platforms I looked at).
Again the choice of setting here seems something better left to the
specific BIOS/embedded-code for a given platform.


Loic

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Fix x86 iounmap() calling ioremap_change_attr() with a size that's too big.

2007-12-27 Thread Loic Prylli

The get_vm_area() allocates one extra guard page, and stores the augmented
size in area->size. But the ioremap/iounmap code on x86 relies on finding
the original size in area->size (otherwise it does change the attribute
of some random device in the linear-map by using a wrong size in
ioremap_change_attr() ). The problem is avoided by not using an extra guard
page for VM_IOREMAP allocations (if somebody can think of an easy way to
store the original size across ioremap()/iounmap() calls, that could be
a more elegant solution).

Signed-off-by: Loic Prylli <[EMAIL PROTECTED]>
---
 mm/vmalloc.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index af77e17..efd0093 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -196,9 +196,10 @@ static struct vm_struct *__get_vm_area_node(unsigned long 
size, unsigned long fl
return NULL;
 
/*
-* We always allocate a guard page.
+* We allocate a guard page (except for ioremap() which relies on 
area->size == size)
 */
-   size += PAGE_SIZE;
+   if (!(flags & VM_IOREMAP))
+   size += PAGE_SIZE;
 
write_lock(_lock);
for (p =  (tmp = *p) != NULL ;p = >next) {
-- 
1.5.2.2



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in

2007-12-27 Thread Loic Prylli

On 12/27/2007 1:58 PM, Linus Torvalds wrote:
>
> There was a thread called "PCI vendor id == 1 regression?" (Kai Ruhnau was 
> the main reporter for that one). But looking back, it seems that one 
> didn't hit the kernel mailing list, so I guess google cannot pick it up. I 
> can forward all the emails if you want, but quite frankly, you don't 
> really want to. It boils down to:
>
> Stephen Hemminger:
>   "There have been two reports with different hardware of the PCI vendor
>id of 0001 showing up. I got a report on sky2, and Francois saw similar
>problem on r8169.
>In one case, it happened only with 2.6.23 kernel, the correct id was
>returned by older kernels.
>
>This is a heads up, there may be a PCI problem. Or just
>some users smoking strange fall leaves."
>
> And then one of the reporters:
>
>   "Good kernel:
>
>02:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8056 PCI-E 
> Gigabit Ethernet Controller (rev 12)
>   00: ab 11 64 43 07 00 10 00 12 00 00 02 01 00 00 00
>
>Bad kernel:
>
>02:00.0 Ethernet controller: Unknown device 0001:4364 (rev 12)
>   00: 01 00 64 43 07 00 10 00 12 00 00 02 01 00 00 00"
>   





The root pcie port implementation is obviously buggy. But did we confirm
whether that hardware bug might be partly related to
"configuration-retry-status" pcie-root handling as introduced/described in:



http://marc.info/?l=linux-kernel=110541914926842=2


Does the 0001 vendor-id still shows up if pci_enable_crs() has never
been called?


Does anybody knows what was the original rational to call
pci_enable_crs() by default?


Loic

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Fix x86 iounmap() calling ioremap_change_attr() with a size that's too big.

2007-12-27 Thread Loic Prylli

The get_vm_area() allocates one extra guard page, and stores the augmented
size in area-size. But the ioremap/iounmap code on x86 relies on finding
the original size in area-size (otherwise it does change the attribute
of some random device in the linear-map by using a wrong size in
ioremap_change_attr() ). The problem is avoided by not using an extra guard
page for VM_IOREMAP allocations (if somebody can think of an easy way to
store the original size across ioremap()/iounmap() calls, that could be
a more elegant solution).

Signed-off-by: Loic Prylli [EMAIL PROTECTED]
---
 mm/vmalloc.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index af77e17..efd0093 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -196,9 +196,10 @@ static struct vm_struct *__get_vm_area_node(unsigned long 
size, unsigned long fl
return NULL;
 
/*
-* We always allocate a guard page.
+* We allocate a guard page (except for ioremap() which relies on 
area-size == size)
 */
-   size += PAGE_SIZE;
+   if (!(flags  VM_IOREMAP))
+   size += PAGE_SIZE;
 
write_lock(vmlist_lock);
for (p = vmlist; (tmp = *p) != NULL ;p = tmp-next) {
-- 
1.5.2.2



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in

2007-12-27 Thread Loic Prylli

On 12/27/2007 1:58 PM, Linus Torvalds wrote:

 There was a thread called PCI vendor id == 1 regression? (Kai Ruhnau was 
 the main reporter for that one). But looking back, it seems that one 
 didn't hit the kernel mailing list, so I guess google cannot pick it up. I 
 can forward all the emails if you want, but quite frankly, you don't 
 really want to. It boils down to:

 Stephen Hemminger:
   There have been two reports with different hardware of the PCI vendor
id of 0001 showing up. I got a report on sky2, and Francois saw similar
problem on r8169.
In one case, it happened only with 2.6.23 kernel, the correct id was
returned by older kernels.

This is a heads up, there may be a PCI problem. Or just
some users smoking strange fall leaves.

 And then one of the reporters:

   Good kernel:

02:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8056 PCI-E 
 Gigabit Ethernet Controller (rev 12)
   00: ab 11 64 43 07 00 10 00 12 00 00 02 01 00 00 00

Bad kernel:

02:00.0 Ethernet controller: Unknown device 0001:4364 (rev 12)
   00: 01 00 64 43 07 00 10 00 12 00 00 02 01 00 00 00
   





The root pcie port implementation is obviously buggy. But did we confirm
whether that hardware bug might be partly related to
configuration-retry-status pcie-root handling as introduced/described in:



http://marc.info/?l=linux-kernelm=110541914926842w=2


Does the 0001 vendor-id still shows up if pci_enable_crs() has never
been called?


Does anybody knows what was the original rational to call
pci_enable_crs() by default?


Loic

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-23 Thread Loic Prylli

On 12/23/2007 3:55 PM, Matthew Wilcox wrote:
> On Sun, Dec 23, 2007 at 03:16:24PM -0500, Loic Prylli wrote:
>   
>> I just realized one thing: the bar sizing code in pci_read_bases() (that
>> writes 0x in the bars) does not seem to disable the
>> PCI_COMMAND_MEM/PCI_COMMAND_IO bits in the cmd register before
>> manipulating the BARs. And it seems nobody else ensures they are
>> disabled at this point either (or am I missing something?).
>> 
>
> Right, we don't.  Ivan and Greg are convinced that doing so can break
> some machines.  
>   



 It might indeed be scary to suddenly change the command bits on some
old "Host bridges" and maybe other classes of bridges. I was too
optimistic about the triviality of the fix.


Still an obvious improvement to the pci_read_bases() code would be to
not try sizing non-existings BARs (that way if a device has no memory
bars, we never need to touch the mem-enable bit). In addition, if you
exclude Host bridges and other bridges from temporary mem-disabling
while sizing their BARs, then doing temp disabling for the remaining
devices looks less scary than the alternative (messing live with the
bars decoding range especially for regular devices integrated to the
main chipset).


It seems to me worth making that init code safer independantly of the
mmconfig issues.



> Now that contradicts some information we've been told before; can you
> post those traces?  That would argue in favour of disabling memspace
> when configuring BARs.
>   



Here is a trace where COMMAND/STATUS and 0x10 and 0x14 registers
accesses were captured. Comments in brackets are my interpretation of
what's happening. That was with windows 2003 64bit.


Loic



[ reset at 607 second]

[BIOS: read-status, COMMAND = 0 ]

626.442718:cfg-read:0x6.w=0x10
626.442805:cfg-write:0x4.w=0x0

[BIOS: sizing BAR0]

626.448325:cfg-read:0x10.l=0x8
626.448365:cfg-write:0x10.l=0xfffe
626.448439:cfg-read:0x10.l=0xff08
626.448493:cfg-write:0x10.l=0x8

[BIOS: sizing non-existent BAR1]

626.448552:cfg-read:0x14.l=0x0
626.448591:cfg-write:0x14.l=0xfffe
626.448647:cfg-read:0x14.l=0x0
626.448687:cfg-write:0x14.l=0x0

[BIOS: sizing BAR0]

626.449596:cfg-read:0x10.l=0x8
626.449635:cfg-write:0x10.l=0xfffe
626.449710:cfg-read:0x10.l=0xff08
626.449764:cfg-write:0x10.l=0x8

[BIOS: sizing non-existent BAR1]

626.449832:cfg-read:0x14.l=0x0
626.449872:cfg-write:0x14.l=0xfffe
626.449928:cfg-read:0x14.l=0x0
626.449967:cfg-write:0x14.l=0x0

[BIOS: sizing BAR0]


626.451822:cfg-read:0x10.l=0x8
626.451862:cfg-write:0x10.l=0xfffe
626.451936:cfg-read:0x10.l=0xff08
626.451990:cfg-write:0x10.l=0x8

[BIOS: sizing non-existent BAR1]

626.452058:cfg-read:0x14.l=0x0
626.452098:cfg-write:0x14.l=0xfffe
626.452154:cfg-read:0x14.l=0x0
626.452194:cfg-write:0x14.l=0x0


[BIOS: sizing and assigning BAR0]

626.454899:cfg-read:0x10.l=0x8
626.454939:cfg-write:0x10.l=0xfffe
626.455013:cfg-read:0x10.l=0xff08

626.455068:cfg-write:0x10.l=0x8
626.455138:cfg-write:0x10.l=0xfa00

[BIOS: sizing non-existent BAR1]

626.455211:cfg-read:0x14.l=0x0
626.455251:cfg-write:0x14.l=0xfffe
626.455307:cfg-read:0x14.l=0x0
626.455346:cfg-write:0x14.l=0x0


[BIOS: sizing and restoring BAR0]


628.024774:cfg-read:0x10.l=0xfa08
628.024831:cfg-write:0x10.l=0xfffe
628.024908:cfg-read:0x10.l=0xff08
628.024964:cfg-write:0x10.l=0xfa08

[BIOS: sizing non-existent BAR1]

628.025049:cfg-read:0x14.l=0x0
628.025091:cfg-write:0x14.l=0xfffe
628.025149:cfg-read:0x14.l=0x0
628.025190:cfg-write:0x14.l=0x0

[BIOS: clear status ]

628.025714:cfg-write:0x6.w=0x

[BIOS: enable SERR/IO+MEMORY+MASTER+SPECIAL+INVALIDATE ]
628.025788:cfg-write:0x4.w=0x11f

[BIOS: enable MEMORY+MASTER (after seeing IO/special/invalidate were
read-only 0) ]
629.028114:cfg-read:0x4.b=0x6
629.028160:cfg-write:0x4.b=0x6

[BIOS: read status, clear PARITY/SERR status bits ]
629.028490:cfg-read:0x7.b=0x0
629.028536:cfg-write:0x7.b=0xc1

[BIOS: enable PARITY detection ]
629.028818:cfg-read:0x4.b=0x6
629.028863:cfg-write:0x4.b=0x46

[BIOS: enable SERR (already enabled anyway) ]
629.028920:cfg-read:0x5.b=0x1
629.028966:cfg-write:0x5.b=0x1
629.032338:cfg-read:0x6.w=0x10


[ WINDOWS starting ]

661.965932:cfg-read:0x4.l=0x100146
661.965986:cfg-read:0x10.l=0xfa08
661.966040:cfg-read:0x14.l=0x0

662.112971:cfg-read:0x4.l=0x100146
662.113024:cfg-read:0x10.l=0xfa08
662.113078:cfg-read:0x14.l=0x0

662.155903:cfg-read:0x4.l=0x100146

662.156066:cfg-read:0x4.l=0x100146
662.156120:cfg-read:0x10.l=0xfa08
662.156174:cfg-read:0x14.l=0x0

662.156337:cfg-read:0x4.l=0x100146
662.156391:cfg-read:0x10.l=0xfa08
662.156445:cfg-read:0x14.l=0x0

[ WINDOWS: disable MEMORY + MASTER ]
662.156519:cfg-write:0x4.l=0x140

[ WINDOWS: size BAR0 and BAR1]
662.156588:cfg-write:0x10.l=0x
662.156662:cfg-write:0x14.l=0x
662.156755:cfg-read:0x4.l=0x100140
662.156809:cfg-read

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-23 Thread Loic Prylli

On 12/20/2007 1:16 PM, Matthew Wilcox wrote:
> Oh, that's the same bug others (including me) have been complaining
> about.
>
> http://marc.info/?l=linux-kernel=118809338631160=2
>
>   
>> It hangs in exactly the same place every time.
>>
>> I am surmising that the write to that BAR is causing a MCE.
>> 
>
> Bad deduction.  What's happening is that the write to the BAR is causing
> it to overlap the decode for mmconfig space.  So the mmconfig write to
> set the BAR back never gets through.
>
> I have a different idea to fix this problem.  Instead of writing
> 0x, we could look for an unused bit of space in the E820 map and
> write, say, 0xdfff to the low 32-bits of a BAR.  Then it wouldn't
> overlap, and we could find its size using MMCONFIG.
>
> Does anyone know how Windows handles these machines? 

I just realized one thing: the bar sizing code in pci_read_bases() (that
writes 0x in the bars) does not seem to disable the
PCI_COMMAND_MEM/PCI_COMMAND_IO bits in the cmd register before
manipulating the BARs. And it seems nobody else ensures they are
disabled at this point either (or am I missing something?).

Touching the bars while they are enabled would be buggy behaviour from
our part, and something trivial to fix. And it might well fix that
particular problem (it's fair play from the machine to crash if we
create a decoding conflict, simply disabling the cmd bits in
pci_read_bases() should remove that conflict).

FWIW, to partially answer your last question, Windows does disable
mem-space and/or IO-space when sizing the bars of a device (I have some
traces of configuration-space-access taken on a window machine for one
of the PCI busses).

Loic

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Make MMCONFIG space (extended PCI config space) a driver opt-in issue

2007-12-23 Thread Loic Prylli

On 12/23/2007 12:44 AM, Jeff Garzik wrote:
>>
>> By possibly using different implementations for the two ranges you avoid
>> introducing a new API, you avoid taking risks with mmconf when you don't
>> need it. That doesn't preclude using mmconf for everything either if the
>> user requests it or based on enough knowledge of the system to be sure
>> nothing will break.
>
> Are you prepared to guarantee that freely mixing mmconfig and type1
> config accesses at runtime will always work, on all chipsets?  I'm
> talking about silicon here, not kernel software.

Your question is whether I expect a mix of type1 (for legacy conf space)
and mmconfig (for >= 256) to trigger some bug in silicon causing
flakiness (why not possible random corruption) that would not happen
when only using type1 or mmconfig.

Obviously, I cannot prove or guarantee the absence of subtle silicon
bugs. But I can say:
-  documentation is saying mixing them is OK,  pci-express docs insists
on paying attention to ordering when mixing (which is easy on x86),
acknowledging one is able to mix them.
- All current kernels have been using a mix of "type1" and "mmconf"
accesses to the same device even if that's only used during mmconfig
initialization.
- On amd platforms with typical combination on mmconf-aware and
non-mmconf aware chipsets, you always have a mix of "type1" and "mmconf"
accesses all the time with standard kernels (not on the same bus)
- Mixing such accesses on the same device at runtime has happened on
thousands of systems for Myricom device/software users.

Are you prepared to guarantee no processor will ever have a wierd
silicon problem that's triggered by any Linux change?

 Mixing mmconf/type-1 is an approach that any mmconf-able hardware is
supposed to support, it has been tried on at least most
server/workstation chipsets typically used in the last two years, so the
burden of proof is on actually finding some silicon that cannot do that
properly before rejecting that based on hardware fears  (otherwise you
go on a slippery slope).

>
> Furthermore, is it best for our users to find problems with mixed
> config accesses -- not at boot time, not at driver load time, but at
> some random time when some random driver does its first extended
> config space access?  IMO, no.  If you are going to fail, failing in a
> predictable, visible way is best.

A failure happening during driver initialization or some very specific
identifiable driver event (the first ext-conf-space access is not
something popping randomly in the life of a driver) is predictable and
visible.  I am not sure what kind of problem you are afraid of anyway
(ext-conf-space not available?), so it's hard to talk generally.

Needless to say, I never said to not to test mmconf on all pci-express
devices advertised in MCFG during initialization time (preferably after
all PCI-memory-space initialization is done since there is uncertainty
about whether mmconf might conflict with that in some corner cases).

>
> Failures are more predictable and more consistent with an all-or-none
> scenario.  The all-or-none solutions are the least complex on the
> software side, and far more widely tested than any mixed config access
> scheme.

Disabling altogether ext-conf-space access by default (that's the
default your propose, right? ) is definitely the safest approach. But
that's the least useful too.

Caricaturing, I am the one advocating the real all-or-none scenario:
always use type1 for legacy conf-space (never mmconf). You are saying
that for the legacy-conf-space access, some people should use mmconf and
other people should use type1 based on their kernel configuration or
command-line, and that use depends on something unrelated (whether they
feel they might need extended-conf-space access at some point, be it
even for lspci/setpci).

Loic

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Make MMCONFIG space (extended PCI config space) a driver opt-in issue

2007-12-23 Thread Loic Prylli



On 12/23/2007 12:44 AM, Jeff Garzik wrote:

 By possibly using different implementations for the two ranges you avoid
 introducing a new API, you avoid taking risks with mmconf when you don't
 need it. That doesn't preclude using mmconf for everything either if the
 user requests it or based on enough knowledge of the system to be sure
 nothing will break.

 Are you prepared to guarantee that freely mixing mmconfig and type1
 config accesses at runtime will always work, on all chipsets?  I'm
 talking about silicon here, not kernel software.




Your question is whether I expect a mix of type1 (for legacy conf space)
and mmconfig (for = 256) to trigger some bug in silicon causing
flakiness (why not possible random corruption) that would not happen
when only using type1 or mmconfig.


Obviously, I cannot prove or guarantee the absence of subtle silicon
bugs. But I can say:
-  documentation is saying mixing them is OK,  pci-express docs insists
on paying attention to ordering when mixing (which is easy on x86),
acknowledging one is able to mix them.
- All current kernels have been using a mix of type1 and mmconf
accesses to the same device even if that's only used during mmconfig
initialization.
- On amd platforms with typical combination on mmconf-aware and
non-mmconf aware chipsets, you always have a mix of type1 and mmconf
accesses all the time with standard kernels (not on the same bus)
- Mixing such accesses on the same device at runtime has happened on
thousands of systems for Myricom device/software users.


Are you prepared to guarantee no processor will ever have a wierd
silicon problem that's triggered by any Linux change?

 Mixing mmconf/type-1 is an approach that any mmconf-able hardware is
supposed to support, it has been tried on at least most
server/workstation chipsets typically used in the last two years, so the
burden of proof is on actually finding some silicon that cannot do that
properly before rejecting that based on hardware fears  (otherwise you
go on a slippery slope).





 Furthermore, is it best for our users to find problems with mixed
 config accesses -- not at boot time, not at driver load time, but at
 some random time when some random driver does its first extended
 config space access?  IMO, no.  If you are going to fail, failing in a
 predictable, visible way is best.






A failure happening during driver initialization or some very specific
identifiable driver event (the first ext-conf-space access is not
something popping randomly in the life of a driver) is predictable and
visible.  I am not sure what kind of problem you are afraid of anyway
(ext-conf-space not available?), so it's hard to talk generally.


Needless to say, I never said to not to test mmconf on all pci-express
devices advertised in MCFG during initialization time (preferably after
all PCI-memory-space initialization is done since there is uncertainty
about whether mmconf might conflict with that in some corner cases).





 Failures are more predictable and more consistent with an all-or-none
 scenario.  The all-or-none solutions are the least complex on the
 software side, and far more widely tested than any mixed config access
 scheme.



Disabling altogether ext-conf-space access by default (that's the
default your propose, right? ) is definitely the safest approach. But
that's the least useful too.


Caricaturing, I am the one advocating the real all-or-none scenario:
always use type1 for legacy conf-space (never mmconf). You are saying
that for the legacy-conf-space access, some people should use mmconf and
other people should use type1 based on their kernel configuration or
command-line, and that use depends on something unrelated (whether they
feel they might need extended-conf-space access at some point, be it
even for lspci/setpci).



Loic



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-23 Thread Loic Prylli



On 12/20/2007 1:16 PM, Matthew Wilcox wrote:
 Oh, that's the same bug others (including me) have been complaining
 about.

 http://marc.info/?l=linux-kernelm=118809338631160w=2

   
 It hangs in exactly the same place every time.

 I am surmising that the write to that BAR is causing a MCE.
 

 Bad deduction.  What's happening is that the write to the BAR is causing
 it to overlap the decode for mmconfig space.  So the mmconfig write to
 set the BAR back never gets through.

 I have a different idea to fix this problem.  Instead of writing
 0x, we could look for an unused bit of space in the E820 map and
 write, say, 0xdfff to the low 32-bits of a BAR.  Then it wouldn't
 overlap, and we could find its size using MMCONFIG.

 Does anyone know how Windows handles these machines? 



I just realized one thing: the bar sizing code in pci_read_bases() (that
writes 0x in the bars) does not seem to disable the
PCI_COMMAND_MEM/PCI_COMMAND_IO bits in the cmd register before
manipulating the BARs. And it seems nobody else ensures they are
disabled at this point either (or am I missing something?).


Touching the bars while they are enabled would be buggy behaviour from
our part, and something trivial to fix. And it might well fix that
particular problem (it's fair play from the machine to crash if we
create a decoding conflict, simply disabling the cmd bits in
pci_read_bases() should remove that conflict).

FWIW, to partially answer your last question, Windows does disable
mem-space and/or IO-space when sizing the bars of a device (I have some
traces of configuration-space-access taken on a window machine for one
of the PCI busses).



Loic






--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-23 Thread Loic Prylli

On 12/23/2007 3:55 PM, Matthew Wilcox wrote:
 On Sun, Dec 23, 2007 at 03:16:24PM -0500, Loic Prylli wrote:
   
 I just realized one thing: the bar sizing code in pci_read_bases() (that
 writes 0x in the bars) does not seem to disable the
 PCI_COMMAND_MEM/PCI_COMMAND_IO bits in the cmd register before
 manipulating the BARs. And it seems nobody else ensures they are
 disabled at this point either (or am I missing something?).
 

 Right, we don't.  Ivan and Greg are convinced that doing so can break
 some machines.  
   



 It might indeed be scary to suddenly change the command bits on some
old Host bridges and maybe other classes of bridges. I was too
optimistic about the triviality of the fix.


Still an obvious improvement to the pci_read_bases() code would be to
not try sizing non-existings BARs (that way if a device has no memory
bars, we never need to touch the mem-enable bit). In addition, if you
exclude Host bridges and other bridges from temporary mem-disabling
while sizing their BARs, then doing temp disabling for the remaining
devices looks less scary than the alternative (messing live with the
bars decoding range especially for regular devices integrated to the
main chipset).


It seems to me worth making that init code safer independantly of the
mmconfig issues.



 Now that contradicts some information we've been told before; can you
 post those traces?  That would argue in favour of disabling memspace
 when configuring BARs.
   



Here is a trace where COMMAND/STATUS and 0x10 and 0x14 registers
accesses were captured. Comments in brackets are my interpretation of
what's happening. That was with windows 2003 64bit.


Loic



[ reset at 607 second]

[BIOS: read-status, COMMAND = 0 ]

626.442718:cfg-read:0x6.w=0x10
626.442805:cfg-write:0x4.w=0x0

[BIOS: sizing BAR0]

626.448325:cfg-read:0x10.l=0x8
626.448365:cfg-write:0x10.l=0xfffe
626.448439:cfg-read:0x10.l=0xff08
626.448493:cfg-write:0x10.l=0x8

[BIOS: sizing non-existent BAR1]

626.448552:cfg-read:0x14.l=0x0
626.448591:cfg-write:0x14.l=0xfffe
626.448647:cfg-read:0x14.l=0x0
626.448687:cfg-write:0x14.l=0x0

[BIOS: sizing BAR0]

626.449596:cfg-read:0x10.l=0x8
626.449635:cfg-write:0x10.l=0xfffe
626.449710:cfg-read:0x10.l=0xff08
626.449764:cfg-write:0x10.l=0x8

[BIOS: sizing non-existent BAR1]

626.449832:cfg-read:0x14.l=0x0
626.449872:cfg-write:0x14.l=0xfffe
626.449928:cfg-read:0x14.l=0x0
626.449967:cfg-write:0x14.l=0x0

[BIOS: sizing BAR0]


626.451822:cfg-read:0x10.l=0x8
626.451862:cfg-write:0x10.l=0xfffe
626.451936:cfg-read:0x10.l=0xff08
626.451990:cfg-write:0x10.l=0x8

[BIOS: sizing non-existent BAR1]

626.452058:cfg-read:0x14.l=0x0
626.452098:cfg-write:0x14.l=0xfffe
626.452154:cfg-read:0x14.l=0x0
626.452194:cfg-write:0x14.l=0x0


[BIOS: sizing and assigning BAR0]

626.454899:cfg-read:0x10.l=0x8
626.454939:cfg-write:0x10.l=0xfffe
626.455013:cfg-read:0x10.l=0xff08

626.455068:cfg-write:0x10.l=0x8
626.455138:cfg-write:0x10.l=0xfa00

[BIOS: sizing non-existent BAR1]

626.455211:cfg-read:0x14.l=0x0
626.455251:cfg-write:0x14.l=0xfffe
626.455307:cfg-read:0x14.l=0x0
626.455346:cfg-write:0x14.l=0x0


[BIOS: sizing and restoring BAR0]


628.024774:cfg-read:0x10.l=0xfa08
628.024831:cfg-write:0x10.l=0xfffe
628.024908:cfg-read:0x10.l=0xff08
628.024964:cfg-write:0x10.l=0xfa08

[BIOS: sizing non-existent BAR1]

628.025049:cfg-read:0x14.l=0x0
628.025091:cfg-write:0x14.l=0xfffe
628.025149:cfg-read:0x14.l=0x0
628.025190:cfg-write:0x14.l=0x0

[BIOS: clear status ]

628.025714:cfg-write:0x6.w=0x

[BIOS: enable SERR/IO+MEMORY+MASTER+SPECIAL+INVALIDATE ]
628.025788:cfg-write:0x4.w=0x11f

[BIOS: enable MEMORY+MASTER (after seeing IO/special/invalidate were
read-only 0) ]
629.028114:cfg-read:0x4.b=0x6
629.028160:cfg-write:0x4.b=0x6

[BIOS: read status, clear PARITY/SERR status bits ]
629.028490:cfg-read:0x7.b=0x0
629.028536:cfg-write:0x7.b=0xc1

[BIOS: enable PARITY detection ]
629.028818:cfg-read:0x4.b=0x6
629.028863:cfg-write:0x4.b=0x46

[BIOS: enable SERR (already enabled anyway) ]
629.028920:cfg-read:0x5.b=0x1
629.028966:cfg-write:0x5.b=0x1
629.032338:cfg-read:0x6.w=0x10


[ WINDOWS starting ]

661.965932:cfg-read:0x4.l=0x100146
661.965986:cfg-read:0x10.l=0xfa08
661.966040:cfg-read:0x14.l=0x0

662.112971:cfg-read:0x4.l=0x100146
662.113024:cfg-read:0x10.l=0xfa08
662.113078:cfg-read:0x14.l=0x0

662.155903:cfg-read:0x4.l=0x100146

662.156066:cfg-read:0x4.l=0x100146
662.156120:cfg-read:0x10.l=0xfa08
662.156174:cfg-read:0x14.l=0x0

662.156337:cfg-read:0x4.l=0x100146
662.156391:cfg-read:0x10.l=0xfa08
662.156445:cfg-read:0x14.l=0x0

[ WINDOWS: disable MEMORY + MASTER ]
662.156519:cfg-write:0x4.l=0x140

[ WINDOWS: size BAR0 and BAR1]
662.156588:cfg-write:0x10.l=0x
662.156662:cfg-write:0x14.l=0x
662.156755:cfg-read:0x4.l=0x100140
662.156809:cfg-read:0x10.l=0xff08
662.156863:cfg-read:0x14.l=0x0

[ WINDOWS: disable MEMORY + MASTER ]

662.156937

Re: [patch] Make MMCONFIG space (extended PCI config space) a driver opt-in issue

2007-12-22 Thread Loic Prylli

On 12/22/2007 11:52 PM, Jeff Garzik wrote:
>
> Absolutely.
>
> But regardless of problems, enabling should be done globally, not per
> device...  

The "enabling globally" requirement, i.e. not per-device, neither
depending on reg >= 256 seems a very debatable assumption (IMHO a big
mistake) that seems to be responsible for many of the problems seen in
the past.

There might be for a very long time AMD-architectures where
extended-conf-space access for pci-express device works and is
beneficial (and where mmconf is not supported by the hardware on
non-pci-express devices). You are basically saying you don't want ever
to support extended-conf-space globally for those systems, where it
would be so easy to precisely use mmconf *only* when attempting
*extended-conf-space * (>= 256) to some device  (that provides a strong
guarantee that you will never break anything unless somebody actually
tries to use the extended-conf-space).

Supporting extended-conf-space is independant of the issue of using
mmconf for legacy conf-space. There is no real reason to use the same
method to access both. I have seen several arguments used that were
implying that, and they all seem really bogus to me. Not only are the
two ranges (<= 256 and >= 256) structurally independant (you have
totally independant capabilities lists that are independantly organized
in each of them), even if they were not there is no consistency issue
that cannot be dealt with a memory barrier, and the concern about taking
an extra branch for each pci-conf-space access is also bogus once you
look at the numbers.

By possibly using different implementations for the two ranges you avoid
introducing a new API, you avoid taking risks with mmconf when you don't
need it. That doesn't preclude using mmconf for everything either if the
user requests it or based on enough knowledge of the system to be sure
nothing will break.

Loic

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Make MMCONFIG space (extended PCI config space) a driver opt-in issue

2007-12-22 Thread Loic Prylli

On 12/22/2007 11:13 PM, Jeff Garzik wrote:
>
> The facts as they exist today:
>
> 1) Existing 256-byte config space devices have been known put
> capabilities in the high end (>= 0xc8) of config space.
>
> 2) It is legal for PCI-Express to put capabilities anywhere in PCI
> config space, including extended config space.  (I hope our PCI cap
> walking code checks for overruns...)

You make it sound almost as if the capability list that starts in
regular conf-space could cross into extended conf-space >= 256). That's
not true, the capability lists in the regular conf-space and the
extended conf-space are really separate, they use a different structure
for linking (different number of bits to define the capability IDs), a
different starting point, different capability IDs definition tables.
The regular conf-space and the extended conf-space are really independant.

>
> 3) Most new machines ship with PCI-Express devices with extended
> config space.
>
> Therefore it is provable /possible/, and is indeed logical to conclude
> that capabilities in extended config space will follow the same
> pattern that existing hw designers have been following...  but only
> once the current OS's have stable extended-config-space support.
>
> Maybe that day will never come, but it is nonetheless quite possible
> within today's PCI Express spec for this to happen.

I agree with that statement. In fact it is already quite useful today. I
am doing a lot of support activities where extended-conf-space is a must
for troubleshooting. It was important enough for us that we have
user-tools that allows us to access mmconfig-space for pci-express even
on systems that don't advertise a MCFG attribute (as long as the chipset
supports it, we have reverse-engineered the location of the "mmconfig
bar" for a few chipsets including nvidia chipsets, for Intel it is well
documented, and there are couple others).

Loic

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Make MMCONFIG space (extended PCI config space) a driver opt-in issue

2007-12-22 Thread Loic Prylli

On 12/22/2007 11:13 PM, Jeff Garzik wrote:

 The facts as they exist today:

 1) Existing 256-byte config space devices have been known put
 capabilities in the high end (= 0xc8) of config space.

 2) It is legal for PCI-Express to put capabilities anywhere in PCI
 config space, including extended config space.  (I hope our PCI cap
 walking code checks for overruns...)





You make it sound almost as if the capability list that starts in
regular conf-space could cross into extended conf-space = 256). That's
not true, the capability lists in the regular conf-space and the
extended conf-space are really separate, they use a different structure
for linking (different number of bits to define the capability IDs), a
different starting point, different capability IDs definition tables.
The regular conf-space and the extended conf-space are really independant.





 3) Most new machines ship with PCI-Express devices with extended
 config space.

 Therefore it is provable /possible/, and is indeed logical to conclude
 that capabilities in extended config space will follow the same
 pattern that existing hw designers have been following...  but only
 once the current OS's have stable extended-config-space support.

 Maybe that day will never come, but it is nonetheless quite possible
 within today's PCI Express spec for this to happen.



I agree with that statement. In fact it is already quite useful today. I
am doing a lot of support activities where extended-conf-space is a must
for troubleshooting. It was important enough for us that we have
user-tools that allows us to access mmconfig-space for pci-express even
on systems that don't advertise a MCFG attribute (as long as the chipset
supports it, we have reverse-engineered the location of the mmconfig
bar for a few chipsets including nvidia chipsets, for Intel it is well
documented, and there are couple others).



Loic

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Make MMCONFIG space (extended PCI config space) a driver opt-in issue

2007-12-22 Thread Loic Prylli


On 12/22/2007 11:52 PM, Jeff Garzik wrote:

 Absolutely.

 But regardless of problems, enabling should be done globally, not per
 device...  




The enabling globally requirement, i.e. not per-device, neither
depending on reg = 256 seems a very debatable assumption (IMHO a big
mistake) that seems to be responsible for many of the problems seen in
the past.



There might be for a very long time AMD-architectures where
extended-conf-space access for pci-express device works and is
beneficial (and where mmconf is not supported by the hardware on
non-pci-express devices). You are basically saying you don't want ever
to support extended-conf-space globally for those systems, where it
would be so easy to precisely use mmconf *only* when attempting
*extended-conf-space * (= 256) to some device  (that provides a strong
guarantee that you will never break anything unless somebody actually
tries to use the extended-conf-space).


Supporting extended-conf-space is independant of the issue of using
mmconf for legacy conf-space. There is no real reason to use the same
method to access both. I have seen several arguments used that were
implying that, and they all seem really bogus to me. Not only are the
two ranges (= 256 and = 256) structurally independant (you have
totally independant capabilities lists that are independantly organized
in each of them), even if they were not there is no consistency issue
that cannot be dealt with a memory barrier, and the concern about taking
an extra branch for each pci-conf-space access is also bogus once you
look at the numbers.


By possibly using different implementations for the two ranges you avoid
introducing a new API, you avoid taking risks with mmconf when you don't
need it. That doesn't preclude using mmconf for everything either if the
user requests it or based on enough knowledge of the system to be sure
nothing will break.



Loic

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-20 Thread Loic Prylli

On 12/20/2007 9:15 PM, Robert Hancock wrote:
>>
>> Suggested Workaround
>>
>> It is strongly recommended that system designers do not connect the
>> AMD-8132 and devices that use extended
>> configuration space MMIO BARs (ex: HyperTransport-to-PCI Express®
>> bridges) to the same processor
>> HyperTransport link.
>>
>> Fix Planned
>> No
>
> That does sound fairly definitive. I have to wonder why certain system
> designers then didn't follow their strong recommendation..

Just curious, do you know of any system where that recommendation was
not followed? On all motherboards where I have seen a AMD-8131 or a
AMD-8132, they were alone on their hypertransport link, and other
"northbridges" (more precisely hypertransport to pci-express or
pci-whatever, often nvidia) with a "MMCONFIG BAR" where on one of the
other available hypertransport links in the system.

Loic

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-20 Thread Loic Prylli

On 12/20/2007 6:21 PM, Tony Camuso wrote:
>
> And the MMCONFIG problem with enterprise systems and workstations, where
> we do control the BIOS (for the most part), is due to known bugs in
> certain versions of certain chipsets, HT1000, AMD8132, among them, not
> the BIOS.

The lack of MMCONFIG support is indeed because some hypertransport
chipsets lack that support. But there are some BIOSes out there that are
advertising support for all busses in their MCFG acpi attribute (even
the busses managed by some amd8131 in a mixed nvidia-ck804/amd8131
motherboard), and the BIOS seems at least faulty for advertising a
capability that does not exist.

Loic

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-20 Thread Loic Prylli

On 12/20/2007 4:00 PM, Matthew Wilcox wrote:
> On Thu, Dec 20, 2007 at 03:56:29PM -0500, Loic Prylli wrote:
>   
>> I know the final device is not aware on how the config request was
>> originated. I am just saying platforms built around the Intel 82801
>> chipset (ICH2) don't support mmconfig at all. I would also not be
>> surprised if the platforms where tg3 needs TG3_FLAG_PCIX_TARGET_HWBUG
>> might also not support mmconfig (but for this second case, it's only
>> speculation based on seeing a couple posts about
>> TG3_FLAG_PCIX_TARGET_HWBUG where amd hypertransport/PCI-X chipsets where
>> mentioned). If you know of a platform that support mmconfig, and where
>> the tg3 does need to use relatively intensively pci-conf-space, I'll be
>> happy to be corrected.
>> 
>
> tg3 is available as an add-in pci card.  i have one.  i can plug it into
> a machine that does support mmconfig.
>   

That doesn't tell for sure your NIC has the specific rev that would
cause the hwbug workaround to be used. But let's assume so, your
combination would still works correctly (maybe a slightly non-optimal
network performance). A non-default mmconf=always option can help
maximize again the performance.

My starting point was that more systems could be supported out-of-the
box (vs not working at all) by using "type 1" more widely,. This would
not break anybody.  And this would not affect performance except on what
I initially called "obscure hardware or systems".

I already acknowledged in the previous email the possibility of a
performance impact, but it is still not clear to me whether it would be
widespread. The example you  mention is a buggy revision of a chip that
was designed to use memory-mapped IO, and has to use config-space
instead because the memory-mapped IO implementation is buggy (in most
cases only when used in combination with certain chipsets). How much
influence should a modest performance impact there influence the
decision-making?

Usually it's better to trade some performance for stability than the
reverse, and you loose the performance anyway when some many people
start using pci=nommconf by default on all their installs.

Loic

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-20 Thread Loic Prylli

On 12/20/2007 3:15 PM, Matthew Wilcox wrote:
> On Thu, Dec 20, 2007 at 03:05:52PM -0500, Loic Prylli wrote:
>   
>> I am not familiar with the tg3 driver, just trying to give a 5 minutes
>> look, it seems the typical cases where the pci-conf-space is used
>> intensively are with some rev in combination with the 82801
>> (TG3_FLG2_ICH_WORKAROUND) which I don't think support mmconfig anyway,
>> as well as some very specific PCI-X combinations
>> (TG3_FLAG_PCIX_TARGET_HWBUG) which are also very unlikely to support
>> mmconfig.
>> 
>
> It's not a question of whether the card supports mmconfig or not -- the
> card can't tell whether a first-256-byte pci config transaction was initiated 
> through mmconfig, type1, type2, or even a bios call.
>   

I know the final device is not aware on how the config request was
originated. I am just saying platforms built around the Intel 82801
chipset (ICH2) don't support mmconfig at all. I would also not be
surprised if the platforms where tg3 needs TG3_FLAG_PCIX_TARGET_HWBUG
might also not support mmconfig (but for this second case, it's only
speculation based on seeing a couple posts about
TG3_FLAG_PCIX_TARGET_HWBUG where amd hypertransport/PCI-X chipsets where
mentioned). If you know of a platform that support mmconfig, and where
the tg3 does need to use relatively intensively pci-conf-space, I'll be
happy to be corrected.

> I'm just hacking together an implementation based on Ivan's suggestion
> to always use type 1 for the first 64 bytes.
>
>   

Whatever you think it's best. If you are only trying to address a
BAR-sizing/mmconfig temporary conflict, and think the "MCFG BIOS" or
other mmconfig problems are under control, that makes sense. If the need
arise for somebody, an option to select between 64 to 256 can be added
later.

Loic

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-20 Thread Loic Prylli

On 12/20/2007 2:08 PM, Matthew Wilcox wrote:
> On Thu, Dec 20, 2007 at 02:04:31PM -0500, Tony Camuso wrote:
>   
>> Also, this solution also would allow us to remove the unreachable_devices()
>> routine and bitmap.
>> 
>
> Not really ... we probe reading address 0x100 to see if the device
> supports extended config space or not.  So we need to make that fail
> gracefully for the amd7111 case.
>   

pci_cfg_space_size()  is only done for PCI-express  or PCI-X mode 2
devices, so you still have eliminated the bulk of the problems which are
typically handling the legacy busses on modern machines. I don't know
what is the amd7111.

>   
>> Does anybody see a down side to this?
>> 
>
> It'll be slower than it would be if we used mmconfig directly.  Now yes,
> nobody should be using pci config space in performance critical paths
> ... but see the tg3 driver.
>   

I am not familiar with the tg3 driver, just trying to give a 5 minutes
look, it seems the typical cases where the pci-conf-space is used
intensively are with some rev in combination with the 82801
(TG3_FLG2_ICH_WORKAROUND) which I don't think support mmconfig anyway,
as well as some very specific PCI-X combinations
(TG3_FLAG_PCIX_TARGET_HWBUG) which are also very unlikely to support
mmconfig.

Even if I am wrong for the tg3, I don't really think mmconfig vs type1
could make a noticeable performance on any common systems (obscure
systems or hardware where it could potentially have a performance impact
could  use a non-default configuration).

Loic

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-20 Thread Loic Prylli

On 12/20/2007 1:16 PM, Matthew Wilcox wrote:
>
> Bad deduction.  What's happening is that the write to the BAR is causing
> it to overlap the decode for mmconfig space.  So the mmconfig write to
> set the BAR back never gets through.
>
> I have a different idea to fix this problem.  Instead of writing
> 0x, we could look for an unused bit of space in the E820 map and
> write, say, 0xdfff to the low 32-bits of a BAR.  Then it wouldn't
> overlap, and we could find its size using MMCONFIG.
>
> Does anyone know how Windows handles these machines?  Obviously, if it's
> using MMCONFIG, it'd have the same problems.  Does it just use type 1
> for initial sizing?  Or does it use type 1 for all accesses below 256
> bytes?
>   

Always using type 1 for accesses below 256 bytes looks like a very very
attractive solution

I know we had a lot of older kernels over the last two years that we
patched like that (we needed MMCONFIG for our own device development
purposes, but we also needed our machines to boot and discover all
devices reliably). Recent kernels works fine out of the box on all
hardware we have, but all this sometimes tricky and apparently endless
work (in big part because of buggy BIOSes) about MMCONFIG would probably
become relatively easy by limiting the aim to have MMCONFIG work when it
is required (for cfg-space accesses >= 256).

Loic

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-20 Thread Loic Prylli


On 12/20/2007 1:16 PM, Matthew Wilcox wrote:

 Bad deduction.  What's happening is that the write to the BAR is causing
 it to overlap the decode for mmconfig space.  So the mmconfig write to
 set the BAR back never gets through.

 I have a different idea to fix this problem.  Instead of writing
 0x, we could look for an unused bit of space in the E820 map and
 write, say, 0xdfff to the low 32-bits of a BAR.  Then it wouldn't
 overlap, and we could find its size using MMCONFIG.

 Does anyone know how Windows handles these machines?  Obviously, if it's
 using MMCONFIG, it'd have the same problems.  Does it just use type 1
 for initial sizing?  Or does it use type 1 for all accesses below 256
 bytes?
   



Always using type 1 for accesses below 256 bytes looks like a very very
attractive solution

I know we had a lot of older kernels over the last two years that we
patched like that (we needed MMCONFIG for our own device development
purposes, but we also needed our machines to boot and discover all
devices reliably). Recent kernels works fine out of the box on all
hardware we have, but all this sometimes tricky and apparently endless
work (in big part because of buggy BIOSes) about MMCONFIG would probably
become relatively easy by limiting the aim to have MMCONFIG work when it
is required (for cfg-space accesses = 256).


Loic

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-20 Thread Loic Prylli

On 12/20/2007 2:08 PM, Matthew Wilcox wrote:
 On Thu, Dec 20, 2007 at 02:04:31PM -0500, Tony Camuso wrote:
   
 Also, this solution also would allow us to remove the unreachable_devices()
 routine and bitmap.
 

 Not really ... we probe reading address 0x100 to see if the device
 supports extended config space or not.  So we need to make that fail
 gracefully for the amd7111 case.
   



pci_cfg_space_size()  is only done for PCI-express  or PCI-X mode 2
devices, so you still have eliminated the bulk of the problems which are
typically handling the legacy busses on modern machines. I don't know
what is the amd7111.



   
 Does anybody see a down side to this?
 

 It'll be slower than it would be if we used mmconfig directly.  Now yes,
 nobody should be using pci config space in performance critical paths
 ... but see the tg3 driver.
   


I am not familiar with the tg3 driver, just trying to give a 5 minutes
look, it seems the typical cases where the pci-conf-space is used
intensively are with some rev in combination with the 82801
(TG3_FLG2_ICH_WORKAROUND) which I don't think support mmconfig anyway,
as well as some very specific PCI-X combinations
(TG3_FLAG_PCIX_TARGET_HWBUG) which are also very unlikely to support
mmconfig.

Even if I am wrong for the tg3, I don't really think mmconfig vs type1
could make a noticeable performance on any common systems (obscure
systems or hardware where it could potentially have a performance impact
could  use a non-default configuration).


Loic




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-20 Thread Loic Prylli

On 12/20/2007 3:15 PM, Matthew Wilcox wrote:
 On Thu, Dec 20, 2007 at 03:05:52PM -0500, Loic Prylli wrote:
   
 I am not familiar with the tg3 driver, just trying to give a 5 minutes
 look, it seems the typical cases where the pci-conf-space is used
 intensively are with some rev in combination with the 82801
 (TG3_FLG2_ICH_WORKAROUND) which I don't think support mmconfig anyway,
 as well as some very specific PCI-X combinations
 (TG3_FLAG_PCIX_TARGET_HWBUG) which are also very unlikely to support
 mmconfig.
 

 It's not a question of whether the card supports mmconfig or not -- the
 card can't tell whether a first-256-byte pci config transaction was initiated 
 through mmconfig, type1, type2, or even a bios call.
   





I know the final device is not aware on how the config request was
originated. I am just saying platforms built around the Intel 82801
chipset (ICH2) don't support mmconfig at all. I would also not be
surprised if the platforms where tg3 needs TG3_FLAG_PCIX_TARGET_HWBUG
might also not support mmconfig (but for this second case, it's only
speculation based on seeing a couple posts about
TG3_FLAG_PCIX_TARGET_HWBUG where amd hypertransport/PCI-X chipsets where
mentioned). If you know of a platform that support mmconfig, and where
the tg3 does need to use relatively intensively pci-conf-space, I'll be
happy to be corrected.




 I'm just hacking together an implementation based on Ivan's suggestion
 to always use type 1 for the first 64 bytes.

   



Whatever you think it's best. If you are only trying to address a
BAR-sizing/mmconfig temporary conflict, and think the MCFG BIOS or
other mmconfig problems are under control, that makes sense. If the need
arise for somebody, an option to select between 64 to 256 can be added
later.



Loic


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-20 Thread Loic Prylli

On 12/20/2007 4:00 PM, Matthew Wilcox wrote:
 On Thu, Dec 20, 2007 at 03:56:29PM -0500, Loic Prylli wrote:
   
 I know the final device is not aware on how the config request was
 originated. I am just saying platforms built around the Intel 82801
 chipset (ICH2) don't support mmconfig at all. I would also not be
 surprised if the platforms where tg3 needs TG3_FLAG_PCIX_TARGET_HWBUG
 might also not support mmconfig (but for this second case, it's only
 speculation based on seeing a couple posts about
 TG3_FLAG_PCIX_TARGET_HWBUG where amd hypertransport/PCI-X chipsets where
 mentioned). If you know of a platform that support mmconfig, and where
 the tg3 does need to use relatively intensively pci-conf-space, I'll be
 happy to be corrected.
 

 tg3 is available as an add-in pci card.  i have one.  i can plug it into
 a machine that does support mmconfig.
   



That doesn't tell for sure your NIC has the specific rev that would
cause the hwbug workaround to be used. But let's assume so, your
combination would still works correctly (maybe a slightly non-optimal
network performance). A non-default mmconf=always option can help
maximize again the performance.

My starting point was that more systems could be supported out-of-the
box (vs not working at all) by using type 1 more widely,. This would
not break anybody.  And this would not affect performance except on what
I initially called obscure hardware or systems.

I already acknowledged in the previous email the possibility of a
performance impact, but it is still not clear to me whether it would be
widespread. The example you  mention is a buggy revision of a chip that
was designed to use memory-mapped IO, and has to use config-space
instead because the memory-mapped IO implementation is buggy (in most
cases only when used in combination with certain chipsets). How much
influence should a modest performance impact there influence the
decision-making?


Usually it's better to trade some performance for stability than the
reverse, and you loose the performance anyway when some many people
start using pci=nommconf by default on all their installs.


Loic

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-20 Thread Loic Prylli


On 12/20/2007 6:21 PM, Tony Camuso wrote:

 And the MMCONFIG problem with enterprise systems and workstations, where
 we do control the BIOS (for the most part), is due to known bugs in
 certain versions of certain chipsets, HT1000, AMD8132, among them, not
 the BIOS.



The lack of MMCONFIG support is indeed because some hypertransport
chipsets lack that support. But there are some BIOSes out there that are
advertising support for all busses in their MCFG acpi attribute (even
the busses managed by some amd8131 in a mixed nvidia-ck804/amd8131
motherboard), and the BIOS seems at least faulty for advertising a
capability that does not exist.


Loic

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-20 Thread Loic Prylli

On 12/20/2007 9:15 PM, Robert Hancock wrote:

 Suggested Workaround

 It is strongly recommended that system designers do not connect the
 AMD-8132 and devices that use extended
 configuration space MMIO BARs (ex: HyperTransport-to-PCI Express®
 bridges) to the same processor
 HyperTransport link.

 Fix Planned
 No

 That does sound fairly definitive. I have to wonder why certain system
 designers then didn't follow their strong recommendation..



Just curious, do you know of any system where that recommendation was
not followed? On all motherboards where I have seen a AMD-8131 or a
AMD-8132, they were alone on their hypertransport link, and other
northbridges (more precisely hypertransport to pci-express or
pci-whatever, often nvidia) with a MMCONFIG BAR where on one of the
other available hypertransport links in the system.


Loic

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: MSI problem since 2.6.21 for devices not providing a mask in their MSI capability

2007-10-04 Thread Loic Prylli

On 10/3/2007 11:58 PM, Eric W. Biederman wrote:
> Right.  And INTx has such a pending bit as well.  I guess I figured
> if MSI was enabled transferring it over would be the obvious thing to
> do.
>   

The INTx pending and disable bit were only added starting with PCI 2.3,
so in  PCI-2.2 and PCI-X 1.0{,a,b} those bits don't exist at all and
there is still a significant of such devices still in use or on the market.

I agree it would look natural (and that probably happens for a lot of or
most devices) to transfer the interrupt state from INTx to MSI, but I
don't think you can rely on it without doing some assumptions about
device interrupt management that are outside the scope of the spec.

> Of course it might even make sense simply to refuse to enable MSI
> if there is not a masking capability present.
>   

The possibility of masking for MSI was only specified (and then only as
something optional) starting with PCI-3.0, PCI Express 1.0 and 1.0a are
based on the older PCI-2.3 and corresponding devices are very unlikely
to have it. So there might still be majority of devices in the field
with no MSI masking capability in the different PCI categories:
conventional-PCI, PCI-X, PCI-Express.

>
>>>  The PCI spec requires disabling/masking the msi when reprogramming it.
>>>  So as a general rule we can not do better. 
>>>
>>>   
>> Do you have a reference for that requirement. The spec only vaguely
>> associates MSI programming with "configuration", but I haven't found any
>> explicit indication that it should not work.
>> 
> I would have to look it up again but it said that the result is only
> defined in the case when it is disabled/masked, when I looked a couple
> of months ago.
>   
>

I found this quote in PCI-3.0/6.8.3.5:
"For MSI-X, a function is permitted to cache Address and Data values
from unmasked
MSI-X Table entries. However, anytime software unmasks a currently
masked MSI-X
Table entry either by clearing its Mask bit or by clearing the Function
Mask bit, the
function must update any Address or Data values that it cached from that
entry. If
software changes the Address or Data value of an entry while the entry
is unmasked, the
result is undefined."

I haven't seen a caching possibility mentioned for the MSI case, so
apart from the problem with multi-word changes, maybe changing the
Address or Data can be done at anytime for MSI.

>> I don't see how you can disable MSI through the control bit (which is
>> equivalent to switching the device to INTx whether or not the INTx
>> disable bit is set in PCI_COMMAND) in the middle of operations, not tell
>> the driver, and not risk loosing interrupts (unless you rely on much
>> more than the spec).
>> 
>
> I will relook.  My impression is that bit is defined as MSI enable.
> Not mode switch.   Although myrinet has clearly implemented it as
> mode switch.
>   

It is indeed defined as MSI-enable, but that's not a contradiction with 
being equivalent to a "mode switch between INTx and MSI" (ignoring MSI-X
in that context). The spec seems to define the following "modes":

MSI-enable = 1, INTx-disable= x : MSI-mode
MSI-enable = 0, INTx-disable= 0 : INTx-mode with INTx-signal == INTx-pending
MSI-enable = 0, INTx-disable= 1 : INTx-mode "polling/diag" mode using
INTx pending bit

The only specificity of Myrinet is having relatively independant logic
for the two modes, while at the same time requiring any pending INTx to
be acked before starting any kind of new interrupt.

> Interesting.  So an irq fires before the driver has finished
> processing the last instance of the irq.  This is very close to a
> screaming irq and something we may actually want to deal with.
>
>   

In our case it is true that the device can fire a bounded number of MSI
without acks (but not an infinite number, there are a limited number of
interrupt tokens, furthermore interrupt rate is limited with  a
configurable minimum time between interrupts which default to ~10us), I
suspect a race with other interrupts were involved because otherwise
that minimum inter-interrupt delay would prevent entering that code path.

I think even a more casual interrupt-scheme (with an explicit ack
required for each interrupt before generating a new one) can also
exercise that code path, since between the return from the handler and
the clearing of IRQ_PROGRESS, there is an opportunity for the next
interrupt to happen.

To detect a crazy device generating storms of edge interrupts, I guess
note_interrupt() could be called during this "reentrant detection" if
masking was made conditional.

Loic

P.S.: just a little more context: in all Myrinet hardware, enough of the
interrupt functionality is implemented in firmware that we can avoid
loosing interrupts whenever MSI-enable is toggled, and we already
started distributing a firmware-based software update for users running
linux >= 2.6.21 and using MSI. So for Myrinet the problem is more or
less already closed.

The only motivation

Re: MSI problem since 2.6.21 for devices not providing a mask in their MSI capability

2007-10-04 Thread Loic Prylli

On 10/3/2007 11:58 PM, Eric W. Biederman wrote:
 Right.  And INTx has such a pending bit as well.  I guess I figured
 if MSI was enabled transferring it over would be the obvious thing to
 do.
   



The INTx pending and disable bit were only added starting with PCI 2.3,
so in  PCI-2.2 and PCI-X 1.0{,a,b} those bits don't exist at all and
there is still a significant of such devices still in use or on the market.

I agree it would look natural (and that probably happens for a lot of or
most devices) to transfer the interrupt state from INTx to MSI, but I
don't think you can rely on it without doing some assumptions about
device interrupt management that are outside the scope of the spec.



 Of course it might even make sense simply to refuse to enable MSI
 if there is not a masking capability present.
   



The possibility of masking for MSI was only specified (and then only as
something optional) starting with PCI-3.0, PCI Express 1.0 and 1.0a are
based on the older PCI-2.3 and corresponding devices are very unlikely
to have it. So there might still be majority of devices in the field
with no MSI masking capability in the different PCI categories:
conventional-PCI, PCI-X, PCI-Express.



  The PCI spec requires disabling/masking the msi when reprogramming it.
  So as a general rule we can not do better. 

   
 Do you have a reference for that requirement. The spec only vaguely
 associates MSI programming with configuration, but I haven't found any
 explicit indication that it should not work.
 
 I would have to look it up again but it said that the result is only
 defined in the case when it is disabled/masked, when I looked a couple
 of months ago.
   



I found this quote in PCI-3.0/6.8.3.5:
For MSI-X, a function is permitted to cache Address and Data values
from unmasked
MSI-X Table entries. However, anytime software unmasks a currently
masked MSI-X
Table entry either by clearing its Mask bit or by clearing the Function
Mask bit, the
function must update any Address or Data values that it cached from that
entry. If
software changes the Address or Data value of an entry while the entry
is unmasked, the
result is undefined.


I haven't seen a caching possibility mentioned for the MSI case, so
apart from the problem with multi-word changes, maybe changing the
Address or Data can be done at anytime for MSI.




 I don't see how you can disable MSI through the control bit (which is
 equivalent to switching the device to INTx whether or not the INTx
 disable bit is set in PCI_COMMAND) in the middle of operations, not tell
 the driver, and not risk loosing interrupts (unless you rely on much
 more than the spec).
 

 I will relook.  My impression is that bit is defined as MSI enable.
 Not mode switch.   Although myrinet has clearly implemented it as
 mode switch.
   



It is indeed defined as MSI-enable, but that's not a contradiction with 
being equivalent to a mode switch between INTx and MSI (ignoring MSI-X
in that context). The spec seems to define the following modes:

MSI-enable = 1, INTx-disable= x : MSI-mode
MSI-enable = 0, INTx-disable= 0 : INTx-mode with INTx-signal == INTx-pending
MSI-enable = 0, INTx-disable= 1 : INTx-mode polling/diag mode using
INTx pending bit

The only specificity of Myrinet is having relatively independant logic
for the two modes, while at the same time requiring any pending INTx to
be acked before starting any kind of new interrupt.


 Interesting.  So an irq fires before the driver has finished
 processing the last instance of the irq.  This is very close to a
 screaming irq and something we may actually want to deal with.

   



In our case it is true that the device can fire a bounded number of MSI
without acks (but not an infinite number, there are a limited number of
interrupt tokens, furthermore interrupt rate is limited with  a
configurable minimum time between interrupts which default to ~10us), I
suspect a race with other interrupts were involved because otherwise
that minimum inter-interrupt delay would prevent entering that code path.

I think even a more casual interrupt-scheme (with an explicit ack
required for each interrupt before generating a new one) can also
exercise that code path, since between the return from the handler and
the clearing of IRQ_PROGRESS, there is an opportunity for the next
interrupt to happen.

To detect a crazy device generating storms of edge interrupts, I guess
note_interrupt() could be called during this reentrant detection if
masking was made conditional.


Loic


P.S.: just a little more context: in all Myrinet hardware, enough of the
interrupt functionality is implemented in firmware that we can avoid
loosing interrupts whenever MSI-enable is toggled, and we already
started distributing a firmware-based software update for users running
linux = 2.6.21 and using MSI. So for Myrinet the problem is more or
less already closed.

The only motivation for starting the thread was that it seemed a
possibility that other

Re: MSI problem since 2.6.21 for devices not providing a mask in their MSI capability

2007-10-03 Thread Loic Prylli

On 10/3/2007 5:49 PM, Eric W. Biederman wrote:
> Loic Prylli <[EMAIL PROTECTED]> writes:
>
>   
>> Hi,
>>
>> We observe a problem with MSI since kernel 2.6.21 where interrupts would
>> randomly stop working. We have tracked it down to the new
>> msi_set_mask_bit definition in 2.6.21. In the MSI case with a device not
>> providing a "native" MSI mask, it was a no-op before, and now it
>> disables MSI in the MSI-ctl register which according to the PCI spec is
>> interpreted as reverting the device to legacy interrupts. If such a
>> device try to generate a new interrupt during the "masked" window, the
>> device will try a legacy interrupt which is generally
>> ignored/never-acked and cause interrupts to no longer work for the
>> device/driver combination (even after the enable bit is restored).
>> 
>
> We should also be leaving the INTx irqs disabled.  So no irq
> should be generated.
>   

Even if the INTx line is not raised, you cannot rely on the device to
retain memory of a interrupt triggered while MSI are disabled, and
expect it to fire it under MSI form later when MSI are reenabled.  The
PCI spec does not provide any implicit or explicit guarantee about the
MSI enable flag that would allow it to be used for temporary masking
without running the risk of loosing such interrupts. Moreover, even if
you eventually call the interrupt handler to recover a lost-interrupt,
having switched the device to INTx mode (whether or not the INTx line
was forced down or not with the corresponding pci-command bit) without
informing the driver can (and will in our case) break interrupt
handshaking because MSI and INTx interrupts are not acked in the same
way (INTx requires an extra step that we don't do for MSI and that the
device will still expect unless going through driver init again).

> If you have a mask bit implemented you are required to be
> able to refire it after the msi is enabled. 

Indeed the masking case is well-defined by the spec (including the
operation of the pending bits). And my subject was definitely restricted
to devices without that masking capability.

>  I don't recall
> the requirements for when both intx and msi irqs are both
> disabled.  Intuitively I would expect no irq message to
> be generated, and at most the card would need to be polled
> manually to recognize a device event happened.
>
> Certainly firing an irq and having it get completely lost is
> unfortunate, and a major pain if you are trying to use the
> card.
>
> As for the previous no-op behavior that was a bug.
>   

OK no-op was a bug, but using the enable-bit for temporary masking
purposes still feels like a bug. I am afraid the only safe solution
might be to prohibit any operation that absolutely requires masking if
real masking is not available. Maybe the set_affinity method should
simply be disabled for device not supported masking (unless there is an
option of doing it without masking for instance by guaranteeing only one
word of the MSI capability is changed).

>   
>> Is there anything apart from irq migration that strongly requires
>> masking? Is is possible to do the irq migration without masking?
>> 
>
> enable_irq/disable_irq.  Although we can get away with a software
> emulation there and those are only needed if the driver calls them.
>   

I don't think there is a problem here, no sane driver would depend on
receiving edge interrupts triggered while irqs were explicitly disabled.

> The PCI spec requires disabling/masking the msi when reprogramming it.
> So as a general rule we can not do better. 

Do you have a reference for that requirement. The spec only vaguely
associates MSI programming with "configuration", but I haven't found any
explicit indication that it should not work.

>  Further because we are
> writing to multiple pci config registers the only way we can safely
> reprogram the message is with the msi disabled/masked on the card in
> some fashion.
>   

That's indeed a show-stopper.

> I suspect what needs to happen is a spec search to verify that the
> current linux behavior is at least reasonable within the spec.
>   

I don't see how you can disable MSI through the control bit (which is
equivalent to switching the device to INTx whether or not the INTx
disable bit is set in PCI_COMMAND) in the middle of operations, not tell
the driver, and not risk loosing interrupts (unless you rely on much
more than the spec).

> I don't want to break anyones hardware, but at the same time I want us
> to be careful and in spec for the default case.
>
>   

The interrupt while doing set_affinity masking would certainly cause a
problem for the device we use (MSI-enable switch between INTx and MSI
mode, and both interru

MSI problem since 2.6.21 for devices not providing a mask in their MSI capability

2007-10-03 Thread Loic Prylli

Hi,

We observe a problem with MSI since kernel 2.6.21 where interrupts would
randomly stop working. We have tracked it down to the new
msi_set_mask_bit definition in 2.6.21. In the MSI case with a device not
providing a "native" MSI mask, it was a no-op before, and now it
disables MSI in the MSI-ctl register which according to the PCI spec is
interpreted as reverting the device to legacy interrupts. If such a
device try to generate a new interrupt during the "masked" window, the
device will try a legacy interrupt which is generally
ignored/never-acked and cause interrupts to no longer work for the
device/driver combination (even after the enable bit is restored).


Is there anything apart from irq migration that strongly requires
masking? Is is possible to do the irq migration without masking?



Loic

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

MSI problem since 2.6.21 for devices not providing a mask in their MSI capability

2007-10-03 Thread Loic Prylli

Hi,

We observe a problem with MSI since kernel 2.6.21 where interrupts would
randomly stop working. We have tracked it down to the new
msi_set_mask_bit definition in 2.6.21. In the MSI case with a device not
providing a native MSI mask, it was a no-op before, and now it
disables MSI in the MSI-ctl register which according to the PCI spec is
interpreted as reverting the device to legacy interrupts. If such a
device try to generate a new interrupt during the masked window, the
device will try a legacy interrupt which is generally
ignored/never-acked and cause interrupts to no longer work for the
device/driver combination (even after the enable bit is restored).


Is there anything apart from irq migration that strongly requires
masking? Is is possible to do the irq migration without masking?



Loic

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: MSI problem since 2.6.21 for devices not providing a mask in their MSI capability

2007-10-03 Thread Loic Prylli

On 10/3/2007 5:49 PM, Eric W. Biederman wrote:
 Loic Prylli [EMAIL PROTECTED] writes:

   
 Hi,

 We observe a problem with MSI since kernel 2.6.21 where interrupts would
 randomly stop working. We have tracked it down to the new
 msi_set_mask_bit definition in 2.6.21. In the MSI case with a device not
 providing a native MSI mask, it was a no-op before, and now it
 disables MSI in the MSI-ctl register which according to the PCI spec is
 interpreted as reverting the device to legacy interrupts. If such a
 device try to generate a new interrupt during the masked window, the
 device will try a legacy interrupt which is generally
 ignored/never-acked and cause interrupts to no longer work for the
 device/driver combination (even after the enable bit is restored).
 

 We should also be leaving the INTx irqs disabled.  So no irq
 should be generated.
   



Even if the INTx line is not raised, you cannot rely on the device to
retain memory of a interrupt triggered while MSI are disabled, and
expect it to fire it under MSI form later when MSI are reenabled.  The
PCI spec does not provide any implicit or explicit guarantee about the
MSI enable flag that would allow it to be used for temporary masking
without running the risk of loosing such interrupts. Moreover, even if
you eventually call the interrupt handler to recover a lost-interrupt,
having switched the device to INTx mode (whether or not the INTx line
was forced down or not with the corresponding pci-command bit) without
informing the driver can (and will in our case) break interrupt
handshaking because MSI and INTx interrupts are not acked in the same
way (INTx requires an extra step that we don't do for MSI and that the
device will still expect unless going through driver init again).


 If you have a mask bit implemented you are required to be
 able to refire it after the msi is enabled. 



Indeed the masking case is well-defined by the spec (including the
operation of the pending bits). And my subject was definitely restricted
to devices without that masking capability.



  I don't recall
 the requirements for when both intx and msi irqs are both
 disabled.  Intuitively I would expect no irq message to
 be generated, and at most the card would need to be polled
 manually to recognize a device event happened.

 Certainly firing an irq and having it get completely lost is
 unfortunate, and a major pain if you are trying to use the
 card.

 As for the previous no-op behavior that was a bug.
   




OK no-op was a bug, but using the enable-bit for temporary masking
purposes still feels like a bug. I am afraid the only safe solution
might be to prohibit any operation that absolutely requires masking if
real masking is not available. Maybe the set_affinity method should
simply be disabled for device not supported masking (unless there is an
option of doing it without masking for instance by guaranteeing only one
word of the MSI capability is changed).



   
 Is there anything apart from irq migration that strongly requires
 masking? Is is possible to do the irq migration without masking?
 

 enable_irq/disable_irq.  Although we can get away with a software
 emulation there and those are only needed if the driver calls them.
   



I don't think there is a problem here, no sane driver would depend on
receiving edge interrupts triggered while irqs were explicitly disabled.



 The PCI spec requires disabling/masking the msi when reprogramming it.
 So as a general rule we can not do better. 



Do you have a reference for that requirement. The spec only vaguely
associates MSI programming with configuration, but I haven't found any
explicit indication that it should not work.



  Further because we are
 writing to multiple pci config registers the only way we can safely
 reprogram the message is with the msi disabled/masked on the card in
 some fashion.
   


That's indeed a show-stopper.


 I suspect what needs to happen is a spec search to verify that the
 current linux behavior is at least reasonable within the spec.
   



I don't see how you can disable MSI through the control bit (which is
equivalent to switching the device to INTx whether or not the INTx
disable bit is set in PCI_COMMAND) in the middle of operations, not tell
the driver, and not risk loosing interrupts (unless you rely on much
more than the spec).


 I don't want to break anyones hardware, but at the same time I want us
 to be careful and in spec for the default case.

   


The interrupt while doing set_affinity masking would certainly cause a
problem for the device we use (MSI-enable switch between INTx and MSI
mode, and both interrupts are not acked the same way assuming they would
even be delivered to the driver), but I got some new data: upon further
examination, the lost interrupts we have seen seems in fact caused at a
different time:
- the problem is the  mask_ack_irq() done in handle_edge_irq() when a
new interrupt arrives before the IRQ_PROGRESS bit is cleared

Re: PAT support for i386 and x86_64

2007-08-08 Thread Loic Prylli

On 8/7/2007 4:30 AM, Andi Kleen wrote:
> On Mon, Aug 06, 2007 at 10:03:15PM -0400, Cédric Augonnet wrote:
>   
>> Hi all,
>>
>> For quite a while now, there as been numerous attempt to introduce support 
>> for
>> Page Attribute Table (PAT) in the Linux kernel, whereas most other OS already
>> have some support for this feature. Such a proposition popping up 
>> periodically,
>> perhaps it would be an opportunity to fix that lack for once.
>> 
>
> The trouble is you need to avoid conflicting attributes, otherwise
> you risk cache corruption. This means the direct mapping needs to be fixed
> up and the kernel needs to keep track of the ranges to prevent conflicts.
>   

I don't see why we have to worry about cache corruption in the case at
hand. Write-combining is needed to map io (typically pci-mem regions)
which are never mapped cachable anywhere, including in the linear map.

If somebody for some reason needs to play with special attributes on
regular RAM for which inconsistent aliasing could be a problem:
- please explain why that consistency issue is mentioned in the context
of  write-combining/PAT: the problem already potentially exists through
the use of the _PAGE_PCD attribute, and having an extra WC choice should
not make the problem worse or better (note that with the initial patch
that WC/PAT combination is only exploited in pci_mmap_page_range() which
rightfully doesn't seem to care about cachable attribute consistency
anyway).

> Also when there is already a MTRR it might not work due to the complicated
> rules of MTRR<->PAT interaction.
>   

Some PAT<->MTRR cases are messy, but getting a WC type through PAT seems
to straightforwardly take precedence over any MTRR type, doesn't it?

> Then there are old CPU errata that need to be handled etc.
>   

We mentioned that point in the introduction of the patch. We have looked
at the documented PAT erratas that exists for the Pentium-II,
Pentium-III, Pentium-M, some early pentium-IV processors. While there
are minor variations for the description of the bug depending on the
processor, they all fit into the following description: "under some
circumstances the PAT bit might be ignored". The patch purposefully puts
write-combining at PAT6 so if the conditions are there for the errata 
to trigger, PAT2 (UC-) will be selected by the processor and the
corresponding accesses will simply be uncacheable instead of being
write-combining, which doesn't affect correctness.

We would certainly appreciate having any other erratas we missed
mentioned here for reference.

> There are also some other issues.
>   

Introducing something like ioremap_wc() or a "sfence" wmb_wc() was
excluded from the initial patch on purpose. It would be the logical next
step (but involves possible API driver extensions), so the proposed
patch was limited to making use of the new WC attribute by really
handling the write_combine argument of pci_mmap_page_range(). That
seemed to generate enough objections to start with.

There is at least one mostly cosmetic problem in the patch in
pci_mmap_page_range() where huge pages should not be a concern here.

> You didn't solve all that at all. If it was as simple as your patch
> we would have long done it already.

I am sorry, but after this and other messages on the list, I still don't
understand why a simple approach hasn't been made available already:
- the attribute consistency issue seems independant of using PAT to
create WC mappings (in particular the possibility of mixing by accident
WC and UC aliases has always existed, having broken driver make such
aliases through a new PAT-based attribute combination hardly changes
anything, same for mixing uncachable and cachable aliases,  new WC/PAT
combination doesn't change anything).

Sure there would some logical follow-up after a simple patch (like
providing a proper ioremap() interface, and maybe a new kind of barrier,
and handling the PAT bit properly in arch/x86/mm/pageattr.c, but those
follow-ups were purposefully excluded, and none seems very complex either).

So if there is any remaining issue on doing something "as simple as this
patch", please clarify it.

Best regards,

Loic

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: PAT support for i386 and x86_64

2007-08-08 Thread Loic Prylli

On 8/7/2007 4:30 AM, Andi Kleen wrote:
 On Mon, Aug 06, 2007 at 10:03:15PM -0400, Cédric Augonnet wrote:
   
 Hi all,

 For quite a while now, there as been numerous attempt to introduce support 
 for
 Page Attribute Table (PAT) in the Linux kernel, whereas most other OS already
 have some support for this feature. Such a proposition popping up 
 periodically,
 perhaps it would be an opportunity to fix that lack for once.
 

 The trouble is you need to avoid conflicting attributes, otherwise
 you risk cache corruption. This means the direct mapping needs to be fixed
 up and the kernel needs to keep track of the ranges to prevent conflicts.
   




I don't see why we have to worry about cache corruption in the case at
hand. Write-combining is needed to map io (typically pci-mem regions)
which are never mapped cachable anywhere, including in the linear map.


If somebody for some reason needs to play with special attributes on
regular RAM for which inconsistent aliasing could be a problem:
- please explain why that consistency issue is mentioned in the context
of  write-combining/PAT: the problem already potentially exists through
the use of the _PAGE_PCD attribute, and having an extra WC choice should
not make the problem worse or better (note that with the initial patch
that WC/PAT combination is only exploited in pci_mmap_page_range() which
rightfully doesn't seem to care about cachable attribute consistency
anyway).



 Also when there is already a MTRR it might not work due to the complicated
 rules of MTRR-PAT interaction.
   



Some PAT-MTRR cases are messy, but getting a WC type through PAT seems
to straightforwardly take precedence over any MTRR type, doesn't it?



 Then there are old CPU errata that need to be handled etc.
   



We mentioned that point in the introduction of the patch. We have looked
at the documented PAT erratas that exists for the Pentium-II,
Pentium-III, Pentium-M, some early pentium-IV processors. While there
are minor variations for the description of the bug depending on the
processor, they all fit into the following description: under some
circumstances the PAT bit might be ignored. The patch purposefully puts
write-combining at PAT6 so if the conditions are there for the errata 
to trigger, PAT2 (UC-) will be selected by the processor and the
corresponding accesses will simply be uncacheable instead of being
write-combining, which doesn't affect correctness.


We would certainly appreciate having any other erratas we missed
mentioned here for reference.


 There are also some other issues.
   



Introducing something like ioremap_wc() or a sfence wmb_wc() was
excluded from the initial patch on purpose. It would be the logical next
step (but involves possible API driver extensions), so the proposed
patch was limited to making use of the new WC attribute by really
handling the write_combine argument of pci_mmap_page_range(). That
seemed to generate enough objections to start with.


There is at least one mostly cosmetic problem in the patch in
pci_mmap_page_range() where huge pages should not be a concern here.



 You didn't solve all that at all. If it was as simple as your patch
 we would have long done it already.


I am sorry, but after this and other messages on the list, I still don't
understand why a simple approach hasn't been made available already:
- the attribute consistency issue seems independant of using PAT to
create WC mappings (in particular the possibility of mixing by accident
WC and UC aliases has always existed, having broken driver make such
aliases through a new PAT-based attribute combination hardly changes
anything, same for mixing uncachable and cachable aliases,  new WC/PAT
combination doesn't change anything).


Sure there would some logical follow-up after a simple patch (like
providing a proper ioremap() interface, and maybe a new kind of barrier,
and handling the PAT bit properly in arch/x86/mm/pageattr.c, but those
follow-ups were purposefully excluded, and none seems very complex either).


So if there is any remaining issue on doing something as simple as this
patch, please clarify it.


Best regards,


Loic


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] MTRR: Fix race causing set_mtrr to go into infinite loop

2007-06-25 Thread Loic Prylli

On 6/25/2007 6:34 PM, Andi Kleen wrote:
> On Tuesday 26 June 2007 00:05:17 Chuck Ebbert wrote:
>   
>> On 06/25/2007 05:38 PM, Loic Prylli wrote:
>>
>> [cc: Andi]
>>
>> 
>>> Processors synchronization in set_mtrr requires the .gate field
>>> to be set after .count field is properly initialized. Without an explicit
>>> barrier, the compiler was reordering those memory stores. That was sometimes
>>> causing a processor (in ipi_handler) to see the .gate change and
>>> decrement .count before the latter is set by set_mtrr() (which
>>> then hangs in a infinite loop with irqs disabled).
>>>   
>
> Hmm, perhaps we should just put the smp_wmb into atomic_set().
> Near all other atomic operations have memory barriers too. I think
> that would be the better fix.
>
> -Andi
>   

In Documentation/atomic_ops.txt atomic_set/atomic_read are described as
nothing more than a type-safe assignement or reading, without any extra
semantics. For other atomic operations, the rule is that any atomic
operation that doesn't return a value doesn't come with a barrier (and
any operation that returns the atomic value must have memory barriers).

So I guess you are suggesting to change the doc and the implementation
for all arches.

I should admit I did not knew a number of atomic operations did not
imply memory-barriers before. But maybe the extra cost might not be
completely negligible, especially if, for consistency with other
"barrier-implied" atomic operations, a new memory barrier is put before
and after,

Are you suggested changing just atomic_set(), or also other barrier-free
atomic operations :"atomic_dec", "atomic_inc", "atomic_add", "atomic_sub" ?

Independently of what is done to atomic, what about not making the .gate
field an atomic_t, but a simple "int" in the mttr code, since the only
operations done on it are atomic_read and atomic_set?

Loic

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] MTRR: Fix race causing set_mtrr to go into infinite loop

2007-06-25 Thread Loic Prylli

Processors synchronization in set_mtrr requires the .gate field
to be set after .count field is properly initialized. Without an explicit
barrier, the compiler was reordering those memory stores. That was sometimes
causing a processor (in ipi_handler) to see the .gate change and
decrement .count before the latter is set by set_mtrr() (which
then hangs in a infinite loop with irqs disabled).

Signed-off-by: Loic Prylli <[EMAIL PROTECTED]>
---
 arch/i386/kernel/cpu/mtrr/main.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/i386/kernel/cpu/mtrr/main.c b/arch/i386/kernel/cpu/mtrr/main.c
index 55b0051..75dc6d5 100644
--- a/arch/i386/kernel/cpu/mtrr/main.c
+++ b/arch/i386/kernel/cpu/mtrr/main.c
@@ -229,6 +229,8 @@ static void set_mtrr(unsigned int reg, unsigned long base,
data.smp_size = size;
data.smp_type = type;
atomic_set(, num_booting_cpus() - 1);
+   /* make sure data.count is visible before unleashing other CPUs */
+   smp_wmb();
atomic_set(,0);
 
/*  Start the ball rolling on other CPUs  */
@@ -242,6 +244,7 @@ static void set_mtrr(unsigned int reg, unsigned long base,
 
/* ok, reset count and toggle gate */
atomic_set(, num_booting_cpus() - 1);
+   smp_wmb();
atomic_set(,1);
 
/* do our MTRR business */
@@ -260,6 +263,7 @@ static void set_mtrr(unsigned int reg, unsigned long base,
cpu_relax();
 
atomic_set(, num_booting_cpus() - 1);
+   smp_wmb();
atomic_set(,0);
 
/*
-- 1.5.2.2 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] MTRR: Fix race causing set_mtrr to go into infinite loop

2007-06-25 Thread Loic Prylli

Processors synchronization in set_mtrr requires the .gate field
to be set after .count field is properly initialized. Without an explicit
barrier, the compiler was reordering those memory stores. That was sometimes
causing a processor (in ipi_handler) to see the .gate change and
decrement .count before the latter is set by set_mtrr() (which
then hangs in a infinite loop with irqs disabled).

Signed-off-by: Loic Prylli [EMAIL PROTECTED]
---
 arch/i386/kernel/cpu/mtrr/main.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/i386/kernel/cpu/mtrr/main.c b/arch/i386/kernel/cpu/mtrr/main.c
index 55b0051..75dc6d5 100644
--- a/arch/i386/kernel/cpu/mtrr/main.c
+++ b/arch/i386/kernel/cpu/mtrr/main.c
@@ -229,6 +229,8 @@ static void set_mtrr(unsigned int reg, unsigned long base,
data.smp_size = size;
data.smp_type = type;
atomic_set(data.count, num_booting_cpus() - 1);
+   /* make sure data.count is visible before unleashing other CPUs */
+   smp_wmb();
atomic_set(data.gate,0);
 
/*  Start the ball rolling on other CPUs  */
@@ -242,6 +244,7 @@ static void set_mtrr(unsigned int reg, unsigned long base,
 
/* ok, reset count and toggle gate */
atomic_set(data.count, num_booting_cpus() - 1);
+   smp_wmb();
atomic_set(data.gate,1);
 
/* do our MTRR business */
@@ -260,6 +263,7 @@ static void set_mtrr(unsigned int reg, unsigned long base,
cpu_relax();
 
atomic_set(data.count, num_booting_cpus() - 1);
+   smp_wmb();
atomic_set(data.gate,0);
 
/*
-- 1.5.2.2 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] MTRR: Fix race causing set_mtrr to go into infinite loop

2007-06-25 Thread Loic Prylli

On 6/25/2007 6:34 PM, Andi Kleen wrote:
 On Tuesday 26 June 2007 00:05:17 Chuck Ebbert wrote:
   
 On 06/25/2007 05:38 PM, Loic Prylli wrote:

 [cc: Andi]

 
 Processors synchronization in set_mtrr requires the .gate field
 to be set after .count field is properly initialized. Without an explicit
 barrier, the compiler was reordering those memory stores. That was sometimes
 causing a processor (in ipi_handler) to see the .gate change and
 decrement .count before the latter is set by set_mtrr() (which
 then hangs in a infinite loop with irqs disabled).
   

 Hmm, perhaps we should just put the smp_wmb into atomic_set().
 Near all other atomic operations have memory barriers too. I think
 that would be the better fix.

 -Andi
   


In Documentation/atomic_ops.txt atomic_set/atomic_read are described as
nothing more than a type-safe assignement or reading, without any extra
semantics. For other atomic operations, the rule is that any atomic
operation that doesn't return a value doesn't come with a barrier (and
any operation that returns the atomic value must have memory barriers).

So I guess you are suggesting to change the doc and the implementation
for all arches.

I should admit I did not knew a number of atomic operations did not
imply memory-barriers before. But maybe the extra cost might not be
completely negligible, especially if, for consistency with other
barrier-implied atomic operations, a new memory barrier is put before
and after,

Are you suggested changing just atomic_set(), or also other barrier-free
atomic operations :atomic_dec, atomic_inc, atomic_add, atomic_sub ?

Independently of what is done to atomic, what about not making the .gate
field an atomic_t, but a simple int in the mttr code, since the only
operations done on it are atomic_read and atomic_set?


Loic

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

mask/unmasking while servicing MSI(-X) unnecessary?

2006-11-29 Thread Loic Prylli



While looking into using MSI-X for our myri10ge driver, we have seen
that the msi(x) code (driver/pci/msi.c) masks the MSI(-X) vector while
servicing an interrupt. We are not sure why this masking is needed (for
instance no such thing is done for "edge IOAPIC" interrupts). There
seems to be already several mechanisms each individually protecting
against "loosing an interrupt" without masking:
- the "x86" architecture is able to queue 2 interrupt messages. That
guarantees an interrupt handler will always start after the last MSI
received (even in the case of a big burst of MSI messages).
- Even if there wasn't that interrupt queuing, ack_APIC_irq() could be
moved in the ack() method. Then things would work without masking even
on a hypothetical platform where a new interrupt is completely ignored
(with no IRR-like register) while servicing the same vector (anyway
since this "msi" code is already tied to "x86" architecture
specificities, that hypothetical platform might not be relevant).
- Almost every driver/device have their own way of acknowledging
interrupts anyway, which also by itself makes the masking/unmasking
unnecessary.
- The masking by itself is racy unless the driver interrupt handler
starts by making sure the masking request has reached the device with
some kind of MMIO-read. Such a MMIO-read is normally the kind of costly
requests you are happy to get rid of in the MSI model.

So if it is not useful, it might be better to avoid that systematic
mask/unmask pair. This masking has a small but measurable performance
impact for our device/driver combo.

Would you agree to suppress that masking (sample patch following)? Or
otherwise, is there is a possibility of making it optional on a
per-device basis.

Thanks for any comment!


Loic Prylli
Myricom, Inc.

[PATCH] Don't mask the interrupt vector while servicing a MSI interrupt.


Signed-off-by: Loic Prylli <[EMAIL PROTECTED]>


---

 drivers/pci/msi.c |   19 ++-
 1 files changed, 6 insertions(+), 13 deletions(-)

98e9a27091c3ccdc8a8a72cea9ab9086fb258af3
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index a83c1f5..af446e3 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -200,19 +200,12 @@ static void shutdown_msi_irq(unsigned in
spin_unlock_irqrestore(_lock, flags);
 }
 
-static void end_msi_irq_wo_maskbit(unsigned int vector)
+static void end_msi_irq(unsigned int vector)
 {
move_native_irq(vector);
ack_APIC_irq();
 }
 
-static void end_msi_irq_w_maskbit(unsigned int vector)
-{
-   move_native_irq(vector);
-   unmask_MSI_irq(vector);
-   ack_APIC_irq();
-}
-
 static void do_nothing(unsigned int vector)
 {
 }
@@ -227,8 +220,8 @@ static struct hw_interrupt_type msix_irq
.shutdown   = shutdown_msi_irq,
.enable = unmask_MSI_irq,
.disable= mask_MSI_irq,
-   .ack= mask_MSI_irq,
-   .end= end_msi_irq_w_maskbit,
+   .ack= do_nothing,
+   .end= end_msi_irq,
.set_affinity   = set_msi_affinity
 };
 
@@ -243,8 +236,8 @@ static struct hw_interrupt_type msi_irq_
.shutdown   = shutdown_msi_irq,
.enable = unmask_MSI_irq,
.disable= mask_MSI_irq,
-   .ack= mask_MSI_irq,
-   .end= end_msi_irq_w_maskbit,
+   .ack= do_nothing,
+   .end= end_msi_irq,
.set_affinity   = set_msi_affinity
 };
 
@@ -260,7 +253,7 @@ static struct hw_interrupt_type msi_irq_
.enable = do_nothing,
.disable= do_nothing,
.ack= do_nothing,
-   .end= end_msi_irq_wo_maskbit,
+   .end= end_msi_irq,
.set_affinity   = set_msi_affinity
 };
 
-- 
1.3.2




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

mask/unmasking while servicing MSI(-X) unnecessary?

2006-11-29 Thread Loic Prylli



While looking into using MSI-X for our myri10ge driver, we have seen
that the msi(x) code (driver/pci/msi.c) masks the MSI(-X) vector while
servicing an interrupt. We are not sure why this masking is needed (for
instance no such thing is done for edge IOAPIC interrupts). There
seems to be already several mechanisms each individually protecting
against loosing an interrupt without masking:
- the x86 architecture is able to queue 2 interrupt messages. That
guarantees an interrupt handler will always start after the last MSI
received (even in the case of a big burst of MSI messages).
- Even if there wasn't that interrupt queuing, ack_APIC_irq() could be
moved in the ack() method. Then things would work without masking even
on a hypothetical platform where a new interrupt is completely ignored
(with no IRR-like register) while servicing the same vector (anyway
since this msi code is already tied to x86 architecture
specificities, that hypothetical platform might not be relevant).
- Almost every driver/device have their own way of acknowledging
interrupts anyway, which also by itself makes the masking/unmasking
unnecessary.
- The masking by itself is racy unless the driver interrupt handler
starts by making sure the masking request has reached the device with
some kind of MMIO-read. Such a MMIO-read is normally the kind of costly
requests you are happy to get rid of in the MSI model.

So if it is not useful, it might be better to avoid that systematic
mask/unmask pair. This masking has a small but measurable performance
impact for our device/driver combo.

Would you agree to suppress that masking (sample patch following)? Or
otherwise, is there is a possibility of making it optional on a
per-device basis.

Thanks for any comment!


Loic Prylli
Myricom, Inc.

[PATCH] Don't mask the interrupt vector while servicing a MSI interrupt.


Signed-off-by: Loic Prylli [EMAIL PROTECTED]


---

 drivers/pci/msi.c |   19 ++-
 1 files changed, 6 insertions(+), 13 deletions(-)

98e9a27091c3ccdc8a8a72cea9ab9086fb258af3
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index a83c1f5..af446e3 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -200,19 +200,12 @@ static void shutdown_msi_irq(unsigned in
spin_unlock_irqrestore(msi_lock, flags);
 }
 
-static void end_msi_irq_wo_maskbit(unsigned int vector)
+static void end_msi_irq(unsigned int vector)
 {
move_native_irq(vector);
ack_APIC_irq();
 }
 
-static void end_msi_irq_w_maskbit(unsigned int vector)
-{
-   move_native_irq(vector);
-   unmask_MSI_irq(vector);
-   ack_APIC_irq();
-}
-
 static void do_nothing(unsigned int vector)
 {
 }
@@ -227,8 +220,8 @@ static struct hw_interrupt_type msix_irq
.shutdown   = shutdown_msi_irq,
.enable = unmask_MSI_irq,
.disable= mask_MSI_irq,
-   .ack= mask_MSI_irq,
-   .end= end_msi_irq_w_maskbit,
+   .ack= do_nothing,
+   .end= end_msi_irq,
.set_affinity   = set_msi_affinity
 };
 
@@ -243,8 +236,8 @@ static struct hw_interrupt_type msi_irq_
.shutdown   = shutdown_msi_irq,
.enable = unmask_MSI_irq,
.disable= mask_MSI_irq,
-   .ack= mask_MSI_irq,
-   .end= end_msi_irq_w_maskbit,
+   .ack= do_nothing,
+   .end= end_msi_irq,
.set_affinity   = set_msi_affinity
 };
 
@@ -260,7 +253,7 @@ static struct hw_interrupt_type msi_irq_
.enable = do_nothing,
.disable= do_nothing,
.ack= do_nothing,
-   .end= end_msi_irq_wo_maskbit,
+   .end= end_msi_irq,
.set_affinity   = set_msi_affinity
 };
 
-- 
1.3.2




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

60 matches

Mail list logo