Re: [PATCH] net: Fix NETDEV_CHANGE notifier usage causing spurious arp flush
On Wed, Jul 02, 2014 at 10:03:49AM +0300, Dan Aloni wrote: > On Tue, Jul 01, 2014 at 09:39:43PM -0700, Loic Prylli wrote: > > A bug was introduced in NETDEV_CHANGE notifier sequence causing the > > arp table to be sometimes spuriously cleared (including manual arp > > entries marked permanent), upon network link carrier changes. > > > > The changed argument for the notifier was applied only to a single > > caller of NETDEV_CHANGE, missing among others netdev_state_change(). > > So upon net_carrier events induced by the network, which are > > triggering a call to netdev_state_change(), arp_netdev_event() would > > decide whether to clear or not arp cache based on random/junk stack > > values (a kind of read buffer overflow). > [..] > > { > > if (dev->flags & IFF_UP) { > > - call_netdevice_notifiers(NETDEV_CHANGE, dev); > > + struct netdev_notifier_change_info change_info; > > + > > + change_info.flags_changed = 0; > > I think it would be safer to do: > >struct netdev_notifier_change_info change_info = {}; > > So that when future fields are added to the struct and this call-site > happens to be forgotten, they will get 0 by default rather than > random stack values. Thanks for the review. Will take into account suggestion. For the record, another (possibly bigger) trap from the preexisting code that remains (and caused the bug) is NETDEV_CHANGE being the only netdev notifier with a different special calling sequence. Since calls to NETDEV_CHANGE notifier have been reduced over time to the two instances in net/core/dev.c, it hopefully won't be a problem (and fixing that maintainability issue would be out-of-scope of this simple low-risk bug fix). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] net: Fix NETDEV_CHANGE notifier usage causing spurious arp flush
On Wed, Jul 02, 2014 at 10:03:49AM +0300, Dan Aloni wrote: On Tue, Jul 01, 2014 at 09:39:43PM -0700, Loic Prylli wrote: A bug was introduced in NETDEV_CHANGE notifier sequence causing the arp table to be sometimes spuriously cleared (including manual arp entries marked permanent), upon network link carrier changes. The changed argument for the notifier was applied only to a single caller of NETDEV_CHANGE, missing among others netdev_state_change(). So upon net_carrier events induced by the network, which are triggering a call to netdev_state_change(), arp_netdev_event() would decide whether to clear or not arp cache based on random/junk stack values (a kind of read buffer overflow). [..] { if (dev-flags IFF_UP) { - call_netdevice_notifiers(NETDEV_CHANGE, dev); + struct netdev_notifier_change_info change_info; + + change_info.flags_changed = 0; I think it would be safer to do: struct netdev_notifier_change_info change_info = {}; So that when future fields are added to the struct and this call-site happens to be forgotten, they will get 0 by default rather than random stack values. Thanks for the review. Will take into account suggestion. For the record, another (possibly bigger) trap from the preexisting code that remains (and caused the bug) is NETDEV_CHANGE being the only netdev notifier with a different special calling sequence. Since calls to NETDEV_CHANGE notifier have been reduced over time to the two instances in net/core/dev.c, it hopefully won't be a problem (and fixing that maintainability issue would be out-of-scope of this simple low-risk bug fix). -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] net: Fix NETDEV_CHANGE notifier usage causing spurious arp flush
A bug was introduced in NETDEV_CHANGE notifier sequence causing the arp table to be sometimes spuriously cleared (including manual arp entries marked permanent), upon network link carrier changes. The changed argument for the notifier was applied only to a single caller of NETDEV_CHANGE, missing among others netdev_state_change(). So upon net_carrier events induced by the network, which are triggering a call to netdev_state_change(), arp_netdev_event() would decide whether to clear or not arp cache based on random/junk stack values (a kind of read buffer overflow). Fixes: be9efd365328 ("net: pass changed flags along with NETDEV_CHANGE event") Fixes: 6c8b4e3ff81b ("arp: flush arp cache on IFF_NOARP change") Signed-off-by: Loic Prylli --- net/core/dev.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/net/core/dev.c b/net/core/dev.c index 30eedf6..63129d4 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -148,6 +148,9 @@ struct list_head ptype_all __read_mostly; /* Taps */ static struct list_head offload_base __read_mostly; static int netif_rx_internal(struct sk_buff *skb); +static int call_netdevice_notifiers_info(unsigned long val, +struct net_device *dev, +struct netdev_notifier_info *info); /* * The @dev_base_head list is protected by @dev_base_lock and the rtnl @@ -1207,7 +1210,11 @@ EXPORT_SYMBOL(netdev_features_change); void netdev_state_change(struct net_device *dev) { if (dev->flags & IFF_UP) { - call_netdevice_notifiers(NETDEV_CHANGE, dev); + struct netdev_notifier_change_info change_info; + + change_info.flags_changed = 0; + call_netdevice_notifiers_info(NETDEV_CHANGE, dev, + _info.info); rtmsg_ifinfo(RTM_NEWLINK, dev, 0, GFP_KERNEL); } } -- 2.0.0.526.g5318336 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] net: Fix NETDEV_CHANGE notifier usage causing spurious arp flush
A bug was introduced in NETDEV_CHANGE notifier sequence causing the arp table to be sometimes spuriously cleared (including manual arp entries marked permanent), upon network link carrier changes. The changed argument for the notifier was applied only to a single caller of NETDEV_CHANGE, missing among others netdev_state_change(). So upon net_carrier events induced by the network, which are triggering a call to netdev_state_change(), arp_netdev_event() would decide whether to clear or not arp cache based on random/junk stack values (a kind of read buffer overflow). Fixes: be9efd365328 (net: pass changed flags along with NETDEV_CHANGE event) Fixes: 6c8b4e3ff81b (arp: flush arp cache on IFF_NOARP change) Signed-off-by: Loic Prylli lo...@google.com --- net/core/dev.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/net/core/dev.c b/net/core/dev.c index 30eedf6..63129d4 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -148,6 +148,9 @@ struct list_head ptype_all __read_mostly; /* Taps */ static struct list_head offload_base __read_mostly; static int netif_rx_internal(struct sk_buff *skb); +static int call_netdevice_notifiers_info(unsigned long val, +struct net_device *dev, +struct netdev_notifier_info *info); /* * The @dev_base_head list is protected by @dev_base_lock and the rtnl @@ -1207,7 +1210,11 @@ EXPORT_SYMBOL(netdev_features_change); void netdev_state_change(struct net_device *dev) { if (dev-flags IFF_UP) { - call_netdevice_notifiers(NETDEV_CHANGE, dev); + struct netdev_notifier_change_info change_info; + + change_info.flags_changed = 0; + call_netdevice_notifiers_info(NETDEV_CHANGE, dev, + change_info.info); rtmsg_ifinfo(RTM_NEWLINK, dev, 0, GFP_KERNEL); } } -- 2.0.0.526.g5318336 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in
On 1/15/2008 2:38 PM, Linus Torvalds wrote: On Tue, 15 Jan 2008, Tony Camuso wrote: Linus is confident that conf1 is not going away for at least the next five years. Not on PC's. Small birds tell me that there can be all these non-PC x86 subarchitectures that may or may not have conf1. Linus But is there a ACPI-compliant/architecture that only offers mmconfig for configuration-space access and no other fallback method (i.e. no conf1, no bios,...)? 2.6.24 supports mmconfig for: - ACPI-system with MCFG - a couple chipset discovered by conf1 If a system has no conf1, but does not have e820+ACPI+MCFG, or does have some other method than mmconfig, it was already irrelevant in the discussion of Ivan's initial patch in december (because that system was either never supported or not impacted, and we were trying to fix bugs, not introduce support for new class of systems). Maybe Arjan could share his knowledge, and tell us what system he was thinking about (and whether it needed to be supported by 2.6.24) when saying: "When (and I'm saying "when" not "if") systems arrive that only have MMCONFIG for some of the devices." Anyway Ivan's patch + Matthew's extensions are handling that non-PC arch. That combination is advocated by at least: Ivan Kokshaysky Matthew Wilcox Tony Camuso Loic Prylli even Arjan's said that while he prefers his patch (saying it's more conservative), he does not see a existing problem with the Ivan/Matthew combination. [ simpler, less ambitious fixes can be forgotten if nothing can be done for 2.6.24, I can understand that choice ] The list of problems I see with Arjan's patch are: - no word on whether the existing Linux driver/pci/pcie/aer code should be converted to opt-in? - mmconfig still needs to be revisited to sort-out the mix of mmconfig+conf1+third-method access. - you cannot test if ext-conf-space is available without taking risks: when pci_enable_ext_config() is called, even legacy-conf-space is switched to the new method. So some administrator action (lspci -v +maybe-other-flag) or some driver action (that can optionally use ext-conf-space but does not *rely* on it) could cause some devices to totally disappear (if some pci hierarchy is handled by mmconfig as a 0x section as seen on many amd machines). Matthew/Ivan will simply in the worst case detect that ext-conf-space is not available in pci_cfg_space_size()), legacy-conf-space will still work (and that 0x section is perfectly *safe* to query, tell me if you need more details of why). - introduce a new user-api, and a new kernel API, while in practice there is no evidence that brings any benefits compared to Ivan/Matthew. IMHO, making "pci=nommconf" the default behaviour is better than Arjan's patch: for the exaggerated 99.99% users he claims don't need ext-conf-space, that's obviously as good. And many of the others would benefit from the ability to test and optionally use ext-conf-space is available without taking the risk of crashing something, so something else is better for them. With Arjan's patch, in 10 years, we might still have to use an extra option (or some other action) when using lspci to display extended caps, and we would still run the risk of crashing some old machine when doing so (unless maybe a blacklist of some sort will be added, making the newly introduced API completely useless soon, or unless we keep the painful bitmaps in mmconfig potentially ending-up with 3 set of pci-ops). Loic -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in
On 1/14/2008 6:04 PM, Adrian Bunk wrote: I thought so, but due to the way that things are initialised, mmconfig happens before conf1. conf1 is known to be usable, but hasn't set raw_pci_ops at this point. Confusing, and not ideal, but fixing this isn't in scope for 2.6.24. ... *ahem* I don't think anything of what was discussed in this thread would be in scope for 2.6.24 (unless Linus wants to let the bunny that brings eggs release 2.6.24). cu Adrian Why not put in 2.6.24 a simple fix for the last known remaining mmconfig problems in 2.6.24? There has mostly been three bugs related to mmconfig: - BIOS/hardware: exaggerated MCFG claims: solved long ago - hardware: buggy CRS+mmconfig chipset: fix included last month - Linux code: mmconfig incompatible with live BAR-probing: *not fixed* It would be ironic to not fix the only one that is really confined to the Linux code. Everybody more or less agrees *any* patches submitted so far does solve the known problems, and will not cause regressions. The only long discussion is about how to best prevent the effect of an "imaginary" fourth bug, and by nature that's a controversial topic. For 2.6.24, if nothing more than a few lines can be done, either make pci=nommconf the default and add a pci=mmconf option, or/and apply one of the easiest patch to review i.e.Tony's one, so small I copy it again below (using 0x40 or 0x100 for the comparison does not really matter, personally I would change it to 0x100 to be like Ivan's patch, but either is much better than nothing). Replacing some mmconfig access by conf1 cannot cause any regression. Loic P.S.: with that patch, conf1-less x86 systems requiring mmconfig would not be supported. But they are like UFOs. They are plenty of them in the galaxy, but earth sightings are not convincing enough for 2.6.24 support, they can wait 2.6.25. diff --git a/arch/x86/pci/mmconfig_32.c b/arch/x86/pci/mmconfig_32.c index 1bf5816..4474979 100644 --- a/arch/x86/pci/mmconfig_32.c +++ b/arch/x86/pci/mmconfig_32.c @@ -73,7 +73,7 @@ static int pci_mmcfg_read(unsigned int seg, unsigned int bus, } base = get_base_addr(seg, bus, devfn); -if (!base) +if ((!base) || (reg < 0x40)) return pci_conf1_read(seg,bus,devfn,reg,len,value); spin_lock_irqsave(_config_lock, flags); @@ -106,7 +106,7 @@ static int pci_mmcfg_write(unsigned int seg, unsigned int bus, return -EINVAL; base = get_base_addr(seg, bus, devfn); -if (!base) +if ((!base) || (reg < 0x40)) return pci_conf1_write(seg,bus,devfn,reg,len,value); spin_lock_irqsave(_config_lock, flags); diff --git a/arch/x86/pci/mmconfig_64.c b/arch/x86/pci/mmconfig_64.c index 4095e4d..4ad1fcb 100644 --- a/arch/x86/pci/mmconfig_64.c +++ b/arch/x86/pci/mmconfig_64.c @@ -61,7 +61,7 @@ static int pci_mmcfg_read(unsigned int seg, unsigned int bus, } addr = pci_dev_base(seg, bus, devfn); -if (!addr) +if ((!addr) || (reg < 0x40)) return pci_conf1_read(seg,bus,devfn,reg,len,value); switch (len) { @@ -89,7 +89,7 @@ static int pci_mmcfg_write(unsigned int seg, unsigned int bus, return -EINVAL; addr = pci_dev_base(seg, bus, devfn); -if (!addr) +if ((!addr) || (reg < 0x40)) return pci_conf1_write(seg,bus,devfn,reg,len,value); switch (len) { -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in
On 1/14/2008 6:04 PM, Adrian Bunk wrote: I thought so, but due to the way that things are initialised, mmconfig happens before conf1. conf1 is known to be usable, but hasn't set raw_pci_ops at this point. Confusing, and not ideal, but fixing this isn't in scope for 2.6.24. ... *ahem* I don't think anything of what was discussed in this thread would be in scope for 2.6.24 (unless Linus wants to let the bunny that brings eggs release 2.6.24). cu Adrian Why not put in 2.6.24 a simple fix for the last known remaining mmconfig problems in 2.6.24? There has mostly been three bugs related to mmconfig: - BIOS/hardware: exaggerated MCFG claims: solved long ago - hardware: buggy CRS+mmconfig chipset: fix included last month - Linux code: mmconfig incompatible with live BAR-probing: *not fixed* It would be ironic to not fix the only one that is really confined to the Linux code. Everybody more or less agrees *any* patches submitted so far does solve the known problems, and will not cause regressions. The only long discussion is about how to best prevent the effect of an imaginary fourth bug, and by nature that's a controversial topic. For 2.6.24, if nothing more than a few lines can be done, either make pci=nommconf the default and add a pci=mmconf option, or/and apply one of the easiest patch to review i.e.Tony's one, so small I copy it again below (using 0x40 or 0x100 for the comparison does not really matter, personally I would change it to 0x100 to be like Ivan's patch, but either is much better than nothing). Replacing some mmconfig access by conf1 cannot cause any regression. Loic P.S.: with that patch, conf1-less x86 systems requiring mmconfig would not be supported. But they are like UFOs. They are plenty of them in the galaxy, but earth sightings are not convincing enough for 2.6.24 support, they can wait 2.6.25. diff --git a/arch/x86/pci/mmconfig_32.c b/arch/x86/pci/mmconfig_32.c index 1bf5816..4474979 100644 --- a/arch/x86/pci/mmconfig_32.c +++ b/arch/x86/pci/mmconfig_32.c @@ -73,7 +73,7 @@ static int pci_mmcfg_read(unsigned int seg, unsigned int bus, } base = get_base_addr(seg, bus, devfn); -if (!base) +if ((!base) || (reg 0x40)) return pci_conf1_read(seg,bus,devfn,reg,len,value); spin_lock_irqsave(pci_config_lock, flags); @@ -106,7 +106,7 @@ static int pci_mmcfg_write(unsigned int seg, unsigned int bus, return -EINVAL; base = get_base_addr(seg, bus, devfn); -if (!base) +if ((!base) || (reg 0x40)) return pci_conf1_write(seg,bus,devfn,reg,len,value); spin_lock_irqsave(pci_config_lock, flags); diff --git a/arch/x86/pci/mmconfig_64.c b/arch/x86/pci/mmconfig_64.c index 4095e4d..4ad1fcb 100644 --- a/arch/x86/pci/mmconfig_64.c +++ b/arch/x86/pci/mmconfig_64.c @@ -61,7 +61,7 @@ static int pci_mmcfg_read(unsigned int seg, unsigned int bus, } addr = pci_dev_base(seg, bus, devfn); -if (!addr) +if ((!addr) || (reg 0x40)) return pci_conf1_read(seg,bus,devfn,reg,len,value); switch (len) { @@ -89,7 +89,7 @@ static int pci_mmcfg_write(unsigned int seg, unsigned int bus, return -EINVAL; addr = pci_dev_base(seg, bus, devfn); -if (!addr) +if ((!addr) || (reg 0x40)) return pci_conf1_write(seg,bus,devfn,reg,len,value); switch (len) { -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in
On 1/15/2008 2:38 PM, Linus Torvalds wrote: On Tue, 15 Jan 2008, Tony Camuso wrote: Linus is confident that conf1 is not going away for at least the next five years. Not on PC's. Small birds tell me that there can be all these non-PC x86 subarchitectures that may or may not have conf1. Linus But is there a ACPI-compliant/architecture that only offers mmconfig for configuration-space access and no other fallback method (i.e. no conf1, no bios,...)? 2.6.24 supports mmconfig for: - ACPI-system with MCFG - a couple chipset discovered by conf1 If a system has no conf1, but does not have e820+ACPI+MCFG, or does have some other method than mmconfig, it was already irrelevant in the discussion of Ivan's initial patch in december (because that system was either never supported or not impacted, and we were trying to fix bugs, not introduce support for new class of systems). Maybe Arjan could share his knowledge, and tell us what system he was thinking about (and whether it needed to be supported by 2.6.24) when saying: When (and I'm saying when not if) systems arrive that only have MMCONFIG for some of the devices. Anyway Ivan's patch + Matthew's extensions are handling that non-PC arch. That combination is advocated by at least: Ivan Kokshaysky Matthew Wilcox Tony Camuso Loic Prylli even Arjan's said that while he prefers his patch (saying it's more conservative), he does not see a existing problem with the Ivan/Matthew combination. [ simpler, less ambitious fixes can be forgotten if nothing can be done for 2.6.24, I can understand that choice ] The list of problems I see with Arjan's patch are: - no word on whether the existing Linux driver/pci/pcie/aer code should be converted to opt-in? - mmconfig still needs to be revisited to sort-out the mix of mmconfig+conf1+third-method access. - you cannot test if ext-conf-space is available without taking risks: when pci_enable_ext_config() is called, even legacy-conf-space is switched to the new method. So some administrator action (lspci -v +maybe-other-flag) or some driver action (that can optionally use ext-conf-space but does not *rely* on it) could cause some devices to totally disappear (if some pci hierarchy is handled by mmconfig as a 0x section as seen on many amd machines). Matthew/Ivan will simply in the worst case detect that ext-conf-space is not available in pci_cfg_space_size()), legacy-conf-space will still work (and that 0x section is perfectly *safe* to query, tell me if you need more details of why). - introduce a new user-api, and a new kernel API, while in practice there is no evidence that brings any benefits compared to Ivan/Matthew. IMHO, making pci=nommconf the default behaviour is better than Arjan's patch: for the exaggerated 99.99% users he claims don't need ext-conf-space, that's obviously as good. And many of the others would benefit from the ability to test and optionally use ext-conf-space is available without taking the risk of crashing something, so something else is better for them. With Arjan's patch, in 10 years, we might still have to use an extra option (or some other action) when using lspci to display extended caps, and we would still run the risk of crashing some old machine when doing so (unless maybe a blacklist of some sort will be added, making the newly introduced API completely useless soon, or unless we keep the painful bitmaps in mmconfig potentially ending-up with 3 set of pci-ops). Loic -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in
On 1/13/2008 3:43 PM, Matthew Wilcox wrote: On Sun, Jan 13, 2008 at 10:41:24AM -0800, Arjan van de Ven wrote: Note: There is not a 100% overlap between "need" and "will not be used in the patches that use legacy for < 256". In the other patches posted, extended config space will be used in cases where it won't be with my patch. (Most obvious one is an "lspci -vx" from automated scripts). I believe you to be mistaken in this belief. If you take Ivan's patch, conf1 is used for all accesses below 256 bytes. lspci -x only dumps config space up to 64 bytes; lspci - is needed to show extended pci config space. I agree with Arjan about that "not a 100% overlap". It is about the extra ext-conf-space access done while probing in drivers/pci/probe.c: dev->cfg_size = pci_cfg_space_size(dev); (and lspci -v will also query/show the list of extended-caps for pci-x/pcie-x devices that have some, provided the kernel can access ext-conf-space). With Ivan's patch, that line would still cause one extended-conf-space access at offset 256 for pcie/pci-x2 devices (to check the ability to query ext-space). Arjan "opt-in" patch would prevent that extra access. IMHO that access is OK and harmless in all cases, we are already protected by MCFG/e820 checks, but I agree one can express a different opinion based on trying to prevent "never-seen/potential" hardware/BIOS bugs. FWIW it is also there that I was suggested to exclude PCI-X2 devices (when restricted to pcie, that access while probing cannot even cause the harmless master-abort/0x), but there is a small trade-off. Loic -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in
On 1/13/2008 1:41 PM, Arjan van de Ven wrote: On Sun, 13 Jan 2008 13:23:35 -0500 Loic Prylli <[EMAIL PROTECTED]> wrote: Matthew pointed a patch that basically does what you suggested; only one comment on your mail left after that: 2) the pci_enable_ext_config() function and dev->ext_cfg_space field, sysfs interface should be removed from the patch. There has never been a problem reporting crashes or any undefined behaviour while trying to access ext-conf-space, all the problems where *using mmconfig to access legacy-conf-space*. This entirely misses the point of why I made the patch. The point is NOT that devices are buggy. The point is that right now, 99.99% of the machines out there do NOT need extended config space (no matter how it gets accessed), The point of my patch was to make people who don't need extended config space, not have to deal with it anymore. I think I got your point the first time, and I agree it is sound. But in my subjective and biased opinion, I just think ext-conf-space is already useful and widespread enough (being used is not the same as being strictly required for basic operation) for your proposed tradeoff to not be optimal (protecting against "future/non-proven" hardware bugs, i.e. bringing non-proven benefits, at the expense of making life harder for ext-conf-space users while bringing additional extra API/code). To take an example from the linux tree: the driver/pci/pcie/aer code uses ext-conf-space for every pcie-root (currently several distributions enable it by default), does it mean opt-in would be automatically activated for most pcie hierarchies (defeating most of the benefits of being opt-in), or we just disable that code by default? Does lspci -v will automatically opt-in all pcie (right now by default it tries to list the extended-capabilities for pcie and pcix), or do we now require manual explicit sysfs operations to get the whole thing? Is is an additional flag to lspci (if so will that flag also apply to pcix, possibly causing a crash for lspci -v - on some machines). Note: There is not a 100% overlap between "need" and "will not be used in the patches that use legacy for < 256". In the other patches posted, extended config space will be used in cases where it won't be with my patch. (Most obvious one is an "lspci -vx" from automated scripts). To go one step your direction, I have already argued in a couple of emails that I would prefer to not implement ext-conf-space access for any PCI-X devices (removing PCI-X2 from pci_ext_cfg_size), because there we are trying to support devices that we don't really know exists or will ever exists. And protecting against "unproven bugs" makes more sense when it only removes "unproven benefits". Is that a problem? We've had 2 years of mess, with one not-enough patch after another. There still are problems TODAY (eg im 2.6.24-rc7). The patch that falls back to an alternative method for below 256 is no doubt a step in the right direction. (although I'm not all that happy about mixing access types, it's not provably incorrect) Is it enough? I'm not sure. FWIW, I have in my tree a patch almost identical to Ivan's dated "December 2005". Because of the constant activity on the mmconfig front (that I thought would make it obsolete), I never took the effort of suggesting it before one month ago (I am not a regular user of linux-kernel). I admit nobody else should view it that way, but for me rather than the last attempt at fixing mmconfig, it's a patch first used two years ago that would have arguably prevented all problems that have been reported since then. Besides, recent mails show that hypothetically, we could even not change anything to the existing conf-space code, since the only known bug remaining is the one associated with bar probing and could be adressed by: http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc6/2.6.24-rc6-mm1/broken-out/pci-disable-decoding-during-sizing-of-bars.patch [ Thanks to Robert hancok and Grant Grundler for explaining to me the history of bar-probing last month ] Even if that bar-probing patch was applied (maybe it needs to be more combat-proven), by default, it still seems better to not use mmconfig for legacy-conf-space access, but going two extra precaution steps beyond what seems necessary might be excessive. Only time can tell I suppose, but the risk side is that if it is not enough, users who don't need the extended config space for functionality will suffer the bugs AGAIN. You can indeed never exclude 100% that possibility, but if they see a problem again, it is likely to be a new category of hardware/BIOS bugs never seen before. Loic -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMA
Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in
On 1/12/2008 12:45 PM, Arjan van de Ven wrote: On Sat, 12 Jan 2008 17:40:30 +0300 Ivan Kokshaysky <[EMAIL PROTECTED]> wrote: + if (reg < 256) + return pci_conf1_read(seg,bus,devfn,reg,len,value); + btw this is my main objection to your patch; it intertwines the conf1 and mmconfig code even more. When (and I'm saying "when" not "if") systems arrive that only have MMCONFIG for some of the devices, we'll have to detangle this again, and I'm really not looking forward to that. conf1 has been a hardcoded dependencies of mmconfig for years. Ivan's patch does not make it worse (in fact it considerably simplifies that code, making it easier to untangle later). IMHO, either your patch or Ivan's can be a good base, but: 1) For your remark above to be given any consideration, your patch should be modified to remove the hardcoded conf1 from the *current* mmconfig code, otherwise we end up with 3 set of ops (mmconfig + conf1+ a possible third set of operations) intertwined in a confusing manner. And removing that dependency is not a straightforward operation unless you also do 2): 2) the pci_enable_ext_config() function and dev->ext_cfg_space field, sysfs interface should be removed from the patch. There has never been a problem reporting crashes or any undefined behaviour while trying to access ext-conf-space, all the problems where *using mmconfig to access legacy-conf-space*. The "if (dev->cfg_space_ext > 0)" checks can instead be replaced by "if (reg >= 256)". Otherwise when using per-device explicit enabling, just *checking* whether ext-conf-space is available by calling pci_enable_ext_config(), will make some of the old problems of *loosing legacy conf-space* come back: you would have introduced a new user-space and kernel API while only solving half the problems, not a good deal. if you do 1) and 2), then you really support the good properties you claimed: - You can use mmconfig for ext-space and something else for legacy-space. - You can use mmconfig for everything (for instance if conf1 is not implemented). Of course it is as straightforward to modify Ivan's patch to also have the same properties. Loic Loic -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in
On 1/12/2008 12:45 PM, Arjan van de Ven wrote: On Sat, 12 Jan 2008 17:40:30 +0300 Ivan Kokshaysky [EMAIL PROTECTED] wrote: + if (reg 256) + return pci_conf1_read(seg,bus,devfn,reg,len,value); + btw this is my main objection to your patch; it intertwines the conf1 and mmconfig code even more. When (and I'm saying when not if) systems arrive that only have MMCONFIG for some of the devices, we'll have to detangle this again, and I'm really not looking forward to that. conf1 has been a hardcoded dependencies of mmconfig for years. Ivan's patch does not make it worse (in fact it considerably simplifies that code, making it easier to untangle later). IMHO, either your patch or Ivan's can be a good base, but: 1) For your remark above to be given any consideration, your patch should be modified to remove the hardcoded conf1 from the *current* mmconfig code, otherwise we end up with 3 set of ops (mmconfig + conf1+ a possible third set of operations) intertwined in a confusing manner. And removing that dependency is not a straightforward operation unless you also do 2): 2) the pci_enable_ext_config() function and dev-ext_cfg_space field, sysfs interface should be removed from the patch. There has never been a problem reporting crashes or any undefined behaviour while trying to access ext-conf-space, all the problems where *using mmconfig to access legacy-conf-space*. The if (dev-cfg_space_ext 0) checks can instead be replaced by if (reg = 256). Otherwise when using per-device explicit enabling, just *checking* whether ext-conf-space is available by calling pci_enable_ext_config(), will make some of the old problems of *loosing legacy conf-space* come back: you would have introduced a new user-space and kernel API while only solving half the problems, not a good deal. if you do 1) and 2), then you really support the good properties you claimed: - You can use mmconfig for ext-space and something else for legacy-space. - You can use mmconfig for everything (for instance if conf1 is not implemented). Of course it is as straightforward to modify Ivan's patch to also have the same properties. Loic Loic -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in
On 1/13/2008 1:41 PM, Arjan van de Ven wrote: On Sun, 13 Jan 2008 13:23:35 -0500 Loic Prylli [EMAIL PROTECTED] wrote: Matthew pointed a patch that basically does what you suggested; only one comment on your mail left after that: 2) the pci_enable_ext_config() function and dev-ext_cfg_space field, sysfs interface should be removed from the patch. There has never been a problem reporting crashes or any undefined behaviour while trying to access ext-conf-space, all the problems where *using mmconfig to access legacy-conf-space*. This entirely misses the point of why I made the patch. The point is NOT that devices are buggy. The point is that right now, 99.99% of the machines out there do NOT need extended config space (no matter how it gets accessed), The point of my patch was to make people who don't need extended config space, not have to deal with it anymore. I think I got your point the first time, and I agree it is sound. But in my subjective and biased opinion, I just think ext-conf-space is already useful and widespread enough (being used is not the same as being strictly required for basic operation) for your proposed tradeoff to not be optimal (protecting against future/non-proven hardware bugs, i.e. bringing non-proven benefits, at the expense of making life harder for ext-conf-space users while bringing additional extra API/code). To take an example from the linux tree: the driver/pci/pcie/aer code uses ext-conf-space for every pcie-root (currently several distributions enable it by default), does it mean opt-in would be automatically activated for most pcie hierarchies (defeating most of the benefits of being opt-in), or we just disable that code by default? Does lspci -v will automatically opt-in all pcie (right now by default it tries to list the extended-capabilities for pcie and pcix), or do we now require manual explicit sysfs operations to get the whole thing? Is is an additional flag to lspci (if so will that flag also apply to pcix, possibly causing a crash for lspci -v -opt-in-all-potential-ext-devices on some machines). Note: There is not a 100% overlap between need and will not be used in the patches that use legacy for 256. In the other patches posted, extended config space will be used in cases where it won't be with my patch. (Most obvious one is an lspci -vx from automated scripts). To go one step your direction, I have already argued in a couple of emails that I would prefer to not implement ext-conf-space access for any PCI-X devices (removing PCI-X2 from pci_ext_cfg_size), because there we are trying to support devices that we don't really know exists or will ever exists. And protecting against unproven bugs makes more sense when it only removes unproven benefits. Is that a problem? We've had 2 years of mess, with one not-enough patch after another. There still are problems TODAY (eg im 2.6.24-rc7). The patch that falls back to an alternative method for below 256 is no doubt a step in the right direction. (although I'm not all that happy about mixing access types, it's not provably incorrect) Is it enough? I'm not sure. FWIW, I have in my tree a patch almost identical to Ivan's dated December 2005. Because of the constant activity on the mmconfig front (that I thought would make it obsolete), I never took the effort of suggesting it before one month ago (I am not a regular user of linux-kernel). I admit nobody else should view it that way, but for me rather than the last attempt at fixing mmconfig, it's a patch first used two years ago that would have arguably prevented all problems that have been reported since then. Besides, recent mails show that hypothetically, we could even not change anything to the existing conf-space code, since the only known bug remaining is the one associated with bar probing and could be adressed by: http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc6/2.6.24-rc6-mm1/broken-out/pci-disable-decoding-during-sizing-of-bars.patch [ Thanks to Robert hancok and Grant Grundler for explaining to me the history of bar-probing last month ] Even if that bar-probing patch was applied (maybe it needs to be more combat-proven), by default, it still seems better to not use mmconfig for legacy-conf-space access, but going two extra precaution steps beyond what seems necessary might be excessive. Only time can tell I suppose, but the risk side is that if it is not enough, users who don't need the extended config space for functionality will suffer the bugs AGAIN. You can indeed never exclude 100% that possibility, but if they see a problem again, it is likely to be a new category of hardware/BIOS bugs never seen before. Loic -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in
On 1/13/2008 3:43 PM, Matthew Wilcox wrote: On Sun, Jan 13, 2008 at 10:41:24AM -0800, Arjan van de Ven wrote: Note: There is not a 100% overlap between need and will not be used in the patches that use legacy for 256. In the other patches posted, extended config space will be used in cases where it won't be with my patch. (Most obvious one is an lspci -vx from automated scripts). I believe you to be mistaken in this belief. If you take Ivan's patch, conf1 is used for all accesses below 256 bytes. lspci -x only dumps config space up to 64 bytes; lspci - is needed to show extended pci config space. I agree with Arjan about that not a 100% overlap. It is about the extra ext-conf-space access done while probing in drivers/pci/probe.c: dev-cfg_size = pci_cfg_space_size(dev); (and lspci -v will also query/show the list of extended-caps for pci-x/pcie-x devices that have some, provided the kernel can access ext-conf-space). With Ivan's patch, that line would still cause one extended-conf-space access at offset 256 for pcie/pci-x2 devices (to check the ability to query ext-space). Arjan opt-in patch would prevent that extra access. IMHO that access is OK and harmless in all cases, we are already protected by MCFG/e820 checks, but I agree one can express a different opinion based on trying to prevent never-seen/potential hardware/BIOS bugs. FWIW it is also there that I was suggested to exclude PCI-X2 devices (when restricted to pcie, that access while probing cannot even cause the harmless master-abort/0x), but there is a small trade-off. Loic -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in
On 1/13/2008 1:01 AM, Matthew Wilcox wrote: On Sat, Dec 29, 2007 at 12:12:19AM +0300, Ivan Kokshaysky wrote: On Fri, Dec 28, 2007 at 12:40:53PM -0500, Loic Prylli wrote: One thing that could be changed in pci_cfg_space_size() is to avoid making a special case for PCI-X 266MHz/533Mhz (assume cfg_size == 256 for such devices too, reserve extended cfg-space for pci-express devices). I agree, we should remove it. IIRC, this PCI-X check was written long ago with some draft (not a final spec) in hands. Matthew? I have what I believe to be the released version of PCI-X 2.0a (July 22, 2003). It is quite clear that Mode 2 devices (ie those running at 266MHz or 533MHz) are required to support all 4096 bytes of extended config space. More to the point, I don't think we have any bug reports suggesting that PCI-X Mode 2 devices/bridges have any problems. As PCI-X2 bridge/chipset, I only knows about the AMD-8132 (from what I understand it does PCI-X Mode 2), and some obscure IBM enterprise chipset (I am sure there are a few more). Too bad for the spec, but we definitely know for sure the AMD-8132 doesn't do ext-space (and makes it unusable for any device behind it). There are relatively few of them in existance, and my impression is that PCI-X2 is only being implemented on server-class machines. True. 'Consumer grade' equipment is where all the problems lie anyway. mmconfig has been a pain on the servers too (there are a lot of server class amd machines using one pcie/mmconfig/chipset + amd-8131/2). While the PCI-X 2.0a spec does not define any Extended Capability IDs, it simply states that "This field is a PCI-SIG defined ID number that indicates the nature and format of the Extended Capabilities List item". The PCIe spec does define Extended Capability IDs, and I would think it's entirely appropriate to use the same IDs for PCI-X Mode 2 devices. Sure it might be needed on PCI-X2. But contrary to pcie (where the driver/pci/pcie/aer subsystem already use ext-conf-space, and other usages are bound to increase), needing ext-conf-space in the future on pci-x2 is quite unlikely (pcie is long-lived, whereas PCI-X2 was short-lived, obsoleted by PCI-E, and nobody has mentioned yet an example of using ext-registers with a PCI-X2 device). I was only mentioning that because of the very small trade-off: if you don't exclude PCI-X2, on platforms with the amd-8132+bad-MCFG, you might trigger a cfg-read==0x/master-abort in pci_cfg_space_size() for such devices with Ivan patch. This is harmless, because a lot of similar master-abort happen during PCI-probing anyway, so one more won't change anything. Anyway, I am equally happy with keeping pci_cfg_space_size() as it is. Loic -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in
On 1/13/2008 1:01 AM, Matthew Wilcox wrote: On Sat, Dec 29, 2007 at 12:12:19AM +0300, Ivan Kokshaysky wrote: On Fri, Dec 28, 2007 at 12:40:53PM -0500, Loic Prylli wrote: One thing that could be changed in pci_cfg_space_size() is to avoid making a special case for PCI-X 266MHz/533Mhz (assume cfg_size == 256 for such devices too, reserve extended cfg-space for pci-express devices). I agree, we should remove it. IIRC, this PCI-X check was written long ago with some draft (not a final spec) in hands. Matthew? I have what I believe to be the released version of PCI-X 2.0a (July 22, 2003). It is quite clear that Mode 2 devices (ie those running at 266MHz or 533MHz) are required to support all 4096 bytes of extended config space. More to the point, I don't think we have any bug reports suggesting that PCI-X Mode 2 devices/bridges have any problems. As PCI-X2 bridge/chipset, I only knows about the AMD-8132 (from what I understand it does PCI-X Mode 2), and some obscure IBM enterprise chipset (I am sure there are a few more). Too bad for the spec, but we definitely know for sure the AMD-8132 doesn't do ext-space (and makes it unusable for any device behind it). There are relatively few of them in existance, and my impression is that PCI-X2 is only being implemented on server-class machines. True. 'Consumer grade' equipment is where all the problems lie anyway. mmconfig has been a pain on the servers too (there are a lot of server class amd machines using one pcie/mmconfig/chipset + amd-8131/2). While the PCI-X 2.0a spec does not define any Extended Capability IDs, it simply states that This field is a PCI-SIG defined ID number that indicates the nature and format of the Extended Capabilities List item. The PCIe spec does define Extended Capability IDs, and I would think it's entirely appropriate to use the same IDs for PCI-X Mode 2 devices. Sure it might be needed on PCI-X2. But contrary to pcie (where the driver/pci/pcie/aer subsystem already use ext-conf-space, and other usages are bound to increase), needing ext-conf-space in the future on pci-x2 is quite unlikely (pcie is long-lived, whereas PCI-X2 was short-lived, obsoleted by PCI-E, and nobody has mentioned yet an example of using ext-registers with a PCI-X2 device). I was only mentioning that because of the very small trade-off: if you don't exclude PCI-X2, on platforms with the amd-8132+bad-MCFG, you might trigger a cfg-read==0x/master-abort in pci_cfg_space_size() for such devices with Ivan patch. This is harmless, because a lot of similar master-abort happen during PCI-probing anyway, so one more won't change anything. Anyway, I am equally happy with keeping pci_cfg_space_size() as it is. Loic -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in
On 12/28/2007 1:06 AM, Benjamin Herrenschmidt wrote: > On Thu, 2007-12-27 at 21:37 -0800, Linus Torvalds wrote: > >> On Fri, 28 Dec 2007, Benjamin Herrenschmidt wrote: >> >>> I have embedded boards where proper CRS operations is critical since the >>> kernel brings the PCIe link up itself, and thus is likely to hit devices >>> still in the middle of CRS. >>> >> .. but that's perfectly fine. A PCI-E bridge will certainly retry it in >> hardware (or it isn't a PCI-E bridge!). >> > > Only a handful of times in many bridges I've seen. > Yes retry implementation are not universal. But bridges with CRS-visibility might be even less common (both are optional/implementation-specific features). To make some choices for our own devices, we experimented at Myricom with CRS on a few chipsets to see what they implement: broadcom/serverworks 2000/1000: no-retry/ no-visibility nvidia CK804/IO804: no-retry/no-visibility nvidia MPC55/IO55: no-retry/no-visibility intel-7520: does-retry/no-visibility intel-975X: does-retry/no-visibility intel-5000series: does-retry/no-visibility intel X38: does-retry/no-visibility AMD/ATI RD790: no-retry/no-visibility [ in all no-retry cases, when a CRS is somehow returned to the pcie-root, the requester will get a 0xf, depending on chipsets, other error bits (malformed-tlp, completion-timeout) might be triggered ] >From the data reported on this mailing-list, it seems like the ATI tries to implement CRS-visibility, but it is obviously buggy. Not knowing whether there is any chipset with the visibility feature, but without the retry capability, and given that CRS is irrelevant for most Linux platforms (it only matters just after power-on, long before Linux is started in the common case), it seems fair that CRS-visibility-enabling should only be added in the specific code of the specific embedded platforms (with no BIOS/firmware of any kind) that might need it. > > >> So I'm going to disable that thing. If there is some _other_ PCI-E bridge >> that is simply buggy, and cannot handle the hw retry itself or is just >> otherwise dodgy, we can have a white-list for cases where it really needs >> to be done, but the current code is just bogus. >> > > If you disable it, then isn't there also a problem with PCIE->PCI-X > bridge which will stop issuing CRS when they should ? (not sure here, I > may be a bit confused). > This is mostly independant, allowing a PCIE->PCI-X bridge to generate CRS is a different bit (bit 15 of pcie->devctl on the bridge). FWIW, Linux does not seem to touch it, and it defaults to zero, so it does not seem like most current PCIE->PCI-X bridge will never generates a CRS (some BIOSes might do it, but not the couple of platforms I looked at). Again the choice of setting here seems something better left to the specific BIOS/embedded-code for a given platform. Loic -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in
On 12/28/2007 11:38 AM, Ivan Kokshaysky wrote: > On Fri, Dec 28, 2007 at 08:14:18AM -0800, Arjan van de Ven wrote: > >> it removes code by removing quirks / known not working stuff.. >> > > The only quirk I see removed is a bitmap with an arbitrary size (that we don't really know is sufficient for every system), and that is only built using comparison between mmconf and type1 accesses. IMHO, there is zero knowledge in that removed code (no knowledge about specific chipsets that work or don't work, or misleading BIOSes). > This not working stuff gets detected at probe time - see > drivers/pci/probe.c:pci_cfg_space_size(). > This indeed avoids most mmconf invalid attempts (for extended-conf-space probing that goes through pci_find_ext_capability() or from user-space). One could think of adding a cfg_size check in pci_read_config_{read/write), but IMHO that would be useless, since direct read/write into a known extended-conf-space register in the extended-config-space can only happen for a pci-express device, and there is ample evidence that such accesses always work (more exactly MCFG can always be trusted for pcie devices). One thing that could be changed in pci_cfg_space_size() is to avoid making a special case for PCI-X 266MHz/533Mhz (assume cfg_size == 256 for such devices too, reserve extended cfg-space for pci-express devices). There is good reasons to think no such PCI-X 266Mhz/533 device will ever have an extended-space (no capability IDs was ever defined in the PCI-X 2.0 spec, no new revision is planned). Such a check would avoid the possibility of trying extended-conf-space access for PCI-X 2.0 devices behind a amd-8132 or similar (such accesses would just returnd -1, but there was some objections raised about doing anything like that other than at initialization time, even if there is ample reasons to argue it would be harmless). Loic -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in
On 12/28/2007 11:38 AM, Ivan Kokshaysky wrote: On Fri, Dec 28, 2007 at 08:14:18AM -0800, Arjan van de Ven wrote: it removes code by removing quirks / known not working stuff.. The only quirk I see removed is a bitmap with an arbitrary size (that we don't really know is sufficient for every system), and that is only built using comparison between mmconf and type1 accesses. IMHO, there is zero knowledge in that removed code (no knowledge about specific chipsets that work or don't work, or misleading BIOSes). This not working stuff gets detected at probe time - see drivers/pci/probe.c:pci_cfg_space_size(). This indeed avoids most mmconf invalid attempts (for extended-conf-space probing that goes through pci_find_ext_capability() or from user-space). One could think of adding a cfg_size check in pci_read_config_{read/write), but IMHO that would be useless, since direct read/write into a known extended-conf-space register in the extended-config-space can only happen for a pci-express device, and there is ample evidence that such accesses always work (more exactly MCFG can always be trusted for pcie devices). One thing that could be changed in pci_cfg_space_size() is to avoid making a special case for PCI-X 266MHz/533Mhz (assume cfg_size == 256 for such devices too, reserve extended cfg-space for pci-express devices). There is good reasons to think no such PCI-X 266Mhz/533 device will ever have an extended-space (no capability IDs was ever defined in the PCI-X 2.0 spec, no new revision is planned). Such a check would avoid the possibility of trying extended-conf-space access for PCI-X 2.0 devices behind a amd-8132 or similar (such accesses would just returnd -1, but there was some objections raised about doing anything like that other than at initialization time, even if there is ample reasons to argue it would be harmless). Loic -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in
On 12/28/2007 1:06 AM, Benjamin Herrenschmidt wrote: On Thu, 2007-12-27 at 21:37 -0800, Linus Torvalds wrote: On Fri, 28 Dec 2007, Benjamin Herrenschmidt wrote: I have embedded boards where proper CRS operations is critical since the kernel brings the PCIe link up itself, and thus is likely to hit devices still in the middle of CRS. .. but that's perfectly fine. A PCI-E bridge will certainly retry it in hardware (or it isn't a PCI-E bridge!). Only a handful of times in many bridges I've seen. Yes retry implementation are not universal. But bridges with CRS-visibility might be even less common (both are optional/implementation-specific features). To make some choices for our own devices, we experimented at Myricom with CRS on a few chipsets to see what they implement: broadcom/serverworks 2000/1000: no-retry/ no-visibility nvidia CK804/IO804: no-retry/no-visibility nvidia MPC55/IO55: no-retry/no-visibility intel-7520: does-retry/no-visibility intel-975X: does-retry/no-visibility intel-5000series: does-retry/no-visibility intel X38: does-retry/no-visibility AMD/ATI RD790: no-retry/no-visibility [ in all no-retry cases, when a CRS is somehow returned to the pcie-root, the requester will get a 0xf, depending on chipsets, other error bits (malformed-tlp, completion-timeout) might be triggered ] From the data reported on this mailing-list, it seems like the ATI tries to implement CRS-visibility, but it is obviously buggy. Not knowing whether there is any chipset with the visibility feature, but without the retry capability, and given that CRS is irrelevant for most Linux platforms (it only matters just after power-on, long before Linux is started in the common case), it seems fair that CRS-visibility-enabling should only be added in the specific code of the specific embedded platforms (with no BIOS/firmware of any kind) that might need it. So I'm going to disable that thing. If there is some _other_ PCI-E bridge that is simply buggy, and cannot handle the hw retry itself or is just otherwise dodgy, we can have a white-list for cases where it really needs to be done, but the current code is just bogus. If you disable it, then isn't there also a problem with PCIE-PCI-X bridge which will stop issuing CRS when they should ? (not sure here, I may be a bit confused). This is mostly independant, allowing a PCIE-PCI-X bridge to generate CRS is a different bit (bit 15 of pcie-devctl on the bridge). FWIW, Linux does not seem to touch it, and it defaults to zero, so it does not seem like most current PCIE-PCI-X bridge will never generates a CRS (some BIOSes might do it, but not the couple of platforms I looked at). Again the choice of setting here seems something better left to the specific BIOS/embedded-code for a given platform. Loic -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Fix x86 iounmap() calling ioremap_change_attr() with a size that's too big.
The get_vm_area() allocates one extra guard page, and stores the augmented size in area->size. But the ioremap/iounmap code on x86 relies on finding the original size in area->size (otherwise it does change the attribute of some random device in the linear-map by using a wrong size in ioremap_change_attr() ). The problem is avoided by not using an extra guard page for VM_IOREMAP allocations (if somebody can think of an easy way to store the original size across ioremap()/iounmap() calls, that could be a more elegant solution). Signed-off-by: Loic Prylli <[EMAIL PROTECTED]> --- mm/vmalloc.c |5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index af77e17..efd0093 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -196,9 +196,10 @@ static struct vm_struct *__get_vm_area_node(unsigned long size, unsigned long fl return NULL; /* -* We always allocate a guard page. +* We allocate a guard page (except for ioremap() which relies on area->size == size) */ - size += PAGE_SIZE; + if (!(flags & VM_IOREMAP)) + size += PAGE_SIZE; write_lock(_lock); for (p = (tmp = *p) != NULL ;p = >next) { -- 1.5.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in
On 12/27/2007 1:58 PM, Linus Torvalds wrote: > > There was a thread called "PCI vendor id == 1 regression?" (Kai Ruhnau was > the main reporter for that one). But looking back, it seems that one > didn't hit the kernel mailing list, so I guess google cannot pick it up. I > can forward all the emails if you want, but quite frankly, you don't > really want to. It boils down to: > > Stephen Hemminger: > "There have been two reports with different hardware of the PCI vendor >id of 0001 showing up. I got a report on sky2, and Francois saw similar >problem on r8169. >In one case, it happened only with 2.6.23 kernel, the correct id was >returned by older kernels. > >This is a heads up, there may be a PCI problem. Or just >some users smoking strange fall leaves." > > And then one of the reporters: > > "Good kernel: > >02:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8056 PCI-E > Gigabit Ethernet Controller (rev 12) > 00: ab 11 64 43 07 00 10 00 12 00 00 02 01 00 00 00 > >Bad kernel: > >02:00.0 Ethernet controller: Unknown device 0001:4364 (rev 12) > 00: 01 00 64 43 07 00 10 00 12 00 00 02 01 00 00 00" > The root pcie port implementation is obviously buggy. But did we confirm whether that hardware bug might be partly related to "configuration-retry-status" pcie-root handling as introduced/described in: http://marc.info/?l=linux-kernel=110541914926842=2 Does the 0001 vendor-id still shows up if pci_enable_crs() has never been called? Does anybody knows what was the original rational to call pci_enable_crs() by default? Loic -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Fix x86 iounmap() calling ioremap_change_attr() with a size that's too big.
The get_vm_area() allocates one extra guard page, and stores the augmented size in area-size. But the ioremap/iounmap code on x86 relies on finding the original size in area-size (otherwise it does change the attribute of some random device in the linear-map by using a wrong size in ioremap_change_attr() ). The problem is avoided by not using an extra guard page for VM_IOREMAP allocations (if somebody can think of an easy way to store the original size across ioremap()/iounmap() calls, that could be a more elegant solution). Signed-off-by: Loic Prylli [EMAIL PROTECTED] --- mm/vmalloc.c |5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index af77e17..efd0093 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -196,9 +196,10 @@ static struct vm_struct *__get_vm_area_node(unsigned long size, unsigned long fl return NULL; /* -* We always allocate a guard page. +* We allocate a guard page (except for ioremap() which relies on area-size == size) */ - size += PAGE_SIZE; + if (!(flags VM_IOREMAP)) + size += PAGE_SIZE; write_lock(vmlist_lock); for (p = vmlist; (tmp = *p) != NULL ;p = tmp-next) { -- 1.5.2.2 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in
On 12/27/2007 1:58 PM, Linus Torvalds wrote: There was a thread called PCI vendor id == 1 regression? (Kai Ruhnau was the main reporter for that one). But looking back, it seems that one didn't hit the kernel mailing list, so I guess google cannot pick it up. I can forward all the emails if you want, but quite frankly, you don't really want to. It boils down to: Stephen Hemminger: There have been two reports with different hardware of the PCI vendor id of 0001 showing up. I got a report on sky2, and Francois saw similar problem on r8169. In one case, it happened only with 2.6.23 kernel, the correct id was returned by older kernels. This is a heads up, there may be a PCI problem. Or just some users smoking strange fall leaves. And then one of the reporters: Good kernel: 02:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8056 PCI-E Gigabit Ethernet Controller (rev 12) 00: ab 11 64 43 07 00 10 00 12 00 00 02 01 00 00 00 Bad kernel: 02:00.0 Ethernet controller: Unknown device 0001:4364 (rev 12) 00: 01 00 64 43 07 00 10 00 12 00 00 02 01 00 00 00 The root pcie port implementation is obviously buggy. But did we confirm whether that hardware bug might be partly related to configuration-retry-status pcie-root handling as introduced/described in: http://marc.info/?l=linux-kernelm=110541914926842w=2 Does the 0001 vendor-id still shows up if pci_enable_crs() has never been called? Does anybody knows what was the original rational to call pci_enable_crs() by default? Loic -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]
On 12/23/2007 3:55 PM, Matthew Wilcox wrote: > On Sun, Dec 23, 2007 at 03:16:24PM -0500, Loic Prylli wrote: > >> I just realized one thing: the bar sizing code in pci_read_bases() (that >> writes 0x in the bars) does not seem to disable the >> PCI_COMMAND_MEM/PCI_COMMAND_IO bits in the cmd register before >> manipulating the BARs. And it seems nobody else ensures they are >> disabled at this point either (or am I missing something?). >> > > Right, we don't. Ivan and Greg are convinced that doing so can break > some machines. > It might indeed be scary to suddenly change the command bits on some old "Host bridges" and maybe other classes of bridges. I was too optimistic about the triviality of the fix. Still an obvious improvement to the pci_read_bases() code would be to not try sizing non-existings BARs (that way if a device has no memory bars, we never need to touch the mem-enable bit). In addition, if you exclude Host bridges and other bridges from temporary mem-disabling while sizing their BARs, then doing temp disabling for the remaining devices looks less scary than the alternative (messing live with the bars decoding range especially for regular devices integrated to the main chipset). It seems to me worth making that init code safer independantly of the mmconfig issues. > Now that contradicts some information we've been told before; can you > post those traces? That would argue in favour of disabling memspace > when configuring BARs. > Here is a trace where COMMAND/STATUS and 0x10 and 0x14 registers accesses were captured. Comments in brackets are my interpretation of what's happening. That was with windows 2003 64bit. Loic [ reset at 607 second] [BIOS: read-status, COMMAND = 0 ] 626.442718:cfg-read:0x6.w=0x10 626.442805:cfg-write:0x4.w=0x0 [BIOS: sizing BAR0] 626.448325:cfg-read:0x10.l=0x8 626.448365:cfg-write:0x10.l=0xfffe 626.448439:cfg-read:0x10.l=0xff08 626.448493:cfg-write:0x10.l=0x8 [BIOS: sizing non-existent BAR1] 626.448552:cfg-read:0x14.l=0x0 626.448591:cfg-write:0x14.l=0xfffe 626.448647:cfg-read:0x14.l=0x0 626.448687:cfg-write:0x14.l=0x0 [BIOS: sizing BAR0] 626.449596:cfg-read:0x10.l=0x8 626.449635:cfg-write:0x10.l=0xfffe 626.449710:cfg-read:0x10.l=0xff08 626.449764:cfg-write:0x10.l=0x8 [BIOS: sizing non-existent BAR1] 626.449832:cfg-read:0x14.l=0x0 626.449872:cfg-write:0x14.l=0xfffe 626.449928:cfg-read:0x14.l=0x0 626.449967:cfg-write:0x14.l=0x0 [BIOS: sizing BAR0] 626.451822:cfg-read:0x10.l=0x8 626.451862:cfg-write:0x10.l=0xfffe 626.451936:cfg-read:0x10.l=0xff08 626.451990:cfg-write:0x10.l=0x8 [BIOS: sizing non-existent BAR1] 626.452058:cfg-read:0x14.l=0x0 626.452098:cfg-write:0x14.l=0xfffe 626.452154:cfg-read:0x14.l=0x0 626.452194:cfg-write:0x14.l=0x0 [BIOS: sizing and assigning BAR0] 626.454899:cfg-read:0x10.l=0x8 626.454939:cfg-write:0x10.l=0xfffe 626.455013:cfg-read:0x10.l=0xff08 626.455068:cfg-write:0x10.l=0x8 626.455138:cfg-write:0x10.l=0xfa00 [BIOS: sizing non-existent BAR1] 626.455211:cfg-read:0x14.l=0x0 626.455251:cfg-write:0x14.l=0xfffe 626.455307:cfg-read:0x14.l=0x0 626.455346:cfg-write:0x14.l=0x0 [BIOS: sizing and restoring BAR0] 628.024774:cfg-read:0x10.l=0xfa08 628.024831:cfg-write:0x10.l=0xfffe 628.024908:cfg-read:0x10.l=0xff08 628.024964:cfg-write:0x10.l=0xfa08 [BIOS: sizing non-existent BAR1] 628.025049:cfg-read:0x14.l=0x0 628.025091:cfg-write:0x14.l=0xfffe 628.025149:cfg-read:0x14.l=0x0 628.025190:cfg-write:0x14.l=0x0 [BIOS: clear status ] 628.025714:cfg-write:0x6.w=0x [BIOS: enable SERR/IO+MEMORY+MASTER+SPECIAL+INVALIDATE ] 628.025788:cfg-write:0x4.w=0x11f [BIOS: enable MEMORY+MASTER (after seeing IO/special/invalidate were read-only 0) ] 629.028114:cfg-read:0x4.b=0x6 629.028160:cfg-write:0x4.b=0x6 [BIOS: read status, clear PARITY/SERR status bits ] 629.028490:cfg-read:0x7.b=0x0 629.028536:cfg-write:0x7.b=0xc1 [BIOS: enable PARITY detection ] 629.028818:cfg-read:0x4.b=0x6 629.028863:cfg-write:0x4.b=0x46 [BIOS: enable SERR (already enabled anyway) ] 629.028920:cfg-read:0x5.b=0x1 629.028966:cfg-write:0x5.b=0x1 629.032338:cfg-read:0x6.w=0x10 [ WINDOWS starting ] 661.965932:cfg-read:0x4.l=0x100146 661.965986:cfg-read:0x10.l=0xfa08 661.966040:cfg-read:0x14.l=0x0 662.112971:cfg-read:0x4.l=0x100146 662.113024:cfg-read:0x10.l=0xfa08 662.113078:cfg-read:0x14.l=0x0 662.155903:cfg-read:0x4.l=0x100146 662.156066:cfg-read:0x4.l=0x100146 662.156120:cfg-read:0x10.l=0xfa08 662.156174:cfg-read:0x14.l=0x0 662.156337:cfg-read:0x4.l=0x100146 662.156391:cfg-read:0x10.l=0xfa08 662.156445:cfg-read:0x14.l=0x0 [ WINDOWS: disable MEMORY + MASTER ] 662.156519:cfg-write:0x4.l=0x140 [ WINDOWS: size BAR0 and BAR1] 662.156588:cfg-write:0x10.l=0x 662.156662:cfg-write:0x14.l=0x 662.156755:cfg-read:0x4.l=0x100140 662.156809:cfg-read
Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]
On 12/20/2007 1:16 PM, Matthew Wilcox wrote: > Oh, that's the same bug others (including me) have been complaining > about. > > http://marc.info/?l=linux-kernel=118809338631160=2 > > >> It hangs in exactly the same place every time. >> >> I am surmising that the write to that BAR is causing a MCE. >> > > Bad deduction. What's happening is that the write to the BAR is causing > it to overlap the decode for mmconfig space. So the mmconfig write to > set the BAR back never gets through. > > I have a different idea to fix this problem. Instead of writing > 0x, we could look for an unused bit of space in the E820 map and > write, say, 0xdfff to the low 32-bits of a BAR. Then it wouldn't > overlap, and we could find its size using MMCONFIG. > > Does anyone know how Windows handles these machines? I just realized one thing: the bar sizing code in pci_read_bases() (that writes 0x in the bars) does not seem to disable the PCI_COMMAND_MEM/PCI_COMMAND_IO bits in the cmd register before manipulating the BARs. And it seems nobody else ensures they are disabled at this point either (or am I missing something?). Touching the bars while they are enabled would be buggy behaviour from our part, and something trivial to fix. And it might well fix that particular problem (it's fair play from the machine to crash if we create a decoding conflict, simply disabling the cmd bits in pci_read_bases() should remove that conflict). FWIW, to partially answer your last question, Windows does disable mem-space and/or IO-space when sizing the bars of a device (I have some traces of configuration-space-access taken on a window machine for one of the PCI busses). Loic -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Make MMCONFIG space (extended PCI config space) a driver opt-in issue
On 12/23/2007 12:44 AM, Jeff Garzik wrote: >> >> By possibly using different implementations for the two ranges you avoid >> introducing a new API, you avoid taking risks with mmconf when you don't >> need it. That doesn't preclude using mmconf for everything either if the >> user requests it or based on enough knowledge of the system to be sure >> nothing will break. > > Are you prepared to guarantee that freely mixing mmconfig and type1 > config accesses at runtime will always work, on all chipsets? I'm > talking about silicon here, not kernel software. Your question is whether I expect a mix of type1 (for legacy conf space) and mmconfig (for >= 256) to trigger some bug in silicon causing flakiness (why not possible random corruption) that would not happen when only using type1 or mmconfig. Obviously, I cannot prove or guarantee the absence of subtle silicon bugs. But I can say: - documentation is saying mixing them is OK, pci-express docs insists on paying attention to ordering when mixing (which is easy on x86), acknowledging one is able to mix them. - All current kernels have been using a mix of "type1" and "mmconf" accesses to the same device even if that's only used during mmconfig initialization. - On amd platforms with typical combination on mmconf-aware and non-mmconf aware chipsets, you always have a mix of "type1" and "mmconf" accesses all the time with standard kernels (not on the same bus) - Mixing such accesses on the same device at runtime has happened on thousands of systems for Myricom device/software users. Are you prepared to guarantee no processor will ever have a wierd silicon problem that's triggered by any Linux change? Mixing mmconf/type-1 is an approach that any mmconf-able hardware is supposed to support, it has been tried on at least most server/workstation chipsets typically used in the last two years, so the burden of proof is on actually finding some silicon that cannot do that properly before rejecting that based on hardware fears (otherwise you go on a slippery slope). > > Furthermore, is it best for our users to find problems with mixed > config accesses -- not at boot time, not at driver load time, but at > some random time when some random driver does its first extended > config space access? IMO, no. If you are going to fail, failing in a > predictable, visible way is best. A failure happening during driver initialization or some very specific identifiable driver event (the first ext-conf-space access is not something popping randomly in the life of a driver) is predictable and visible. I am not sure what kind of problem you are afraid of anyway (ext-conf-space not available?), so it's hard to talk generally. Needless to say, I never said to not to test mmconf on all pci-express devices advertised in MCFG during initialization time (preferably after all PCI-memory-space initialization is done since there is uncertainty about whether mmconf might conflict with that in some corner cases). > > Failures are more predictable and more consistent with an all-or-none > scenario. The all-or-none solutions are the least complex on the > software side, and far more widely tested than any mixed config access > scheme. Disabling altogether ext-conf-space access by default (that's the default your propose, right? ) is definitely the safest approach. But that's the least useful too. Caricaturing, I am the one advocating the real all-or-none scenario: always use type1 for legacy conf-space (never mmconf). You are saying that for the legacy-conf-space access, some people should use mmconf and other people should use type1 based on their kernel configuration or command-line, and that use depends on something unrelated (whether they feel they might need extended-conf-space access at some point, be it even for lspci/setpci). Loic -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Make MMCONFIG space (extended PCI config space) a driver opt-in issue
On 12/23/2007 12:44 AM, Jeff Garzik wrote: By possibly using different implementations for the two ranges you avoid introducing a new API, you avoid taking risks with mmconf when you don't need it. That doesn't preclude using mmconf for everything either if the user requests it or based on enough knowledge of the system to be sure nothing will break. Are you prepared to guarantee that freely mixing mmconfig and type1 config accesses at runtime will always work, on all chipsets? I'm talking about silicon here, not kernel software. Your question is whether I expect a mix of type1 (for legacy conf space) and mmconfig (for = 256) to trigger some bug in silicon causing flakiness (why not possible random corruption) that would not happen when only using type1 or mmconfig. Obviously, I cannot prove or guarantee the absence of subtle silicon bugs. But I can say: - documentation is saying mixing them is OK, pci-express docs insists on paying attention to ordering when mixing (which is easy on x86), acknowledging one is able to mix them. - All current kernels have been using a mix of type1 and mmconf accesses to the same device even if that's only used during mmconfig initialization. - On amd platforms with typical combination on mmconf-aware and non-mmconf aware chipsets, you always have a mix of type1 and mmconf accesses all the time with standard kernels (not on the same bus) - Mixing such accesses on the same device at runtime has happened on thousands of systems for Myricom device/software users. Are you prepared to guarantee no processor will ever have a wierd silicon problem that's triggered by any Linux change? Mixing mmconf/type-1 is an approach that any mmconf-able hardware is supposed to support, it has been tried on at least most server/workstation chipsets typically used in the last two years, so the burden of proof is on actually finding some silicon that cannot do that properly before rejecting that based on hardware fears (otherwise you go on a slippery slope). Furthermore, is it best for our users to find problems with mixed config accesses -- not at boot time, not at driver load time, but at some random time when some random driver does its first extended config space access? IMO, no. If you are going to fail, failing in a predictable, visible way is best. A failure happening during driver initialization or some very specific identifiable driver event (the first ext-conf-space access is not something popping randomly in the life of a driver) is predictable and visible. I am not sure what kind of problem you are afraid of anyway (ext-conf-space not available?), so it's hard to talk generally. Needless to say, I never said to not to test mmconf on all pci-express devices advertised in MCFG during initialization time (preferably after all PCI-memory-space initialization is done since there is uncertainty about whether mmconf might conflict with that in some corner cases). Failures are more predictable and more consistent with an all-or-none scenario. The all-or-none solutions are the least complex on the software side, and far more widely tested than any mixed config access scheme. Disabling altogether ext-conf-space access by default (that's the default your propose, right? ) is definitely the safest approach. But that's the least useful too. Caricaturing, I am the one advocating the real all-or-none scenario: always use type1 for legacy conf-space (never mmconf). You are saying that for the legacy-conf-space access, some people should use mmconf and other people should use type1 based on their kernel configuration or command-line, and that use depends on something unrelated (whether they feel they might need extended-conf-space access at some point, be it even for lspci/setpci). Loic -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]
On 12/20/2007 1:16 PM, Matthew Wilcox wrote: Oh, that's the same bug others (including me) have been complaining about. http://marc.info/?l=linux-kernelm=118809338631160w=2 It hangs in exactly the same place every time. I am surmising that the write to that BAR is causing a MCE. Bad deduction. What's happening is that the write to the BAR is causing it to overlap the decode for mmconfig space. So the mmconfig write to set the BAR back never gets through. I have a different idea to fix this problem. Instead of writing 0x, we could look for an unused bit of space in the E820 map and write, say, 0xdfff to the low 32-bits of a BAR. Then it wouldn't overlap, and we could find its size using MMCONFIG. Does anyone know how Windows handles these machines? I just realized one thing: the bar sizing code in pci_read_bases() (that writes 0x in the bars) does not seem to disable the PCI_COMMAND_MEM/PCI_COMMAND_IO bits in the cmd register before manipulating the BARs. And it seems nobody else ensures they are disabled at this point either (or am I missing something?). Touching the bars while they are enabled would be buggy behaviour from our part, and something trivial to fix. And it might well fix that particular problem (it's fair play from the machine to crash if we create a decoding conflict, simply disabling the cmd bits in pci_read_bases() should remove that conflict). FWIW, to partially answer your last question, Windows does disable mem-space and/or IO-space when sizing the bars of a device (I have some traces of configuration-space-access taken on a window machine for one of the PCI busses). Loic -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]
On 12/23/2007 3:55 PM, Matthew Wilcox wrote: On Sun, Dec 23, 2007 at 03:16:24PM -0500, Loic Prylli wrote: I just realized one thing: the bar sizing code in pci_read_bases() (that writes 0x in the bars) does not seem to disable the PCI_COMMAND_MEM/PCI_COMMAND_IO bits in the cmd register before manipulating the BARs. And it seems nobody else ensures they are disabled at this point either (or am I missing something?). Right, we don't. Ivan and Greg are convinced that doing so can break some machines. It might indeed be scary to suddenly change the command bits on some old Host bridges and maybe other classes of bridges. I was too optimistic about the triviality of the fix. Still an obvious improvement to the pci_read_bases() code would be to not try sizing non-existings BARs (that way if a device has no memory bars, we never need to touch the mem-enable bit). In addition, if you exclude Host bridges and other bridges from temporary mem-disabling while sizing their BARs, then doing temp disabling for the remaining devices looks less scary than the alternative (messing live with the bars decoding range especially for regular devices integrated to the main chipset). It seems to me worth making that init code safer independantly of the mmconfig issues. Now that contradicts some information we've been told before; can you post those traces? That would argue in favour of disabling memspace when configuring BARs. Here is a trace where COMMAND/STATUS and 0x10 and 0x14 registers accesses were captured. Comments in brackets are my interpretation of what's happening. That was with windows 2003 64bit. Loic [ reset at 607 second] [BIOS: read-status, COMMAND = 0 ] 626.442718:cfg-read:0x6.w=0x10 626.442805:cfg-write:0x4.w=0x0 [BIOS: sizing BAR0] 626.448325:cfg-read:0x10.l=0x8 626.448365:cfg-write:0x10.l=0xfffe 626.448439:cfg-read:0x10.l=0xff08 626.448493:cfg-write:0x10.l=0x8 [BIOS: sizing non-existent BAR1] 626.448552:cfg-read:0x14.l=0x0 626.448591:cfg-write:0x14.l=0xfffe 626.448647:cfg-read:0x14.l=0x0 626.448687:cfg-write:0x14.l=0x0 [BIOS: sizing BAR0] 626.449596:cfg-read:0x10.l=0x8 626.449635:cfg-write:0x10.l=0xfffe 626.449710:cfg-read:0x10.l=0xff08 626.449764:cfg-write:0x10.l=0x8 [BIOS: sizing non-existent BAR1] 626.449832:cfg-read:0x14.l=0x0 626.449872:cfg-write:0x14.l=0xfffe 626.449928:cfg-read:0x14.l=0x0 626.449967:cfg-write:0x14.l=0x0 [BIOS: sizing BAR0] 626.451822:cfg-read:0x10.l=0x8 626.451862:cfg-write:0x10.l=0xfffe 626.451936:cfg-read:0x10.l=0xff08 626.451990:cfg-write:0x10.l=0x8 [BIOS: sizing non-existent BAR1] 626.452058:cfg-read:0x14.l=0x0 626.452098:cfg-write:0x14.l=0xfffe 626.452154:cfg-read:0x14.l=0x0 626.452194:cfg-write:0x14.l=0x0 [BIOS: sizing and assigning BAR0] 626.454899:cfg-read:0x10.l=0x8 626.454939:cfg-write:0x10.l=0xfffe 626.455013:cfg-read:0x10.l=0xff08 626.455068:cfg-write:0x10.l=0x8 626.455138:cfg-write:0x10.l=0xfa00 [BIOS: sizing non-existent BAR1] 626.455211:cfg-read:0x14.l=0x0 626.455251:cfg-write:0x14.l=0xfffe 626.455307:cfg-read:0x14.l=0x0 626.455346:cfg-write:0x14.l=0x0 [BIOS: sizing and restoring BAR0] 628.024774:cfg-read:0x10.l=0xfa08 628.024831:cfg-write:0x10.l=0xfffe 628.024908:cfg-read:0x10.l=0xff08 628.024964:cfg-write:0x10.l=0xfa08 [BIOS: sizing non-existent BAR1] 628.025049:cfg-read:0x14.l=0x0 628.025091:cfg-write:0x14.l=0xfffe 628.025149:cfg-read:0x14.l=0x0 628.025190:cfg-write:0x14.l=0x0 [BIOS: clear status ] 628.025714:cfg-write:0x6.w=0x [BIOS: enable SERR/IO+MEMORY+MASTER+SPECIAL+INVALIDATE ] 628.025788:cfg-write:0x4.w=0x11f [BIOS: enable MEMORY+MASTER (after seeing IO/special/invalidate were read-only 0) ] 629.028114:cfg-read:0x4.b=0x6 629.028160:cfg-write:0x4.b=0x6 [BIOS: read status, clear PARITY/SERR status bits ] 629.028490:cfg-read:0x7.b=0x0 629.028536:cfg-write:0x7.b=0xc1 [BIOS: enable PARITY detection ] 629.028818:cfg-read:0x4.b=0x6 629.028863:cfg-write:0x4.b=0x46 [BIOS: enable SERR (already enabled anyway) ] 629.028920:cfg-read:0x5.b=0x1 629.028966:cfg-write:0x5.b=0x1 629.032338:cfg-read:0x6.w=0x10 [ WINDOWS starting ] 661.965932:cfg-read:0x4.l=0x100146 661.965986:cfg-read:0x10.l=0xfa08 661.966040:cfg-read:0x14.l=0x0 662.112971:cfg-read:0x4.l=0x100146 662.113024:cfg-read:0x10.l=0xfa08 662.113078:cfg-read:0x14.l=0x0 662.155903:cfg-read:0x4.l=0x100146 662.156066:cfg-read:0x4.l=0x100146 662.156120:cfg-read:0x10.l=0xfa08 662.156174:cfg-read:0x14.l=0x0 662.156337:cfg-read:0x4.l=0x100146 662.156391:cfg-read:0x10.l=0xfa08 662.156445:cfg-read:0x14.l=0x0 [ WINDOWS: disable MEMORY + MASTER ] 662.156519:cfg-write:0x4.l=0x140 [ WINDOWS: size BAR0 and BAR1] 662.156588:cfg-write:0x10.l=0x 662.156662:cfg-write:0x14.l=0x 662.156755:cfg-read:0x4.l=0x100140 662.156809:cfg-read:0x10.l=0xff08 662.156863:cfg-read:0x14.l=0x0 [ WINDOWS: disable MEMORY + MASTER ] 662.156937
Re: [patch] Make MMCONFIG space (extended PCI config space) a driver opt-in issue
On 12/22/2007 11:52 PM, Jeff Garzik wrote: > > Absolutely. > > But regardless of problems, enabling should be done globally, not per > device... The "enabling globally" requirement, i.e. not per-device, neither depending on reg >= 256 seems a very debatable assumption (IMHO a big mistake) that seems to be responsible for many of the problems seen in the past. There might be for a very long time AMD-architectures where extended-conf-space access for pci-express device works and is beneficial (and where mmconf is not supported by the hardware on non-pci-express devices). You are basically saying you don't want ever to support extended-conf-space globally for those systems, where it would be so easy to precisely use mmconf *only* when attempting *extended-conf-space * (>= 256) to some device (that provides a strong guarantee that you will never break anything unless somebody actually tries to use the extended-conf-space). Supporting extended-conf-space is independant of the issue of using mmconf for legacy conf-space. There is no real reason to use the same method to access both. I have seen several arguments used that were implying that, and they all seem really bogus to me. Not only are the two ranges (<= 256 and >= 256) structurally independant (you have totally independant capabilities lists that are independantly organized in each of them), even if they were not there is no consistency issue that cannot be dealt with a memory barrier, and the concern about taking an extra branch for each pci-conf-space access is also bogus once you look at the numbers. By possibly using different implementations for the two ranges you avoid introducing a new API, you avoid taking risks with mmconf when you don't need it. That doesn't preclude using mmconf for everything either if the user requests it or based on enough knowledge of the system to be sure nothing will break. Loic -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Make MMCONFIG space (extended PCI config space) a driver opt-in issue
On 12/22/2007 11:13 PM, Jeff Garzik wrote: > > The facts as they exist today: > > 1) Existing 256-byte config space devices have been known put > capabilities in the high end (>= 0xc8) of config space. > > 2) It is legal for PCI-Express to put capabilities anywhere in PCI > config space, including extended config space. (I hope our PCI cap > walking code checks for overruns...) You make it sound almost as if the capability list that starts in regular conf-space could cross into extended conf-space >= 256). That's not true, the capability lists in the regular conf-space and the extended conf-space are really separate, they use a different structure for linking (different number of bits to define the capability IDs), a different starting point, different capability IDs definition tables. The regular conf-space and the extended conf-space are really independant. > > 3) Most new machines ship with PCI-Express devices with extended > config space. > > Therefore it is provable /possible/, and is indeed logical to conclude > that capabilities in extended config space will follow the same > pattern that existing hw designers have been following... but only > once the current OS's have stable extended-config-space support. > > Maybe that day will never come, but it is nonetheless quite possible > within today's PCI Express spec for this to happen. I agree with that statement. In fact it is already quite useful today. I am doing a lot of support activities where extended-conf-space is a must for troubleshooting. It was important enough for us that we have user-tools that allows us to access mmconfig-space for pci-express even on systems that don't advertise a MCFG attribute (as long as the chipset supports it, we have reverse-engineered the location of the "mmconfig bar" for a few chipsets including nvidia chipsets, for Intel it is well documented, and there are couple others). Loic -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Make MMCONFIG space (extended PCI config space) a driver opt-in issue
On 12/22/2007 11:13 PM, Jeff Garzik wrote: The facts as they exist today: 1) Existing 256-byte config space devices have been known put capabilities in the high end (= 0xc8) of config space. 2) It is legal for PCI-Express to put capabilities anywhere in PCI config space, including extended config space. (I hope our PCI cap walking code checks for overruns...) You make it sound almost as if the capability list that starts in regular conf-space could cross into extended conf-space = 256). That's not true, the capability lists in the regular conf-space and the extended conf-space are really separate, they use a different structure for linking (different number of bits to define the capability IDs), a different starting point, different capability IDs definition tables. The regular conf-space and the extended conf-space are really independant. 3) Most new machines ship with PCI-Express devices with extended config space. Therefore it is provable /possible/, and is indeed logical to conclude that capabilities in extended config space will follow the same pattern that existing hw designers have been following... but only once the current OS's have stable extended-config-space support. Maybe that day will never come, but it is nonetheless quite possible within today's PCI Express spec for this to happen. I agree with that statement. In fact it is already quite useful today. I am doing a lot of support activities where extended-conf-space is a must for troubleshooting. It was important enough for us that we have user-tools that allows us to access mmconfig-space for pci-express even on systems that don't advertise a MCFG attribute (as long as the chipset supports it, we have reverse-engineered the location of the mmconfig bar for a few chipsets including nvidia chipsets, for Intel it is well documented, and there are couple others). Loic -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Make MMCONFIG space (extended PCI config space) a driver opt-in issue
On 12/22/2007 11:52 PM, Jeff Garzik wrote: Absolutely. But regardless of problems, enabling should be done globally, not per device... The enabling globally requirement, i.e. not per-device, neither depending on reg = 256 seems a very debatable assumption (IMHO a big mistake) that seems to be responsible for many of the problems seen in the past. There might be for a very long time AMD-architectures where extended-conf-space access for pci-express device works and is beneficial (and where mmconf is not supported by the hardware on non-pci-express devices). You are basically saying you don't want ever to support extended-conf-space globally for those systems, where it would be so easy to precisely use mmconf *only* when attempting *extended-conf-space * (= 256) to some device (that provides a strong guarantee that you will never break anything unless somebody actually tries to use the extended-conf-space). Supporting extended-conf-space is independant of the issue of using mmconf for legacy conf-space. There is no real reason to use the same method to access both. I have seen several arguments used that were implying that, and they all seem really bogus to me. Not only are the two ranges (= 256 and = 256) structurally independant (you have totally independant capabilities lists that are independantly organized in each of them), even if they were not there is no consistency issue that cannot be dealt with a memory barrier, and the concern about taking an extra branch for each pci-conf-space access is also bogus once you look at the numbers. By possibly using different implementations for the two ranges you avoid introducing a new API, you avoid taking risks with mmconf when you don't need it. That doesn't preclude using mmconf for everything either if the user requests it or based on enough knowledge of the system to be sure nothing will break. Loic -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]
On 12/20/2007 9:15 PM, Robert Hancock wrote: >> >> Suggested Workaround >> >> It is strongly recommended that system designers do not connect the >> AMD-8132 and devices that use extended >> configuration space MMIO BARs (ex: HyperTransport-to-PCI Express® >> bridges) to the same processor >> HyperTransport link. >> >> Fix Planned >> No > > That does sound fairly definitive. I have to wonder why certain system > designers then didn't follow their strong recommendation.. Just curious, do you know of any system where that recommendation was not followed? On all motherboards where I have seen a AMD-8131 or a AMD-8132, they were alone on their hypertransport link, and other "northbridges" (more precisely hypertransport to pci-express or pci-whatever, often nvidia) with a "MMCONFIG BAR" where on one of the other available hypertransport links in the system. Loic -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]
On 12/20/2007 6:21 PM, Tony Camuso wrote: > > And the MMCONFIG problem with enterprise systems and workstations, where > we do control the BIOS (for the most part), is due to known bugs in > certain versions of certain chipsets, HT1000, AMD8132, among them, not > the BIOS. The lack of MMCONFIG support is indeed because some hypertransport chipsets lack that support. But there are some BIOSes out there that are advertising support for all busses in their MCFG acpi attribute (even the busses managed by some amd8131 in a mixed nvidia-ck804/amd8131 motherboard), and the BIOS seems at least faulty for advertising a capability that does not exist. Loic -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]
On 12/20/2007 4:00 PM, Matthew Wilcox wrote: > On Thu, Dec 20, 2007 at 03:56:29PM -0500, Loic Prylli wrote: > >> I know the final device is not aware on how the config request was >> originated. I am just saying platforms built around the Intel 82801 >> chipset (ICH2) don't support mmconfig at all. I would also not be >> surprised if the platforms where tg3 needs TG3_FLAG_PCIX_TARGET_HWBUG >> might also not support mmconfig (but for this second case, it's only >> speculation based on seeing a couple posts about >> TG3_FLAG_PCIX_TARGET_HWBUG where amd hypertransport/PCI-X chipsets where >> mentioned). If you know of a platform that support mmconfig, and where >> the tg3 does need to use relatively intensively pci-conf-space, I'll be >> happy to be corrected. >> > > tg3 is available as an add-in pci card. i have one. i can plug it into > a machine that does support mmconfig. > That doesn't tell for sure your NIC has the specific rev that would cause the hwbug workaround to be used. But let's assume so, your combination would still works correctly (maybe a slightly non-optimal network performance). A non-default mmconf=always option can help maximize again the performance. My starting point was that more systems could be supported out-of-the box (vs not working at all) by using "type 1" more widely,. This would not break anybody. And this would not affect performance except on what I initially called "obscure hardware or systems". I already acknowledged in the previous email the possibility of a performance impact, but it is still not clear to me whether it would be widespread. The example you mention is a buggy revision of a chip that was designed to use memory-mapped IO, and has to use config-space instead because the memory-mapped IO implementation is buggy (in most cases only when used in combination with certain chipsets). How much influence should a modest performance impact there influence the decision-making? Usually it's better to trade some performance for stability than the reverse, and you loose the performance anyway when some many people start using pci=nommconf by default on all their installs. Loic -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]
On 12/20/2007 3:15 PM, Matthew Wilcox wrote: > On Thu, Dec 20, 2007 at 03:05:52PM -0500, Loic Prylli wrote: > >> I am not familiar with the tg3 driver, just trying to give a 5 minutes >> look, it seems the typical cases where the pci-conf-space is used >> intensively are with some rev in combination with the 82801 >> (TG3_FLG2_ICH_WORKAROUND) which I don't think support mmconfig anyway, >> as well as some very specific PCI-X combinations >> (TG3_FLAG_PCIX_TARGET_HWBUG) which are also very unlikely to support >> mmconfig. >> > > It's not a question of whether the card supports mmconfig or not -- the > card can't tell whether a first-256-byte pci config transaction was initiated > through mmconfig, type1, type2, or even a bios call. > I know the final device is not aware on how the config request was originated. I am just saying platforms built around the Intel 82801 chipset (ICH2) don't support mmconfig at all. I would also not be surprised if the platforms where tg3 needs TG3_FLAG_PCIX_TARGET_HWBUG might also not support mmconfig (but for this second case, it's only speculation based on seeing a couple posts about TG3_FLAG_PCIX_TARGET_HWBUG where amd hypertransport/PCI-X chipsets where mentioned). If you know of a platform that support mmconfig, and where the tg3 does need to use relatively intensively pci-conf-space, I'll be happy to be corrected. > I'm just hacking together an implementation based on Ivan's suggestion > to always use type 1 for the first 64 bytes. > > Whatever you think it's best. If you are only trying to address a BAR-sizing/mmconfig temporary conflict, and think the "MCFG BIOS" or other mmconfig problems are under control, that makes sense. If the need arise for somebody, an option to select between 64 to 256 can be added later. Loic -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]
On 12/20/2007 2:08 PM, Matthew Wilcox wrote: > On Thu, Dec 20, 2007 at 02:04:31PM -0500, Tony Camuso wrote: > >> Also, this solution also would allow us to remove the unreachable_devices() >> routine and bitmap. >> > > Not really ... we probe reading address 0x100 to see if the device > supports extended config space or not. So we need to make that fail > gracefully for the amd7111 case. > pci_cfg_space_size() is only done for PCI-express or PCI-X mode 2 devices, so you still have eliminated the bulk of the problems which are typically handling the legacy busses on modern machines. I don't know what is the amd7111. > >> Does anybody see a down side to this? >> > > It'll be slower than it would be if we used mmconfig directly. Now yes, > nobody should be using pci config space in performance critical paths > ... but see the tg3 driver. > I am not familiar with the tg3 driver, just trying to give a 5 minutes look, it seems the typical cases where the pci-conf-space is used intensively are with some rev in combination with the 82801 (TG3_FLG2_ICH_WORKAROUND) which I don't think support mmconfig anyway, as well as some very specific PCI-X combinations (TG3_FLAG_PCIX_TARGET_HWBUG) which are also very unlikely to support mmconfig. Even if I am wrong for the tg3, I don't really think mmconfig vs type1 could make a noticeable performance on any common systems (obscure systems or hardware where it could potentially have a performance impact could use a non-default configuration). Loic -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]
On 12/20/2007 1:16 PM, Matthew Wilcox wrote: > > Bad deduction. What's happening is that the write to the BAR is causing > it to overlap the decode for mmconfig space. So the mmconfig write to > set the BAR back never gets through. > > I have a different idea to fix this problem. Instead of writing > 0x, we could look for an unused bit of space in the E820 map and > write, say, 0xdfff to the low 32-bits of a BAR. Then it wouldn't > overlap, and we could find its size using MMCONFIG. > > Does anyone know how Windows handles these machines? Obviously, if it's > using MMCONFIG, it'd have the same problems. Does it just use type 1 > for initial sizing? Or does it use type 1 for all accesses below 256 > bytes? > Always using type 1 for accesses below 256 bytes looks like a very very attractive solution I know we had a lot of older kernels over the last two years that we patched like that (we needed MMCONFIG for our own device development purposes, but we also needed our machines to boot and discover all devices reliably). Recent kernels works fine out of the box on all hardware we have, but all this sometimes tricky and apparently endless work (in big part because of buggy BIOSes) about MMCONFIG would probably become relatively easy by limiting the aim to have MMCONFIG work when it is required (for cfg-space accesses >= 256). Loic -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]
On 12/20/2007 1:16 PM, Matthew Wilcox wrote: Bad deduction. What's happening is that the write to the BAR is causing it to overlap the decode for mmconfig space. So the mmconfig write to set the BAR back never gets through. I have a different idea to fix this problem. Instead of writing 0x, we could look for an unused bit of space in the E820 map and write, say, 0xdfff to the low 32-bits of a BAR. Then it wouldn't overlap, and we could find its size using MMCONFIG. Does anyone know how Windows handles these machines? Obviously, if it's using MMCONFIG, it'd have the same problems. Does it just use type 1 for initial sizing? Or does it use type 1 for all accesses below 256 bytes? Always using type 1 for accesses below 256 bytes looks like a very very attractive solution I know we had a lot of older kernels over the last two years that we patched like that (we needed MMCONFIG for our own device development purposes, but we also needed our machines to boot and discover all devices reliably). Recent kernels works fine out of the box on all hardware we have, but all this sometimes tricky and apparently endless work (in big part because of buggy BIOSes) about MMCONFIG would probably become relatively easy by limiting the aim to have MMCONFIG work when it is required (for cfg-space accesses = 256). Loic -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]
On 12/20/2007 2:08 PM, Matthew Wilcox wrote: On Thu, Dec 20, 2007 at 02:04:31PM -0500, Tony Camuso wrote: Also, this solution also would allow us to remove the unreachable_devices() routine and bitmap. Not really ... we probe reading address 0x100 to see if the device supports extended config space or not. So we need to make that fail gracefully for the amd7111 case. pci_cfg_space_size() is only done for PCI-express or PCI-X mode 2 devices, so you still have eliminated the bulk of the problems which are typically handling the legacy busses on modern machines. I don't know what is the amd7111. Does anybody see a down side to this? It'll be slower than it would be if we used mmconfig directly. Now yes, nobody should be using pci config space in performance critical paths ... but see the tg3 driver. I am not familiar with the tg3 driver, just trying to give a 5 minutes look, it seems the typical cases where the pci-conf-space is used intensively are with some rev in combination with the 82801 (TG3_FLG2_ICH_WORKAROUND) which I don't think support mmconfig anyway, as well as some very specific PCI-X combinations (TG3_FLAG_PCIX_TARGET_HWBUG) which are also very unlikely to support mmconfig. Even if I am wrong for the tg3, I don't really think mmconfig vs type1 could make a noticeable performance on any common systems (obscure systems or hardware where it could potentially have a performance impact could use a non-default configuration). Loic -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]
On 12/20/2007 3:15 PM, Matthew Wilcox wrote: On Thu, Dec 20, 2007 at 03:05:52PM -0500, Loic Prylli wrote: I am not familiar with the tg3 driver, just trying to give a 5 minutes look, it seems the typical cases where the pci-conf-space is used intensively are with some rev in combination with the 82801 (TG3_FLG2_ICH_WORKAROUND) which I don't think support mmconfig anyway, as well as some very specific PCI-X combinations (TG3_FLAG_PCIX_TARGET_HWBUG) which are also very unlikely to support mmconfig. It's not a question of whether the card supports mmconfig or not -- the card can't tell whether a first-256-byte pci config transaction was initiated through mmconfig, type1, type2, or even a bios call. I know the final device is not aware on how the config request was originated. I am just saying platforms built around the Intel 82801 chipset (ICH2) don't support mmconfig at all. I would also not be surprised if the platforms where tg3 needs TG3_FLAG_PCIX_TARGET_HWBUG might also not support mmconfig (but for this second case, it's only speculation based on seeing a couple posts about TG3_FLAG_PCIX_TARGET_HWBUG where amd hypertransport/PCI-X chipsets where mentioned). If you know of a platform that support mmconfig, and where the tg3 does need to use relatively intensively pci-conf-space, I'll be happy to be corrected. I'm just hacking together an implementation based on Ivan's suggestion to always use type 1 for the first 64 bytes. Whatever you think it's best. If you are only trying to address a BAR-sizing/mmconfig temporary conflict, and think the MCFG BIOS or other mmconfig problems are under control, that makes sense. If the need arise for somebody, an option to select between 64 to 256 can be added later. Loic -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]
On 12/20/2007 4:00 PM, Matthew Wilcox wrote: On Thu, Dec 20, 2007 at 03:56:29PM -0500, Loic Prylli wrote: I know the final device is not aware on how the config request was originated. I am just saying platforms built around the Intel 82801 chipset (ICH2) don't support mmconfig at all. I would also not be surprised if the platforms where tg3 needs TG3_FLAG_PCIX_TARGET_HWBUG might also not support mmconfig (but for this second case, it's only speculation based on seeing a couple posts about TG3_FLAG_PCIX_TARGET_HWBUG where amd hypertransport/PCI-X chipsets where mentioned). If you know of a platform that support mmconfig, and where the tg3 does need to use relatively intensively pci-conf-space, I'll be happy to be corrected. tg3 is available as an add-in pci card. i have one. i can plug it into a machine that does support mmconfig. That doesn't tell for sure your NIC has the specific rev that would cause the hwbug workaround to be used. But let's assume so, your combination would still works correctly (maybe a slightly non-optimal network performance). A non-default mmconf=always option can help maximize again the performance. My starting point was that more systems could be supported out-of-the box (vs not working at all) by using type 1 more widely,. This would not break anybody. And this would not affect performance except on what I initially called obscure hardware or systems. I already acknowledged in the previous email the possibility of a performance impact, but it is still not clear to me whether it would be widespread. The example you mention is a buggy revision of a chip that was designed to use memory-mapped IO, and has to use config-space instead because the memory-mapped IO implementation is buggy (in most cases only when used in combination with certain chipsets). How much influence should a modest performance impact there influence the decision-making? Usually it's better to trade some performance for stability than the reverse, and you loose the performance anyway when some many people start using pci=nommconf by default on all their installs. Loic -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]
On 12/20/2007 6:21 PM, Tony Camuso wrote: And the MMCONFIG problem with enterprise systems and workstations, where we do control the BIOS (for the most part), is due to known bugs in certain versions of certain chipsets, HT1000, AMD8132, among them, not the BIOS. The lack of MMCONFIG support is indeed because some hypertransport chipsets lack that support. But there are some BIOSes out there that are advertising support for all busses in their MCFG acpi attribute (even the busses managed by some amd8131 in a mixed nvidia-ck804/amd8131 motherboard), and the BIOS seems at least faulty for advertising a capability that does not exist. Loic -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]
On 12/20/2007 9:15 PM, Robert Hancock wrote: Suggested Workaround It is strongly recommended that system designers do not connect the AMD-8132 and devices that use extended configuration space MMIO BARs (ex: HyperTransport-to-PCI Express® bridges) to the same processor HyperTransport link. Fix Planned No That does sound fairly definitive. I have to wonder why certain system designers then didn't follow their strong recommendation.. Just curious, do you know of any system where that recommendation was not followed? On all motherboards where I have seen a AMD-8131 or a AMD-8132, they were alone on their hypertransport link, and other northbridges (more precisely hypertransport to pci-express or pci-whatever, often nvidia) with a MMCONFIG BAR where on one of the other available hypertransport links in the system. Loic -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: MSI problem since 2.6.21 for devices not providing a mask in their MSI capability
On 10/3/2007 11:58 PM, Eric W. Biederman wrote: > Right. And INTx has such a pending bit as well. I guess I figured > if MSI was enabled transferring it over would be the obvious thing to > do. > The INTx pending and disable bit were only added starting with PCI 2.3, so in PCI-2.2 and PCI-X 1.0{,a,b} those bits don't exist at all and there is still a significant of such devices still in use or on the market. I agree it would look natural (and that probably happens for a lot of or most devices) to transfer the interrupt state from INTx to MSI, but I don't think you can rely on it without doing some assumptions about device interrupt management that are outside the scope of the spec. > Of course it might even make sense simply to refuse to enable MSI > if there is not a masking capability present. > The possibility of masking for MSI was only specified (and then only as something optional) starting with PCI-3.0, PCI Express 1.0 and 1.0a are based on the older PCI-2.3 and corresponding devices are very unlikely to have it. So there might still be majority of devices in the field with no MSI masking capability in the different PCI categories: conventional-PCI, PCI-X, PCI-Express. > >>> The PCI spec requires disabling/masking the msi when reprogramming it. >>> So as a general rule we can not do better. >>> >>> >> Do you have a reference for that requirement. The spec only vaguely >> associates MSI programming with "configuration", but I haven't found any >> explicit indication that it should not work. >> > I would have to look it up again but it said that the result is only > defined in the case when it is disabled/masked, when I looked a couple > of months ago. > > I found this quote in PCI-3.0/6.8.3.5: "For MSI-X, a function is permitted to cache Address and Data values from unmasked MSI-X Table entries. However, anytime software unmasks a currently masked MSI-X Table entry either by clearing its Mask bit or by clearing the Function Mask bit, the function must update any Address or Data values that it cached from that entry. If software changes the Address or Data value of an entry while the entry is unmasked, the result is undefined." I haven't seen a caching possibility mentioned for the MSI case, so apart from the problem with multi-word changes, maybe changing the Address or Data can be done at anytime for MSI. >> I don't see how you can disable MSI through the control bit (which is >> equivalent to switching the device to INTx whether or not the INTx >> disable bit is set in PCI_COMMAND) in the middle of operations, not tell >> the driver, and not risk loosing interrupts (unless you rely on much >> more than the spec). >> > > I will relook. My impression is that bit is defined as MSI enable. > Not mode switch. Although myrinet has clearly implemented it as > mode switch. > It is indeed defined as MSI-enable, but that's not a contradiction with being equivalent to a "mode switch between INTx and MSI" (ignoring MSI-X in that context). The spec seems to define the following "modes": MSI-enable = 1, INTx-disable= x : MSI-mode MSI-enable = 0, INTx-disable= 0 : INTx-mode with INTx-signal == INTx-pending MSI-enable = 0, INTx-disable= 1 : INTx-mode "polling/diag" mode using INTx pending bit The only specificity of Myrinet is having relatively independant logic for the two modes, while at the same time requiring any pending INTx to be acked before starting any kind of new interrupt. > Interesting. So an irq fires before the driver has finished > processing the last instance of the irq. This is very close to a > screaming irq and something we may actually want to deal with. > > In our case it is true that the device can fire a bounded number of MSI without acks (but not an infinite number, there are a limited number of interrupt tokens, furthermore interrupt rate is limited with a configurable minimum time between interrupts which default to ~10us), I suspect a race with other interrupts were involved because otherwise that minimum inter-interrupt delay would prevent entering that code path. I think even a more casual interrupt-scheme (with an explicit ack required for each interrupt before generating a new one) can also exercise that code path, since between the return from the handler and the clearing of IRQ_PROGRESS, there is an opportunity for the next interrupt to happen. To detect a crazy device generating storms of edge interrupts, I guess note_interrupt() could be called during this "reentrant detection" if masking was made conditional. Loic P.S.: just a little more context: in all Myrinet hardware, enough of the interrupt functionality is implemented in firmware that we can avoid loosing interrupts whenever MSI-enable is toggled, and we already started distributing a firmware-based software update for users running linux >= 2.6.21 and using MSI. So for Myrinet the problem is more or less already closed. The only motivation
Re: MSI problem since 2.6.21 for devices not providing a mask in their MSI capability
On 10/3/2007 11:58 PM, Eric W. Biederman wrote: Right. And INTx has such a pending bit as well. I guess I figured if MSI was enabled transferring it over would be the obvious thing to do. The INTx pending and disable bit were only added starting with PCI 2.3, so in PCI-2.2 and PCI-X 1.0{,a,b} those bits don't exist at all and there is still a significant of such devices still in use or on the market. I agree it would look natural (and that probably happens for a lot of or most devices) to transfer the interrupt state from INTx to MSI, but I don't think you can rely on it without doing some assumptions about device interrupt management that are outside the scope of the spec. Of course it might even make sense simply to refuse to enable MSI if there is not a masking capability present. The possibility of masking for MSI was only specified (and then only as something optional) starting with PCI-3.0, PCI Express 1.0 and 1.0a are based on the older PCI-2.3 and corresponding devices are very unlikely to have it. So there might still be majority of devices in the field with no MSI masking capability in the different PCI categories: conventional-PCI, PCI-X, PCI-Express. The PCI spec requires disabling/masking the msi when reprogramming it. So as a general rule we can not do better. Do you have a reference for that requirement. The spec only vaguely associates MSI programming with configuration, but I haven't found any explicit indication that it should not work. I would have to look it up again but it said that the result is only defined in the case when it is disabled/masked, when I looked a couple of months ago. I found this quote in PCI-3.0/6.8.3.5: For MSI-X, a function is permitted to cache Address and Data values from unmasked MSI-X Table entries. However, anytime software unmasks a currently masked MSI-X Table entry either by clearing its Mask bit or by clearing the Function Mask bit, the function must update any Address or Data values that it cached from that entry. If software changes the Address or Data value of an entry while the entry is unmasked, the result is undefined. I haven't seen a caching possibility mentioned for the MSI case, so apart from the problem with multi-word changes, maybe changing the Address or Data can be done at anytime for MSI. I don't see how you can disable MSI through the control bit (which is equivalent to switching the device to INTx whether or not the INTx disable bit is set in PCI_COMMAND) in the middle of operations, not tell the driver, and not risk loosing interrupts (unless you rely on much more than the spec). I will relook. My impression is that bit is defined as MSI enable. Not mode switch. Although myrinet has clearly implemented it as mode switch. It is indeed defined as MSI-enable, but that's not a contradiction with being equivalent to a mode switch between INTx and MSI (ignoring MSI-X in that context). The spec seems to define the following modes: MSI-enable = 1, INTx-disable= x : MSI-mode MSI-enable = 0, INTx-disable= 0 : INTx-mode with INTx-signal == INTx-pending MSI-enable = 0, INTx-disable= 1 : INTx-mode polling/diag mode using INTx pending bit The only specificity of Myrinet is having relatively independant logic for the two modes, while at the same time requiring any pending INTx to be acked before starting any kind of new interrupt. Interesting. So an irq fires before the driver has finished processing the last instance of the irq. This is very close to a screaming irq and something we may actually want to deal with. In our case it is true that the device can fire a bounded number of MSI without acks (but not an infinite number, there are a limited number of interrupt tokens, furthermore interrupt rate is limited with a configurable minimum time between interrupts which default to ~10us), I suspect a race with other interrupts were involved because otherwise that minimum inter-interrupt delay would prevent entering that code path. I think even a more casual interrupt-scheme (with an explicit ack required for each interrupt before generating a new one) can also exercise that code path, since between the return from the handler and the clearing of IRQ_PROGRESS, there is an opportunity for the next interrupt to happen. To detect a crazy device generating storms of edge interrupts, I guess note_interrupt() could be called during this reentrant detection if masking was made conditional. Loic P.S.: just a little more context: in all Myrinet hardware, enough of the interrupt functionality is implemented in firmware that we can avoid loosing interrupts whenever MSI-enable is toggled, and we already started distributing a firmware-based software update for users running linux = 2.6.21 and using MSI. So for Myrinet the problem is more or less already closed. The only motivation for starting the thread was that it seemed a possibility that other
Re: MSI problem since 2.6.21 for devices not providing a mask in their MSI capability
On 10/3/2007 5:49 PM, Eric W. Biederman wrote: > Loic Prylli <[EMAIL PROTECTED]> writes: > > >> Hi, >> >> We observe a problem with MSI since kernel 2.6.21 where interrupts would >> randomly stop working. We have tracked it down to the new >> msi_set_mask_bit definition in 2.6.21. In the MSI case with a device not >> providing a "native" MSI mask, it was a no-op before, and now it >> disables MSI in the MSI-ctl register which according to the PCI spec is >> interpreted as reverting the device to legacy interrupts. If such a >> device try to generate a new interrupt during the "masked" window, the >> device will try a legacy interrupt which is generally >> ignored/never-acked and cause interrupts to no longer work for the >> device/driver combination (even after the enable bit is restored). >> > > We should also be leaving the INTx irqs disabled. So no irq > should be generated. > Even if the INTx line is not raised, you cannot rely on the device to retain memory of a interrupt triggered while MSI are disabled, and expect it to fire it under MSI form later when MSI are reenabled. The PCI spec does not provide any implicit or explicit guarantee about the MSI enable flag that would allow it to be used for temporary masking without running the risk of loosing such interrupts. Moreover, even if you eventually call the interrupt handler to recover a lost-interrupt, having switched the device to INTx mode (whether or not the INTx line was forced down or not with the corresponding pci-command bit) without informing the driver can (and will in our case) break interrupt handshaking because MSI and INTx interrupts are not acked in the same way (INTx requires an extra step that we don't do for MSI and that the device will still expect unless going through driver init again). > If you have a mask bit implemented you are required to be > able to refire it after the msi is enabled. Indeed the masking case is well-defined by the spec (including the operation of the pending bits). And my subject was definitely restricted to devices without that masking capability. > I don't recall > the requirements for when both intx and msi irqs are both > disabled. Intuitively I would expect no irq message to > be generated, and at most the card would need to be polled > manually to recognize a device event happened. > > Certainly firing an irq and having it get completely lost is > unfortunate, and a major pain if you are trying to use the > card. > > As for the previous no-op behavior that was a bug. > OK no-op was a bug, but using the enable-bit for temporary masking purposes still feels like a bug. I am afraid the only safe solution might be to prohibit any operation that absolutely requires masking if real masking is not available. Maybe the set_affinity method should simply be disabled for device not supported masking (unless there is an option of doing it without masking for instance by guaranteeing only one word of the MSI capability is changed). > >> Is there anything apart from irq migration that strongly requires >> masking? Is is possible to do the irq migration without masking? >> > > enable_irq/disable_irq. Although we can get away with a software > emulation there and those are only needed if the driver calls them. > I don't think there is a problem here, no sane driver would depend on receiving edge interrupts triggered while irqs were explicitly disabled. > The PCI spec requires disabling/masking the msi when reprogramming it. > So as a general rule we can not do better. Do you have a reference for that requirement. The spec only vaguely associates MSI programming with "configuration", but I haven't found any explicit indication that it should not work. > Further because we are > writing to multiple pci config registers the only way we can safely > reprogram the message is with the msi disabled/masked on the card in > some fashion. > That's indeed a show-stopper. > I suspect what needs to happen is a spec search to verify that the > current linux behavior is at least reasonable within the spec. > I don't see how you can disable MSI through the control bit (which is equivalent to switching the device to INTx whether or not the INTx disable bit is set in PCI_COMMAND) in the middle of operations, not tell the driver, and not risk loosing interrupts (unless you rely on much more than the spec). > I don't want to break anyones hardware, but at the same time I want us > to be careful and in spec for the default case. > > The interrupt while doing set_affinity masking would certainly cause a problem for the device we use (MSI-enable switch between INTx and MSI mode, and both interru
MSI problem since 2.6.21 for devices not providing a mask in their MSI capability
Hi, We observe a problem with MSI since kernel 2.6.21 where interrupts would randomly stop working. We have tracked it down to the new msi_set_mask_bit definition in 2.6.21. In the MSI case with a device not providing a "native" MSI mask, it was a no-op before, and now it disables MSI in the MSI-ctl register which according to the PCI spec is interpreted as reverting the device to legacy interrupts. If such a device try to generate a new interrupt during the "masked" window, the device will try a legacy interrupt which is generally ignored/never-acked and cause interrupts to no longer work for the device/driver combination (even after the enable bit is restored). Is there anything apart from irq migration that strongly requires masking? Is is possible to do the irq migration without masking? Loic - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
MSI problem since 2.6.21 for devices not providing a mask in their MSI capability
Hi, We observe a problem with MSI since kernel 2.6.21 where interrupts would randomly stop working. We have tracked it down to the new msi_set_mask_bit definition in 2.6.21. In the MSI case with a device not providing a native MSI mask, it was a no-op before, and now it disables MSI in the MSI-ctl register which according to the PCI spec is interpreted as reverting the device to legacy interrupts. If such a device try to generate a new interrupt during the masked window, the device will try a legacy interrupt which is generally ignored/never-acked and cause interrupts to no longer work for the device/driver combination (even after the enable bit is restored). Is there anything apart from irq migration that strongly requires masking? Is is possible to do the irq migration without masking? Loic - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: MSI problem since 2.6.21 for devices not providing a mask in their MSI capability
On 10/3/2007 5:49 PM, Eric W. Biederman wrote: Loic Prylli [EMAIL PROTECTED] writes: Hi, We observe a problem with MSI since kernel 2.6.21 where interrupts would randomly stop working. We have tracked it down to the new msi_set_mask_bit definition in 2.6.21. In the MSI case with a device not providing a native MSI mask, it was a no-op before, and now it disables MSI in the MSI-ctl register which according to the PCI spec is interpreted as reverting the device to legacy interrupts. If such a device try to generate a new interrupt during the masked window, the device will try a legacy interrupt which is generally ignored/never-acked and cause interrupts to no longer work for the device/driver combination (even after the enable bit is restored). We should also be leaving the INTx irqs disabled. So no irq should be generated. Even if the INTx line is not raised, you cannot rely on the device to retain memory of a interrupt triggered while MSI are disabled, and expect it to fire it under MSI form later when MSI are reenabled. The PCI spec does not provide any implicit or explicit guarantee about the MSI enable flag that would allow it to be used for temporary masking without running the risk of loosing such interrupts. Moreover, even if you eventually call the interrupt handler to recover a lost-interrupt, having switched the device to INTx mode (whether or not the INTx line was forced down or not with the corresponding pci-command bit) without informing the driver can (and will in our case) break interrupt handshaking because MSI and INTx interrupts are not acked in the same way (INTx requires an extra step that we don't do for MSI and that the device will still expect unless going through driver init again). If you have a mask bit implemented you are required to be able to refire it after the msi is enabled. Indeed the masking case is well-defined by the spec (including the operation of the pending bits). And my subject was definitely restricted to devices without that masking capability. I don't recall the requirements for when both intx and msi irqs are both disabled. Intuitively I would expect no irq message to be generated, and at most the card would need to be polled manually to recognize a device event happened. Certainly firing an irq and having it get completely lost is unfortunate, and a major pain if you are trying to use the card. As for the previous no-op behavior that was a bug. OK no-op was a bug, but using the enable-bit for temporary masking purposes still feels like a bug. I am afraid the only safe solution might be to prohibit any operation that absolutely requires masking if real masking is not available. Maybe the set_affinity method should simply be disabled for device not supported masking (unless there is an option of doing it without masking for instance by guaranteeing only one word of the MSI capability is changed). Is there anything apart from irq migration that strongly requires masking? Is is possible to do the irq migration without masking? enable_irq/disable_irq. Although we can get away with a software emulation there and those are only needed if the driver calls them. I don't think there is a problem here, no sane driver would depend on receiving edge interrupts triggered while irqs were explicitly disabled. The PCI spec requires disabling/masking the msi when reprogramming it. So as a general rule we can not do better. Do you have a reference for that requirement. The spec only vaguely associates MSI programming with configuration, but I haven't found any explicit indication that it should not work. Further because we are writing to multiple pci config registers the only way we can safely reprogram the message is with the msi disabled/masked on the card in some fashion. That's indeed a show-stopper. I suspect what needs to happen is a spec search to verify that the current linux behavior is at least reasonable within the spec. I don't see how you can disable MSI through the control bit (which is equivalent to switching the device to INTx whether or not the INTx disable bit is set in PCI_COMMAND) in the middle of operations, not tell the driver, and not risk loosing interrupts (unless you rely on much more than the spec). I don't want to break anyones hardware, but at the same time I want us to be careful and in spec for the default case. The interrupt while doing set_affinity masking would certainly cause a problem for the device we use (MSI-enable switch between INTx and MSI mode, and both interrupts are not acked the same way assuming they would even be delivered to the driver), but I got some new data: upon further examination, the lost interrupts we have seen seems in fact caused at a different time: - the problem is the mask_ack_irq() done in handle_edge_irq() when a new interrupt arrives before the IRQ_PROGRESS bit is cleared
Re: PAT support for i386 and x86_64
On 8/7/2007 4:30 AM, Andi Kleen wrote: > On Mon, Aug 06, 2007 at 10:03:15PM -0400, Cédric Augonnet wrote: > >> Hi all, >> >> For quite a while now, there as been numerous attempt to introduce support >> for >> Page Attribute Table (PAT) in the Linux kernel, whereas most other OS already >> have some support for this feature. Such a proposition popping up >> periodically, >> perhaps it would be an opportunity to fix that lack for once. >> > > The trouble is you need to avoid conflicting attributes, otherwise > you risk cache corruption. This means the direct mapping needs to be fixed > up and the kernel needs to keep track of the ranges to prevent conflicts. > I don't see why we have to worry about cache corruption in the case at hand. Write-combining is needed to map io (typically pci-mem regions) which are never mapped cachable anywhere, including in the linear map. If somebody for some reason needs to play with special attributes on regular RAM for which inconsistent aliasing could be a problem: - please explain why that consistency issue is mentioned in the context of write-combining/PAT: the problem already potentially exists through the use of the _PAGE_PCD attribute, and having an extra WC choice should not make the problem worse or better (note that with the initial patch that WC/PAT combination is only exploited in pci_mmap_page_range() which rightfully doesn't seem to care about cachable attribute consistency anyway). > Also when there is already a MTRR it might not work due to the complicated > rules of MTRR<->PAT interaction. > Some PAT<->MTRR cases are messy, but getting a WC type through PAT seems to straightforwardly take precedence over any MTRR type, doesn't it? > Then there are old CPU errata that need to be handled etc. > We mentioned that point in the introduction of the patch. We have looked at the documented PAT erratas that exists for the Pentium-II, Pentium-III, Pentium-M, some early pentium-IV processors. While there are minor variations for the description of the bug depending on the processor, they all fit into the following description: "under some circumstances the PAT bit might be ignored". The patch purposefully puts write-combining at PAT6 so if the conditions are there for the errata to trigger, PAT2 (UC-) will be selected by the processor and the corresponding accesses will simply be uncacheable instead of being write-combining, which doesn't affect correctness. We would certainly appreciate having any other erratas we missed mentioned here for reference. > There are also some other issues. > Introducing something like ioremap_wc() or a "sfence" wmb_wc() was excluded from the initial patch on purpose. It would be the logical next step (but involves possible API driver extensions), so the proposed patch was limited to making use of the new WC attribute by really handling the write_combine argument of pci_mmap_page_range(). That seemed to generate enough objections to start with. There is at least one mostly cosmetic problem in the patch in pci_mmap_page_range() where huge pages should not be a concern here. > You didn't solve all that at all. If it was as simple as your patch > we would have long done it already. I am sorry, but after this and other messages on the list, I still don't understand why a simple approach hasn't been made available already: - the attribute consistency issue seems independant of using PAT to create WC mappings (in particular the possibility of mixing by accident WC and UC aliases has always existed, having broken driver make such aliases through a new PAT-based attribute combination hardly changes anything, same for mixing uncachable and cachable aliases, new WC/PAT combination doesn't change anything). Sure there would some logical follow-up after a simple patch (like providing a proper ioremap() interface, and maybe a new kind of barrier, and handling the PAT bit properly in arch/x86/mm/pageattr.c, but those follow-ups were purposefully excluded, and none seems very complex either). So if there is any remaining issue on doing something "as simple as this patch", please clarify it. Best regards, Loic - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PAT support for i386 and x86_64
On 8/7/2007 4:30 AM, Andi Kleen wrote: On Mon, Aug 06, 2007 at 10:03:15PM -0400, Cédric Augonnet wrote: Hi all, For quite a while now, there as been numerous attempt to introduce support for Page Attribute Table (PAT) in the Linux kernel, whereas most other OS already have some support for this feature. Such a proposition popping up periodically, perhaps it would be an opportunity to fix that lack for once. The trouble is you need to avoid conflicting attributes, otherwise you risk cache corruption. This means the direct mapping needs to be fixed up and the kernel needs to keep track of the ranges to prevent conflicts. I don't see why we have to worry about cache corruption in the case at hand. Write-combining is needed to map io (typically pci-mem regions) which are never mapped cachable anywhere, including in the linear map. If somebody for some reason needs to play with special attributes on regular RAM for which inconsistent aliasing could be a problem: - please explain why that consistency issue is mentioned in the context of write-combining/PAT: the problem already potentially exists through the use of the _PAGE_PCD attribute, and having an extra WC choice should not make the problem worse or better (note that with the initial patch that WC/PAT combination is only exploited in pci_mmap_page_range() which rightfully doesn't seem to care about cachable attribute consistency anyway). Also when there is already a MTRR it might not work due to the complicated rules of MTRR-PAT interaction. Some PAT-MTRR cases are messy, but getting a WC type through PAT seems to straightforwardly take precedence over any MTRR type, doesn't it? Then there are old CPU errata that need to be handled etc. We mentioned that point in the introduction of the patch. We have looked at the documented PAT erratas that exists for the Pentium-II, Pentium-III, Pentium-M, some early pentium-IV processors. While there are minor variations for the description of the bug depending on the processor, they all fit into the following description: under some circumstances the PAT bit might be ignored. The patch purposefully puts write-combining at PAT6 so if the conditions are there for the errata to trigger, PAT2 (UC-) will be selected by the processor and the corresponding accesses will simply be uncacheable instead of being write-combining, which doesn't affect correctness. We would certainly appreciate having any other erratas we missed mentioned here for reference. There are also some other issues. Introducing something like ioremap_wc() or a sfence wmb_wc() was excluded from the initial patch on purpose. It would be the logical next step (but involves possible API driver extensions), so the proposed patch was limited to making use of the new WC attribute by really handling the write_combine argument of pci_mmap_page_range(). That seemed to generate enough objections to start with. There is at least one mostly cosmetic problem in the patch in pci_mmap_page_range() where huge pages should not be a concern here. You didn't solve all that at all. If it was as simple as your patch we would have long done it already. I am sorry, but after this and other messages on the list, I still don't understand why a simple approach hasn't been made available already: - the attribute consistency issue seems independant of using PAT to create WC mappings (in particular the possibility of mixing by accident WC and UC aliases has always existed, having broken driver make such aliases through a new PAT-based attribute combination hardly changes anything, same for mixing uncachable and cachable aliases, new WC/PAT combination doesn't change anything). Sure there would some logical follow-up after a simple patch (like providing a proper ioremap() interface, and maybe a new kind of barrier, and handling the PAT bit properly in arch/x86/mm/pageattr.c, but those follow-ups were purposefully excluded, and none seems very complex either). So if there is any remaining issue on doing something as simple as this patch, please clarify it. Best regards, Loic - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] MTRR: Fix race causing set_mtrr to go into infinite loop
On 6/25/2007 6:34 PM, Andi Kleen wrote: > On Tuesday 26 June 2007 00:05:17 Chuck Ebbert wrote: > >> On 06/25/2007 05:38 PM, Loic Prylli wrote: >> >> [cc: Andi] >> >> >>> Processors synchronization in set_mtrr requires the .gate field >>> to be set after .count field is properly initialized. Without an explicit >>> barrier, the compiler was reordering those memory stores. That was sometimes >>> causing a processor (in ipi_handler) to see the .gate change and >>> decrement .count before the latter is set by set_mtrr() (which >>> then hangs in a infinite loop with irqs disabled). >>> > > Hmm, perhaps we should just put the smp_wmb into atomic_set(). > Near all other atomic operations have memory barriers too. I think > that would be the better fix. > > -Andi > In Documentation/atomic_ops.txt atomic_set/atomic_read are described as nothing more than a type-safe assignement or reading, without any extra semantics. For other atomic operations, the rule is that any atomic operation that doesn't return a value doesn't come with a barrier (and any operation that returns the atomic value must have memory barriers). So I guess you are suggesting to change the doc and the implementation for all arches. I should admit I did not knew a number of atomic operations did not imply memory-barriers before. But maybe the extra cost might not be completely negligible, especially if, for consistency with other "barrier-implied" atomic operations, a new memory barrier is put before and after, Are you suggested changing just atomic_set(), or also other barrier-free atomic operations :"atomic_dec", "atomic_inc", "atomic_add", "atomic_sub" ? Independently of what is done to atomic, what about not making the .gate field an atomic_t, but a simple "int" in the mttr code, since the only operations done on it are atomic_read and atomic_set? Loic - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] MTRR: Fix race causing set_mtrr to go into infinite loop
Processors synchronization in set_mtrr requires the .gate field to be set after .count field is properly initialized. Without an explicit barrier, the compiler was reordering those memory stores. That was sometimes causing a processor (in ipi_handler) to see the .gate change and decrement .count before the latter is set by set_mtrr() (which then hangs in a infinite loop with irqs disabled). Signed-off-by: Loic Prylli <[EMAIL PROTECTED]> --- arch/i386/kernel/cpu/mtrr/main.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/arch/i386/kernel/cpu/mtrr/main.c b/arch/i386/kernel/cpu/mtrr/main.c index 55b0051..75dc6d5 100644 --- a/arch/i386/kernel/cpu/mtrr/main.c +++ b/arch/i386/kernel/cpu/mtrr/main.c @@ -229,6 +229,8 @@ static void set_mtrr(unsigned int reg, unsigned long base, data.smp_size = size; data.smp_type = type; atomic_set(, num_booting_cpus() - 1); + /* make sure data.count is visible before unleashing other CPUs */ + smp_wmb(); atomic_set(,0); /* Start the ball rolling on other CPUs */ @@ -242,6 +244,7 @@ static void set_mtrr(unsigned int reg, unsigned long base, /* ok, reset count and toggle gate */ atomic_set(, num_booting_cpus() - 1); + smp_wmb(); atomic_set(,1); /* do our MTRR business */ @@ -260,6 +263,7 @@ static void set_mtrr(unsigned int reg, unsigned long base, cpu_relax(); atomic_set(, num_booting_cpus() - 1); + smp_wmb(); atomic_set(,0); /* -- 1.5.2.2 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] MTRR: Fix race causing set_mtrr to go into infinite loop
Processors synchronization in set_mtrr requires the .gate field to be set after .count field is properly initialized. Without an explicit barrier, the compiler was reordering those memory stores. That was sometimes causing a processor (in ipi_handler) to see the .gate change and decrement .count before the latter is set by set_mtrr() (which then hangs in a infinite loop with irqs disabled). Signed-off-by: Loic Prylli [EMAIL PROTECTED] --- arch/i386/kernel/cpu/mtrr/main.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/arch/i386/kernel/cpu/mtrr/main.c b/arch/i386/kernel/cpu/mtrr/main.c index 55b0051..75dc6d5 100644 --- a/arch/i386/kernel/cpu/mtrr/main.c +++ b/arch/i386/kernel/cpu/mtrr/main.c @@ -229,6 +229,8 @@ static void set_mtrr(unsigned int reg, unsigned long base, data.smp_size = size; data.smp_type = type; atomic_set(data.count, num_booting_cpus() - 1); + /* make sure data.count is visible before unleashing other CPUs */ + smp_wmb(); atomic_set(data.gate,0); /* Start the ball rolling on other CPUs */ @@ -242,6 +244,7 @@ static void set_mtrr(unsigned int reg, unsigned long base, /* ok, reset count and toggle gate */ atomic_set(data.count, num_booting_cpus() - 1); + smp_wmb(); atomic_set(data.gate,1); /* do our MTRR business */ @@ -260,6 +263,7 @@ static void set_mtrr(unsigned int reg, unsigned long base, cpu_relax(); atomic_set(data.count, num_booting_cpus() - 1); + smp_wmb(); atomic_set(data.gate,0); /* -- 1.5.2.2 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] MTRR: Fix race causing set_mtrr to go into infinite loop
On 6/25/2007 6:34 PM, Andi Kleen wrote: On Tuesday 26 June 2007 00:05:17 Chuck Ebbert wrote: On 06/25/2007 05:38 PM, Loic Prylli wrote: [cc: Andi] Processors synchronization in set_mtrr requires the .gate field to be set after .count field is properly initialized. Without an explicit barrier, the compiler was reordering those memory stores. That was sometimes causing a processor (in ipi_handler) to see the .gate change and decrement .count before the latter is set by set_mtrr() (which then hangs in a infinite loop with irqs disabled). Hmm, perhaps we should just put the smp_wmb into atomic_set(). Near all other atomic operations have memory barriers too. I think that would be the better fix. -Andi In Documentation/atomic_ops.txt atomic_set/atomic_read are described as nothing more than a type-safe assignement or reading, without any extra semantics. For other atomic operations, the rule is that any atomic operation that doesn't return a value doesn't come with a barrier (and any operation that returns the atomic value must have memory barriers). So I guess you are suggesting to change the doc and the implementation for all arches. I should admit I did not knew a number of atomic operations did not imply memory-barriers before. But maybe the extra cost might not be completely negligible, especially if, for consistency with other barrier-implied atomic operations, a new memory barrier is put before and after, Are you suggested changing just atomic_set(), or also other barrier-free atomic operations :atomic_dec, atomic_inc, atomic_add, atomic_sub ? Independently of what is done to atomic, what about not making the .gate field an atomic_t, but a simple int in the mttr code, since the only operations done on it are atomic_read and atomic_set? Loic - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
mask/unmasking while servicing MSI(-X) unnecessary?
While looking into using MSI-X for our myri10ge driver, we have seen that the msi(x) code (driver/pci/msi.c) masks the MSI(-X) vector while servicing an interrupt. We are not sure why this masking is needed (for instance no such thing is done for "edge IOAPIC" interrupts). There seems to be already several mechanisms each individually protecting against "loosing an interrupt" without masking: - the "x86" architecture is able to queue 2 interrupt messages. That guarantees an interrupt handler will always start after the last MSI received (even in the case of a big burst of MSI messages). - Even if there wasn't that interrupt queuing, ack_APIC_irq() could be moved in the ack() method. Then things would work without masking even on a hypothetical platform where a new interrupt is completely ignored (with no IRR-like register) while servicing the same vector (anyway since this "msi" code is already tied to "x86" architecture specificities, that hypothetical platform might not be relevant). - Almost every driver/device have their own way of acknowledging interrupts anyway, which also by itself makes the masking/unmasking unnecessary. - The masking by itself is racy unless the driver interrupt handler starts by making sure the masking request has reached the device with some kind of MMIO-read. Such a MMIO-read is normally the kind of costly requests you are happy to get rid of in the MSI model. So if it is not useful, it might be better to avoid that systematic mask/unmask pair. This masking has a small but measurable performance impact for our device/driver combo. Would you agree to suppress that masking (sample patch following)? Or otherwise, is there is a possibility of making it optional on a per-device basis. Thanks for any comment! Loic Prylli Myricom, Inc. [PATCH] Don't mask the interrupt vector while servicing a MSI interrupt. Signed-off-by: Loic Prylli <[EMAIL PROTECTED]> --- drivers/pci/msi.c | 19 ++- 1 files changed, 6 insertions(+), 13 deletions(-) 98e9a27091c3ccdc8a8a72cea9ab9086fb258af3 diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index a83c1f5..af446e3 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -200,19 +200,12 @@ static void shutdown_msi_irq(unsigned in spin_unlock_irqrestore(_lock, flags); } -static void end_msi_irq_wo_maskbit(unsigned int vector) +static void end_msi_irq(unsigned int vector) { move_native_irq(vector); ack_APIC_irq(); } -static void end_msi_irq_w_maskbit(unsigned int vector) -{ - move_native_irq(vector); - unmask_MSI_irq(vector); - ack_APIC_irq(); -} - static void do_nothing(unsigned int vector) { } @@ -227,8 +220,8 @@ static struct hw_interrupt_type msix_irq .shutdown = shutdown_msi_irq, .enable = unmask_MSI_irq, .disable= mask_MSI_irq, - .ack= mask_MSI_irq, - .end= end_msi_irq_w_maskbit, + .ack= do_nothing, + .end= end_msi_irq, .set_affinity = set_msi_affinity }; @@ -243,8 +236,8 @@ static struct hw_interrupt_type msi_irq_ .shutdown = shutdown_msi_irq, .enable = unmask_MSI_irq, .disable= mask_MSI_irq, - .ack= mask_MSI_irq, - .end= end_msi_irq_w_maskbit, + .ack= do_nothing, + .end= end_msi_irq, .set_affinity = set_msi_affinity }; @@ -260,7 +253,7 @@ static struct hw_interrupt_type msi_irq_ .enable = do_nothing, .disable= do_nothing, .ack= do_nothing, - .end= end_msi_irq_wo_maskbit, + .end= end_msi_irq, .set_affinity = set_msi_affinity }; -- 1.3.2 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
mask/unmasking while servicing MSI(-X) unnecessary?
While looking into using MSI-X for our myri10ge driver, we have seen that the msi(x) code (driver/pci/msi.c) masks the MSI(-X) vector while servicing an interrupt. We are not sure why this masking is needed (for instance no such thing is done for edge IOAPIC interrupts). There seems to be already several mechanisms each individually protecting against loosing an interrupt without masking: - the x86 architecture is able to queue 2 interrupt messages. That guarantees an interrupt handler will always start after the last MSI received (even in the case of a big burst of MSI messages). - Even if there wasn't that interrupt queuing, ack_APIC_irq() could be moved in the ack() method. Then things would work without masking even on a hypothetical platform where a new interrupt is completely ignored (with no IRR-like register) while servicing the same vector (anyway since this msi code is already tied to x86 architecture specificities, that hypothetical platform might not be relevant). - Almost every driver/device have their own way of acknowledging interrupts anyway, which also by itself makes the masking/unmasking unnecessary. - The masking by itself is racy unless the driver interrupt handler starts by making sure the masking request has reached the device with some kind of MMIO-read. Such a MMIO-read is normally the kind of costly requests you are happy to get rid of in the MSI model. So if it is not useful, it might be better to avoid that systematic mask/unmask pair. This masking has a small but measurable performance impact for our device/driver combo. Would you agree to suppress that masking (sample patch following)? Or otherwise, is there is a possibility of making it optional on a per-device basis. Thanks for any comment! Loic Prylli Myricom, Inc. [PATCH] Don't mask the interrupt vector while servicing a MSI interrupt. Signed-off-by: Loic Prylli [EMAIL PROTECTED] --- drivers/pci/msi.c | 19 ++- 1 files changed, 6 insertions(+), 13 deletions(-) 98e9a27091c3ccdc8a8a72cea9ab9086fb258af3 diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index a83c1f5..af446e3 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -200,19 +200,12 @@ static void shutdown_msi_irq(unsigned in spin_unlock_irqrestore(msi_lock, flags); } -static void end_msi_irq_wo_maskbit(unsigned int vector) +static void end_msi_irq(unsigned int vector) { move_native_irq(vector); ack_APIC_irq(); } -static void end_msi_irq_w_maskbit(unsigned int vector) -{ - move_native_irq(vector); - unmask_MSI_irq(vector); - ack_APIC_irq(); -} - static void do_nothing(unsigned int vector) { } @@ -227,8 +220,8 @@ static struct hw_interrupt_type msix_irq .shutdown = shutdown_msi_irq, .enable = unmask_MSI_irq, .disable= mask_MSI_irq, - .ack= mask_MSI_irq, - .end= end_msi_irq_w_maskbit, + .ack= do_nothing, + .end= end_msi_irq, .set_affinity = set_msi_affinity }; @@ -243,8 +236,8 @@ static struct hw_interrupt_type msi_irq_ .shutdown = shutdown_msi_irq, .enable = unmask_MSI_irq, .disable= mask_MSI_irq, - .ack= mask_MSI_irq, - .end= end_msi_irq_w_maskbit, + .ack= do_nothing, + .end= end_msi_irq, .set_affinity = set_msi_affinity }; @@ -260,7 +253,7 @@ static struct hw_interrupt_type msi_irq_ .enable = do_nothing, .disable= do_nothing, .ack= do_nothing, - .end= end_msi_irq_wo_maskbit, + .end= end_msi_irq, .set_affinity = set_msi_affinity }; -- 1.3.2 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/