Re: MMIO and gcc re-ordering issue
Nick Piggin writes: /* turn off LED */ val64 = readq(bar0-adapter_control); val64 = val64 (~ADAPTER_LED_ON); writeq(val64, bar0-adapter_control); s2io_link(nic, LINK_DOWN); } clear_bit(__S2IO_STATE_LINK_TASK, (nic-state)); Now I can't say definitively that this is going to be wrong on powerpc, because I don't know the code well enough. But I'd be 90% sure that the unlock is not supposed to be visible to other CPUs before the writeqs are queued to the card. On x86 it wouldn't be. Interestingly, there is also a store to cacheable memory (nic-device_enabled_once), but no smp_wmb or equivalent before the clear_bit. So there are other potential problems here besides the I/O related ones. Anyway, I have done some tests on a dual G5 here with putting a sync on both sides of the store in writel etc. (i.e. making readl/writel strongly ordered w.r.t. everything else), and as you predicted, there wasn't a noticeable performance degradation, at least not on the couple of things I tried. So I am now inclined to accept your suggestion that we should do that. I should probably do some similar checks on POWER6 and a few other machines first, though. Paul. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Thursday 12 June 2008 22:14, Paul Mackerras wrote: Nick Piggin writes: /* turn off LED */ val64 = readq(bar0-adapter_control); val64 = val64 (~ADAPTER_LED_ON); writeq(val64, bar0-adapter_control); s2io_link(nic, LINK_DOWN); } clear_bit(__S2IO_STATE_LINK_TASK, (nic-state)); Now I can't say definitively that this is going to be wrong on powerpc, because I don't know the code well enough. But I'd be 90% sure that the unlock is not supposed to be visible to other CPUs before the writeqs are queued to the card. On x86 it wouldn't be. Interestingly, there is also a store to cacheable memory (nic-device_enabled_once), but no smp_wmb or equivalent before the clear_bit. So there are other potential problems here besides the I/O related ones. Yeah there sure is. That sucks too, but we go one step at a time ;) I think proposing a strong ordering between set_bit/clear_bit would actually be quite noticable slowdown in core kernel code at this point. Which reminds me, I have been meaning to do another pass of test and set bit / clear bit conversions to the _lock primitives... Anyway, I have done some tests on a dual G5 here with putting a sync on both sides of the store in writel etc. (i.e. making readl/writel strongly ordered w.r.t. everything else), and as you predicted, there wasn't a noticeable performance degradation, at least not on the couple of things I tried. So I am now inclined to accept your suggestion that we should do that. I should probably do some similar checks on POWER6 and a few other machines first, though. Oh good, thanks for looking into it. I guess it might be a little more noticable on bigger POWER systems. And I think we might even need to do a PCI read after every writel on sn2 systems in order to get the semantics I want. I can't say it won't be noticable. But if we consider the number of drivers (maybe one or two dozen well maintained ones), and number of sites in each driver (maybe one or two submission and completion fastpaths which should have a minimum of IO operations in each one) that will have to be converted in order to get performance as good or better than it is currently with relaxed accessors and weigh that against all the places in those and every other crappy obscure driver that we *won't* have to audit, I really think we end up with a net win even with some short term pain. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Thu, Jun 05, 2008 at 06:43:53PM +1000, Benjamin Herrenschmidt wrote: Note that the powerpc implementation currently clears the flag on spin_lock and tests it on unlock. We are considering changing that to not touch the flag on spin_lock and just clear it whenever we do a sync (ie, on unlock, on explicit mmiowb, and possibly even on readl's where we happen to do sync's). Your current scheme sounds like it's broken for spin_lock(a) writel(); spin_lock(b); spin_unlock(b); spin_unlock(a); -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Thu, 2008-06-12 at 09:07 -0600, Matthew Wilcox wrote: On Thu, Jun 05, 2008 at 06:43:53PM +1000, Benjamin Herrenschmidt wrote: Note that the powerpc implementation currently clears the flag on spin_lock and tests it on unlock. We are considering changing that to not touch the flag on spin_lock and just clear it whenever we do a sync (ie, on unlock, on explicit mmiowb, and possibly even on readl's where we happen to do sync's). Your current scheme sounds like it's broken for spin_lock(a) writel(); spin_lock(b); spin_unlock(b); spin_unlock(a); Which is why we are considering changing it :-) But as Paulus said before, he did some measurement and we came to the conclusion that (pending more measurements on a wider range of HW) we may as well drop the whole scheme and make writel fully synchronous instead. Then, we can get some nice weakly ordered accessors and start adding them with appropriate explicit barriers to the hot path of perf. critical drivers we care about. Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Wednesday 11 June 2008 15:35, Nick Piggin wrote: On Wednesday 11 June 2008 15:13, Paul Mackerras wrote: Nick Piggin writes: I just wish we had even one actual example of things going wrong with the current rules we have on powerpc to motivate changing to this model. ~/usr/src/linux-2.6 git grep test_and_set_bit drivers/ | wc -l 506 How sure are you that none of those forms part of a cobbled-together locking scheme that hopes to constrain some IO access? My comment was precisely about the fact that this sort of argument is actually FUD. I want one example that is demonstrably wrong, not just a how sure are you. But this is a case where you really should be scared, so FUD is completely appropriate. At least, I don't see how it is at all worse than FUD about how much slower everything will be with stronger ordering, and how nobody will be able to do anything about it. In reality, for most devices it will not be measurable, and for some like network or disk might use a bit of a tweak in one or two fastpaths. If the driver author honestly does not know the code well enough to say whether a particular ordering is allowed or not, then it should just not be allowed. Anyway, what do I win if I give you a concrete example? I found one in the first file I picked out of my grep: drivers/net/s2io.c if (test_and_set_bit(__S2IO_STATE_LINK_TASK, (nic-state))) { /* The card is being reset, no point doing anything */ goto out_unlock; } ... val64 = readq(bar0-adapter_control); val64 |= ADAPTER_LED_ON; writeq(val64, bar0-adapter_control); s2io_link(nic, LINK_UP); } else { if (CARDS_WITH_FAULTY_LINK_INDICATORS(nic-device_type, subid)) { val64 = readq(bar0-gpio_control); val64 = ~GPIO_CTRL_GPIO_0; writeq(val64, bar0-gpio_control); val64 = readq(bar0-gpio_control); } /* turn off LED */ val64 = readq(bar0-adapter_control); val64 = val64 (~ADAPTER_LED_ON); writeq(val64, bar0-adapter_control); s2io_link(nic, LINK_DOWN); } clear_bit(__S2IO_STATE_LINK_TASK, (nic-state)); Now I can't say definitively that this is going to be wrong on powerpc, because I don't know the code well enough. But I'd be 90% sure that the unlock is not supposed to be visible to other CPUs before the writeqs are queued to the card. On x86 it wouldn't be. And I hope you're reading this, sn2 guys :) The mmiowb() audit for spin_unlock was a nice idea, but this example shows that unless you *really* know the code and have actually audited everything, then you should not be messing with relaxing of ordering etc. Now obviously the example I quote is a slowpath, in which case it can very happily continue to use strongly ordered io accessors, and you will never ever notice the extra sync or PCI read or whatever after the writeq. Then you can easily relax the couple of fastpaths that matter, put barriers in there, comment them, and you're away: you now have a *working* driver (that doesn't randomly lose adapter_control bits twice a year), and that is also faster than it was when it was broken because you've specifically used the very weakly ordered IO accessors in the fast paths that matter. And when somebody plugs in some obscure hardware into your box that you haven't audited and put mmiowb etc etc into, then guess what? It is also not going to blow up in an undebuggable way 6 months later in the middle of insert something important here. The worst case is that the user notices it is running a bit slowly and sends a full profile output the next day. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
Nick Piggin writes: Now that doesn't leave waker ordering architectures lumped with slow old x86 semantics. Think of it as giving them the benefit of sharing x86 development and testing :) Worth something, but not perhaps as much as you think, given that many x86 driver writers still don't pay much attention to making their code endian-safe and 64-bit-clean. And there are still plenty of drivers that use virt_to_bus... Paul. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Wed, 11 Jun 2008, Nick Piggin wrote: I can't actually find the definitive statement in the Intel manuals saying UC is strongly ordered also WRT WB. Linus? Definitive? Dunno. But look in the Architecture manual, volume 3A, 10.3 Methods of Caching Available, and then under the bullet about Write Combining (WC), it says the writes may be delayed until the next occurrence of a serializing event; such as, an SFENCE of MFENCE instruction, CPUID execution, a read or write to uncached memory, an interrupt occurrence, or a LOCK instruction execution. However, it's worth noting that - documentation can be wrong, or even if right, can be Intel-specific. - the above is expressly _only_ about the WC buffer, not about regular memory writes. Cached memory accesses are different from WC accesses. so in the end, the thing that matters is how things actually work. Linus ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Tuesday, June 10, 2008 8:29 pm Nick Piggin wrote: On Wednesday 11 June 2008 05:19, Jesse Barnes wrote: On Tuesday, June 10, 2008 12:05 pm Roland Dreier wrote: me too. That's the whole basis for readX_relaxed() and its cohorts: we make our weirdest machines (like altix) conform to the x86 norm. Then where it really kills us we introduce additional semantics to selected drivers that enable us to recover I/O speed on the abnormal platforms. Except as I pointed out before, Altix doesn't conform to the norm and many (most?) drivers are missing mmiowb()s that are needed for Altix. Just no one has plugged most devices into an Altix (or haven't stressed the driver in a way that exposes problems of IO ordering between CPUs). It would be a great thing to use the powerpc trick of setting a flag that is tested by spin_unlock()/mutex_unlock() and automatically doing the mmiowb() if needed, and then killing off mmiowb() entirely. Yeah I think that's what Nick's guidelines would guarantee. And Jes is already working on the spin_unlock change you mentioned, so mmiowb() should be history soon (in name only, assuming Nick also introduces the I/O barriers he talked about for ordering the looser accessors it would still be there but would be called io_wmb or something). Exactly, yes. I guess everybody has had good intentions here, but as noticed, what is lacking is coordination and documentation. You mention strong ordering WRT spin_unlock, which suggests that you would prefer to take option #2 (the current powerpc one): io/io is ordered and io is contained inside spinlocks, but io/cacheable in general is not ordered. I was thinking it would be good for the weaker accessors, but now that I think about it you could just use the new io_* barrier functions. I didn't mean to imply that I wasn't in favor of the io/cacheable ordering as well. For any high performance drivers that are well maintained (ie. the ones where slowdown might be noticed), everyone should have a pretty good handle on memory ordering requirements, so it shouldn't take long to go through and convert them to relaxed accessors. Yep. Thanks for working on this, Nick, it's definitely a good thing that you're taking control of it. :) Thanks, Jesse ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Wednesday 04 June 2008 05:07, Linus Torvalds wrote: On Tue, 3 Jun 2008, Trent Piepho wrote: On Tue, 3 Jun 2008, Linus Torvalds wrote: On Tue, 3 Jun 2008, Nick Piggin wrote: Linus: on x86, memory operations to wc and wc+ memory are not ordered with one another, or operations to other memory types (ie. load/load and store/store reordering is allowed). Also, as you know, store/load reordering is explicitly allowed as well, which covers all memory types. So perhaps it is not quite true to say readl/writel is strongly ordered by default even on x86. You would have to put in some mfence instructions in them to make it so. So on x86, these could be re-ordered? writel(START_OPERATION, CONTROL_REGISTER); status = readl(STATUS_REGISTER); With both registers in a WC+ area, yes. The write may be in the WC buffers until the WC buffers are flushed (short list: a fence, a serializing instruction, a read-write to uncached memory, or an interrupt. There are others, but those are the main ones). But if the status register is in uncached memory (which is the only *sane* thing to do), then it doesn't matter if the control register is in WC memory. Because the status register read is itself serializing with the WC buffer, it's actually fine. So this is used for putting things like ring queues in WC memory, and fill them up with writes, and get nice bursty write traffic with the CPU automatically buffering it up (think stdio.h on a really low level). And if you then have the command registers in UC memory or using IO port accesses, reading and writing to them will automatically serialize. OK, I'm sitll not quite sure where this has ended up. I guess you are happy with x86 semantics as they are now. That is, all IO accesses are strongly ordered WRT one another and WRT cacheable memory (which includes keeping them within spinlocks), *unless* one asks for WC memory, in which case that memory is quite weakly ordered (and is not even ordered by a regular IO readl, at least according to AMD spec). So for WC memory, one still needs to use mb/rmb/wmb. So that still doesn't tell us what *minimum* level of ordering we should provide in the cross platform readl/writel API. Some relatively sane suggestions would be: - as strong as x86. guaranteed not to break drivers that work on x86, but slower on some archs. To me, this is most pleasing. It is much much easier to notice something is going a little slower and to work out how to use weaker ordering there, than it is to debug some once-in-a-bluemoon breakage caused by just the right architecture, driver, etc. It totally frees up the driver writer from thinking about barriers, provided they get the locking right. - ordered WRT other IO accessors, constrained within spinlocks, but not cacheable memory. This is what powerpc does now. It's a little faster for them, and probably covers the vast majority of drivers, but there are real possibilities to get it wrong (trivial example: using bit locks or mutexes or any kind of open coded locking or lockless synchronisation can break). - (less sane) same as above, but not ordered WRT spinlocks. This is what ia64 (sn2) does. From a purist POV, it is a little less arbitrary than powerpc, but in practice, it will break a lot more drivers than powerpc. I was kind of joking about taking control of this issue :) But seriously, it needs a decision to be made. I vote for #1. My rationale: I'm still finding relatively major (well, found maybe 4 or 5 in the last couple of years) bugs in the mm subsystem due to memory ordering problems. This is apparently one of the most well reviewed and tested bit of code in the kernel by people who know all about memory ordering. Not to mention that mm/ does not have to worry about IO ordering at all. Then apparently driver are the least reviewed and tested. Connect dots. Now that doesn't leave waker ordering architectures lumped with slow old x86 semantics. Think of it as giving them the benefit of sharing x86 development and testing :) We can then formalise the relaxed __ accessors to be more complete (ie. +/- byteswapping). I'd also propose to add io_rmb/io_wmb/io_mb that order io/io access, to help architectures like sn2 where the io/cacheable barrier is pretty expensive. Any comments? ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Monday, June 09, 2008 11:56 pm Nick Piggin wrote: So that still doesn't tell us what *minimum* level of ordering we should provide in the cross platform readl/writel API. Some relatively sane suggestions would be: - as strong as x86. guaranteed not to break drivers that work on x86, but slower on some archs. To me, this is most pleasing. It is much much easier to notice something is going a little slower and to work out how to use weaker ordering there, than it is to debug some once-in-a-bluemoon breakage caused by just the right architecture, driver, etc. It totally frees up the driver writer from thinking about barriers, provided they get the locking right. - ordered WRT other IO accessors, constrained within spinlocks, but not cacheable memory. This is what powerpc does now. It's a little faster for them, and probably covers the vast majority of drivers, but there are real possibilities to get it wrong (trivial example: using bit locks or mutexes or any kind of open coded locking or lockless synchronisation can break). - (less sane) same as above, but not ordered WRT spinlocks. This is what ia64 (sn2) does. From a purist POV, it is a little less arbitrary than powerpc, but in practice, it will break a lot more drivers than powerpc. I was kind of joking about taking control of this issue :) But seriously, it needs a decision to be made. I vote for #1. My rationale: I'm still finding relatively major (well, found maybe 4 or 5 in the last couple of years) bugs in the mm subsystem due to memory ordering problems. This is apparently one of the most well reviewed and tested bit of code in the kernel by people who know all about memory ordering. Not to mention that mm/ does not have to worry about IO ordering at all. Then apparently driver are the least reviewed and tested. Connect dots. Now that doesn't leave waker ordering architectures lumped with slow old x86 semantics. Think of it as giving them the benefit of sharing x86 development and testing :) We can then formalise the relaxed __ accessors to be more complete (ie. +/- byteswapping). I'd also propose to add io_rmb/io_wmb/io_mb that order io/io access, to help architectures like sn2 where the io/cacheable barrier is pretty expensive. Any comments? FWIW that approach sounds pretty good to me. Arches that suffer from performance penalties can still add lower level primitives and port selected drivers over, so really they won't be losing much. AFAICT though drivers will still have to worry about regular memory ordering issues; they'll just be safe from I/O related ones. :) Still, the simplification is probably worth it. Jesse ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Tue, 2008-06-10 at 10:41 -0700, Jesse Barnes wrote: On Monday, June 09, 2008 11:56 pm Nick Piggin wrote: So that still doesn't tell us what *minimum* level of ordering we should provide in the cross platform readl/writel API. Some relatively sane suggestions would be: - as strong as x86. guaranteed not to break drivers that work on x86, but slower on some archs. To me, this is most pleasing. It is much much easier to notice something is going a little slower and to work out how to use weaker ordering there, than it is to debug some once-in-a-bluemoon breakage caused by just the right architecture, driver, etc. It totally frees up the driver writer from thinking about barriers, provided they get the locking right. - ordered WRT other IO accessors, constrained within spinlocks, but not cacheable memory. This is what powerpc does now. It's a little faster for them, and probably covers the vast majority of drivers, but there are real possibilities to get it wrong (trivial example: using bit locks or mutexes or any kind of open coded locking or lockless synchronisation can break). - (less sane) same as above, but not ordered WRT spinlocks. This is what ia64 (sn2) does. From a purist POV, it is a little less arbitrary than powerpc, but in practice, it will break a lot more drivers than powerpc. I was kind of joking about taking control of this issue :) But seriously, it needs a decision to be made. I vote for #1. My rationale: I'm still finding relatively major (well, found maybe 4 or 5 in the last couple of years) bugs in the mm subsystem due to memory ordering problems. This is apparently one of the most well reviewed and tested bit of code in the kernel by people who know all about memory ordering. Not to mention that mm/ does not have to worry about IO ordering at all. Then apparently driver are the least reviewed and tested. Connect dots. Now that doesn't leave waker ordering architectures lumped with slow old x86 semantics. Think of it as giving them the benefit of sharing x86 development and testing :) We can then formalise the relaxed __ accessors to be more complete (ie. +/- byteswapping). I'd also propose to add io_rmb/io_wmb/io_mb that order io/io access, to help architectures like sn2 where the io/cacheable barrier is pretty expensive. Any comments? FWIW that approach sounds pretty good to me. Arches that suffer from performance penalties can still add lower level primitives and port selected drivers over, so really they won't be losing much. AFAICT though drivers will still have to worry about regular memory ordering issues; they'll just be safe from I/O related ones. :) Still, the simplification is probably worth it. me too. That's the whole basis for readX_relaxed() and its cohorts: we make our weirdest machines (like altix) conform to the x86 norm. Then where it really kills us we introduce additional semantics to selected drivers that enable us to recover I/O speed on the abnormal platforms. About the only problem we've had is that architectures aren't very good at co-ordinating for their additional accessors so we tend to get a forest of strange ones growing up, which appear only in a few drivers (i.e. the ones that need the speed ups) and which have no well documented meaning. James ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
me too. That's the whole basis for readX_relaxed() and its cohorts: we make our weirdest machines (like altix) conform to the x86 norm. Then where it really kills us we introduce additional semantics to selected drivers that enable us to recover I/O speed on the abnormal platforms. Except as I pointed out before, Altix doesn't conform to the norm and many (most?) drivers are missing mmiowb()s that are needed for Altix. Just no one has plugged most devices into an Altix (or haven't stressed the driver in a way that exposes problems of IO ordering between CPUs). It would be a great thing to use the powerpc trick of setting a flag that is tested by spin_unlock()/mutex_unlock() and automatically doing the mmiowb() if needed, and then killing off mmiowb() entirely. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Tuesday, June 10, 2008 12:05 pm Roland Dreier wrote: me too. That's the whole basis for readX_relaxed() and its cohorts: we make our weirdest machines (like altix) conform to the x86 norm. Then where it really kills us we introduce additional semantics to selected drivers that enable us to recover I/O speed on the abnormal platforms. Except as I pointed out before, Altix doesn't conform to the norm and many (most?) drivers are missing mmiowb()s that are needed for Altix. Just no one has plugged most devices into an Altix (or haven't stressed the driver in a way that exposes problems of IO ordering between CPUs). It would be a great thing to use the powerpc trick of setting a flag that is tested by spin_unlock()/mutex_unlock() and automatically doing the mmiowb() if needed, and then killing off mmiowb() entirely. Yeah I think that's what Nick's guidelines would guarantee. And Jes is already working on the spin_unlock change you mentioned, so mmiowb() should be history soon (in name only, assuming Nick also introduces the I/O barriers he talked about for ordering the looser accessors it would still be there but would be called io_wmb or something). Jesse ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Wednesday 11 June 2008 05:19, Jesse Barnes wrote: On Tuesday, June 10, 2008 12:05 pm Roland Dreier wrote: me too. That's the whole basis for readX_relaxed() and its cohorts: we make our weirdest machines (like altix) conform to the x86 norm. Then where it really kills us we introduce additional semantics to selected drivers that enable us to recover I/O speed on the abnormal platforms. Except as I pointed out before, Altix doesn't conform to the norm and many (most?) drivers are missing mmiowb()s that are needed for Altix. Just no one has plugged most devices into an Altix (or haven't stressed the driver in a way that exposes problems of IO ordering between CPUs). It would be a great thing to use the powerpc trick of setting a flag that is tested by spin_unlock()/mutex_unlock() and automatically doing the mmiowb() if needed, and then killing off mmiowb() entirely. Yeah I think that's what Nick's guidelines would guarantee. And Jes is already working on the spin_unlock change you mentioned, so mmiowb() should be history soon (in name only, assuming Nick also introduces the I/O barriers he talked about for ordering the looser accessors it would still be there but would be called io_wmb or something). Exactly, yes. I guess everybody has had good intentions here, but as noticed, what is lacking is coordination and documentation. You mention strong ordering WRT spin_unlock, which suggests that you would prefer to take option #2 (the current powerpc one): io/io is ordered and io is contained inside spinlocks, but io/cacheable in general is not ordered. I *would* prefer that io/cacheable actually is strongly ordered with the default accessors. Because if you have that, then the driver writer never has to care about memory ordering, provided they use correct locking for SMP issues. Same as x86. With option 2, there are still windows where you could possibly have issues. For any high performance drivers that are well maintained (ie. the ones where slowdown might be noticed), everyone should have a pretty good handle on memory ordering requirements, so it shouldn't take long to go through and convert them to relaxed accessors. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Wed, 2008-06-11 at 13:29 +1000, Nick Piggin wrote: Exactly, yes. I guess everybody has had good intentions here, but as noticed, what is lacking is coordination and documentation. You mention strong ordering WRT spin_unlock, which suggests that you would prefer to take option #2 (the current powerpc one): io/io is ordered and io is contained inside spinlocks, but io/cacheable in general is not ordered. IO/cacheable -is- ordered on powepc in what we believe is the direction that matter: IO reads are fully ordered vs. anything and IO writes are ordered vs. previous cacheable stores. The only relaxed situation is IO writes followed by cacheable stores, which I believe shouldn't be a problem. (except for spinlocks for which we use the flag trick) I *would* prefer that io/cacheable actually is strongly ordered with the default accessors. Because if you have that, then the driver writer never has to care about memory ordering, provided they use correct locking for SMP issues. Same as x86. With option 2, there are still windows where you could possibly have issues. For any high performance drivers that are well maintained (ie. the ones where slowdown might be noticed), everyone should have a pretty good handle on memory ordering requirements, so it shouldn't take long to go through and convert them to relaxed accessors. Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Wednesday 11 June 2008 13:40, Benjamin Herrenschmidt wrote: On Wed, 2008-06-11 at 13:29 +1000, Nick Piggin wrote: Exactly, yes. I guess everybody has had good intentions here, but as noticed, what is lacking is coordination and documentation. You mention strong ordering WRT spin_unlock, which suggests that you would prefer to take option #2 (the current powerpc one): io/io is ordered and io is contained inside spinlocks, but io/cacheable in general is not ordered. IO/cacheable -is- ordered on powepc in what we believe is the direction that matter: IO reads are fully ordered vs. anything and IO writes are ordered vs. previous cacheable stores. The only relaxed situation is IO writes followed by cacheable stores, which I believe shouldn't be a problem. (except for spinlocks for which we use the flag trick) Spinlocks... mutexes, semaphores, rwsems, rwlocks, bit spinlocks, bit mutexes, open coded bit locks (of which there are a number floating around in drivers/). But even assuming you get all of that fixed up. I wonder what is such a big benefit to powerpc that you'll rather add the exception cacheable stores are not ordered with previous io stores than to say any driver which works on x86 will work on powerpc as far as memory ordering goes? (don't you also need something to order io reads with cacheable reads? as per my observation that your rmb is broken according to IBM docs) Obviously you already have a sync instruction in your writel, so 1) adding a second one doesn't slow it down by an order of mangnitude or anything, just some small factor; and 2) you obviously want to be converting high performance drivers to a more relaxed model anyway regardless of whether there is one sync or two in there. Has this ever been measured or thought carefully about? Beyond the extent of one sync good, two sync bad ;) ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
Nick Piggin writes: OK, I'm sitll not quite sure where this has ended up. I guess you are happy with x86 semantics as they are now. That is, all IO accesses are strongly ordered WRT one another and WRT cacheable memory (which includes keeping them within spinlocks), My understanding was that on x86, loads could pass stores in general, i.e. a later load could be performed before an earlier store. I guess that can't be true for uncached loads, but could a cacheable load be performed before an earlier uncached store? - as strong as x86. guaranteed not to break drivers that work on x86, but slower on some archs. To me, this is most pleasing. It is much much easier to notice something is going a little slower and to work out how to use weaker ordering there, than it is to debug some once-in-a-bluemoon breakage caused by just the right architecture, driver, etc. It totally frees up the driver writer from thinking about barriers, provided they get the locking right. I just wish we had even one actual example of things going wrong with the current rules we have on powerpc to motivate changing to this model. Now that doesn't leave waker ordering architectures lumped with slow old x86 semantics. Think of it as giving them the benefit of sharing x86 development and testing :) We can then formalise the relaxed __ accessors to be more complete (ie. +/- byteswapping). That leaves a gulf between the extremely strongly ordered writel etc. and the extremely weakly ordered __writel etc. The current powerpc scheme is fine for a lot of drivers but your proposal would leave us no way to deliver it to them. Paul. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Wednesday 11 June 2008 14:18, Paul Mackerras wrote: Nick Piggin writes: OK, I'm sitll not quite sure where this has ended up. I guess you are happy with x86 semantics as they are now. That is, all IO accesses are strongly ordered WRT one another and WRT cacheable memory (which includes keeping them within spinlocks), My understanding was that on x86, loads could pass stores in general, i.e. a later load could be performed before an earlier store. Yes, this is the one reordering allowed by the ISA on cacheable memory. WC memory is weaker, which Linus wants to allow exception for because one must explicitly ask for it. UC memory (which presumably is what we're talking about as IO access) I think is stronger in that it does not allow any reordering between one another or any cacheable access: AMD says this: c — A store (wp,wt,wb,uc,wc,wc+) may not pass a previous load (wp,wt,wb,uc,wc,wc+). f — A load (uc) does not pass a previous store (wp,wt,wb,uc,wc,wc+). g — A store (wp,wt,wb,uc) does not pass a previous store (wp,wt,wb,uc). i — A load (wp,wt,wb,wc,wc+) does not pass a previous store (uc). AMD does allow WC/WC+ to be weakly ordered WRT WC as well as UC, which Intel seemingly does not. But we're alrady treating WC as special. I can't actually find the definitive statement in the Intel manuals saying UC is strongly ordered also WRT WB. Linus? I guess that can't be true for uncached loads, but could a cacheable load be performed before an earlier uncached store? - as strong as x86. guaranteed not to break drivers that work on x86, but slower on some archs. To me, this is most pleasing. It is much much easier to notice something is going a little slower and to work out how to use weaker ordering there, than it is to debug some once-in-a-bluemoon breakage caused by just the right architecture, driver, etc. It totally frees up the driver writer from thinking about barriers, provided they get the locking right. I just wish we had even one actual example of things going wrong with the current rules we have on powerpc to motivate changing to this model. ~/usr/src/linux-2.6 git grep test_and_set_bit drivers/ | wc -l 506 How sure are you that none of those forms part of a cobbled-together locking scheme that hopes to constrain some IO access? ~/usr/src/linux-2.6 git grep test_and_set_bit drivers/ | grep while | wc -l 29 How about those? ~/usr/src/linux-2.6 git grep mutex_lock drivers/ | wc -l 3138 How sure are you that none of them is hoping to constrain IO operations within the lock? Also grep for down, down_write, write_lock, and maybe some others I've forgotten. And then forget completely about locking and imagine some of the open coded things you see around the place (or parts where drivers try to get creative and open code their own locking or try lockless things). Now that doesn't leave waker ordering architectures lumped with slow old x86 semantics. Think of it as giving them the benefit of sharing x86 development and testing :) We can then formalise the relaxed __ accessors to be more complete (ie. +/- byteswapping). That leaves a gulf between the extremely strongly ordered writel etc. and the extremely weakly ordered __writel etc. The current powerpc scheme is fine for a lot of drivers but your proposal would leave us no way to deliver it to them. But surely you have to audit the drivers anyway to ensure they are OK with the current powerpc scheme. In which case, once you have audited them and know they are safe, you can easily convert them to the even _faster_ __readl/__writel, and just add the appropriate barriers. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
Nick Piggin writes: I just wish we had even one actual example of things going wrong with the current rules we have on powerpc to motivate changing to this model. ~/usr/src/linux-2.6 git grep test_and_set_bit drivers/ | wc -l 506 How sure are you that none of those forms part of a cobbled-together locking scheme that hopes to constrain some IO access? My comment was precisely about the fact that this sort of argument is actually FUD. I want one example that is demonstrably wrong, not just a how sure are you. But surely you have to audit the drivers anyway to ensure they are OK with the current powerpc scheme. In which case, once you have audited them and know they are safe, you can easily convert them to the even _faster_ __readl/__writel, and just add the appropriate barriers. The trouble is that as code gets maintained, using __writel + explicit barriers is more fragile because some people will change the code, or add new code, without understanding the barriers. So whenever a driver gets converted to using __writel + barriers, we will end up having to watch every change that goes into it forever. Whereas with the current scheme there's a much smaller set of gotchas to watch out for, and the gotchas are things that already raise red flags, such as open-coded locking and any sort of clever lockless scheme. Paul. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
Jesse Barnes wrote: Now, in hindsight, using a PIO write set test flag approach in writeX/spin_unlock (ala powerpc) might have been a better approach, but iirc that never came up in the discussion, probably because we were focused on PCI posting and not uncached vs. cached ordering. Hi Jesse, I am going to take a stab at implementing this so we can see how much of an impact it will have. Cheers, Jes ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Thu, 2008-06-05 at 10:40 +0200, Jes Sorensen wrote: Jesse Barnes wrote: Now, in hindsight, using a PIO write set test flag approach in writeX/spin_unlock (ala powerpc) might have been a better approach, but iirc that never came up in the discussion, probably because we were focused on PCI posting and not uncached vs. cached ordering. Hi Jesse, I am going to take a stab at implementing this so we can see how much of an impact it will have. Note that the powerpc implementation currently clears the flag on spin_lock and tests it on unlock. We are considering changing that to not touch the flag on spin_lock and just clear it whenever we do a sync (ie, on unlock, on explicit mmiowb, and possibly even on readl's where we happen to do sync's). Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Mon, 2 Jun 2008, Haavard Skinnemoen wrote: So what happened to the old idea of putting the accessor function pointers in the device/bus structure? Don't know. I think it sounds like overkill to replace a simple load or store with an indirect function call. Indeed. *Especially* as the driver in practice tends to always know which one it is statically. Yes, sometimes the same RTL may end up being used behind two differnt buses, and with some broken byte-conversion hardware in the middle, but that's pretty damn rare in the end. It happens, but it happens mostly with the odd broken kind of embedded setups where somebody took a standard part and then did something _strange_ to it - and in the process they may well have introduced other issues as well. I suspect you find this kind of thing mostly with things like stupid integrated IDE controllers, and things like byte order tends to be the _least_ of the issues (specialized register accesses with odd offsets etc will happen). So most of the time the byte order is simply decided by the hardware itself (and LE is the common one, due to ISA/PCI). Sometimes you can set it dynamically (at which point it's irrelevant - just pick the one you want). So I definitely argue against complex IO accessor functions with things like dynamic support for byte order depending on bus. 99.9% of all hardware simply do not need them, and the small percentage that might need it will need special drivers anyway, so they might as well do the complexity on their own rather than make everybody else care. See for example the input driver for the i8042 chip: not only do those things sometimes need native order (probably exactly because it's integrated on a chip over some special native bus, or just wired up to be big-endian on a BE platform by some moron), but you'll also find that they just need odd accesses anyway exactly because it's oddly wired up (eg it's some special serial protocol that emulates ISA accesses). So having some byte-order correcting function still wouldn't make any sense, because it's not just about byte order. Will that mean that you might occasionally have a normal driver and an odd driver that actually drives what is basically the same HW block, just wired up differently? Sure. But that's still a smaller price to pay than it would be to try to make everything be generic when there is no point to it. Linus ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Tue, Jun 03, 2008 at 06:19:05PM +1000, Nick Piggin wrote: On Tuesday 03 June 2008 18:15, Jeremy Higdon wrote: On Tue, Jun 03, 2008 at 02:33:11PM +1000, Nick Piggin wrote: On Monday 02 June 2008 19:56, Jes Sorensen wrote: Would we be able to use Ben's trick of setting a per cpu flag in writel() then and checking that in spin unlock issuing the mmiowb() there if needed? Yes you could, but your writels would still not be strongly ordered within (or outside) spinlock regions, which is what Linus wants (and I kind of agree with). Yes they would be. Writes from the same CPU are always ordered. Writes from different CPUs are not, but that's only a concern if you protect They are not strongly ordered WRT writes to cacheable memory. If they were, then they would not leak out of spinlocks. No posted writes are. As I recall, the outX functions are not supposed to be posted, so the sn2 versions issue a mmiowb in the outX. But writeX has always been posted, or at least postable. I thought that was generally accepted. Normally, the only way for a device to see cacheable memory is via a DMA read, and you are guaranteed on sn2 that in the following: store of of A to location X mmio write to device device DMA read from location X that the device will see A. In the past, on some other archs, you'd have to flush the Dcache for that to work. Granted, if the compiler reorders the store and the mmio write, then you have a problem. jeremy ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Tuesday 03 June 2008 18:15, Jeremy Higdon wrote: On Tue, Jun 03, 2008 at 02:33:11PM +1000, Nick Piggin wrote: On Monday 02 June 2008 19:56, Jes Sorensen wrote: Would we be able to use Ben's trick of setting a per cpu flag in writel() then and checking that in spin unlock issuing the mmiowb() there if needed? Yes you could, but your writels would still not be strongly ordered within (or outside) spinlock regions, which is what Linus wants (and I kind of agree with). Yes they would be. Writes from the same CPU are always ordered. Writes from different CPUs are not, but that's only a concern if you protect They are not strongly ordered WRT writes to cacheable memory. If they were, then they would not leak out of spinlocks. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
Scott Wood [EMAIL PROTECTED] wrote: On Mon, Jun 02, 2008 at 10:11:02AM +0200, Haavard Skinnemoen wrote: Geert Uytterhoeven [EMAIL PROTECTED] wrote: On Fri, 30 May 2008, Haavard Skinnemoen wrote: Maybe we need another interface that does not do byteswapping but provides stronger ordering guarantees? The byte swapping depends on the device/bus. Of course. But isn't it reasonable to assume that a device integrated on the same silicon as the CPU is connected to a somewhat sane bus which doesn't require any byte swapping? No, unfortunately. :-( Ok, I guess I was being naive. See the end of drivers/dma/fsldma.h. Likewise with Freescale's PCI host bridges; for some reason the bus itself being little endian led to the host bridge control registers also being little endian. Right. But still, isn't it better to handle it on a case-by-case basis in the drivers? In some cases, it's best to explicitly use a certain byte order, in others it's best to use whatever is native to the CPU. Haavard ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Tue, 2008-06-03 at 16:11 +1000, Nick Piggin wrote: - readl is synchronous (ie, makes the CPU think the data was actually used before executing subsequent instructions, thus waits for the data to come back, for example to ensure that a read used to push out post buffers followed by a delay will indeed happen with the right delay). So your readl can pass an earlier cacheable store or earlier writel? I forgot to mention that all MMIO are ordered vs. each other and I do prevent readl from passing earlier cacheable stores too in my current implementation but I'n not 100% we want to guarantee that, unless we have stupid devices that trigger DMA's on reads with side effects.. anyway, it is guaranteed in the current case. Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Tuesday 03 June 2008 16:53, Paul Mackerras wrote: Nick Piggin writes: So your readl can pass an earlier cacheable store or earlier writel? No. It's quite gross at the moment, it has a sync before the access (i.e. a full mb()) and a twi; isync sequence after the access that stalls execution until the data comes back. OK. We don't provide meaningless ones like writel + cacheable store for example. (PCI posting would defeat it anyway). What do you mean by meaningless? Ordering of writel followed by a cacheable store is meaningful eg. for retaining io operations within locks. OK, you explicitly have some extra code for spin_unlock, but not for bit locks, mutexes, etc. It would make sense to have the default operations _very_ strongly ordered IMO, and then move drivers to be more relaxed when they are verified. It's meaningless in the sense that nothing guarantees that the writel has actually hit the device, even if we put a full mb() barrier in between the writel and the cacheable write. That would guarantee that the writel had got to the PCI host bridge, and couldn't be reordered with other accesses to the device, but not that the device had actually seen it. OK, but I think fits OK with our SMP ordering model for cacheable stores: no amount of barriers on CPU0 will guarantee that CPU1 has seen the store, you actually have to observe a causual effect of the store before you can say that. I don't mind adding code to the mutex unlock to do the same as spin_unlock, but I really don't want to have *two* sync instructions for every MMIO write. One is bad enough. So you can't provide iostore/store ordering without another sync after the writel store? I guess the problem with providing exceptions is that it makes it hard for people who absolutely don't know or care about ordering. I don't like having to think about it hmm, we can allow this type of reordering... oh, unless some silly device does X If we come up with a sane set of weakly ordered accessors (including io_*mb()), it will make it easier to go through the important drivers and improve them. We don't have to enforce the the new semantics strictly until then if they'll slow you down too much in the meantime. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Tue, Jun 03, 2008 at 02:33:11PM +1000, Nick Piggin wrote: On Monday 02 June 2008 19:56, Jes Sorensen wrote: Jeremy Higdon wrote: We don't actually have that problem on the Altix. All writes issued by CPU X will be ordered with respect to each other. But writes by CPU X and CPU Y will not be, unless an mmiowb() is done by the original CPU before the second CPU writes. I.e. CPU X writel CPU X writel CPU X mmiowb CPU Y writel ... Note that this implies some sort of locking. Also note that if in the above, CPU Y did the mmiowb, that would not work. Hmmm, Then it's less bad than I thought - my apologies for the confusion. Would we be able to use Ben's trick of setting a per cpu flag in writel() then and checking that in spin unlock issuing the mmiowb() there if needed? Yes you could, but your writels would still not be strongly ordered within (or outside) spinlock regions, which is what Linus wants (and I kind of agree with). Yes they would be. Writes from the same CPU are always ordered. Writes from different CPUs are not, but that's only a concern if you protect writing via some sort of lock. If the lock release forces a barrier, that should take care of the problem. jeremy ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
Nick Piggin writes: So your readl can pass an earlier cacheable store or earlier writel? No. It's quite gross at the moment, it has a sync before the access (i.e. a full mb()) and a twi; isync sequence after the access that stalls execution until the data comes back. We don't provide meaningless ones like writel + cacheable store for example. (PCI posting would defeat it anyway). What do you mean by meaningless? Ordering of writel followed by a cacheable store is meaningful eg. for retaining io operations within locks. OK, you explicitly have some extra code for spin_unlock, but not for bit locks, mutexes, etc. It would make sense to have the default operations _very_ strongly ordered IMO, and then move drivers to be more relaxed when they are verified. It's meaningless in the sense that nothing guarantees that the writel has actually hit the device, even if we put a full mb() barrier in between the writel and the cacheable write. That would guarantee that the writel had got to the PCI host bridge, and couldn't be reordered with other accesses to the device, but not that the device had actually seen it. I don't mind adding code to the mutex unlock to do the same as spin_unlock, but I really don't want to have *two* sync instructions for every MMIO write. One is bad enough. Paul. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Tue, 3 Jun 2008, Nick Piggin wrote: Linus: on x86, memory operations to wc and wc+ memory are not ordered with one another, or operations to other memory types (ie. load/load and store/store reordering is allowed). Also, as you know, store/load reordering is explicitly allowed as well, which covers all memory types. So perhaps it is not quite true to say readl/writel is strongly ordered by default even on x86. You would have to put in some mfence instructions in them to make it so. Well, you have to ask for WC/WC+ anyway, so it's immaterial. A driver that does that needs to be aware of it. IOW, it's a non-issue, imnsho. Linus ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Tue, 3 Jun 2008, Trent Piepho wrote: On Tue, 3 Jun 2008, Linus Torvalds wrote: On Tue, 3 Jun 2008, Nick Piggin wrote: Linus: on x86, memory operations to wc and wc+ memory are not ordered with one another, or operations to other memory types (ie. load/load and store/store reordering is allowed). Also, as you know, store/load reordering is explicitly allowed as well, which covers all memory types. So perhaps it is not quite true to say readl/writel is strongly ordered by default even on x86. You would have to put in some mfence instructions in them to make it so. So on x86, these could be re-ordered? writel(START_OPERATION, CONTROL_REGISTER); status = readl(STATUS_REGISTER); With both registers in a WC+ area, yes. The write may be in the WC buffers until the WC buffers are flushed (short list: a fence, a serializing instruction, a read-write to uncached memory, or an interrupt. There are others, but those are the main ones). But if the status register is in uncached memory (which is the only *sane* thing to do), then it doesn't matter if the control register is in WC memory. Because the status register read is itself serializing with the WC buffer, it's actually fine. So this is used for putting things like ring queues in WC memory, and fill them up with writes, and get nice bursty write traffic with the CPU automatically buffering it up (think stdio.h on a really low level). And if you then have the command registers in UC memory or using IO port accesses, reading and writing to them will automatically serialize. Well, you have to ask for WC/WC+ anyway, so it's immaterial. A driver that does that needs to be aware of it. IOW, it's a non-issue, imnsho. You need to ask for coherent DMA memory too. Not on x86. And I don't care what anybody else says - x86 is *so* totally dominant, that other architectures have to live with the fact that 99.9% of all drivers are written for and tested on x86. As a result, anything else is theory. Are some drivers good and are careful? Yes. Are most? No, and even if they were, people making changes would mostly test them on x86. Architectures should strive to basically act as closely as possible to x86 semantics in order to not get nasty surprises. And architectures that can't do that well enough *will* often find that they need to fix drivers, and only a small small subset of all kernel drivers will generally work out-of-the-box. If you're ready to do that, your architecture can do anything it damn well pleases ;) Linus ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Tue, 3 Jun 2008, Linus Torvalds wrote: On Tue, 3 Jun 2008, Nick Piggin wrote: Linus: on x86, memory operations to wc and wc+ memory are not ordered with one another, or operations to other memory types (ie. load/load and store/store reordering is allowed). Also, as you know, store/load reordering is explicitly allowed as well, which covers all memory types. So perhaps it is not quite true to say readl/writel is strongly ordered by default even on x86. You would have to put in some mfence instructions in them to make it so. So on x86, these could be re-ordered? writel(START_OPERATION, CONTROL_REGISTER); status = readl(STATUS_REGISTER); Well, you have to ask for WC/WC+ anyway, so it's immaterial. A driver that does that needs to be aware of it. IOW, it's a non-issue, imnsho. You need to ask for coherent DMA memory too. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Tue, Jun 03, 2008 at 11:47:00AM -0700, Trent Piepho wrote: On Tue, 3 Jun 2008, Linus Torvalds wrote: On Tue, 3 Jun 2008, Nick Piggin wrote: Linus: on x86, memory operations to wc and wc+ memory are not ordered with one another, or operations to other memory types (ie. load/load and store/store reordering is allowed). Also, as you know, store/load reordering is explicitly allowed as well, which covers all memory types. So perhaps it is not quite true to say readl/writel is strongly ordered by default even on x86. You would have to put in some mfence instructions in them to make it so. So on x86, these could be re-ordered? writel(START_OPERATION, CONTROL_REGISTER); status = readl(STATUS_REGISTER); You wouldn't ask for write-combining memory mapping for control or status registers. Well, you have to ask for WC/WC+ anyway, so it's immaterial. A driver that does that needs to be aware of it. IOW, it's a non-issue, imnsho. You need to ask for coherent DMA memory too. Different case. Coherent DMA memory is *host* memory that the *device* accesses. We're talking about *device* memory that the *cpu* accesses. -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Tue, 3 Jun 2008, Matthew Wilcox wrote: On Tue, Jun 03, 2008 at 11:47:00AM -0700, Trent Piepho wrote: On Tue, 3 Jun 2008, Linus Torvalds wrote: On Tue, 3 Jun 2008, Nick Piggin wrote: Linus: on x86, memory operations to wc and wc+ memory are not ordered with one another, or operations to other memory types (ie. load/load and store/store reordering is allowed). Also, as you know, store/load reordering is explicitly allowed as well, which covers all memory types. So perhaps it is not quite true to say readl/writel is strongly ordered by default even on x86. You would have to put in some mfence instructions in them to make it so. So on x86, these could be re-ordered? writel(START_OPERATION, CONTROL_REGISTER); status = readl(STATUS_REGISTER); You wouldn't ask for write-combining memory mapping for control or status registers. But Nick said, store/load reordering is explicitly allowed as well, which covers *all* memory types. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Tue, 3 Jun 2008, Nick Piggin wrote: On Monday 02 June 2008 17:24, Russell King wrote: So, can the semantics of what's expected from these IO accessor functions be documented somewhere. Please? Before this thread gets lost in the depths of time? This whole thread also ties in with my posts about mmiowb (which IMO should go away). readl/writel: strongly ordered wrt one another and other stores to cacheable RAM, byteswapping __readl/__writel: not ordered (needs mb/rmb/wmb to order with other readl/writel and cacheable operations, or io_*mb to order with one another) raw_readl/raw_writel: strongly ordered, no byteswapping __raw_readl/__raw_writel: not ordered, no byteswapping Byte-swapping vs not byte-swapping is not usually what the programmer wants. Usually your device's registers are defined as being big-endian or little-endian and you want whatever is needed to give you that. I believe that on some archs that can be either byte order, some built-in devices will change their registers to match, and so you want native endian or no swapping for these. Though that's definitely in the minority. An accessors that always byte-swaps regardless of the endianness of the host is never something I've seen a driver want. IOW, there are four ways one can defined endianness/swapping: 1) Little-endian 2) Big-endian 3) Native-endian aka non-byte-swapping 4) Foreign-endian aka byte-swapping 1 and 2 are by far the most used. Some code wants 3. No one wants 4. Yet our API is providing 3 4, the two which are the least useful. Is it enough to provide only all or none for ordering strictness? For instance on powerpc, one can get a speedup by dropping strict ordering for IO vs cacheable memory, but still keeping ordering for IO vs IO and IO vs locks. This is much easier to program for than no ordering at all. In fact, if one doesn't use coherent DMA, it's basically the same as fully strict ordering. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Tue, Jun 03, 2008 at 12:43:21PM -0700, Trent Piepho wrote: IOW, there are four ways one can defined endianness/swapping: 1) Little-endian 2) Big-endian 3) Native-endian aka non-byte-swapping 4) Foreign-endian aka byte-swapping 1 and 2 are by far the most used. Some code wants 3. No one wants 4. Yet our API is providing 3 4, the two which are the least useful. You've fundamentally misunderstood. readX/writeX and __readX/__writeX provide little-endian access. __raw_readX provide native-endian. If you want 2 or 4, define your own accessors. Some architectures define other accessors (eg gsc_readX on parisc is native (big) endian, and works on physical addresses that haven't been ioremapped. sbus_readX on sparc64 also seems to be native (big) endian). Is it enough to provide only all or none for ordering strictness? For instance on powerpc, one can get a speedup by dropping strict ordering for IO vs cacheable memory, but still keeping ordering for IO vs IO and IO vs locks. This is much easier to program for than no ordering at all. In fact, if one doesn't use coherent DMA, it's basically the same as fully strict ordering. I don't understand why you keep talking about DMA. Are you talking about ordering between readX() and DMA? PCI proides those guarantees. -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Tue, Jun 03, 2008 at 12:57:56PM -0700, Trent Piepho wrote: On Tue, 3 Jun 2008, Matthew Wilcox wrote: On Tue, Jun 03, 2008 at 11:47:00AM -0700, Trent Piepho wrote: On Tue, 3 Jun 2008, Linus Torvalds wrote: On Tue, 3 Jun 2008, Nick Piggin wrote: Linus: on x86, memory operations to wc and wc+ memory are not ordered with one another, or operations to other memory types (ie. load/load and store/store reordering is allowed). Also, as you know, store/load reordering is explicitly allowed as well, which covers all memory types. So perhaps it is not quite true to say readl/writel is strongly ordered by default even on x86. You would have to put in some mfence instructions in them to make it so. So on x86, these could be re-ordered? writel(START_OPERATION, CONTROL_REGISTER); status = readl(STATUS_REGISTER); You wouldn't ask for write-combining memory mapping for control or status registers. But Nick said, store/load reordering is explicitly allowed as well, which covers *all* memory types. Then Nick is confused. PCI only defines one way to flush posted writes to a device -- doing a read from it. There's no way that reads can be allowed to pass writes (unless you've asked for it, like with write combining). -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Tue, 3 Jun 2008, Matthew Wilcox wrote: On Tue, Jun 03, 2008 at 12:43:21PM -0700, Trent Piepho wrote: IOW, there are four ways one can defined endianness/swapping: 1) Little-endian 2) Big-endian 3) Native-endian aka non-byte-swapping 4) Foreign-endian aka byte-swapping 1 and 2 are by far the most used. Some code wants 3. No one wants 4. Yet our API is providing 3 4, the two which are the least useful. You've fundamentally misunderstood. readX/writeX and __readX/__writeX provide little-endian access. __raw_readX provide native-endian. If you want 2 or 4, define your own accessors. Some architectures define other accessors (eg gsc_readX on parisc is native (big) endian, and How about providing 1 and 2, and if you want 3 or 4 define your own accessors? Is it enough to provide only all or none for ordering strictness? For instance on powerpc, one can get a speedup by dropping strict ordering for IO vs cacheable memory, but still keeping ordering for IO vs IO and IO vs locks. This is much easier to program for than no ordering at all. In fact, if one doesn't use coherent DMA, it's basically the same as fully strict ordering. I don't understand why you keep talking about DMA. Are you talking about ordering between readX() and DMA? PCI proides those guarantees. I guess you haven't been reading the whole thread. The reason it started was because gcc can re-order powerpc (and everyone else's too) IO accesses vs accesses to cachable memory (but not spin-locks), which ends up only being a problem with coherent DMA. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Tue, 3 Jun 2008, Matthew Wilcox wrote: On Tue, Jun 03, 2008 at 12:57:56PM -0700, Trent Piepho wrote: On Tue, 3 Jun 2008, Matthew Wilcox wrote: On Tue, Jun 03, 2008 at 11:47:00AM -0700, Trent Piepho wrote: On Tue, 3 Jun 2008, Linus Torvalds wrote: On Tue, 3 Jun 2008, Nick Piggin wrote: Linus: on x86, memory operations to wc and wc+ memory are not ordered with one another, or operations to other memory types (ie. load/load and store/store reordering is allowed). Also, as you know, store/load reordering is explicitly allowed as well, which covers all memory types. So perhaps it is not quite true to say readl/writel is strongly ordered by default even on x86. You would have to put in some mfence instructions in them to make it so. So on x86, these could be re-ordered? writel(START_OPERATION, CONTROL_REGISTER); status = readl(STATUS_REGISTER); You wouldn't ask for write-combining memory mapping for control or status registers. But Nick said, store/load reordering is explicitly allowed as well, which covers *all* memory types. Then Nick is confused. PCI only defines one way to flush posted writes to a device -- doing a read from it. There's no way that reads can be allowed to pass writes (unless you've asked for it, like with write combining). But that requirement is for the PCI bridge, isn't it? It doesn't matter if the bridge will flush all posted writes before allowing a read if the CPU decides to give the bridge the read before the write. A powerpc CPU will certainly do this if you don't take any steps like telling it the memory is uncachable and guarded. I didn't think it was allowed on x86 (except with WC), but Nick seemed to say it was. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Tue, 2008-06-03 at 12:43 -0700, Trent Piepho wrote: Byte-swapping vs not byte-swapping is not usually what the programmer wants. Usually your device's registers are defined as being big-endian or little-endian and you want whatever is needed to give you that. Yes, which is why I (and some other archs) have writel_be/readl_be. The standard writel/readl being LE. However, the raw variants are defined to be native endian, which is of some use to -some- archs apparently where they have SoC device whose endianness follow the core. I believe that on some archs that can be either byte order, some built-in devices will change their registers to match, and so you want native endian or no swapping for these. Though that's definitely in the minority. An accessors that always byte-swaps regardless of the endianness of the host is never something I've seen a driver want. IOW, there are four ways one can defined endianness/swapping: 1) Little-endian 2) Big-endian 3) Native-endian aka non-byte-swapping 4) Foreign-endian aka byte-swapping 1 and 2 are by far the most used. Some code wants 3. No one wants 4. Yet our API is providing 3 4, the two which are the least useful. No, we don't provide 4, it was something unclear with nick. We provide 1. (writel/readl and __variants), some archs provide 2 (writel_be/readl_be, tho I don't have __variants, I suppose I could), and everybody provides 3. though in some cases (like us) only in the form of __variants (ie, non ordered, like __raw_readl/__raw_writel). Nick's proposal is to plug those gaps, though it's, I believe, missing the _be variants. Is it enough to provide only all or none for ordering strictness? For instance on powerpc, one can get a speedup by dropping strict ordering for IO vs cacheable memory, but still keeping ordering for IO vs IO and IO vs locks. This is much easier to program for than no ordering at all. In fact, if one doesn't use coherent DMA, it's basically the same as fully strict ordering. Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Wednesday 04 June 2008 07:58, Trent Piepho wrote: On Tue, 3 Jun 2008, Matthew Wilcox wrote: On Tue, Jun 03, 2008 at 12:57:56PM -0700, Trent Piepho wrote: On Tue, 3 Jun 2008, Matthew Wilcox wrote: On Tue, Jun 03, 2008 at 11:47:00AM -0700, Trent Piepho wrote: On Tue, 3 Jun 2008, Linus Torvalds wrote: On Tue, 3 Jun 2008, Nick Piggin wrote: Linus: on x86, memory operations to wc and wc+ memory are not ordered with one another, or operations to other memory types (ie. load/load and store/store reordering is allowed). Also, as you know, store/load reordering is explicitly allowed as well, which covers all memory types. So perhaps it is not quite true to say readl/writel is strongly ordered by default even on x86. You would have to put in some mfence instructions in them to make it so. So on x86, these could be re-ordered? writel(START_OPERATION, CONTROL_REGISTER); status = readl(STATUS_REGISTER); You wouldn't ask for write-combining memory mapping for control or status registers. But Nick said, store/load reordering is explicitly allowed as well, which covers *all* memory types. Then Nick is confused. PCI only defines one way to flush posted writes to a device -- doing a read from it. There's no way that reads can be allowed to pass writes (unless you've asked for it, like with write combining). But that requirement is for the PCI bridge, isn't it? It doesn't matter if the bridge will flush all posted writes before allowing a read if the CPU decides to give the bridge the read before the write. A powerpc CPU will certainly do this if you don't take any steps like telling it the memory is uncachable and guarded. I didn't think it was allowed on x86 (except with WC), but Nick seemed to say it was. Ah sorry, not UC, I was confused. UC I think actually is strongly ordered WRT other UC and also cacheable operations. WC is weakly ordered, anything goes. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Wednesday 04 June 2008 05:07, Linus Torvalds wrote: On Tue, 3 Jun 2008, Trent Piepho wrote: On Tue, 3 Jun 2008, Linus Torvalds wrote: On Tue, 3 Jun 2008, Nick Piggin wrote: Linus: on x86, memory operations to wc and wc+ memory are not ordered with one another, or operations to other memory types (ie. load/load and store/store reordering is allowed). Also, as you know, store/load reordering is explicitly allowed as well, which covers all memory types. So perhaps it is not quite true to say readl/writel is strongly ordered by default even on x86. You would have to put in some mfence instructions in them to make it so. So on x86, these could be re-ordered? writel(START_OPERATION, CONTROL_REGISTER); status = readl(STATUS_REGISTER); With both registers in a WC+ area, yes. The write may be in the WC buffers until the WC buffers are flushed (short list: a fence, a serializing instruction, a read-write to uncached memory, or an interrupt. There are others, but those are the main ones). But if the status register is in uncached memory (which is the only *sane* thing to do), then it doesn't matter if the control register is in WC memory. Because the status register read is itself serializing with the WC buffer, it's actually fine. Actually, according to the document I am looking at (the AMD one), a UC store may pass a previous WC store. So you could have some code that writes to some WC memory on the card, and then stores to an UC control register to start up the operation on that memory, couldn't you? Those can go out of order. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Wednesday 04 June 2008 00:47, Linus Torvalds wrote: On Tue, 3 Jun 2008, Nick Piggin wrote: Linus: on x86, memory operations to wc and wc+ memory are not ordered with one another, or operations to other memory types (ie. load/load and store/store reordering is allowed). Also, as you know, store/load reordering is explicitly allowed as well, which covers all memory types. So perhaps it is not quite true to say readl/writel is strongly ordered by default even on x86. You would have to put in some mfence instructions in them to make it so. Well, you have to ask for WC/WC+ anyway, so it's immaterial. A driver that does that needs to be aware of it. IOW, it's a non-issue, imnsho. Ah, yes UC is strongly ordered WRT all others *except* WC/WC+. But WC memory is not an x86 specific thing right, so do we need some accessors for WC memory? Or can we just throw that in the weakly ordered pile, and ensure mb/rmb/wmb does the right thing for them. And you want readl/writel to be strongly ordered like x86 on all architectures, no exceptions? This will slow some things down, but if we then also provide explicitly weakly ordered instructions (and add io_mb/io_rmb/io_wmb) then at least it gives the framework for drivers to be written to run on those architectures. The other thing we could do is mandate only that readl/writel will be ordered WRT one another, *and* with spinlocks, but otherwise not with cacheable RAM... ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Wednesday 04 June 2008 07:44, Trent Piepho wrote: On Tue, 3 Jun 2008, Matthew Wilcox wrote: I don't understand why you keep talking about DMA. Are you talking about ordering between readX() and DMA? PCI proides those guarantees. I guess you haven't been reading the whole thread. The reason it started was because gcc can re-order powerpc (and everyone else's too) IO accesses vs accesses to cachable memory (but not spin-locks), which ends up only being a problem with coherent DMA. I don't think it is only a problem with coherent DMA. CPU0 CPU1 mutex_lock(mutex); writel(something, DATA_REG); writel(GO, CTRL_REG); started = 1; mutex_unlock(mutex); mutex_lock(mutex); if (started) /* oops, this can reach device before GO */ writel(STOP, CTRL_REG); ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Wed, 4 Jun 2008, Nick Piggin wrote: Actually, according to the document I am looking at (the AMD one), a UC store may pass a previous WC store. Hmm. Intel arch manyal, Vol 3, 10.3 (page 10-7 in my version): If the WC bufer is partially filled, the writes may be delayed until the next ocurrence of a serializing event; such as, an SFENCE or MFENCE instruction, CPUID execution, a read or write to uncached memory, ... Any typos mine. Anyway, Intel certainly seems to document that WC memory is serialized by any access to UC memory. But yes, I can well imagine that AMD is different, and I also heartily would recommend rather being safe than sorry. Putting an explicit memory barrier in between those accesses when you know it might make a difference is just a good idea. But basically, as far as I know the thing was designed to be invisible to old software: that is the whole idea behind WC memory. So the design was certainly intended to be that you can generally mark a framebuffer-like structure WC without any software _ever_ caring, as long as you keep all control ports in UC memory. Of course, because burst writes from the WC buffer are iso/i much more efficient on the PCI bus than dribbling them out one write at a time, it didn't take long before all the graphics cards etc wanted to ialso/i mark their command queues as WC memory, so that you could burst out the commands to the ring buffers as fast as possible. So now you have both your frame buffer *and* your command buffers mapped WC, and now ordering really has to be ensured in software if you access both. [ And then there are the crazy people who mark *main memory* as WC, because they don't want to pollute the cache with all the data, and then you have the issue of cache coherency etc crap. Which only gets worse with SMP, especially if one processor thinks it has part of memory exclusively cached, and another one - or even the same one, through another aliasign address - ignores the cache protocol. And you now get unhappy CPU's that think that there is a bug in the cache protocol and they get machine check faults. So what started out as a we can do accesses to the frame buffer more efficiently without anybody ever even having to know or care has turned into a whole nightmare of people using it for other things, and then you very much _do_ have to care! ] And it doesn't surprise me if AMD then didn't get exactly the same rules. Oh, well. Linus ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Tue, May 27, 2008 at 02:55:56PM -0700, Linus Torvalds wrote: On Wed, 28 May 2008, Benjamin Herrenschmidt wrote: A problem with __raw_ though is that they -also- don't do byteswap, Well, that's why there is __readl() and __raw_readl(), no? Neither does ordering, and __raw_readl() doesn't do byte-swap. This is where the lack of documentation causes arch maintainers a big problem. None of the semantics of __raw_readl vs __readl vs readl are documented _anywhere_. If you look at x86 as a template, there's no comments there about what the different variants are supposed to do or not do. So it's left up to arch maintainers to literally guess what should be done. That's precisely what I did when I implemented ARMs __raw_readl and friends. I guessed. And it was only after I read a few mails on lkml which suggested that readl and friends should always be LE that ARMs readl implementation started to use le32_to_cpu()... before that it had always been native endian. Again, lack of documentation... So, can the semantics of what's expected from these IO accessor functions be documented somewhere. Please? Before this thread gets lost in the depths of time? -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
Geert Uytterhoeven [EMAIL PROTECTED] wrote: On Fri, 30 May 2008, Haavard Skinnemoen wrote: Maybe we need another interface that does not do byteswapping but provides stronger ordering guarantees? The byte swapping depends on the device/bus. Of course. But isn't it reasonable to assume that a device integrated on the same silicon as the CPU is connected to a somewhat sane bus which doesn't require any byte swapping? So what happened to the old idea of putting the accessor function pointers in the device/bus structure? Don't know. I think it sounds like overkill to replace a simple load or store with an indirect function call. Haavard ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
Jeremy Higdon wrote: We don't actually have that problem on the Altix. All writes issued by CPU X will be ordered with respect to each other. But writes by CPU X and CPU Y will not be, unless an mmiowb() is done by the original CPU before the second CPU writes. I.e. CPU X writel CPU X writel CPU X mmiowb CPU Y writel ... Note that this implies some sort of locking. Also note that if in the above, CPU Y did the mmiowb, that would not work. Hmmm, Then it's less bad than I thought - my apologies for the confusion. Would we be able to use Ben's trick of setting a per cpu flag in writel() then and checking that in spin unlock issuing the mmiowb() there if needed? Cheers, Jes ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
* Linus Torvalds [EMAIL PROTECTED] wrote: Here's a UNTESTED patch for x86 that may or may not compile and work, and which serializes (on a compiler level) the IO accesses against regular memory accesses. Ok, so it at least boots on x86-32. Thus probably on x86-64 too (since the code is now shared). I didn't look at whether it generates much bigger code due to the potential extra serialization, but some of the code generation I looked at looked fine. IOW, it doesn't at least create any _obviously_ worse code, and it should be arguably safer than assuming the compiler does volatile accesses the way we want it to. ok, to pursue this topic of making readl*/writel*() more robust i picked up your patch into -tip and created a new topic branch for it: tip/x86/mmio. The patch passed initial light testing in -tip (~30 successful random self-builds and bootups on various mixed 32-bit/64-bit boxes) but it's still v2.6.27 material IMO. Failures in this area are subtle so there's no good way to tell whether it works as intended - we need wider testing. I've also added the tip/x86/mmio topic to tip/auto-x86-next rules as well so these changes will be picked up by tomorrow's linux-next tree as well, and by the next -mm iteration. Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
Pavel Machek wrote: Still better than changing semantics of writel for _all_ drivers. If you are really sure driver does not depend on writel order, it is just a sed/// command, so I don't see any big code maintenance issues... This isn't changing the semantics for all drivers, it means it leaves it up to us to fix the drivers that hit the problem. Whereas the other way round we'd have to make some fairly big modifications to a large number of drivers, which the driver maintainers are likely to be reluctant to accept and maintain. It will be a constant chase to keep that in order. Cheers Jes ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Mon, Jun 02, 2008 at 10:11:02AM +0200, Haavard Skinnemoen wrote: Geert Uytterhoeven [EMAIL PROTECTED] wrote: On Fri, 30 May 2008, Haavard Skinnemoen wrote: Maybe we need another interface that does not do byteswapping but provides stronger ordering guarantees? The byte swapping depends on the device/bus. Of course. But isn't it reasonable to assume that a device integrated on the same silicon as the CPU is connected to a somewhat sane bus which doesn't require any byte swapping? No, unfortunately. :-( See the end of drivers/dma/fsldma.h. Likewise with Freescale's PCI host bridges; for some reason the bus itself being little endian led to the host bridge control registers also being little endian. -Scott ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Mon, Jun 02, 2008 at 11:56:39AM +0200, Jes Sorensen wrote: Jeremy Higdon wrote: We don't actually have that problem on the Altix. All writes issued by CPU X will be ordered with respect to each other. But writes by CPU X and CPU Y will not be, unless an mmiowb() is done by the original CPU before the second CPU writes. I.e. CPU X writel CPU X writel CPU X mmiowb CPU Y writel ... Note that this implies some sort of locking. Also note that if in the above, CPU Y did the mmiowb, that would not work. Hmmm, Then it's less bad than I thought - my apologies for the confusion. Would we be able to use Ben's trick of setting a per cpu flag in writel() then and checking that in spin unlock issuing the mmiowb() there if needed? Yes, that should work fine. You will get more mmiowb's than you need, since some drivers, such as Fusion, don't need them. On the Origins (older SGI MIPS-based Numa), the 'sync' instruction had the same effect as mmiowb() with respect to mmio write ordering, and it was issued unconditionally in the spin unlock. It was cheaper than mmiowb, however. If it matters, we could invent and use writel_relaxed() to get performance back in drivers we care about jeremy ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Mon, 2008-06-02 at 12:36 +0200, Ingo Molnar wrote: The patch passed initial light testing in -tip (~30 successful random self-builds and bootups on various mixed 32-bit/64-bit boxes) but it's still v2.6.27 material IMO. I think adding the memory clobber should be .26 and even -stable. We know for sure newer gcc will re-order things, we know it for sure some drivers will break in subtle ways because of that, ad we know as well that sticking memory clobber in there will fix that breakage we know about ... So I don't see the point of waiting. Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Monday 02 June 2008 19:56, Jes Sorensen wrote: Jeremy Higdon wrote: We don't actually have that problem on the Altix. All writes issued by CPU X will be ordered with respect to each other. But writes by CPU X and CPU Y will not be, unless an mmiowb() is done by the original CPU before the second CPU writes. I.e. CPU X writel CPU X writel CPU X mmiowb CPU Y writel ... Note that this implies some sort of locking. Also note that if in the above, CPU Y did the mmiowb, that would not work. Hmmm, Then it's less bad than I thought - my apologies for the confusion. Would we be able to use Ben's trick of setting a per cpu flag in writel() then and checking that in spin unlock issuing the mmiowb() there if needed? Yes you could, but your writels would still not be strongly ordered within (or outside) spinlock regions, which is what Linus wants (and I kind of agree with). This comes back to my posting about mmiowb and io_*mb barriers etc. Despite what you say, what you've done really _does_ change the semantics of wmb() for all drivers. It is a really sad situation we've got ourselves into somehow, AFAIKS in the hope of trying to save ourselves a tiny bit of work upfront :( (this is not just the sgi folk with mmiowb I'm talking about, but the whole random undefinedness of ordering and io barriers). The right way to make any change is never to weaken the postcondition of an existing interface *unless* you are willing to audit the entire tree and fix it. Impossible for drivers, so the correct thing to do is introduce a new interface, and move things over at an easier pace. Not rocket science. The argument that Altix only uses a few drivers so this way we can just fix these up rather than make big modifications to large numbers of drivers is bogus. It is far worse even for Altix if you make incompatible changes, because you first *break* every driver on your platform, then you have to audit and fix them. If you make compatible changes, then you have to do exactly the same audits to move them over to the new API, but you go from slower-faster rather than broken-non broken. As a bonus, you haven't got random broken stuff all over the tree that you forgot to audit. I don't know how there is still so much debate about this :( I have a proposal: I am a neutral party here, not being an arch maintainer, so I'll take input and write up a backward compatible API specification and force everybody to conform to it ;) ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
Hi! Though it's my understanding that at least ia64 does require the explicit barriers anyway, so we are still in a dodgy situation here where it's not clear what drivers should do and we end up with possibly excessive barriers on powerpc where I end up with both the wmb/rmb/mb that were added for ia64 -and- the ones I have in readl/writel to make them look synchronous... Not nice. ia64 is a disaster with a slightly different ordering problem -- the mmiowb() issue. I know Ben knows far too much about this, but for big SGI boxes, you sometimes need mmiowb() to avoid problems with driver code that does totally sane stuff like spin_lock(mmio_lock); writel(val1, reg1); writel(val2, reg2); spin_unlock(mmio_lock); If that snippet is called on two CPUs at the same time, then the device might see a sequence like CPU1 -- write reg1 CPU2 -- write reg1 CPU1 -- write reg2 CPU2 -- write reg2 in spite of the fact that everything is totally ordered on the CPUs by the spin lock. The reason this is such a disaster is because the code looks right, makes sense, and works fine on 99.99% of all systems out there. So I would bet that 99% of our drivers have missing mmiowb() bugs -- no one has plugged the hardware into an Altix box and cared enough to stress test it. Yes, ia64 is a disaster, and needs its spinlock implementation fixed :-). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
Hi! The only way to guarantee ordering in the above setup, is to either make writel() fully ordered or adding the mmiowb()'s inbetween the two writel's. On Altix you have to go and read from the PCI brige to ensure all writes to it have been flushed, which is also what mmiowb() is doing. If writel() was to guarantee this ordering, it would make every writel() call extremely expensive :-( Interesting. I've always been taught by ia64 people that mmiowb() was intended to be used solely between writel() and spin_unlock(). I think in the above case, you really should make writel() ordered. Anything else is asking for trouble, for the exact same reasons that I made it fully ordered on powerpc at least vs. previous stores. I only kept it relaxed vs. subsequent cacheable stores (ie, spin_unlock), for which I use the trick mentioned before. Hmmm I hope I didn't mess up the description of this and added to the confusion. The net result of that would be to kill performance completely, I really don't like that idea Having each writel() go out and poll the PCI bridge is going to make every driver used on Altix slow as a dog. In addition it's still not going to solve the problem for userland mapped stuff such as infinibug. Yes, this has some cost (can be fairly significant on powerpc too) but I think it's a very basic assumption from drivers that consecutive writel's, especially issued by the same CPU, will get to the device in order. In this case the cost is more than just significant, it's pretty crippling. If this is a performance problem, then provide relaxed variants and use them in selected drivers. We'd have to make major changes to drivers like e1000, tg3, mptsas, the qla2/3/4xxx and a bunch of others to get performance back. I really think the code maintenance issue there will get far worse than what we have today :( Still better than changing semantics of writel for _all_ drivers. If you are really sure driver does not depend on writel order, it is just a sed/// command, so I don't see any big code maintenance issues... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Thu, May 29, 2008 at 10:47:18AM -0400, Jes Sorensen wrote: Thats not going to solve the problem on Altix. On Altix the issue is that there can be multiple paths through the NUMA fabric from cpuX to PCI bridge Y. Consider this uber-cooltm ascii art - NR is my abbrevation for NUMA router: --- --- |cpu X| |cpu Y| --- --- | \ /| |\/ | |/\ | | / \| - -- |NR 1| |NR 2| -- -- \ / \ / --- | PCI | --- The problem is that your two writel's, despite being both issued on cpu X, due to the spin lock, in your example, can end up with the first one going through NR 1 and the second one going through NR 2. If there's contention on NR 1, the write going via NR 2 may hit the PCI bridge prior to the one going via NR 1. We don't actually have that problem on the Altix. All writes issued by CPU X will be ordered with respect to each other. But writes by CPU X and CPU Y will not be, unless an mmiowb() is done by the original CPU before the second CPU writes. I.e. CPU X writel CPU X writel CPU X mmiowb CPU Y writel ... Note that this implies some sort of locking. Also note that if in the above, CPU Y did the mmiowb, that would not work. jeremy ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Fri, May 30, 2008 at 10:21:00AM -0700, Jesse Barnes wrote: On Friday, May 30, 2008 2:36 am Jes Sorensen wrote: James Bottomley wrote: The only way to guarantee ordering in the above setup, is to either make writel() fully ordered or adding the mmiowb()'s inbetween the two writel's. On Altix you have to go and read from the PCI brige to ensure all writes to it have been flushed, which is also what mmiowb() is doing. If writel() was to guarantee this ordering, it would make every writel() call extremely expensive :-( So if a read from the bridge achieves the same effect, can't we just put one after the writes within the spinlock (an unrelaxed one). That way A relaxed readX would be sufficient. It's the next lowest cost way (after mmiowb) of ensuring write ordering between CPUs. Regular readX is the most expensive method (well, we could probably come up with something worse, but we'd have to work at it :). this whole sequence will look like a well understood PCI posting flush rather than have to muck around with little understood (at least by most driver writers) io barriers? Hmmm, I think mmiowb() does some sort of status read from the bridge, I am not sure if it's enough to just do a regular readl(). I'm adding Jeremy to the list, he should know for sure. I think a read from the target host bridge is enough. What mmiowb() does though is to read a *local* host bridge register, which contains a count of the number of PIO ops still in flight on their way to their target bridge. When it reaches 0, all PIOs have arrived at the target host bridge (they still may be bufferd), so ordering is guaranteed. Note that is the main advantage over a read. There is no round trip across the NUMA fabric. jeremy ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Fri, 30 May 2008 11:13:23 +1000 Benjamin Herrenschmidt [EMAIL PROTECTED] wrote: Currently, this is the only interface I know that can do native-endian accesses, so if you take it away, I'm gonna need an alternative interface that doesn't do byteswapping. Are you aware that these also don't provide any ordering guarantee ? Yes, but I am not aware of any alternative. I think the drivers I've written have the necessary barriers (or dma ops with implicit barriers) that they don't actually depend on any DMA vs. MMIO ordering guarantees. I hope MMIO vs. MMIO ordering is guaranteed though? Haavard ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Fri, 2008-05-30 at 08:07 +0200, Haavard Skinnemoen wrote: On Fri, 30 May 2008 11:13:23 +1000 Benjamin Herrenschmidt [EMAIL PROTECTED] wrote: Currently, this is the only interface I know that can do native-endian accesses, so if you take it away, I'm gonna need an alternative interface that doesn't do byteswapping. Are you aware that these also don't provide any ordering guarantee ? Yes, but I am not aware of any alternative. I think the drivers I've written have the necessary barriers (or dma ops with implicit barriers) that they don't actually depend on any DMA vs. MMIO ordering guarantees. I hope MMIO vs. MMIO ordering is guaranteed though? Only to the same address I'd say. Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Fri, 30 May 2008 17:24:27 +1000 Benjamin Herrenschmidt [EMAIL PROTECTED] wrote: On Fri, 2008-05-30 at 08:07 +0200, Haavard Skinnemoen wrote: I think the drivers I've written have the necessary barriers (or dma ops with implicit barriers) that they don't actually depend on any DMA vs. MMIO ordering guarantees. I hope MMIO vs. MMIO ordering is guaranteed though? Only to the same address I'd say. Right, I sort of suspected that. Now, I'm pretty sure the architectures that can actually run those drivers (ARM9 and AVR32 AP7) provide stronger guarantees than that, and I suspect the same is true on most other embedded architectures that use __raw_* in their drivers. So I don't think adding barriers is the right thing to do because they won't do anything useful in practice, so it's hard to tell whether they are used correctly. And it will hurt performance at least on AVR32 since wmb() evaluates to a sync instruction, which flushes the write buffer to RAM. Since MMIO writes are unbuffered, that's pure overhead. Maybe we need another interface that does not do byteswapping but provides stronger ordering guarantees? Haavard ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Fri, 30 May 2008, Haavard Skinnemoen wrote: Maybe we need another interface that does not do byteswapping but provides stronger ordering guarantees? The byte swapping depends on the device/bus. So what happened to the old idea of putting the accessor function pointers in the device/bus structure? Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [EMAIL PROTECTED] In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say programmer or something like that. -- Linus Torvalds ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
James Bottomley wrote: The only way to guarantee ordering in the above setup, is to either make writel() fully ordered or adding the mmiowb()'s inbetween the two writel's. On Altix you have to go and read from the PCI brige to ensure all writes to it have been flushed, which is also what mmiowb() is doing. If writel() was to guarantee this ordering, it would make every writel() call extremely expensive :-( So if a read from the bridge achieves the same effect, can't we just put one after the writes within the spinlock (an unrelaxed one). That way this whole sequence will look like a well understood PCI posting flush rather than have to muck around with little understood (at least by most driver writers) io barriers? Hmmm, I think mmiowb() does some sort of status read from the bridge, I am not sure if it's enough to just do a regular readl(). I'm adding Jeremy to the list, he should know for sure. Cheers, Jes ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
Jesse Barnes wrote: On Thursday, May 29, 2008 2:40 pm Benjamin Herrenschmidt wrote: On Thu, 2008-05-29 at 10:47 -0400, Jes Sorensen wrote: The only way to guarantee ordering in the above setup, is to either make writel() fully ordered or adding the mmiowb()'s inbetween the two writel's. On Altix you have to go and read from the PCI brige to ensure all writes to it have been flushed, which is also what mmiowb() is doing. If writel() was to guarantee this ordering, it would make every writel() call extremely expensive :-( Interesting. I've always been taught by ia64 people that mmiowb() was intended to be used solely between writel() and spin_unlock(). Well, that *was* true, afaik, but maybe these days multipath isn't just for fail-over. If that's true, then yeah making every single writeX ordered would be the only way to go... I could be getting bits wrong, but multi-path here is in the NUMA routing, not at the device level. If this is a performance problem, then provide relaxed variants and use them in selected drivers. Sounds reasonable. That way drivers just work and important drivers can be optimized. That would kill all levels of performance in all drivers, resulting in attempts to try and modify a fair bit of drivers to get the performance back. In reality this problem really only exists for devices where ordering of consecutive writel's is a big issue. In my experience it really isn't the case very frequently - and the number of mmiowb's that have put in shows that too :-) Cheers, Jes ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
Benjamin Herrenschmidt wrote: On Thu, 2008-05-29 at 10:47 -0400, Jes Sorensen wrote: The only way to guarantee ordering in the above setup, is to either make writel() fully ordered or adding the mmiowb()'s inbetween the two writel's. On Altix you have to go and read from the PCI brige to ensure all writes to it have been flushed, which is also what mmiowb() is doing. If writel() was to guarantee this ordering, it would make every writel() call extremely expensive :-( Interesting. I've always been taught by ia64 people that mmiowb() was intended to be used solely between writel() and spin_unlock(). I think in the above case, you really should make writel() ordered. Anything else is asking for trouble, for the exact same reasons that I made it fully ordered on powerpc at least vs. previous stores. I only kept it relaxed vs. subsequent cacheable stores (ie, spin_unlock), for which I use the trick mentioned before. Hmmm I hope I didn't mess up the description of this and added to the confusion. The net result of that would be to kill performance completely, I really don't like that idea Having each writel() go out and poll the PCI bridge is going to make every driver used on Altix slow as a dog. In addition it's still not going to solve the problem for userland mapped stuff such as infinibug. Yes, this has some cost (can be fairly significant on powerpc too) but I think it's a very basic assumption from drivers that consecutive writel's, especially issued by the same CPU, will get to the device in order. In this case the cost is more than just significant, it's pretty crippling. If this is a performance problem, then provide relaxed variants and use them in selected drivers. We'd have to make major changes to drivers like e1000, tg3, mptsas, the qla2/3/4xxx and a bunch of others to get performance back. I really think the code maintenance issue there will get far worse than what we have today :( Cheers, Jes ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Friday, May 30, 2008 2:36 am Jes Sorensen wrote: James Bottomley wrote: The only way to guarantee ordering in the above setup, is to either make writel() fully ordered or adding the mmiowb()'s inbetween the two writel's. On Altix you have to go and read from the PCI brige to ensure all writes to it have been flushed, which is also what mmiowb() is doing. If writel() was to guarantee this ordering, it would make every writel() call extremely expensive :-( So if a read from the bridge achieves the same effect, can't we just put one after the writes within the spinlock (an unrelaxed one). That way this whole sequence will look like a well understood PCI posting flush rather than have to muck around with little understood (at least by most driver writers) io barriers? Hmmm, I think mmiowb() does some sort of status read from the bridge, I am not sure if it's enough to just do a regular readl(). I'm adding Jeremy to the list, he should know for sure. I think a read from the target host bridge is enough. What mmiowb() does though is to read a *local* host bridge register, which contains a count of the number of PIO ops still in flight on their way to their target bridge. When it reaches 0, all PIOs have arrived at the target host bridge (they still may be bufferd), so ordering is guaranteed. Jesse ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Wednesday 28 May 2008, Benjamin Herrenschmidt wrote: On Tue, 2008-05-27 at 14:55 -0700, Linus Torvalds wrote: On Wed, 28 May 2008, Benjamin Herrenschmidt wrote: A problem with __raw_ though is that they -also- don't do byteswap, Well, that's why there is __readl() and __raw_readl(), no? As I replied to somebody else, __readl() is news to me :-) we dont' have those on powerpc. It's not exactly a well-established interface. Only five architectures define these functions, and there is not a single user in the kernel source outside of these architecture's io.h files. Arnd ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
It's not exactly a well-established interface. Only five architectures define these functions, and there is not a single user in the kernel source outside of these architecture's io.h files. That is because the drivers using them had them removed (eg I²O) - mostly because it didn't compile on power pc 8( Alan ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On 28 Μαϊ 2008, at 11:36 ΠΜ, Haavard Skinnemoen wrote: Benjamin Herrenschmidt [EMAIL PROTECTED] wrote: I'm happy to say that __raw is purely about ordering and make them byteswap on powerpc tho (ie, make them little endian like the non- raw counterpart). That would break a lot of drivers. How many actually use __raw_ * ? I do -- in all the drivers for on-chip peripherals that are shared between AT91 ARM (LE) and AVR32 (BE). Since everything goes on inside the chip, we must use LE accesses on ARM and BE accesses on AVR32. Currently, this is the only interface I know that can do native-endian accesses, so if you take it away, I'm gonna need an alternative interface that doesn't do byteswapping. Haavard I certainly do too. Quite a lot of SoC specific drivers use __raw for accessing their on-chip peripherals. Please don't change the __raw semantics, you'll end up breaking almost everything that's BE. -- Pantelis ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Thu, 2008-05-29 at 10:47 -0400, Jes Sorensen wrote: Roland == Roland Dreier [EMAIL PROTECTED] writes: This is a different issue. We deal with it on powerpc by having writel set a per-cpu flag and spin_unlock() test it, and do the barrier if needed there. Roland Cool... I assume you do this for mutex_unlock() etc? Roland Is there any reason why ia64 can't do this too so we can kill Roland mmiowb and save everyone a lot of hassle? (mips, sh and frv Roland have non-empty mmiowb() definitions too but I'd guess that Roland these are all bugs based on misunderstandings of the mmiowb() Roland semantics...) Hi Roland, Thats not going to solve the problem on Altix. On Altix the issue is that there can be multiple paths through the NUMA fabric from cpuX to PCI bridge Y. Consider this uber-cooltm ascii art - NR is my abbrevation for NUMA router: --- --- |cpu X| |cpu Y| --- --- | \ /| |\/ | |/\ | | / \| - -- |NR 1| |NR 2| -- -- \ / \ / --- | PCI | --- The problem is that your two writel's, despite being both issued on cpu X, due to the spin lock, in your example, can end up with the first one going through NR 1 and the second one going through NR 2. If there's contention on NR 1, the write going via NR 2 may hit the PCI bridge prior to the one going via NR 1. Of course, the bigger the system, the worse the problem The only way to guarantee ordering in the above setup, is to either make writel() fully ordered or adding the mmiowb()'s inbetween the two writel's. On Altix you have to go and read from the PCI brige to ensure all writes to it have been flushed, which is also what mmiowb() is doing. If writel() was to guarantee this ordering, it would make every writel() call extremely expensive :-( So if a read from the bridge achieves the same effect, can't we just put one after the writes within the spinlock (an unrelaxed one). That way this whole sequence will look like a well understood PCI posting flush rather than have to muck around with little understood (at least by most driver writers) io barriers? James ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
Roland == Roland Dreier [EMAIL PROTECTED] writes: This is a different issue. We deal with it on powerpc by having writel set a per-cpu flag and spin_unlock() test it, and do the barrier if needed there. Roland Cool... I assume you do this for mutex_unlock() etc? Roland Is there any reason why ia64 can't do this too so we can kill Roland mmiowb and save everyone a lot of hassle? (mips, sh and frv Roland have non-empty mmiowb() definitions too but I'd guess that Roland these are all bugs based on misunderstandings of the mmiowb() Roland semantics...) Hi Roland, Thats not going to solve the problem on Altix. On Altix the issue is that there can be multiple paths through the NUMA fabric from cpuX to PCI bridge Y. Consider this uber-cooltm ascii art - NR is my abbrevation for NUMA router: --- --- |cpu X| |cpu Y| --- --- | \ /| |\/ | |/\ | | / \| - -- |NR 1| |NR 2| -- -- \ / \ / --- | PCI | --- The problem is that your two writel's, despite being both issued on cpu X, due to the spin lock, in your example, can end up with the first one going through NR 1 and the second one going through NR 2. If there's contention on NR 1, the write going via NR 2 may hit the PCI bridge prior to the one going via NR 1. Of course, the bigger the system, the worse the problem The only way to guarantee ordering in the above setup, is to either make writel() fully ordered or adding the mmiowb()'s inbetween the two writel's. On Altix you have to go and read from the PCI brige to ensure all writes to it have been flushed, which is also what mmiowb() is doing. If writel() was to guarantee this ordering, it would make every writel() call extremely expensive :-( Cheers, Jes ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Thu, 2008-05-29 at 10:47 -0400, Jes Sorensen wrote: The only way to guarantee ordering in the above setup, is to either make writel() fully ordered or adding the mmiowb()'s inbetween the two writel's. On Altix you have to go and read from the PCI brige to ensure all writes to it have been flushed, which is also what mmiowb() is doing. If writel() was to guarantee this ordering, it would make every writel() call extremely expensive :-( Interesting. I've always been taught by ia64 people that mmiowb() was intended to be used solely between writel() and spin_unlock(). I think in the above case, you really should make writel() ordered. Anything else is asking for trouble, for the exact same reasons that I made it fully ordered on powerpc at least vs. previous stores. I only kept it relaxed vs. subsequent cacheable stores (ie, spin_unlock), for which I use the trick mentioned before. Yes, this has some cost (can be fairly significant on powerpc too) but I think it's a very basic assumption from drivers that consecutive writel's, especially issued by the same CPU, will get to the device in order. If this is a performance problem, then provide relaxed variants and use them in selected drivers. Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Fri, 30 May 2008, Benjamin Herrenschmidt wrote: On Thu, 2008-05-29 at 10:47 -0400, Jes Sorensen wrote: Interesting. I've always been taught by ia64 people that mmiowb() was intended to be used solely between writel() and spin_unlock(). That's what I gathered too, based on what's written in memory-barriers.txt, which is the only kernel docs I could find that addressed this. Yes, this has some cost (can be fairly significant on powerpc too) but I think it's a very basic assumption from drivers that consecutive writel's, especially issued by the same CPU, will get to the device in order. It's also what memory-barriers.txt says they should do. If this is a performance problem, then provide relaxed variants and use them in selected drivers. I wrote a JTAG over gpio driver for the powerpc MPC8572DS platform. With the non-raw io accessors, the JTAG clock can run at almost ~9.5 MHz. Using raw versions (which I had to write since powerpc doesn't have any), the clock speed increases to about 28 MHz. So it can make a very significant different. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Thursday, May 29, 2008 2:40 pm Benjamin Herrenschmidt wrote: On Thu, 2008-05-29 at 10:47 -0400, Jes Sorensen wrote: The only way to guarantee ordering in the above setup, is to either make writel() fully ordered or adding the mmiowb()'s inbetween the two writel's. On Altix you have to go and read from the PCI brige to ensure all writes to it have been flushed, which is also what mmiowb() is doing. If writel() was to guarantee this ordering, it would make every writel() call extremely expensive :-( Interesting. I've always been taught by ia64 people that mmiowb() was intended to be used solely between writel() and spin_unlock(). Well, that *was* true, afaik, but maybe these days multipath isn't just for fail-over. If that's true, then yeah making every single writeX ordered would be the only way to go... If this is a performance problem, then provide relaxed variants and use them in selected drivers. Sounds reasonable. That way drivers just work and important drivers can be optimized. Jesse ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
The problem is that your two writel's, despite being both issued on cpu X, due to the spin lock, in your example, can end up with the first one going through NR 1 and the second one going through NR 2. If there's contention on NR 1, the write going via NR 2 may hit the PCI bridge prior to the one going via NR 1. Really?? I can't see how you can expect any drivers to work reliably if simple code like writel(reg1); writel(reg2); might end up with the write to reg2 hitting the device before the write to reg1. Almost all MMIO stuff has ordering requirements and will totally break if you allow, say, the start IO write to be reordered before the set IO address write. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Thu, 2008-05-29 at 14:48 -0700, Trent Piepho wrote: I wrote a JTAG over gpio driver for the powerpc MPC8572DS platform. With the non-raw io accessors, the JTAG clock can run at almost ~9.5 MHz. Using raw versions (which I had to write since powerpc doesn't have any), the clock speed increases to about 28 MHz. So it can make a very significant different. Yes, sync's can hurt a lot. This is why I initially tried to get more relaxed semantics. We could implement something like __ variants and do something that would still have eieio's but not sync's for example (ie. MMIOs are still ordered vs. each other but not vs. coherent memory). Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
I do -- in all the drivers for on-chip peripherals that are shared between AT91 ARM (LE) and AVR32 (BE). Since everything goes on inside the chip, we must use LE accesses on ARM and BE accesses on AVR32. Currently, this is the only interface I know that can do native-endian accesses, so if you take it away, I'm gonna need an alternative interface that doesn't do byteswapping. Are you aware that these also don't provide any ordering guarantee ? Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Fri, 30 May 2008, Benjamin Herrenschmidt wrote: On Thu, 2008-05-29 at 14:48 -0700, Trent Piepho wrote: I wrote a JTAG over gpio driver for the powerpc MPC8572DS platform. With the non-raw io accessors, the JTAG clock can run at almost ~9.5 MHz. Using raw versions (which I had to write since powerpc doesn't have any), the clock speed increases to about 28 MHz. So it can make a very significant different. Yes, sync's can hurt a lot. This is why I initially tried to get more relaxed semantics. We could implement something like __ variants and do something that would still have eieio's but not sync's for example (ie. MMIOs are still ordered vs. each other but not vs. coherent memory). The problem current with the raw variants is that not all archs have them. And for those that do, there is no defined semantics. Each arch is different as to what ordering they have (and endianness too). If you want to write a driver that is (or might be one day) multi-platform, there aren't any less ordered accessors one can use. A lot of drivers don't even use coherent DMA, and could use less strictly ordered semantics quite trivially. Except there aren't any. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
Trent Piepho writes: On Thu, 29 May 2008, Roland Dreier wrote: The problem is that your two writel's, despite being both issued on cpu X, due to the spin lock, in your example, can end up with the first one going through NR 1 and the second one going through NR 2. If there's contention on NR 1, the write going via NR 2 may hit the PCI bridge prior to the one going via NR 1. Really?? I can't see how you can expect any drivers to work reliably if simple code like writel(reg1); writel(reg2); might end up with the write to reg2 hitting the device before the write to reg1. Almost all MMIO stuff has ordering requirements and will This is how powerpc is natively (the linux accessors have extra ordering to not allow this of course), and there are non-Linux drivers that are written for this ordering model. In fact stores to non-cacheable locations have to be kept in order, according to the Power architecture. But you're correct in that non-cacheable loads can in principle be reordered w.r.t. each other and to stores. I think what makes altix so hard is that writes to the _same_ register may be be re-ordered. Re-ordering writes to the same address is much less common than allowing writes to different addresses to be re-ordered. Indeed. Paul. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
Benjamin Herrenschmidt [EMAIL PROTECTED] wrote: I'm happy to say that __raw is purely about ordering and make them byteswap on powerpc tho (ie, make them little endian like the non-raw counterpart). That would break a lot of drivers. How many actually use __raw_ * ? I do -- in all the drivers for on-chip peripherals that are shared between AT91 ARM (LE) and AVR32 (BE). Since everything goes on inside the chip, we must use LE accesses on ARM and BE accesses on AVR32. Currently, this is the only interface I know that can do native-endian accesses, so if you take it away, I'm gonna need an alternative interface that doesn't do byteswapping. Haavard ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Tue, 27 May 2008, Linus Torvalds wrote: Here's a UNTESTED patch for x86 that may or may not compile and work, and which serializes (on a compiler level) the IO accesses against regular memory accesses. Ok, so it at least boots on x86-32. Thus probably on x86-64 too (since the code is now shared). I didn't look at whether it generates much bigger code due to the potential extra serialization, but some of the code generation I looked at looked fine. IOW, it doesn't at least create any _obviously_ worse code, and it should be arguably safer than assuming the compiler does volatile accesses the way we want it to. Linus ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Tue, 2008-05-27 at 08:50 -0700, Roland Dreier wrote: Though it's my understanding that at least ia64 does require the explicit barriers anyway, so we are still in a dodgy situation here where it's not clear what drivers should do and we end up with possibly excessive barriers on powerpc where I end up with both the wmb/rmb/mb that were added for ia64 -and- the ones I have in readl/writel to make them look synchronous... Not nice. ia64 is a disaster with a slightly different ordering problem -- the mmiowb() issue. I know Ben knows far too much about this, but for big SGI boxes, you sometimes need mmiowb() to avoid problems with driver code that does totally sane stuff like spin_lock(mmio_lock); writel(val1, reg1); writel(val2, reg2); spin_unlock(mmio_lock); If that snippet is called on two CPUs at the same time, then the device might see a sequence like CPU1 -- write reg1 CPU2 -- write reg1 CPU1 -- write reg2 CPU2 -- write reg2 in spite of the fact that everything is totally ordered on the CPUs by the spin lock. The reason this is such a disaster is because the code looks right, makes sense, and works fine on 99.99% of all systems out there. So I would bet that 99% of our drivers have missing mmiowb() bugs -- no one has plugged the hardware into an Altix box and cared enough to stress test it. However for the issue at hand, my expectation as a driver writer is that readl()/writel() are ordered with respect to MMIO operations, but not necessarily with respect to normal writes to coherent CPU memory. And I've included explicit wmb()s in code I've written like drivers/infiniband/hw/mthca. Actually, this specifically should not be. The need for mmiowb on altix is because it explicitly violates some of the PCI rules that would otherwise impede performance. The compromise is that readX on altix contains the needed dma flush but there's a variant operator, readX_relaxed that doesn't (for drivers that know what they're doing). The altix critical drivers have all been converted to use the relaxed form for performance, and the unconverted ones should all operate just fine (albeit potentially more slowly). You can see all of this in include/asm-ia64/sn/io.h It is confusing to me that sn_dma_flush() and sn_mmiowb() have different implementations, but I think both fix the spinlock problem you allude to by ensuring the DMA operation is completed before the CPU instruction is executed. James ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
Actually, this specifically should not be. The need for mmiowb on altix is because it explicitly violates some of the PCI rules that would otherwise impede performance. The compromise is that readX on altix contains the needed dma flush but there's a variant operator, readX_relaxed that doesn't (for drivers that know what they're doing). The altix critical drivers have all been converted to use the relaxed form for performance, and the unconverted ones should all operate just fine (albeit potentially more slowly). Is this a recent change? Because as of October 2007, 76d7cc03 (IB/mthca: Use mmiowb() to avoid firmware commands getting jumbled up) was needed. But this was involving writel() (__raw_writel() actually, looking at the code), not readl(). But writel_relaxed() doesn't exist (and doesn't make sense). - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Tue, 2008-05-27 at 10:38 -0700, Roland Dreier wrote: Actually, this specifically should not be. The need for mmiowb on altix is because it explicitly violates some of the PCI rules that would otherwise impede performance. The compromise is that readX on altix contains the needed dma flush but there's a variant operator, readX_relaxed that doesn't (for drivers that know what they're doing). The altix critical drivers have all been converted to use the relaxed form for performance, and the unconverted ones should all operate just fine (albeit potentially more slowly). Is this a recent change? Because as of October 2007, 76d7cc03 (IB/mthca: Use mmiowb() to avoid firmware commands getting jumbled up) was needed. But this was involving writel() (__raw_writel() actually, looking at the code), not readl(). But writel_relaxed() doesn't exist (and doesn't make sense). Um, OK, you've said write twice now ... I was assuming you meant read. Even on an x86, writes are posted, so there's no way a spin lock could serialise a write without an intervening read to flush the posting (that's why only reads have a relaxed version on altix). Or is there something else I'm missing? James ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
Um, OK, you've said write twice now ... I was assuming you meant read. Even on an x86, writes are posted, so there's no way a spin lock could serialise a write without an intervening read to flush the posting (that's why only reads have a relaxed version on altix). Or is there something else I'm missing? Writes are posted yes, but not reordered arbitrarily. If I have code like: spin_lock(mmio_lock); writel(val1, reg1); writel(val2, reg2); spin_unlock(mmio_lock); then I have a reasonable expectation that if two CPUs run this at the same time, their writes to reg1/reg2 won't be interleaved with each other (because the whole section is inside a spinlock). And Altix violates that expectation. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
Writes are posted yes, but not reordered arbitrarily. on standard x86 I mean here... ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Tue, 27 May 2008, Linus Torvalds wrote: On Tue, 27 May 2008, Benjamin Herrenschmidt wrote: Yes. As it is today, tg3 for example is potentially broken on all archs with newer gcc unless we either add memory clobber to readl/writel or stick some wmb's in there (just a random driver I picked). So Linus, what is your take on that matter ? Let's just serialize the damn things, and add a memory clobber to them. Expecting people to fix up all drivers is simply not going to happen. And serializing things shouldn't be *that* expensive. People who cannot take the expense can continue to use the magic __raw_writel() etc stuff. Is there an issue with anything _besides_ coherent DMA? Could one have a special version of the accessors for drivers that want to assume they are strictly ordered vs coherent DMA memory? That would be much easier to get right, without slowing _everything_ down. The problem with the raw versions is that on some arches they are much more raw than just not being ordered w.r.t normal memory. __raw_writel() No ordering beyond what the arch provides natively. The only thing you can assume is that reads and writes to the same location on the same CPU are ordered. writel() Strictly ordered with respect to any other IO, but not to normal memory. (this is what Documentation/memory-barriers.txt claims). Should probably be strictly ordered vs locks as well. Would be strictly ordered w.r.t. the streaming DMA sync calls of course. strict_writel() Strictly ordered with respect to normal (incl. coherent DMA) memory as well. One could even go as far as to allow a driver to #define WANT_STRICT_IO and then it would get the strict versions. Add that to any driver that uses DMA and then worry about vetting those drivers. It's also worth nothing that adding barriers to IO accessors doesn't mean you never have to worry about barriers with coherent DMA. Inherent with coherent DMA, and driving hardware in general, there are various sequence points where one must be sure one operation has finished before starting the next. Making sure you're doing with DMA memory before telling the hardware to do something with it a common example. Often the telling the hardware to do something is done via an IO accessor function, but not always. You can have buffer descriptors in DMA memory that describe the DMA buffers to the hardware. It would be critical to be finished reading a DMA buffer before updating the descriptor to indicate that it's empty. You could trigger a DMA operation not by a call to an IO accessor, but with an atomic operation to a variable shared with an ISR. When the ISR runs it sees the change and does the necessary IO. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
Trent Piepho wrote: Is there an issue with anything _besides_ coherent DMA? Could one have a special version of the accessors for drivers that want to assume they are strictly ordered vs coherent DMA memory? That would be much easier to get right, without slowing _everything_ down. It's better to be safe by default and then optimize the fast paths than to be relaxed by default and hang the machine in some piece of code that runs once a month. Premature optimization is the root of all evil, and what not. One could even go as far as to allow a driver to #define WANT_STRICT_IO and then it would get the strict versions. Add that to any driver that uses DMA and then worry about vetting those drivers. See above -- if you must have a #define, then it should be WANT_RELAXED_IO. -Scott ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Tue, 2008-05-27 at 08:35 -0700, Linus Torvalds wrote: On Tue, 27 May 2008, Benjamin Herrenschmidt wrote: Yes. As it is today, tg3 for example is potentially broken on all archs with newer gcc unless we either add memory clobber to readl/writel or stick some wmb's in there (just a random driver I picked). So Linus, what is your take on that matter ? Let's just serialize the damn things, and add a memory clobber to them. Expecting people to fix up all drivers is simply not going to happen. And serializing things shouldn't be *that* expensive. People who cannot take the expense can continue to use the magic __raw_writel() etc stuff. Ok. Do we also remove wmb/rmb/... from drivers then ? :-) I think ia64 would need to be fixed to make their writel serializing... Regarding __raw_* their semantics are dodgy ... we might want to provide something better but it's a different subject. Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Tue, 2008-05-27 at 08:50 -0700, Roland Dreier wrote: Though it's my understanding that at least ia64 does require the explicit barriers anyway, so we are still in a dodgy situation here where it's not clear what drivers should do and we end up with possibly excessive barriers on powerpc where I end up with both the wmb/rmb/mb that were added for ia64 -and- the ones I have in readl/writel to make them look synchronous... Not nice. ia64 is a disaster with a slightly different ordering problem -- the mmiowb() issue. I know Ben knows far too much about this, but for big SGI boxes, you sometimes need mmiowb() to avoid problems with driver code that does totally sane stuff like This is a different issue. We deal with it on powerpc by having writel set a per-cpu flag and spin_unlock() test it, and do the barrier if needed there. However, drivers such as e1000 -also- have a wmb() between filling the ring buffer and kicking the DMA with MMIO, with a comment about this being needed for ia64 relaxed ordering. Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Tue, 2008-05-27 at 09:47 -0700, Linus Torvalds wrote: __read[bwlq]()/__write[bwlq]() are not serialized with a :memory barrier, although since they still use asm volatile I suspect that i practice they are probably serial too. Did not look very closely at any generated code (only did a trivial test to see that the code looks *roughly* correct). Nah, asm volatile doesn't help, it does need the memory clobber too. Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
Roland Dreier wrote: Writes are posted yes, but not reordered arbitrarily. If I have code like: spin_lock(mmio_lock); writel(val1, reg1); writel(val2, reg2); spin_unlock(mmio_lock); then I have a reasonable expectation that if two CPUs run this at the same time, their writes to reg1/reg2 won't be interleaved with each other (because the whole section is inside a spinlock). And Altix violates that expectation. Does that necessarily follow? If you've got a large system with multiple pci bridges, could you end up with posted writes coming from different cpus taking a different amount of time to propagate to a device and thus colliding? Chris ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
Writes are posted yes, but not reordered arbitrarily. If I have code like: spin_lock(mmio_lock); writel(val1, reg1); writel(val2, reg2); spin_unlock(mmio_lock); then I have a reasonable expectation that if two CPUs run this at the same time, their writes to reg1/reg2 won't be interleaved with each other (because the whole section is inside a spinlock). And Altix violates that expectation. Does that necessarily follow? If you've got a large system with multiple pci bridges, could you end up with posted writes coming from different cpus taking a different amount of time to propagate to a device and thus colliding? Not on x86. And a given PCI device can only be reached from a single host bridge, so I don't see how it can happen. But on SGI altix systems, there is a routed fabric between the CPU and the PCI bus, so the reordering can happen there. Hence mmiowb() and the endless supply of driver bugs that it causes. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Wed, 28 May 2008, Benjamin Herrenschmidt wrote: On Tue, 2008-05-27 at 08:35 -0700, Linus Torvalds wrote: Expecting people to fix up all drivers is simply not going to happen. And serializing things shouldn't be *that* expensive. People who cannot take the expense can continue to use the magic __raw_writel() etc stuff. Ok. Do we also remove wmb/rmb/... from drivers then ? :-) I think ia64 would need to be fixed to make their writel serializing... Well.. There's really two different issues: (a) x86 and the fact that we have thousands of drivers which in turn conflicts with (b) non-x86 and the fact that other architectures tend to be absolute pieces of cr*p when it comes to ordering, _especially_ across IO. and the thing about (b) is that the number of drivers involved is a hell of a lot smaller. For example, ia64 and the big SGI machines probably really only care about roughly five drivers (number taken out of my nether regions). So practically speaking, I suspect that the right approach is to just say ok, x86 will continue to be pretty darn ordered, and the barriers won't really matter (*) but at the same time also saying we wish reality was different, and well-maintained drivers should probably try to work in the presense of re-ordering. In *practice*, that probably means that most architectures will be better off if they emulate x86 closely, just because that means that they won't rely on drivers always getting things right, but I think we can leave the door open for the odd machines. We should just realize that they will never get a lot of testing, but on the other hand, their usage scenarios will generally also be very limited (very specific loads, and _very_ specific hardware). And the patch I sent out actually made __[raw_]readl() different from readl() on x86 too, in that the assembly _allows_ a bit more re-ordering, even though I doubt it will be visible in practice. So if you use the __ versions, you'd better have barriers even on x86! Linus (*) With the possible but unlikely exception of some big machines with separate IO networks, but if they happen they will fall into the 'ia64' case of just having a few relevant drivers. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
This is a different issue. We deal with it on powerpc by having writel set a per-cpu flag and spin_unlock() test it, and do the barrier if needed there. Cool... I assume you do this for mutex_unlock() etc? Is there any reason why ia64 can't do this too so we can kill mmiowb and save everyone a lot of hassle? (mips, sh and frv have non-empty mmiowb() definitions too but I'd guess that these are all bugs based on misunderstandings of the mmiowb() semantics...) However, drivers such as e1000 -also- have a wmb() between filling the ring buffer and kicking the DMA with MMIO, with a comment about this being needed for ia64 relaxed ordering. I put these barriers into mthca, mlx4 etc, although it came from my possible misunderstanding of the memory ordering rules in the kernel more than any experience of problems (as opposed the the mmiowb()s, which all came from real world bugs). - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
So practically speaking, I suspect that the right approach is to just say ok, x86 will continue to be pretty darn ordered, and the barriers won't really matter (*) but at the same time also saying we wish reality was different, and well-maintained drivers should probably try to work in the presense of re-ordering. Ok, well, I'll slap memory clobbers onto powerpc accessors, we made them fully ordered a while ago anyway. The extra barriers in drivers like USB etc.. won't hurt us much, we can always fine tune drivers that really want high performances. A problem with __raw_ though is that they -also- don't do byteswap, which is a pain in the neck as people use them for either one reason (relaxed ordering) or the other (no byteswap) without always knowing the consequences of doing so... I'm happy to say that __raw is purely about ordering and make them byteswap on powerpc tho (ie, make them little endian like the non-raw counterpart). Some archs started providing writel_be etc... I added those to powerpc a little while ago, and I tend to prefer that approach for the byteswap issue. What do you think ? Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
re-ordering, even though I doubt it will be visible in practice. So if you use the __ versions, you'd better have barriers even on x86! Are we also going to have __ioread*/__iowrite* ? Also is the sematics of __readl/__writel defined for all architectures - I used it ages ago in the i2o drivers for speed and it got removed because it didn't build on some platforms. Alan ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Tue, May 27, 2008 at 10:38:22PM +0100, Alan Cox wrote: re-ordering, even though I doubt it will be visible in practice. So if you use the __ versions, you'd better have barriers even on x86! Are we also going to have __ioread*/__iowrite* ? Didn't we already define ioread*() to have loose semantics? -- Intel are signing my paycheques ... these opinions are still mine Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Wed, 28 May 2008, Benjamin Herrenschmidt wrote: A problem with __raw_ though is that they -also- don't do byteswap, Well, that's why there is __readl() and __raw_readl(), no? Neither does ordering, and __raw_readl() doesn't do byte-swap. Of course, I'm not going to guarantee every architecture even has all those versions, nor am I going to guarantee they all work as advertised :) For x86, they have historially all been 100% identical. With the inline asm patch I posted, the __ version (whether raw or not) lack the memory barrier, so they allow a *little* bit more re-ordering. (They won't be re-ordered wrt spinlocks etc, unless gcc starts reordering volatile asm's against each other, which would be a bug). In practice, I doubt it matters. Whatever small compiler re-ordering it might affect won't have any real performance impack one way or the other, I think. Linus ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: MMIO and gcc re-ordering issue
On Tue, 27 May 2008, Alan Cox wrote: re-ordering, even though I doubt it will be visible in practice. So if you use the __ versions, you'd better have barriers even on x86! Are we also going to have __ioread*/__iowrite* ? I doubt there is any reason to. Let's just keep them very strictly ordered. Also is the sematics of __readl/__writel defined for all architectures - I used it ages ago in the i2o drivers for speed and it got removed because it didn't build on some platforms. Agreed - I'm not sure the __ versions are really worth it. We have them, but the semantics are subtle enough that most drivers will never care enough to really use them. I would suggest using them mainly for architecture-specific drivers, on architectures where it actually matters (which is not the case on x86). Linus ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev