Re: MMIO and gcc re-ordering issue

2008-06-12 Thread Paul Mackerras
Nick Piggin writes:

 /* turn off LED */
 val64 = readq(bar0-adapter_control);
 val64 = val64 (~ADAPTER_LED_ON);
 writeq(val64, bar0-adapter_control);
 s2io_link(nic, LINK_DOWN);
 }
 clear_bit(__S2IO_STATE_LINK_TASK, (nic-state));
 
 Now I can't say definitively that this is going to be wrong on
 powerpc, because I don't know the code well enough. But I'd be
 90% sure that the unlock is not supposed to be visible to
 other CPUs before the writeqs are queued to the card. On x86 it
 wouldn't be.

Interestingly, there is also a store to cacheable memory
(nic-device_enabled_once), but no smp_wmb or equivalent before the
clear_bit.  So there are other potential problems here besides the I/O
related ones.

Anyway, I have done some tests on a dual G5 here with putting a sync
on both sides of the store in writel etc. (i.e. making readl/writel
strongly ordered w.r.t. everything else), and as you predicted, there
wasn't a noticeable performance degradation, at least not on the
couple of things I tried.  So I am now inclined to accept your
suggestion that we should do that.  I should probably do some similar
checks on POWER6 and a few other machines first, though.

Paul.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-12 Thread Nick Piggin
On Thursday 12 June 2008 22:14, Paul Mackerras wrote:
 Nick Piggin writes:
  /* turn off LED */
  val64 = readq(bar0-adapter_control);
  val64 = val64 (~ADAPTER_LED_ON);
  writeq(val64, bar0-adapter_control);
  s2io_link(nic, LINK_DOWN);
  }
  clear_bit(__S2IO_STATE_LINK_TASK, (nic-state));
 
  Now I can't say definitively that this is going to be wrong on
  powerpc, because I don't know the code well enough. But I'd be
  90% sure that the unlock is not supposed to be visible to
  other CPUs before the writeqs are queued to the card. On x86 it
  wouldn't be.

 Interestingly, there is also a store to cacheable memory
 (nic-device_enabled_once), but no smp_wmb or equivalent before the
 clear_bit.  So there are other potential problems here besides the I/O
 related ones.

Yeah there sure is. That sucks too, but we go one step at a time ;)
I think proposing a strong ordering between set_bit/clear_bit would
actually be quite noticable slowdown in core kernel code at this
point.

Which reminds me, I have been meaning to do another pass of test and
set bit / clear bit conversions to the _lock primitives...


 Anyway, I have done some tests on a dual G5 here with putting a sync
 on both sides of the store in writel etc. (i.e. making readl/writel
 strongly ordered w.r.t. everything else), and as you predicted, there
 wasn't a noticeable performance degradation, at least not on the
 couple of things I tried.  So I am now inclined to accept your
 suggestion that we should do that.  I should probably do some similar
 checks on POWER6 and a few other machines first, though.

Oh good, thanks for looking into it. I guess it might be a little more
noticable on bigger POWER systems. And I think we might even need to do
a PCI read after every writel on sn2 systems in order to get the
semantics I want.

I can't say it won't be noticable. But if we consider the number of
drivers (maybe one or two dozen well maintained ones), and number of
sites in each driver (maybe one or two submission and completion
fastpaths which should have a minimum of IO operations in each one)
that will have to be converted in order to get performance as good or
better than it is currently with relaxed accessors and weigh that
against all the places in those and every other crappy obscure
driver that we *won't* have to audit, I really think we end up with
a net win even with some short term pain.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-12 Thread Matthew Wilcox
On Thu, Jun 05, 2008 at 06:43:53PM +1000, Benjamin Herrenschmidt wrote:
 Note that the powerpc implementation currently clears the flag
 on spin_lock and tests it on unlock. We are considering changing
 that to not touch the flag on spin_lock and just clear it whenever
 we do a sync (ie, on unlock, on explicit mmiowb, and possibly even
 on readl's where we happen to do sync's).

Your current scheme sounds like it's broken for

spin_lock(a)
writel();
spin_lock(b);
spin_unlock(b);
spin_unlock(a);

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-12 Thread Benjamin Herrenschmidt
On Thu, 2008-06-12 at 09:07 -0600, Matthew Wilcox wrote:
 On Thu, Jun 05, 2008 at 06:43:53PM +1000, Benjamin Herrenschmidt wrote:
  Note that the powerpc implementation currently clears the flag
  on spin_lock and tests it on unlock. We are considering changing
  that to not touch the flag on spin_lock and just clear it whenever
  we do a sync (ie, on unlock, on explicit mmiowb, and possibly even
  on readl's where we happen to do sync's).
 
 Your current scheme sounds like it's broken for
 
 spin_lock(a)
 writel();
 spin_lock(b);
 spin_unlock(b);
 spin_unlock(a);

Which is why we are considering changing it :-)

But as Paulus said before, he did some measurement and we came to the
conclusion that (pending more measurements on a wider range of HW) we
may as well drop the whole scheme and make writel fully synchronous
instead.

Then, we can get some nice weakly ordered accessors and start adding
them with appropriate explicit barriers to the hot path of perf.
critical drivers we care about.

Cheers,
Ben.



___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-11 Thread Nick Piggin
On Wednesday 11 June 2008 15:35, Nick Piggin wrote:
 On Wednesday 11 June 2008 15:13, Paul Mackerras wrote:
  Nick Piggin writes:
I just wish we had even one actual example of things going wrong with
the current rules we have on powerpc to motivate changing to this
model.
  
   ~/usr/src/linux-2.6 git grep test_and_set_bit drivers/ | wc -l
   506
   How sure are you that none of those forms part of a cobbled-together
   locking scheme that hopes to constrain some IO access?
 
  My comment was precisely about the fact that this sort of argument is
  actually FUD.  I want one example that is demonstrably wrong, not just
  a how sure are you.

 But this is a case where you really should be scared, so FUD is
 completely appropriate.

 At least, I don't see how it is at all worse than FUD about how much
 slower everything will be with stronger ordering, and how nobody will
 be able to do anything about it. In reality, for most devices it will
 not be measurable, and for some like network or disk might use a bit
 of a tweak in one or two fastpaths.

 If the driver author honestly does not know the code well enough to
 say whether a particular ordering is allowed or not, then it should
 just not be allowed.


 Anyway, what do I win if I give you a concrete example?

 I found one in the first file I picked out of my grep: drivers/net/s2io.c

 if (test_and_set_bit(__S2IO_STATE_LINK_TASK, (nic-state))) {
 /* The card is being reset, no point doing anything */
 goto out_unlock;
 }

 ...
 val64 = readq(bar0-adapter_control);
 val64 |= ADAPTER_LED_ON;
 writeq(val64, bar0-adapter_control);
 s2io_link(nic, LINK_UP);
 } else {
 if (CARDS_WITH_FAULTY_LINK_INDICATORS(nic-device_type,
   subid)) {
 val64 = readq(bar0-gpio_control);
 val64 = ~GPIO_CTRL_GPIO_0;
 writeq(val64, bar0-gpio_control);
 val64 = readq(bar0-gpio_control);
 }
 /* turn off LED */
 val64 = readq(bar0-adapter_control);
 val64 = val64 (~ADAPTER_LED_ON);
 writeq(val64, bar0-adapter_control);
 s2io_link(nic, LINK_DOWN);
 }
 clear_bit(__S2IO_STATE_LINK_TASK, (nic-state));

 Now I can't say definitively that this is going to be wrong on
 powerpc, because I don't know the code well enough. But I'd be
 90% sure that the unlock is not supposed to be visible to
 other CPUs before the writeqs are queued to the card. On x86 it
 wouldn't be.

And I hope you're reading this, sn2 guys :)

The mmiowb() audit for spin_unlock was a nice idea, but this example
shows that unless you *really* know the code and have actually audited
everything, then you should not be messing with relaxing of ordering
etc.

Now obviously the example I quote is a slowpath, in which case it can
very happily continue to use strongly ordered io accessors, and you
will never ever notice the extra sync or PCI read or whatever after
the writeq. Then you can easily relax the couple of fastpaths that
matter, put barriers in there, comment them, and you're away: you now
have a *working* driver (that doesn't randomly lose adapter_control
bits twice a year), and that is also faster than it was when it was
broken because you've specifically used the very weakly ordered IO
accessors in the fast paths that matter.

And when somebody plugs in some obscure hardware into your box that
you haven't audited and put mmiowb etc etc into, then guess what? It
is also not going to blow up in an undebuggable way 6 months later in
the middle of insert something important here. The worst case is
that the user notices it is running a bit slowly and sends a full
profile output the next day.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-11 Thread Paul Mackerras
Nick Piggin writes:

 Now that doesn't leave waker ordering architectures lumped with slow old
 x86 semantics. Think of it as giving them the benefit of sharing x86
 development and testing :)

Worth something, but not perhaps as much as you think, given that many
x86 driver writers still don't pay much attention to making their code
endian-safe and 64-bit-clean.  And there are still plenty of drivers
that use virt_to_bus...

Paul.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-11 Thread Linus Torvalds


On Wed, 11 Jun 2008, Nick Piggin wrote:
 
 I can't actually find the definitive statement in the Intel manuals
 saying UC is strongly ordered also WRT WB. Linus?

Definitive? Dunno. But look in the Architecture manual, volume 3A, 10.3 
Methods of Caching Available, and then under the bullet about Write 
Combining (WC), it says

  the writes may be delayed until the next occurrence of a serializing 
  event; such as, an SFENCE of MFENCE instruction, CPUID execution, a read 
  or write to uncached memory, an interrupt occurrence, or a LOCK 
  instruction execution.

However, it's worth noting that

 - documentation can be wrong, or even if right, can be Intel-specific.

 - the above is expressly _only_ about the WC buffer, not about regular 
   memory writes. Cached memory accesses are different from WC accesses.

so in the end, the thing that matters is how things actually work.

Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-11 Thread Jesse Barnes
On Tuesday, June 10, 2008 8:29 pm Nick Piggin wrote:
 On Wednesday 11 June 2008 05:19, Jesse Barnes wrote:
  On Tuesday, June 10, 2008 12:05 pm Roland Dreier wrote:
 me too.  That's the whole basis for readX_relaxed() and its cohorts:
 we make our weirdest machines (like altix) conform to the x86 norm.
 Then where it really kills us we introduce additional semantics to
 selected drivers that enable us to recover I/O speed on the abnormal
 platforms.
  
   Except as I pointed out before, Altix doesn't conform to the norm and
   many (most?) drivers are missing mmiowb()s that are needed for Altix.
   Just no one has plugged most devices into an Altix (or haven't stressed
   the driver in a way that exposes problems of IO ordering between CPUs).
  
   It would be a great thing to use the powerpc trick of setting a flag
   that is tested by spin_unlock()/mutex_unlock() and automatically doing
   the mmiowb() if needed, and then killing off mmiowb() entirely.
 
  Yeah I think that's what Nick's guidelines would guarantee.  And Jes is
  already working on the spin_unlock change you mentioned, so mmiowb()
  should be history soon (in name only, assuming Nick also introduces the
  I/O barriers he talked about for ordering the looser accessors it would
  still be there but would be called io_wmb or something).

 Exactly, yes. I guess everybody has had good intentions here, but
 as noticed, what is lacking is coordination and documentation.

 You mention strong ordering WRT spin_unlock, which suggests that
 you would prefer to take option #2 (the current powerpc one): io/io
 is ordered and io is contained inside spinlocks, but io/cacheable
 in general is not ordered.

I was thinking it would be good for the weaker accessors, but now that I think 
about it you could just use the new io_* barrier functions.

I didn't mean to imply that I wasn't in favor of the io/cacheable ordering as 
well.

 For any high performance drivers that are well maintained (ie. the
 ones where slowdown might be noticed), everyone should have a pretty
 good handle on memory ordering requirements, so it shouldn't take
 long to go through and convert them to relaxed accessors.

Yep.  Thanks for working on this, Nick, it's definitely a good thing that 
you're taking control of it. :)

Thanks,
Jesse
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-10 Thread Nick Piggin
On Wednesday 04 June 2008 05:07, Linus Torvalds wrote:
 On Tue, 3 Jun 2008, Trent Piepho wrote:
  On Tue, 3 Jun 2008, Linus Torvalds wrote:
   On Tue, 3 Jun 2008, Nick Piggin wrote:
Linus: on x86, memory operations to wc and wc+ memory are not ordered
with one another, or operations to other memory types (ie. load/load
and store/store reordering is allowed). Also, as you know, store/load
reordering is explicitly allowed as well, which covers all memory
types. So perhaps it is not quite true to say readl/writel is
strongly ordered by default even on x86. You would have to put in
some mfence instructions in them to make it so.
 
  So on x86, these could be re-ordered?
 
  writel(START_OPERATION, CONTROL_REGISTER);
  status = readl(STATUS_REGISTER);

 With both registers in a WC+ area, yes. The write may be in the WC buffers
 until the WC buffers are flushed (short list: a fence, a serializing
 instruction, a read-write to uncached memory, or an interrupt. There are
 others, but those are the main ones).

 But if the status register is in uncached memory (which is the only *sane*
 thing to do), then it doesn't matter if the control register is in WC
 memory. Because the status register read is itself serializing with the WC
 buffer, it's actually fine.

 So this is used for putting things like ring queues in WC memory, and fill
 them up with writes, and get nice bursty write traffic with the CPU
 automatically buffering it up (think stdio.h on a really low level). And
 if you then have the command registers in UC memory or using IO port
 accesses, reading and writing to them will automatically serialize.

OK, I'm sitll not quite sure where this has ended up. I guess you are happy
with x86 semantics as they are now. That is, all IO accesses are strongly
ordered WRT one another and WRT cacheable memory (which includes keeping
them within spinlocks), *unless* one asks for WC memory, in which case that
memory is quite weakly ordered (and is not even ordered by a regular IO
readl, at least according to AMD spec). So for WC memory, one still needs
to use mb/rmb/wmb.

So that still doesn't tell us what *minimum* level of ordering we should
provide in the cross platform readl/writel API. Some relatively sane
suggestions would be:

- as strong as x86. guaranteed not to break drivers that work on x86,
  but slower on some archs. To me, this is most pleasing. It is much
  much easier to notice something is going a little slower and to work
  out how to use weaker ordering there, than it is to debug some
  once-in-a-bluemoon breakage caused by just the right architecture,
  driver, etc. It totally frees up the driver writer from thinking
  about barriers, provided they get the locking right.

- ordered WRT other IO accessors, constrained within spinlocks, but not
  cacheable memory. This is what powerpc does now. It's a little faster
  for them, and probably covers the vast majority of drivers, but there
  are real possibilities to get it wrong (trivial example: using bit
  locks or mutexes or any kind of open coded locking or lockless
  synchronisation can break).

- (less sane) same as above, but not ordered WRT spinlocks. This is what
  ia64 (sn2) does. From a purist POV, it is a little less arbitrary than
  powerpc, but in practice, it will break a lot more drivers than powerpc.

I was kind of joking about taking control of this issue :) But seriously,
it needs a decision to be made. I vote for #1. My rationale: I'm still
finding relatively major (well, found maybe 4 or 5 in the last couple of
years) bugs in the mm subsystem due to memory ordering problems. This is
apparently one of the most well reviewed and tested bit of code in the
kernel by people who know all about memory ordering. Not to mention that
mm/ does not have to worry about IO ordering at all. Then apparently
driver are the least reviewed and tested. Connect dots.

Now that doesn't leave waker ordering architectures lumped with slow old
x86 semantics. Think of it as giving them the benefit of sharing x86
development and testing :) We can then formalise the relaxed __ accessors
to be more complete (ie. +/- byteswapping). I'd also propose to add
io_rmb/io_wmb/io_mb that order io/io access, to help architectures like
sn2 where the io/cacheable barrier is pretty expensive.

Any comments?
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-10 Thread Jesse Barnes
On Monday, June 09, 2008 11:56 pm Nick Piggin wrote:
 So that still doesn't tell us what *minimum* level of ordering we should
 provide in the cross platform readl/writel API. Some relatively sane
 suggestions would be:

 - as strong as x86. guaranteed not to break drivers that work on x86,
   but slower on some archs. To me, this is most pleasing. It is much
   much easier to notice something is going a little slower and to work
   out how to use weaker ordering there, than it is to debug some
   once-in-a-bluemoon breakage caused by just the right architecture,
   driver, etc. It totally frees up the driver writer from thinking
   about barriers, provided they get the locking right.

 - ordered WRT other IO accessors, constrained within spinlocks, but not
   cacheable memory. This is what powerpc does now. It's a little faster
   for them, and probably covers the vast majority of drivers, but there
   are real possibilities to get it wrong (trivial example: using bit
   locks or mutexes or any kind of open coded locking or lockless
   synchronisation can break).

 - (less sane) same as above, but not ordered WRT spinlocks. This is what
   ia64 (sn2) does. From a purist POV, it is a little less arbitrary than
   powerpc, but in practice, it will break a lot more drivers than powerpc.

 I was kind of joking about taking control of this issue :) But seriously,
 it needs a decision to be made. I vote for #1. My rationale: I'm still
 finding relatively major (well, found maybe 4 or 5 in the last couple of
 years) bugs in the mm subsystem due to memory ordering problems. This is
 apparently one of the most well reviewed and tested bit of code in the
 kernel by people who know all about memory ordering. Not to mention that
 mm/ does not have to worry about IO ordering at all. Then apparently
 driver are the least reviewed and tested. Connect dots.

 Now that doesn't leave waker ordering architectures lumped with slow old
 x86 semantics. Think of it as giving them the benefit of sharing x86
 development and testing :) We can then formalise the relaxed __ accessors
 to be more complete (ie. +/- byteswapping). I'd also propose to add
 io_rmb/io_wmb/io_mb that order io/io access, to help architectures like
 sn2 where the io/cacheable barrier is pretty expensive.

 Any comments?

FWIW that approach sounds pretty good to me.  Arches that suffer from 
performance penalties can still add lower level primitives and port selected 
drivers over, so really they won't be losing much.  AFAICT though drivers 
will still have to worry about regular memory ordering issues; they'll just 
be safe from I/O related ones. :)  Still, the simplification is probably 
worth it.

Jesse

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-10 Thread James Bottomley
On Tue, 2008-06-10 at 10:41 -0700, Jesse Barnes wrote:
 On Monday, June 09, 2008 11:56 pm Nick Piggin wrote:
  So that still doesn't tell us what *minimum* level of ordering we should
  provide in the cross platform readl/writel API. Some relatively sane
  suggestions would be:
 
  - as strong as x86. guaranteed not to break drivers that work on x86,
but slower on some archs. To me, this is most pleasing. It is much
much easier to notice something is going a little slower and to work
out how to use weaker ordering there, than it is to debug some
once-in-a-bluemoon breakage caused by just the right architecture,
driver, etc. It totally frees up the driver writer from thinking
about barriers, provided they get the locking right.
 
  - ordered WRT other IO accessors, constrained within spinlocks, but not
cacheable memory. This is what powerpc does now. It's a little faster
for them, and probably covers the vast majority of drivers, but there
are real possibilities to get it wrong (trivial example: using bit
locks or mutexes or any kind of open coded locking or lockless
synchronisation can break).
 
  - (less sane) same as above, but not ordered WRT spinlocks. This is what
ia64 (sn2) does. From a purist POV, it is a little less arbitrary than
powerpc, but in practice, it will break a lot more drivers than powerpc.
 
  I was kind of joking about taking control of this issue :) But seriously,
  it needs a decision to be made. I vote for #1. My rationale: I'm still
  finding relatively major (well, found maybe 4 or 5 in the last couple of
  years) bugs in the mm subsystem due to memory ordering problems. This is
  apparently one of the most well reviewed and tested bit of code in the
  kernel by people who know all about memory ordering. Not to mention that
  mm/ does not have to worry about IO ordering at all. Then apparently
  driver are the least reviewed and tested. Connect dots.
 
  Now that doesn't leave waker ordering architectures lumped with slow old
  x86 semantics. Think of it as giving them the benefit of sharing x86
  development and testing :) We can then formalise the relaxed __ accessors
  to be more complete (ie. +/- byteswapping). I'd also propose to add
  io_rmb/io_wmb/io_mb that order io/io access, to help architectures like
  sn2 where the io/cacheable barrier is pretty expensive.
 
  Any comments?
 
 FWIW that approach sounds pretty good to me.  Arches that suffer from 
 performance penalties can still add lower level primitives and port selected 
 drivers over, so really they won't be losing much.  AFAICT though drivers 
 will still have to worry about regular memory ordering issues; they'll just 
 be safe from I/O related ones. :)  Still, the simplification is probably 
 worth it.

me too.  That's the whole basis for readX_relaxed() and its cohorts: we
make our weirdest machines (like altix) conform to the x86 norm.  Then
where it really kills us we introduce additional semantics to selected
drivers that enable us to recover I/O speed on the abnormal platforms.

About the only problem we've had is that architectures aren't very good
at co-ordinating for their additional accessors so we tend to get a
forest of strange ones growing up, which appear only in a few drivers
(i.e. the ones that need the speed ups) and which have no well
documented meaning.

James


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-10 Thread Roland Dreier
  me too.  That's the whole basis for readX_relaxed() and its cohorts: we
  make our weirdest machines (like altix) conform to the x86 norm.  Then
  where it really kills us we introduce additional semantics to selected
  drivers that enable us to recover I/O speed on the abnormal platforms.

Except as I pointed out before, Altix doesn't conform to the norm and
many (most?) drivers are missing mmiowb()s that are needed for Altix.
Just no one has plugged most devices into an Altix (or haven't stressed
the driver in a way that exposes problems of IO ordering between CPUs).

It would be a great thing to use the powerpc trick of setting a flag
that is tested by spin_unlock()/mutex_unlock() and automatically doing
the mmiowb() if needed, and then killing off mmiowb() entirely.

 - R.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-10 Thread Jesse Barnes
On Tuesday, June 10, 2008 12:05 pm Roland Dreier wrote:
   me too.  That's the whole basis for readX_relaxed() and its cohorts: we
   make our weirdest machines (like altix) conform to the x86 norm.  Then
   where it really kills us we introduce additional semantics to selected
   drivers that enable us to recover I/O speed on the abnormal platforms.

 Except as I pointed out before, Altix doesn't conform to the norm and
 many (most?) drivers are missing mmiowb()s that are needed for Altix.
 Just no one has plugged most devices into an Altix (or haven't stressed
 the driver in a way that exposes problems of IO ordering between CPUs).

 It would be a great thing to use the powerpc trick of setting a flag
 that is tested by spin_unlock()/mutex_unlock() and automatically doing
 the mmiowb() if needed, and then killing off mmiowb() entirely.

Yeah I think that's what Nick's guidelines would guarantee.  And Jes is 
already working on the spin_unlock change you mentioned, so mmiowb() should 
be history soon (in name only, assuming Nick also introduces the I/O barriers 
he talked about for ordering the looser accessors it would still be there but 
would be called io_wmb or something).

Jesse

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-10 Thread Nick Piggin
On Wednesday 11 June 2008 05:19, Jesse Barnes wrote:
 On Tuesday, June 10, 2008 12:05 pm Roland Dreier wrote:
me too.  That's the whole basis for readX_relaxed() and its cohorts:
we make our weirdest machines (like altix) conform to the x86 norm. 
Then where it really kills us we introduce additional semantics to
selected drivers that enable us to recover I/O speed on the abnormal
platforms.
 
  Except as I pointed out before, Altix doesn't conform to the norm and
  many (most?) drivers are missing mmiowb()s that are needed for Altix.
  Just no one has plugged most devices into an Altix (or haven't stressed
  the driver in a way that exposes problems of IO ordering between CPUs).
 
  It would be a great thing to use the powerpc trick of setting a flag
  that is tested by spin_unlock()/mutex_unlock() and automatically doing
  the mmiowb() if needed, and then killing off mmiowb() entirely.

 Yeah I think that's what Nick's guidelines would guarantee.  And Jes is
 already working on the spin_unlock change you mentioned, so mmiowb() should
 be history soon (in name only, assuming Nick also introduces the I/O
 barriers he talked about for ordering the looser accessors it would still
 be there but would be called io_wmb or something).

Exactly, yes. I guess everybody has had good intentions here, but
as noticed, what is lacking is coordination and documentation.

You mention strong ordering WRT spin_unlock, which suggests that
you would prefer to take option #2 (the current powerpc one): io/io
is ordered and io is contained inside spinlocks, but io/cacheable
in general is not ordered.

I *would* prefer that io/cacheable actually is strongly ordered with
the default accessors. Because if you have that, then the driver
writer never has to care about memory ordering, provided they use
correct locking for SMP issues. Same as x86. With option 2, there
are still windows where you could possibly have issues.

For any high performance drivers that are well maintained (ie. the
ones where slowdown might be noticed), everyone should have a pretty
good handle on memory ordering requirements, so it shouldn't take
long to go through and convert them to relaxed accessors.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-10 Thread Benjamin Herrenschmidt
On Wed, 2008-06-11 at 13:29 +1000, Nick Piggin wrote:
 
 Exactly, yes. I guess everybody has had good intentions here, but
 as noticed, what is lacking is coordination and documentation.
 
 You mention strong ordering WRT spin_unlock, which suggests that
 you would prefer to take option #2 (the current powerpc one): io/io
 is ordered and io is contained inside spinlocks, but io/cacheable
 in general is not ordered.

IO/cacheable -is- ordered on powepc in what we believe is the direction
that matter: IO reads are fully ordered vs. anything and IO writes are
ordered vs. previous cacheable stores. The only relaxed situation is
IO writes followed by cacheable stores, which I believe shouldn't be
a problem. (except for spinlocks for which we use the flag trick)

 I *would* prefer that io/cacheable actually is strongly ordered with
 the default accessors. Because if you have that, then the driver
 writer never has to care about memory ordering, provided they use
 correct locking for SMP issues. Same as x86. With option 2, there
 are still windows where you could possibly have issues.
 
 For any high performance drivers that are well maintained (ie. the
 ones where slowdown might be noticed), everyone should have a pretty
 good handle on memory ordering requirements, so it shouldn't take
 long to go through and convert them to relaxed accessors.

Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-10 Thread Nick Piggin
On Wednesday 11 June 2008 13:40, Benjamin Herrenschmidt wrote:
 On Wed, 2008-06-11 at 13:29 +1000, Nick Piggin wrote:
  Exactly, yes. I guess everybody has had good intentions here, but
  as noticed, what is lacking is coordination and documentation.
 
  You mention strong ordering WRT spin_unlock, which suggests that
  you would prefer to take option #2 (the current powerpc one): io/io
  is ordered and io is contained inside spinlocks, but io/cacheable
  in general is not ordered.

 IO/cacheable -is- ordered on powepc in what we believe is the direction
 that matter: IO reads are fully ordered vs. anything and IO writes are
 ordered vs. previous cacheable stores. The only relaxed situation is
 IO writes followed by cacheable stores, which I believe shouldn't be
 a problem. (except for spinlocks for which we use the flag trick)

Spinlocks... mutexes, semaphores, rwsems, rwlocks, bit spinlocks, bit
mutexes, open coded bit locks (of which there are a number floating
around in drivers/).

But even assuming you get all of that fixed up. I wonder what is such
a big benefit to powerpc that you'll rather add the exception cacheable
stores are not ordered with previous io stores than to say any driver
which works on x86 will work on powerpc as far as memory ordering goes?
(don't you also need something to order io reads with cacheable reads?
as per my observation that your rmb is broken according to IBM docs)

Obviously you already have a sync instruction in your writel, so 1)
adding a second one doesn't slow it down by an order of mangnitude or
anything, just some small factor; and 2) you obviously want to be
converting high performance drivers to a more relaxed model anyway
regardless of whether there is one sync or two in there.

Has this ever been measured or thought carefully about? Beyond the
extent of one sync good, two sync bad ;)
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-10 Thread Paul Mackerras
Nick Piggin writes:

 OK, I'm sitll not quite sure where this has ended up. I guess you are happy
 with x86 semantics as they are now. That is, all IO accesses are strongly
 ordered WRT one another and WRT cacheable memory (which includes keeping
 them within spinlocks),

My understanding was that on x86, loads could pass stores in general,
i.e. a later load could be performed before an earlier store.  I guess
that can't be true for uncached loads, but could a cacheable load be
performed before an earlier uncached store?

 - as strong as x86. guaranteed not to break drivers that work on x86,
   but slower on some archs. To me, this is most pleasing. It is much
   much easier to notice something is going a little slower and to work
   out how to use weaker ordering there, than it is to debug some
   once-in-a-bluemoon breakage caused by just the right architecture,
   driver, etc. It totally frees up the driver writer from thinking
   about barriers, provided they get the locking right.

I just wish we had even one actual example of things going wrong with
the current rules we have on powerpc to motivate changing to this
model.

 Now that doesn't leave waker ordering architectures lumped with slow old
 x86 semantics. Think of it as giving them the benefit of sharing x86
 development and testing :) We can then formalise the relaxed __ accessors
 to be more complete (ie. +/- byteswapping).

That leaves a gulf between the extremely strongly ordered writel
etc. and the extremely weakly ordered __writel etc.  The current
powerpc scheme is fine for a lot of drivers but your proposal would
leave us no way to deliver it to them.

Paul.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-10 Thread Nick Piggin
On Wednesday 11 June 2008 14:18, Paul Mackerras wrote:
 Nick Piggin writes:
  OK, I'm sitll not quite sure where this has ended up. I guess you are
  happy with x86 semantics as they are now. That is, all IO accesses are
  strongly ordered WRT one another and WRT cacheable memory (which includes
  keeping them within spinlocks),

 My understanding was that on x86, loads could pass stores in general,
 i.e. a later load could be performed before an earlier store.

Yes, this is the one reordering allowed by the ISA on cacheable memory.
WC memory is weaker, which Linus wants to allow exception for because
one must explicitly ask for it. UC memory (which presumably is what
we're talking about as IO access) I think is stronger in that it does
not allow any reordering between one another or any cacheable access:

AMD says this:

c — A store (wp,wt,wb,uc,wc,wc+) may not pass a previous load 
(wp,wt,wb,uc,wc,wc+).
f — A load (uc) does not pass a previous store (wp,wt,wb,uc,wc,wc+).
g — A store (wp,wt,wb,uc) does not pass a previous store (wp,wt,wb,uc).
i — A load (wp,wt,wb,wc,wc+) does not pass a previous store (uc).

AMD does allow WC/WC+ to be weakly ordered WRT WC as well as UC, which
Intel seemingly does not. But we're alrady treating WC as special.

I can't actually find the definitive statement in the Intel manuals
saying UC is strongly ordered also WRT WB. Linus?


 I guess 
 that can't be true for uncached loads, but could a cacheable load be
 performed before an earlier uncached store?
  - as strong as x86. guaranteed not to break drivers that work on x86,
but slower on some archs. To me, this is most pleasing. It is much
much easier to notice something is going a little slower and to work
out how to use weaker ordering there, than it is to debug some
once-in-a-bluemoon breakage caused by just the right architecture,
driver, etc. It totally frees up the driver writer from thinking
about barriers, provided they get the locking right.

 I just wish we had even one actual example of things going wrong with
 the current rules we have on powerpc to motivate changing to this
 model.

~/usr/src/linux-2.6 git grep test_and_set_bit drivers/ | wc -l
506
How sure are you that none of those forms part of a cobbled-together
locking scheme that hopes to constrain some IO access?

~/usr/src/linux-2.6 git grep test_and_set_bit drivers/ | grep while | wc -l
29
How about those?

~/usr/src/linux-2.6 git grep mutex_lock drivers/ | wc -l
3138
How sure are you that none of them is hoping to constrain IO operations
within the lock?

Also grep for down, down_write, write_lock, and maybe some others I've
forgotten. And then forget completely about locking and imagine some
of the open coded things you see around the place (or parts where drivers
try to get creative and open code their own locking or try lockless
things).


  Now that doesn't leave waker ordering architectures lumped with slow old
  x86 semantics. Think of it as giving them the benefit of sharing x86
  development and testing :) We can then formalise the relaxed __ accessors
  to be more complete (ie. +/- byteswapping).

 That leaves a gulf between the extremely strongly ordered writel
 etc. and the extremely weakly ordered __writel etc.  The current
 powerpc scheme is fine for a lot of drivers but your proposal would
 leave us no way to deliver it to them.

But surely you have to audit the drivers anyway to ensure they are OK
with the current powerpc scheme. In which case, once you have audited
them and know they are safe, you can easily convert them to the even
_faster_ __readl/__writel, and just add the appropriate barriers.

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: MMIO and gcc re-ordering issue

2008-06-10 Thread Paul Mackerras
Nick Piggin writes:

  I just wish we had even one actual example of things going wrong with
  the current rules we have on powerpc to motivate changing to this
  model.
 
 ~/usr/src/linux-2.6 git grep test_and_set_bit drivers/ | wc -l
 506
 How sure are you that none of those forms part of a cobbled-together
 locking scheme that hopes to constrain some IO access?

My comment was precisely about the fact that this sort of argument is
actually FUD.  I want one example that is demonstrably wrong, not just
a how sure are you.

 But surely you have to audit the drivers anyway to ensure they are OK
 with the current powerpc scheme. In which case, once you have audited
 them and know they are safe, you can easily convert them to the even
 _faster_ __readl/__writel, and just add the appropriate barriers.

The trouble is that as code gets maintained, using __writel + explicit
barriers is more fragile because some people will change the code, or
add new code, without understanding the barriers.  So whenever a
driver gets converted to using __writel + barriers, we will end up
having to watch every change that goes into it forever.  Whereas with
the current scheme there's a much smaller set of gotchas to watch out
for, and the gotchas are things that already raise red flags, such as
open-coded locking and any sort of clever lockless scheme.

Paul.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-05 Thread Jes Sorensen

Jesse Barnes wrote:
Now, in hindsight, using a PIO write set  test flag approach in 
writeX/spin_unlock (ala powerpc) might have been a better approach, but iirc 
that never came up in the discussion, probably because we were focused on PCI 
posting and not uncached vs. cached ordering.


Hi Jesse,

I am going to take a stab at implementing this so we can see how much
of an impact it will have.

Cheers,
Jes
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-05 Thread Benjamin Herrenschmidt
On Thu, 2008-06-05 at 10:40 +0200, Jes Sorensen wrote:
 Jesse Barnes wrote:
  Now, in hindsight, using a PIO write set  test flag approach in 
  writeX/spin_unlock (ala powerpc) might have been a better approach, but 
  iirc 
  that never came up in the discussion, probably because we were focused on 
  PCI 
  posting and not uncached vs. cached ordering.
 
 Hi Jesse,
 
 I am going to take a stab at implementing this so we can see how much
 of an impact it will have.

Note that the powerpc implementation currently clears the flag
on spin_lock and tests it on unlock. We are considering changing
that to not touch the flag on spin_lock and just clear it whenever
we do a sync (ie, on unlock, on explicit mmiowb, and possibly even
on readl's where we happen to do sync's).

Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-04 Thread Linus Torvalds


On Mon, 2 Jun 2008, Haavard Skinnemoen wrote:
 
  So what happened to the old idea of putting the accessor function pointers
  in the device/bus structure?
 
 Don't know. I think it sounds like overkill to replace a simple load or
 store with an indirect function call.

Indeed. *Especially* as the driver in practice tends to always know which 
one it is statically.

Yes, sometimes the same RTL may end up being used behind two differnt 
buses, and with some broken byte-conversion hardware in the middle, but 
that's pretty damn rare in the end. It happens, but it happens mostly with 
the odd broken kind of embedded setups where somebody took a standard part 
and then did something _strange_ to it - and in the process they may well 
have introduced other issues as well.

I suspect you find this kind of thing mostly with things like stupid 
integrated IDE controllers, and things like byte order tends to be the 
_least_ of the issues (specialized register accesses with odd offsets etc 
will happen).

So most of the time the byte order is simply decided by the hardware 
itself (and LE is the common one, due to ISA/PCI). Sometimes you can set 
it dynamically (at which point it's irrelevant - just pick the one you 
want).

So I definitely argue against complex IO accessor functions with things 
like dynamic support for byte order depending on bus. 99.9% of all 
hardware simply do not need them, and the small percentage that might need 
it will need special drivers anyway, so they might as well do the 
complexity on their own rather than make everybody else care.

See for example the input driver for the i8042 chip: not only do those 
things sometimes need native order (probably exactly because it's 
integrated on a chip over some special native bus, or just wired up to 
be big-endian on a BE platform by some moron), but you'll also find that 
they just need odd accesses anyway exactly because it's oddly wired up 
(eg it's some special serial protocol that emulates ISA accesses).

So having some byte-order correcting function still wouldn't make any 
sense, because it's not just about byte order.

Will that mean that you might occasionally have a normal driver and an 
odd driver that actually drives what is basically the same HW block, 
just wired up differently? Sure. But that's still a smaller price to pay 
than it would be to try to make everything be generic when there is no 
point to it.

Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-03 Thread Jeremy Higdon
On Tue, Jun 03, 2008 at 06:19:05PM +1000, Nick Piggin wrote:
 On Tuesday 03 June 2008 18:15, Jeremy Higdon wrote:
  On Tue, Jun 03, 2008 at 02:33:11PM +1000, Nick Piggin wrote:
   On Monday 02 June 2008 19:56, Jes Sorensen wrote:
Would we be able to use Ben's trick of setting a per cpu flag in
writel() then and checking that in spin unlock issuing the mmiowb()
there if needed?
  
   Yes you could, but your writels would still not be strongly ordered
   within (or outside) spinlock regions, which is what Linus wants (and
   I kind of agree with).
 
  Yes they would be.  Writes from the same CPU are always ordered.  Writes
  from different CPUs are not, but that's only a concern if you protect
 
 They are not strongly ordered WRT writes to cacheable memory. If they
 were, then they would not leak out of spinlocks.

No posted writes are.
As I recall, the outX functions are not supposed to be posted, so the sn2
versions issue a mmiowb in the outX.  But writeX has always been posted,
or at least postable.  I thought that was generally accepted.

Normally, the only way for a device to see cacheable memory is via a DMA
read, and you are guaranteed on sn2 that in the following:

store of of A to location X
mmio write to device
device DMA read from location X

that the device will see A.  In the past, on some other archs, you'd have
to flush the Dcache for that to work.

Granted, if the compiler reorders the store and the mmio write, then you
have a problem.

jeremy
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-03 Thread Nick Piggin
On Tuesday 03 June 2008 18:15, Jeremy Higdon wrote:
 On Tue, Jun 03, 2008 at 02:33:11PM +1000, Nick Piggin wrote:
  On Monday 02 June 2008 19:56, Jes Sorensen wrote:

   Would we be able to use Ben's trick of setting a per cpu flag in
   writel() then and checking that in spin unlock issuing the mmiowb()
   there if needed?
 
  Yes you could, but your writels would still not be strongly ordered
  within (or outside) spinlock regions, which is what Linus wants (and
  I kind of agree with).

 Yes they would be.  Writes from the same CPU are always ordered.  Writes
 from different CPUs are not, but that's only a concern if you protect

They are not strongly ordered WRT writes to cacheable memory. If they
were, then they would not leak out of spinlocks.

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-03 Thread Haavard Skinnemoen
Scott Wood [EMAIL PROTECTED] wrote:
 On Mon, Jun 02, 2008 at 10:11:02AM +0200, Haavard Skinnemoen wrote:
  Geert Uytterhoeven [EMAIL PROTECTED] wrote:
   On Fri, 30 May 2008, Haavard Skinnemoen wrote:
Maybe we need another interface that does not do byteswapping but
provides stronger ordering guarantees?
   
   The byte swapping depends on the device/bus.
  
  Of course. But isn't it reasonable to assume that a device integrated
  on the same silicon as the CPU is connected to a somewhat sane bus
  which doesn't require any byte swapping?
 
 No, unfortunately. :-(

Ok, I guess I was being naive.

 See the end of drivers/dma/fsldma.h.  Likewise with Freescale's PCI host
 bridges; for some reason the bus itself being little endian led to the host
 bridge control registers also being little endian.

Right. But still, isn't it better to handle it on a case-by-case basis
in the drivers? In some cases, it's best to explicitly use a certain
byte order, in others it's best to use whatever is native to the CPU.

Haavard
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-03 Thread Benjamin Herrenschmidt
On Tue, 2008-06-03 at 16:11 +1000, Nick Piggin wrote:
- readl is synchronous (ie, makes the CPU think the
data was actually used before executing subsequent
instructions, thus waits for the data to come back,
for example to ensure that a read used to push out
post buffers followed by a delay will indeed happen
with the right delay).
 
 So your readl can pass an earlier cacheable store or earlier writel?

I forgot to mention that all MMIO are ordered vs. each other and I
do prevent readl from passing earlier cacheable stores too in my
current implementation but I'n not 100% we want to guarantee that,
unless we have stupid devices that trigger DMA's on reads with side
effects.. anyway, it is guaranteed in the current case.

Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-03 Thread Nick Piggin
On Tuesday 03 June 2008 16:53, Paul Mackerras wrote:
 Nick Piggin writes:
  So your readl can pass an earlier cacheable store or earlier writel?

 No.  It's quite gross at the moment, it has a sync before the access
 (i.e. a full mb()) and a twi; isync sequence after the access that
 stalls execution until the data comes back.

OK.


   We don't provide meaningless ones like writel + cacheable store for
   example. (PCI posting would defeat it anyway).
 
  What do you mean by meaningless? Ordering of writel followed by a
  cacheable store  is meaningful eg. for retaining io operations within
  locks. OK, you explicitly have some extra code for spin_unlock, but
  not for bit locks, mutexes, etc. It would make sense to have the
  default operations _very_ strongly ordered IMO, and then move drivers
  to be more relaxed when they are verified.

 It's meaningless in the sense that nothing guarantees that the writel
 has actually hit the device, even if we put a full mb() barrier in
 between the writel and the cacheable write.  That would guarantee that
 the writel had got to the PCI host bridge, and couldn't be reordered
 with other accesses to the device, but not that the device had
 actually seen it.

OK, but I think fits OK with our SMP ordering model for cacheable
stores: no amount of barriers on CPU0 will guarantee that CPU1 has
seen the store, you actually have to observe a causual effect
of the store before you can say that.


 I don't mind adding code to the mutex unlock to do the same as
 spin_unlock, but I really don't want to have *two* sync instructions
 for every MMIO write.  One is bad enough.

So you can't provide iostore/store ordering without another sync
after the writel store?

I guess the problem with providing exceptions is that it makes it
hard for people who absolutely don't know or care about ordering.
I don't like having to think about it hmm, we can allow this type
of reordering... oh, unless some silly device does X

If we come up with a sane set of weakly ordered accessors
(including io_*mb()), it will make it easier to go through the
important drivers and improve them. We don't have to enforce the
the new semantics strictly until then if they'll slow you down
too much in the meantime.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-03 Thread Jeremy Higdon
On Tue, Jun 03, 2008 at 02:33:11PM +1000, Nick Piggin wrote:
 On Monday 02 June 2008 19:56, Jes Sorensen wrote:
  Jeremy Higdon wrote:
   We don't actually have that problem on the Altix.  All writes issued
   by CPU X will be ordered with respect to each other.  But writes by
   CPU X and CPU Y will not be, unless an mmiowb() is done by the
   original CPU before the second CPU writes.  I.e.
  
 CPU X   writel
 CPU X   writel
 CPU X   mmiowb
  
 CPU Y   writel
 ...
  
   Note that this implies some sort of locking.  Also note that if in
   the above, CPU Y did the mmiowb, that would not work.
 
  Hmmm,
 
  Then it's less bad than I thought - my apologies for the confusion.
 
  Would we be able to use Ben's trick of setting a per cpu flag in
  writel() then and checking that in spin unlock issuing the mmiowb()
  there if needed?
 
 Yes you could, but your writels would still not be strongly ordered
 within (or outside) spinlock regions, which is what Linus wants (and
 I kind of agree with).

Yes they would be.  Writes from the same CPU are always ordered.  Writes
from different CPUs are not, but that's only a concern if you protect
writing via some sort of lock.  If the lock release forces a barrier,
that should take care of the problem.

jeremy
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-03 Thread Paul Mackerras
Nick Piggin writes:

 So your readl can pass an earlier cacheable store or earlier writel?

No.  It's quite gross at the moment, it has a sync before the access
(i.e. a full mb()) and a twi; isync sequence after the access that
stalls execution until the data comes back.

  We don't provide meaningless ones like writel + cacheable store for
  example. (PCI posting would defeat it anyway).
 
 What do you mean by meaningless? Ordering of writel followed by a
 cacheable store  is meaningful eg. for retaining io operations within
 locks. OK, you explicitly have some extra code for spin_unlock, but
 not for bit locks, mutexes, etc. It would make sense to have the
 default operations _very_ strongly ordered IMO, and then move drivers
 to be more relaxed when they are verified.

It's meaningless in the sense that nothing guarantees that the writel
has actually hit the device, even if we put a full mb() barrier in
between the writel and the cacheable write.  That would guarantee that
the writel had got to the PCI host bridge, and couldn't be reordered
with other accesses to the device, but not that the device had
actually seen it.

I don't mind adding code to the mutex unlock to do the same as
spin_unlock, but I really don't want to have *two* sync instructions
for every MMIO write.  One is bad enough.

Paul.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-03 Thread Linus Torvalds


On Tue, 3 Jun 2008, Nick Piggin wrote:
 
 Linus: on x86, memory operations to wc and wc+ memory are not ordered
 with one another, or operations to other memory types (ie. load/load
 and store/store reordering is allowed). Also, as you know, store/load
 reordering is explicitly allowed as well, which covers all memory
 types. So perhaps it is not quite true to say readl/writel is strongly
 ordered by default even on x86. You would have to put in some
 mfence instructions in them to make it so.

Well, you have to ask for WC/WC+ anyway, so it's immaterial. A driver that 
does that needs to be aware of it. IOW, it's a non-issue, imnsho.

Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-03 Thread Linus Torvalds


On Tue, 3 Jun 2008, Trent Piepho wrote:

 On Tue, 3 Jun 2008, Linus Torvalds wrote:
  On Tue, 3 Jun 2008, Nick Piggin wrote:
   
   Linus: on x86, memory operations to wc and wc+ memory are not ordered
   with one another, or operations to other memory types (ie. load/load
   and store/store reordering is allowed). Also, as you know, store/load
   reordering is explicitly allowed as well, which covers all memory
   types. So perhaps it is not quite true to say readl/writel is strongly
   ordered by default even on x86. You would have to put in some
   mfence instructions in them to make it so.
 
 So on x86, these could be re-ordered?
 
 writel(START_OPERATION, CONTROL_REGISTER);
 status = readl(STATUS_REGISTER);

With both registers in a WC+ area, yes. The write may be in the WC buffers 
until the WC buffers are flushed (short list: a fence, a serializing 
instruction, a read-write to uncached memory, or an interrupt. There are 
others, but those are the main ones).

But if the status register is in uncached memory (which is the only *sane* 
thing to do), then it doesn't matter if the control register is in WC 
memory. Because the status register read is itself serializing with the WC 
buffer, it's actually fine.

So this is used for putting things like ring queues in WC memory, and fill 
them up with writes, and get nice bursty write traffic with the CPU 
automatically buffering it up (think stdio.h on a really low level). And 
if you then have the command registers in UC memory or using IO port 
accesses, reading and writing to them will automatically serialize.

  Well, you have to ask for WC/WC+ anyway, so it's immaterial. A driver that
  does that needs to be aware of it. IOW, it's a non-issue, imnsho.
 
 You need to ask for coherent DMA memory too.

Not on x86.

And I don't care what anybody else says - x86 is *so* totally dominant, 
that other architectures have to live with the fact that 99.9% of all 
drivers are written for and tested on x86. As a result, anything else is 
theory.

Are some drivers good and are careful? Yes. Are most? No, and even if they 
were, people making changes would mostly test them on x86. Architectures 
should strive to basically act as closely as possible to x86 semantics in 
order to not get nasty surprises.

And architectures that can't do that well enough *will* often find that 
they need to fix drivers, and only a small small subset of all kernel 
drivers will generally work out-of-the-box.

If you're ready to do that, your architecture can do anything it damn well 
pleases ;)

Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-03 Thread Trent Piepho

On Tue, 3 Jun 2008, Linus Torvalds wrote:

On Tue, 3 Jun 2008, Nick Piggin wrote:


Linus: on x86, memory operations to wc and wc+ memory are not ordered
with one another, or operations to other memory types (ie. load/load
and store/store reordering is allowed). Also, as you know, store/load
reordering is explicitly allowed as well, which covers all memory
types. So perhaps it is not quite true to say readl/writel is strongly
ordered by default even on x86. You would have to put in some
mfence instructions in them to make it so.


So on x86, these could be re-ordered?

writel(START_OPERATION, CONTROL_REGISTER);
status = readl(STATUS_REGISTER);


Well, you have to ask for WC/WC+ anyway, so it's immaterial. A driver that
does that needs to be aware of it. IOW, it's a non-issue, imnsho.


You need to ask for coherent DMA memory too.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-03 Thread Matthew Wilcox
On Tue, Jun 03, 2008 at 11:47:00AM -0700, Trent Piepho wrote:
 On Tue, 3 Jun 2008, Linus Torvalds wrote:
 On Tue, 3 Jun 2008, Nick Piggin wrote:
 
 Linus: on x86, memory operations to wc and wc+ memory are not ordered
 with one another, or operations to other memory types (ie. load/load
 and store/store reordering is allowed). Also, as you know, store/load
 reordering is explicitly allowed as well, which covers all memory
 types. So perhaps it is not quite true to say readl/writel is strongly
 ordered by default even on x86. You would have to put in some
 mfence instructions in them to make it so.
 
 So on x86, these could be re-ordered?
 
 writel(START_OPERATION, CONTROL_REGISTER);
 status = readl(STATUS_REGISTER);

You wouldn't ask for write-combining memory mapping for control or
status registers.

 Well, you have to ask for WC/WC+ anyway, so it's immaterial. A driver that
 does that needs to be aware of it. IOW, it's a non-issue, imnsho.
 
 You need to ask for coherent DMA memory too.

Different case.  Coherent DMA memory is *host* memory that the *device*
accesses.  We're talking about *device* memory that the *cpu* accesses.

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-03 Thread Trent Piepho

On Tue, 3 Jun 2008, Matthew Wilcox wrote:

On Tue, Jun 03, 2008 at 11:47:00AM -0700, Trent Piepho wrote:

On Tue, 3 Jun 2008, Linus Torvalds wrote:

On Tue, 3 Jun 2008, Nick Piggin wrote:


Linus: on x86, memory operations to wc and wc+ memory are not ordered
with one another, or operations to other memory types (ie. load/load
and store/store reordering is allowed). Also, as you know, store/load
reordering is explicitly allowed as well, which covers all memory
types. So perhaps it is not quite true to say readl/writel is strongly
ordered by default even on x86. You would have to put in some
mfence instructions in them to make it so.


So on x86, these could be re-ordered?

writel(START_OPERATION, CONTROL_REGISTER);
status = readl(STATUS_REGISTER);


You wouldn't ask for write-combining memory mapping for control or
status registers.


But Nick said, store/load reordering is explicitly allowed as well, which
covers *all* memory types.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-03 Thread Trent Piepho

On Tue, 3 Jun 2008, Nick Piggin wrote:

On Monday 02 June 2008 17:24, Russell King wrote:

So, can the semantics of what's expected from these IO accessor
functions be documented somewhere.  Please?  Before this thread gets
lost in the depths of time?


This whole thread also ties in with my posts about mmiowb (which IMO
should go away).

readl/writel:  strongly ordered wrt one another and other stores
  to cacheable RAM, byteswapping
__readl/__writel:  not ordered (needs mb/rmb/wmb to order with
  other readl/writel and cacheable operations, or
  io_*mb to order with one another)
raw_readl/raw_writel:  strongly ordered, no byteswapping
__raw_readl/__raw_writel:  not ordered, no byteswapping


Byte-swapping vs not byte-swapping is not usually what the programmer wants. 
Usually your device's registers are defined as being big-endian or

little-endian and you want whatever is needed to give you that.

I believe that on some archs that can be either byte order, some built-in
devices will change their registers to match, and so you want native endian
or no swapping for these.  Though that's definitely in the minority.

An accessors that always byte-swaps regardless of the endianness of the host
is never something I've seen a driver want.

IOW, there are four ways one can defined endianness/swapping:
1) Little-endian
2) Big-endian
3) Native-endian aka non-byte-swapping
4) Foreign-endian aka byte-swapping

1 and 2 are by far the most used.  Some code wants 3.  No one wants 4.  Yet
our API is providing 3  4, the two which are the least useful.

Is it enough to provide only all or none for ordering strictness?  For
instance on powerpc, one can get a speedup by dropping strict ordering for IO
vs cacheable memory, but still keeping ordering for IO vs IO and IO vs locks. 
This is much easier to program for than no ordering at all.  In fact, if one

doesn't use coherent DMA, it's basically the same as fully strict ordering.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-03 Thread Matthew Wilcox
On Tue, Jun 03, 2008 at 12:43:21PM -0700, Trent Piepho wrote:
 IOW, there are four ways one can defined endianness/swapping:
 1) Little-endian
 2) Big-endian
 3) Native-endian aka non-byte-swapping
 4) Foreign-endian aka byte-swapping
 
 1 and 2 are by far the most used.  Some code wants 3.  No one wants 4.  Yet
 our API is providing 3  4, the two which are the least useful.

You've fundamentally misunderstood.

readX/writeX and __readX/__writeX provide little-endian access.
__raw_readX provide native-endian.

If you want 2 or 4, define your own accessors.  Some architectures define
other accessors (eg gsc_readX on parisc is native (big) endian, and
works on physical addresses that haven't been ioremapped.  sbus_readX on
sparc64 also seems to be native (big) endian).

 Is it enough to provide only all or none for ordering strictness?  For
 instance on powerpc, one can get a speedup by dropping strict ordering for 
 IO
 vs cacheable memory, but still keeping ordering for IO vs IO and IO vs 
 locks. This is much easier to program for than no ordering at all.  In 
 fact, if one
 doesn't use coherent DMA, it's basically the same as fully strict ordering.

I don't understand why you keep talking about DMA.  Are you talking
about ordering between readX() and DMA?  PCI proides those guarantees.

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-03 Thread Matthew Wilcox
On Tue, Jun 03, 2008 at 12:57:56PM -0700, Trent Piepho wrote:
 On Tue, 3 Jun 2008, Matthew Wilcox wrote:
 On Tue, Jun 03, 2008 at 11:47:00AM -0700, Trent Piepho wrote:
 On Tue, 3 Jun 2008, Linus Torvalds wrote:
 On Tue, 3 Jun 2008, Nick Piggin wrote:
 
 Linus: on x86, memory operations to wc and wc+ memory are not ordered
 with one another, or operations to other memory types (ie. load/load
 and store/store reordering is allowed). Also, as you know, store/load
 reordering is explicitly allowed as well, which covers all memory
 types. So perhaps it is not quite true to say readl/writel is strongly
 ordered by default even on x86. You would have to put in some
 mfence instructions in them to make it so.
 
 So on x86, these could be re-ordered?
 
 writel(START_OPERATION, CONTROL_REGISTER);
 status = readl(STATUS_REGISTER);
 
 You wouldn't ask for write-combining memory mapping for control or
 status registers.
 
 But Nick said, store/load reordering is explicitly allowed as well, which
 covers *all* memory types.

Then Nick is confused.  PCI only defines one way to flush posted writes
to a device -- doing a read from it.  There's no way that reads can
be allowed to pass writes (unless you've asked for it, like with write
combining).

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-03 Thread Trent Piepho

On Tue, 3 Jun 2008, Matthew Wilcox wrote:

On Tue, Jun 03, 2008 at 12:43:21PM -0700, Trent Piepho wrote:

IOW, there are four ways one can defined endianness/swapping:
1) Little-endian
2) Big-endian
3) Native-endian aka non-byte-swapping
4) Foreign-endian aka byte-swapping

1 and 2 are by far the most used.  Some code wants 3.  No one wants 4.  Yet
our API is providing 3  4, the two which are the least useful.


You've fundamentally misunderstood.

readX/writeX and __readX/__writeX provide little-endian access.
__raw_readX provide native-endian.

If you want 2 or 4, define your own accessors.  Some architectures define
other accessors (eg gsc_readX on parisc is native (big) endian, and


How about providing 1 and 2, and if you want 3 or 4 define your own accessors?


Is it enough to provide only all or none for ordering strictness?  For
instance on powerpc, one can get a speedup by dropping strict ordering for
IO
vs cacheable memory, but still keeping ordering for IO vs IO and IO vs
locks. This is much easier to program for than no ordering at all.  In
fact, if one
doesn't use coherent DMA, it's basically the same as fully strict ordering.


I don't understand why you keep talking about DMA.  Are you talking
about ordering between readX() and DMA?  PCI proides those guarantees.


I guess you haven't been reading the whole thread.  The reason it started was
because gcc can re-order powerpc (and everyone else's too) IO accesses vs
accesses to cachable memory (but not spin-locks), which ends up only being a
problem with coherent DMA.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-03 Thread Trent Piepho

On Tue, 3 Jun 2008, Matthew Wilcox wrote:

On Tue, Jun 03, 2008 at 12:57:56PM -0700, Trent Piepho wrote:

On Tue, 3 Jun 2008, Matthew Wilcox wrote:

On Tue, Jun 03, 2008 at 11:47:00AM -0700, Trent Piepho wrote:

On Tue, 3 Jun 2008, Linus Torvalds wrote:

On Tue, 3 Jun 2008, Nick Piggin wrote:


Linus: on x86, memory operations to wc and wc+ memory are not ordered
with one another, or operations to other memory types (ie. load/load
and store/store reordering is allowed). Also, as you know, store/load
reordering is explicitly allowed as well, which covers all memory
types. So perhaps it is not quite true to say readl/writel is strongly
ordered by default even on x86. You would have to put in some
mfence instructions in them to make it so.


So on x86, these could be re-ordered?

writel(START_OPERATION, CONTROL_REGISTER);
status = readl(STATUS_REGISTER);


You wouldn't ask for write-combining memory mapping for control or
status registers.


But Nick said, store/load reordering is explicitly allowed as well, which
covers *all* memory types.


Then Nick is confused.  PCI only defines one way to flush posted writes
to a device -- doing a read from it.  There's no way that reads can
be allowed to pass writes (unless you've asked for it, like with write
combining).


But that requirement is for the PCI bridge, isn't it?  It doesn't matter if
the bridge will flush all posted writes before allowing a read if the CPU
decides to give the bridge the read before the write.  A powerpc CPU will
certainly do this if you don't take any steps like telling it the memory is
uncachable and guarded.  I didn't think it was allowed on x86 (except with
WC), but Nick seemed to say it was.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-03 Thread Benjamin Herrenschmidt
On Tue, 2008-06-03 at 12:43 -0700, Trent Piepho wrote:
 
 Byte-swapping vs not byte-swapping is not usually what the programmer wants. 
 Usually your device's registers are defined as being big-endian or
 little-endian and you want whatever is needed to give you that.

Yes, which is why I (and some other archs) have writel_be/readl_be.

The standard writel/readl being LE.

However, the raw variants are defined to be native endian, which is of
some use to -some- archs apparently where they have SoC device whose
endianness follow the core.

 I believe that on some archs that can be either byte order, some built-in
 devices will change their registers to match, and so you want native endian
 or no swapping for these.  Though that's definitely in the minority.
 
 An accessors that always byte-swaps regardless of the endianness of the host
 is never something I've seen a driver want.
 
 IOW, there are four ways one can defined endianness/swapping:
 1) Little-endian
 2) Big-endian
 3) Native-endian aka non-byte-swapping
 4) Foreign-endian aka byte-swapping
 
 1 and 2 are by far the most used.  Some code wants 3.  No one wants 4.  Yet
 our API is providing 3  4, the two which are the least useful.

No, we don't provide 4, it was something unclear with nick.

We provide 1. (writel/readl and __variants), some archs provide 2
(writel_be/readl_be, tho I don't have __variants, I suppose I could),
and everybody provides 3. though in some cases (like us) only in the
form of __variants (ie, non ordered, like __raw_readl/__raw_writel).

Nick's proposal is to plug those gaps, though it's, I believe, missing
the _be variants.

 Is it enough to provide only all or none for ordering strictness?  For
 instance on powerpc, one can get a speedup by dropping strict ordering for IO
 vs cacheable memory, but still keeping ordering for IO vs IO and IO vs locks. 
 This is much easier to program for than no ordering at all.  In fact, if one
 doesn't use coherent DMA, it's basically the same as fully strict ordering.

Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-03 Thread Nick Piggin
On Wednesday 04 June 2008 07:58, Trent Piepho wrote:
 On Tue, 3 Jun 2008, Matthew Wilcox wrote:
  On Tue, Jun 03, 2008 at 12:57:56PM -0700, Trent Piepho wrote:
  On Tue, 3 Jun 2008, Matthew Wilcox wrote:
  On Tue, Jun 03, 2008 at 11:47:00AM -0700, Trent Piepho wrote:
  On Tue, 3 Jun 2008, Linus Torvalds wrote:
  On Tue, 3 Jun 2008, Nick Piggin wrote:
  Linus: on x86, memory operations to wc and wc+ memory are not
  ordered with one another, or operations to other memory types (ie.
  load/load and store/store reordering is allowed). Also, as you know,
  store/load reordering is explicitly allowed as well, which covers
  all memory types. So perhaps it is not quite true to say
  readl/writel is strongly ordered by default even on x86. You would
  have to put in some mfence instructions in them to make it so.
 
  So on x86, these could be re-ordered?
 
  writel(START_OPERATION, CONTROL_REGISTER);
  status = readl(STATUS_REGISTER);
 
  You wouldn't ask for write-combining memory mapping for control or
  status registers.
 
  But Nick said, store/load reordering is explicitly allowed as well,
  which covers *all* memory types.
 
  Then Nick is confused.  PCI only defines one way to flush posted writes
  to a device -- doing a read from it.  There's no way that reads can
  be allowed to pass writes (unless you've asked for it, like with write
  combining).

 But that requirement is for the PCI bridge, isn't it?  It doesn't matter if
 the bridge will flush all posted writes before allowing a read if the CPU
 decides to give the bridge the read before the write.  A powerpc CPU will
 certainly do this if you don't take any steps like telling it the memory is
 uncachable and guarded.  I didn't think it was allowed on x86 (except with
 WC), but Nick seemed to say it was.

Ah sorry, not UC, I was confused. UC I think actually is strongly ordered
WRT other UC and also cacheable operations.

WC is weakly ordered, anything goes.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-03 Thread Nick Piggin
On Wednesday 04 June 2008 05:07, Linus Torvalds wrote:
 On Tue, 3 Jun 2008, Trent Piepho wrote:
  On Tue, 3 Jun 2008, Linus Torvalds wrote:
   On Tue, 3 Jun 2008, Nick Piggin wrote:
Linus: on x86, memory operations to wc and wc+ memory are not ordered
with one another, or operations to other memory types (ie. load/load
and store/store reordering is allowed). Also, as you know, store/load
reordering is explicitly allowed as well, which covers all memory
types. So perhaps it is not quite true to say readl/writel is
strongly ordered by default even on x86. You would have to put in
some mfence instructions in them to make it so.
 
  So on x86, these could be re-ordered?
 
  writel(START_OPERATION, CONTROL_REGISTER);
  status = readl(STATUS_REGISTER);

 With both registers in a WC+ area, yes. The write may be in the WC buffers
 until the WC buffers are flushed (short list: a fence, a serializing
 instruction, a read-write to uncached memory, or an interrupt. There are
 others, but those are the main ones).

 But if the status register is in uncached memory (which is the only *sane*
 thing to do), then it doesn't matter if the control register is in WC
 memory. Because the status register read is itself serializing with the WC
 buffer, it's actually fine.

Actually, according to the document I am looking at (the AMD one), a UC
store may pass a previous WC store.

So you could have some code that writes to some WC memory on the card,
and then stores to an UC control register to start up the operation on
that memory, couldn't you? Those can go out of order.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-03 Thread Nick Piggin
On Wednesday 04 June 2008 00:47, Linus Torvalds wrote:
 On Tue, 3 Jun 2008, Nick Piggin wrote:
  Linus: on x86, memory operations to wc and wc+ memory are not ordered
  with one another, or operations to other memory types (ie. load/load
  and store/store reordering is allowed). Also, as you know, store/load
  reordering is explicitly allowed as well, which covers all memory
  types. So perhaps it is not quite true to say readl/writel is strongly
  ordered by default even on x86. You would have to put in some
  mfence instructions in them to make it so.

 Well, you have to ask for WC/WC+ anyway, so it's immaterial. A driver that
 does that needs to be aware of it. IOW, it's a non-issue, imnsho.

Ah, yes UC is strongly ordered WRT all others *except* WC/WC+.

But WC memory is not an x86 specific thing right, so do we need
some accessors for WC memory? Or can we just throw that in the
weakly ordered pile, and ensure mb/rmb/wmb does the right thing
for them.

And you want readl/writel to be strongly ordered like x86 on all
architectures, no exceptions? This will slow some things down,
but if we then also provide explicitly weakly ordered instructions
(and add io_mb/io_rmb/io_wmb) then at least it gives the framework
for drivers to be written to run on those architectures.

The other thing we could do is mandate only that readl/writel will
be ordered WRT one another, *and* with spinlocks, but otherwise not
with cacheable RAM...
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-03 Thread Nick Piggin
On Wednesday 04 June 2008 07:44, Trent Piepho wrote:
 On Tue, 3 Jun 2008, Matthew Wilcox wrote:

  I don't understand why you keep talking about DMA.  Are you talking
  about ordering between readX() and DMA?  PCI proides those guarantees.

 I guess you haven't been reading the whole thread.  The reason it started
 was because gcc can re-order powerpc (and everyone else's too) IO accesses
 vs accesses to cachable memory (but not spin-locks), which ends up only
 being a problem with coherent DMA.

I don't think it is only a problem with coherent DMA.

CPU0 CPU1
mutex_lock(mutex);
writel(something, DATA_REG);
writel(GO, CTRL_REG);
started = 1;
mutex_unlock(mutex);
 mutex_lock(mutex);
 if (started)
   /* oops, this can reach device before GO */
   writel(STOP, CTRL_REG);
   
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-03 Thread Linus Torvalds


On Wed, 4 Jun 2008, Nick Piggin wrote:
 
 Actually, according to the document I am looking at (the AMD one), a UC
 store may pass a previous WC store.

Hmm. Intel arch manyal, Vol 3, 10.3 (page 10-7 in my version):

  If the WC bufer is partially filled, the writes may be delayed until 
   the next ocurrence of a serializing event; such as, an SFENCE or MFENCE 
   instruction, CPUID execution, a read or write to uncached memory, ...

Any typos mine.

Anyway, Intel certainly seems to document that WC memory is serialized by 
any access to UC memory.

But yes, I can well imagine that AMD is different, and I also heartily 
would recommend rather being safe than sorry. Putting an explicit memory 
barrier in between those accesses when you know it might make a difference 
is just a good idea. 

But basically, as far as I know the thing was designed to be invisible to 
old software: that is the whole idea behind WC memory. So the design was 
certainly intended to be that you can generally mark a framebuffer-like 
structure WC without any software _ever_ caring, as long as you keep all 
control ports in UC memory.

Of course, because burst writes from the WC buffer are iso/i much more 
efficient on the PCI bus than dribbling them out one write at a time, it 
didn't take long before all the graphics cards etc wanted to ialso/i 
mark their command queues as WC memory, so that you could burst out the 
commands to the ring buffers as fast as possible. So now you have both 
your frame buffer *and* your command buffers mapped WC, and now ordering 
really has to be ensured in software if you access both.

[ And then there are the crazy people who mark *main memory* as WC, 
  because they don't want to pollute the cache with all the data, and then 
  you have the issue of cache coherency etc crap. Which only gets worse 
  with SMP, especially if one processor thinks it has part of memory 
  exclusively cached, and another one - or even the same one, 
  through another aliasign address - ignores the cache protocol.

  And you now get unhappy CPU's that think that there is a bug in the 
  cache protocol and they get machine check faults.

  So what started out as a we can do accesses to the frame buffer more 
  efficiently without anybody ever even having to know or care has 
  turned into a whole nightmare of people using it for other things, and 
  then you very much _do_ have to care! ]

And it doesn't surprise me if AMD then didn't get exactly the same 
rules. 

Oh, well.

Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-02 Thread Russell King
On Tue, May 27, 2008 at 02:55:56PM -0700, Linus Torvalds wrote:
 
 
 On Wed, 28 May 2008, Benjamin Herrenschmidt wrote:
  
  A problem with __raw_ though is that they -also- don't do byteswap,
 
 Well, that's why there is __readl() and __raw_readl(), no?
 
 Neither does ordering, and __raw_readl() doesn't do byte-swap.

This is where the lack of documentation causes arch maintainers a big
problem.  None of the semantics of __raw_readl vs __readl vs readl are
documented _anywhere_.  If you look at x86 as a template, there's no
comments there about what the different variants are supposed to do
or not do.

So it's left up to arch maintainers to literally guess what should be
done.  That's precisely what I did when I implemented ARMs __raw_readl
and friends.  I guessed.

And it was only after I read a few mails on lkml which suggested that
readl and friends should always be LE that ARMs readl implementation
started to use le32_to_cpu()... before that it had always been native
endian.  Again, lack of documentation...

So, can the semantics of what's expected from these IO accessor
functions be documented somewhere.  Please?  Before this thread gets
lost in the depths of time?

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-02 Thread Haavard Skinnemoen
Geert Uytterhoeven [EMAIL PROTECTED] wrote:
 On Fri, 30 May 2008, Haavard Skinnemoen wrote:
  Maybe we need another interface that does not do byteswapping but
  provides stronger ordering guarantees?
 
 The byte swapping depends on the device/bus.

Of course. But isn't it reasonable to assume that a device integrated
on the same silicon as the CPU is connected to a somewhat sane bus
which doesn't require any byte swapping?

 So what happened to the old idea of putting the accessor function pointers
 in the device/bus structure?

Don't know. I think it sounds like overkill to replace a simple load or
store with an indirect function call.

Haavard
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-02 Thread Jes Sorensen

Jeremy Higdon wrote:

We don't actually have that problem on the Altix.  All writes issued
by CPU X will be ordered with respect to each other.  But writes by
CPU X and CPU Y will not be, unless an mmiowb() is done by the
original CPU before the second CPU writes.  I.e.

CPU X   writel
CPU X   writel
CPU X   mmiowb

CPU Y   writel
...

Note that this implies some sort of locking.  Also note that if in
the above, CPU Y did the mmiowb, that would not work.


Hmmm,

Then it's less bad than I thought - my apologies for the confusion.

Would we be able to use Ben's trick of setting a per cpu flag in
writel() then and checking that in spin unlock issuing the mmiowb()
there if needed?


Cheers,
Jes
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-02 Thread Ingo Molnar

* Linus Torvalds [EMAIL PROTECTED] wrote:

  Here's a UNTESTED patch for x86 that may or may not compile and 
  work, and which serializes (on a compiler level) the IO accesses 
  against regular memory accesses.
 
 Ok, so it at least boots on x86-32. Thus probably on x86-64 too (since 
 the code is now shared). I didn't look at whether it generates much 
 bigger code due to the potential extra serialization, but some of the 
 code generation I looked at looked fine.
 
 IOW, it doesn't at least create any _obviously_ worse code, and it 
 should be arguably safer than assuming the compiler does volatile 
 accesses the way we want it to.

ok, to pursue this topic of making readl*/writel*() more robust i picked 
up your patch into -tip and created a new topic branch for it: 
tip/x86/mmio.

The patch passed initial light testing in -tip (~30 successful random 
self-builds and bootups on various mixed 32-bit/64-bit boxes) but it's 
still v2.6.27 material IMO.

Failures in this area are subtle so there's no good way to tell whether 
it works as intended - we need wider testing. I've also added the 
tip/x86/mmio topic to tip/auto-x86-next rules as well so these changes 
will be picked up by tomorrow's linux-next tree as well, and by the next 
-mm iteration.

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-02 Thread Jes Sorensen

Pavel Machek wrote:

Still better than changing semantics of writel for _all_ drivers.

If you are really sure driver does not depend on writel order, it is
just a sed/// command, so I don't see any big code maintenance
issues...


This isn't changing the semantics for all drivers, it means it leaves
it up to us to fix the drivers that hit the problem. Whereas the other
way round we'd have to make some fairly big modifications to a large
number of drivers, which the driver maintainers are likely to be
reluctant to accept and maintain. It will be a constant chase to keep
that in order.

Cheers
Jes


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-02 Thread Scott Wood
On Mon, Jun 02, 2008 at 10:11:02AM +0200, Haavard Skinnemoen wrote:
 Geert Uytterhoeven [EMAIL PROTECTED] wrote:
  On Fri, 30 May 2008, Haavard Skinnemoen wrote:
   Maybe we need another interface that does not do byteswapping but
   provides stronger ordering guarantees?
  
  The byte swapping depends on the device/bus.
 
 Of course. But isn't it reasonable to assume that a device integrated
 on the same silicon as the CPU is connected to a somewhat sane bus
 which doesn't require any byte swapping?

No, unfortunately. :-(

See the end of drivers/dma/fsldma.h.  Likewise with Freescale's PCI host
bridges; for some reason the bus itself being little endian led to the host
bridge control registers also being little endian.

-Scott
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-02 Thread Jeremy Higdon
On Mon, Jun 02, 2008 at 11:56:39AM +0200, Jes Sorensen wrote:
 Jeremy Higdon wrote:
 We don't actually have that problem on the Altix.  All writes issued
 by CPU X will be ordered with respect to each other.  But writes by
 CPU X and CPU Y will not be, unless an mmiowb() is done by the
 original CPU before the second CPU writes.  I.e.
 
  CPU X   writel
  CPU X   writel
  CPU X   mmiowb
 
  CPU Y   writel
  ...
 
 Note that this implies some sort of locking.  Also note that if in
 the above, CPU Y did the mmiowb, that would not work.
 
 Hmmm,
 
 Then it's less bad than I thought - my apologies for the confusion.
 
 Would we be able to use Ben's trick of setting a per cpu flag in
 writel() then and checking that in spin unlock issuing the mmiowb()
 there if needed?


Yes, that should work fine.  You will get more mmiowb's than you
need, since some drivers, such as Fusion, don't need them.  On the
Origins (older SGI MIPS-based Numa), the 'sync' instruction had
the same effect as mmiowb() with respect to mmio write ordering,
and it was issued unconditionally in the spin unlock.  It was
cheaper than mmiowb, however.

If it matters, we could invent and use writel_relaxed() to get
performance back in drivers we care about

jeremy
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-02 Thread Benjamin Herrenschmidt
On Mon, 2008-06-02 at 12:36 +0200, Ingo Molnar wrote:
 The patch passed initial light testing in -tip (~30 successful random 
 self-builds and bootups on various mixed 32-bit/64-bit boxes) but
 it's 
 still v2.6.27 material IMO.

I think adding the memory clobber should be .26 and even -stable. We
know for sure newer gcc will re-order things, we know it for sure some
drivers will break in subtle ways because of that, ad we know as well
that sticking memory clobber in there will fix that breakage we know
about ... So I don't see the point of waiting.

Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-02 Thread Nick Piggin
On Monday 02 June 2008 19:56, Jes Sorensen wrote:
 Jeremy Higdon wrote:
  We don't actually have that problem on the Altix.  All writes issued
  by CPU X will be ordered with respect to each other.  But writes by
  CPU X and CPU Y will not be, unless an mmiowb() is done by the
  original CPU before the second CPU writes.  I.e.
 
  CPU X   writel
  CPU X   writel
  CPU X   mmiowb
 
  CPU Y   writel
  ...
 
  Note that this implies some sort of locking.  Also note that if in
  the above, CPU Y did the mmiowb, that would not work.

 Hmmm,

 Then it's less bad than I thought - my apologies for the confusion.

 Would we be able to use Ben's trick of setting a per cpu flag in
 writel() then and checking that in spin unlock issuing the mmiowb()
 there if needed?

Yes you could, but your writels would still not be strongly ordered
within (or outside) spinlock regions, which is what Linus wants (and
I kind of agree with).

This comes back to my posting about mmiowb and io_*mb barriers etc.

Despite what you say, what you've done really _does_ change the semantics
of wmb() for all drivers. It is a really sad situation we've got ourselves
into somehow, AFAIKS in the hope of trying to save ourselves a tiny bit
of work upfront :( (this is not just the sgi folk with mmiowb I'm talking
about, but the whole random undefinedness of ordering and io barriers).

The right way to make any change is never to weaken the postcondition of
an existing interface *unless* you are willing to audit the entire tree
and fix it. Impossible for drivers, so the correct thing to do is introduce
a new interface, and move things over at an easier pace. Not rocket science.

The argument that Altix only uses a few drivers so this way we can just
fix these up rather than make big modifications to large numbers of drivers
is bogus. It is far worse even for Altix if you make incompatible changes,
because you first *break* every driver on your platform, then you have to
audit and fix them. If you make compatible changes, then you have to do
exactly the same audits to move them over to the new API, but you go from
slower-faster rather than broken-non broken. As a bonus, you haven't
got random broken stuff all over the tree that you forgot to audit.

I don't know how there is still so much debate about this :(

I have a proposal: I am a neutral party here, not being an arch maintainer,
so I'll take input and write up a backward compatible API specification
and force everybody to conform to it ;)
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-01 Thread Pavel Machek
Hi!

   Though it's my understanding that at least ia64 does require the
   explicit barriers anyway, so we are still in a dodgy situation here
   where it's not clear what drivers should do and we end up with
   possibly excessive barriers on powerpc where I end up with both
   the wmb/rmb/mb that were added for ia64 -and- the ones I have in
   readl/writel to make them look synchronous... Not nice.
 
 ia64 is a disaster with a slightly different ordering problem -- the
 mmiowb() issue.  I know Ben knows far too much about this, but for big
 SGI boxes, you sometimes need mmiowb() to avoid problems with driver
 code that does totally sane stuff like
 
   spin_lock(mmio_lock);
   writel(val1, reg1);
   writel(val2, reg2);
   spin_unlock(mmio_lock);
 
 If that snippet is called on two CPUs at the same time, then the device
 might see a sequence like
 
   CPU1 -- write reg1
   CPU2 -- write reg1
   CPU1 -- write reg2
   CPU2 -- write reg2
 
 in spite of the fact that everything is totally ordered on the CPUs by
 the spin lock.
 
 The reason this is such a disaster is because the code looks right,
 makes sense, and works fine on 99.99% of all systems out there.  So I
 would bet that 99% of our drivers have missing mmiowb() bugs -- no one
 has plugged the hardware into an Altix box and cared enough to stress
 test it.

Yes, ia64 is a disaster, and needs its spinlock implementation fixed
:-).
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-06-01 Thread Pavel Machek
Hi!

 The only way to guarantee ordering in the above setup, 
 is to either
 make writel() fully ordered or adding the mmiowb()'s 
 inbetween the two
 writel's. On Altix you have to go and read from the 
 PCI brige to
 ensure all writes to it have been flushed, which is 
 also what mmiowb()
 is doing. If writel() was to guarantee this ordering, 
 it would make
 every writel() call extremely expensive :-(
 
 Interesting. I've always been taught by ia64 people 
 that mmiowb() was
 intended to be used solely between writel() and 
 spin_unlock().
 
 I think in the above case, you really should make 
 writel() ordered.
 Anything else is asking for trouble, for the exact same 
 reasons that I
 made it fully ordered on powerpc at least vs. previous 
 stores. I only
 kept it relaxed vs. subsequent cacheable stores (ie, 
 spin_unlock), for
 which I use the trick mentioned before.
 
 Hmmm I hope I didn't mess up the description of this and 
 added to the
 confusion.
 
 The net result of that would be to kill performance 
 completely, I really
 don't like that idea Having each writel() go out and 
 poll the PCI
 bridge is going to make every driver used on Altix slow 
 as a dog.
 
 In addition it's still not going to solve the problem 
 for userland
 mapped stuff such as infinibug.
 
 Yes, this has some cost (can be fairly significant on 
 powerpc too) but
 I think it's a very basic assumption from drivers that 
 consecutive
 writel's, especially issued by the same CPU, will get 
 to the device
 in order.
 
 In this case the cost is more than just significant, 
 it's pretty
 crippling.
 
 If this is a performance problem, then provide relaxed 
 variants and
 use them in selected drivers.
 
 We'd have to make major changes to drivers like e1000, 
 tg3, mptsas, the
 qla2/3/4xxx and a bunch of others to get performance 
 back. I really
 think the code maintenance issue there will get far 
 worse than what we
 have today :(

Still better than changing semantics of writel for _all_ drivers.

If you are really sure driver does not depend on writel order, it is
just a sed/// command, so I don't see any big code maintenance
issues...
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-31 Thread Jeremy Higdon
On Thu, May 29, 2008 at 10:47:18AM -0400, Jes Sorensen wrote:
 Thats not going to solve the problem on Altix. On Altix the issue is
 that there can be multiple paths through the NUMA fabric from cpuX to
 PCI bridge Y. 
 
 Consider this uber-cooltm ascii art - NR is my abbrevation for NUMA
 router:
 
 --- ---
 |cpu X| |cpu Y|
 --- ---
  |   \  /|
  |\/ |
  |/\ |
  |   /  \|
  -  --
  |NR 1| |NR 2|
  -- --
   \ /
\   /
 ---
 | PCI |
 ---
 
 The problem is that your two writel's, despite being both issued on
 cpu X, due to the spin lock, in your example, can end up with the
 first one going through NR 1 and the second one going through NR 2. If
 there's contention on NR 1, the write going via NR 2 may hit the PCI
 bridge prior to the one going via NR 1.

We don't actually have that problem on the Altix.  All writes issued
by CPU X will be ordered with respect to each other.  But writes by
CPU X and CPU Y will not be, unless an mmiowb() is done by the
original CPU before the second CPU writes.  I.e.

CPU X   writel
CPU X   writel
CPU X   mmiowb

CPU Y   writel
...

Note that this implies some sort of locking.  Also note that if in
the above, CPU Y did the mmiowb, that would not work.

jeremy
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-31 Thread Jeremy Higdon
On Fri, May 30, 2008 at 10:21:00AM -0700, Jesse Barnes wrote:
 On Friday, May 30, 2008 2:36 am Jes Sorensen wrote:
  James Bottomley wrote:
   The only way to guarantee ordering in the above setup, is to either
   make writel() fully ordered or adding the mmiowb()'s inbetween the two
   writel's. On Altix you have to go and read from the PCI brige to
   ensure all writes to it have been flushed, which is also what mmiowb()
   is doing. If writel() was to guarantee this ordering, it would make
   every writel() call extremely expensive :-(
  
   So if a read from the bridge achieves the same effect, can't we just put
   one after the writes within the spinlock (an unrelaxed one).  That way

A relaxed readX would be sufficient.  It's the next lowest cost way (after
mmiowb) of ensuring write ordering between CPUs.  Regular readX is the
most expensive method (well, we could probably come up with something worse,
but we'd have to work at it  :).

   this whole sequence will look like a well understood PCI posting flush
   rather than have to muck around with little understood (at least by most
   driver writers) io barriers?
 
  Hmmm,
 
  I think mmiowb() does some sort of status read from the bridge, I am not
  sure if it's enough to just do a regular readl().
 
  I'm adding Jeremy to the list, he should know for sure.
 
 I think a read from the target host bridge is enough.  What mmiowb() does 
 though is to read a *local* host bridge register, which contains a count of 
 the number of PIO ops still in flight on their way to their target bridge.  
 When it reaches 0, all PIOs have arrived at the target host bridge (they 
 still may be bufferd), so ordering is guaranteed.


Note that is the main advantage over a read.  There is no round trip across
the NUMA fabric.

jeremy
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-30 Thread Haavard Skinnemoen
On Fri, 30 May 2008 11:13:23 +1000
Benjamin Herrenschmidt [EMAIL PROTECTED] wrote:

  Currently, this is the only interface I know that can do native-endian
  accesses, so if you take it away, I'm gonna need an alternative
  interface that doesn't do byteswapping.  
 
 Are you aware that these also don't provide any ordering guarantee ?

Yes, but I am not aware of any alternative.

I think the drivers I've written have the necessary barriers (or dma
ops with implicit barriers) that they don't actually depend on any
DMA vs. MMIO ordering guarantees. I hope MMIO vs. MMIO ordering is
guaranteed though?

Haavard
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-30 Thread Benjamin Herrenschmidt
On Fri, 2008-05-30 at 08:07 +0200, Haavard Skinnemoen wrote:
 On Fri, 30 May 2008 11:13:23 +1000
 Benjamin Herrenschmidt [EMAIL PROTECTED] wrote:
 
   Currently, this is the only interface I know that can do native-endian
   accesses, so if you take it away, I'm gonna need an alternative
   interface that doesn't do byteswapping.  
  
  Are you aware that these also don't provide any ordering guarantee ?
 
 Yes, but I am not aware of any alternative.
 
 I think the drivers I've written have the necessary barriers (or dma
 ops with implicit barriers) that they don't actually depend on any
 DMA vs. MMIO ordering guarantees. I hope MMIO vs. MMIO ordering is
 guaranteed though?

Only to the same address I'd say.

Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-30 Thread Haavard Skinnemoen
On Fri, 30 May 2008 17:24:27 +1000
Benjamin Herrenschmidt [EMAIL PROTECTED] wrote:
 On Fri, 2008-05-30 at 08:07 +0200, Haavard Skinnemoen wrote:
  I think the drivers I've written have the necessary barriers (or dma
  ops with implicit barriers) that they don't actually depend on any
  DMA vs. MMIO ordering guarantees. I hope MMIO vs. MMIO ordering is
  guaranteed though?
 
 Only to the same address I'd say.

Right, I sort of suspected that.

Now, I'm pretty sure the architectures that can actually run those
drivers (ARM9 and AVR32 AP7) provide stronger guarantees than that, and
I suspect the same is true on most other embedded architectures that
use __raw_* in their drivers. So I don't think adding barriers is the
right thing to do because they won't do anything useful in practice, so
it's hard to tell whether they are used correctly. And it will hurt
performance at least on AVR32 since wmb() evaluates to a sync
instruction, which flushes the write buffer to RAM. Since MMIO writes
are unbuffered, that's pure overhead.

Maybe we need another interface that does not do byteswapping but
provides stronger ordering guarantees?

Haavard
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-30 Thread Geert Uytterhoeven
On Fri, 30 May 2008, Haavard Skinnemoen wrote:
 Maybe we need another interface that does not do byteswapping but
 provides stronger ordering guarantees?

The byte swapping depends on the device/bus.

So what happened to the old idea of putting the accessor function pointers
in the device/bus structure?

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [EMAIL PROTECTED]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say programmer or something like that.
-- Linus Torvalds
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-30 Thread Jes Sorensen

James Bottomley wrote:

The only way to guarantee ordering in the above setup, is to either
make writel() fully ordered or adding the mmiowb()'s inbetween the two
writel's. On Altix you have to go and read from the PCI brige to
ensure all writes to it have been flushed, which is also what mmiowb()
is doing. If writel() was to guarantee this ordering, it would make
every writel() call extremely expensive :-(


So if a read from the bridge achieves the same effect, can't we just put
one after the writes within the spinlock (an unrelaxed one).  That way
this whole sequence will look like a well understood PCI posting flush
rather than have to muck around with little understood (at least by most
driver writers) io barriers?


Hmmm,

I think mmiowb() does some sort of status read from the bridge, I am not
sure if it's enough to just do a regular readl().

I'm adding Jeremy to the list, he should know for sure.

Cheers,
Jes

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-30 Thread Jes Sorensen

Jesse Barnes wrote:

On Thursday, May 29, 2008 2:40 pm Benjamin Herrenschmidt wrote:

On Thu, 2008-05-29 at 10:47 -0400, Jes Sorensen wrote:

The only way to guarantee ordering in the above setup, is to either
make writel() fully ordered or adding the mmiowb()'s inbetween the two
writel's. On Altix you have to go and read from the PCI brige to
ensure all writes to it have been flushed, which is also what mmiowb()
is doing. If writel() was to guarantee this ordering, it would make
every writel() call extremely expensive :-(

Interesting. I've always been taught by ia64 people that mmiowb() was
intended to be used solely between writel() and spin_unlock().


Well, that *was* true, afaik, but maybe these days multipath isn't just for 
fail-over.  If that's true, then yeah making every single writeX ordered 
would be the only way to go...


I could be getting bits wrong, but multi-path here is in the NUMA
routing, not at the device level.


If this is a performance problem, then provide relaxed variants and
use them in selected drivers.


Sounds reasonable.  That way drivers just work and important drivers can be 
optimized.


That would kill all levels of performance in all drivers, resulting in
attempts to try and modify a fair bit of drivers to get the performance
back.

In reality this problem really only exists for devices where ordering of
consecutive writel's is a big issue. In my experience it really isn't
the case very frequently - and the number of mmiowb's that have put in
shows that too :-)

Cheers,
Jes
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-30 Thread Jes Sorensen

Benjamin Herrenschmidt wrote:

On Thu, 2008-05-29 at 10:47 -0400, Jes Sorensen wrote:

The only way to guarantee ordering in the above setup, is to either
make writel() fully ordered or adding the mmiowb()'s inbetween the two
writel's. On Altix you have to go and read from the PCI brige to
ensure all writes to it have been flushed, which is also what mmiowb()
is doing. If writel() was to guarantee this ordering, it would make
every writel() call extremely expensive :-(


Interesting. I've always been taught by ia64 people that mmiowb() was
intended to be used solely between writel() and spin_unlock().

I think in the above case, you really should make writel() ordered.
Anything else is asking for trouble, for the exact same reasons that I
made it fully ordered on powerpc at least vs. previous stores. I only
kept it relaxed vs. subsequent cacheable stores (ie, spin_unlock), for
which I use the trick mentioned before.


Hmmm I hope I didn't mess up the description of this and added to the
confusion.

The net result of that would be to kill performance completely, I really
don't like that idea Having each writel() go out and poll the PCI
bridge is going to make every driver used on Altix slow as a dog.

In addition it's still not going to solve the problem for userland
mapped stuff such as infinibug.


Yes, this has some cost (can be fairly significant on powerpc too) but
I think it's a very basic assumption from drivers that consecutive
writel's, especially issued by the same CPU, will get to the device
in order.


In this case the cost is more than just significant, it's pretty
crippling.


If this is a performance problem, then provide relaxed variants and
use them in selected drivers.


We'd have to make major changes to drivers like e1000, tg3, mptsas, the
qla2/3/4xxx and a bunch of others to get performance back. I really
think the code maintenance issue there will get far worse than what we
have today :(

Cheers,
Jes
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-30 Thread Jesse Barnes
On Friday, May 30, 2008 2:36 am Jes Sorensen wrote:
 James Bottomley wrote:
  The only way to guarantee ordering in the above setup, is to either
  make writel() fully ordered or adding the mmiowb()'s inbetween the two
  writel's. On Altix you have to go and read from the PCI brige to
  ensure all writes to it have been flushed, which is also what mmiowb()
  is doing. If writel() was to guarantee this ordering, it would make
  every writel() call extremely expensive :-(
 
  So if a read from the bridge achieves the same effect, can't we just put
  one after the writes within the spinlock (an unrelaxed one).  That way
  this whole sequence will look like a well understood PCI posting flush
  rather than have to muck around with little understood (at least by most
  driver writers) io barriers?

 Hmmm,

 I think mmiowb() does some sort of status read from the bridge, I am not
 sure if it's enough to just do a regular readl().

 I'm adding Jeremy to the list, he should know for sure.

I think a read from the target host bridge is enough.  What mmiowb() does 
though is to read a *local* host bridge register, which contains a count of 
the number of PIO ops still in flight on their way to their target bridge.  
When it reaches 0, all PIOs have arrived at the target host bridge (they 
still may be bufferd), so ordering is guaranteed.

Jesse
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-29 Thread Arnd Bergmann
On Wednesday 28 May 2008, Benjamin Herrenschmidt wrote:
 On Tue, 2008-05-27 at 14:55 -0700, Linus Torvalds wrote:
  
  On Wed, 28 May 2008, Benjamin Herrenschmidt wrote:
   
   A problem with __raw_ though is that they -also- don't do byteswap,
  
  Well, that's why there is __readl() and __raw_readl(), no?
 
 As I replied to somebody else, __readl() is news to me :-) we dont' have
 those on powerpc.
 

It's not exactly a well-established interface. Only five architectures
define these functions, and there is not a single user in the kernel
source outside of these architecture's io.h files.

Arnd 
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-29 Thread Alan Cox
 It's not exactly a well-established interface. Only five architectures
 define these functions, and there is not a single user in the kernel
 source outside of these architecture's io.h files.

That is because the drivers using them had them removed (eg I²O) - mostly
because it didn't compile on power pc 8(


Alan
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: MMIO and gcc re-ordering issue

2008-05-29 Thread Pantelis Antoniou


On 28 Μαϊ 2008, at 11:36 ΠΜ, Haavard Skinnemoen wrote:


Benjamin Herrenschmidt [EMAIL PROTECTED] wrote:

I'm happy to say that __raw is purely about ordering and make them
byteswap on powerpc tho (ie, make them little endian like the non- 
raw

counterpart).


That would break a lot of drivers.


How many actually use __raw_ * ?


I do -- in all the drivers for on-chip peripherals that are shared
between AT91 ARM (LE) and AVR32 (BE). Since everything goes on inside
the chip, we must use LE accesses on ARM and BE accesses on AVR32.

Currently, this is the only interface I know that can do native-endian
accesses, so if you take it away, I'm gonna need an alternative
interface that doesn't do byteswapping.

Haavard


I certainly do too.

Quite a lot of SoC specific drivers use __raw for accessing their
on-chip peripherals.

Please don't change the __raw semantics, you'll end up breaking almost
everything that's BE.

-- Pantelis

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: MMIO and gcc re-ordering issue

2008-05-29 Thread James Bottomley
On Thu, 2008-05-29 at 10:47 -0400, Jes Sorensen wrote:
  Roland == Roland Dreier [EMAIL PROTECTED] writes:
 
  This is a different issue. We deal with it on powerpc by having
  writel set a per-cpu flag and spin_unlock() test it, and do the
  barrier if needed there.
 
 Roland Cool... I assume you do this for mutex_unlock() etc?
 
 Roland Is there any reason why ia64 can't do this too so we can kill
 Roland mmiowb and save everyone a lot of hassle?  (mips, sh and frv
 Roland have non-empty mmiowb() definitions too but I'd guess that
 Roland these are all bugs based on misunderstandings of the mmiowb()
 Roland semantics...)
 
 Hi Roland,
 
 Thats not going to solve the problem on Altix. On Altix the issue is
 that there can be multiple paths through the NUMA fabric from cpuX to
 PCI bridge Y. 
 
 Consider this uber-cooltm ascii art - NR is my abbrevation for NUMA
 router:
 
 --- ---
 |cpu X| |cpu Y|
 --- ---
  |   \  /|
  |\/ |
  |/\ |
  |   /  \|
  -  --
  |NR 1| |NR 2|
  -- --
   \ /
\   /
 ---
 | PCI |
 ---
 
 The problem is that your two writel's, despite being both issued on
 cpu X, due to the spin lock, in your example, can end up with the
 first one going through NR 1 and the second one going through NR 2. If
 there's contention on NR 1, the write going via NR 2 may hit the PCI
 bridge prior to the one going via NR 1.
 
 Of course, the bigger the system, the worse the problem
 
 The only way to guarantee ordering in the above setup, is to either
 make writel() fully ordered or adding the mmiowb()'s inbetween the two
 writel's. On Altix you have to go and read from the PCI brige to
 ensure all writes to it have been flushed, which is also what mmiowb()
 is doing. If writel() was to guarantee this ordering, it would make
 every writel() call extremely expensive :-(

So if a read from the bridge achieves the same effect, can't we just put
one after the writes within the spinlock (an unrelaxed one).  That way
this whole sequence will look like a well understood PCI posting flush
rather than have to muck around with little understood (at least by most
driver writers) io barriers?

James


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-29 Thread Jes Sorensen
 Roland == Roland Dreier [EMAIL PROTECTED] writes:

 This is a different issue. We deal with it on powerpc by having
 writel set a per-cpu flag and spin_unlock() test it, and do the
 barrier if needed there.

Roland Cool... I assume you do this for mutex_unlock() etc?

Roland Is there any reason why ia64 can't do this too so we can kill
Roland mmiowb and save everyone a lot of hassle?  (mips, sh and frv
Roland have non-empty mmiowb() definitions too but I'd guess that
Roland these are all bugs based on misunderstandings of the mmiowb()
Roland semantics...)

Hi Roland,

Thats not going to solve the problem on Altix. On Altix the issue is
that there can be multiple paths through the NUMA fabric from cpuX to
PCI bridge Y. 

Consider this uber-cooltm ascii art - NR is my abbrevation for NUMA
router:

--- ---
|cpu X| |cpu Y|
--- ---
 |   \  /|
 |\/ |
 |/\ |
 |   /  \|
 -  --
 |NR 1| |NR 2|
 -- --
  \ /
   \   /
---
| PCI |
---

The problem is that your two writel's, despite being both issued on
cpu X, due to the spin lock, in your example, can end up with the
first one going through NR 1 and the second one going through NR 2. If
there's contention on NR 1, the write going via NR 2 may hit the PCI
bridge prior to the one going via NR 1.

Of course, the bigger the system, the worse the problem

The only way to guarantee ordering in the above setup, is to either
make writel() fully ordered or adding the mmiowb()'s inbetween the two
writel's. On Altix you have to go and read from the PCI brige to
ensure all writes to it have been flushed, which is also what mmiowb()
is doing. If writel() was to guarantee this ordering, it would make
every writel() call extremely expensive :-(

Cheers,
Jes
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-29 Thread Benjamin Herrenschmidt
On Thu, 2008-05-29 at 10:47 -0400, Jes Sorensen wrote:
 
 The only way to guarantee ordering in the above setup, is to either
 make writel() fully ordered or adding the mmiowb()'s inbetween the two
 writel's. On Altix you have to go and read from the PCI brige to
 ensure all writes to it have been flushed, which is also what mmiowb()
 is doing. If writel() was to guarantee this ordering, it would make
 every writel() call extremely expensive :-(

Interesting. I've always been taught by ia64 people that mmiowb() was
intended to be used solely between writel() and spin_unlock().

I think in the above case, you really should make writel() ordered.
Anything else is asking for trouble, for the exact same reasons that I
made it fully ordered on powerpc at least vs. previous stores. I only
kept it relaxed vs. subsequent cacheable stores (ie, spin_unlock), for
which I use the trick mentioned before.

Yes, this has some cost (can be fairly significant on powerpc too) but
I think it's a very basic assumption from drivers that consecutive
writel's, especially issued by the same CPU, will get to the device
in order.

If this is a performance problem, then provide relaxed variants and
use them in selected drivers.

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-29 Thread Trent Piepho

On Fri, 30 May 2008, Benjamin Herrenschmidt wrote:

On Thu, 2008-05-29 at 10:47 -0400, Jes Sorensen wrote:
Interesting. I've always been taught by ia64 people that mmiowb() was
intended to be used solely between writel() and spin_unlock().


That's what I gathered too, based on what's written in memory-barriers.txt,
which is the only kernel docs I could find that addressed this.


Yes, this has some cost (can be fairly significant on powerpc too) but
I think it's a very basic assumption from drivers that consecutive
writel's, especially issued by the same CPU, will get to the device
in order.


It's also what memory-barriers.txt says they should do.


If this is a performance problem, then provide relaxed variants and
use them in selected drivers.


I wrote a JTAG over gpio driver for the powerpc MPC8572DS platform.  With the
non-raw io accessors, the JTAG clock can run at almost ~9.5 MHz.  Using raw
versions (which I had to write since powerpc doesn't have any), the clock
speed increases to about 28 MHz.  So it can make a very significant different.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-29 Thread Jesse Barnes
On Thursday, May 29, 2008 2:40 pm Benjamin Herrenschmidt wrote:
 On Thu, 2008-05-29 at 10:47 -0400, Jes Sorensen wrote:
  The only way to guarantee ordering in the above setup, is to either
  make writel() fully ordered or adding the mmiowb()'s inbetween the two
  writel's. On Altix you have to go and read from the PCI brige to
  ensure all writes to it have been flushed, which is also what mmiowb()
  is doing. If writel() was to guarantee this ordering, it would make
  every writel() call extremely expensive :-(

 Interesting. I've always been taught by ia64 people that mmiowb() was
 intended to be used solely between writel() and spin_unlock().

Well, that *was* true, afaik, but maybe these days multipath isn't just for 
fail-over.  If that's true, then yeah making every single writeX ordered 
would be the only way to go...

 If this is a performance problem, then provide relaxed variants and
 use them in selected drivers.

Sounds reasonable.  That way drivers just work and important drivers can be 
optimized.

Jesse
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-29 Thread Roland Dreier
  The problem is that your two writel's, despite being both issued on
  cpu X, due to the spin lock, in your example, can end up with the
  first one going through NR 1 and the second one going through NR 2. If
  there's contention on NR 1, the write going via NR 2 may hit the PCI
  bridge prior to the one going via NR 1.

Really??  I can't see how you can expect any drivers to work reliably if
simple code like

writel(reg1);
writel(reg2);

might end up with the write to reg2 hitting the device before the write
to reg1.  Almost all MMIO stuff has ordering requirements and will
totally break if you allow, say, the start IO write to be reordered
before the set IO address write.

 - R.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-29 Thread Benjamin Herrenschmidt
On Thu, 2008-05-29 at 14:48 -0700, Trent Piepho wrote:

 I wrote a JTAG over gpio driver for the powerpc MPC8572DS platform.  With the
 non-raw io accessors, the JTAG clock can run at almost ~9.5 MHz.  Using raw
 versions (which I had to write since powerpc doesn't have any), the clock
 speed increases to about 28 MHz.  So it can make a very significant different.

Yes, sync's can hurt a lot. This is why I initially tried to get more
relaxed semantics.

We could implement something like __ variants and do something that
would still have eieio's but not sync's for example (ie. MMIOs are still
ordered vs. each other but not vs. coherent memory).

Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-29 Thread Benjamin Herrenschmidt

 I do -- in all the drivers for on-chip peripherals that are shared
 between AT91 ARM (LE) and AVR32 (BE). Since everything goes on inside
 the chip, we must use LE accesses on ARM and BE accesses on AVR32.
 
 Currently, this is the only interface I know that can do native-endian
 accesses, so if you take it away, I'm gonna need an alternative
 interface that doesn't do byteswapping.

Are you aware that these also don't provide any ordering guarantee ?

Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-29 Thread Trent Piepho

On Fri, 30 May 2008, Benjamin Herrenschmidt wrote:

On Thu, 2008-05-29 at 14:48 -0700, Trent Piepho wrote:


I wrote a JTAG over gpio driver for the powerpc MPC8572DS platform.  With the
non-raw io accessors, the JTAG clock can run at almost ~9.5 MHz.  Using raw
versions (which I had to write since powerpc doesn't have any), the clock
speed increases to about 28 MHz.  So it can make a very significant different.


Yes, sync's can hurt a lot. This is why I initially tried to get more
relaxed semantics.

We could implement something like __ variants and do something that
would still have eieio's but not sync's for example (ie. MMIOs are still
ordered vs. each other but not vs. coherent memory).


The problem current with the raw variants is that not all archs have them. 
And for those that do, there is no defined semantics.  Each arch is different

as to what ordering they have (and endianness too).

If you want to write a driver that is (or might be one day) multi-platform,
there aren't any less ordered accessors one can use.  A lot of drivers don't
even use coherent DMA, and could use less strictly ordered semantics quite
trivially.  Except there aren't any.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-29 Thread Paul Mackerras
Trent Piepho writes:

 On Thu, 29 May 2008, Roland Dreier wrote:
   The problem is that your two writel's, despite being both issued on
   cpu X, due to the spin lock, in your example, can end up with the
   first one going through NR 1 and the second one going through NR 2. If
   there's contention on NR 1, the write going via NR 2 may hit the PCI
   bridge prior to the one going via NR 1.
 
  Really??  I can't see how you can expect any drivers to work reliably if
  simple code like
 
  writel(reg1);
  writel(reg2);
 
  might end up with the write to reg2 hitting the device before the write
  to reg1.  Almost all MMIO stuff has ordering requirements and will
 
 This is how powerpc is natively (the linux accessors have extra ordering to
 not allow this of course), and there are non-Linux drivers that are written
 for this ordering model.

In fact stores to non-cacheable locations have to be kept in order,
according to the Power architecture.  But you're correct in that
non-cacheable loads can in principle be reordered w.r.t. each other
and to stores.

 I think what makes altix so hard is that writes to the _same_ register may be
 be re-ordered.  Re-ordering writes to the same address is much less common
 than allowing writes to different addresses to be re-ordered.

Indeed.

Paul.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-28 Thread Haavard Skinnemoen
Benjamin Herrenschmidt [EMAIL PROTECTED] wrote:
   I'm happy to say that __raw is purely about ordering and make them
   byteswap on powerpc tho (ie, make them little endian like the non-raw
   counterpart).  
  
  That would break a lot of drivers.  
 
 How many actually use __raw_ * ?

I do -- in all the drivers for on-chip peripherals that are shared
between AT91 ARM (LE) and AVR32 (BE). Since everything goes on inside
the chip, we must use LE accesses on ARM and BE accesses on AVR32.

Currently, this is the only interface I know that can do native-endian
accesses, so if you take it away, I'm gonna need an alternative
interface that doesn't do byteswapping.

Haavard
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-27 Thread Linus Torvalds


On Tue, 27 May 2008, Linus Torvalds wrote:
 
 Here's a UNTESTED patch for x86 that may or may not compile and work, and 
 which serializes (on a compiler level) the IO accesses against regular 
 memory accesses.

Ok, so it at least boots on x86-32. Thus probably on x86-64 too (since the 
code is now shared). I didn't look at whether it generates much bigger 
code due to the potential extra serialization, but some of the code 
generation I looked at looked fine.

IOW, it doesn't at least create any _obviously_ worse code, and it should 
be arguably safer than assuming the compiler does volatile accesses the 
way we want it to.

Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-27 Thread James Bottomley
On Tue, 2008-05-27 at 08:50 -0700, Roland Dreier wrote:
  Though it's my understanding that at least ia64 does require the
   explicit barriers anyway, so we are still in a dodgy situation here
   where it's not clear what drivers should do and we end up with
   possibly excessive barriers on powerpc where I end up with both
   the wmb/rmb/mb that were added for ia64 -and- the ones I have in
   readl/writel to make them look synchronous... Not nice.
 
 ia64 is a disaster with a slightly different ordering problem -- the
 mmiowb() issue.  I know Ben knows far too much about this, but for big
 SGI boxes, you sometimes need mmiowb() to avoid problems with driver
 code that does totally sane stuff like
 
   spin_lock(mmio_lock);
   writel(val1, reg1);
   writel(val2, reg2);
   spin_unlock(mmio_lock);
 
 If that snippet is called on two CPUs at the same time, then the device
 might see a sequence like
 
   CPU1 -- write reg1
   CPU2 -- write reg1
   CPU1 -- write reg2
   CPU2 -- write reg2
 
 in spite of the fact that everything is totally ordered on the CPUs by
 the spin lock.
 
 The reason this is such a disaster is because the code looks right,
 makes sense, and works fine on 99.99% of all systems out there.  So I
 would bet that 99% of our drivers have missing mmiowb() bugs -- no one
 has plugged the hardware into an Altix box and cared enough to stress
 test it.
 
 However for the issue at hand, my expectation as a driver writer is that
 readl()/writel() are ordered with respect to MMIO operations, but not
 necessarily with respect to normal writes to coherent CPU memory.  And
 I've included explicit wmb()s in code I've written like
 drivers/infiniband/hw/mthca.

Actually, this specifically should not be.  The need for mmiowb on altix
is because it explicitly violates some of the PCI rules that would
otherwise impede performance.   The compromise is that readX on altix
contains the needed dma flush but there's a variant operator,
readX_relaxed that doesn't (for drivers that know what they're doing).
The altix critical drivers have all been converted to use the relaxed
form for performance, and the unconverted ones should all operate just
fine (albeit potentially more slowly).

You can see all of this in include/asm-ia64/sn/io.h

It is confusing to me that sn_dma_flush() and sn_mmiowb() have different
implementations, but I think both fix the spinlock problem you allude to
by ensuring the DMA operation is completed before the CPU instruction is
executed.

James


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-27 Thread Roland Dreier
  Actually, this specifically should not be.  The need for mmiowb on altix
  is because it explicitly violates some of the PCI rules that would
  otherwise impede performance.   The compromise is that readX on altix
  contains the needed dma flush but there's a variant operator,
  readX_relaxed that doesn't (for drivers that know what they're doing).
  The altix critical drivers have all been converted to use the relaxed
  form for performance, and the unconverted ones should all operate just
  fine (albeit potentially more slowly).

Is this a recent change?  Because as of October 2007, 76d7cc03
(IB/mthca: Use mmiowb() to avoid firmware commands getting jumbled up)
was needed.  But this was involving writel() (__raw_writel() actually,
looking at the code), not readl().  But writel_relaxed() doesn't exist
(and doesn't make sense).

 - R.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-27 Thread James Bottomley
On Tue, 2008-05-27 at 10:38 -0700, Roland Dreier wrote:
  Actually, this specifically should not be.  The need for mmiowb on altix
   is because it explicitly violates some of the PCI rules that would
   otherwise impede performance.   The compromise is that readX on altix
   contains the needed dma flush but there's a variant operator,
   readX_relaxed that doesn't (for drivers that know what they're doing).
   The altix critical drivers have all been converted to use the relaxed
   form for performance, and the unconverted ones should all operate just
   fine (albeit potentially more slowly).
 
 Is this a recent change?  Because as of October 2007, 76d7cc03
 (IB/mthca: Use mmiowb() to avoid firmware commands getting jumbled up)
 was needed.  But this was involving writel() (__raw_writel() actually,
 looking at the code), not readl().  But writel_relaxed() doesn't exist
 (and doesn't make sense).

Um, OK, you've said write twice now ... I was assuming you meant read.
Even on an x86, writes are posted, so there's no way a spin lock could
serialise a write without an intervening read to flush the posting
(that's why only reads have a relaxed version on altix).  Or is there
something else I'm missing?

James


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-27 Thread Roland Dreier
  Um, OK, you've said write twice now ... I was assuming you meant read.
  Even on an x86, writes are posted, so there's no way a spin lock could
  serialise a write without an intervening read to flush the posting
  (that's why only reads have a relaxed version on altix).  Or is there
  something else I'm missing?

Writes are posted yes, but not reordered arbitrarily.  If I have code like:

spin_lock(mmio_lock);
writel(val1, reg1);
writel(val2, reg2);
spin_unlock(mmio_lock);

then I have a reasonable expectation that if two CPUs run this at the
same time, their writes to reg1/reg2 won't be interleaved with each
other (because the whole section is inside a spinlock).  And Altix
violates that expectation.

 - R.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-27 Thread Roland Dreier
  Writes are posted yes, but not reordered arbitrarily.

on standard x86 I mean here...
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-27 Thread Trent Piepho

On Tue, 27 May 2008, Linus Torvalds wrote:

On Tue, 27 May 2008, Benjamin Herrenschmidt wrote:


Yes. As it is today, tg3 for example is potentially broken on all archs
with newer gcc unless we either add memory clobber to readl/writel or
stick some wmb's in there (just a random driver I picked).

So Linus, what is your take on that matter ?


Let's just serialize the damn things, and add a memory clobber to them.

Expecting people to fix up all drivers is simply not going to happen. And
serializing things shouldn't be *that* expensive. People who cannot take
the expense can continue to use the magic __raw_writel() etc stuff.


Is there an issue with anything _besides_ coherent DMA?

Could one have a special version of the accessors for drivers that want to
assume they are strictly ordered vs coherent DMA memory?  That would be much
easier to get right, without slowing _everything_ down.  The problem with the
raw versions is that on some arches they are much more raw than just not being
ordered w.r.t normal memory.

__raw_writel()
No ordering beyond what the arch provides natively.  The only thing you can
assume is that reads and writes to the same location on the same CPU are
ordered.

writel()
Strictly ordered with respect to any other IO, but not to normal memory. 
(this is what Documentation/memory-barriers.txt claims).  Should probably be

strictly ordered vs locks as well.  Would be strictly ordered w.r.t. the
streaming DMA sync calls of course.

strict_writel()
Strictly ordered with respect to normal (incl. coherent DMA) memory as well.

One could even go as far as to allow a driver to #define WANT_STRICT_IO and
then it would get the strict versions.  Add that to any driver that uses DMA
and then worry about vetting those drivers.

It's also worth nothing that adding barriers to IO accessors doesn't mean you
never have to worry about barriers with coherent DMA.  Inherent with coherent
DMA, and driving hardware in general, there are various sequence points where
one must be sure one operation has finished before starting the next.  Making
sure you're doing with DMA memory before telling the hardware to do something
with it a common example.  Often the telling the hardware to do something is
done via an IO accessor function, but not always.

You can have buffer descriptors in DMA memory that describe the DMA buffers to
the hardware.  It would be critical to be finished reading a DMA buffer before
updating the descriptor to indicate that it's empty.  You could trigger a DMA
operation not by a call to an IO accessor, but with an atomic operation to a
variable shared with an ISR.  When the ISR runs it sees the change and does
the necessary IO.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-27 Thread Scott Wood

Trent Piepho wrote:

Is there an issue with anything _besides_ coherent DMA?

Could one have a special version of the accessors for drivers that
want to assume they are strictly ordered vs coherent DMA memory?
That would be much easier to get right, without slowing _everything_
down.


It's better to be safe by default and then optimize the fast paths than 
to be relaxed by default and hang the machine in some piece of code that 
runs once a month.  Premature optimization is the root of all evil, 
and what not.


One could even go as far as to allow a driver to #define 
WANT_STRICT_IO and then it would get the strict versions.  Add that

to any driver that uses DMA and then worry about vetting those
drivers.


See above -- if you must have a #define, then it should be WANT_RELAXED_IO.

-Scott

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-27 Thread Benjamin Herrenschmidt

On Tue, 2008-05-27 at 08:35 -0700, Linus Torvalds wrote:
 
 On Tue, 27 May 2008, Benjamin Herrenschmidt wrote:
  
  Yes. As it is today, tg3 for example is potentially broken on all archs
  with newer gcc unless we either add memory clobber to readl/writel or
  stick some wmb's in there (just a random driver I picked).
  
  So Linus, what is your take on that matter ?
 
 Let's just serialize the damn things, and add a memory clobber to them.
 
 Expecting people to fix up all drivers is simply not going to happen. And 
 serializing things shouldn't be *that* expensive. People who cannot take 
 the expense can continue to use the magic __raw_writel() etc stuff.

Ok.

Do we also remove wmb/rmb/... from drivers then ? :-) I think ia64 would
need to be fixed to make their writel serializing...

Regarding __raw_* their semantics are dodgy ... we might want to provide
something better but it's a different subject.

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-27 Thread Benjamin Herrenschmidt

On Tue, 2008-05-27 at 08:50 -0700, Roland Dreier wrote:
  Though it's my understanding that at least ia64 does require the
   explicit barriers anyway, so we are still in a dodgy situation here
   where it's not clear what drivers should do and we end up with
   possibly excessive barriers on powerpc where I end up with both
   the wmb/rmb/mb that were added for ia64 -and- the ones I have in
   readl/writel to make them look synchronous... Not nice.
 
 ia64 is a disaster with a slightly different ordering problem -- the
 mmiowb() issue.  I know Ben knows far too much about this, but for big
 SGI boxes, you sometimes need mmiowb() to avoid problems with driver
 code that does totally sane stuff like

This is a different issue. We deal with it on powerpc by having writel
set a per-cpu flag and spin_unlock() test it, and do the barrier if
needed there.

However, drivers such as e1000 -also- have a wmb() between filling the
ring buffer and kicking the DMA with MMIO, with a comment about this
being needed for ia64 relaxed ordering.

Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-27 Thread Benjamin Herrenschmidt

On Tue, 2008-05-27 at 09:47 -0700, Linus Torvalds wrote:
 
 __read[bwlq]()/__write[bwlq]() are not serialized with a :memory 
 barrier, although since they still use asm volatile I suspect that
 i 
 practice they are probably serial too. Did not look very closely at
 any 
 generated code (only did a trivial test to see that the code looks 
 *roughly* correct).

Nah, asm volatile doesn't help, it does need the memory clobber too.

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-27 Thread Chris Friesen

Roland Dreier wrote:


Writes are posted yes, but not reordered arbitrarily.  If I have code like:

spin_lock(mmio_lock);
writel(val1, reg1);
writel(val2, reg2);
spin_unlock(mmio_lock);

then I have a reasonable expectation that if two CPUs run this at the
same time, their writes to reg1/reg2 won't be interleaved with each
other (because the whole section is inside a spinlock).  And Altix
violates that expectation.


Does that necessarily follow?

If you've got a large system with multiple pci bridges, could you end up 
with posted writes coming from different cpus taking a different amount 
of time to propagate to a device and thus colliding?


Chris
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-27 Thread Roland Dreier
   Writes are posted yes, but not reordered arbitrarily.  If I have code like:
  
  spin_lock(mmio_lock);
  writel(val1, reg1);
  writel(val2, reg2);
  spin_unlock(mmio_lock);
  
   then I have a reasonable expectation that if two CPUs run this at the
   same time, their writes to reg1/reg2 won't be interleaved with each
   other (because the whole section is inside a spinlock).  And Altix
   violates that expectation.
  
  Does that necessarily follow?
  
  If you've got a large system with multiple pci bridges, could you end
  up with posted writes coming from different cpus taking a different
  amount of time to propagate to a device and thus colliding?

Not on x86.  And a given PCI device can only be reached from a single
host bridge, so I don't see how it can happen.  But on SGI altix
systems, there is a routed fabric between the CPU and the PCI bus, so
the reordering can happen there.  Hence mmiowb() and the endless supply
of driver bugs that it causes.

 - R.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-27 Thread Linus Torvalds


On Wed, 28 May 2008, Benjamin Herrenschmidt wrote:
 
 On Tue, 2008-05-27 at 08:35 -0700, Linus Torvalds wrote:
  
  Expecting people to fix up all drivers is simply not going to happen. And 
  serializing things shouldn't be *that* expensive. People who cannot take 
  the expense can continue to use the magic __raw_writel() etc stuff.
 
 Ok.
 
 Do we also remove wmb/rmb/... from drivers then ? :-) I think ia64 would
 need to be fixed to make their writel serializing...

Well..

There's really two different issues:

 (a) x86 and the fact that we have thousands of drivers

which in turn conflicts with

 (b) non-x86 and the fact that other architectures tend to be absolute 
 pieces of cr*p when it comes to ordering, _especially_ across IO.

and the thing about (b) is that the number of drivers involved is a hell 
of a lot smaller. For example, ia64 and the big SGI machines probably 
really only care about roughly five drivers (number taken out of my nether 
regions). 

So practically speaking, I suspect that the right approach is to just say 
ok, x86 will continue to be pretty darn ordered, and the barriers won't 
really matter (*) but at the same time also saying we wish reality was 
different, and well-maintained drivers should probably try to work in the 
presense of re-ordering.

In *practice*, that probably means that most architectures will be better 
off if they emulate x86 closely, just because that means that they won't 
rely on drivers always getting things right, but I think we can leave the 
door open for the odd machines. We should just realize that they will 
never get a lot of testing, but on the other hand, their usage scenarios 
will generally also be very limited (very specific loads, and _very_ 
specific hardware).

And the patch I sent out actually made __[raw_]readl() different from 
readl() on x86 too, in that the assembly _allows_ a bit more 
re-ordering, even though I doubt it will be visible in practice. So if you 
use the __ versions, you'd better have barriers even on x86!

Linus

(*) With the possible but unlikely exception of some big machines with 
separate IO networks, but if they happen they will fall into the 'ia64' 
case of just having a few relevant drivers.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-27 Thread Roland Dreier
  This is a different issue. We deal with it on powerpc by having writel
  set a per-cpu flag and spin_unlock() test it, and do the barrier if
  needed there.

Cool... I assume you do this for mutex_unlock() etc?

Is there any reason why ia64 can't do this too so we can kill mmiowb and
save everyone a lot of hassle?  (mips, sh and frv have non-empty
mmiowb() definitions too but I'd guess that these are all bugs based on
misunderstandings of the mmiowb() semantics...)

  However, drivers such as e1000 -also- have a wmb() between filling the
  ring buffer and kicking the DMA with MMIO, with a comment about this
  being needed for ia64 relaxed ordering.

I put these barriers into mthca, mlx4 etc, although it came from my
possible misunderstanding of the memory ordering rules in the kernel
more than any experience of problems (as opposed the the mmiowb()s,
which all came from real world bugs).

 - R.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-27 Thread Benjamin Herrenschmidt

 So practically speaking, I suspect that the right approach is to just say 
 ok, x86 will continue to be pretty darn ordered, and the barriers won't 
 really matter (*) but at the same time also saying we wish reality was 
 different, and well-maintained drivers should probably try to work in the 
 presense of re-ordering.

Ok, well, I'll slap memory clobbers onto powerpc accessors, we made
them fully ordered a while ago anyway. The extra barriers in drivers
like USB etc.. won't hurt us much, we can always fine tune drivers that
really want high performances.

A problem with __raw_ though is that they -also- don't do byteswap,
which is a pain in the neck as people use them for either one reason
(relaxed ordering) or the other (no byteswap) without always knowing the
consequences of doing so...

I'm happy to say that __raw is purely about ordering and make them
byteswap on powerpc tho (ie, make them little endian like the non-raw
counterpart).

Some archs started providing writel_be etc... I added those to powerpc a
little while ago, and I tend to prefer that approach for the byteswap
issue.

What do you think ?

Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-27 Thread Alan Cox
 re-ordering, even though I doubt it will be visible in practice. So if you 
 use the __ versions, you'd better have barriers even on x86!

Are we also going to have __ioread*/__iowrite* ?

Also is the sematics of __readl/__writel defined for all architectures -
I used it ages ago in the i2o drivers for speed and it got removed
because it didn't build on some platforms.

Alan

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-27 Thread Matthew Wilcox
On Tue, May 27, 2008 at 10:38:22PM +0100, Alan Cox wrote:
  re-ordering, even though I doubt it will be visible in practice. So if you 
  use the __ versions, you'd better have barriers even on x86!
 
 Are we also going to have __ioread*/__iowrite* ?

Didn't we already define ioread*() to have loose semantics?

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-27 Thread Linus Torvalds


On Wed, 28 May 2008, Benjamin Herrenschmidt wrote:
 
 A problem with __raw_ though is that they -also- don't do byteswap,

Well, that's why there is __readl() and __raw_readl(), no?

Neither does ordering, and __raw_readl() doesn't do byte-swap.

Of course, I'm not going to guarantee every architecture even has all 
those versions, nor am I going to guarantee they all work as advertised :)

For x86, they have historially all been 100% identical. With the inline 
asm patch I posted, the __ version (whether raw or not) lack the 
memory barrier, so they allow a *little* bit more re-ordering.

(They won't be re-ordered wrt spinlocks etc, unless gcc starts reordering 
volatile asm's against each other, which would be a bug).

In practice, I doubt it matters. Whatever small compiler re-ordering it 
might affect won't have any real performance impack one way or the other, 
I think.

Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: MMIO and gcc re-ordering issue

2008-05-27 Thread Linus Torvalds


On Tue, 27 May 2008, Alan Cox wrote:

  re-ordering, even though I doubt it will be visible in practice. So if you 
  use the __ versions, you'd better have barriers even on x86!
 
 Are we also going to have __ioread*/__iowrite* ?

I doubt there is any reason to. Let's just keep them very strictly 
ordered. 

 Also is the sematics of __readl/__writel defined for all architectures -
 I used it ages ago in the i2o drivers for speed and it got removed
 because it didn't build on some platforms.

Agreed - I'm not sure the __ versions are really worth it. We have them, 
but the semantics are subtle enough that most drivers will never care 
enough to really use them.

I would suggest using them mainly for architecture-specific drivers, on 
architectures where it actually matters (which is not the case on x86).

Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


  1   2   >