Re: [PATCH] add delay between port write and port read

2019-03-01 Thread Maciej W. Rozycki
On Fri, 1 Mar 2019, Linus Torvalds wrote:

> > Without that, using __raw_xyz() to copy between RAM and
> > buffers in PCI memory space is broken, as you said, but the
> > assumption would be broken on certain older machines that
> > do a hardware endian swap by swizzling the address lines rather
> > than swapping bytes on the data bus.
> 
> Honestly, there's no way we can ever fix that.

 Well, (: the good news is we actually have it covered in the MIPS port 
already, with the interfaces I described in the other e-mail.  We have had 
it for some 15 years now, ever since we got the ATA PIO stuff covered.

 So all that we need is to agree upon names so that everyone does not come 
up with their own ones, write them down somewhere, and let individual port 
maintainers sort it out as they deem necessary.  This way we have a 
solution available, but don't have to do anything.  Win-win.

> Because no new CPU's will ever be designed where the byte order isn't
> little-endian, and nobody will ever make those broken "we'll do random
> things in hardware to worfk around the fact that we're doing crazy
> things" machines.

 Even better then: the trasformation will be trivial, i.e. pass-through, 
so all the new CPU stuff will merely alias all the accessor variants to 
each other.  Which means we can put all that as a bunch of #defines in 
include/asm-generic/io.h and forget about it.

  Maciej


Re: [PATCH] add delay between port write and port read

2019-03-01 Thread Linus Torvalds
On Fri, Mar 1, 2019 at 1:52 PM Arnd Bergmann  wrote:
>
> Without that, using __raw_xyz() to copy between RAM and
> buffers in PCI memory space is broken, as you said, but the
> assumption would be broken on certain older machines that
> do a hardware endian swap by swizzling the address lines rather
> than swapping bytes on the data bus.

Honestly, there's no way we can ever fix that.

The fact is, __raw_readl/writel simply isn't portable. If a driver
uses them, the driver HAS TO KNOW what the situation is on the
hardware it runs on.

Because the "byte order" of the end result will basically be random on
some machines.

And you know what? We absolutely do not care. Not one whit.

Because no new CPU's will ever be designed where the byte order isn't
little-endian, and nobody will ever make those broken "we'll do random
things in hardware to worfk around the fact that we're doing crazy
things" machines.

So it's all legacy anyway, and it's absolutely not worth
over-designing or over-thinking this issue.

If you have a random SoC that does insane things, that random SoC may

 (a) need its own drivers that know exactly how it works

 (b) not be able to use some drivers for IP that the SoC has, but the
SoC designers screwed up and did some random crazy byte lane swizzling

and that's all perfectly fine.  We don't care. That SoC gets to do its
own drivers.

And 99% of truly generic drivers won't have any of these issues,
because they don't use __raw_readl/writel to begin with.

So don't make this any stranger than it is. Just accept that by
design, and by definition, __raw_readl/writel is a bit non-portable,
and won't work for everything.

The corollary of that is that __raw_readl/writel() may also then be
used _for_ that mis-designed piece of garbage SoC to implement the
special driver needed for just that SoC.

But as mentioned, this simply isn't a problem in practice. Sure,
people can do insane things if they really want to. Maybe some "stable
genius" decides to do a big-endian RISC-V core. We don't care, and we
don't make everybody else have to worry about that kind of insanity.

   Linus


Re: [PATCH] add delay between port write and port read

2019-03-01 Thread Maciej W. Rozycki
On Fri, 1 Mar 2019, Arnd Bergmann wrote:

> > I think the whole point of __raw_xyz() is that it's the lowest level
> > model. It gives you relaxed ordering (together with the ioremap
> > model), and it gives you straight-through behavior.
> >
> > And yes, any driver using them needs to be aware of the byte ordering,
> > which may or may not be the same as regular memory, and may or may not
> > be the same as other devices.
> >
> > So __raw_xyz() is very much for low-level drivers that know what they
> > are doing. Caveat user.
> >
> > "If it breaks, you get to keep both pieces"
> 
> I agree in principle, but I think we already have a lot of precedence
> for __raw_xyz() being relied on having a specific behavior in
> architecture independent drivers, and I think it makes sense for
> architectures to provide that.
> 
> Specifically, I think we need __raw_xyz() to do the same as xyz()
> on all little-endian kernels regarding byte ordering (not barriers), and
> I would expect it to provide the same ordering and addressing
> as swabX(xyz()) on big-endian kernels.
> 
> Without that, using __raw_xyz() to copy between RAM and
> buffers in PCI memory space is broken, as you said, but the
> assumption would be broken on certain older machines that
> do a hardware endian swap by swizzling the address lines rather
> than swapping bytes on the data bus.

 Ah, swizzling!  Thanks for reminding me of that.  The swizzling machines 
will be those that do bit (rather than byte) lane matching in hardware and 
consequently produce a numeric value of data that is consistent between 
the CPU and a device when accessed in bus-width quantities.  These do 
indeed require address adjustment if you access a narrower quantity.

> The best idea I have for working around this is to never rely
> on __raw_xyz() to not do byte swapping in platform specific
> drivers with CPU-endian MMIO space, but to have a platform
> specific set of wrappers around the normal I/O functions, and
> make __raw_xyz() just do whatever we expect them to do on
> PCI devices.

 For the record in the MIPS port we have the following accessors:

1. `readX', `writeX', `inX', `outX': data passed in the CPU endianness, 
   ordered WRT MMIO and (for reads) WRT memory; suitable for device CSR 
   accesses,

2. `__mem_readX', `__mem_writeX', `__mem_inX', `__mem_outX': data passed 
   in the memory endianness, ordered WRT MMIO and (for reads) WRT memory, 
   suitable for data transfers between memory and a device, e.g. PIO ATA,

3. `__relaxed_readX', `__relaxed_writeX': data passed in the CPU 
   endianness, ordered WRT MMIO only; aliased to `*_relaxed',

4. `__raw_readX', `__raw_writeX': data passed straight-through with no 
   transformation, ordered WRT MMIO and (for reads) WRT memory.

For correct operation of generic drivers we need at least #1 and #2, and 
we need #4 for platform drivers (which will know they're on a local bus); 
#3 is an optimisation only that is nice having, but drivers would do with 
#1 instead.  Note however that depending on the wiring of a particular 
big-endian machine we may have to swizzle addresses for #1, #2, #3, and 
then byte-swap either #1 and #3, or #2.

 See arch/mips/include/asm/*/mangle-port.h for all the mess.

 It actually took quite a while back in early 2000s to get PIO ATA to work 
correctly on big-endian MIPS machines, because people did not understand 
the subtlety between the CPU and the memory endianness, and consequently 
that different accessors are required for the 16-bit ATA data register; 
you just can't get 16-bit ATA data transfers and 16-bit CSR accesses made 
elsewhere right with the use of a single accessor only, we need two.

 I haven't looked at what other ports did in this area, but we do need at 
least #1 and #2, strongly-ordered, kernel-wide, and then we can discuss 
what the semantics of `__raw_xyz' is supposed to be, but at least some 
ports will have to retain the semantics of #4, strongly-ordered, under 
whatever name.  I think using a uniform one would be advantageous though, 
for obvious reasons.

 NB the old IDE driver had special local provisions for PIO, which wired 
#2 accessors on MIPS by including a platform specific header that provided 
IDE-specific names and I suppose libata carried that over or it wouldn't 
work.  But I think that instead of having such driver-specific hacks that 
require port maintainers' attention on a case-by-case basis we really 
ought to provide #2 semantics kernel-wide.

  Maciej


Re: [PATCH] add delay between port write and port read

2019-03-01 Thread Arnd Bergmann
On Fri, Mar 1, 2019 at 8:19 PM Linus Torvalds
 wrote:
>
> On Fri, Mar 1, 2019 at 11:13 AM Maciej W. Rozycki  
> wrote:
> >
> >  What do we do WRT straight-through vs byte-swapping properties of these
> > accessors?
>
> I think the whole point of __raw_xyz() is that it's the lowest level
> model. It gives you relaxed ordering (together with the ioremap
> model), and it gives you straight-through behavior.
>
> And yes, any driver using them needs to be aware of the byte ordering,
> which may or may not be the same as regular memory, and may or may not
> be the same as other devices.
>
> So __raw_xyz() is very much for low-level drivers that know what they
> are doing. Caveat user.
>
> "If it breaks, you get to keep both pieces"

I agree in principle, but I think we already have a lot of precedence
for __raw_xyz() being relied on having a specific behavior in
architecture independent drivers, and I think it makes sense for
architectures to provide that.

Specifically, I think we need __raw_xyz() to do the same as xyz()
on all little-endian kernels regarding byte ordering (not barriers), and
I would expect it to provide the same ordering and addressing
as swabX(xyz()) on big-endian kernels.

Without that, using __raw_xyz() to copy between RAM and
buffers in PCI memory space is broken, as you said, but the
assumption would be broken on certain older machines that
do a hardware endian swap by swizzling the address lines rather
than swapping bytes on the data bus.

The best idea I have for working around this is to never rely
on __raw_xyz() to not do byte swapping in platform specific
drivers with CPU-endian MMIO space, but to have a platform
specific set of wrappers around the normal I/O functions, and
make __raw_xyz() just do whatever we expect them to do on
PCI devices.

   Arnd


Re: [PATCH] add delay between port write and port read

2019-03-01 Thread Maciej W. Rozycki
On Fri, 1 Mar 2019, Linus Torvalds wrote:

> Because quite often you don't want any extra byte ordering because
> you've moving things around anyway (ie you're copying from the device
> to memory or similar, and switching to little-endian in between would
> just mean that you have to switch back for the memory write anyway).

 Good point!

  Maciej


Re: [PATCH] add delay between port write and port read

2019-03-01 Thread Linus Torvalds
On Fri, Mar 1, 2019 at 11:13 AM Maciej W. Rozycki  wrote:
>
>  What do we do WRT straight-through vs byte-swapping properties of these
> accessors?

I think the whole point of __raw_xyz() is that it's the lowest level
model. It gives you relaxed ordering (together with the ioremap
model), and it gives you straight-through behavior.

And yes, any driver using them needs to be aware of the byte ordering,
which may or may not be the same as regular memory, and may or may not
be the same as other devices.

So __raw_xyz() is very much for low-level drivers that know what they
are doing. Caveat user.

"If it breaks, you get to keep both pieces"

 Linus


Re: [PATCH] add delay between port write and port read

2019-03-01 Thread Maciej W. Rozycki
On Fri, 1 Mar 2019, Linus Torvalds wrote:

> So that would seem what an architecture implementation should _aim_
> for: having various "ioremap_xyz()" for setting the
> PCIe/system/whatever controller level ordering, and then using the
> "__raw_xyz()" accessors for unordered CPU accesses.

 What do we do WRT straight-through vs byte-swapping properties of these 
accessors?

 For the record: we do need to have straight-through accessors for 
endianness-agnostic peripherals.  This is the case for example with SOC 
devices that have plenty of I/O onchip that are always accessed natively 
regardless of the endianness of the external bus.

 E.g. with the Broadcom SiByte BCM1250 SOC we have a whole lot of devices 
onchip (including a pair of MIPS64 CPU cores), and then PCI and HT 
endpoints, and the SOC's endianness is configurable at reset.  The 
endianness selected only applies to the external bus, affecting external 
data lanes: in the big-endian case there's a choice between matching byte 
lanes or bit lanes by using one of the two MMIO physical address spaces 
that apply either of these policies; in the little-endian case access is 
pass-through of course.  Consequently we have to use straight-through 
accessors for onchip devices and byte-swapped ones for PCI/HT devices.

  Maciej


Re: [PATCH] add delay between port write and port read

2019-03-01 Thread Linus Torvalds
On Fri, Mar 1, 2019 at 10:03 AM Maciej W. Rozycki  wrote:
>
>  Well, `__raw_*' accessors are never byte-swapped, not at least with the
> MIPS port, making them a tad cumbersome for a driver that has no interest
> in paying attention to any endianness mismatch between the CPU bus and the
> device's peripheral bus.

Well, the people who want ultimate performance and not worry about
access ordering almost always _also_ want to handle byte ordering
manually.

Because quite often you don't want any extra byte ordering because
you've moving things around anyway (ie you're copying from the device
to memory or similar, and switching to little-endian in between would
just mean that you have to switch back for the memory write anyway).

Does it make things more complicated for a driver when it has to care
about byte ordering and not just say "PCI is always little-endian"?
Yes. But if you want simple, you shouldn't be doing the unordered
accesses.

   Linus


Re: [PATCH] add delay between port write and port read

2019-03-01 Thread Maciej W. Rozycki
Hi Will,

> >  FAOD, I think this assumption/requirement only applies to the plain 
> > accessors (`inX', `readX', `ioreadX', etc.).
> 
> It's also a requirement for the *_relaxed accessors, and there are drivers
> that rely on this being the case.

 Well, from the reading of memory-barriers.txt, be it as it stands, or 
with your rewording applied, I take it the `*_relaxed' accessors do not 
guarantee ordering WRT locking or DMA and hence a trailing barrier in 
`readX_relaxed' is not necessary, and so we don't need a trailing `mb' 
with the Alpha port either (but we do need a leading `mb' there, as well 
as with `writeX_relaxed').

> >  For performance reasons we may decide sometime to opt in for accessors 
> > that do not suffer from the requirement to be strongly ordered WRT each 
> > other, for the benefit to architectures that are not strongly ordered with 
> > MMIO and that suffer a lot from serialising accesses that do not really 
> > care, e.g. where you need to load a bunch of device registers or maybe 
> > even device RAM in any order before making a serialised final request to 
> > accept the values loaded.
> 
> I'd expect accesses to device RAM to use something like ioremap_wc() if
> possible. In that case, the ordering of accesses is weakened by the
> underlying memory type in the page tables, but we're not yet at the point
> where we've figured out the portable semantics in this case. I plan to
> look at that once we've nailed normal ioremap()!

 Some CPU hardware and certainly many if not most MIPS implementations do 
not give such a finegrained control over mapping.  I think all the MIPS 
CPUs I dealt with only gave the choice between a cached (coherent or not) 
and an uncached mapping, with the ordering being weak in all cases.  So 
the only control over ordering was with explicit barrier instructions.

> >  That piece of hardware is however rather peculiar and not an example of 
> > the most common design seen nowadays and I am not sure if the extra 
> > maintenance burden across all the ports for any additional accessors would 
> > be outweighed by the benefit for the weakly ordered MMIO architectures 
> > (where an execution stall can indeed count in hundreds of clock cycles per 
> > barrier inserted) combined with the appreciation (i.e. actual use) level 
> > from driver writers who do not necessarily grok all that weak ordering 
> > business.
> 
> If you use ioremap_wc() for device RAM and __raw_readX() for bunching up
> register accesses, then I don't think we need to add anything new, do we?

 Well, `__raw_*' accessors are never byte-swapped, not at least with the 
MIPS port, making them a tad cumbersome for a driver that has no interest 
in paying attention to any endianness mismatch between the CPU bus and the 
device's peripheral bus.

 Granted this does not matter for this driver as it only has a chance to 
be used with DECstation (MIPS), DEC 3000 AXP (Alpha) and VAXstation 4000 
(VAX) hardware, all of which are little-endian throughout (and not all of 
which we have support for upstream).  Still I'd consider it a hack as 
obviously we do have drivers for options living on say PCI, which is 
little-endian, that we want to use with processors configured for the big 
endianness with their FSB.

  Maciej


Re: [PATCH] add delay between port write and port read

2019-03-01 Thread Linus Torvalds
On Fri, Mar 1, 2019 at 9:33 AM Will Deacon  wrote:
>
> I'd expect accesses to device RAM to use something like ioremap_wc() if
> possible. In that case, the ordering of accesses is weakened by the
> underlying memory type in the page tables, but we're not yet at the point
> where we've figured out the portable semantics in this case. I plan to
> look at that once we've nailed normal ioremap()!

The case that matters most from a performance standpoint is
traditionally stupid framebuffer accesses, and the fb layer has
basically standardized this:

  #define fb_writel __raw_writel

together with various fb drivers then using "wmb()" etc for ordering
for the non-framebuffer effects.

So that would seem what an architecture implementation should _aim_
for: having various "ioremap_xyz()" for setting the
PCIe/system/whatever controller level ordering, and then using the
"__raw_xyz()" accessors for unordered CPU accesses.

Linus


Re: [PATCH] add delay between port write and port read

2019-03-01 Thread Will Deacon
Hi Maciej,

On Wed, Feb 27, 2019 at 06:49:57PM +, Maciej W. Rozycki wrote:
> On Tue, 26 Feb 2019, Will Deacon wrote:
> 
> > > If they are the same device (just different data ports), I'd
> > > *definitely* expect them to be ordered.
> > > 
> > > We have tons of code that depends on that. Almost every driver out
> > > there, in fact.
> > > 
> > > So we need the mb() on alpha to guarantee the access ordering on the
> > > CPU side, and then PCI itself ends up guaranteeing that accesses to
> > > the same device will remain ordered outside the CPU.
> > > 
> > > Agreed?
> > 
> > Yup, agreed. I'd consider all those ports to be the same endpoint, so we're
> > good.
> 
>  FAOD, I think this assumption/requirement only applies to the plain 
> accessors (`inX', `readX', `ioreadX', etc.).

It's also a requirement for the *_relaxed accessors, and there are drivers
that rely on this being the case.

>  For performance reasons we may decide sometime to opt in for accessors 
> that do not suffer from the requirement to be strongly ordered WRT each 
> other, for the benefit to architectures that are not strongly ordered with 
> MMIO and that suffer a lot from serialising accesses that do not really 
> care, e.g. where you need to load a bunch of device registers or maybe 
> even device RAM in any order before making a serialised final request to 
> accept the values loaded.

I'd expect accesses to device RAM to use something like ioremap_wc() if
possible. In that case, the ordering of accesses is weakened by the
underlying memory type in the page tables, but we're not yet at the point
where we've figured out the portable semantics in this case. I plan to
look at that once we've nailed normal ioremap()!

>  I made provisions for that with a driver I recently added with commit 
> 61414f5ec983 ("FDDI: defza: Add support for DEC FDDIcontroller 700 
> TURBOchannel adapter"), where locally defined accessor macros suffixed 
> with `_o' and `_u' denote accesses that have to be strongly ordered and 
> can be weakly ordered respectively WRT each other.
> 
>  Right now they all expand to the respective `_relaxed' accessors (with a 
> lone `dma_rmb' inserted appropriately; yes, the device does DMA one way 
> only, and the other one is PIO with a lot of MMIO traffic to board RAM 
> that would benefit from omitting barriers), however they can be replaced 
> with references to truly unordered accessors if we ever have them.
> 
>  That piece of hardware is however rather peculiar and not an example of 
> the most common design seen nowadays and I am not sure if the extra 
> maintenance burden across all the ports for any additional accessors would 
> be outweighed by the benefit for the weakly ordered MMIO architectures 
> (where an execution stall can indeed count in hundreds of clock cycles per 
> barrier inserted) combined with the appreciation (i.e. actual use) level 
> from driver writers who do not necessarily grok all that weak ordering 
> business.

If you use ioremap_wc() for device RAM and __raw_readX() for bunching up
register accesses, then I don't think we need to add anything new, do we?

Will


Re: [PATCH] add delay between port write and port read

2019-02-27 Thread Linus Torvalds
On Wed, Feb 27, 2019 at 10:54 AM Maciej W. Rozycki  wrote:
>
>  For that reason though we don't have the trailing barrier in the
> `readX_relaxed' accessors in the MIPS port.

Makes sense. And alpha should probably follow suit.

 Linus


Re: [PATCH] add delay between port write and port read

2019-02-27 Thread Maciej W. Rozycki
On Wed, 27 Feb 2019, Linus Torvalds wrote:

> > Should "writeb_relaxed" on Alpha also use the barrier?
> 
> I think they should.  Only the double-underscore (__raw_xyz()) ones
> are entirely unordered, the relaxed ones are just unordered wrt
> regular memory and DMA.

 For that reason though we don't have the trailing barrier in the 
`readX_relaxed' accessors in the MIPS port.

  Maciej


Re: [PATCH] add delay between port write and port read

2019-02-27 Thread Maciej W. Rozycki
On Tue, 26 Feb 2019, Will Deacon wrote:

> > If they are the same device (just different data ports), I'd
> > *definitely* expect them to be ordered.
> > 
> > We have tons of code that depends on that. Almost every driver out
> > there, in fact.
> > 
> > So we need the mb() on alpha to guarantee the access ordering on the
> > CPU side, and then PCI itself ends up guaranteeing that accesses to
> > the same device will remain ordered outside the CPU.
> > 
> > Agreed?
> 
> Yup, agreed. I'd consider all those ports to be the same endpoint, so we're
> good.

 FAOD, I think this assumption/requirement only applies to the plain 
accessors (`inX', `readX', `ioreadX', etc.).

 For performance reasons we may decide sometime to opt in for accessors 
that do not suffer from the requirement to be strongly ordered WRT each 
other, for the benefit to architectures that are not strongly ordered with 
MMIO and that suffer a lot from serialising accesses that do not really 
care, e.g. where you need to load a bunch of device registers or maybe 
even device RAM in any order before making a serialised final request to 
accept the values loaded.

 I made provisions for that with a driver I recently added with commit 
61414f5ec983 ("FDDI: defza: Add support for DEC FDDIcontroller 700 
TURBOchannel adapter"), where locally defined accessor macros suffixed 
with `_o' and `_u' denote accesses that have to be strongly ordered and 
can be weakly ordered respectively WRT each other.

 Right now they all expand to the respective `_relaxed' accessors (with a 
lone `dma_rmb' inserted appropriately; yes, the device does DMA one way 
only, and the other one is PIO with a lot of MMIO traffic to board RAM 
that would benefit from omitting barriers), however they can be replaced 
with references to truly unordered accessors if we ever have them.

 That piece of hardware is however rather peculiar and not an example of 
the most common design seen nowadays and I am not sure if the extra 
maintenance burden across all the ports for any additional accessors would 
be outweighed by the benefit for the weakly ordered MMIO architectures 
(where an execution stall can indeed count in hundreds of clock cycles per 
barrier inserted) combined with the appreciation (i.e. actual use) level 
from driver writers who do not necessarily grok all that weak ordering 
business.

  Maciej


Re: [PATCH] add delay between port write and port read

2019-02-27 Thread Maciej W. Rozycki
On Wed, 27 Feb 2019, Linus Torvalds wrote:

> > I suppose you might need the mb() before *and* after the I/O access in the
> > read case. The idea with readX()/ioreadX() is that you should be able to
> > do something like:
> 
> Yeah, that sounds reasonable.
> 
> You might relax the barrier after the readX() to just a rmb(), which
> might make the performance impact slightly less noticeable and might
> be sufficient in practice. But I guess once you do IO, it's not like
> the CPU barrier will be the limiting case.

 FWIW in the MIPS port we've had it as `rmb' since Sinan's commit 
a1cc7034e33d ("MIPS: io: Add barrier after register read in readX()"), but 
then we expand this macro to a hardware SYNC instruction anyway (rather 
than SYNC_RMB or at most SYNC_MB), which is stronger these days even as it 
has been at one point redefined in the architecture as a completion rather 
than an ordering barrier.

 NB MIPS also has SYNC_ACQUIRE and SYNC_RELEASE too, which are assymetric 
load <= load/store and load/store <= store ordering barriers respectively, 
obviously meant for locking.

> So maybe just a full mb() and see if anybody notices. Better to have
> working code than random failures.

 For Alpha it doesn't matter anyway as it doesn't have a separate read 
barrier.  Although I'd prefer to have the expected semantics recorded even 
if the underlying implementation is the same, as otherwise it gets even 
more confusing to people than it already is.

  Maciej


Re: [PATCH] add delay between port write and port read

2019-02-27 Thread Maciej W. Rozycki
On Wed, 27 Feb 2019, Sinan Kaya wrote:

> What we missed is the fact that alpha reorders accesses across two
> register accesses. This is guaranteed in other architectures.

 Not for MIPS either.

  Maciej


Re: [PATCH] add delay between port write and port read

2019-02-27 Thread Linus Torvalds
On Wed, Feb 27, 2019 at 9:26 AM Will Deacon  wrote:
>
> I suppose you might need the mb() before *and* after the I/O access in the
> read case. The idea with readX()/ioreadX() is that you should be able to
> do something like:

Yeah, that sounds reasonable.

You might relax the barrier after the readX() to just a rmb(), which
might make the performance impact slightly less noticeable and might
be sufficient in practice. But I guess once you do IO, it's not like
the CPU barrier will be the limiting case.

That said, excessively weakly ordered CPU's are horrible for a reason,
and the barrier model was broken and made them even more so.  And
alpha is a posted boy for "don't let hw people design a memory odering
just to make it easy for them".

So maybe just a full mb() and see if anybody notices. Better to have
working code than random failures.

  Linus


Re: [PATCH] add delay between port write and port read

2019-02-27 Thread Linus Torvalds
On Wed, Feb 27, 2019 at 9:05 AM Mikulas Patocka  wrote:
>
> Should "writeb_relaxed" on Alpha also use the barrier?

I think they should.  Only the double-underscore (__raw_xyz()) ones
are entirely unordered, the relaxed ones are just unordered wrt
regular memory and DMA.

That said, I doubt anything much cares, and I'm not sure alpha has
ever done that.

 Linus


Re: [PATCH] add delay between port write and port read

2019-02-27 Thread Will Deacon
On Wed, Feb 27, 2019 at 12:12:51PM -0500, Mikulas Patocka wrote:
> 
> 
> On Tue, 26 Feb 2019, Linus Torvalds wrote:
> 
> > Does anybody see any worries with the "just move the mb() earlier in
> > the ioread functions" model?
> > 
> > Linus
> 
> It used to be like that and it worked.
> 
> Then, commits cd0e00c106722eca40b38ebf11cf134c01901086 and 
> 92d7223a74235054f2aa7227d207d9c57f84dca0 came.
> 
> These commits claim that they changed the code to be consistent with the 
> specification (now we have barrier after ioread and before iowrite). But 
> they broke serial port and RTC on my Alpha machine.

I suppose you might need the mb() before *and* after the I/O access in the
read case. The idea with readX()/ioreadX() is that you should be able to
do something like:

status = ioread8(STATUS_REG);
if (status & RX_DMA_COMPLETE)
memcpy(mem_buf, dma_buf, size);

and rely on the DMA'd data being read out of the buffer.

Will


Re: [PATCH] add delay between port write and port read

2019-02-27 Thread Sinan Kaya

On 2/27/2019 12:12 PM, Mikulas Patocka wrote:

It used to be like that and it worked.

Then, commits cd0e00c106722eca40b38ebf11cf134c01901086 and
92d7223a74235054f2aa7227d207d9c57f84dca0 came.

These commits claim that they changed the code to be consistent with the
specification (now we have barrier after ioread and before iowrite). But
they broke serial port and RTC on my Alpha machine.


A barrier before write is needed to ensure that memory operations done
are visible to the hardware before you send a write command.

Barrier after read is needed to ensure that if you read a memory
location after register read, memory contents are coherent.

What we missed is the fact that alpha reorders accesses across two
register accesses. This is guaranteed in other architectures.

To satisfy this, you need something like

barrier
register write
barrier
register read
barrier



Re: [PATCH] add delay between port write and port read

2019-02-27 Thread Mikulas Patocka



On Tue, 26 Feb 2019, Linus Torvalds wrote:

> Does anybody see any worries with the "just move the mb() earlier in
> the ioread functions" model?
> 
> Linus

It used to be like that and it worked.

Then, commits cd0e00c106722eca40b38ebf11cf134c01901086 and 
92d7223a74235054f2aa7227d207d9c57f84dca0 came.

These commits claim that they changed the code to be consistent with the 
specification (now we have barrier after ioread and before iowrite). But 
they broke serial port and RTC on my Alpha machine.

Mikulas


Re: [PATCH] add delay between port write and port read

2019-02-27 Thread Mikulas Patocka



On Tue, 26 Feb 2019, Linus Torvalds wrote:

> On Tue, Feb 26, 2019 at 10:38 AM Will Deacon  wrote:
> >
> > That makes sense to me for this Alpha-specific case, but in general I
> > don't think we require that I/O accesses to different endpoints are
> > ordered with respect to each other, which was implied by one of Maciej's
> > examples. e.g.
> >
> > writeb(0x42, _mmio_reg);
> > readb(_mmio_reg);
> 
> If they are the same device (just different data ports), I'd
> *definitely* expect them to be ordered.
> 
> We have tons of code that depends on that. Almost every driver out
> there, in fact.
> 
> So we need the mb() on alpha to guarantee the access ordering on the
> CPU side, and then PCI itself ends up guaranteeing that accesses to
> the same device will remain ordered outside the CPU.
> 
> Agreed?
> 
> Linus

Should "writeb_relaxed" on Alpha also use the barrier?

The understanding is that readX_relaxed and writeX_relaxed are ordered 
with other accesses to the same PCI device. If Alpha doesn't gaurantee 
that, they should use barriers too.

Mikulas


Re: [PATCH] add delay between port write and port read

2019-02-26 Thread Will Deacon
On Tue, Feb 26, 2019 at 10:43:24AM -0800, Linus Torvalds wrote:
> On Tue, Feb 26, 2019 at 10:38 AM Will Deacon  wrote:
> >
> > That makes sense to me for this Alpha-specific case, but in general I
> > don't think we require that I/O accesses to different endpoints are
> > ordered with respect to each other, which was implied by one of Maciej's
> > examples. e.g.
> >
> > writeb(0x42, _mmio_reg);
> > readb(_mmio_reg);
> 
> If they are the same device (just different data ports), I'd
> *definitely* expect them to be ordered.
> 
> We have tons of code that depends on that. Almost every driver out
> there, in fact.
> 
> So we need the mb() on alpha to guarantee the access ordering on the
> CPU side, and then PCI itself ends up guaranteeing that accesses to
> the same device will remain ordered outside the CPU.
> 
> Agreed?

Yup, agreed. I'd consider all those ports to be the same endpoint, so we're
good.

Will


Re: [PATCH] add delay between port write and port read

2019-02-26 Thread Linus Torvalds
On Tue, Feb 26, 2019 at 10:38 AM Will Deacon  wrote:
>
> That makes sense to me for this Alpha-specific case, but in general I
> don't think we require that I/O accesses to different endpoints are
> ordered with respect to each other, which was implied by one of Maciej's
> examples. e.g.
>
> writeb(0x42, _mmio_reg);
> readb(_mmio_reg);

If they are the same device (just different data ports), I'd
*definitely* expect them to be ordered.

We have tons of code that depends on that. Almost every driver out
there, in fact.

So we need the mb() on alpha to guarantee the access ordering on the
CPU side, and then PCI itself ends up guaranteeing that accesses to
the same device will remain ordered outside the CPU.

Agreed?

Linus


Re: [PATCH] add delay between port write and port read

2019-02-26 Thread Will Deacon
On Tue, Feb 26, 2019 at 10:12:03AM -0800, Linus Torvalds wrote:
> On Tue, Feb 26, 2019 at 9:52 AM Maciej W. Rozycki  
> wrote:
> >
> >  The thing is taking for example `ioreadX' and `iowriteX' we have `mb'
> > before a write and `mb' after a read.  So if we do say (addresses are
> > arbitrary for illustration only):
> >
> > iowrite8(0, 1);
> > ioread8(2);
> >
> > then these effectively expand to chipset-specific:
> >
> > mb();
> > foo_iowrite8(0, 1);
> > foo_ioread8(2);
> > mb();
> 
> Yeah, looking at the current alpha io.c file, I'd suggest that all IO
> operations just do the "mb()" _before_ doing the op.
> 
> That way all IO ops are at least ordered wrt each other, any any IO
> ops are ordered wrt any memory writes before it (in particular the
> "people wrote to memory, and then did a IO write to start DMA" case).
> 
> Also, since "spin_unlock()" on alpha has a memory barrier before
> releasing the lock, it means that IO ends up being ordered wrt locks
> too.
> 
> Does it mean that a "ioread()" can then be delayed until after later
> non-IO memory operations? Yes. But that seems harmless
> 
> That said, I didn't spend a _ton_ of time thinking about this and
> looking at the code, but it does seem like the simplest model really
> is: "just always do a mb() before any IO ops". I think the "order IO
> wrt each other, and wrt any _preceding_ memory traffic (and all
> locking) is the important part, and that would seem to guarantee it.
> 
> Obviously there can then be other re-ordering at the IO layer level
> (ie posted writes and IO being re-ordered across different devices
> etc), but that's universal and not alpha-specific.
> 
> Does anybody see any worries with the "just move the mb() earlier in
> the ioread functions" model?

That makes sense to me for this Alpha-specific case, but in general I
don't think we require that I/O accesses to different endpoints are
ordered with respect to each other, which was implied by one of Maciej's
examples. e.g.

writeb(0x42, _mmio_reg);
readb(_mmio_reg);

are not ordered with respect to each other (and I think if you wanted
this ordering in general you're going to need more than just fences).

I've been trying to write this up in memory-barriers.txt, so if anybody
has comments or concerns I'm keen to hear them, especially if there's
a feeling that we should be providing stronger guarantees for port I/O:

http://lkml.kernel.org/r/20190222175525.1198-1-will.dea...@arm.com

Cheers,

Will


Re: [PATCH] add delay between port write and port read

2019-02-26 Thread Linus Torvalds
On Tue, Feb 26, 2019 at 9:52 AM Maciej W. Rozycki  wrote:
>
>  The thing is taking for example `ioreadX' and `iowriteX' we have `mb'
> before a write and `mb' after a read.  So if we do say (addresses are
> arbitrary for illustration only):
>
> iowrite8(0, 1);
> ioread8(2);
>
> then these effectively expand to chipset-specific:
>
> mb();
> foo_iowrite8(0, 1);
> foo_ioread8(2);
> mb();

Yeah, looking at the current alpha io.c file, I'd suggest that all IO
operations just do the "mb()" _before_ doing the op.

That way all IO ops are at least ordered wrt each other, any any IO
ops are ordered wrt any memory writes before it (in particular the
"people wrote to memory, and then did a IO write to start DMA" case).

Also, since "spin_unlock()" on alpha has a memory barrier before
releasing the lock, it means that IO ends up being ordered wrt locks
too.

Does it mean that a "ioread()" can then be delayed until after later
non-IO memory operations? Yes. But that seems harmless

That said, I didn't spend a _ton_ of time thinking about this and
looking at the code, but it does seem like the simplest model really
is: "just always do a mb() before any IO ops". I think the "order IO
wrt each other, and wrt any _preceding_ memory traffic (and all
locking) is the important part, and that would seem to guarantee it.

Obviously there can then be other re-ordering at the IO layer level
(ie posted writes and IO being re-ordered across different devices
etc), but that's universal and not alpha-specific.

Does anybody see any worries with the "just move the mb() earlier in
the ioread functions" model?

Linus


Re: [PATCH] add delay between port write and port read

2019-02-26 Thread Maciej W. Rozycki
Linus,

 I have added you to this discussion in hope you can clear any uncertainty 
there might be about MMIO ordering properties of the Alpha; a question for 
you is towards the end of this e-mail.

On Thu, 21 Feb 2019, Mikulas Patocka wrote:

> Do you think that we should fix this by identifying hardware that needs 
> the delays and adding the delays there?

 First of all we need to correctly identify what is really going on there.  

 And I've had a look at our arch/alpha/kernel/io.c and 
arch/alpha/include/asm/io.h as they stand now and AFAICT what they have is 
completely broken and if it works anywhere, then by pure luck only, e.g. 
because external bus accesses complete quickly enough for code execution 
not to progress to a subsequent external bus access.

 The thing is taking for example `ioreadX' and `iowriteX' we have `mb' 
before a write and `mb' after a read.  So if we do say (addresses are 
arbitrary for illustration only):

iowrite8(0, 1);
ioread8(2);

then these effectively expand to chipset-specific:

mb();
foo_iowrite8(0, 1);
foo_ioread8(2);
mb();

Which means `foo_iowrite8' and `foo_ioread8' are unordered WRT each other 
as there is no barrier between them.  Worse yet for:

iowrite8(0, 1);
ioread8(1);

the `ioread8' operation may not even appear externally, as the CPU may 
peek the value in the CPU's write buffer while the write is still pending.  
A similar consideration can be done for `readX' and `writeX', as well as a 
mixture between the two kinds of these accesses.

 We do need to add a preceding `mb' to all the `*readX' accessors to 
satisfy the strong ordering requirement WRT any preceding `*writeX' 
access.  We need that leading `mb' in `*_relaxed' accessors as well.  And 
we need to alias `mmiowb' to `wmb' (unless it's removed as I've seen 
with Will's recent patches).

 NB, the 82378ZB SIO adds a minimum of 5 ISA clocks between back-to-back 
accesses from PCI to ISA, which AFAIU should be enough for the PC87332 
Super I/O.  More can be configured if required, but I doubt that is 
needed, because it wasn't for 20+ years.

> In my opinion, adding mb() to the port accessing functions is safer - it 
> is 6 line patch.

 The amount of code does not matter as long as it is wrong.  We need to 
get this right.

> Reading all the hardware manuals is time consuming and hardly anyone will 
> do it for 25 years old hardware.

 I do this regularly and I'm fine doing it this time too, also for the 
value of the education gained.  I'll see if I can experiment with my 
hardware too, but regrettably that'll have to wait a couple of months at 
the very least (due to Brexit :( -- I have left my Alpha in the UK and the 
very last weekend I literally headed off to stay away from all of the 
upcoming mess).

 Meanwhile here is what I have tracked down in the architecture manual[1]:

"Access to local I/O space does not cause any implicit read/write 
ordering; explicit barrier instructions must be used to ensure any desired 
ordering.  Barrier instructions must be used:

* After updating a memory-resident data structure and before writing a 
  local I/O space location to notify the device of the updates.

* Between multiple consecutive direct accesses to local I/O space, e.g.  
  device control registers, if those accesses are expected to be ordered
  at the device.

Again, note that implementations may cache not only memory-resident data
structures, but also local I/O space locations."

-- which in my opinion makes it clear that we need these barriers that 
were removed with commit 92d7223a7423 ("alpha: io: reorder barriers to 
guarantee writeX() and iowriteX() ordering #2"), in addition to ones added 
there (the `*_relaxed' accessors don't need the latter ones though).

 NB for some reason the whole chapter on I/O has been reduced in later 
revisions of the architecure manual[2][3] (as well as the shortened 
version of the book called the architecture handbook) to a single-page 
overview lacking any details.  I think that does not make the architecture 
ordered any more strongly WRT MMIO though; instead it makes it even less 
specified.

 Linus, can you confirm or contradict my conclusions taken from the 
manual in the context of their later removal?

 You did the Alpha port back in mid-1990s (BTW, Jon 'maddog' spoke very 
fondly about that effort at FOSDEM earlier this month) and certainly had 
to know these details to complete it and you worked with people directly 
involved with the design of the Alpha at the time it was being developed, 
and given the somewhat terse formal specification you could therefore be 
the only person around I know of who can give a definite answer.  I hope 
you still remember the bits.

 Questions, thoughts?

References:

[1] Richard L. Sites, ed., "Alpha Architecture Reference Manual", 
Digital Press, 1992, Chapter 8, "Input/Output (I)", Section 8.2.1 
"Read/Write Ordering", 


Re: [PATCH] add delay between port write and port read

2019-02-22 Thread Will Deacon
On Wed, Feb 20, 2019 at 05:52:18PM +0100, Arnd Bergmann wrote:
> On Wed, Feb 20, 2019 at 4:38 PM Maciej W. Rozycki  
> wrote:
> > On Wed, 20 Feb 2019, Mikulas Patocka wrote:
> > > >  Well, actually `iowriteX' is generic and for MMIO you have `writeX'.
> > > > See lib/iomap.c, the comment at the top and e.g. `iowrite8' there for an
> > > > actual proof.  Alpha may have an oddball implementation, but there you 
> > > > go.
> > > > Drivers will assume they can do `iowriteX' to any kind of I/O resource,
> > > > and ordering must be respected as per Documentation/memory-barriers.txt.
> > >
> > > So, do you think that the barrier whould be added to iowriteX and slow
> > > down every MMIO access?
> >
> >  We need it either for `outX' and `iowriteX' calls operating on port I/O
> > resources, or for neither of them, both at a time, to ensure the required
> > consistency between the two interfaces.  If that badly affects MMIO (and
> > is not required there; please remind me what the exact justification to
> > use `mb' here is, as it's not entirely clear to me from the commit
> > message; `mb' is a barrier and not means for a delay), then we need to
> > find a away for `iowriteX' to tell port I/O and MMIO accesses apart and
> > only apply the barrier for the former kind.
> 
> Will Deacon is in the process of sanitizing our documentation for this,
> and this part is still under discussion.

This thread seems to be confusing barriers with delays in places, but fwiw,
I agree with everything Maciej has said. I'm about to post a new version of
my memory-barriers.txt updates, so I'll cc people on that patch and we can
discuss the ordering side of things over there.

Will


Re: [PATCH] add delay between port write and port read

2019-02-21 Thread Mikulas Patocka
Do you think that we should fix this by identifying hardware that needs 
the delays and adding the delays there?

In my opinion, adding mb() to the port accessing functions is safer - it 
is 6 line patch.

Reading all the hardware manuals is time consuming and hardly anyone will 
do it for 25 years old hardware.

Mikulas



On Wed, 20 Feb 2019, Maciej W. Rozycki wrote:

> On Wed, 20 Feb 2019, Mikulas Patocka wrote:
> 
> > > Will Deacon is in the process of sanitizing our documentation for this,
> > > and this part is still under discussion.
> > > 
> > > I personally don't see a problem with having different barrier
> > > semantics between outb() and iowrite8(), the reasoning being that
> > > any caller of iowriteX() would already have to be careful about
> > > posted writes when iowriteX is backed by ioremapped memory
> > > space rather than port I/O.
> > > 
> > > I think the more important question we have to figure out here
> > > however is why exactly the barrier is needed for alpha, as I still
> > > don't fully understand the issue:
> > 
> > Port delays were common on MS-DOS with the ISA bus. On the ISA bus, the 
> > device cannot reject a transaction, so the programmer has to wait until 
> > the device is ready before talking to it.
> > 
> > My hypothesis is that the engineers who built this Alpha Avanti machine 
> > simply bought some crappy serial line and real-time clock logic from some 
> > ISA bus vendor. And so it requires delays. The PCI stuff on the same 
> > machine doesn't require any extra delays.
> 
>  What you call crap was in the day, i.e. back in 1994, a state-of-the art 
> ISA super I/O chip (the NS87332) and a standard RTC chip (the BQ3287).  
> There was no LPC in existence yet when the Avanti line was built, the 
> south bridge used was the i82378 SIO, additionally wiring two actual ISA 
> slots.
> 
>  Datasheets fo all these pieces should be available; I'm fairly sure I 
> have at least some of them myself.  Also full engineering documentation is 
> available for the Avanti.  So it can all be verified quite easily.
> 
>  If any of these devices need actual ISA access delays, then `outX_p' and 
> `inX_p' accessors should be used by the respective drivers, and we need to 
> implement these accordingly; that would be no different to a standard x86 
> PC with say an ISA serial port (I still have such hardware BTW).
> 
>  BTW, I have recently got an Avanti system myself, an AS/250 I believe, 
> although for an unknown reason placed in an AS/300 enclosure.  The main 
> board of the two systems is identical and the only difference is the 
> maximum amount of RAM supported anyway.  I have't wired it properly yet 
> though, so I can't do any verification ATM, but I will try sometime later.
> 
> > > a) maybe the barrier here is only needed to provide the
> > >non-posted PCI write behavior required by outb(), in that
> > >case I would suggest adding the barriers to outX() but
> > >perhaps not iowriteX()
> > 
> > Except for serial line and real-time-clock, the machine works OK. So 
> > there's no need to slow down regural PCI accesses that use iowriteX.
> 
>  If actual delays are required rather than strong ordering only, then the 
> respective drivers need to be fixed, however platform code still has to 
> enforce the ordering semantics we put into the respective internal 
> interfaces.
> 
> > > b) or the barriers are in fact needed for /any/ I/O operation
> > >on alpha, to ensure that a store is ordered with regard to
> > >a following load. If this is the case, we need the barriers
> > >in all three families: outX(), writeX() and iowriteX().
> > 
> > My understanding is that multiple accesses to the same PCI space are 
> > ordered.
> 
>  I believe they are not; there's not such thing as "a PCI space" from the 
> CPU's point of view, its just another MMIO location, and it's the CPU that 
> is weakly ordered, not the chipset.
> 
>  I'll come back with the relevant citations from the architecture manual, 
> ISTR I quoted them before already.
> 
>   Maciej
> 


Re: [PATCH] add delay between port write and port read

2019-02-20 Thread Maciej W. Rozycki
On Wed, 20 Feb 2019, Mikulas Patocka wrote:

> > Will Deacon is in the process of sanitizing our documentation for this,
> > and this part is still under discussion.
> > 
> > I personally don't see a problem with having different barrier
> > semantics between outb() and iowrite8(), the reasoning being that
> > any caller of iowriteX() would already have to be careful about
> > posted writes when iowriteX is backed by ioremapped memory
> > space rather than port I/O.
> > 
> > I think the more important question we have to figure out here
> > however is why exactly the barrier is needed for alpha, as I still
> > don't fully understand the issue:
> 
> Port delays were common on MS-DOS with the ISA bus. On the ISA bus, the 
> device cannot reject a transaction, so the programmer has to wait until 
> the device is ready before talking to it.
> 
> My hypothesis is that the engineers who built this Alpha Avanti machine 
> simply bought some crappy serial line and real-time clock logic from some 
> ISA bus vendor. And so it requires delays. The PCI stuff on the same 
> machine doesn't require any extra delays.

 What you call crap was in the day, i.e. back in 1994, a state-of-the art 
ISA super I/O chip (the NS87332) and a standard RTC chip (the BQ3287).  
There was no LPC in existence yet when the Avanti line was built, the 
south bridge used was the i82378 SIO, additionally wiring two actual ISA 
slots.

 Datasheets fo all these pieces should be available; I'm fairly sure I 
have at least some of them myself.  Also full engineering documentation is 
available for the Avanti.  So it can all be verified quite easily.

 If any of these devices need actual ISA access delays, then `outX_p' and 
`inX_p' accessors should be used by the respective drivers, and we need to 
implement these accordingly; that would be no different to a standard x86 
PC with say an ISA serial port (I still have such hardware BTW).

 BTW, I have recently got an Avanti system myself, an AS/250 I believe, 
although for an unknown reason placed in an AS/300 enclosure.  The main 
board of the two systems is identical and the only difference is the 
maximum amount of RAM supported anyway.  I have't wired it properly yet 
though, so I can't do any verification ATM, but I will try sometime later.

> > a) maybe the barrier here is only needed to provide the
> >non-posted PCI write behavior required by outb(), in that
> >case I would suggest adding the barriers to outX() but
> >perhaps not iowriteX()
> 
> Except for serial line and real-time-clock, the machine works OK. So 
> there's no need to slow down regural PCI accesses that use iowriteX.

 If actual delays are required rather than strong ordering only, then the 
respective drivers need to be fixed, however platform code still has to 
enforce the ordering semantics we put into the respective internal 
interfaces.

> > b) or the barriers are in fact needed for /any/ I/O operation
> >on alpha, to ensure that a store is ordered with regard to
> >a following load. If this is the case, we need the barriers
> >in all three families: outX(), writeX() and iowriteX().
> 
> My understanding is that multiple accesses to the same PCI space are 
> ordered.

 I believe they are not; there's not such thing as "a PCI space" from the 
CPU's point of view, its just another MMIO location, and it's the CPU that 
is weakly ordered, not the chipset.

 I'll come back with the relevant citations from the architecture manual, 
ISTR I quoted them before already.

  Maciej


Re: [PATCH] add delay between port write and port read

2019-02-20 Thread Mikulas Patocka



On Wed, 20 Feb 2019, Arnd Bergmann wrote:

> On Wed, Feb 20, 2019 at 4:38 PM Maciej W. Rozycki  
> wrote:
> > On Wed, 20 Feb 2019, Mikulas Patocka wrote:
> > > >  Well, actually `iowriteX' is generic and for MMIO you have `writeX'.
> > > > See lib/iomap.c, the comment at the top and e.g. `iowrite8' there for an
> > > > actual proof.  Alpha may have an oddball implementation, but there you 
> > > > go.
> > > > Drivers will assume they can do `iowriteX' to any kind of I/O resource,
> > > > and ordering must be respected as per Documentation/memory-barriers.txt.
> > >
> > > So, do you think that the barrier whould be added to iowriteX and slow
> > > down every MMIO access?
> >
> >  We need it either for `outX' and `iowriteX' calls operating on port I/O
> > resources, or for neither of them, both at a time, to ensure the required
> > consistency between the two interfaces.  If that badly affects MMIO (and
> > is not required there; please remind me what the exact justification to
> > use `mb' here is, as it's not entirely clear to me from the commit
> > message; `mb' is a barrier and not means for a delay), then we need to
> > find a away for `iowriteX' to tell port I/O and MMIO accesses apart and
> > only apply the barrier for the former kind.
> 
> Will Deacon is in the process of sanitizing our documentation for this,
> and this part is still under discussion.
> 
> I personally don't see a problem with having different barrier
> semantics between outb() and iowrite8(), the reasoning being that
> any caller of iowriteX() would already have to be careful about
> posted writes when iowriteX is backed by ioremapped memory
> space rather than port I/O.
> 
> I think the more important question we have to figure out here
> however is why exactly the barrier is needed for alpha, as I still
> don't fully understand the issue:

Port delays were common on MS-DOS with the ISA bus. On the ISA bus, the 
device cannot reject a transaction, so the programmer has to wait until 
the device is ready before talking to it.

My hypothesis is that the engineers who built this Alpha Avanti machine 
simply bought some crappy serial line and real-time clock logic from some 
ISA bus vendor. And so it requires delays. The PCI stuff on the same 
machine doesn't require any extra delays.

> a) maybe the barrier here is only needed to provide the
>non-posted PCI write behavior required by outb(), in that
>case I would suggest adding the barriers to outX() but
>perhaps not iowriteX()

Except for serial line and real-time-clock, the machine works OK. So 
there's no need to slow down regural PCI accesses that use iowriteX.

> b) or the barriers are in fact needed for /any/ I/O operation
>on alpha, to ensure that a store is ordered with regard to
>a following load. If this is the case, we need the barriers
>in all three families: outX(), writeX() and iowriteX().
> 
>   Arnd

My understanding is that multiple accesses to the same PCI space are 
ordered.

Mikulas


Re: [PATCH] add delay between port write and port read

2019-02-20 Thread Maciej W. Rozycki
On Wed, 20 Feb 2019, Arnd Bergmann wrote:

> I personally don't see a problem with having different barrier
> semantics between outb() and iowrite8(), the reasoning being that
> any caller of iowriteX() would already have to be careful about
> posted writes when iowriteX is backed by ioremapped memory
> space rather than port I/O.

 The problem is architectures (such as certainly MIPS) may imply posted 
writes for accesses that ultimately end up as port I/O (most architectures 
obviously have no real port I/O, and it all boils down to accessing an 
MMIO address range used to issue I/O access bus cycles to ISA or PCI), due 
to write buffering within the CPU itself.  This has to be dealt with by 
explicits barrier instructions inserted in port-specific I/O accessors so 
that generic driver code has a consistent view regardless of the 
peculiarities of the underlying CPU architecture.

> b) or the barriers are in fact needed for /any/ I/O operation
>on alpha, to ensure that a store is ordered with regard to
>a following load. If this is the case, we need the barriers
>in all three families: outX(), writeX() and iowriteX().

 I do believe Alpha is weakly ordered, at least architecturally, when it 
comes to any I/O (uncached) operations; I'll double-check with the 
architecture manual, as I recall the wording there was peculiar.

 NB I did a bunch of fixes for the MIPS port somewhat recently:

8b656253a7a4 MIPS: Provide actually relaxed MMIO accessors
3d474dacae72 MIPS: Enforce strong ordering for MMIO accessors
a711d43cbbaa MIPS: Correct `mmiowb' barrier for `wbflush' platforms
4ae0452bddca MIPS: Define MMIO ordering barriers

as it was similarly broken in this regard.

  Maciej


Re: [PATCH] add delay between port write and port read

2019-02-20 Thread Mikulas Patocka



On Tue, 19 Feb 2019, Maciej W. Rozycki wrote:

> On Tue, 19 Feb 2019, Mikulas Patocka wrote:
> 
> > >  As I say, shouldn't the barrier be in `iowriteX' instead?  I don't 
> > > suppose these are allowed to be weakly ordered.
> > > 
> > >   Maciej
> > 
> > iowriteX is for memory-mapped I/O, out[bwl] is for I/O ports.
> 
>  Well, actually `iowriteX' is generic and for MMIO you have `writeX'.  
> See lib/iomap.c, the comment at the top and e.g. `iowrite8' there for an 
> actual proof.  Alpha may have an oddball implementation, but there you go. 
> Drivers will assume they can do `iowriteX' to any kind of I/O resource, 
> and ordering must be respected as per Documentation/memory-barriers.txt.
> 
>   Maciej

So, do you think that the barrier whould be added to iowriteX and slow 
down every MMIO access?

Mikulas


Re: [PATCH] add delay between port write and port read

2019-02-19 Thread Maciej W. Rozycki
On Tue, 19 Feb 2019, Mikulas Patocka wrote:

> >  As I say, shouldn't the barrier be in `iowriteX' instead?  I don't 
> > suppose these are allowed to be weakly ordered.
> > 
> >   Maciej
> 
> iowriteX is for memory-mapped I/O, out[bwl] is for I/O ports.

 Well, actually `iowriteX' is generic and for MMIO you have `writeX'.  
See lib/iomap.c, the comment at the top and e.g. `iowrite8' there for an 
actual proof.  Alpha may have an oddball implementation, but there you go. 
Drivers will assume they can do `iowriteX' to any kind of I/O resource, 
and ordering must be respected as per Documentation/memory-barriers.txt.

  Maciej


Re: [PATCH] add delay between port write and port read

2019-02-19 Thread Mikulas Patocka



On Tue, 19 Feb 2019, Arnd Bergmann wrote:

> On Tue, Feb 19, 2019 at 2:44 PM Mikulas Patocka  wrote:
> > On Tue, 19 Feb 2019, Mikulas Patocka wrote:
> >
> > > The patches cd0e00c106722eca40b38ebf11cf134c01901086 and
> > > 92d7223a74235054f2aa7227d207d9c57f84dca0 fix a theoretical issue where the
> > > code didn't follow the specification. Unfortunatelly, they also reduce
> > > timing when port write is followed by a port read.
> > >
> > > These reduced timing cause hang on boot on the Avanti platform when
> > > probing serial ports. This patch adds memory barrier after the outb, outw,
> > > outl functions, so that there is delay between port write and subsequent
> > > port read - just like before.
> > >
> > > Fixes: cd0e00c10672 ("alpha: io: reorder barriers to guarantee writeX() 
> > > and iowriteX() ordering")
> > > Cc: sta...@vger.kernel.org# v4.17+
> >
> > you can also add:
> >
> > Tested-by: Mikulas Patocka 
> 
> Acked-by: Arnd Bergmann 
> 
> but I notice you are missing Signed-off-by.

I forgot about it. You made that patch, so there should be your signed-off 
too.

Signed-off-by: Mikulas Patocka 

> We clearly need this patch, but I assumed the alpha maintainers would pick
> it up, not me. I merged the original changes since they were 
> cross-architecture,
> but I don't normally take patches for a particular architecture through the
> asm-generic tree (or the soc tree for that matter).
> 
>   Arnd
> 


Re: [PATCH] add delay between port write and port read

2019-02-19 Thread Mikulas Patocka



On Tue, 19 Feb 2019, Maciej W. Rozycki wrote:

> On Tue, 19 Feb 2019, Arnd Bergmann wrote:
> 
> > We clearly need this patch, but I assumed the alpha maintainers would pick
> > it up, not me. I merged the original changes since they were 
> > cross-architecture,
> > but I don't normally take patches for a particular architecture through the
> > asm-generic tree (or the soc tree for that matter).
> 
>  As I say, shouldn't the barrier be in `iowriteX' instead?  I don't 
> suppose these are allowed to be weakly ordered.
> 
>   Maciej

iowriteX is for memory-mapped I/O, out[bwl] is for I/O ports.

If someone finds a problem with memory-mapped device, he can add the 
barier there.

Mikulas


Re: [PATCH] add delay between port write and port read

2019-02-19 Thread Maciej W. Rozycki
On Tue, 19 Feb 2019, Arnd Bergmann wrote:

> We clearly need this patch, but I assumed the alpha maintainers would pick
> it up, not me. I merged the original changes since they were 
> cross-architecture,
> but I don't normally take patches for a particular architecture through the
> asm-generic tree (or the soc tree for that matter).

 As I say, shouldn't the barrier be in `iowriteX' instead?  I don't 
suppose these are allowed to be weakly ordered.

  Maciej


Re: [PATCH] add delay between port write and port read

2019-02-19 Thread Arnd Bergmann
On Tue, Feb 19, 2019 at 2:44 PM Mikulas Patocka  wrote:
> On Tue, 19 Feb 2019, Mikulas Patocka wrote:
>
> > The patches cd0e00c106722eca40b38ebf11cf134c01901086 and
> > 92d7223a74235054f2aa7227d207d9c57f84dca0 fix a theoretical issue where the
> > code didn't follow the specification. Unfortunatelly, they also reduce
> > timing when port write is followed by a port read.
> >
> > These reduced timing cause hang on boot on the Avanti platform when
> > probing serial ports. This patch adds memory barrier after the outb, outw,
> > outl functions, so that there is delay between port write and subsequent
> > port read - just like before.
> >
> > Fixes: cd0e00c10672 ("alpha: io: reorder barriers to guarantee writeX() and 
> > iowriteX() ordering")
> > Cc: sta...@vger.kernel.org# v4.17+
>
> you can also add:
>
> Tested-by: Mikulas Patocka 

Acked-by: Arnd Bergmann 

but I notice you are missing Signed-off-by.

We clearly need this patch, but I assumed the alpha maintainers would pick
it up, not me. I merged the original changes since they were cross-architecture,
but I don't normally take patches for a particular architecture through the
asm-generic tree (or the soc tree for that matter).

  Arnd


[PATCH] add delay between port write and port read

2019-02-19 Thread Mikulas Patocka
The patches cd0e00c106722eca40b38ebf11cf134c01901086 and
92d7223a74235054f2aa7227d207d9c57f84dca0 fix a theoretical issue where the
code didn't follow the specification. Unfortunatelly, they also reduce
timing when port write is followed by a port read.

These reduced timing cause hang on boot on the Avanti platform when
probing serial ports. This patch adds memory barrier after the outb, outw,
outl functions, so that there is delay between port write and subsequent
port read - just like before.

Fixes: cd0e00c10672 ("alpha: io: reorder barriers to guarantee writeX() and 
iowriteX() ordering")
Cc: sta...@vger.kernel.org  # v4.17+

---
 arch/alpha/kernel/io.c |6 ++
 1 file changed, 6 insertions(+)

Index: linux-stable/arch/alpha/kernel/io.c
===
--- linux-stable.orig/arch/alpha/kernel/io.c2019-02-19 14:06:03.0 
+0100
+++ linux-stable/arch/alpha/kernel/io.c 2019-02-19 14:12:29.0 +0100
@@ -78,16 +78,19 @@ u32 inl(unsigned long port)
 void outb(u8 b, unsigned long port)
 {
iowrite8(b, ioport_map(port, 1));
+   mb();
 }
 
 void outw(u16 b, unsigned long port)
 {
iowrite16(b, ioport_map(port, 2));
+   mb();
 }
 
 void outl(u32 b, unsigned long port)
 {
iowrite32(b, ioport_map(port, 4));
+   mb();
 }
 
 EXPORT_SYMBOL(inb);
@@ -336,6 +339,7 @@ void iowrite8_rep(void __iomem *port, co
 void outsb(unsigned long port, const void *src, unsigned long count)
 {
iowrite8_rep(ioport_map(port, 1), src, count);
+   mb();
 }
 
 EXPORT_SYMBOL(iowrite8_rep);
@@ -376,6 +380,7 @@ void iowrite16_rep(void __iomem *port, c
 void outsw(unsigned long port, const void *src, unsigned long count)
 {
iowrite16_rep(ioport_map(port, 2), src, count);
+   mb();
 }
 
 EXPORT_SYMBOL(iowrite16_rep);
@@ -408,6 +413,7 @@ void iowrite32_rep(void __iomem *port, c
 void outsl(unsigned long port, const void *src, unsigned long count)
 {
iowrite32_rep(ioport_map(port, 4), src, count);
+   mb();
 }
 
 EXPORT_SYMBOL(iowrite32_rep);