Re: RFC on writel and writel_relaxed

2018-04-02 Thread Sinan Kaya
On 3/29/2018 9:40 PM, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-29 at 09:56 -0400, Sinan Kaya wrote:
>> On 3/28/2018 11:55 AM, David Miller wrote:
>>> From: Benjamin Herrenschmidt 
>>> Date: Thu, 29 Mar 2018 02:13:16 +1100
>>>
 Let's fix all archs, it's way easier than fixing all drivers. Half of
 the archs are unused or dead anyway.
>>>
>>> Agreed.
>>>
>>
>> I pinged most of the maintainers yesterday.
>> Which arches do we care about these days?
>> I have not been paying attention any other architecture besides arm64.
> 
> Thanks for going through that exercise !
> 
> Once sparc, s390, microblaze and mips reply, I think we'll have a good
> coverage, maybe riscv is to put in that lot too.


I posted the following two patches for supporting microblaze and unicore32. 

[PATCH v2 1/2] io: prevent compiler reordering on the default writeX() 
implementation
[PATCH v2 2/2] io: prevent compiler reordering on the default readX() 
implementation

The rest of the arches except mips and alpha seem OK.
I sent a question email on Friday to mips and alpha mailing lists. I'll follow 
up with
an actual patch today.


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-03-29 Thread Benjamin Herrenschmidt
On Thu, 2018-03-29 at 09:56 -0400, Sinan Kaya wrote:
> On 3/28/2018 11:55 AM, David Miller wrote:
> > From: Benjamin Herrenschmidt 
> > Date: Thu, 29 Mar 2018 02:13:16 +1100
> > 
> > > Let's fix all archs, it's way easier than fixing all drivers. Half of
> > > the archs are unused or dead anyway.
> > 
> > Agreed.
> > 
> 
> I pinged most of the maintainers yesterday.
> Which arches do we care about these days?
> I have not been paying attention any other architecture besides arm64.

Thanks for going through that exercise !

Once sparc, s390, microblaze and mips reply, I think we'll have a good
coverage, maybe riscv is to put in that lot too.
 
Cheers,
Ben.

> 
> arch  status  detail
> ---   
> 
> alpha question sent
> arc   question sent   ys...@users.sourceforge.jp will fix it.
> arm   no issues
> arm64 no issues
> blackfin  question sent   about to be removed
> c6x   question sent
> cris  question sent
> frv
> h8300 question sent
> hexagon   question sent
> ia64  no issues   confirmed by Tony Luck
> m32r
> m68k  question sent
> metag
> microblazequestion sent
> mips  question sent
> mn10300   question sent
> nios2 question sent
> openrisc  no issues   sho...@gmail.com says should no issues
> pariscno issues   grantgrund...@gmail.com says 
> most probably no problem but still looking
> powerpc   no issues
> riscv question sent
> s390  question sent
> score question sent
> shquestion sent
> sparc question sent
> tile  question sent
> unicore32 question sent
> x86   no issues
> xtensaquestion sent
> 
> 


Re: RFC on writel and writel_relaxed

2018-03-29 Thread Sinan Kaya
On 3/29/2018 12:29 PM, Arnd Bergmann wrote:
> On Thu, Mar 29, 2018 at 3:56 PM, Sinan Kaya  wrote:
>> On 3/28/2018 11:55 AM, David Miller wrote:
>>> From: Benjamin Herrenschmidt 
>>> Date: Thu, 29 Mar 2018 02:13:16 +1100
>>>
 Let's fix all archs, it's way easier than fixing all drivers. Half of
 the archs are unused or dead anyway.
>>>
>>> Agreed.
>>>
>>
>> I pinged most of the maintainers yesterday.
>> Which arches do we care about these days?
>> I have not been paying attention any other architecture besides arm64.
>>
>> archstatus  detail
>> --  -   
>> alpha   question sent

Thanks for the detailed analysis.

> 
> I'm guessing alpha has problems
> 
> extern inline u32 readl(const volatile void __iomem *addr)
> {
> u32 ret = __raw_readl(addr);
> mb();
> return ret;
> }
> extern inline void writel(u32 b, volatile void __iomem *addr)
> {
> __raw_writel(b, addr);
> mb();
> }

Looks like a problem to me too. I'll start a thread with the alpha
people and CC you.


> 
> There is a barrier in writel /after/ the acess but not before.
> 

This is the consolidated list. I also heart back from m68k and corrected
contacts for arc and h8300.

archstatus  detail
--  -   
alpha   question sent   Arnd: alpha has problems
arc question sent   vineet.gup...@synopsys.com says he'll 
get to this
in the next few days
arm no issues
arm64   no issues
c6x no issues   no PCI
h8300   no issues   no PCI: ys...@users.sourceforge.jp will 
fix it.
hexagon no issues   no PCI
ia64no issues   confirmed by Tony Luck
m68kno issues   ge...@linux-m68k.org says no problem
metag   no issues   arnd: removed
microblaze  question sent   arnd: some mips platforms have problems
mipsquestion sent   arnd: some mips platforms have problems
nds32   question sent
nios2   no issues   no PCI
openriscno issues   sho...@gmail.com says should no issues
parisc  no issues   grantgrund...@gmail.com says most 
probably no problem
but still looking
powerpc no issues
riscv   no issues   arnd: riscv should be fine
s390no issues   arnd: Pretty sure this is also fine
sh  question sent
sparc   no issues   da...@davemloft.net says always 
strongly ordered
unicore32   question sent   resent to g...@pku.edu.cn
x86 no issues
x86_64  no issues



> 
>  Arnd
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-03-29 Thread Jason Gunthorpe
On Thu, Mar 29, 2018 at 02:58:34PM +, David Laight wrote:
> From: Jason Gunthorpe
> > Sent: 29 March 2018 15:45
> ...
> > > > When talking about ordering between the devices, the relevant question
> > > > is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA
> > > > from the DEVICE_FOO. 'ordered' means that in this case
> > > > writel(DEVICE_FOO) must be presented to FOO before anything generated
> > > > by BAR.
> > >
> > > Yes, and that isn't the case for arm because the writes can still be
> > > buffered.
> > 
> > The statement is not about buffering, or temporal completion order, or
> > the order of acks returning to the CPU. It is about pure transaction
> > ordering inside the interconnect.
> > 
> > Can write BAR -> FOO pass write CPU -> FOO?
> 
> Almost certainly.
> The first cpu write can almost certainly be 'stalled' at the shared PCIe 
> bridge.
> The second cpu write then completes (to a different target).
> That target then issues a peer to peer transfer that reaches the shared 
> bridge.
> I doubt the order of the transactions is guaranteed when it becomes 
> 'un-stalled'.

The PCI spec has very strong wording on ordering that covers this
case. Stalled bridges have to follow the ordering rules, and posted
writes cannot pass other posted writes.

Since in PCI all three transactions:
 CPU -> FOO
 CPU -> BAR
 BAR -> FOO

Must traverse a shared bus segment, they must be placed on that bus in
the above order, and the bridge(s) toward FOO must preserve this
order.

ARM's AXI has similar rules, I just can't recall the tiny details
right now :)

> Of course, these are peer to peer transfers, and strange ones at that.
> Normally you'd not be doing peer to peer transfers that access 'memory'
> the cpu has just written to.

It is the best situation I can think of where order of completion to
different devices would matter to a generic Linux driver..

.. And there are patches circulating right now for NVMe that enable
exactly this kind of transfer, and rely on these kind of semantics, so
it is a relevant detail :)

Jason


Re: RFC on writel and writel_relaxed

2018-03-29 Thread Arnd Bergmann
On Thu, Mar 29, 2018 at 3:56 PM, Sinan Kaya  wrote:
> On 3/28/2018 11:55 AM, David Miller wrote:
>> From: Benjamin Herrenschmidt 
>> Date: Thu, 29 Mar 2018 02:13:16 +1100
>>
>>> Let's fix all archs, it's way easier than fixing all drivers. Half of
>>> the archs are unused or dead anyway.
>>
>> Agreed.
>>
>
> I pinged most of the maintainers yesterday.
> Which arches do we care about these days?
> I have not been paying attention any other architecture besides arm64.
>
> archstatus  detail
> --  -   
> alpha   question sent

I'm guessing alpha has problems

extern inline u32 readl(const volatile void __iomem *addr)
{
u32 ret = __raw_readl(addr);
mb();
return ret;
}
extern inline void writel(u32 b, volatile void __iomem *addr)
{
__raw_writel(b, addr);
mb();
}

There is a barrier in writel /after/ the acess but not before.

> arc question sent   ys...@users.sourceforge.jp will fix 
> it.
> arm no issues
> arm64   no issues
> blackfinquestion sent   about to be removed
> c6x question sent

no PCI, so it might not matter that much -- all drivers
are platform specific in the end.

> crisquestion sent
> frv

cris and frv are getting removed

> h8300   question sent

no PCI

> hexagon question sent

no PCI

> ia64no issues   confirmed by Tony Luck
> m32r

removed

> m68kquestion sent

> metag

removed

> microblaze  question sent
> mipsquestion sent

I'm guessing that some mips platforms have problems, but others don't.

> mn10300 question sent

removed

> nios2   question sent

no PCI

> openriscno issues   sho...@gmail.com says should no issues
> parisc  no issues   grantgrund...@gmail.com says most 
> probably no problem but still looking
> powerpc no issues
> riscv   question sent

riscv should be fine

> s390question sent

Pretty sure this is also fine

> score   question sent

removed

> sh  question sent
> sparc   question sent

> tilequestion sent

removed

> unicore32   question sent

Note the maintainer's new email address in linux-next.

> x86 no issues

> xtensa  question sent

removed.

 Arnd


RE: RFC on writel and writel_relaxed

2018-03-29 Thread David Laight
From: Jason Gunthorpe
> Sent: 29 March 2018 15:45
...
> > > When talking about ordering between the devices, the relevant question
> > > is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA
> > > from the DEVICE_FOO. 'ordered' means that in this case
> > > writel(DEVICE_FOO) must be presented to FOO before anything generated
> > > by BAR.
> >
> > Yes, and that isn't the case for arm because the writes can still be
> > buffered.
> 
> The statement is not about buffering, or temporal completion order, or
> the order of acks returning to the CPU. It is about pure transaction
> ordering inside the interconnect.
> 
> Can write BAR -> FOO pass write CPU -> FOO?

Almost certainly.
The first cpu write can almost certainly be 'stalled' at the shared PCIe bridge.
The second cpu write then completes (to a different target).
That target then issues a peer to peer transfer that reaches the shared bridge.
I doubt the order of the transactions is guaranteed when it becomes 
'un-stalled'.

Of course, these are peer to peer transfers, and strange ones at that.
Normally you'd not be doing peer to peer transfers that access 'memory'
the cpu has just written to.

Requiring extra barriers in this case, or different functions for WC accesses
shouldn't really be an issue.

Even requiring a barrier between a write to dma coherent memory and a write
that starts dma isn't really onerous.
Even if it is a nop on all current architectures it is a good comment in the 
code.
It could even have a 'dev' argument.

David



Re: RFC on writel and writel_relaxed

2018-03-29 Thread Jason Gunthorpe
On Thu, Mar 29, 2018 at 10:19:41AM +0100, Will Deacon wrote:
> On Wed, Mar 28, 2018 at 10:57:32AM -0600, Jason Gunthorpe wrote:
> > On Wed, Mar 28, 2018 at 11:13:45AM +0100, Will Deacon wrote:
> > > On Wed, Mar 28, 2018 at 09:01:27PM +1100, Benjamin Herrenschmidt wrote:
> > > > On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote:
> > > > > > powerpc and ARM can't quite make them synchronous I think, but at 
> > > > > > least
> > > > > > they should have the same semantics as writel.
> > > > > 
> > > > > One thing that ARM does IIRC is that it only guarantees to order 
> > > > > writel() within
> > > > > one device, and the memory mapped PCI I/O space window almost 
> > > > > certainly
> > > > > counts as a separate device to the CPU.
> > > > 
> > > > That sounds bogus.
> > > 
> > > To elaborate, if you do the following on arm:
> > > 
> > >   writel(DEVICE_FOO);
> > >   writel(DEVICE_BAR);
> > > 
> > > we generally cannot guarantee in which order those accesses will hit the
> > > devices even if we add every barrier under the sun. You'd need something
> > > in between, specific to DEVICE_FOO (probably a read-back) to really push
> > > the first write out. This doesn't sound like it would be that uncommon to
> > > me.
> > 
> > The PCI posted write does not require the above to execute 'in order'
> > only that any bus segment shared by the two devices have the writes
> > issued in CPU order. ie at a shared PCI root port for instance.
> > 
> > If I recall this is very similar to the ordering that ARM's on-chip
> > AXI interconnect is supposed to provide.. So I'd be very surprised if
> > a modern ARM64 has an meaningful difference from x86 here.
> 
> From the architectural perspective, writes to different "peripherals" are
> not ordered with respect to each other. The first writel will complete once
> it gets its write acknowledgement, but this may not necessarily come from
> the endpoint -- it could come from an intermediate buffer past the point of
> serialisation (i.e. the write will then be ordered with respect to other
> accesses to that same endpoint). The PCI root port would look like one
> peripheral here.

That is basically the same as PCI - PCI has no write ACK, so all
writes are buffered by the PCI interconnect and complete in some
undefined temporal order when multiple end points are involved.

This does not seem very different from what happens in x86..

> > When talking about ordering between the devices, the relevant question
> > is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA
> > from the DEVICE_FOO. 'ordered' means that in this case
> > writel(DEVICE_FOO) must be presented to FOO before anything generated
> > by BAR.
> 
> Yes, and that isn't the case for arm because the writes can still be
> buffered.

The statement is not about buffering, or temporal completion order, or
the order of acks returning to the CPU. It is about pure transaction
ordering inside the interconnect.

Can write BAR -> FOO pass write CPU -> FOO?

Jason


Re: RFC on writel and writel_relaxed

2018-03-29 Thread David Miller
From: Sinan Kaya 
Date: Thu, 29 Mar 2018 09:56:01 -0400

> sparc question sent

Sparc never lets physical memory accesses pass MMIO, and vice versa.

They are always strongly ordered amongst eachother.

Therefore no explicit barrier instructions are necessary.


Re: RFC on writel and writel_relaxed

2018-03-29 Thread Sinan Kaya
On 3/28/2018 11:55 AM, David Miller wrote:
> From: Benjamin Herrenschmidt 
> Date: Thu, 29 Mar 2018 02:13:16 +1100
> 
>> Let's fix all archs, it's way easier than fixing all drivers. Half of
>> the archs are unused or dead anyway.
> 
> Agreed.
> 

I pinged most of the maintainers yesterday.
Which arches do we care about these days?
I have not been paying attention any other architecture besides arm64.

archstatus  detail
--  -   
alpha   question sent
arc question sent   ys...@users.sourceforge.jp will fix it.
arm no issues
arm64   no issues
blackfinquestion sent   about to be removed
c6x question sent
crisquestion sent
frv
h8300   question sent
hexagon question sent
ia64no issues   confirmed by Tony Luck
m32r
m68kquestion sent
metag
microblaze  question sent
mipsquestion sent
mn10300 question sent
nios2   question sent
openriscno issues   sho...@gmail.com says should no issues
parisc  no issues   grantgrund...@gmail.com says most 
probably no problem but still looking
powerpc no issues
riscv   question sent
s390question sent
score   question sent
sh  question sent
sparc   question sent
tilequestion sent
unicore32   question sent
x86 no issues
xtensa  question sent


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-03-29 Thread Will Deacon
On Thu, Mar 29, 2018 at 08:31:32AM +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-29 at 02:23 +1000, Nicholas Piggin wrote:
> >   This is a variation on the mandatory write barrier that causes writes to 
> > weakly
> >   ordered I/O regions to be partially ordered.  Its effects may go beyond 
> > the
> >   CPU->Hardware interface and actually affect the hardware at some level.
> > 
> > How can a driver writer possibly get that right?
> > 
> > IIRC it was added for some big ia64 system that was really expensive
> > to implement the proper wmb() semantics on. So wmb() semantics were
> > quietly downgraded, then the subsequently broken drivers they cared
> > about were fixed by adding the stronger mmiowb().
> > 
> > What should have happened was wmb and writel remained correct, sane, and
> > expensive, and they add an mmio_wmb() to order MMIO stores made by the
> > writel_relaxed accessors, then use that to speed up the few drivers they
> > care about.
> > 
> > Now that ia64 doesn't matter too much, can we deprecate mmiowb and just
> > make wmb ordering talk about stores to the device, not to some
> > intermediate stage of the interconnect where it can be subsequently
> > reordered wrt the device? Drivers can be converted back to using wmb
> > or writel gradually.
> 
> I was under the impression that mmiowb was specifically about ordering
> writel's with a subsequent spin_unlock, without it, MMIOs from
> different CPUs (within the same lock) would still arrive OO.
> 
> If that's indeed the case, I would suggest ia64 switches to a similar
> per-cpu flag trick powerpc uses.

... or we could remove ia64.

/me runs for cover

Will


Re: RFC on writel and writel_relaxed

2018-03-29 Thread Will Deacon
On Wed, Mar 28, 2018 at 10:57:32AM -0600, Jason Gunthorpe wrote:
> On Wed, Mar 28, 2018 at 11:13:45AM +0100, Will Deacon wrote:
> > On Wed, Mar 28, 2018 at 09:01:27PM +1100, Benjamin Herrenschmidt wrote:
> > > On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote:
> > > > > powerpc and ARM can't quite make them synchronous I think, but at 
> > > > > least
> > > > > they should have the same semantics as writel.
> > > > 
> > > > One thing that ARM does IIRC is that it only guarantees to order 
> > > > writel() within
> > > > one device, and the memory mapped PCI I/O space window almost certainly
> > > > counts as a separate device to the CPU.
> > > 
> > > That sounds bogus.
> > 
> > To elaborate, if you do the following on arm:
> > 
> > writel(DEVICE_FOO);
> > writel(DEVICE_BAR);
> > 
> > we generally cannot guarantee in which order those accesses will hit the
> > devices even if we add every barrier under the sun. You'd need something
> > in between, specific to DEVICE_FOO (probably a read-back) to really push
> > the first write out. This doesn't sound like it would be that uncommon to
> > me.
> 
> The PCI posted write does not require the above to execute 'in order'
> only that any bus segment shared by the two devices have the writes
> issued in CPU order. ie at a shared PCI root port for instance.
> 
> If I recall this is very similar to the ordering that ARM's on-chip
> AXI interconnect is supposed to provide.. So I'd be very surprised if
> a modern ARM64 has an meaningful difference from x86 here.

>From the architectural perspective, writes to different "peripherals" are
not ordered with respect to each other. The first writel will complete once
it gets its write acknowledgement, but this may not necessarily come from
the endpoint -- it could come from an intermediate buffer past the point of
serialisation (i.e. the write will then be ordered with respect to other
accesses to that same endpoint). The PCI root port would look like one
peripheral here.

> When talking about ordering between the devices, the relevant question
> is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA
> from the DEVICE_FOO. 'ordered' means that in this case
> writel(DEVICE_FOO) must be presented to FOO before anything generated
> by BAR.

Yes, and that isn't the case for arm because the writes can still be
buffered.

Will


Re: RFC on writel and writel_relaxed

2018-03-28 Thread Nicholas Piggin
On Thu, 29 Mar 2018 08:31:32 +1100
Benjamin Herrenschmidt  wrote:

> On Thu, 2018-03-29 at 02:23 +1000, Nicholas Piggin wrote:
> > On Wed, 28 Mar 2018 11:55:09 -0400 (EDT)
> > David Miller  wrote:
> >   
> > > From: Benjamin Herrenschmidt 
> > > Date: Thu, 29 Mar 2018 02:13:16 +1100
> > >   
> > > > Let's fix all archs, it's way easier than fixing all drivers. Half of
> > > > the archs are unused or dead anyway.
> > > 
> > > Agreed.  
> > 
> > While we're making decrees here, can we do something about mmiowb?
> > The semantics are basically indecipherable.  
> 
> I was going to tackle that next :-)
> 
> >   This is a variation on the mandatory write barrier that causes writes to 
> > weakly
> >   ordered I/O regions to be partially ordered.  Its effects may go beyond 
> > the
> >   CPU->Hardware interface and actually affect the hardware at some level.
> > 
> > How can a driver writer possibly get that right?
> > 
> > IIRC it was added for some big ia64 system that was really expensive
> > to implement the proper wmb() semantics on. So wmb() semantics were
> > quietly downgraded, then the subsequently broken drivers they cared
> > about were fixed by adding the stronger mmiowb().
> > 
> > What should have happened was wmb and writel remained correct, sane, and
> > expensive, and they add an mmio_wmb() to order MMIO stores made by the
> > writel_relaxed accessors, then use that to speed up the few drivers they
> > care about.
> > 
> > Now that ia64 doesn't matter too much, can we deprecate mmiowb and just
> > make wmb ordering talk about stores to the device, not to some
> > intermediate stage of the interconnect where it can be subsequently
> > reordered wrt the device? Drivers can be converted back to using wmb
> > or writel gradually.  
> 
> I was under the impression that mmiowb was specifically about ordering
> writel's with a subsequent spin_unlock, without it, MMIOs from
> different CPUs (within the same lock) would still arrive OO.

Yes more or less, and I think that until mmiowb was introduced, wmb
or writel was sufficient for this.

Thanks,
Nick


Re: RFC on writel and writel_relaxed

2018-03-28 Thread Benjamin Herrenschmidt
On Thu, 2018-03-29 at 02:23 +1000, Nicholas Piggin wrote:
> On Wed, 28 Mar 2018 11:55:09 -0400 (EDT)
> David Miller  wrote:
> 
> > From: Benjamin Herrenschmidt 
> > Date: Thu, 29 Mar 2018 02:13:16 +1100
> > 
> > > Let's fix all archs, it's way easier than fixing all drivers. Half of
> > > the archs are unused or dead anyway.  
> > 
> > Agreed.
> 
> While we're making decrees here, can we do something about mmiowb?
> The semantics are basically indecipherable.

I was going to tackle that next :-)

>   This is a variation on the mandatory write barrier that causes writes to 
> weakly
>   ordered I/O regions to be partially ordered.  Its effects may go beyond the
>   CPU->Hardware interface and actually affect the hardware at some level.
> 
> How can a driver writer possibly get that right?
> 
> IIRC it was added for some big ia64 system that was really expensive
> to implement the proper wmb() semantics on. So wmb() semantics were
> quietly downgraded, then the subsequently broken drivers they cared
> about were fixed by adding the stronger mmiowb().
> 
> What should have happened was wmb and writel remained correct, sane, and
> expensive, and they add an mmio_wmb() to order MMIO stores made by the
> writel_relaxed accessors, then use that to speed up the few drivers they
> care about.
> 
> Now that ia64 doesn't matter too much, can we deprecate mmiowb and just
> make wmb ordering talk about stores to the device, not to some
> intermediate stage of the interconnect where it can be subsequently
> reordered wrt the device? Drivers can be converted back to using wmb
> or writel gradually.

I was under the impression that mmiowb was specifically about ordering
writel's with a subsequent spin_unlock, without it, MMIOs from
different CPUs (within the same lock) would still arrive OO.

If that's indeed the case, I would suggest ia64 switches to a similar
per-cpu flag trick powerpc uses.

Cheers,
Ben.

> Thanks,
> Nick


Re: RFC on writel and writel_relaxed

2018-03-28 Thread Jason Gunthorpe
On Wed, Mar 28, 2018 at 11:13:45AM +0100, Will Deacon wrote:
> On Wed, Mar 28, 2018 at 09:01:27PM +1100, Benjamin Herrenschmidt wrote:
> > On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote:
> > > > powerpc and ARM can't quite make them synchronous I think, but at least
> > > > they should have the same semantics as writel.
> > > 
> > > One thing that ARM does IIRC is that it only guarantees to order writel() 
> > > within
> > > one device, and the memory mapped PCI I/O space window almost certainly
> > > counts as a separate device to the CPU.
> > 
> > That sounds bogus.
> 
> To elaborate, if you do the following on arm:
> 
>   writel(DEVICE_FOO);
>   writel(DEVICE_BAR);
> 
> we generally cannot guarantee in which order those accesses will hit the
> devices even if we add every barrier under the sun. You'd need something
> in between, specific to DEVICE_FOO (probably a read-back) to really push
> the first write out. This doesn't sound like it would be that uncommon to
> me.

The PCI posted write does not require the above to execute 'in order'
only that any bus segment shared by the two devices have the writes
issued in CPU order. ie at a shared PCI root port for instance.

If I recall this is very similar to the ordering that ARM's on-chip
AXI interconnect is supposed to provide.. So I'd be very surprised if
a modern ARM64 has an meaningful difference from x86 here.

When talking about ordering between the devices, the relevant question
is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA
from the DEVICE_FOO. 'ordered' means that in this case
writel(DEVICE_FOO) must be presented to FOO before anything generated
by BAR.

Jason


Re: RFC on writel and writel_relaxed

2018-03-28 Thread Nicholas Piggin
On Wed, 28 Mar 2018 11:55:09 -0400 (EDT)
David Miller  wrote:

> From: Benjamin Herrenschmidt 
> Date: Thu, 29 Mar 2018 02:13:16 +1100
> 
> > Let's fix all archs, it's way easier than fixing all drivers. Half of
> > the archs are unused or dead anyway.  
> 
> Agreed.

While we're making decrees here, can we do something about mmiowb?
The semantics are basically indecipherable.

  This is a variation on the mandatory write barrier that causes writes to 
weakly
  ordered I/O regions to be partially ordered.  Its effects may go beyond the
  CPU->Hardware interface and actually affect the hardware at some level.

How can a driver writer possibly get that right?

IIRC it was added for some big ia64 system that was really expensive
to implement the proper wmb() semantics on. So wmb() semantics were
quietly downgraded, then the subsequently broken drivers they cared
about were fixed by adding the stronger mmiowb().

What should have happened was wmb and writel remained correct, sane, and
expensive, and they add an mmio_wmb() to order MMIO stores made by the
writel_relaxed accessors, then use that to speed up the few drivers they
care about.

Now that ia64 doesn't matter too much, can we deprecate mmiowb and just
make wmb ordering talk about stores to the device, not to some
intermediate stage of the interconnect where it can be subsequently
reordered wrt the device? Drivers can be converted back to using wmb
or writel gradually.

Thanks,
Nick


RE: RFC on writel and writel_relaxed

2018-03-28 Thread David Laight
From: Benjamin Herrenschmidt
> Sent: 28 March 2018 16:13
...
> > I've always wondered exactly what the twi;isync were for - always seemed
> > very heavy handed for most mmio reads.
> > Particularly if you are doing mmio reads from a data fifo.
> 
> If you do that you should use the "s" version of the accessors. Those
> will only do the above trick at the end of the access series. Also a
> FIFO needs special care about endianness anyway, so you should use
> those accessors regardless. (Hint: you never endian swap a FIFO even on
> BE on a LE device, unless something's been wired very badly in HW).

That was actually a 64 bit wide fifo connected to a 16bit wide PIO interface.
Reading the high address 'clocked' the fifo.
So the first 3 reads could happen in any order, but the 4th had to be last.
This is a small ppc and we shovel a lot of data through that fifo.

Whether it needed byteswapping depended completely on how our hardware people
had built the pcb (not made easy by some docs using the ibm bit numbering).
In fact it didn't

While that driver only had to run on a very specific small ppc, generic drivers
might have similar issues.

I suspect that writel() is always (or should always be):
barrier_before_writel()
writel_relaxed()
barrier_after_writel()
So if a driver needs to do multiple writes (without strong ordering)
it should be able to repeat the writel_relaxed() with only one set
of barriers.
Similarly for readl().
In addition a lesser barrier is probably enough between a readl_relaxed()
and a writel_relaxed() that is conditional on the read value.

David



Re: RFC on writel and writel_relaxed

2018-03-28 Thread David Miller
From: Benjamin Herrenschmidt 
Date: Thu, 29 Mar 2018 02:13:16 +1100

> Let's fix all archs, it's way easier than fixing all drivers. Half of
> the archs are unused or dead anyway.

Agreed.


Re: RFC on writel and writel_relaxed

2018-03-28 Thread Benjamin Herrenschmidt
On Wed, 2018-03-28 at 07:41 -0400, ok...@codeaurora.org wrote:
> Yes, we want to get there indeed. It is because of some arch not 
> implementing writel properly. Maintainers want to play safe.
> 
> That is why I asked if IA64 and other well known archs follow the 
> strongly ordered rule at this moment like PPC and ARM.
> 
> Or should we go and inform every arch about this before yanking wmb()?
> 
> Maintainers are afraid of introducing a regression.

Let's fix all archs, it's way easier than fixing all drivers. Half of
the archs are unused or dead anyway.

> > 
> > The above code makes no sense, and just looks stupid to me. It also
> > generates pointlessly bad code on x86, so it's bad there too.
> > 
> > Linus


Re: RFC on writel and writel_relaxed

2018-03-28 Thread Benjamin Herrenschmidt
On Wed, 2018-03-28 at 11:30 +, David Laight wrote:
> From: Benjamin Herrenschmidt
> > Sent: 28 March 2018 10:56
> 
> ...
> > For example, let's say I have a device with a reset bit and the spec
> > says the reset bit needs to be set for at least 10us.
> > 
> > This is wrong:
> > 
> > writel(1, RESET_REG);
> > usleep(10);
> > writel(0, RESET_REG);
> > 
> > Because of write posting, the first write might arrive to the device
> > right before the second one.
> > 
> > The typical "fix" is to turn that into:
> > 
> > writel(1, RESET_REG);
> > readl(RESET_REG); /* Flush posted writes */
> 
> Would a writel(1, RESET_REG) here provide enough synchronsiation?

Probably yes. It's one of those things where you try to deal with the
fact that 90% of driver writers barely understand the basic stuff and
so you need the "default" accessors to be hardened as much as possible.

We still need to get a reasonably definition of the semantics of the
relaxed ones vs. WC memory but let's get through that exercise first
and hopefully for the last time.
 
> > usleep(10);
> > writel(0, RESET_REG);
> > 
> > *However* the issue here, at least on power, is that the CPU can issue
> > that readl but doesn't necessarily wait for it to complete (ie, the
> > data to return), before proceeding to the usleep. Now a usleep contains
> > a bunch of loads and stores and is probably fine, but a udelay which
> > just loops on the timebase may not be.
> > 
> > Thus we may still violate the timing requirement.
> 
> I've seem that sort of code (with udelay() and no read back) quite often.
> How many were in linux I don't know.
> 
> For small delays I usually fix it by repeated writes (of the same value)
> to the device register. That can guarantee very short intervals.

As long as you know the bus frequency...

> The only time I've actually seen buffered writes break timing was
> between a 286 and an 8859 interrupt controller.

:-)

The problem for me is not so much what I've seen, I realize that most
of the issues we are talking about are the kind that will hit once in a
thousand times or less.

But we *can* reason about them in a way that can effectively prevent
the problem completely and when your cluster has 1 machine, 1/1000
starts becoming significant.

These days the vast majority of IO devices either are 99% DMA driven so
that a bit of overhead on MMIOs is irrelevant, or have one fast path
(low latency IB etc...) that needs some finer control, and the rest is
all setup which can be paranoid at will.

So I think we should aim for the "default" accessors most people use to
be as hadened as we can think of. I favor correctness over performance
in all cases. But then we also define a reasonable semantic for the
relaxed ones (well, we sort-of do have one, we might have to make it a
bit more precise in some areas) that allows the few MMIO fast path that
care to be optimized.

> If you wrote to the mask then enabled interrupts the first IACK cycle
> could be too close to write and break the cycle recovery time.
> That clobbered most of the interrupt controller registers.
> That probably affected every 286 board ever built!
> Not sure how much software added the required extra bus cycle.
> 
> > What we did inside readl, with the twi;isync sequence (which basically
> > means, trap on return value with "trap never" as a condition, followed
> > by isync that ensures all excpetion conditions are resolved), is force
> > the CPU to "consume" the data from the read before moving on.
> > 
> > This effectively makes readl fully synchronous (we would probably avoid
> > that if we were to implement a readl_relaxed).
> 
> I've always wondered exactly what the twi;isync were for - always seemed
> very heavy handed for most mmio reads.
> Particularly if you are doing mmio reads from a data fifo.

If you do that you should use the "s" version of the accessors. Those
will only do the above trick at the end of the access series. Also a
FIFO needs special care about endianness anyway, so you should use
those accessors regardless. (Hint: you never endian swap a FIFO even on
BE on a LE device, unless something's been wired very badly in HW).

> Perhaps there should be a writel_wait() that is allowed to do a read back
> for such code paths?

I think what we have is fine, we just define that the standard
writel/readl as fairly simple and hardened, and we look at providing a
somewhat reasonable set of relaxed variants for optimizing fast path.
We pretty much already are there, we just need to be better at defining
the semantics.

And for the super high perf case, which thankfully is either seldom
(server high perf network stuff) or very arch specific (ARM SoC stuff),
then arch specific driver hacks will always remain the norm.

Cheers,
Ben. 

>   David
> 


Re: RFC on writel and writel_relaxed

2018-03-28 Thread okaya

On 2018-03-28 02:14, Linus Torvalds wrote:
On Tue, Mar 27, 2018 at 5:24 PM, Sinan Kaya  
wrote:


Basically changing it to

dma_buffer->foo = 1;/* WB */
wmb()
writel_relaxed(KICK, DMA_KICK_REGISTER);/* UC */
mmiowb()


Why?

Why not  just remove the wmb(), and keep the barrier in the writel()?


Yes, we want to get there indeed. It is because of some arch not 
implementing writel properly. Maintainers want to play safe.


That is why I asked if IA64 and other well known archs follow the 
strongly ordered rule at this moment like PPC and ARM.


Or should we go and inform every arch about this before yanking wmb()?

Maintainers are afraid of introducing a regression.



The above code makes no sense, and just looks stupid to me. It also
generates pointlessly bad code on x86, so it's bad there too.

   Linus


RE: RFC on writel and writel_relaxed

2018-03-28 Thread David Laight
From: Benjamin Herrenschmidt
> Sent: 28 March 2018 10:56
...
> For example, let's say I have a device with a reset bit and the spec
> says the reset bit needs to be set for at least 10us.
> 
> This is wrong:
> 
>   writel(1, RESET_REG);
>   usleep(10);
>   writel(0, RESET_REG);
> 
> Because of write posting, the first write might arrive to the device
> right before the second one.
> 
> The typical "fix" is to turn that into:
> 
>   writel(1, RESET_REG);
>   readl(RESET_REG); /* Flush posted writes */

Would a writel(1, RESET_REG) here provide enough synchronsiation?

>   usleep(10);
>   writel(0, RESET_REG);
> 
> *However* the issue here, at least on power, is that the CPU can issue
> that readl but doesn't necessarily wait for it to complete (ie, the
> data to return), before proceeding to the usleep. Now a usleep contains
> a bunch of loads and stores and is probably fine, but a udelay which
> just loops on the timebase may not be.
> 
> Thus we may still violate the timing requirement.

I've seem that sort of code (with udelay() and no read back) quite often.
How many were in linux I don't know.

For small delays I usually fix it by repeated writes (of the same value)
to the device register. That can guarantee very short intervals.

The only time I've actually seen buffered writes break timing was
between a 286 and an 8859 interrupt controller.
If you wrote to the mask then enabled interrupts the first IACK cycle
could be too close to write and break the cycle recovery time.
That clobbered most of the interrupt controller registers.
That probably affected every 286 board ever built!
Not sure how much software added the required extra bus cycle.

> What we did inside readl, with the twi;isync sequence (which basically
> means, trap on return value with "trap never" as a condition, followed
> by isync that ensures all excpetion conditions are resolved), is force
> the CPU to "consume" the data from the read before moving on.
> 
> This effectively makes readl fully synchronous (we would probably avoid
> that if we were to implement a readl_relaxed).

I've always wondered exactly what the twi;isync were for - always seemed
very heavy handed for most mmio reads.
Particularly if you are doing mmio reads from a data fifo.

Perhaps there should be a writel_wait() that is allowed to do a read back
for such code paths?

David



Re: Aw: Re: RFC on writel and writel_relaxed

2018-03-28 Thread Benjamin Herrenschmidt
On Wed, 2018-03-28 at 12:13 +0200, Lino Sanfilippo wrote:
> Hi,
> 
> 
> > 
> > Yeah so that other trick I'm talking about is also used for timing
> > accuracy.
> > 
> > For example, let's say I have a device with a reset bit and the spec
> > says the reset bit needs to be set for at least 10us.
> > 
> > This is wrong:
> > 
> > writel(1, RESET_REG);
> > usleep(10);
> > writel(0, RESET_REG);
> > 
> > Because of write posting, the first write might arrive to the device
> > right before the second one.
> > 
> 
> Does not write posting only concern PCI? This seems to be a different topic. 
> Furthermore
> write posting should not include write reordering...

Nobody's talking about re-ordering and no, write posting is rather
common practice on a whole lot of different busses, not just PCI(e).

Cheers,
Ben.



Aw: Re: RFC on writel and writel_relaxed

2018-03-28 Thread Lino Sanfilippo
Hi,


> 
> Yeah so that other trick I'm talking about is also used for timing
> accuracy.
> 
> For example, let's say I have a device with a reset bit and the spec
> says the reset bit needs to be set for at least 10us.
> 
> This is wrong:
> 
>   writel(1, RESET_REG);
>   usleep(10);
>   writel(0, RESET_REG);
> 
> Because of write posting, the first write might arrive to the device
> right before the second one.
> 

Does not write posting only concern PCI? This seems to be a different topic. 
Furthermore
write posting should not include write reordering...

Regards,
Lino


Re: RFC on writel and writel_relaxed

2018-03-28 Thread Will Deacon
On Wed, Mar 28, 2018 at 09:01:27PM +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote:
> > > powerpc and ARM can't quite make them synchronous I think, but at least
> > > they should have the same semantics as writel.
> > 
> > One thing that ARM does IIRC is that it only guarantees to order writel() 
> > within
> > one device, and the memory mapped PCI I/O space window almost certainly
> > counts as a separate device to the CPU.
> 
> That sounds bogus.

To elaborate, if you do the following on arm:

writel(DEVICE_FOO);
writel(DEVICE_BAR);

we generally cannot guarantee in which order those accesses will hit the
devices even if we add every barrier under the sun. You'd need something
in between, specific to DEVICE_FOO (probably a read-back) to really push
the first write out. This doesn't sound like it would be that uncommon to
me.

On the other hand:

writel(DEVICE_FOO);
writel(DEVICE_FOO);

is obviously ordered and also things like:

writel(DEVICE_FOO_IN_PCI_MEM_SPACE);
writel(DEVICE_BAR_IN_SAME_PCI_MEM_SPACE);

are ordered up to the PCI host bridge, because that's really the "device"
here.

Will


Re: RFC on writel and writel_relaxed

2018-03-28 Thread Benjamin Herrenschmidt
On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote:
> > powerpc and ARM can't quite make them synchronous I think, but at least
> > they should have the same semantics as writel.
> 
> One thing that ARM does IIRC is that it only guarantees to order writel() 
> within
> one device, and the memory mapped PCI I/O space window almost certainly
> counts as a separate device to the CPU.

That sounds bogus.

> In the absence of an enforced global synchronization during an I/O port
> access, that means writel() and outb() can be reordered before they arrive
> at a device in theory. Again, this rarely matters in practice, but I think it
> makes sense to document the less strict behavior here, given that we have
> common hardware that can't provide x86 compatible semantics.

Can't you put some kind of super heavy handed barrier in inX/outX ?
These things are never going to be performance sensitive anyway...

Cheers,
Ben.



Re: RFC on writel and writel_relaxed

2018-03-28 Thread Benjamin Herrenschmidt
On Wed, 2018-03-28 at 10:09 +0100, Will Deacon wrote:
> On Wed, Mar 28, 2018 at 09:00:01AM +, David Laight wrote:
> > From: Will Deacon
> > > Sent: 28 March 2018 09:54
> > 
> > ...
> > > > > I don't think so. My reading of memory-barriers.txt says that writeX 
> > > > > might
> > > > > expand to outX, and outX is not ordered with respect to other types of
> > > > > memory.
> > > > 
> > > > Ugh ?
> > > > 
> > > > My understanding of HW at least is the exact opposite. outX is *more*
> > > > ordered if anything, than any other accessors. IO space is completely
> > > > synchronous, non posted and ordered afaik.
> > > 
> > > I'm just going by memory-barriers.txt:
> > > 
> > > 
> > >  (*) inX(), outX():
> > > 
> > >  [...]
> > > 
> > >  They are guaranteed to be fully ordered with respect to each other.
> > > 
> > >  They are not guaranteed to be fully ordered with respect to other 
> > > types of
> > >  memory and I/O operation.
> > 
> > A long time ago there was a document from Intel that said that inb/outb 
> > weren't
> > necessarily synchronised wrt memory accesses.
> > (Might be P-pro era).
> > However no processors actually behaved that way and more recent docs
> > say that inb/outb are fully ordered.
> 
> Thank you, David! I'll write another patch fixing this up and hopefully
> we'll soon have one making writeX/readX much clearer.

Thanks for doing the grunt work Will ! :-)

Cheers,
Ben.



Re: RFC on writel and writel_relaxed

2018-03-28 Thread Benjamin Herrenschmidt
On Wed, 2018-03-28 at 10:07 +0100, Will Deacon wrote:
> 
> For arm/arm64 we guarantee ordering for (1) but not for (2) -- you'd need to
> add an mb() to make it work.
> 
> Do both of these work on power? 

Yes. There's even another quirk, see further down ;-)

> If so, I guess I can make readl even more
> expensive :/ Feels a bit like the tail wagging the dog, though.

Maybe, but then readl is always horribly slow anyway so you may not
necessarily be losing that much.

> Another thing I just realised is that we restrict the barriers we use in
> readl/writel on arm64 so that they don't necessary apply to both loads and
> stores. To be specific:
> 
>writel is ordered against prior writes to memory, but not reads

That could be tricky... You may end up with something that reads before
triggering a DMA and ends up with the post-DMA value ... ugh.

>readl is ordered against subsequent reads of memory, but not writes (but
>note that in example (1) above, the control dependency ensures that).
> 
> If necessary, I could move the barrier in our readl implementation to be
> before the read, then play the control-dependency + instruction-sync (ISB)
> trick that you do on power.

Yeah so that other trick I'm talking about is also used for timing
accuracy.

For example, let's say I have a device with a reset bit and the spec
says the reset bit needs to be set for at least 10us.

This is wrong:

writel(1, RESET_REG);
usleep(10);
writel(0, RESET_REG);

Because of write posting, the first write might arrive to the device
right before the second one.

The typical "fix" is to turn that into:

writel(1, RESET_REG);
readl(RESET_REG); /* Flush posted writes */
usleep(10);
writel(0, RESET_REG);

*However* the issue here, at least on power, is that the CPU can issue
that readl but doesn't necessarily wait for it to complete (ie, the
data to return), before proceeding to the usleep. Now a usleep contains
a bunch of loads and stores and is probably fine, but a udelay which
just loops on the timebase may not be.

Thus we may still violate the timing requirement.

What we did inside readl, with the twi;isync sequence (which basically
means, trap on return value with "trap never" as a condition, followed
by isync that ensures all excpetion conditions are resolved), is force
the CPU to "consume" the data from the read before moving on.

This effectively makes readl fully synchronous (we would probably avoid
that if we were to implement a readl_relaxed).

Cheers,
Ben.



Re: RFC on writel and writel_relaxed

2018-03-28 Thread Arnd Bergmann
On Wed, Mar 28, 2018 at 11:50 AM, Benjamin Herrenschmidt
 wrote:
> On Wed, 2018-03-28 at 09:53 +0100, Will Deacon wrote:
>> For arm/arm64 these end up behaving exactly the same as readX/writeX, but
>> I'm nervous about changing the documentation without understanding why it's
>> like it is currently. Maybe another ia64 thing?.
>
> I doubt it ... the Intel ancestry here would make me think they are
> completely ordered there too.
>
> powerpc and ARM can't quite make them synchronous I think, but at least
> they should have the same semantics as writel.

One thing that ARM does IIRC is that it only guarantees to order writel() within
one device, and the memory mapped PCI I/O space window almost certainly
counts as a separate device to the CPU.

In the absence of an enforced global synchronization during an I/O port
access, that means writel() and outb() can be reordered before they arrive
at a device in theory. Again, this rarely matters in practice, but I think it
makes sense to document the less strict behavior here, given that we have
common hardware that can't provide x86 compatible semantics.

   Arnd


Re: RFC on writel and writel_relaxed

2018-03-28 Thread Benjamin Herrenschmidt
On Wed, 2018-03-28 at 09:53 +0100, Will Deacon wrote:
> For arm/arm64 these end up behaving exactly the same as readX/writeX, but
> I'm nervous about changing the documentation without understanding why it's
> like it is currently. Maybe another ia64 thing?.

I doubt it ... the Intel ancestry here would make me think they are
completely ordered there too.

powerpc and ARM can't quite make them synchronous I think, but at least
they should have the same semantics as writel.

Cheers,
Ben.



Re: RFC on writel and writel_relaxed

2018-03-28 Thread Will Deacon
On Wed, Mar 28, 2018 at 09:00:01AM +, David Laight wrote:
> From: Will Deacon
> > Sent: 28 March 2018 09:54
> ...
> > > > I don't think so. My reading of memory-barriers.txt says that writeX 
> > > > might
> > > > expand to outX, and outX is not ordered with respect to other types of
> > > > memory.
> > >
> > > Ugh ?
> > >
> > > My understanding of HW at least is the exact opposite. outX is *more*
> > > ordered if anything, than any other accessors. IO space is completely
> > > synchronous, non posted and ordered afaik.
> > 
> > I'm just going by memory-barriers.txt:
> > 
> > 
> >  (*) inX(), outX():
> > 
> >  [...]
> > 
> >  They are guaranteed to be fully ordered with respect to each other.
> > 
> >  They are not guaranteed to be fully ordered with respect to other 
> > types of
> >  memory and I/O operation.
> 
> A long time ago there was a document from Intel that said that inb/outb 
> weren't
> necessarily synchronised wrt memory accesses.
> (Might be P-pro era).
> However no processors actually behaved that way and more recent docs
> say that inb/outb are fully ordered.

Thank you, David! I'll write another patch fixing this up and hopefully
we'll soon have one making writeX/readX much clearer.

Will


Re: RFC on writel and writel_relaxed

2018-03-28 Thread Will Deacon
On Wed, Mar 28, 2018 at 05:42:56PM +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2018-03-27 at 20:26 -1000, Linus Torvalds wrote:
> > On Tue, Mar 27, 2018 at 6:33 PM, Benjamin Herrenschmidt
> >  wrote:
> > > 
> > > This is why, I want (with your agreement) to define clearly and once
> > > and for all, that the Linux semantics of writel are that it is ordered
> > > with previous writes to coherent memory (*)
> > 
> > Honestly, I think those are the sane semantics. In fact, make it
> > "ordered with previous writes" full stop, since it's not only ordered
> > wrt previous writes to memory, but also previous writel's.
> 
> Of course. It was somewhat a given that it's ordered vs. any previous
> MMIO actually, but it doesn't hurt to spell it out once more.

Good. So I think this confirms our understanding so far.

> 
> > > Also, can I assume the above ordering with writel() equally applies to
> > > readl() or not ?
> > > 
> > > IE:
> > > dma_buf->foo = 1;
> > > readl(STUPID_DEVICE_DMA_KICK_ON_READ);
> > 
> > If that KICK_ON_READ is UC, then that's definitely the case. And
> > honestly, status registers like that really should always be UC.
> > 
> > But if somebody sets the area WC (which is crazy), then I think it
> > might be at least debatable. x86 semantics does allow reads to be done
> > before previous writes (or, put another way, writes to be buffered -
> > the buffers are ordered so writes don't get re-ordered, but reads can
> > happen during the buffering).
> 
> Right, for now I worry about UC semantics. Once we have nailed that, we
> can look at WC, which is a lot more tricky as archs differs more
> widely, but one thing at a time.
> 
> > But UC accesses are always  done entirely ordered, and honestly, any
> > status register that starts a DMA would not make sense any other way.
> > 
> > Of course, you'd have to be pretty odd to want to start a DMA with a
> > read anyway - partly exactly because it's bad for performance since
> > reads will be synchronous and not buffered like a write).
> 
> I have bad memories of old adaptec controllers ...
> 
> That said, I think the above might not be right on ARM if we want to
> make it the rule, Will, what do you reckon ?

So there are two cases to consider:

1.
if (readl(DEVICE_DMA_STATUS) == DMA_DONE)
mydata = *dma_bufp;



2.
*dma_bufp = 42;
readl(DEVICE_DMA_KICK_ON_READ);


For arm/arm64 we guarantee ordering for (1) but not for (2) -- you'd need to
add an mb() to make it work.

Do both of these work on power? If so, I guess I can make readl even more
expensive :/ Feels a bit like the tail wagging the dog, though.

Another thing I just realised is that we restrict the barriers we use in
readl/writel on arm64 so that they don't necessary apply to both loads and
stores. To be specific:

   writel is ordered against prior writes to memory, but not reads

   readl is ordered against subsequent reads of memory, but not writes (but
   note that in example (1) above, the control dependency ensures that).

If necessary, I could move the barrier in our readl implementation to be
before the read, then play the control-dependency + instruction-sync (ISB)
trick that you do on power.

Will


RE: RFC on writel and writel_relaxed

2018-03-28 Thread David Laight
From: Will Deacon
> Sent: 28 March 2018 09:54
...
> > > I don't think so. My reading of memory-barriers.txt says that writeX might
> > > expand to outX, and outX is not ordered with respect to other types of
> > > memory.
> >
> > Ugh ?
> >
> > My understanding of HW at least is the exact opposite. outX is *more*
> > ordered if anything, than any other accessors. IO space is completely
> > synchronous, non posted and ordered afaik.
> 
> I'm just going by memory-barriers.txt:
> 
> 
>  (*) inX(), outX():
> 
>  [...]
> 
>  They are guaranteed to be fully ordered with respect to each other.
> 
>  They are not guaranteed to be fully ordered with respect to other types 
> of
>  memory and I/O operation.

A long time ago there was a document from Intel that said that inb/outb weren't
necessarily synchronised wrt memory accesses.
(Might be P-pro era).
However no processors actually behaved that way and more recent docs
say that inb/outb are fully ordered.

David



Re: RFC on writel and writel_relaxed

2018-03-28 Thread Will Deacon
On Wed, Mar 28, 2018 at 08:29:45AM +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2018-03-27 at 15:36 +0100, Will Deacon wrote:
> > > Can we say the same thing for iowrite32() and iowrite32be(). I also see 
> > > wmb()
> > > in front of these.
> > 
> > I don't think so. My reading of memory-barriers.txt says that writeX might
> > expand to outX, and outX is not ordered with respect to other types of
> > memory.
> 
> Ugh ?
> 
> My understanding of HW at least is the exact opposite. outX is *more*
> ordered if anything, than any other accessors. IO space is completely
> synchronous, non posted and ordered afaik.

I'm just going by memory-barriers.txt:


 (*) inX(), outX():

 [...]

 They are guaranteed to be fully ordered with respect to each other.

 They are not guaranteed to be fully ordered with respect to other types of
 memory and I/O operation.


For arm/arm64 these end up behaving exactly the same as readX/writeX, but
I'm nervous about changing the documentation without understanding why it's
like it is currently. Maybe another ia64 thing?.

Will


Re: RFC on writel and writel_relaxed

2018-03-28 Thread Linus Torvalds
On Tue, Mar 27, 2018, 20:43 Benjamin Herrenschmidt 
wrote:

> >
> > Of course, you'd have to be pretty odd to want to start a DMA with a
> > read anyway - partly exactly because it's bad for performance since
> > reads will be synchronous and not buffered like a write).
>
> I have bad memories of old adaptec controllers ...
>

*Old* adaptec controllers were likely to use the in/out instructions for
status and command data.

Those are actually even more ordered than UC reads and writes: the in/out
instructions are not just fully ordered, but are fully *synchronous* on
x86.

So not just doing accesses in order, but actually waiting for everything to
drain before they start executing, but they also wait for the operation
itself to complete (ie "out" will not just queue the write, it will then
wait for the queue to empty and the write data to hit the line).

That's why in/out were *so* slow, and why nobody uses them any more (well,
the address size limitations and the lack of any remapping of the address
obviously also are a reason).

Linus

>


Re: RFC on writel and writel_relaxed

2018-03-28 Thread Benjamin Herrenschmidt
On Wed, 2018-03-28 at 09:11 +0200, Arnd Bergmann wrote:
> On Wed, Mar 28, 2018 at 8:56 AM, Benjamin Herrenschmidt
>  wrote:
> > On Wed, 2018-03-28 at 06:53 +, Linus Torvalds wrote:
> > > On Tue, Mar 27, 2018, 20:43 Benjamin Herrenschmidt 
> > >  wrote:
> > > That's why in/out were *so* slow, and why nobody uses them any more
> > > (well, the address size limitations and the lack of any remapping of
> > > the address obviously also are a reason).
> > 
> > All true indeed, though a lot of other archs never quite made them
> > fully synchronous, which was another can of worms ... oh well.
> 
> Many architectures have no way of providing PCI compliant semantics
> for outb, as their instruction set and/or bus interconnect lacks a
> method of waiting for completion of an outb.

Yup, that includes powerpc. Note that since POWER8 we don't even
genetate IO space anymore :-)

> In practice, it doesn't seem to matter for any of the devices one would
> encounter these days: very few use I/O space, and those that do don't
> actually rely on the strict ordering. Some architectures (in particular
> s390, but I remember seeing the same thing elsewhere) explicitly
> disallow I/O space access on PCI because of this. On ARM, the typical
> PCI implementations have other problems that are worse than this
> one, so most drivers are fine with the almost-working semantics.

/me cries...

Ben.



Re: RFC on writel and writel_relaxed

2018-03-28 Thread Arnd Bergmann
On Wed, Mar 28, 2018 at 8:56 AM, Benjamin Herrenschmidt
 wrote:
> On Wed, 2018-03-28 at 06:53 +, Linus Torvalds wrote:
>> On Tue, Mar 27, 2018, 20:43 Benjamin Herrenschmidt 
>>  wrote:
>> That's why in/out were *so* slow, and why nobody uses them any more
>> (well, the address size limitations and the lack of any remapping of
>> the address obviously also are a reason).
>
> All true indeed, though a lot of other archs never quite made them
> fully synchronous, which was another can of worms ... oh well.

Many architectures have no way of providing PCI compliant semantics
for outb, as their instruction set and/or bus interconnect lacks a
method of waiting for completion of an outb.

In practice, it doesn't seem to matter for any of the devices one would
encounter these days: very few use I/O space, and those that do don't
actually rely on the strict ordering. Some architectures (in particular
s390, but I remember seeing the same thing elsewhere) explicitly
disallow I/O space access on PCI because of this. On ARM, the typical
PCI implementations have other problems that are worse than this
one, so most drivers are fine with the almost-working semantics.

Arnd


Re: RFC on writel and writel_relaxed

2018-03-28 Thread Benjamin Herrenschmidt
On Wed, 2018-03-28 at 06:53 +, Linus Torvalds wrote:
> 
> 
> On Tue, Mar 27, 2018, 20:43 Benjamin Herrenschmidt  ing.org> wrote:
> > >
> > > Of course, you'd have to be pretty odd to want to start a DMA
> > with a
> > > read anyway - partly exactly because it's bad for performance
> > since
> > > reads will be synchronous and not buffered like a write).
> > 
> > I have bad memories of old adaptec controllers ...
> 
> *Old* adaptec controllers were likely to use the in/out instructions
> for status and command data.
> 
> Those are actually even more ordered than UC reads and writes: the
> in/out instructions are not just fully ordered, but are fully
> *synchronous* on x86. 
> 
> So not just doing accesses in order, but actually waiting for
> everything to drain before they start executing, but they also wait
> for the operation itself to complete (ie "out" will not just queue
> the write, it will then wait for the queue to empty and the write
> data to hit the line).
> 
> That's why in/out were *so* slow, and why nobody uses them any more
> (well, the address size limitations and the lack of any remapping of
> the address obviously also are a reason).

All true indeed, though a lot of other archs never quite made them
fully synchronous, which was another can of worms ... oh well.

As for Adaptec, you might be right, I do remember having cases of old
stuff triggering DMA on reads, it might have been "Mac" variants of
Adaptec using MMIO or something...

Cheers,
Ben.



Re: RFC on writel and writel_relaxed

2018-03-28 Thread Benjamin Herrenschmidt
On Tue, 2018-03-27 at 20:26 -1000, Linus Torvalds wrote:
> On Tue, Mar 27, 2018 at 6:33 PM, Benjamin Herrenschmidt
>  wrote:
> > 
> > This is why, I want (with your agreement) to define clearly and once
> > and for all, that the Linux semantics of writel are that it is ordered
> > with previous writes to coherent memory (*)
> 
> Honestly, I think those are the sane semantics. In fact, make it
> "ordered with previous writes" full stop, since it's not only ordered
> wrt previous writes to memory, but also previous writel's.

Of course. It was somewhat a given that it's ordered vs. any previous
MMIO actually, but it doesn't hurt to spell it out once more.

> > Also, can I assume the above ordering with writel() equally applies to
> > readl() or not ?
> > 
> > IE:
> > dma_buf->foo = 1;
> > readl(STUPID_DEVICE_DMA_KICK_ON_READ);
> 
> If that KICK_ON_READ is UC, then that's definitely the case. And
> honestly, status registers like that really should always be UC.
> 
> But if somebody sets the area WC (which is crazy), then I think it
> might be at least debatable. x86 semantics does allow reads to be done
> before previous writes (or, put another way, writes to be buffered -
> the buffers are ordered so writes don't get re-ordered, but reads can
> happen during the buffering).

Right, for now I worry about UC semantics. Once we have nailed that, we
can look at WC, which is a lot more tricky as archs differs more
widely, but one thing at a time.

> But UC accesses are always  done entirely ordered, and honestly, any
> status register that starts a DMA would not make sense any other way.
> 
> Of course, you'd have to be pretty odd to want to start a DMA with a
> read anyway - partly exactly because it's bad for performance since
> reads will be synchronous and not buffered like a write).

I have bad memories of old adaptec controllers ...

That said, I think the above might not be right on ARM if we want to
make it the rule, Will, what do you reckon ?

Cheers,
Ben.



Re: RFC on writel and writel_relaxed

2018-03-28 Thread Linus Torvalds
On Tue, Mar 27, 2018 at 6:33 PM, Benjamin Herrenschmidt
 wrote:
>
> This is why, I want (with your agreement) to define clearly and once
> and for all, that the Linux semantics of writel are that it is ordered
> with previous writes to coherent memory (*)

Honestly, I think those are the sane semantics. In fact, make it
"ordered with previous writes" full stop, since it's not only ordered
wrt previous writes to memory, but also previous writel's.

> Also, can I assume the above ordering with writel() equally applies to
> readl() or not ?
>
> IE:
> dma_buf->foo = 1;
> readl(STUPID_DEVICE_DMA_KICK_ON_READ);

If that KICK_ON_READ is UC, then that's definitely the case. And
honestly, status registers like that really should always be UC.

But if somebody sets the area WC (which is crazy), then I think it
might be at least debatable. x86 semantics does allow reads to be done
before previous writes (or, put another way, writes to be buffered -
the buffers are ordered so writes don't get re-ordered, but reads can
happen during the buffering).

But UC accesses are always  done entirely ordered, and honestly, any
status register that starts a DMA would not make sense any other way.

Of course, you'd have to be pretty odd to want to start a DMA with a
read anyway - partly exactly because it's bad for performance since
reads will be synchronous and not buffered like a write).

   Linus


Re: RFC on writel and writel_relaxed

2018-03-28 Thread Linus Torvalds
On Tue, Mar 27, 2018 at 5:24 PM, Sinan Kaya  wrote:
>
> Basically changing it to
>
> dma_buffer->foo = 1;/* WB */
> wmb()
> writel_relaxed(KICK, DMA_KICK_REGISTER);/* UC */
> mmiowb()

Why?

Why not  just remove the wmb(), and keep the barrier in the writel()?

The above code makes no sense, and just looks stupid to me. It also
generates pointlessly bad code on x86, so it's bad there too.

   Linus


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Benjamin Herrenschmidt
On Tue, 2018-03-27 at 23:24 -0400, Sinan Kaya wrote:
> On 3/27/2018 10:51 PM, Linus Torvalds wrote:
> > > The discussion at hand is about
> > > 
> > > dma_buffer->foo = 1;/* WB */
> > > writel(KICK, DMA_KICK_REGISTER);/* UC */
> > 
> > Yes. That certainly is ordered on x86. In fact, afaik it's ordered
> > even if that writel() might be of type WC, because that only delays
> > writes, it doesn't move them earlier.
> 
> Now that we clarified x86 myth, Is this guaranteed on all architectures?

If not we need to fix it. It's guaranteed on the "main" ones (arm,
arm64, powerpc, i386, x86_64). We might need to check with other arch
maintainers for the rest.

We really want Linux to provide well defined "sane" semantics for the
basic writel accessors.

Note: We still have open questions about how readl() relates to
surrounding memory accesses. It looks like ARM and powerpc do different
things here.

> We keep getting IA64 exception example. Maybe, this is also corrected since
> then.

I would think ia64 fixed it back when it was all discussed. I was under
the impression all ia64 had "special" was the writel vs. spin_unlock
which requires mmiowb, but maybe that was never completely fixed ?

> Jose Abreu says "I don't know about x86 but arc architecture doesn't
> have a wmb() in the writel() function (in some configs)".

Well, it probably should then.

> As long as we have these exceptions, these wmb() in common drivers is not
> going anywhere and relaxed-arches will continue paying performance penalty.

Well, let's fix them or leave them broken, at this point, it doesn't
matter. We can give all arch maintainers a wakeup call and start making
drivers work based on the documented assumptions.

> I see 15% performance loss on ARM64 servers using Intel i40e network
> drivers and an XL710 adapter due to CPU keeping itself busy doing barriers
> most of the time rather than real work because of sequences like this all over
> the place.
> 
>  dma_buffer->foo = 1;/* WB */
>wmb()
>  writel(KICK, DMA_KICK_REGISTER);/* UC */
>
> I posted several patches last week to remove duplicate barriers on ARM while
> trying to make the code friendly with other architectures.
> 
> Basically changing it to
> 
> dma_buffer->foo = 1;/* WB */
> wmb()
> writel_relaxed(KICK, DMA_KICK_REGISTER);/* UC */
> mmiowb()
> 
> This is a small step in the performance direction until we remove all 
> exceptions.
> 
> https://www.spinics.net/lists/netdev/msg491842.html
> https://www.spinics.net/lists/linux-rdma/msg62434.html
> https://www.spinics.net/lists/arm-kernel/msg642336.html
> 
> Discussion started to move around the need for relaxed API on PPC and then
> why wmb() question came up.

I'm working on the problem of relaxed APIs for powerpc, but we can keep
that orthogonal. As is, today, a wmb() + writel() and a wmb() +
writel_relaxed() on powerpc are identical. So changing them will not
break us.

But I don't see the point of doing that transformation if we can just
get the straying archs fixed. It's not like any of them has a
significant market presence these days anyway.

Cheers,
Ben.

> Sinan
> 


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Benjamin Herrenschmidt
On Tue, 2018-03-27 at 16:51 -1000, Linus Torvalds wrote:
> On Tue, Mar 27, 2018 at 3:03 PM, Benjamin Herrenschmidt
>  wrote:
> > 
> > The discussion at hand is about
> > 
> > dma_buffer->foo = 1;/* WB */
> > writel(KICK, DMA_KICK_REGISTER);/* UC */
> 
> Yes. That certainly is ordered on x86. In fact, afaik it's ordered
> even if that writel() might be of type WC, because that only delays
> writes, it doesn't move them earlier.

Ok so this is our answer ...

 ... snip ... (thanks for the background info !)

> Oh, the above UC case is absoutely guaranteed.

Good.

Then

> The only issue really is that 99.9% of all testing gets done on x86
> unless you look at specific SoC drivers.
> 
> On ARM, for example, there is likely little reason to care about x86
> memory ordering, because there is almost zero driver overlap between
> x86 and ARM.
> 
> *Historically*, the reason for following the x86 IO ordering was
> simply that a lot of architectures used the drivers that were
> developed on x86. The alpha and powerpc workstations were *designed*
> with the x86 IO bus (PCI, then PCIe) and to work with the devices that
> came with it.
> 
> ARM? PCIe is almost irrelevant. For ARM servers, if they ever take
> off, sure. But 99.99% of ARM is about their own SoC's, and so "x86
> test coverage" is simply not an issue.
> 
> How much of an issue is it for Power? Maybe you decide it's not a big deal.
> 
> Then all the above is almost irrelevant.

So the overlap may not be that NIL in practice :-) But even then that
doesn't matter as ARM has been happily implementing the same semantic
you describe above for years, as do we powerpc.

This is why, I want (with your agreement) to define clearly and once
and for all, that the Linux semantics of writel are that it is ordered
with previous writes to coherent memory (*)

This is already what ARM and powerpc provide, from what you say, what
x86 provides, I don't see any reason to keep that badly documented and
have drivers randomly growing useless wmb()'s because they don't think
it works on x86 without them !

Once that's sorted, let's tackle the problem of mmiowb vs. spin_unlock
and the problem of writel_relaxed semantics but as separate issues :-)

Also, can I assume the above ordering with writel() equally applies to
readl() or not ?

IE:
dma_buf->foo = 1;
readl(STUPID_DEVICE_DMA_KICK_ON_READ);

Also works on x86 ? (It does on power, maybe not on ARM).

Cheers,
Ben.

(*) From an Linux API perspective, all of this is only valid if the
memory was allocated by dma_alloc_coherent(). Anything obtained by
dma_map_something() might have been bounced bufferred or might require
extra cache flushes on some architectures, and thus needs
dma_sync_for_{cpu,device} calls.

Cheers,
Ben.



Re: RFC on writel and writel_relaxed

2018-03-27 Thread Sinan Kaya
On 3/27/2018 10:51 PM, Linus Torvalds wrote:
>> The discussion at hand is about
>>
>> dma_buffer->foo = 1;/* WB */
>> writel(KICK, DMA_KICK_REGISTER);/* UC */
> Yes. That certainly is ordered on x86. In fact, afaik it's ordered
> even if that writel() might be of type WC, because that only delays
> writes, it doesn't move them earlier.

Now that we clarified x86 myth, Is this guaranteed on all architectures?
We keep getting IA64 exception example. Maybe, this is also corrected since
then.

Jose Abreu says "I don't know about x86 but arc architecture doesn't
have a wmb() in the writel() function (in some configs)".

As long as we have these exceptions, these wmb() in common drivers is not
going anywhere and relaxed-arches will continue paying performance penalty.

I see 15% performance loss on ARM64 servers using Intel i40e network
drivers and an XL710 adapter due to CPU keeping itself busy doing barriers
most of the time rather than real work because of sequences like this all over
the place.

 dma_buffer->foo = 1;/* WB */
 wmb()
 writel(KICK, DMA_KICK_REGISTER);/* UC */

I posted several patches last week to remove duplicate barriers on ARM while
trying to make the code friendly with other architectures.

Basically changing it to

dma_buffer->foo = 1;/* WB */
wmb()
writel_relaxed(KICK, DMA_KICK_REGISTER);/* UC */
mmiowb()

This is a small step in the performance direction until we remove all 
exceptions.

https://www.spinics.net/lists/netdev/msg491842.html
https://www.spinics.net/lists/linux-rdma/msg62434.html
https://www.spinics.net/lists/arm-kernel/msg642336.html

Discussion started to move around the need for relaxed API on PPC and then
why wmb() question came up.

Sinan

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Linus Torvalds
On Tue, Mar 27, 2018 at 3:03 PM, Benjamin Herrenschmidt
 wrote:
>
> The discussion at hand is about
>
> dma_buffer->foo = 1;/* WB */
> writel(KICK, DMA_KICK_REGISTER);/* UC */

Yes. That certainly is ordered on x86. In fact, afaik it's ordered
even if that writel() might be of type WC, because that only delays
writes, it doesn't move them earlier.

Whether people *do* that or not, I don't know. But I wouldn't be
surprised if they do.

So if it's a DMA buffer, it's "cached". And even cached accesses are
ordered wrt MMIO.

Basically, to get unordered writes on x86, you need to either use
explicitly nontemporal stores, or have a writecombining region with
back-to-back writes that actually combine.

And nobody really does that nontemporal store thing any more because
the hardware that cared pretty much doesn't exist any more. It was too
much pain. People use DMA and maybe an UC store for starting the DMA
(or possibly a WC buffer that gets multiple  stores in ascending order
as a stream of commands).

Things like UC will force everything to be entirely ordered, but even
without UC, loads won't pass loads, and stores won't pass stores.

> Now it appears that this wasn't fully understood back then, and some
> people are now saying that x86 might not even provide that semantic
> always.

Oh, the above UC case is absoutely guaranteed.

And I think even if it's WC, the write to kick off the DMA is ordered
wrt the cached write.

On x86, I think you need barriers only if you do things like

 - do two non-temporal stores and require them to be ordered: put a
sfence or mfence in between them.

 - do two WC stores, and make sure they do not combine: put a sfence
or mfence between them.

 - do a store, and a subsequent from a different address, and neither
of them is UC: put a mfence between them. But note that this is
literally just "load after store". A "store after load" doesn't need
one.

I think that's pretty much it.

For example, the "lfence" instruction is almost entirely pointless on
x86 - it was designed back in the time when people *thought* they
might re-order loads. But loads don't get re-ordered. At least Intel
seems to document that only non-temporal *stores* can get re-ordered
wrt each other.

End result: lfence is a historical oddity that can now be used to
guarantee that a previous  load has finished, and that in turn meant
that it is  now used in the Spectre mitigations. But it basically has
no real memory ordering meaning since nothing passes an earlier load
anyway, it's more of a pipeline thing.

But in the end, one question is just "how much do drivers actually
_rely_ on the x86 strong ordering?"

We so support "smp_wmb()" even though x86 has strong enough ordering
that just a barrier suffices. Somebody might just say "screw the x86
memory ordering, we're relaxed, and we'll fix up the drivers we care
about".

The only issue really is that 99.9% of all testing gets done on x86
unless you look at specific SoC drivers.

On ARM, for example, there is likely little reason to care about x86
memory ordering, because there is almost zero driver overlap between
x86 and ARM.

*Historically*, the reason for following the x86 IO ordering was
simply that a lot of architectures used the drivers that were
developed on x86. The alpha and powerpc workstations were *designed*
with the x86 IO bus (PCI, then PCIe) and to work with the devices that
came with it.

ARM? PCIe is almost irrelevant. For ARM servers, if they ever take
off, sure. But 99.99% of ARM is about their own SoC's, and so "x86
test coverage" is simply not an issue.

How much of an issue is it for Power? Maybe you decide it's not a big deal.

Then all the above is almost irrelevant.

   Linus


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Benjamin Herrenschmidt
On Tue, 2018-03-27 at 16:10 +0100, Will Deacon wrote:
> To clarify: are you saying that on x86 you need a wmb() prior to a writel
> if you want that writel to be ordered after prior writes to memory? Is this
> specific to WC memory or some other non-standard attribute?
> 
> The only reason we have wmb() inside writel() on arm, arm64 and power is for
> parity with x86 because Linus (CC'd) wanted architectures to order I/O vs
> memory by default so that it was easier to write portable drivers. The
> performance impact of that implicit barrier is non-trivial, but we want the
> driver portability and I went as far as adding generic _relaxed versions for
> the cases where ordering isn't required. You seem to be suggesting that none
> of this is necessary and drivers would already run into problems on x86 if
> they didn't use wmb() explicitly in conjunction with writel, which I find
> hard to believe and is in direct contradiction with the current Linux I/O
> memory model (modulo the broken example in the dma_*mb section of
> memory-barriers.txt).

Another clarification while we are at it 

All of this only applies to concurrent access by the CPU and the device
to memory allocate with dma_alloc_coherent().

For memory "mapped" into the DMA domain via dma_map_* then an extra
dma_sync_for_* is needed.

In most useful server cases etc... these latter are NOPs, but
architecture without full DMA cache coherency or using swiotlb,
dma_map_* might maintain bounce buffers or play additional cache
flushing tricks.

Cheers,
Ben.



Re: RFC on writel and writel_relaxed

2018-03-27 Thread Benjamin Herrenschmidt
On Tue, 2018-03-27 at 14:39 -1000, Linus Torvalds wrote:
> On Tue, Mar 27, 2018 at 11:33 AM, Benjamin Herrenschmidt
>  wrote:
> > 
> > Well, we need to clarify that once and for all, because as I wrote
> > earlier, it was decreed by Linus more than a decade ago that writel
> > would be fully ordered by itself vs. previous memory stores (at least
> > on UC memory).
> 
> Yes.
> 
> So "writel()" needs to be ordered with respect to other writel() uses
> on the same thread. Anything else *will* break drivers. Obviously, the
> drivers may then do magic to say "do write combining etc", but that
> magic will be architecture-specific.
> 
> The other issue is that "writel()" needs to be ordered wrt other CPU's
> doing "writel()" if those writel's are in a spinlocked region.

 .../...

The discussion at hand is about

dma_buffer->foo = 1;/* WB */
writel(KICK, DMA_KICK_REGISTER);/* UC */

(The WC case is something else, let's not mix things up just yet)

IE, a store to normal WB cache memory followed by a writel to a device
which will then DMA from that buffer.

Back in the days, we did require on powerpc a wmb() between these, but
you made the point that x86 didn't and driver writers would never get
that right.

We decided to go conservative, added the necessary barrier inside
writel, so did ARM and it became the norm that writel is also fully
ordered vs. previous stores to memory *by the same CPU* of course (or
protected by the same spinlock).

Now it appears that this wasn't fully understood back then, and some
people are now saying that x86 might not even provide that semantic
always.

So a number (fairly large) of drivers have been adding wmb() in those
case, while others haven't, and it's a mess.

The mess is compounded by the fact that if writel is now defined to
*not* provide that ordering guarantee, then writel_relaxed() is
pointless since all it is defined to relax is precisely the above
ordering guarantee.

So I want to get to the bottom of this once and for all so we can have
well defined and documented semantics and stop having drivers do random
things that may or may not work on some or all architectures (including
x86 !).

Quite note about the spinlock case... In fact this is the only case you
did allow back then to be relaxed. In theory a writel followed by a
spin_unlock requires an mmiowb (which is the only point of that barrier
in fact). This was done because an arch (I think ia64) had a hard time
getting MMIOs from multiple CPUs get in order vs. a lock and required
an expensive access to the PCI host bridge to do so.

Back then, on powerpc, we chose not to allow that relaxing and instead
added code to our writel to set a per-cpu flag which would cause the
next spin_unlock to use a stronger barrier than usual.

We do need to clarify this as well, but let's start with the most basic
one first, there is enough confusion already.

Cheers,
Ben.


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Linus Torvalds
On Tue, Mar 27, 2018 at 11:33 AM, Benjamin Herrenschmidt
 wrote:
>
> Well, we need to clarify that once and for all, because as I wrote
> earlier, it was decreed by Linus more than a decade ago that writel
> would be fully ordered by itself vs. previous memory stores (at least
> on UC memory).

Yes.

So "writel()" needs to be ordered with respect to other writel() uses
on the same thread. Anything else *will* break drivers. Obviously, the
drivers may then do magic to say "do write combining etc", but that
magic will be architecture-specific.

The other issue is that "writel()" needs to be ordered wrt other CPU's
doing "writel()" if those writel's are in a spinlocked region.

So it's not  that "writel()" needs to be ordered wrt the spinlock
itself, but you *do* need to honor ordering if you have something like
this:

   spin_lock();
   writel(a);
   writel(b);
   spin_unlock();

and if two CPU's run the above code "at the same time", then the
*ONLY* acceptable sequence is abab.

You cannot, and must not, ever see "aabb" at the device, for example,
because of how the writel would basically leak out of the spinlock.

That sounds "obvious", but dammit, a lot of architectures got that
wrong, afaik.

Linus


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Benjamin Herrenschmidt
On Tue, 2018-03-27 at 14:54 -0700, Alexander Duyck wrote:
> On Tue, Mar 27, 2018 at 2:35 PM, Benjamin Herrenschmidt
>  wrote:
> > On Tue, 2018-03-27 at 10:46 -0400, Sinan Kaya wrote:
> > >  combined buffers.
> > > 
> > > Alex:
> > > "Don't bother. I can tell you right now that for x86 you have to have a
> > > wmb() before the writel().
> > 
> > No, this isn't the semantics of writel. You shouldn't need it unless
> > something changed and we need to revisit our complete understanding of
> > *all* MMIO accessor semantics.
> 
> The issue seems to be that there have been two different ways of
> dealing with this. There has historically been a number of different
> drivers that have been carrying this wmb() workaround since something
> like 2002. I get that the semantics for writel might have changed
> since then, but those of us who already have the wmb() in our drivers
> will be very wary of anyone wanting to go through and remove them
> since writel is supposed to be "good enough". I would much rather err
> on the side of caution here.
> 
> I view the wmb() + writel_relaxed() as more of a driver owning and
> handling this itself. Besides in the Intel Ethernet driver case it is
> better performance as our wmb() placement for us also provides a
> secondary barrier so we don't need to add a separate smp_wmb() to deal
> with a potential race we have with the Tx cleanup.
> 
> > At least for UC space, it has always been accepted (and enforced) that
> > writel would not require any other barrier to order vs. previous stores
> > to memory.
> 
> So the one thing I would question here is if this is UC vs UC or if
> this extends to other types as well? So for x86 we could find
> references to Write Combining being flushed by a write to UC memory,
> however I have yet to find a clear explanation of what a write to UC
> does to WB. 

Well, this is the standard write memory + trigger DMA case, the one
specific case for which Linus was adamant we don't need another barrier
 back then ...

> My personal inclination would be to err on the side of
> caution.

Which means writel_relaxed is now pointless ?

We need clear semantics here. In this case the "side of caution" means
we are randomly doing things not understanding what really happens and
that makes me *more* nervous.

>  I just don't want us going through and removing the wmb()
> calls because it "should" work. I would want to know for certain it
> will work.

We need to know for certain anyway. Otherwise, all the drivers that do
not have wmb's are potentially broken.

So I dont agree with the status quo.

We need to establish precisely what x86 does, decide what we want the
semantic of writel to be, and implement things accordingly.

Ben.



Re: RFC on writel and writel_relaxed

2018-03-27 Thread Sinan Kaya
On 3/27/2018 5:54 PM, Alexander Duyck wrote:
> I view the wmb() + writel_relaxed() as more of a driver owning and
> handling this itself. Besides in the Intel Ethernet driver case it is
> better performance as our wmb() placement for us also provides a
> secondary barrier so we don't need to add a separate smp_wmb() to deal
> with a potential race we have with the Tx cleanup.

Thanks for the reminder. 

I forgot about the double barrier optimization. wmb() + writel_relaxed()
seems to be the best option for Intel network drivers at this moment. 
Otherwise, we'll have to remove wmb() and throw in smp barriers there
like you mentioned.

I'll leave the changes in the Intel drivers alone.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Alexander Duyck
On Tue, Mar 27, 2018 at 2:35 PM, Benjamin Herrenschmidt
 wrote:
> On Tue, 2018-03-27 at 10:46 -0400, Sinan Kaya wrote:
>>  combined buffers.
>>
>> Alex:
>> "Don't bother. I can tell you right now that for x86 you have to have a
>> wmb() before the writel().
>
> No, this isn't the semantics of writel. You shouldn't need it unless
> something changed and we need to revisit our complete understanding of
> *all* MMIO accessor semantics.

The issue seems to be that there have been two different ways of
dealing with this. There has historically been a number of different
drivers that have been carrying this wmb() workaround since something
like 2002. I get that the semantics for writel might have changed
since then, but those of us who already have the wmb() in our drivers
will be very wary of anyone wanting to go through and remove them
since writel is supposed to be "good enough". I would much rather err
on the side of caution here.

I view the wmb() + writel_relaxed() as more of a driver owning and
handling this itself. Besides in the Intel Ethernet driver case it is
better performance as our wmb() placement for us also provides a
secondary barrier so we don't need to add a separate smp_wmb() to deal
with a potential race we have with the Tx cleanup.

> At least for UC space, it has always been accepted (and enforced) that
> writel would not require any other barrier to order vs. previous stores
> to memory.

So the one thing I would question here is if this is UC vs UC or if
this extends to other types as well? So for x86 we could find
references to Write Combining being flushed by a write to UC memory,
however I have yet to find a clear explanation of what a write to UC
does to WB. My personal inclination would be to err on the side of
caution. I just don't want us going through and removing the wmb()
calls because it "should" work. I would want to know for certain it
will work.

- Alex


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Benjamin Herrenschmidt
On Tue, 2018-03-27 at 10:46 -0400, Sinan Kaya wrote:
>  combined buffers.
> 
> Alex:
> "Don't bother. I can tell you right now that for x86 you have to have a
> wmb() before the writel().

No, this isn't the semantics of writel. You shouldn't need it unless
something changed and we need to revisit our complete understanding of
*all* MMIO accessor semantics.

At least for UC space, it has always been accepted (and enforced) that
writel would not require any other barrier to order vs. previous stores
to memory.

> Based on the comment in
> (https://www.spinics.net/lists/linux-rdma/msg62666.html):
> Replacing wmb() + writel() with wmb() + writel_relaxed() will work on
> PPC, it will just not give you a benefit today.
> 
> I say the patch set stays. This gives benefit on ARM, and has no
> effect on x86 and PowerPC. If you want to look at trying to optimize
> things further on PowerPC and such then go for it in terms of trying
> to implement the writel_relaxed(). Otherwise I say we call the ARM
> goodness a win and don't get ourselves too wrapped up in trying to fix
> this for all architectures."


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Benjamin Herrenschmidt
On Tue, 2018-03-27 at 11:54 -0700, Alexander Duyck wrote:
> As far as I know the code has been this way for a while, something
> like 2002, when the barrier was already present in e1000. However
> there it was calling out weakly ordered models "such as IA-64". Since
> then pretty much all the hardware based network drivers at this point
> have similar code floating around with wmb() in place to prevent
> issues on weak ordered memory systems.
> 
> So in any case we still need to be careful as there are architectures
> that are depending on this even if they might not be x86. :-/

Well, we need to clarify that once and for all, because as I wrote
earlier, it was decreed by Linus more than a decade ago that writel
would be fully ordered by itself vs. previous memory stores (at least
on UC memory).

This is why we added sync's to writel on powerpc and later ARM added
similar barriers to theirs.

This is also why writel_relaxed was added (though much later), since
what writel_relaxed does is to life that specific requirement.

IE. If what you say is true and wmb() is needed on x86, then
writel_relaxed is now completely useless...

Ben.


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Benjamin Herrenschmidt
On Tue, 2018-03-27 at 15:36 +0100, Will Deacon wrote:
> > Can we say the same thing for iowrite32() and iowrite32be(). I also see 
> > wmb()
> > in front of these.
> 
> I don't think so. My reading of memory-barriers.txt says that writeX might
> expand to outX, and outX is not ordered with respect to other types of
> memory.

Ugh ?

My understanding of HW at least is the exact opposite. outX is *more*
ordered if anything, than any other accessors. IO space is completely
synchronous, non posted and ordered afaik.

Cheers,
Ben.



Re: RFC on writel and writel_relaxed

2018-03-27 Thread Benjamin Herrenschmidt
On Tue, 2018-03-27 at 08:12 -0600, Jason Gunthorpe wrote:
> > I have been converting wmb+writel to wmb+writel_relaxed. (About 30 patches)
> > 
> > I will have to just remove the wmb and keep writel, then repost.
> 
> Okay, but before you do that, can we get a statement how this works
> for WC?
> 
> Some of these writels are to WC memory, do they need the wmb()?!?

This is an issue as we don't have well defined semantics for WC.

At this point, I would suggest staying away from that (ie, not changing
them). We need to look into it.

I know for example that on powerpc I cannot give you any weaker
semantic on WC for writel (I have to put a full sync in there), but I
am trying to see if I can make writel_relaxed both work with the
existing semantics and provide combining. But it's not yet a given (our
weaker IO barrier, eieio, isn't architecturally defined to do anything
on G=0 space, looking with the HW guys at what the HW actually does).

Cheers,
Ben.



Re: RFC on writel and writel_relaxed

2018-03-27 Thread Benjamin Herrenschmidt
On Tue, 2018-03-27 at 09:46 -0400, Sinan Kaya wrote:
> On 3/27/2018 7:02 AM, Will Deacon wrote:
> > - See Documentation/DMA-API.txt for more information on consistent 
> > memory.
> > + can see it now has ownership.  Note that, when using writel(), a prior
> > + wmb() is not needed to guarantee that the cache coherent memory writes
> > + have completed before writing to the MMIO region.  The cheaper
> > + writel_relaxed() does not provide this guarantee and must not be used
> > + here.
> 
> Can we say the same thing for iowrite32() and iowrite32be(). I also see wmb()
> in front of these.

Yes, they should have the same semantics as writel

Cheers,
Ben.


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Arnd Bergmann
On Tue, Mar 27, 2018 at 9:54 PM, Arnd Bergmann  wrote:
> On Tue, Mar 27, 2018 at 8:54 PM, Alexander Duyck
>  wrote:
>> On Tue, Mar 27, 2018 at 8:10 AM, Will Deacon  wrote:
>
> 11.10 STORE BUFFER
> Intel 64 and IA-32 processors temporarily store each write (store) to
> memory in a store buffer. The store buffer
> improves processor performance by allowing the processor to continue
> executing instructions without having to
> wait until a write to memory and/or to a cache is complete. It also
> allows writes to be delayed for more efficient use
> of memory-access bus cycles.
> In general, the existence of the store buffer is transparent to
> software, even in systems that use multiple processors.
> The processor ensures that write operations are always carried out in
> program order. It also insures that the
> contents of the store buffer are always drained to memory in the
> following situations:
> • When an exception or interrupt is generated.
> • (P6 and more recent processor families only) When a serializing
> instruction is executed.
> • When an I/O instruction is executed.

I guess I/O instruction is still ambiguous on x86, it may just refer
to 'inb'/'outb' style instructions rather than 'mov' on a device MMIO
area.

Here's a link to a reply from Linus that I found on this topic:

http://yarchive.net/comp/linux/write_combining.html

  Arnd


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Arnd Bergmann
On Tue, Mar 27, 2018 at 8:54 PM, Alexander Duyck
 wrote:
> On Tue, Mar 27, 2018 at 8:10 AM, Will Deacon  wrote:

>>>
>>> Sinan
>>> "We are being told that if you use writel(), then you don't need a wmb() on
>>> all architectures."
>>>
>>> Alex:
>>> "I'm not sure who told you that but that is incorrect, at least for
>>> x86. If you attempt to use writel() without the wmb() we will have to
>>> NAK the patches. We will accept the wmb() with writel_releaxed() since
>>> that solves things for ARM."
>>>
>>> > Jason is seeking behavior clarification for write combined buffers.
>>>
>>> Alex:
>>> "Don't bother. I can tell you right now that for x86 you have to have a
>>> wmb() before the writel().
>>
>> To clarify: are you saying that on x86 you need a wmb() prior to a writel
>> if you want that writel to be ordered after prior writes to memory? Is this
>> specific to WC memory or some other non-standard attribute?
>
> Note, I am not a CPU guy so this is just my interpretation. It is my
> understanding that the wmb(), aka sfence, is needed on x86 to sort out
> writes between Write-back(WB) system memory and Strong Uncacheable
> (UC) MMIO accesses.
>
> I was hoping to be able to cite something in the software developers
> manual 
> (https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf),
> but that tends to be pretty vague. I have re-read section 22.34
> (volume 3B) several times and I am still not clear on if it says we
> need the sfence or not. It is a matter of figuring out what the impact
> of store buffers and caching are for WB versus UC memory.

Here is what I found regarding the store buffer in that document:

11.10 STORE BUFFER
Intel 64 and IA-32 processors temporarily store each write (store) to
memory in a store buffer. The store buffer
improves processor performance by allowing the processor to continue
executing instructions without having to
wait until a write to memory and/or to a cache is complete. It also
allows writes to be delayed for more efficient use
of memory-access bus cycles.
In general, the existence of the store buffer is transparent to
software, even in systems that use multiple processors.
The processor ensures that write operations are always carried out in
program order. It also insures that the
contents of the store buffer are always drained to memory in the
following situations:
• When an exception or interrupt is generated.
• (P6 and more recent processor families only) When a serializing
instruction is executed.
• When an I/O instruction is executed.
• When a LOCK operation is performed.
• (P6 and more recent processor families only) When a BINIT operation
is performed.
• (Pentium III, and more recent processor families only) When using an
SFENCE instruction to order stores.
• (Pentium 4 and more recent processor families only) When using an
MFENCE instruction to order stores.
The discussion of write ordering in Section 8.2, “Memory Ordering,”
gives a detailed description of the operation of
the store buffer.

   Arnd


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Alexander Duyck
On Tue, Mar 27, 2018 at 8:10 AM, Will Deacon  wrote:
> Hi Alex,
>
> On Tue, Mar 27, 2018 at 10:46:58AM -0400, Sinan Kaya wrote:
>> +netdev, +Alex
>>
>> On 3/26/2018 6:00 PM, Benjamin Herrenschmidt wrote:
>> > On Mon, 2018-03-26 at 23:30 +0200, Arnd Bergmann wrote:
>> >>  Most of the drivers have a unwound loop with writeq() or something to
>> >>> do it.
>> >>
>> >> But isn't the writeq() barrier much more expensive than anything you'd
>> >> do in function calls?
>> >
>> > It is for us, and will break any write combining.
>> >
>> > The same document says that _relaxed() does not give that guarentee.
>> >
>> > The lwn articule on this went into some depth on the interaction with
>> > spinlocks.
>> >
>> > As far as I can see, containment in a spinlock seems to be the only
>> > different between writel and writel_relaxed..
>> 
>>  I was always puzzled by this: The intention of _relaxed() on ARM
>>  (where it originates) was to skip the barrier that serializes DMA
>>  with MMIO, not to skip the serialization between MMIO and locks.
>> >>>
>> >>> But that was never a requirement of writel(),
>> >>> Documentation/memory-barriers.txt gives an explicit example demanding
>> >>> the wmb() before writel() for ordering system memory against writel.
>> >
>> > This is a bug in the documentation.
>> >
>> >> Indeed, but it's in an example for when to use dma_wmb(), not wmb().
>> >> Adding Alexander Duyck to Cc, he added that section as part of
>> >> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
>> >> dma_wmb()"). Also adding the other people that were involved with that.
>> >
>> > Linus himself made it very clear years ago. readl and writel have to
>> > order vs memory accesses.
>> >
>> >>> I actually have no idea why ARM had that barrier, I always assumed it
>> >>> was to give program ordering to the accesses and that _relaxed allowed
>> >>> re-ordering (the usual meaning of relaxed)..
>> >>>
>> >>> But the barrier document makes it pretty clear that the only
>> >>> difference between the two is spinlock containment, and WillD wrote
>> >>> this text, so I belive it is accurate for ARM.
>> >>>
>> >>> Very confusing.
>> >>
>> >> It does mention serialization with both DMA and locks in the
>> >> section about  readX_relaxed()/writeX_relaxed(). The part
>> >> about DMA is very clear here, and I must have just forgotten
>> >> the exact semantics with regards to spinlocks. I'm still not
>> >> sure what prevents a writel() from leaking out the end of a
>> >> spinlock section that doesn't happen with writel_relaxed(), since
>> >> the barrier in writel() comes before the access, and the
>> >> spin_unlock() shouldn't affect the external buses.
>> >
>> > So...
>> >
>> > Historically, what happened is that we (we means whoever participated
>> > in the discussion on the list with Linus calling the shots really)
>> > decided that there was no sane way for drivers to understand a world
>> > where readl/writel didn't fully order things vs. memory accesses (ie,
>> > DMA).
>> >
>> > So it should always be correct to do:
>> >
>> > - Write to some in-memory buffer
>> > - writel() to kick the DMA read of that buffer
>> >
>> > without any extra barrier.
>> >
>> > The spinlock situation however got murky. Mostly that came up because
>> > on architecture (I forgot who, might have been ia64) has a hard time
>> > providing that consistency without making writel insanely expensive.
>> >
>> > Thus they created mmiowb whose main purpose was precisely to order
>> > writel with a following spin_unlock.
>> >
>> > I decided not to go down that path on power because getting all drivers
>> > "fixed" to do the right thing was going to be a losing battle, and
>> > instead added per-cpu tracking of writel in order to "escalate" to a
>> > heavier barrier in spin_unlock itself when necessary.
>> >
>> > Now, all this happened more than a decade ago and it's possible that
>> > the understanding or expectations "shifted" over time...
>>
>> Alex is raising concerns on the netdev list.
>>
>> Sinan
>> "We are being told that if you use writel(), then you don't need a wmb() on
>> all architectures."
>>
>> Alex:
>> "I'm not sure who told you that but that is incorrect, at least for
>> x86. If you attempt to use writel() without the wmb() we will have to
>> NAK the patches. We will accept the wmb() with writel_releaxed() since
>> that solves things for ARM."
>>
>> > Jason is seeking behavior clarification for write combined buffers.
>>
>> Alex:
>> "Don't bother. I can tell you right now that for x86 you have to have a
>> wmb() before the writel().
>
> To clarify: are you saying that on x86 you need a wmb() prior to a writel
> if you want that writel to be ordered after prior writes to memory? Is this
> specific to WC memory or some other non-standard attribute?

Note, I am not a CPU guy so this is just my interpretation. It is my
understanding that the wmb(), aka sfence, is 

Re: RFC on writel and writel_relaxed

2018-03-27 Thread Jose Abreu
Hi,

On 27-03-2018 15:46, Sinan Kaya wrote:
>
> Sinan
> "We are being told that if you use writel(), then you don't need a wmb() on
> all architectures."
>
> Alex:
> "I'm not sure who told you that but that is incorrect, at least for
> x86. If you attempt to use writel() without the wmb() we will have to
> NAK the patches. We will accept the wmb() with writel_releaxed() since
> that solves things for ARM."
>

So this means we should always use writel() + wmb() in *all*
accesses? I don't know about x86 but arc architecture doesn't
have a wmb() in the writel() function (in some configs).

I see the point in net drivers while you have dma + io accesses
but for most drivers this shouldn't be needed, right?

What about ordering of writes? Is it guaranteed that one write
will happen before the next one ?

Best Regards,
Jose Miguel Abreu



Re: RFC on writel and writel_relaxed

2018-03-27 Thread Will Deacon
Hi Alex,

On Tue, Mar 27, 2018 at 10:46:58AM -0400, Sinan Kaya wrote:
> +netdev, +Alex
> 
> On 3/26/2018 6:00 PM, Benjamin Herrenschmidt wrote:
> > On Mon, 2018-03-26 at 23:30 +0200, Arnd Bergmann wrote:
> >>  Most of the drivers have a unwound loop with writeq() or something to
> >>> do it.
> >>
> >> But isn't the writeq() barrier much more expensive than anything you'd
> >> do in function calls?
> > 
> > It is for us, and will break any write combining.
> > 
> > The same document says that _relaxed() does not give that guarentee.
> >
> > The lwn articule on this went into some depth on the interaction with
> > spinlocks.
> >
> > As far as I can see, containment in a spinlock seems to be the only
> > different between writel and writel_relaxed..
> 
>  I was always puzzled by this: The intention of _relaxed() on ARM
>  (where it originates) was to skip the barrier that serializes DMA
>  with MMIO, not to skip the serialization between MMIO and locks.
> >>>
> >>> But that was never a requirement of writel(),
> >>> Documentation/memory-barriers.txt gives an explicit example demanding
> >>> the wmb() before writel() for ordering system memory against writel.
> > 
> > This is a bug in the documentation.
> > 
> >> Indeed, but it's in an example for when to use dma_wmb(), not wmb().
> >> Adding Alexander Duyck to Cc, he added that section as part of
> >> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
> >> dma_wmb()"). Also adding the other people that were involved with that.
> > 
> > Linus himself made it very clear years ago. readl and writel have to
> > order vs memory accesses.
> > 
> >>> I actually have no idea why ARM had that barrier, I always assumed it
> >>> was to give program ordering to the accesses and that _relaxed allowed
> >>> re-ordering (the usual meaning of relaxed)..
> >>>
> >>> But the barrier document makes it pretty clear that the only
> >>> difference between the two is spinlock containment, and WillD wrote
> >>> this text, so I belive it is accurate for ARM.
> >>>
> >>> Very confusing.
> >>
> >> It does mention serialization with both DMA and locks in the
> >> section about  readX_relaxed()/writeX_relaxed(). The part
> >> about DMA is very clear here, and I must have just forgotten
> >> the exact semantics with regards to spinlocks. I'm still not
> >> sure what prevents a writel() from leaking out the end of a
> >> spinlock section that doesn't happen with writel_relaxed(), since
> >> the barrier in writel() comes before the access, and the
> >> spin_unlock() shouldn't affect the external buses.
> > 
> > So...
> > 
> > Historically, what happened is that we (we means whoever participated
> > in the discussion on the list with Linus calling the shots really)
> > decided that there was no sane way for drivers to understand a world
> > where readl/writel didn't fully order things vs. memory accesses (ie,
> > DMA).
> > 
> > So it should always be correct to do:
> > 
> > - Write to some in-memory buffer
> > - writel() to kick the DMA read of that buffer
> > 
> > without any extra barrier.
> > 
> > The spinlock situation however got murky. Mostly that came up because
> > on architecture (I forgot who, might have been ia64) has a hard time
> > providing that consistency without making writel insanely expensive.
> > 
> > Thus they created mmiowb whose main purpose was precisely to order
> > writel with a following spin_unlock.
> > 
> > I decided not to go down that path on power because getting all drivers
> > "fixed" to do the right thing was going to be a losing battle, and
> > instead added per-cpu tracking of writel in order to "escalate" to a
> > heavier barrier in spin_unlock itself when necessary.
> > 
> > Now, all this happened more than a decade ago and it's possible that
> > the understanding or expectations "shifted" over time...
> 
> Alex is raising concerns on the netdev list.
> 
> Sinan
> "We are being told that if you use writel(), then you don't need a wmb() on
> all architectures."
> 
> Alex:
> "I'm not sure who told you that but that is incorrect, at least for
> x86. If you attempt to use writel() without the wmb() we will have to
> NAK the patches. We will accept the wmb() with writel_releaxed() since
> that solves things for ARM."
> 
> > Jason is seeking behavior clarification for write combined buffers.
> 
> Alex:
> "Don't bother. I can tell you right now that for x86 you have to have a
> wmb() before the writel().

To clarify: are you saying that on x86 you need a wmb() prior to a writel
if you want that writel to be ordered after prior writes to memory? Is this
specific to WC memory or some other non-standard attribute?

The only reason we have wmb() inside writel() on arm, arm64 and power is for
parity with x86 because Linus (CC'd) wanted architectures to order I/O vs
memory by default so that it was easier to write portable drivers. The
performance impact of that implicit barrier is non-trivial, but 

Re: RFC on writel and writel_relaxed

2018-03-27 Thread Sinan Kaya
+netdev, +Alex

On 3/26/2018 6:00 PM, Benjamin Herrenschmidt wrote:
> On Mon, 2018-03-26 at 23:30 +0200, Arnd Bergmann wrote:
>>  Most of the drivers have a unwound loop with writeq() or something to
>>> do it.
>>
>> But isn't the writeq() barrier much more expensive than anything you'd
>> do in function calls?
> 
> It is for us, and will break any write combining.
> 
> The same document says that _relaxed() does not give that guarentee.
>
> The lwn articule on this went into some depth on the interaction with
> spinlocks.
>
> As far as I can see, containment in a spinlock seems to be the only
> different between writel and writel_relaxed..

 I was always puzzled by this: The intention of _relaxed() on ARM
 (where it originates) was to skip the barrier that serializes DMA
 with MMIO, not to skip the serialization between MMIO and locks.
>>>
>>> But that was never a requirement of writel(),
>>> Documentation/memory-barriers.txt gives an explicit example demanding
>>> the wmb() before writel() for ordering system memory against writel.
> 
> This is a bug in the documentation.
> 
>> Indeed, but it's in an example for when to use dma_wmb(), not wmb().
>> Adding Alexander Duyck to Cc, he added that section as part of
>> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
>> dma_wmb()"). Also adding the other people that were involved with that.
> 
> Linus himself made it very clear years ago. readl and writel have to
> order vs memory accesses.
> 
>>> I actually have no idea why ARM had that barrier, I always assumed it
>>> was to give program ordering to the accesses and that _relaxed allowed
>>> re-ordering (the usual meaning of relaxed)..
>>>
>>> But the barrier document makes it pretty clear that the only
>>> difference between the two is spinlock containment, and WillD wrote
>>> this text, so I belive it is accurate for ARM.
>>>
>>> Very confusing.
>>
>> It does mention serialization with both DMA and locks in the
>> section about  readX_relaxed()/writeX_relaxed(). The part
>> about DMA is very clear here, and I must have just forgotten
>> the exact semantics with regards to spinlocks. I'm still not
>> sure what prevents a writel() from leaking out the end of a
>> spinlock section that doesn't happen with writel_relaxed(), since
>> the barrier in writel() comes before the access, and the
>> spin_unlock() shouldn't affect the external buses.
> 
> So...
> 
> Historically, what happened is that we (we means whoever participated
> in the discussion on the list with Linus calling the shots really)
> decided that there was no sane way for drivers to understand a world
> where readl/writel didn't fully order things vs. memory accesses (ie,
> DMA).
> 
> So it should always be correct to do:
> 
>   - Write to some in-memory buffer
>   - writel() to kick the DMA read of that buffer
> 
> without any extra barrier.
> 
> The spinlock situation however got murky. Mostly that came up because
> on architecture (I forgot who, might have been ia64) has a hard time
> providing that consistency without making writel insanely expensive.
> 
> Thus they created mmiowb whose main purpose was precisely to order
> writel with a following spin_unlock.
> 
> I decided not to go down that path on power because getting all drivers
> "fixed" to do the right thing was going to be a losing battle, and
> instead added per-cpu tracking of writel in order to "escalate" to a
> heavier barrier in spin_unlock itself when necessary.
> 
> Now, all this happened more than a decade ago and it's possible that
> the understanding or expectations "shifted" over time...

Alex is raising concerns on the netdev list.

Sinan
"We are being told that if you use writel(), then you don't need a wmb() on
all architectures."

Alex:
"I'm not sure who told you that but that is incorrect, at least for
x86. If you attempt to use writel() without the wmb() we will have to
NAK the patches. We will accept the wmb() with writel_releaxed() since
that solves things for ARM."

> Jason is seeking behavior clarification for write combined buffers.

Alex:
"Don't bother. I can tell you right now that for x86 you have to have a
wmb() before the writel().

Based on the comment in
(https://www.spinics.net/lists/linux-rdma/msg62666.html):
Replacing wmb() + writel() with wmb() + writel_relaxed() will work on
PPC, it will just not give you a benefit today.

I say the patch set stays. This gives benefit on ARM, and has no
effect on x86 and PowerPC. If you want to look at trying to optimize
things further on PowerPC and such then go for it in terms of trying
to implement the writel_relaxed(). Otherwise I say we call the ARM
goodness a win and don't get ourselves too wrapped up in trying to fix
this for all architectures."



> 
> Cheers,
> Ben.
>  
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code 

Re: RFC on writel and writel_relaxed

2018-03-27 Thread Will Deacon
On Tue, Mar 27, 2018 at 09:46:51AM -0400, Sinan Kaya wrote:
> On 3/27/2018 7:02 AM, Will Deacon wrote:
> > - See Documentation/DMA-API.txt for more information on consistent 
> > memory.
> > + can see it now has ownership.  Note that, when using writel(), a prior
> > + wmb() is not needed to guarantee that the cache coherent memory writes
> > + have completed before writing to the MMIO region.  The cheaper
> > + writel_relaxed() does not provide this guarantee and must not be used
> > + here.
> 
> Can we say the same thing for iowrite32() and iowrite32be(). I also see wmb()
> in front of these.

I don't think so. My reading of memory-barriers.txt says that writeX might
expand to outX, and outX is not ordered with respect to other types of
memory.

Will


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Jason Gunthorpe
On Tue, Mar 27, 2018 at 10:42:00AM +0100, Will Deacon wrote:
> On Tue, Mar 27, 2018 at 09:56:47AM +0200, Arnd Bergmann wrote:
> > On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe  wrote:
> > > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote:
> > >> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
> > >
> > > I even see patches adding wmb() based on actual observed memory
> > > corruption during testing on Intel:
> > >
> > > https://patchwork.kernel.org/patch/10177207/
> > >
> > > So you think all of this is unnecessary and writel is totally strongly
> > > ordered, even on multi-socket Intel?
> > 
> > This example adds a wmb() between two writes to a coherent DMA
> > area, it is definitely required there. I'm pretty sure I've never seen
> > any bug reports pointing to a missing wmb() between memory
> > and MMIO write accesses, but if you remember seeing them in the
> > list, maybe you can look again for some evidence of something going
> > wrong on x86 without it?
> 
> If this is just about ordering accesses to coherent DMA, then using
> dma_wmb() instead will be much better performance on arm/arm64.

dma_wmb() is a NOP on x86, it was tested anyhow and didn't help
this case.. Confusing, but probably not relevant to this discussion.

Jason


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Jason Gunthorpe
On Tue, Mar 27, 2018 at 09:56:47AM +0200, Arnd Bergmann wrote:
> On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe  wrote:
> > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote:
> >> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
> >
> > I even see patches adding wmb() based on actual observed memory
> > corruption during testing on Intel:
> >
> > https://patchwork.kernel.org/patch/10177207/
> >
> > So you think all of this is unnecessary and writel is totally strongly
> > ordered, even on multi-socket Intel?
> 
> This example adds a wmb() between two writes to a coherent DMA
> area, it is definitely required there.

Ah! So it is, too many things called 'db' in that driver.. One of the
'db' is also WC BAR memory..

Jason


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Jason Gunthorpe
On Tue, Mar 27, 2018 at 08:22:55AM -0400, ok...@codeaurora.org wrote:
> On 2018-03-27 07:23, Benjamin Herrenschmidt wrote:
> >On Tue, 2018-03-27 at 11:44 +0200, Arnd Bergmann wrote:
> >>> The interesting thing is that we do seem to have a whole LOT of these
> >>> spurrious wmb before writel all over the tree, I suspect because of
> >>> that incorrect recommendation in memory-barriers.txt.
> >>>
> >>> We should fix that.
> >>
> >>Maybe the problem is just that it's so counter-intuitive that we don't
> >>need that barrier in Linux, when the hardware does need one on some
> >>architectures.
> >>
> >>How about we define a barrier type instruction specifically for this
> >>purpose, something like wmb_before_mmio() and have all architectures
> >>define that to an empty macro?
> >
> >This is exactly what wmb() is about and exactly what Linux rejected
> >back in the day (and in hindsight I agree with him).
> >
> >>That way, having correct code using wmb_before_mmio() will not
> >>trigger an incorrect review comment that leads to extra wmb(). ;-)
> >
> >Ah, you mean have an empty macro that will always be empty on all
> >architectures just to fool people ? :-)
> >
> >Not sure that will fly ... I think we just need to be documenting that
> >stuff better and not have incorrect examples. Also a sweep to remove
> >some useless ones like the one in e1000e would help.
> 
> I have been converting wmb+writel to wmb+writel_relaxed. (About 30 patches)
>
> I will have to just remove the wmb and keep writel, then repost.

Okay, but before you do that, can we get a statement how this works
for WC?

Some of these writels are to WC memory, do they need the wmb()?!?

Jason


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Sinan Kaya
On 3/27/2018 7:02 AM, Will Deacon wrote:
> - See Documentation/DMA-API.txt for more information on consistent memory.
> + can see it now has ownership.  Note that, when using writel(), a prior
> + wmb() is not needed to guarantee that the cache coherent memory writes
> + have completed before writing to the MMIO region.  The cheaper
> + writel_relaxed() does not provide this guarantee and must not be used
> + here.

Can we say the same thing for iowrite32() and iowrite32be(). I also see wmb()
in front of these.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


RE: RFC on writel and writel_relaxed

2018-03-27 Thread David Laight
> Fair enough. I'd rather people used _relaxed by default, but I have to admit
> that it will probably just result in them getting things wrong...

Certainly requiring the driver writes use explicit barriers should make
them understand when and why they are needed - and then put in the correct ones.

The problem I've had is that I have a good idea which barriers are needed
but find that readl/writel seem to contain a lot of extra ones.
Maybe the are required in some places, but the extra synchronising
instructions could easily have measureable performance effects on
hot paths.

Drivers are likely to contain sequences like:
read_io
if (...) return
write_mem
...
write_mem
barrier
write_mem
barrier
write_io
for things like ring updates.
Where the 'mem' might actually be in io space.
In such sequences not all the synchronising instructions are needed.
I'm not at all sure it is easy to get the right set.

David






Re: RFC on writel and writel_relaxed

2018-03-27 Thread okaya

On 2018-03-27 07:23, Benjamin Herrenschmidt wrote:

On Tue, 2018-03-27 at 11:44 +0200, Arnd Bergmann wrote:

> The interesting thing is that we do seem to have a whole LOT of these
> spurrious wmb before writel all over the tree, I suspect because of
> that incorrect recommendation in memory-barriers.txt.
>
> We should fix that.

Maybe the problem is just that it's so counter-intuitive that we don't
need that barrier in Linux, when the hardware does need one on some
architectures.

How about we define a barrier type instruction specifically for this
purpose, something like wmb_before_mmio() and have all architectures
define that to an empty macro?


This is exactly what wmb() is about and exactly what Linux rejected
back in the day (and in hindsight I agree with him).


That way, having correct code using wmb_before_mmio() will not
trigger an incorrect review comment that leads to extra wmb(). ;-)


Ah, you mean have an empty macro that will always be empty on all
architectures just to fool people ? :-)

Not sure that will fly ... I think we just need to be documenting that
stuff better and not have incorrect examples. Also a sweep to remove
some useless ones like the one in e1000e would help.


I have been converting wmb+writel to wmb+writel_relaxed. (About 30 
patches)


I will have to just remove the wmb and keep writel, then repost.

Some of these got applied. It will cause some churn for the maintainers.



Cheers,
Ben.


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Benjamin Herrenschmidt
On Tue, 2018-03-27 at 10:57 +0100, Will Deacon wrote:
> > 
> > The interesting thing is that we do seem to have a whole LOT of these
> > spurrious wmb before writel all over the tree, I suspect because of
> > that incorrect recommendation in memory-barriers.txt.
> > 
> > We should fix that.
> 
> Patch below. Thoughts?

Looks good, we should probably also have a clearer (explicit)
definition of that ordering in the driver-api doco.

Now, to remove all those useless wmb's and find what other bugs they
were papering over ... ;-)

Cheers,
Ben.



Re: RFC on writel and writel_relaxed

2018-03-27 Thread Benjamin Herrenschmidt
On Tue, 2018-03-27 at 12:02 +0100, Will Deacon wrote:
>   can see it now has ownership.  Note that, when using writel(), a prior
> >   wmb() is not needed to guarantee that the cache coherent memory writes
> >   have completed before writing to the cache incoherent MMIO region.
> >   The cheaper writel_relaxed() does not guarantee the DMA to be visible
> >   to the device and must not be used here.
> 
> Fair enough. I'd rather people used _relaxed by default, but I have to admit
> that it will probably just result in them getting things wrong. Just a tiny
> bit of wordsmithing brings this to:

I prefer people using writel() by default for the simple reason that
99% of writels out there are configuration stuff for which the
performance difference doesn't matter, and people will just get it
wrong.

Let's focus on the rare fast path for optimisation.
> 
> diff --git a/Documentation/memory-barriers.txt 
> b/Documentation/memory-barriers.txt
> index a863009849a3..3247547d1c36 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1909,9 +1909,6 @@ There are some more advanced barrier functions:
> /* assign ownership */
> desc->status = DEVICE_OWN;
>  
> -   /* force memory to sync before notifying device via MMIO */
> -   wmb();
> -
> /* notify device of new descriptors */
> writel(DESC_NOTIFY, doorbell);
> }
> @@ -1919,11 +1916,15 @@ There are some more advanced barrier functions:
>   The dma_rmb() allows us guarantee the device has released ownership
>   before we read the data from the descriptor, and the dma_wmb() allows
>   us to guarantee the data is written to the descriptor before the device
> - can see it now has ownership.  The wmb() is needed to guarantee that the
> - cache coherent memory writes have completed before attempting a write to
> - the cache incoherent MMIO region.
> -
> - See Documentation/DMA-API.txt for more information on consistent memory.
> + can see it now has ownership.  Note that, when using writel(), a prior
> + wmb() is not needed to guarantee that the cache coherent memory writes
> + have completed before writing to the MMIO region.  The cheaper
> + writel_relaxed() does not provide this guarantee and must not be used
> + here.
> +
> + See the subsection "Kernel I/O barrier effects" for more information on
> + relaxed I/O accessors and the Documentation/DMA-API.txt file for more
> + information on consistent memory.
>  
>  
>  MMIO WRITE BARRIER
> 
> 
> If you're happy with that, I'll send it as a proper patch.
> 
> Cheers,
> 
> Will


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Will Deacon
On Tue, Mar 27, 2018 at 10:20:02PM +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2018-03-27 at 10:42 +0100, Will Deacon wrote:
> > > 
> > > This example adds a wmb() between two writes to a coherent DMA
> > > area, it is definitely required there. I'm pretty sure I've never seen
> > > any bug reports pointing to a missing wmb() between memory
> > > and MMIO write accesses, but if you remember seeing them in the
> > > list, maybe you can look again for some evidence of something going
> > > wrong on x86 without it?
> > 
> > If this is just about ordering accesses to coherent DMA, then using
> > dma_wmb() instead will be much better performance on arm/arm64.
> 
> Ah, something we should look into for powerpc as well, as we could use
> an lwsync for that which is also cheaper than a full sync wmb does.
> 
> dma_wmb() is basically the same as smp_wmb() without the CONFIG_SMP
> conditional right ?

Almost -- the slight change we have on arm64 is to say that it's
"outer-shareable", which means it also orders non-cacheable accesses
in the case that dma_alloc_coherent is used to allocate a consistent
buffer for a non-coherent device.

Will


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Benjamin Herrenschmidt
On Tue, 2018-03-27 at 11:44 +0200, Arnd Bergmann wrote:
> > The interesting thing is that we do seem to have a whole LOT of these
> > spurrious wmb before writel all over the tree, I suspect because of
> > that incorrect recommendation in memory-barriers.txt.
> > 
> > We should fix that.
> 
> Maybe the problem is just that it's so counter-intuitive that we don't
> need that barrier in Linux, when the hardware does need one on some
> architectures.
> 
> How about we define a barrier type instruction specifically for this
> purpose, something like wmb_before_mmio() and have all architectures
> define that to an empty macro?

This is exactly what wmb() is about and exactly what Linux rejected
back in the day (and in hindsight I agree with him).

> That way, having correct code using wmb_before_mmio() will not
> trigger an incorrect review comment that leads to extra wmb(). ;-)

Ah, you mean have an empty macro that will always be empty on all
architectures just to fool people ? :-)

Not sure that will fly ... I think we just need to be documenting that
stuff better and not have incorrect examples. Also a sweep to remove
some useless ones like the one in e1000e would help.

Cheers,
Ben.



Re: RFC on writel and writel_relaxed

2018-03-27 Thread Benjamin Herrenschmidt
On Tue, 2018-03-27 at 10:42 +0100, Will Deacon wrote:
> > 
> > This example adds a wmb() between two writes to a coherent DMA
> > area, it is definitely required there. I'm pretty sure I've never seen
> > any bug reports pointing to a missing wmb() between memory
> > and MMIO write accesses, but if you remember seeing them in the
> > list, maybe you can look again for some evidence of something going
> > wrong on x86 without it?
> 
> If this is just about ordering accesses to coherent DMA, then using
> dma_wmb() instead will be much better performance on arm/arm64.

Ah, something we should look into for powerpc as well, as we could use
an lwsync for that which is also cheaper than a full sync wmb does.

dma_wmb() is basically the same as smp_wmb() without the CONFIG_SMP
conditional right ?

Cheers,
Ben.



Re: RFC on writel and writel_relaxed

2018-03-27 Thread Arnd Bergmann
On Tue, Mar 27, 2018 at 1:02 PM, Will Deacon  wrote:
> On Tue, Mar 27, 2018 at 12:53:49PM +0200, Arnd Bergmann wrote:
>> On Tue, Mar 27, 2018 at 12:09 PM, Will Deacon  wrote:
>> > On Tue, Mar 27, 2018 at 12:05:06PM +0200, Arnd Bergmann wrote:

> diff --git a/Documentation/memory-barriers.txt 
> b/Documentation/memory-barriers.txt
> index a863009849a3..3247547d1c36 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1909,9 +1909,6 @@ There are some more advanced barrier functions:
> /* assign ownership */
> desc->status = DEVICE_OWN;
>
> -   /* force memory to sync before notifying device via MMIO */
> -   wmb();
> -
> /* notify device of new descriptors */
> writel(DESC_NOTIFY, doorbell);
> }
> @@ -1919,11 +1916,15 @@ There are some more advanced barrier functions:
>   The dma_rmb() allows us guarantee the device has released ownership
>   before we read the data from the descriptor, and the dma_wmb() allows
>   us to guarantee the data is written to the descriptor before the device
> - can see it now has ownership.  The wmb() is needed to guarantee that the
> - cache coherent memory writes have completed before attempting a write to
> - the cache incoherent MMIO region.
> -
> - See Documentation/DMA-API.txt for more information on consistent memory.
> + can see it now has ownership.  Note that, when using writel(), a prior
> + wmb() is not needed to guarantee that the cache coherent memory writes
> + have completed before writing to the MMIO region.  The cheaper
> + writel_relaxed() does not provide this guarantee and must not be used
> + here.
> +
> + See the subsection "Kernel I/O barrier effects" for more information on
> + relaxed I/O accessors and the Documentation/DMA-API.txt file for more
> + information on consistent memory.
>
>
>  MMIO WRITE BARRIER
>
>
> If you're happy with that, I'll send it as a proper patch.

Looks good to me, thanks!

   Arnd


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Will Deacon
On Tue, Mar 27, 2018 at 12:53:49PM +0200, Arnd Bergmann wrote:
> On Tue, Mar 27, 2018 at 12:09 PM, Will Deacon  wrote:
> > On Tue, Mar 27, 2018 at 12:05:06PM +0200, Arnd Bergmann wrote:
> >> > -
> >> > - See Documentation/DMA-API.txt for more information on consistent 
> >> > memory.
> >> > + can see it now has ownership.  Note that, when using writel(), a 
> >> > prior
> >> > + wmb() is not needed to guarantee that the cache coherent memory 
> >> > writes
> >> > + have completed before writing to the cache incoherent MMIO region.
> >> > + If this ordering between incoherent MMIO and coherent memory 
> >> > regions
> 
> One more thing: I think the term "incoherent MMIO" is a bit confusing, I'd
> prefer just "MMIO" here. At least I don't have the faintest clue what the
> difference between "coherent MMIO" and "incoherent MMIO" would be ;-)

Yes, you're right. I was just following the terminology that's already used
here, but actually that seems not be used anywhere else in the document!
I'll kill it.

> >> > + is not required, writel_relaxed() can be used instead and is 
> >> > significantly
> >> > + cheaper on some weakly-ordered architectures.
> >>
> >> I think that's a great improvement, but I'm a bit worried about 
> >> recommending
> >> writel_relaxed() too much: I've seen a lot of drivers that just always use
> >> writel_relaxed() over write(), and some of them get that wrong when they
> >> don't understand the difference but end up using DMA without explicit
> >> barriers anyway.
> >>
> >> Also, having an architecture-independent driver use wmb()+writel_relaxed()
> >> ends up being more expensive than just using write(). Not sure how to
> >> best phrase it though.
> >
> > Perhaps I add reword that with a simple example to say:
> >
> >   If this ordering between incoherent MMIO and coherent memory regions
> >   is not required (e.g. in a sequence of accesses all to the MMIO region)
> >   [...]
> >
> > since that seems to be the usual case where the _relaxed accessors help.
> 
> That still doesn't quite capture what I'd like driver writes to do: in essence
> I would recommend them to use writel() all the time, except in performance
> critical code that has been shown to be correct and has a comment to explain
> why _relaxed() is ok in that particular function.
> 
> Maybe it can just be rephrased to warn against the use of writel_relaxed()
> here, and explain the difference that way:
> 
>  can see it now has ownership.  Note that, when using writel(), a prior
>  wmb() is not needed to guarantee that the cache coherent memory writes
>  have completed before writing to the cache incoherent MMIO region.
>  The cheaper writel_relaxed() does not guarantee the DMA to be visible
>  to the device and must not be used here.

Fair enough. I'd rather people used _relaxed by default, but I have to admit
that it will probably just result in them getting things wrong. Just a tiny
bit of wordsmithing brings this to:


diff --git a/Documentation/memory-barriers.txt 
b/Documentation/memory-barriers.txt
index a863009849a3..3247547d1c36 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1909,9 +1909,6 @@ There are some more advanced barrier functions:
/* assign ownership */
desc->status = DEVICE_OWN;
 
-   /* force memory to sync before notifying device via MMIO */
-   wmb();
-
/* notify device of new descriptors */
writel(DESC_NOTIFY, doorbell);
}
@@ -1919,11 +1916,15 @@ There are some more advanced barrier functions:
  The dma_rmb() allows us guarantee the device has released ownership
  before we read the data from the descriptor, and the dma_wmb() allows
  us to guarantee the data is written to the descriptor before the device
- can see it now has ownership.  The wmb() is needed to guarantee that the
- cache coherent memory writes have completed before attempting a write to
- the cache incoherent MMIO region.
-
- See Documentation/DMA-API.txt for more information on consistent memory.
+ can see it now has ownership.  Note that, when using writel(), a prior
+ wmb() is not needed to guarantee that the cache coherent memory writes
+ have completed before writing to the MMIO region.  The cheaper
+ writel_relaxed() does not provide this guarantee and must not be used
+ here.
+
+ See the subsection "Kernel I/O barrier effects" for more information on
+ relaxed I/O accessors and the Documentation/DMA-API.txt file for more
+ information on consistent memory.
 
 
 MMIO WRITE BARRIER


If you're happy with that, I'll send it as a proper patch.

Cheers,

Will


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Arnd Bergmann
On Tue, Mar 27, 2018 at 12:09 PM, Will Deacon  wrote:
> On Tue, Mar 27, 2018 at 12:05:06PM +0200, Arnd Bergmann wrote:
>> > -
>> > - See Documentation/DMA-API.txt for more information on consistent 
>> > memory.
>> > + can see it now has ownership.  Note that, when using writel(), a 
>> > prior
>> > + wmb() is not needed to guarantee that the cache coherent memory 
>> > writes
>> > + have completed before writing to the cache incoherent MMIO region.
>> > + If this ordering between incoherent MMIO and coherent memory regions

One more thing: I think the term "incoherent MMIO" is a bit confusing, I'd
prefer just "MMIO" here. At least I don't have the faintest clue what the
difference between "coherent MMIO" and "incoherent MMIO" would be ;-)

>> > + is not required, writel_relaxed() can be used instead and is 
>> > significantly
>> > + cheaper on some weakly-ordered architectures.
>>
>> I think that's a great improvement, but I'm a bit worried about recommending
>> writel_relaxed() too much: I've seen a lot of drivers that just always use
>> writel_relaxed() over write(), and some of them get that wrong when they
>> don't understand the difference but end up using DMA without explicit
>> barriers anyway.
>>
>> Also, having an architecture-independent driver use wmb()+writel_relaxed()
>> ends up being more expensive than just using write(). Not sure how to
>> best phrase it though.
>
> Perhaps I add reword that with a simple example to say:
>
>   If this ordering between incoherent MMIO and coherent memory regions
>   is not required (e.g. in a sequence of accesses all to the MMIO region)
>   [...]
>
> since that seems to be the usual case where the _relaxed accessors help.

That still doesn't quite capture what I'd like driver writes to do: in essence
I would recommend them to use writel() all the time, except in performance
critical code that has been shown to be correct and has a comment to explain
why _relaxed() is ok in that particular function.

Maybe it can just be rephrased to warn against the use of writel_relaxed()
here, and explain the difference that way:

 can see it now has ownership.  Note that, when using writel(), a prior
 wmb() is not needed to guarantee that the cache coherent memory writes
 have completed before writing to the cache incoherent MMIO region.
 The cheaper writel_relaxed() does not guarantee the DMA to be visible
 to the device and must not be used here.

   Arnd


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Will Deacon
On Tue, Mar 27, 2018 at 12:05:06PM +0200, Arnd Bergmann wrote:
> On Tue, Mar 27, 2018 at 11:57 AM, Will Deacon  wrote:
> 
> >
> > From db0daeaf94f0f6232f8206fc07a74211324b11d9 Mon Sep 17 00:00:00 2001
> > From: Will Deacon 
> > Date: Tue, 27 Mar 2018 10:49:58 +0100
> > Subject: [PATCH] docs/memory-barriers.txt: Fix broken DMA vs MMIO ordering
> >  example
> >
> > The section of memory-barriers.txt that describes the dma_Xmb() barriers
> > has an incorrect example claiming that a wmb() is required after writing
> > to coherent memory in order for those writes to be visible to a device
> > before a subsequent MMIO access using writel() can reach the device.
> >
> > In fact, this ordering guarantee is provided (at significant cost on some
> > architectures such as arm and power) by writel, so the wmb() is not
> > necessary. writel_relaxed exists for cases where this ordering is not
> > required.
> >
> > Fix the example and update the text to make this clearer.
> >
> > Cc: Benjamin Herrenschmidt 
> > Cc: Arnd Bergmann 
> > Cc: Jason Gunthorpe 
> > Cc: "Paul E. McKenney" 
> > Cc: Peter Zijlstra 
> > Cc: Ingo Molnar 
> > Cc: Jonathan Corbet 
> > Reported-by: Sinan Kaya 
> > Signed-off-by: Will Deacon 
> > ---
> >  Documentation/memory-barriers.txt | 18 ++
> >  1 file changed, 10 insertions(+), 8 deletions(-)
> >
> > diff --git a/Documentation/memory-barriers.txt 
> > b/Documentation/memory-barriers.txt
> > index a863009849a3..2556b4b0e6f9 100644
> > --- a/Documentation/memory-barriers.txt
> > +++ b/Documentation/memory-barriers.txt
> > @@ -1909,9 +1909,6 @@ There are some more advanced barrier functions:
> > /* assign ownership */
> > desc->status = DEVICE_OWN;
> >
> > -   /* force memory to sync before notifying device via MMIO */
> > -   wmb();
> > -
> > /* notify device of new descriptors */
> > writel(DESC_NOTIFY, doorbell);
> > }
> > @@ -1919,11 +1916,16 @@ There are some more advanced barrier functions:
> >   The dma_rmb() allows us guarantee the device has released ownership
> >   before we read the data from the descriptor, and the dma_wmb() allows
> >   us to guarantee the data is written to the descriptor before the 
> > device
> > - can see it now has ownership.  The wmb() is needed to guarantee that 
> > the
> > - cache coherent memory writes have completed before attempting a write 
> > to
> > - the cache incoherent MMIO region.
> > -
> > - See Documentation/DMA-API.txt for more information on consistent 
> > memory.
> > + can see it now has ownership.  Note that, when using writel(), a prior
> > + wmb() is not needed to guarantee that the cache coherent memory writes
> > + have completed before writing to the cache incoherent MMIO region.
> > + If this ordering between incoherent MMIO and coherent memory regions
> > + is not required, writel_relaxed() can be used instead and is 
> > significantly
> > + cheaper on some weakly-ordered architectures.
> 
> I think that's a great improvement, but I'm a bit worried about recommending
> writel_relaxed() too much: I've seen a lot of drivers that just always use
> writel_relaxed() over write(), and some of them get that wrong when they
> don't understand the difference but end up using DMA without explicit
> barriers anyway.
> 
> Also, having an architecture-independent driver use wmb()+writel_relaxed()
> ends up being more expensive than just using write(). Not sure how to
> best phrase it though.

Perhaps I add reword that with a simple example to say:

  If this ordering between incoherent MMIO and coherent memory regions
  is not required (e.g. in a sequence of accesses all to the MMIO region)
  [...]

since that seems to be the usual case where the _relaxed accessors help.

Will


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Arnd Bergmann
On Tue, Mar 27, 2018 at 11:57 AM, Will Deacon  wrote:

>
> From db0daeaf94f0f6232f8206fc07a74211324b11d9 Mon Sep 17 00:00:00 2001
> From: Will Deacon 
> Date: Tue, 27 Mar 2018 10:49:58 +0100
> Subject: [PATCH] docs/memory-barriers.txt: Fix broken DMA vs MMIO ordering
>  example
>
> The section of memory-barriers.txt that describes the dma_Xmb() barriers
> has an incorrect example claiming that a wmb() is required after writing
> to coherent memory in order for those writes to be visible to a device
> before a subsequent MMIO access using writel() can reach the device.
>
> In fact, this ordering guarantee is provided (at significant cost on some
> architectures such as arm and power) by writel, so the wmb() is not
> necessary. writel_relaxed exists for cases where this ordering is not
> required.
>
> Fix the example and update the text to make this clearer.
>
> Cc: Benjamin Herrenschmidt 
> Cc: Arnd Bergmann 
> Cc: Jason Gunthorpe 
> Cc: "Paul E. McKenney" 
> Cc: Peter Zijlstra 
> Cc: Ingo Molnar 
> Cc: Jonathan Corbet 
> Reported-by: Sinan Kaya 
> Signed-off-by: Will Deacon 
> ---
>  Documentation/memory-barriers.txt | 18 ++
>  1 file changed, 10 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/memory-barriers.txt 
> b/Documentation/memory-barriers.txt
> index a863009849a3..2556b4b0e6f9 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1909,9 +1909,6 @@ There are some more advanced barrier functions:
> /* assign ownership */
> desc->status = DEVICE_OWN;
>
> -   /* force memory to sync before notifying device via MMIO */
> -   wmb();
> -
> /* notify device of new descriptors */
> writel(DESC_NOTIFY, doorbell);
> }
> @@ -1919,11 +1916,16 @@ There are some more advanced barrier functions:
>   The dma_rmb() allows us guarantee the device has released ownership
>   before we read the data from the descriptor, and the dma_wmb() allows
>   us to guarantee the data is written to the descriptor before the device
> - can see it now has ownership.  The wmb() is needed to guarantee that the
> - cache coherent memory writes have completed before attempting a write to
> - the cache incoherent MMIO region.
> -
> - See Documentation/DMA-API.txt for more information on consistent memory.
> + can see it now has ownership.  Note that, when using writel(), a prior
> + wmb() is not needed to guarantee that the cache coherent memory writes
> + have completed before writing to the cache incoherent MMIO region.
> + If this ordering between incoherent MMIO and coherent memory regions
> + is not required, writel_relaxed() can be used instead and is 
> significantly
> + cheaper on some weakly-ordered architectures.

I think that's a great improvement, but I'm a bit worried about recommending
writel_relaxed() too much: I've seen a lot of drivers that just always use
writel_relaxed() over write(), and some of them get that wrong when they
don't understand the difference but end up using DMA without explicit
barriers anyway.

Also, having an architecture-independent driver use wmb()+writel_relaxed()
ends up being more expensive than just using write(). Not sure how to
best phrase it though.

  Arnd


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Will Deacon
On Tue, Mar 27, 2018 at 11:44:22AM +0200, Arnd Bergmann wrote:
> On Tue, Mar 27, 2018 at 10:56 AM, Benjamin Herrenschmidt
>  wrote:
> > On Tue, 2018-03-27 at 09:56 +0200, Arnd Bergmann wrote:
> >> On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe  wrote:
> >>
> >> I'm pretty sure I've never seen
> >> any bug reports pointing to a missing wmb() between memory
> >> and MMIO write accesses, but if you remember seeing them in the
> >> list, maybe you can look again for some evidence of something going
> >> wrong on x86 without it?
> >
> > The interesting thing is that we do seem to have a whole LOT of these
> > spurrious wmb before writel all over the tree, I suspect because of
> > that incorrect recommendation in memory-barriers.txt.
> >
> > We should fix that.
> 
> Maybe the problem is just that it's so counter-intuitive that we don't
> need that barrier in Linux, when the hardware does need one on some
> architectures.
> 
> How about we define a barrier type instruction specifically for this
> purpose, something like wmb_before_mmio() and have all architectures
> define that to an empty macro?
> 
> That way, having correct code using wmb_before_mmio() will not
> trigger an incorrect review comment that leads to extra wmb(). ;-)

Please don't add more barriers :)! I think that will make it even more
difficult to understand how to use the ones we already have -- the problem
here seems to be that the documentation that was added for the dma_*
barriers got this wrong, but it was at least in contradiction with the
section elsewhere in memory-barriers.txt that describes the relaxed I/O
accessors.

I guess somebody could hack checkpatch to look for back-to-back wmb/writel
sequences? I suspect we could do something with coccinelle too.

Will


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Will Deacon
[+ locking/ordering/docs people]

On Tue, Mar 27, 2018 at 07:56:59PM +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2018-03-27 at 09:56 +0200, Arnd Bergmann wrote:
> > On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe  wrote:
> > > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote:
> > > > On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
> > > 
> > > I even see patches adding wmb() based on actual observed memory
> > > corruption during testing on Intel:
> > > 
> > > https://patchwork.kernel.org/patch/10177207/
> > > 
> > > So you think all of this is unnecessary and writel is totally strongly
> > > ordered, even on multi-socket Intel?
> > 
> > This example adds a wmb() between two writes to a coherent DMA
> > area, it is definitely required there. 
> 
> Ah you are right, I incorrectly assumed that the "prod_db" function was
> an MMIO. So we do NOT have a counter example where wmb is needed on
> x86, pfiew ! :-)
> 
> > I'm pretty sure I've never seen
> > any bug reports pointing to a missing wmb() between memory
> > and MMIO write accesses, but if you remember seeing them in the
> > list, maybe you can look again for some evidence of something going
> > wrong on x86 without it?
> 
> The interesting thing is that we do seem to have a whole LOT of these
> spurrious wmb before writel all over the tree, I suspect because of
> that incorrect recommendation in memory-barriers.txt.
> 
> We should fix that.

Patch below. Thoughts?

Will

--->8

>From db0daeaf94f0f6232f8206fc07a74211324b11d9 Mon Sep 17 00:00:00 2001
From: Will Deacon 
Date: Tue, 27 Mar 2018 10:49:58 +0100
Subject: [PATCH] docs/memory-barriers.txt: Fix broken DMA vs MMIO ordering
 example

The section of memory-barriers.txt that describes the dma_Xmb() barriers
has an incorrect example claiming that a wmb() is required after writing
to coherent memory in order for those writes to be visible to a device
before a subsequent MMIO access using writel() can reach the device.

In fact, this ordering guarantee is provided (at significant cost on some
architectures such as arm and power) by writel, so the wmb() is not
necessary. writel_relaxed exists for cases where this ordering is not
required.

Fix the example and update the text to make this clearer.

Cc: Benjamin Herrenschmidt 
Cc: Arnd Bergmann 
Cc: Jason Gunthorpe 
Cc: "Paul E. McKenney" 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Jonathan Corbet 
Reported-by: Sinan Kaya 
Signed-off-by: Will Deacon 
---
 Documentation/memory-barriers.txt | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/Documentation/memory-barriers.txt 
b/Documentation/memory-barriers.txt
index a863009849a3..2556b4b0e6f9 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1909,9 +1909,6 @@ There are some more advanced barrier functions:
/* assign ownership */
desc->status = DEVICE_OWN;
 
-   /* force memory to sync before notifying device via MMIO */
-   wmb();
-
/* notify device of new descriptors */
writel(DESC_NOTIFY, doorbell);
}
@@ -1919,11 +1916,16 @@ There are some more advanced barrier functions:
  The dma_rmb() allows us guarantee the device has released ownership
  before we read the data from the descriptor, and the dma_wmb() allows
  us to guarantee the data is written to the descriptor before the device
- can see it now has ownership.  The wmb() is needed to guarantee that the
- cache coherent memory writes have completed before attempting a write to
- the cache incoherent MMIO region.
-
- See Documentation/DMA-API.txt for more information on consistent memory.
+ can see it now has ownership.  Note that, when using writel(), a prior
+ wmb() is not needed to guarantee that the cache coherent memory writes
+ have completed before writing to the cache incoherent MMIO region.
+ If this ordering between incoherent MMIO and coherent memory regions
+ is not required, writel_relaxed() can be used instead and is significantly
+ cheaper on some weakly-ordered architectures.
+
+ See the subsection "Kernel I/O barrier effects" for more information on
+ relaxed I/O accessors and the Documentation/DMA-API.txt file for more
+ information on consistent memory.
 
 
 MMIO WRITE BARRIER
-- 
2.1.4



Re: RFC on writel and writel_relaxed

2018-03-27 Thread Arnd Bergmann
On Tue, Mar 27, 2018 at 10:56 AM, Benjamin Herrenschmidt
 wrote:
> On Tue, 2018-03-27 at 09:56 +0200, Arnd Bergmann wrote:
>> On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe  wrote:
>>
>> I'm pretty sure I've never seen
>> any bug reports pointing to a missing wmb() between memory
>> and MMIO write accesses, but if you remember seeing them in the
>> list, maybe you can look again for some evidence of something going
>> wrong on x86 without it?
>
> The interesting thing is that we do seem to have a whole LOT of these
> spurrious wmb before writel all over the tree, I suspect because of
> that incorrect recommendation in memory-barriers.txt.
>
> We should fix that.

Maybe the problem is just that it's so counter-intuitive that we don't
need that barrier in Linux, when the hardware does need one on some
architectures.

How about we define a barrier type instruction specifically for this
purpose, something like wmb_before_mmio() and have all architectures
define that to an empty macro?

That way, having correct code using wmb_before_mmio() will not
trigger an incorrect review comment that leads to extra wmb(). ;-)

   Arnd


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Will Deacon
On Tue, Mar 27, 2018 at 09:56:47AM +0200, Arnd Bergmann wrote:
> On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe  wrote:
> > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote:
> >> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
> >
> > I even see patches adding wmb() based on actual observed memory
> > corruption during testing on Intel:
> >
> > https://patchwork.kernel.org/patch/10177207/
> >
> > So you think all of this is unnecessary and writel is totally strongly
> > ordered, even on multi-socket Intel?
> 
> This example adds a wmb() between two writes to a coherent DMA
> area, it is definitely required there. I'm pretty sure I've never seen
> any bug reports pointing to a missing wmb() between memory
> and MMIO write accesses, but if you remember seeing them in the
> list, maybe you can look again for some evidence of something going
> wrong on x86 without it?

If this is just about ordering accesses to coherent DMA, then using
dma_wmb() instead will be much better performance on arm/arm64.

Will


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Benjamin Herrenschmidt
On Tue, 2018-03-27 at 09:56 +0200, Arnd Bergmann wrote:
> On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe  wrote:
> > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote:
> > > On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
> > 
> > I even see patches adding wmb() based on actual observed memory
> > corruption during testing on Intel:
> > 
> > https://patchwork.kernel.org/patch/10177207/
> > 
> > So you think all of this is unnecessary and writel is totally strongly
> > ordered, even on multi-socket Intel?
> 
> This example adds a wmb() between two writes to a coherent DMA
> area, it is definitely required there. 

Ah you are right, I incorrectly assumed that the "prod_db" function was
an MMIO. So we do NOT have a counter example where wmb is needed on
x86, pfiew ! :-)

> I'm pretty sure I've never seen
> any bug reports pointing to a missing wmb() between memory
> and MMIO write accesses, but if you remember seeing them in the
> list, maybe you can look again for some evidence of something going
> wrong on x86 without it?

The interesting thing is that we do seem to have a whole LOT of these
spurrious wmb before writel all over the tree, I suspect because of
that incorrect recommendation in memory-barriers.txt.

We should fix that.

Cheers,
Ben.



Re: RFC on writel and writel_relaxed

2018-03-27 Thread Arnd Bergmann
On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe  wrote:
> On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote:
>> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
>
> I even see patches adding wmb() based on actual observed memory
> corruption during testing on Intel:
>
> https://patchwork.kernel.org/patch/10177207/
>
> So you think all of this is unnecessary and writel is totally strongly
> ordered, even on multi-socket Intel?

This example adds a wmb() between two writes to a coherent DMA
area, it is definitely required there. I'm pretty sure I've never seen
any bug reports pointing to a missing wmb() between memory
and MMIO write accesses, but if you remember seeing them in the
list, maybe you can look again for some evidence of something going
wrong on x86 without it?

   Arnd


Re: RFC on writel and writel_relaxed

2018-03-26 Thread Jason Gunthorpe
On Tue, Mar 27, 2018 at 10:59:40AM +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2018-03-26 at 16:50 -0600, Jason Gunthorpe wrote:
> > On Tue, Mar 27, 2018 at 09:36:11AM +1100, Benjamin Herrenschmidt wrote:
> > > On Mon, 2018-03-26 at 16:27 -0600, Jason Gunthorpe wrote:
> > > > > Otherwise almost all drivers out there are broken which I very much
> > > > > doubt :-)
> > > > 
> > > > But.. Sinan is right, you look anywhere in the driver tree and you
> > > > find stuff like this:
> > > > 
> > > > drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > > > 
> > > > /* Force memory writes to complete before letting h/w
> > > >  * know there are new descriptors to fetch.
> > > >  */
> > > > wmb();
> > > > 
> > > > 
> > > > It is *systemic*
> > > 
> > > Yes, because they all copied e1000e :-) If you look at the comment in
> > > there, it does say it's only for weakly ordered archs such as ia64, and
> > > even then, probably predates Linus strong statement on the matter.
> > 
> > Hahah, sure I'll buy that..
> > 
> > But still, if this really is the case, a *strong* statement in
> > barriers.txt to that effect (and not an example demanding the wmb()!)
> > would be very helpful for those of us that have to review driver code!
> 
> I agree, and that Mellanox bug you pointed me to seems to indicate that
> this may not even be true on x86 anymore ...

However, with bugs like that it is hard to know what is going on.. It
could be a CPU bug instead.

> I think we might need to revisit this properly...

I would love to hear a definitive statement from Intel on what wmb();
writel(); does on x86..

Sinan's patches are backwards if writel is ordered, instead of
using writel_relaxed, they should be eliminating the wmb().

But there is no way patches like that could go ahead until
barriers.txt is updated..

Jason


Re: RFC on writel and writel_relaxed

2018-03-26 Thread Benjamin Herrenschmidt
On Mon, 2018-03-26 at 16:50 -0600, Jason Gunthorpe wrote:
> On Tue, Mar 27, 2018 at 09:36:11AM +1100, Benjamin Herrenschmidt wrote:
> > On Mon, 2018-03-26 at 16:27 -0600, Jason Gunthorpe wrote:
> > > > Otherwise almost all drivers out there are broken which I very much
> > > > doubt :-)
> > > 
> > > But.. Sinan is right, you look anywhere in the driver tree and you
> > > find stuff like this:
> > > 
> > > drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > > 
> > > /* Force memory writes to complete before letting h/w
> > >  * know there are new descriptors to fetch.
> > >  */
> > > wmb();
> > > 
> > > 
> > > It is *systemic*
> > 
> > Yes, because they all copied e1000e :-) If you look at the comment in
> > there, it does say it's only for weakly ordered archs such as ia64, and
> > even then, probably predates Linus strong statement on the matter.
> 
> Hahah, sure I'll buy that..
> 
> But still, if this really is the case, a *strong* statement in
> barriers.txt to that effect (and not an example demanding the wmb()!)
> would be very helpful for those of us that have to review driver code!

I agree, and that Mellanox bug you pointed me to seems to indicate that
this may not even be true on x86 anymore ...

I think we might need to revisit this properly...

Cheers,
Ben.


Re: RFC on writel and writel_relaxed

2018-03-26 Thread Jason Gunthorpe
On Tue, Mar 27, 2018 at 09:36:11AM +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2018-03-26 at 16:27 -0600, Jason Gunthorpe wrote:
> > > Otherwise almost all drivers out there are broken which I very much
> > > doubt :-)
> > 
> > But.. Sinan is right, you look anywhere in the driver tree and you
> > find stuff like this:
> > 
> > drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > 
> > /* Force memory writes to complete before letting h/w
> >  * know there are new descriptors to fetch.
> >  */
> > wmb();
> > 
> > 
> > It is *systemic*
> 
> Yes, because they all copied e1000e :-) If you look at the comment in
> there, it does say it's only for weakly ordered archs such as ia64, and
> even then, probably predates Linus strong statement on the matter.

Hahah, sure I'll buy that..

But still, if this really is the case, a *strong* statement in
barriers.txt to that effect (and not an example demanding the wmb()!)
would be very helpful for those of us that have to review driver code!

Jason


Re: RFC on writel and writel_relaxed

2018-03-26 Thread Benjamin Herrenschmidt
On Tue, 2018-03-27 at 09:36 +1100, Benjamin Herrenschmidt wrote:
> I don't kow, it used to be the case, at least that's what drove us to
> define things the way we did.
> 
> Maybe things changed, but if that's the case, nobody knows for sure,
> and we probably want to get Linus POV on the matter.
> 
> I know I still write drivers that do not add a wmb in that case because
> I expect things to work without it.
> 
> If that has changed, we probably can relax some of the barriers in our
> implementations of writel on a number of architectures, but not before
> auditing a bunch more drivers to make sure they have the write wmb()'s

Note also that this was the entire point behind the definition of
the _relaxed() accessors, to lift that specific ordering guarantee.

If you now says that memory + writel requires a wmb() in between
then you made writel be identical to writel_relaxed.

You might notice that Documentation/driver-api/device-io.rst makes no
mention of wmb() at all.

Cheers,
Ben.



Re: RFC on writel and writel_relaxed

2018-03-26 Thread Benjamin Herrenschmidt
On Mon, 2018-03-26 at 16:27 -0600, Jason Gunthorpe wrote:
> > Otherwise almost all drivers out there are broken which I very much
> > doubt :-)
> 
> But.. Sinan is right, you look anywhere in the driver tree and you
> find stuff like this:
> 
> drivers/net/ethernet/intel/i40e/i40e_txrx.c
> 
> /* Force memory writes to complete before letting h/w
>  * know there are new descriptors to fetch.
>  */
> wmb();
> 
> 
> It is *systemic*

Yes, because they all copied e1000e :-) If you look at the comment in
there, it does say it's only for weakly ordered archs such as ia64, and
even then, probably predates Linus strong statement on the matter.

> I even see patches adding wmb() based on actual observed memory
> corruption during testing on Intel:
> 
> https://patchwork.kernel.org/patch/10177207/
> 
> So you think all of this is unnecessary and writel is totally strongly
> ordered, even on multi-socket Intel?

I don't kow, it used to be the case, at least that's what drove us to
define things the way we did.

Maybe things changed, but if that's the case, nobody knows for sure,
and we probably want to get Linus POV on the matter.

I know I still write drivers that do not add a wmb in that case because
I expect things to work without it.

If that has changed, we probably can relax some of the barriers in our
implementations of writel on a number of architectures, but not before
auditing a bunch more drivers to make sure they have the write wmb()'s

Cheers,
Ben.



Re: RFC on writel and writel_relaxed

2018-03-26 Thread Benjamin Herrenschmidt
On Mon, 2018-03-26 at 18:08 -0400, Sinan Kaya wrote:
> On 3/26/2018 6:01 PM, Benjamin Herrenschmidt wrote:
> > On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
> > > On 3/26/2018 5:30 PM, Arnd Bergmann wrote:
> > > > > But that was never a requirement of writel(),
> > > > > Documentation/memory-barriers.txt gives an explicit example demanding
> > > > > the wmb() before writel() for ordering system memory against writel.
> > > > 
> > > > Indeed, but it's in an example for when to use dma_wmb(), not wmb().
> > > > Adding Alexander Duyck to Cc, he added that section as part of
> > > > 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
> > > > dma_wmb()"). Also adding the other people that were involved with that.
> > > > 
> > > 
> > > ARM developers can get away with not including wmb() in their code and use
> > > writel() to observe memory writes due to implicit barriers.
> > > 
> > > However, same code will not work on Intel.
> > 
> > Wrong. It will.
> > 
> > You do NOT need wmb between writes to memory and writel.
> 
> If writel() provides such a guarantee, why do I see code sequences like
> 
> wmb()
> writel() 
> 
> all over the place.

Because it was badly documented and people didn't know what to do, or
maybe the underlying mapping is WC ?

I don't know for sure but I can tell you Linus opinion on the matter
back in the days was very clear and that's why we implemented writel
the way we did on powerpc.

> > 
> > > writel() has a compiler barrier in it for x86.
> > > wmb() has a sync operation in it for x86. 
> > > 
> > > Unless wmb() is called, PCIe device won't observe memory updates from the 
> > > CPU.
> > 
> > This is completely wrong. They will. Intel provides the necessary
> > ordering guarantees without an explicit wmb.
> > 
> 
> I'm still reserving my doubts here. I was told about an explicit
> wmb() requirement last week.

By whome ?

> > Otherwise almost all drivers out there are broken which I very much
> > doubt :-)
> > 
> > Cheers,
> > Ben.
> > 
> > 
> > 
> 
> 


Re: RFC on writel and writel_relaxed

2018-03-26 Thread Jason Gunthorpe
On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
> > On 3/26/2018 5:30 PM, Arnd Bergmann wrote:
> > > > But that was never a requirement of writel(),
> > > > Documentation/memory-barriers.txt gives an explicit example demanding
> > > > the wmb() before writel() for ordering system memory against writel.
> > > 
> > > Indeed, but it's in an example for when to use dma_wmb(), not wmb().
> > > Adding Alexander Duyck to Cc, he added that section as part of
> > > 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
> > > dma_wmb()"). Also adding the other people that were involved with that.
> > > 
> > 
> > ARM developers can get away with not including wmb() in their code and use
> > writel() to observe memory writes due to implicit barriers.
> > 
> > However, same code will not work on Intel.
> 
> Wrong. It will.
> 
> You do NOT need wmb between writes to memory and writel.
> 
> > writel() has a compiler barrier in it for x86.
> > wmb() has a sync operation in it for x86. 
> > 
> > Unless wmb() is called, PCIe device won't observe memory updates from the 
> > CPU.
> 
> This is completely wrong. They will. Intel provides the necessary
> ordering guarantees without an explicit wmb.
> 
> Otherwise almost all drivers out there are broken which I very much
> doubt :-)

But.. Sinan is right, you look anywhere in the driver tree and you
find stuff like this:

drivers/net/ethernet/intel/i40e/i40e_txrx.c

/* Force memory writes to complete before letting h/w
 * know there are new descriptors to fetch.
 */
wmb();


It is *systemic*

I even see patches adding wmb() based on actual observed memory
corruption during testing on Intel:

https://patchwork.kernel.org/patch/10177207/

So you think all of this is unnecessary and writel is totally strongly
ordered, even on multi-socket Intel?

Jason


Re: RFC on writel and writel_relaxed

2018-03-26 Thread Sinan Kaya
On 3/26/2018 6:01 PM, Benjamin Herrenschmidt wrote:
> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
>> On 3/26/2018 5:30 PM, Arnd Bergmann wrote:
 But that was never a requirement of writel(),
 Documentation/memory-barriers.txt gives an explicit example demanding
 the wmb() before writel() for ordering system memory against writel.
>>>
>>> Indeed, but it's in an example for when to use dma_wmb(), not wmb().
>>> Adding Alexander Duyck to Cc, he added that section as part of
>>> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
>>> dma_wmb()"). Also adding the other people that were involved with that.
>>>
>>
>> ARM developers can get away with not including wmb() in their code and use
>> writel() to observe memory writes due to implicit barriers.
>>
>> However, same code will not work on Intel.
> 
> Wrong. It will.
> 
> You do NOT need wmb between writes to memory and writel.

If writel() provides such a guarantee, why do I see code sequences like

wmb()
writel() 

all over the place.

> 
>> writel() has a compiler barrier in it for x86.
>> wmb() has a sync operation in it for x86. 
>>
>> Unless wmb() is called, PCIe device won't observe memory updates from the 
>> CPU.
> 
> This is completely wrong. They will. Intel provides the necessary
> ordering guarantees without an explicit wmb.
> 

I'm still reserving my doubts here. I was told about an explicit
wmb() requirement last week.

> Otherwise almost all drivers out there are broken which I very much
> doubt :-)
> 
> Cheers,
> Ben.
> 
> 
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-03-26 Thread Benjamin Herrenschmidt
On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
> On 3/26/2018 5:30 PM, Arnd Bergmann wrote:
> > > But that was never a requirement of writel(),
> > > Documentation/memory-barriers.txt gives an explicit example demanding
> > > the wmb() before writel() for ordering system memory against writel.
> > 
> > Indeed, but it's in an example for when to use dma_wmb(), not wmb().
> > Adding Alexander Duyck to Cc, he added that section as part of
> > 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
> > dma_wmb()"). Also adding the other people that were involved with that.
> > 
> 
> ARM developers can get away with not including wmb() in their code and use
> writel() to observe memory writes due to implicit barriers.
> 
> However, same code will not work on Intel.

Wrong. It will.

You do NOT need wmb between writes to memory and writel.

> writel() has a compiler barrier in it for x86.
> wmb() has a sync operation in it for x86. 
> 
> Unless wmb() is called, PCIe device won't observe memory updates from the CPU.

This is completely wrong. They will. Intel provides the necessary
ordering guarantees without an explicit wmb.

Otherwise almost all drivers out there are broken which I very much
doubt :-)

Cheers,
Ben.




Re: RFC on writel and writel_relaxed

2018-03-26 Thread Benjamin Herrenschmidt
On Mon, 2018-03-26 at 23:30 +0200, Arnd Bergmann wrote:
>  Most of the drivers have a unwound loop with writeq() or something to
> > do it.
> 
> But isn't the writeq() barrier much more expensive than anything you'd
> do in function calls?

It is for us, and will break any write combining.

> > > > The same document says that _relaxed() does not give that guarentee.
> > > > 
> > > > The lwn articule on this went into some depth on the interaction with
> > > > spinlocks.
> > > > 
> > > > As far as I can see, containment in a spinlock seems to be the only
> > > > different between writel and writel_relaxed..
> > > 
> > > I was always puzzled by this: The intention of _relaxed() on ARM
> > > (where it originates) was to skip the barrier that serializes DMA
> > > with MMIO, not to skip the serialization between MMIO and locks.
> > 
> > But that was never a requirement of writel(),
> > Documentation/memory-barriers.txt gives an explicit example demanding
> > the wmb() before writel() for ordering system memory against writel.

This is a bug in the documentation.

> Indeed, but it's in an example for when to use dma_wmb(), not wmb().
> Adding Alexander Duyck to Cc, he added that section as part of
> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
> dma_wmb()"). Also adding the other people that were involved with that.

Linus himself made it very clear years ago. readl and writel have to
order vs memory accesses.

> > I actually have no idea why ARM had that barrier, I always assumed it
> > was to give program ordering to the accesses and that _relaxed allowed
> > re-ordering (the usual meaning of relaxed)..
> > 
> > But the barrier document makes it pretty clear that the only
> > difference between the two is spinlock containment, and WillD wrote
> > this text, so I belive it is accurate for ARM.
> > 
> > Very confusing.
> 
> It does mention serialization with both DMA and locks in the
> section about  readX_relaxed()/writeX_relaxed(). The part
> about DMA is very clear here, and I must have just forgotten
> the exact semantics with regards to spinlocks. I'm still not
> sure what prevents a writel() from leaking out the end of a
> spinlock section that doesn't happen with writel_relaxed(), since
> the barrier in writel() comes before the access, and the
> spin_unlock() shouldn't affect the external buses.

So...

Historically, what happened is that we (we means whoever participated
in the discussion on the list with Linus calling the shots really)
decided that there was no sane way for drivers to understand a world
where readl/writel didn't fully order things vs. memory accesses (ie,
DMA).

So it should always be correct to do:

- Write to some in-memory buffer
- writel() to kick the DMA read of that buffer

without any extra barrier.

The spinlock situation however got murky. Mostly that came up because
on architecture (I forgot who, might have been ia64) has a hard time
providing that consistency without making writel insanely expensive.

Thus they created mmiowb whose main purpose was precisely to order
writel with a following spin_unlock.

I decided not to go down that path on power because getting all drivers
"fixed" to do the right thing was going to be a losing battle, and
instead added per-cpu tracking of writel in order to "escalate" to a
heavier barrier in spin_unlock itself when necessary.

Now, all this happened more than a decade ago and it's possible that
the understanding or expectations "shifted" over time...

Cheers,
Ben.
 


Re: RFC on writel and writel_relaxed

2018-03-26 Thread Sinan Kaya
On 3/26/2018 5:30 PM, Arnd Bergmann wrote:
>> But that was never a requirement of writel(),
>> Documentation/memory-barriers.txt gives an explicit example demanding
>> the wmb() before writel() for ordering system memory against writel.
> Indeed, but it's in an example for when to use dma_wmb(), not wmb().
> Adding Alexander Duyck to Cc, he added that section as part of
> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
> dma_wmb()"). Also adding the other people that were involved with that.
> 

ARM developers can get away with not including wmb() in their code and use
writel() to observe memory writes due to implicit barriers.

However, same code will not work on Intel.

writel() has a compiler barrier in it for x86.
wmb() has a sync operation in it for x86. 

Unless wmb() is called, PCIe device won't observe memory updates from the CPU.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-03-26 Thread Benjamin Herrenschmidt
On Mon, 2018-03-26 at 10:54 -0600, Jason Gunthorpe wrote:
> On Mon, Mar 26, 2018 at 11:08:45AM +, David Laight wrote:
> > > > This is a super performance critical operation for most drivers and
> > > > directly impacts network performance.
> > 
> > Perhaps there ought to be writel_nobarrier() (etc) that never contain
> > any barriers at all.
> > This might mean that they are always just the memory operation,
> > but it would make it more obvious what the driver was doing.
> 
> I think that is what writel_relaxed is supposed to be.
> 
> The only restriction it has is that the writes to a single device
> using UC memory must be kept in program order..

Which requires barriers on some architectures :-)

Also we don't have a clear definition of what happens on WC memory.

Cheers,
Ben.



Re: RFC on writel and writel_relaxed

2018-03-26 Thread Arnd Bergmann
On Mon, Mar 26, 2018 at 11:09 PM, Jason Gunthorpe  wrote:
> On Mon, Mar 26, 2018 at 10:43:43PM +0200, Arnd Bergmann wrote:
>> On Mon, Mar 26, 2018 at 10:25 PM, Jason Gunthorpe  wrote:
>> > On Mon, Mar 26, 2018 at 09:44:15PM +0200, Arnd Bergmann wrote:
>> >> On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe  wrote:
>> >> > On Mon, Mar 26, 2018 at 11:08:45AM +, David Laight wrote:
>> >> >> > > This is a super performance critical operation for most drivers and
>> >> >> > > directly impacts network performance.
>> >> >>
>> >> >> Perhaps there ought to be writel_nobarrier() (etc) that never contain
>> >> >> any barriers at all.
>> >> >> This might mean that they are always just the memory operation,
>> >> >> but it would make it more obvious what the driver was doing.
>> >> >
>> >> > I think that is what writel_relaxed is supposed to be.
>> >> >
>> >> > The only restriction it has is that the writes to a single device
>> >> > using UC memory must be kept in program order..
>> >>
>> >> Not sure about whether we have ever defined what happens to
>> >> writel_relaxed() on WC memory though: On ARM, we disallow
>> >> the compiler to combine writes, but the CPU still might.
>> >
>> > If the driver uses WC memory then I think it should not expect
>> > anything in terms of how writes map to TLPs other than nothing
>> > combines across mmiowb() and mmiowb() is fully globally ordered when
>> > enclosed in a spinlock.
>> >
>> > The entire point of using WC memory is usually to get combining :) If
>> > the driver doesn't want that then it should map UC..
>>
>> Usually, WC memory is used with memcpy_toio() though, which
>> by definition doesn't have any barriers between accesses, and
>> is required to get the correct byte ordering on writes to memory buffers.
>
> memcpy_toio is too expensive to actually use for anything performance
> though. It is too pessimistic. What the drivers usually want is a
> unwound block of 4 or 8 8-byte copies. No function calls, no
> branching. Everything is already known to be aligned.
>
> Most of the drivers have a unwound loop with writeq() or something to
> do it.

But isn't the writeq() barrier much more expensive than anything you'd
do in function calls?

>> > The same document says that _relaxed() does not give that guarentee.
>> >
>> > The lwn articule on this went into some depth on the interaction with
>> > spinlocks.
>> >
>> > As far as I can see, containment in a spinlock seems to be the only
>> > different between writel and writel_relaxed..
>>
>> I was always puzzled by this: The intention of _relaxed() on ARM
>> (where it originates) was to skip the barrier that serializes DMA
>> with MMIO, not to skip the serialization between MMIO and locks.
>
> But that was never a requirement of writel(),
> Documentation/memory-barriers.txt gives an explicit example demanding
> the wmb() before writel() for ordering system memory against writel.

Indeed, but it's in an example for when to use dma_wmb(), not wmb().
Adding Alexander Duyck to Cc, he added that section as part of
1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
dma_wmb()"). Also adding the other people that were involved with that.

> I actually have no idea why ARM had that barrier, I always assumed it
> was to give program ordering to the accesses and that _relaxed allowed
> re-ordering (the usual meaning of relaxed)..
>
> But the barrier document makes it pretty clear that the only
> difference between the two is spinlock containment, and WillD wrote
> this text, so I belive it is accurate for ARM.
>
> Very confusing.

It does mention serialization with both DMA and locks in the
section about  readX_relaxed()/writeX_relaxed(). The part
about DMA is very clear here, and I must have just forgotten
the exact semantics with regards to spinlocks. I'm still not
sure what prevents a writel() from leaking out the end of a
spinlock section that doesn't happen with writel_relaxed(), since
the barrier in writel() comes before the access, and the
spin_unlock() shouldn't affect the external buses.

  Arnd


Re: RFC on writel and writel_relaxed

2018-03-26 Thread Jason Gunthorpe
On Mon, Mar 26, 2018 at 10:43:43PM +0200, Arnd Bergmann wrote:
> On Mon, Mar 26, 2018 at 10:25 PM, Jason Gunthorpe  wrote:
> > On Mon, Mar 26, 2018 at 09:44:15PM +0200, Arnd Bergmann wrote:
> >> On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe  wrote:
> >> > On Mon, Mar 26, 2018 at 11:08:45AM +, David Laight wrote:
> >> >> > > This is a super performance critical operation for most drivers and
> >> >> > > directly impacts network performance.
> >> >>
> >> >> Perhaps there ought to be writel_nobarrier() (etc) that never contain
> >> >> any barriers at all.
> >> >> This might mean that they are always just the memory operation,
> >> >> but it would make it more obvious what the driver was doing.
> >> >
> >> > I think that is what writel_relaxed is supposed to be.
> >> >
> >> > The only restriction it has is that the writes to a single device
> >> > using UC memory must be kept in program order..
> >>
> >> Not sure about whether we have ever defined what happens to
> >> writel_relaxed() on WC memory though: On ARM, we disallow
> >> the compiler to combine writes, but the CPU still might.
> >
> > If the driver uses WC memory then I think it should not expect
> > anything in terms of how writes map to TLPs other than nothing
> > combines across mmiowb() and mmiowb() is fully globally ordered when
> > enclosed in a spinlock.
> >
> > The entire point of using WC memory is usually to get combining :) If
> > the driver doesn't want that then it should map UC..
> 
> Usually, WC memory is used with memcpy_toio() though, which
> by definition doesn't have any barriers between accesses, and
> is required to get the correct byte ordering on writes to memory buffers.

memcpy_toio is too expensive to actually use for anything performance
though. It is too pessimistic. What the drivers usually want is a
unwound block of 4 or 8 8-byte copies. No function calls, no
branching. Everything is already known to be aligned.

Most of the drivers have a unwound loop with writeq() or something to
do it.

> > The same document says that _relaxed() does not give that guarentee.
> >
> > The lwn articule on this went into some depth on the interaction with
> > spinlocks.
> >
> > As far as I can see, containment in a spinlock seems to be the only
> > different between writel and writel_relaxed..
> 
> I was always puzzled by this: The intention of _relaxed() on ARM
> (where it originates) was to skip the barrier that serializes DMA
> with MMIO, not to skip the serialization between MMIO and locks.

But that was never a requirement of writel(),
Documentation/memory-barriers.txt gives an explicit example demanding
the wmb() before writel() for ordering system memory against writel.

I actually have no idea why ARM had that barrier, I always assumed it
was to give program ordering to the accesses and that _relaxed allowed
re-ordering (the usual meaning of relaxed)..

But the barrier document makes it pretty clear that the only
difference between the two is spinlock containment, and WillD wrote
this text, so I belive it is accurate for ARM.

Very confusing.

> I never fully understood the part about the locks, but from what
> I remember, ARM is still serialized without the barrier here, but
> dropping the barrier on powerpc writel_relaxed() would not
> serialize against locks or DMA.

WC is usually the problem here.. I've been told it is necessary on ARM
as well..

Jason


Re: RFC on writel and writel_relaxed

2018-03-26 Thread Arnd Bergmann
On Mon, Mar 26, 2018 at 10:25 PM, Jason Gunthorpe  wrote:
> On Mon, Mar 26, 2018 at 09:44:15PM +0200, Arnd Bergmann wrote:
>> On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe  wrote:
>> > On Mon, Mar 26, 2018 at 11:08:45AM +, David Laight wrote:
>> >> > > This is a super performance critical operation for most drivers and
>> >> > > directly impacts network performance.
>> >>
>> >> Perhaps there ought to be writel_nobarrier() (etc) that never contain
>> >> any barriers at all.
>> >> This might mean that they are always just the memory operation,
>> >> but it would make it more obvious what the driver was doing.
>> >
>> > I think that is what writel_relaxed is supposed to be.
>> >
>> > The only restriction it has is that the writes to a single device
>> > using UC memory must be kept in program order..
>>
>> Not sure about whether we have ever defined what happens to
>> writel_relaxed() on WC memory though: On ARM, we disallow
>> the compiler to combine writes, but the CPU still might.
>
> If the driver uses WC memory then I think it should not expect
> anything in terms of how writes map to TLPs other than nothing
> combines across mmiowb() and mmiowb() is fully globally ordered when
> enclosed in a spinlock.
>
> The entire point of using WC memory is usually to get combining :) If
> the driver doesn't want that then it should map UC..

Usually, WC memory is used with memcpy_toio() though, which
by definition doesn't have any barriers between accesses, and
is required to get the correct byte ordering on writes to memory buffers.

>> It's also not entirely clear to me what we want writel() inside a
>> spinlock to mean: should the spinlock guarantee that two writel()
>> calls on different CPUs that are protected by spinlocks are
>> serialized by those locks, or not?
>
> Yes for writel, I think that is already defined by the barriers
> document

Sorry, I meant writel_relaxed(), not writel()

> The same document says that _relaxed() does not give that guarentee.
>
> The lwn articule on this went into some depth on the interaction with
> spinlocks.
>
> As far as I can see, containment in a spinlock seems to be the only
> different between writel and writel_relaxed..

I was always puzzled by this: The intention of _relaxed() on ARM
(where it originates) was to skip the barrier that serializes DMA
with MMIO, not to skip the serialization between MMIO and locks.

I never fully understood the part about the locks, but from what
I remember, ARM is still serialized without the barrier here, but
dropping the barrier on powerpc writel_relaxed() would not
serialize against locks or DMA.

 Arnd


Re: RFC on writel and writel_relaxed

2018-03-26 Thread Jason Gunthorpe
On Mon, Mar 26, 2018 at 09:44:15PM +0200, Arnd Bergmann wrote:
> On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe  wrote:
> > On Mon, Mar 26, 2018 at 11:08:45AM +, David Laight wrote:
> >> > > This is a super performance critical operation for most drivers and
> >> > > directly impacts network performance.
> >>
> >> Perhaps there ought to be writel_nobarrier() (etc) that never contain
> >> any barriers at all.
> >> This might mean that they are always just the memory operation,
> >> but it would make it more obvious what the driver was doing.
> >
> > I think that is what writel_relaxed is supposed to be.
> >
> > The only restriction it has is that the writes to a single device
> > using UC memory must be kept in program order..
> 
> Not sure about whether we have ever defined what happens to
> writel_relaxed() on WC memory though: On ARM, we disallow
> the compiler to combine writes, but the CPU still might.

If the driver uses WC memory then I think it should not expect
anything in terms of how writes map to TLPs other than nothing
combines across mmiowb() and mmiowb() is fully globally ordered when
enclosed in a spinlock.

The entire point of using WC memory is usually to get combining :) If
the driver doesn't want that then it should map UC..

> It's also not entirely clear to me what we want writel() inside a
> spinlock to mean: should the spinlock guarantee that two writel()
> calls on different CPUs that are protected by spinlocks are
> serialized by those locks, or not?

Yes for writel, I think that is already defined by the barriers
document

The same document says that _relaxed() does not give that guarentee.

The lwn articule on this went into some depth on the interaction with
spinlocks.

As far as I can see, containment in a spinlock seems to be the only
different between writel and writel_relaxed..

Jason


  1   2   >