Re: RFC on writel and writel_relaxed

2018-04-02 Thread Sinan Kaya
On 3/29/2018 9:40 PM, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-29 at 09:56 -0400, Sinan Kaya wrote:
>> On 3/28/2018 11:55 AM, David Miller wrote:
>>> From: Benjamin Herrenschmidt 
>>> Date: Thu, 29 Mar 2018 02:13:16 +1100
>>>
 Let's fix all archs, it's way easier than fixing all drivers. Half of
 the archs are unused or dead anyway.
>>>
>>> Agreed.
>>>
>>
>> I pinged most of the maintainers yesterday.
>> Which arches do we care about these days?
>> I have not been paying attention any other architecture besides arm64.
> 
> Thanks for going through that exercise !
> 
> Once sparc, s390, microblaze and mips reply, I think we'll have a good
> coverage, maybe riscv is to put in that lot too.


I posted the following two patches for supporting microblaze and unicore32. 

[PATCH v2 1/2] io: prevent compiler reordering on the default writeX() 
implementation
[PATCH v2 2/2] io: prevent compiler reordering on the default readX() 
implementation

The rest of the arches except mips and alpha seem OK.
I sent a question email on Friday to mips and alpha mailing lists. I'll follow 
up with
an actual patch today.


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-03-29 Thread Benjamin Herrenschmidt
On Thu, 2018-03-29 at 09:56 -0400, Sinan Kaya wrote:
> On 3/28/2018 11:55 AM, David Miller wrote:
> > From: Benjamin Herrenschmidt 
> > Date: Thu, 29 Mar 2018 02:13:16 +1100
> > 
> > > Let's fix all archs, it's way easier than fixing all drivers. Half of
> > > the archs are unused or dead anyway.
> > 
> > Agreed.
> > 
> 
> I pinged most of the maintainers yesterday.
> Which arches do we care about these days?
> I have not been paying attention any other architecture besides arm64.

Thanks for going through that exercise !

Once sparc, s390, microblaze and mips reply, I think we'll have a good
coverage, maybe riscv is to put in that lot too.
 
Cheers,
Ben.

> 
> arch  status  detail
> ---   
> 
> alpha question sent
> arc   question sent   ys...@users.sourceforge.jp will fix it.
> arm   no issues
> arm64 no issues
> blackfin  question sent   about to be removed
> c6x   question sent
> cris  question sent
> frv
> h8300 question sent
> hexagon   question sent
> ia64  no issues   confirmed by Tony Luck
> m32r
> m68k  question sent
> metag
> microblazequestion sent
> mips  question sent
> mn10300   question sent
> nios2 question sent
> openrisc  no issues   sho...@gmail.com says should no issues
> pariscno issues   grantgrund...@gmail.com says 
> most probably no problem but still looking
> powerpc   no issues
> riscv question sent
> s390  question sent
> score question sent
> shquestion sent
> sparc question sent
> tile  question sent
> unicore32 question sent
> x86   no issues
> xtensaquestion sent
> 
> 


Re: RFC on writel and writel_relaxed

2018-03-29 Thread Sinan Kaya
On 3/29/2018 12:29 PM, Arnd Bergmann wrote:
> On Thu, Mar 29, 2018 at 3:56 PM, Sinan Kaya  wrote:
>> On 3/28/2018 11:55 AM, David Miller wrote:
>>> From: Benjamin Herrenschmidt 
>>> Date: Thu, 29 Mar 2018 02:13:16 +1100
>>>
 Let's fix all archs, it's way easier than fixing all drivers. Half of
 the archs are unused or dead anyway.
>>>
>>> Agreed.
>>>
>>
>> I pinged most of the maintainers yesterday.
>> Which arches do we care about these days?
>> I have not been paying attention any other architecture besides arm64.
>>
>> archstatus  detail
>> --  -   
>> alpha   question sent

Thanks for the detailed analysis.

> 
> I'm guessing alpha has problems
> 
> extern inline u32 readl(const volatile void __iomem *addr)
> {
> u32 ret = __raw_readl(addr);
> mb();
> return ret;
> }
> extern inline void writel(u32 b, volatile void __iomem *addr)
> {
> __raw_writel(b, addr);
> mb();
> }

Looks like a problem to me too. I'll start a thread with the alpha
people and CC you.


> 
> There is a barrier in writel /after/ the acess but not before.
> 

This is the consolidated list. I also heart back from m68k and corrected
contacts for arc and h8300.

archstatus  detail
--  -   
alpha   question sent   Arnd: alpha has problems
arc question sent   vineet.gup...@synopsys.com says he'll 
get to this
in the next few days
arm no issues
arm64   no issues
c6x no issues   no PCI
h8300   no issues   no PCI: ys...@users.sourceforge.jp will 
fix it.
hexagon no issues   no PCI
ia64no issues   confirmed by Tony Luck
m68kno issues   ge...@linux-m68k.org says no problem
metag   no issues   arnd: removed
microblaze  question sent   arnd: some mips platforms have problems
mipsquestion sent   arnd: some mips platforms have problems
nds32   question sent
nios2   no issues   no PCI
openriscno issues   sho...@gmail.com says should no issues
parisc  no issues   grantgrund...@gmail.com says most 
probably no problem
but still looking
powerpc no issues
riscv   no issues   arnd: riscv should be fine
s390no issues   arnd: Pretty sure this is also fine
sh  question sent
sparc   no issues   da...@davemloft.net says always 
strongly ordered
unicore32   question sent   resent to g...@pku.edu.cn
x86 no issues
x86_64  no issues



> 
>  Arnd
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-03-29 Thread Arnd Bergmann
On Thu, Mar 29, 2018 at 3:56 PM, Sinan Kaya  wrote:
> On 3/28/2018 11:55 AM, David Miller wrote:
>> From: Benjamin Herrenschmidt 
>> Date: Thu, 29 Mar 2018 02:13:16 +1100
>>
>>> Let's fix all archs, it's way easier than fixing all drivers. Half of
>>> the archs are unused or dead anyway.
>>
>> Agreed.
>>
>
> I pinged most of the maintainers yesterday.
> Which arches do we care about these days?
> I have not been paying attention any other architecture besides arm64.
>
> archstatus  detail
> --  -   
> alpha   question sent

I'm guessing alpha has problems

extern inline u32 readl(const volatile void __iomem *addr)
{
u32 ret = __raw_readl(addr);
mb();
return ret;
}
extern inline void writel(u32 b, volatile void __iomem *addr)
{
__raw_writel(b, addr);
mb();
}

There is a barrier in writel /after/ the acess but not before.

> arc question sent   ys...@users.sourceforge.jp will fix 
> it.
> arm no issues
> arm64   no issues
> blackfinquestion sent   about to be removed
> c6x question sent

no PCI, so it might not matter that much -- all drivers
are platform specific in the end.

> crisquestion sent
> frv

cris and frv are getting removed

> h8300   question sent

no PCI

> hexagon question sent

no PCI

> ia64no issues   confirmed by Tony Luck
> m32r

removed

> m68kquestion sent

> metag

removed

> microblaze  question sent
> mipsquestion sent

I'm guessing that some mips platforms have problems, but others don't.

> mn10300 question sent

removed

> nios2   question sent

no PCI

> openriscno issues   sho...@gmail.com says should no issues
> parisc  no issues   grantgrund...@gmail.com says most 
> probably no problem but still looking
> powerpc no issues
> riscv   question sent

riscv should be fine

> s390question sent

Pretty sure this is also fine

> score   question sent

removed

> sh  question sent
> sparc   question sent

> tilequestion sent

removed

> unicore32   question sent

Note the maintainer's new email address in linux-next.

> x86 no issues

> xtensa  question sent

removed.

 Arnd


Re: RFC on writel and writel_relaxed

2018-03-29 Thread David Miller
From: Sinan Kaya 
Date: Thu, 29 Mar 2018 09:56:01 -0400

> sparc question sent

Sparc never lets physical memory accesses pass MMIO, and vice versa.

They are always strongly ordered amongst eachother.

Therefore no explicit barrier instructions are necessary.


Re: RFC on writel and writel_relaxed

2018-03-29 Thread Sinan Kaya
On 3/28/2018 11:55 AM, David Miller wrote:
> From: Benjamin Herrenschmidt 
> Date: Thu, 29 Mar 2018 02:13:16 +1100
> 
>> Let's fix all archs, it's way easier than fixing all drivers. Half of
>> the archs are unused or dead anyway.
> 
> Agreed.
> 

I pinged most of the maintainers yesterday.
Which arches do we care about these days?
I have not been paying attention any other architecture besides arm64.

archstatus  detail
--  -   
alpha   question sent
arc question sent   ys...@users.sourceforge.jp will fix it.
arm no issues
arm64   no issues
blackfinquestion sent   about to be removed
c6x question sent
crisquestion sent
frv
h8300   question sent
hexagon question sent
ia64no issues   confirmed by Tony Luck
m32r
m68kquestion sent
metag
microblaze  question sent
mipsquestion sent
mn10300 question sent
nios2   question sent
openriscno issues   sho...@gmail.com says should no issues
parisc  no issues   grantgrund...@gmail.com says most 
probably no problem but still looking
powerpc no issues
riscv   question sent
s390question sent
score   question sent
sh  question sent
sparc   question sent
tilequestion sent
unicore32   question sent
x86 no issues
xtensa  question sent


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-03-29 Thread Will Deacon
On Thu, Mar 29, 2018 at 08:31:32AM +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-29 at 02:23 +1000, Nicholas Piggin wrote:
> >   This is a variation on the mandatory write barrier that causes writes to 
> > weakly
> >   ordered I/O regions to be partially ordered.  Its effects may go beyond 
> > the
> >   CPU->Hardware interface and actually affect the hardware at some level.
> > 
> > How can a driver writer possibly get that right?
> > 
> > IIRC it was added for some big ia64 system that was really expensive
> > to implement the proper wmb() semantics on. So wmb() semantics were
> > quietly downgraded, then the subsequently broken drivers they cared
> > about were fixed by adding the stronger mmiowb().
> > 
> > What should have happened was wmb and writel remained correct, sane, and
> > expensive, and they add an mmio_wmb() to order MMIO stores made by the
> > writel_relaxed accessors, then use that to speed up the few drivers they
> > care about.
> > 
> > Now that ia64 doesn't matter too much, can we deprecate mmiowb and just
> > make wmb ordering talk about stores to the device, not to some
> > intermediate stage of the interconnect where it can be subsequently
> > reordered wrt the device? Drivers can be converted back to using wmb
> > or writel gradually.
> 
> I was under the impression that mmiowb was specifically about ordering
> writel's with a subsequent spin_unlock, without it, MMIOs from
> different CPUs (within the same lock) would still arrive OO.
> 
> If that's indeed the case, I would suggest ia64 switches to a similar
> per-cpu flag trick powerpc uses.

... or we could remove ia64.

/me runs for cover

Will


Re: RFC on writel and writel_relaxed

2018-03-28 Thread Nicholas Piggin
On Thu, 29 Mar 2018 08:31:32 +1100
Benjamin Herrenschmidt  wrote:

> On Thu, 2018-03-29 at 02:23 +1000, Nicholas Piggin wrote:
> > On Wed, 28 Mar 2018 11:55:09 -0400 (EDT)
> > David Miller  wrote:
> >   
> > > From: Benjamin Herrenschmidt 
> > > Date: Thu, 29 Mar 2018 02:13:16 +1100
> > >   
> > > > Let's fix all archs, it's way easier than fixing all drivers. Half of
> > > > the archs are unused or dead anyway.
> > > 
> > > Agreed.  
> > 
> > While we're making decrees here, can we do something about mmiowb?
> > The semantics are basically indecipherable.  
> 
> I was going to tackle that next :-)
> 
> >   This is a variation on the mandatory write barrier that causes writes to 
> > weakly
> >   ordered I/O regions to be partially ordered.  Its effects may go beyond 
> > the
> >   CPU->Hardware interface and actually affect the hardware at some level.
> > 
> > How can a driver writer possibly get that right?
> > 
> > IIRC it was added for some big ia64 system that was really expensive
> > to implement the proper wmb() semantics on. So wmb() semantics were
> > quietly downgraded, then the subsequently broken drivers they cared
> > about were fixed by adding the stronger mmiowb().
> > 
> > What should have happened was wmb and writel remained correct, sane, and
> > expensive, and they add an mmio_wmb() to order MMIO stores made by the
> > writel_relaxed accessors, then use that to speed up the few drivers they
> > care about.
> > 
> > Now that ia64 doesn't matter too much, can we deprecate mmiowb and just
> > make wmb ordering talk about stores to the device, not to some
> > intermediate stage of the interconnect where it can be subsequently
> > reordered wrt the device? Drivers can be converted back to using wmb
> > or writel gradually.  
> 
> I was under the impression that mmiowb was specifically about ordering
> writel's with a subsequent spin_unlock, without it, MMIOs from
> different CPUs (within the same lock) would still arrive OO.

Yes more or less, and I think that until mmiowb was introduced, wmb
or writel was sufficient for this.

Thanks,
Nick


Re: RFC on writel and writel_relaxed

2018-03-28 Thread Benjamin Herrenschmidt
On Thu, 2018-03-29 at 02:23 +1000, Nicholas Piggin wrote:
> On Wed, 28 Mar 2018 11:55:09 -0400 (EDT)
> David Miller  wrote:
> 
> > From: Benjamin Herrenschmidt 
> > Date: Thu, 29 Mar 2018 02:13:16 +1100
> > 
> > > Let's fix all archs, it's way easier than fixing all drivers. Half of
> > > the archs are unused or dead anyway.  
> > 
> > Agreed.
> 
> While we're making decrees here, can we do something about mmiowb?
> The semantics are basically indecipherable.

I was going to tackle that next :-)

>   This is a variation on the mandatory write barrier that causes writes to 
> weakly
>   ordered I/O regions to be partially ordered.  Its effects may go beyond the
>   CPU->Hardware interface and actually affect the hardware at some level.
> 
> How can a driver writer possibly get that right?
> 
> IIRC it was added for some big ia64 system that was really expensive
> to implement the proper wmb() semantics on. So wmb() semantics were
> quietly downgraded, then the subsequently broken drivers they cared
> about were fixed by adding the stronger mmiowb().
> 
> What should have happened was wmb and writel remained correct, sane, and
> expensive, and they add an mmio_wmb() to order MMIO stores made by the
> writel_relaxed accessors, then use that to speed up the few drivers they
> care about.
> 
> Now that ia64 doesn't matter too much, can we deprecate mmiowb and just
> make wmb ordering talk about stores to the device, not to some
> intermediate stage of the interconnect where it can be subsequently
> reordered wrt the device? Drivers can be converted back to using wmb
> or writel gradually.

I was under the impression that mmiowb was specifically about ordering
writel's with a subsequent spin_unlock, without it, MMIOs from
different CPUs (within the same lock) would still arrive OO.

If that's indeed the case, I would suggest ia64 switches to a similar
per-cpu flag trick powerpc uses.

Cheers,
Ben.

> Thanks,
> Nick


Re: RFC on writel and writel_relaxed

2018-03-28 Thread Nicholas Piggin
On Wed, 28 Mar 2018 11:55:09 -0400 (EDT)
David Miller  wrote:

> From: Benjamin Herrenschmidt 
> Date: Thu, 29 Mar 2018 02:13:16 +1100
> 
> > Let's fix all archs, it's way easier than fixing all drivers. Half of
> > the archs are unused or dead anyway.  
> 
> Agreed.

While we're making decrees here, can we do something about mmiowb?
The semantics are basically indecipherable.

  This is a variation on the mandatory write barrier that causes writes to 
weakly
  ordered I/O regions to be partially ordered.  Its effects may go beyond the
  CPU->Hardware interface and actually affect the hardware at some level.

How can a driver writer possibly get that right?

IIRC it was added for some big ia64 system that was really expensive
to implement the proper wmb() semantics on. So wmb() semantics were
quietly downgraded, then the subsequently broken drivers they cared
about were fixed by adding the stronger mmiowb().

What should have happened was wmb and writel remained correct, sane, and
expensive, and they add an mmio_wmb() to order MMIO stores made by the
writel_relaxed accessors, then use that to speed up the few drivers they
care about.

Now that ia64 doesn't matter too much, can we deprecate mmiowb and just
make wmb ordering talk about stores to the device, not to some
intermediate stage of the interconnect where it can be subsequently
reordered wrt the device? Drivers can be converted back to using wmb
or writel gradually.

Thanks,
Nick


RE: RFC on writel and writel_relaxed

2018-03-28 Thread David Laight
From: Benjamin Herrenschmidt
> Sent: 28 March 2018 16:13
...
> > I've always wondered exactly what the twi;isync were for - always seemed
> > very heavy handed for most mmio reads.
> > Particularly if you are doing mmio reads from a data fifo.
> 
> If you do that you should use the "s" version of the accessors. Those
> will only do the above trick at the end of the access series. Also a
> FIFO needs special care about endianness anyway, so you should use
> those accessors regardless. (Hint: you never endian swap a FIFO even on
> BE on a LE device, unless something's been wired very badly in HW).

That was actually a 64 bit wide fifo connected to a 16bit wide PIO interface.
Reading the high address 'clocked' the fifo.
So the first 3 reads could happen in any order, but the 4th had to be last.
This is a small ppc and we shovel a lot of data through that fifo.

Whether it needed byteswapping depended completely on how our hardware people
had built the pcb (not made easy by some docs using the ibm bit numbering).
In fact it didn't

While that driver only had to run on a very specific small ppc, generic drivers
might have similar issues.

I suspect that writel() is always (or should always be):
barrier_before_writel()
writel_relaxed()
barrier_after_writel()
So if a driver needs to do multiple writes (without strong ordering)
it should be able to repeat the writel_relaxed() with only one set
of barriers.
Similarly for readl().
In addition a lesser barrier is probably enough between a readl_relaxed()
and a writel_relaxed() that is conditional on the read value.

David



Re: RFC on writel and writel_relaxed

2018-03-28 Thread David Miller
From: Benjamin Herrenschmidt 
Date: Thu, 29 Mar 2018 02:13:16 +1100

> Let's fix all archs, it's way easier than fixing all drivers. Half of
> the archs are unused or dead anyway.

Agreed.


Re: RFC on writel and writel_relaxed

2018-03-28 Thread Benjamin Herrenschmidt
On Wed, 2018-03-28 at 07:41 -0400, ok...@codeaurora.org wrote:
> Yes, we want to get there indeed. It is because of some arch not 
> implementing writel properly. Maintainers want to play safe.
> 
> That is why I asked if IA64 and other well known archs follow the 
> strongly ordered rule at this moment like PPC and ARM.
> 
> Or should we go and inform every arch about this before yanking wmb()?
> 
> Maintainers are afraid of introducing a regression.

Let's fix all archs, it's way easier than fixing all drivers. Half of
the archs are unused or dead anyway.

> > 
> > The above code makes no sense, and just looks stupid to me. It also
> > generates pointlessly bad code on x86, so it's bad there too.
> > 
> > Linus


Re: RFC on writel and writel_relaxed

2018-03-28 Thread Benjamin Herrenschmidt
On Wed, 2018-03-28 at 11:30 +, David Laight wrote:
> From: Benjamin Herrenschmidt
> > Sent: 28 March 2018 10:56
> 
> ...
> > For example, let's say I have a device with a reset bit and the spec
> > says the reset bit needs to be set for at least 10us.
> > 
> > This is wrong:
> > 
> > writel(1, RESET_REG);
> > usleep(10);
> > writel(0, RESET_REG);
> > 
> > Because of write posting, the first write might arrive to the device
> > right before the second one.
> > 
> > The typical "fix" is to turn that into:
> > 
> > writel(1, RESET_REG);
> > readl(RESET_REG); /* Flush posted writes */
> 
> Would a writel(1, RESET_REG) here provide enough synchronsiation?

Probably yes. It's one of those things where you try to deal with the
fact that 90% of driver writers barely understand the basic stuff and
so you need the "default" accessors to be hardened as much as possible.

We still need to get a reasonably definition of the semantics of the
relaxed ones vs. WC memory but let's get through that exercise first
and hopefully for the last time.
 
> > usleep(10);
> > writel(0, RESET_REG);
> > 
> > *However* the issue here, at least on power, is that the CPU can issue
> > that readl but doesn't necessarily wait for it to complete (ie, the
> > data to return), before proceeding to the usleep. Now a usleep contains
> > a bunch of loads and stores and is probably fine, but a udelay which
> > just loops on the timebase may not be.
> > 
> > Thus we may still violate the timing requirement.
> 
> I've seem that sort of code (with udelay() and no read back) quite often.
> How many were in linux I don't know.
> 
> For small delays I usually fix it by repeated writes (of the same value)
> to the device register. That can guarantee very short intervals.

As long as you know the bus frequency...

> The only time I've actually seen buffered writes break timing was
> between a 286 and an 8859 interrupt controller.

:-)

The problem for me is not so much what I've seen, I realize that most
of the issues we are talking about are the kind that will hit once in a
thousand times or less.

But we *can* reason about them in a way that can effectively prevent
the problem completely and when your cluster has 1 machine, 1/1000
starts becoming significant.

These days the vast majority of IO devices either are 99% DMA driven so
that a bit of overhead on MMIOs is irrelevant, or have one fast path
(low latency IB etc...) that needs some finer control, and the rest is
all setup which can be paranoid at will.

So I think we should aim for the "default" accessors most people use to
be as hadened as we can think of. I favor correctness over performance
in all cases. But then we also define a reasonable semantic for the
relaxed ones (well, we sort-of do have one, we might have to make it a
bit more precise in some areas) that allows the few MMIO fast path that
care to be optimized.

> If you wrote to the mask then enabled interrupts the first IACK cycle
> could be too close to write and break the cycle recovery time.
> That clobbered most of the interrupt controller registers.
> That probably affected every 286 board ever built!
> Not sure how much software added the required extra bus cycle.
> 
> > What we did inside readl, with the twi;isync sequence (which basically
> > means, trap on return value with "trap never" as a condition, followed
> > by isync that ensures all excpetion conditions are resolved), is force
> > the CPU to "consume" the data from the read before moving on.
> > 
> > This effectively makes readl fully synchronous (we would probably avoid
> > that if we were to implement a readl_relaxed).
> 
> I've always wondered exactly what the twi;isync were for - always seemed
> very heavy handed for most mmio reads.
> Particularly if you are doing mmio reads from a data fifo.

If you do that you should use the "s" version of the accessors. Those
will only do the above trick at the end of the access series. Also a
FIFO needs special care about endianness anyway, so you should use
those accessors regardless. (Hint: you never endian swap a FIFO even on
BE on a LE device, unless something's been wired very badly in HW).

> Perhaps there should be a writel_wait() that is allowed to do a read back
> for such code paths?

I think what we have is fine, we just define that the standard
writel/readl as fairly simple and hardened, and we look at providing a
somewhat reasonable set of relaxed variants for optimizing fast path.
We pretty much already are there, we just need to be better at defining
the semantics.

And for the super high perf case, which thankfully is either seldom
(server high perf network stuff) or very arch specific (ARM SoC stuff),
then arch specific driver hacks will always remain the norm.

Cheers,
Ben. 

>   David
> 


Re: RFC on writel and writel_relaxed

2018-03-28 Thread okaya

On 2018-03-28 02:14, Linus Torvalds wrote:
On Tue, Mar 27, 2018 at 5:24 PM, Sinan Kaya  
wrote:


Basically changing it to

dma_buffer->foo = 1;/* WB */
wmb()
writel_relaxed(KICK, DMA_KICK_REGISTER);/* UC */
mmiowb()


Why?

Why not  just remove the wmb(), and keep the barrier in the writel()?


Yes, we want to get there indeed. It is because of some arch not 
implementing writel properly. Maintainers want to play safe.


That is why I asked if IA64 and other well known archs follow the 
strongly ordered rule at this moment like PPC and ARM.


Or should we go and inform every arch about this before yanking wmb()?

Maintainers are afraid of introducing a regression.



The above code makes no sense, and just looks stupid to me. It also
generates pointlessly bad code on x86, so it's bad there too.

   Linus


RE: RFC on writel and writel_relaxed

2018-03-28 Thread David Laight
From: Benjamin Herrenschmidt
> Sent: 28 March 2018 10:56
...
> For example, let's say I have a device with a reset bit and the spec
> says the reset bit needs to be set for at least 10us.
> 
> This is wrong:
> 
>   writel(1, RESET_REG);
>   usleep(10);
>   writel(0, RESET_REG);
> 
> Because of write posting, the first write might arrive to the device
> right before the second one.
> 
> The typical "fix" is to turn that into:
> 
>   writel(1, RESET_REG);
>   readl(RESET_REG); /* Flush posted writes */

Would a writel(1, RESET_REG) here provide enough synchronsiation?

>   usleep(10);
>   writel(0, RESET_REG);
> 
> *However* the issue here, at least on power, is that the CPU can issue
> that readl but doesn't necessarily wait for it to complete (ie, the
> data to return), before proceeding to the usleep. Now a usleep contains
> a bunch of loads and stores and is probably fine, but a udelay which
> just loops on the timebase may not be.
> 
> Thus we may still violate the timing requirement.

I've seem that sort of code (with udelay() and no read back) quite often.
How many were in linux I don't know.

For small delays I usually fix it by repeated writes (of the same value)
to the device register. That can guarantee very short intervals.

The only time I've actually seen buffered writes break timing was
between a 286 and an 8859 interrupt controller.
If you wrote to the mask then enabled interrupts the first IACK cycle
could be too close to write and break the cycle recovery time.
That clobbered most of the interrupt controller registers.
That probably affected every 286 board ever built!
Not sure how much software added the required extra bus cycle.

> What we did inside readl, with the twi;isync sequence (which basically
> means, trap on return value with "trap never" as a condition, followed
> by isync that ensures all excpetion conditions are resolved), is force
> the CPU to "consume" the data from the read before moving on.
> 
> This effectively makes readl fully synchronous (we would probably avoid
> that if we were to implement a readl_relaxed).

I've always wondered exactly what the twi;isync were for - always seemed
very heavy handed for most mmio reads.
Particularly if you are doing mmio reads from a data fifo.

Perhaps there should be a writel_wait() that is allowed to do a read back
for such code paths?

David



Re: Aw: Re: RFC on writel and writel_relaxed

2018-03-28 Thread Benjamin Herrenschmidt
On Wed, 2018-03-28 at 12:13 +0200, Lino Sanfilippo wrote:
> Hi,
> 
> 
> > 
> > Yeah so that other trick I'm talking about is also used for timing
> > accuracy.
> > 
> > For example, let's say I have a device with a reset bit and the spec
> > says the reset bit needs to be set for at least 10us.
> > 
> > This is wrong:
> > 
> > writel(1, RESET_REG);
> > usleep(10);
> > writel(0, RESET_REG);
> > 
> > Because of write posting, the first write might arrive to the device
> > right before the second one.
> > 
> 
> Does not write posting only concern PCI? This seems to be a different topic. 
> Furthermore
> write posting should not include write reordering...

Nobody's talking about re-ordering and no, write posting is rather
common practice on a whole lot of different busses, not just PCI(e).

Cheers,
Ben.



Aw: Re: RFC on writel and writel_relaxed

2018-03-28 Thread Lino Sanfilippo
Hi,


> 
> Yeah so that other trick I'm talking about is also used for timing
> accuracy.
> 
> For example, let's say I have a device with a reset bit and the spec
> says the reset bit needs to be set for at least 10us.
> 
> This is wrong:
> 
>   writel(1, RESET_REG);
>   usleep(10);
>   writel(0, RESET_REG);
> 
> Because of write posting, the first write might arrive to the device
> right before the second one.
> 

Does not write posting only concern PCI? This seems to be a different topic. 
Furthermore
write posting should not include write reordering...

Regards,
Lino


Re: RFC on writel and writel_relaxed

2018-03-28 Thread Benjamin Herrenschmidt
On Wed, 2018-03-28 at 10:07 +0100, Will Deacon wrote:
> 
> For arm/arm64 we guarantee ordering for (1) but not for (2) -- you'd need to
> add an mb() to make it work.
> 
> Do both of these work on power? 

Yes. There's even another quirk, see further down ;-)

> If so, I guess I can make readl even more
> expensive :/ Feels a bit like the tail wagging the dog, though.

Maybe, but then readl is always horribly slow anyway so you may not
necessarily be losing that much.

> Another thing I just realised is that we restrict the barriers we use in
> readl/writel on arm64 so that they don't necessary apply to both loads and
> stores. To be specific:
> 
>writel is ordered against prior writes to memory, but not reads

That could be tricky... You may end up with something that reads before
triggering a DMA and ends up with the post-DMA value ... ugh.

>readl is ordered against subsequent reads of memory, but not writes (but
>note that in example (1) above, the control dependency ensures that).
> 
> If necessary, I could move the barrier in our readl implementation to be
> before the read, then play the control-dependency + instruction-sync (ISB)
> trick that you do on power.

Yeah so that other trick I'm talking about is also used for timing
accuracy.

For example, let's say I have a device with a reset bit and the spec
says the reset bit needs to be set for at least 10us.

This is wrong:

writel(1, RESET_REG);
usleep(10);
writel(0, RESET_REG);

Because of write posting, the first write might arrive to the device
right before the second one.

The typical "fix" is to turn that into:

writel(1, RESET_REG);
readl(RESET_REG); /* Flush posted writes */
usleep(10);
writel(0, RESET_REG);

*However* the issue here, at least on power, is that the CPU can issue
that readl but doesn't necessarily wait for it to complete (ie, the
data to return), before proceeding to the usleep. Now a usleep contains
a bunch of loads and stores and is probably fine, but a udelay which
just loops on the timebase may not be.

Thus we may still violate the timing requirement.

What we did inside readl, with the twi;isync sequence (which basically
means, trap on return value with "trap never" as a condition, followed
by isync that ensures all excpetion conditions are resolved), is force
the CPU to "consume" the data from the read before moving on.

This effectively makes readl fully synchronous (we would probably avoid
that if we were to implement a readl_relaxed).

Cheers,
Ben.



Re: RFC on writel and writel_relaxed

2018-03-28 Thread Will Deacon
On Wed, Mar 28, 2018 at 05:42:56PM +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2018-03-27 at 20:26 -1000, Linus Torvalds wrote:
> > On Tue, Mar 27, 2018 at 6:33 PM, Benjamin Herrenschmidt
> >  wrote:
> > > 
> > > This is why, I want (with your agreement) to define clearly and once
> > > and for all, that the Linux semantics of writel are that it is ordered
> > > with previous writes to coherent memory (*)
> > 
> > Honestly, I think those are the sane semantics. In fact, make it
> > "ordered with previous writes" full stop, since it's not only ordered
> > wrt previous writes to memory, but also previous writel's.
> 
> Of course. It was somewhat a given that it's ordered vs. any previous
> MMIO actually, but it doesn't hurt to spell it out once more.

Good. So I think this confirms our understanding so far.

> 
> > > Also, can I assume the above ordering with writel() equally applies to
> > > readl() or not ?
> > > 
> > > IE:
> > > dma_buf->foo = 1;
> > > readl(STUPID_DEVICE_DMA_KICK_ON_READ);
> > 
> > If that KICK_ON_READ is UC, then that's definitely the case. And
> > honestly, status registers like that really should always be UC.
> > 
> > But if somebody sets the area WC (which is crazy), then I think it
> > might be at least debatable. x86 semantics does allow reads to be done
> > before previous writes (or, put another way, writes to be buffered -
> > the buffers are ordered so writes don't get re-ordered, but reads can
> > happen during the buffering).
> 
> Right, for now I worry about UC semantics. Once we have nailed that, we
> can look at WC, which is a lot more tricky as archs differs more
> widely, but one thing at a time.
> 
> > But UC accesses are always  done entirely ordered, and honestly, any
> > status register that starts a DMA would not make sense any other way.
> > 
> > Of course, you'd have to be pretty odd to want to start a DMA with a
> > read anyway - partly exactly because it's bad for performance since
> > reads will be synchronous and not buffered like a write).
> 
> I have bad memories of old adaptec controllers ...
> 
> That said, I think the above might not be right on ARM if we want to
> make it the rule, Will, what do you reckon ?

So there are two cases to consider:

1.
if (readl(DEVICE_DMA_STATUS) == DMA_DONE)
mydata = *dma_bufp;



2.
*dma_bufp = 42;
readl(DEVICE_DMA_KICK_ON_READ);


For arm/arm64 we guarantee ordering for (1) but not for (2) -- you'd need to
add an mb() to make it work.

Do both of these work on power? If so, I guess I can make readl even more
expensive :/ Feels a bit like the tail wagging the dog, though.

Another thing I just realised is that we restrict the barriers we use in
readl/writel on arm64 so that they don't necessary apply to both loads and
stores. To be specific:

   writel is ordered against prior writes to memory, but not reads

   readl is ordered against subsequent reads of memory, but not writes (but
   note that in example (1) above, the control dependency ensures that).

If necessary, I could move the barrier in our readl implementation to be
before the read, then play the control-dependency + instruction-sync (ISB)
trick that you do on power.

Will


Re: RFC on writel and writel_relaxed

2018-03-28 Thread Benjamin Herrenschmidt
On Wed, 2018-03-28 at 09:11 +0200, Arnd Bergmann wrote:
> On Wed, Mar 28, 2018 at 8:56 AM, Benjamin Herrenschmidt
>  wrote:
> > On Wed, 2018-03-28 at 06:53 +, Linus Torvalds wrote:
> > > On Tue, Mar 27, 2018, 20:43 Benjamin Herrenschmidt 
> > >  wrote:
> > > That's why in/out were *so* slow, and why nobody uses them any more
> > > (well, the address size limitations and the lack of any remapping of
> > > the address obviously also are a reason).
> > 
> > All true indeed, though a lot of other archs never quite made them
> > fully synchronous, which was another can of worms ... oh well.
> 
> Many architectures have no way of providing PCI compliant semantics
> for outb, as their instruction set and/or bus interconnect lacks a
> method of waiting for completion of an outb.

Yup, that includes powerpc. Note that since POWER8 we don't even
genetate IO space anymore :-)

> In practice, it doesn't seem to matter for any of the devices one would
> encounter these days: very few use I/O space, and those that do don't
> actually rely on the strict ordering. Some architectures (in particular
> s390, but I remember seeing the same thing elsewhere) explicitly
> disallow I/O space access on PCI because of this. On ARM, the typical
> PCI implementations have other problems that are worse than this
> one, so most drivers are fine with the almost-working semantics.

/me cries...

Ben.



Re: RFC on writel and writel_relaxed

2018-03-28 Thread Arnd Bergmann
On Wed, Mar 28, 2018 at 8:56 AM, Benjamin Herrenschmidt
 wrote:
> On Wed, 2018-03-28 at 06:53 +, Linus Torvalds wrote:
>> On Tue, Mar 27, 2018, 20:43 Benjamin Herrenschmidt 
>>  wrote:
>> That's why in/out were *so* slow, and why nobody uses them any more
>> (well, the address size limitations and the lack of any remapping of
>> the address obviously also are a reason).
>
> All true indeed, though a lot of other archs never quite made them
> fully synchronous, which was another can of worms ... oh well.

Many architectures have no way of providing PCI compliant semantics
for outb, as their instruction set and/or bus interconnect lacks a
method of waiting for completion of an outb.

In practice, it doesn't seem to matter for any of the devices one would
encounter these days: very few use I/O space, and those that do don't
actually rely on the strict ordering. Some architectures (in particular
s390, but I remember seeing the same thing elsewhere) explicitly
disallow I/O space access on PCI because of this. On ARM, the typical
PCI implementations have other problems that are worse than this
one, so most drivers are fine with the almost-working semantics.

Arnd


Re: RFC on writel and writel_relaxed

2018-03-28 Thread Benjamin Herrenschmidt
On Wed, 2018-03-28 at 06:53 +, Linus Torvalds wrote:
> 
> 
> On Tue, Mar 27, 2018, 20:43 Benjamin Herrenschmidt  ing.org> wrote:
> > >
> > > Of course, you'd have to be pretty odd to want to start a DMA
> > with a
> > > read anyway - partly exactly because it's bad for performance
> > since
> > > reads will be synchronous and not buffered like a write).
> > 
> > I have bad memories of old adaptec controllers ...
> 
> *Old* adaptec controllers were likely to use the in/out instructions
> for status and command data.
> 
> Those are actually even more ordered than UC reads and writes: the
> in/out instructions are not just fully ordered, but are fully
> *synchronous* on x86. 
> 
> So not just doing accesses in order, but actually waiting for
> everything to drain before they start executing, but they also wait
> for the operation itself to complete (ie "out" will not just queue
> the write, it will then wait for the queue to empty and the write
> data to hit the line).
> 
> That's why in/out were *so* slow, and why nobody uses them any more
> (well, the address size limitations and the lack of any remapping of
> the address obviously also are a reason).

All true indeed, though a lot of other archs never quite made them
fully synchronous, which was another can of worms ... oh well.

As for Adaptec, you might be right, I do remember having cases of old
stuff triggering DMA on reads, it might have been "Mac" variants of
Adaptec using MMIO or something...

Cheers,
Ben.



Re: RFC on writel and writel_relaxed

2018-03-28 Thread Benjamin Herrenschmidt
On Tue, 2018-03-27 at 20:26 -1000, Linus Torvalds wrote:
> On Tue, Mar 27, 2018 at 6:33 PM, Benjamin Herrenschmidt
>  wrote:
> > 
> > This is why, I want (with your agreement) to define clearly and once
> > and for all, that the Linux semantics of writel are that it is ordered
> > with previous writes to coherent memory (*)
> 
> Honestly, I think those are the sane semantics. In fact, make it
> "ordered with previous writes" full stop, since it's not only ordered
> wrt previous writes to memory, but also previous writel's.

Of course. It was somewhat a given that it's ordered vs. any previous
MMIO actually, but it doesn't hurt to spell it out once more.

> > Also, can I assume the above ordering with writel() equally applies to
> > readl() or not ?
> > 
> > IE:
> > dma_buf->foo = 1;
> > readl(STUPID_DEVICE_DMA_KICK_ON_READ);
> 
> If that KICK_ON_READ is UC, then that's definitely the case. And
> honestly, status registers like that really should always be UC.
> 
> But if somebody sets the area WC (which is crazy), then I think it
> might be at least debatable. x86 semantics does allow reads to be done
> before previous writes (or, put another way, writes to be buffered -
> the buffers are ordered so writes don't get re-ordered, but reads can
> happen during the buffering).

Right, for now I worry about UC semantics. Once we have nailed that, we
can look at WC, which is a lot more tricky as archs differs more
widely, but one thing at a time.

> But UC accesses are always  done entirely ordered, and honestly, any
> status register that starts a DMA would not make sense any other way.
> 
> Of course, you'd have to be pretty odd to want to start a DMA with a
> read anyway - partly exactly because it's bad for performance since
> reads will be synchronous and not buffered like a write).

I have bad memories of old adaptec controllers ...

That said, I think the above might not be right on ARM if we want to
make it the rule, Will, what do you reckon ?

Cheers,
Ben.



Re: RFC on writel and writel_relaxed

2018-03-28 Thread Linus Torvalds
On Tue, Mar 27, 2018 at 6:33 PM, Benjamin Herrenschmidt
 wrote:
>
> This is why, I want (with your agreement) to define clearly and once
> and for all, that the Linux semantics of writel are that it is ordered
> with previous writes to coherent memory (*)

Honestly, I think those are the sane semantics. In fact, make it
"ordered with previous writes" full stop, since it's not only ordered
wrt previous writes to memory, but also previous writel's.

> Also, can I assume the above ordering with writel() equally applies to
> readl() or not ?
>
> IE:
> dma_buf->foo = 1;
> readl(STUPID_DEVICE_DMA_KICK_ON_READ);

If that KICK_ON_READ is UC, then that's definitely the case. And
honestly, status registers like that really should always be UC.

But if somebody sets the area WC (which is crazy), then I think it
might be at least debatable. x86 semantics does allow reads to be done
before previous writes (or, put another way, writes to be buffered -
the buffers are ordered so writes don't get re-ordered, but reads can
happen during the buffering).

But UC accesses are always  done entirely ordered, and honestly, any
status register that starts a DMA would not make sense any other way.

Of course, you'd have to be pretty odd to want to start a DMA with a
read anyway - partly exactly because it's bad for performance since
reads will be synchronous and not buffered like a write).

   Linus


Re: RFC on writel and writel_relaxed

2018-03-28 Thread Linus Torvalds
On Tue, Mar 27, 2018 at 5:24 PM, Sinan Kaya  wrote:
>
> Basically changing it to
>
> dma_buffer->foo = 1;/* WB */
> wmb()
> writel_relaxed(KICK, DMA_KICK_REGISTER);/* UC */
> mmiowb()

Why?

Why not  just remove the wmb(), and keep the barrier in the writel()?

The above code makes no sense, and just looks stupid to me. It also
generates pointlessly bad code on x86, so it's bad there too.

   Linus


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Benjamin Herrenschmidt
On Tue, 2018-03-27 at 23:24 -0400, Sinan Kaya wrote:
> On 3/27/2018 10:51 PM, Linus Torvalds wrote:
> > > The discussion at hand is about
> > > 
> > > dma_buffer->foo = 1;/* WB */
> > > writel(KICK, DMA_KICK_REGISTER);/* UC */
> > 
> > Yes. That certainly is ordered on x86. In fact, afaik it's ordered
> > even if that writel() might be of type WC, because that only delays
> > writes, it doesn't move them earlier.
> 
> Now that we clarified x86 myth, Is this guaranteed on all architectures?

If not we need to fix it. It's guaranteed on the "main" ones (arm,
arm64, powerpc, i386, x86_64). We might need to check with other arch
maintainers for the rest.

We really want Linux to provide well defined "sane" semantics for the
basic writel accessors.

Note: We still have open questions about how readl() relates to
surrounding memory accesses. It looks like ARM and powerpc do different
things here.

> We keep getting IA64 exception example. Maybe, this is also corrected since
> then.

I would think ia64 fixed it back when it was all discussed. I was under
the impression all ia64 had "special" was the writel vs. spin_unlock
which requires mmiowb, but maybe that was never completely fixed ?

> Jose Abreu says "I don't know about x86 but arc architecture doesn't
> have a wmb() in the writel() function (in some configs)".

Well, it probably should then.

> As long as we have these exceptions, these wmb() in common drivers is not
> going anywhere and relaxed-arches will continue paying performance penalty.

Well, let's fix them or leave them broken, at this point, it doesn't
matter. We can give all arch maintainers a wakeup call and start making
drivers work based on the documented assumptions.

> I see 15% performance loss on ARM64 servers using Intel i40e network
> drivers and an XL710 adapter due to CPU keeping itself busy doing barriers
> most of the time rather than real work because of sequences like this all over
> the place.
> 
>  dma_buffer->foo = 1;/* WB */
>wmb()
>  writel(KICK, DMA_KICK_REGISTER);/* UC */
>
> I posted several patches last week to remove duplicate barriers on ARM while
> trying to make the code friendly with other architectures.
> 
> Basically changing it to
> 
> dma_buffer->foo = 1;/* WB */
> wmb()
> writel_relaxed(KICK, DMA_KICK_REGISTER);/* UC */
> mmiowb()
> 
> This is a small step in the performance direction until we remove all 
> exceptions.
> 
> https://www.spinics.net/lists/netdev/msg491842.html
> https://www.spinics.net/lists/linux-rdma/msg62434.html
> https://www.spinics.net/lists/arm-kernel/msg642336.html
> 
> Discussion started to move around the need for relaxed API on PPC and then
> why wmb() question came up.

I'm working on the problem of relaxed APIs for powerpc, but we can keep
that orthogonal. As is, today, a wmb() + writel() and a wmb() +
writel_relaxed() on powerpc are identical. So changing them will not
break us.

But I don't see the point of doing that transformation if we can just
get the straying archs fixed. It's not like any of them has a
significant market presence these days anyway.

Cheers,
Ben.

> Sinan
> 


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Benjamin Herrenschmidt
On Tue, 2018-03-27 at 16:51 -1000, Linus Torvalds wrote:
> On Tue, Mar 27, 2018 at 3:03 PM, Benjamin Herrenschmidt
>  wrote:
> > 
> > The discussion at hand is about
> > 
> > dma_buffer->foo = 1;/* WB */
> > writel(KICK, DMA_KICK_REGISTER);/* UC */
> 
> Yes. That certainly is ordered on x86. In fact, afaik it's ordered
> even if that writel() might be of type WC, because that only delays
> writes, it doesn't move them earlier.

Ok so this is our answer ...

 ... snip ... (thanks for the background info !)

> Oh, the above UC case is absoutely guaranteed.

Good.

Then

> The only issue really is that 99.9% of all testing gets done on x86
> unless you look at specific SoC drivers.
> 
> On ARM, for example, there is likely little reason to care about x86
> memory ordering, because there is almost zero driver overlap between
> x86 and ARM.
> 
> *Historically*, the reason for following the x86 IO ordering was
> simply that a lot of architectures used the drivers that were
> developed on x86. The alpha and powerpc workstations were *designed*
> with the x86 IO bus (PCI, then PCIe) and to work with the devices that
> came with it.
> 
> ARM? PCIe is almost irrelevant. For ARM servers, if they ever take
> off, sure. But 99.99% of ARM is about their own SoC's, and so "x86
> test coverage" is simply not an issue.
> 
> How much of an issue is it for Power? Maybe you decide it's not a big deal.
> 
> Then all the above is almost irrelevant.

So the overlap may not be that NIL in practice :-) But even then that
doesn't matter as ARM has been happily implementing the same semantic
you describe above for years, as do we powerpc.

This is why, I want (with your agreement) to define clearly and once
and for all, that the Linux semantics of writel are that it is ordered
with previous writes to coherent memory (*)

This is already what ARM and powerpc provide, from what you say, what
x86 provides, I don't see any reason to keep that badly documented and
have drivers randomly growing useless wmb()'s because they don't think
it works on x86 without them !

Once that's sorted, let's tackle the problem of mmiowb vs. spin_unlock
and the problem of writel_relaxed semantics but as separate issues :-)

Also, can I assume the above ordering with writel() equally applies to
readl() or not ?

IE:
dma_buf->foo = 1;
readl(STUPID_DEVICE_DMA_KICK_ON_READ);

Also works on x86 ? (It does on power, maybe not on ARM).

Cheers,
Ben.

(*) From an Linux API perspective, all of this is only valid if the
memory was allocated by dma_alloc_coherent(). Anything obtained by
dma_map_something() might have been bounced bufferred or might require
extra cache flushes on some architectures, and thus needs
dma_sync_for_{cpu,device} calls.

Cheers,
Ben.



Re: RFC on writel and writel_relaxed

2018-03-27 Thread Sinan Kaya
On 3/27/2018 10:51 PM, Linus Torvalds wrote:
>> The discussion at hand is about
>>
>> dma_buffer->foo = 1;/* WB */
>> writel(KICK, DMA_KICK_REGISTER);/* UC */
> Yes. That certainly is ordered on x86. In fact, afaik it's ordered
> even if that writel() might be of type WC, because that only delays
> writes, it doesn't move them earlier.

Now that we clarified x86 myth, Is this guaranteed on all architectures?
We keep getting IA64 exception example. Maybe, this is also corrected since
then.

Jose Abreu says "I don't know about x86 but arc architecture doesn't
have a wmb() in the writel() function (in some configs)".

As long as we have these exceptions, these wmb() in common drivers is not
going anywhere and relaxed-arches will continue paying performance penalty.

I see 15% performance loss on ARM64 servers using Intel i40e network
drivers and an XL710 adapter due to CPU keeping itself busy doing barriers
most of the time rather than real work because of sequences like this all over
the place.

 dma_buffer->foo = 1;/* WB */
 wmb()
 writel(KICK, DMA_KICK_REGISTER);/* UC */

I posted several patches last week to remove duplicate barriers on ARM while
trying to make the code friendly with other architectures.

Basically changing it to

dma_buffer->foo = 1;/* WB */
wmb()
writel_relaxed(KICK, DMA_KICK_REGISTER);/* UC */
mmiowb()

This is a small step in the performance direction until we remove all 
exceptions.

https://www.spinics.net/lists/netdev/msg491842.html
https://www.spinics.net/lists/linux-rdma/msg62434.html
https://www.spinics.net/lists/arm-kernel/msg642336.html

Discussion started to move around the need for relaxed API on PPC and then
why wmb() question came up.

Sinan

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Linus Torvalds
On Tue, Mar 27, 2018 at 3:03 PM, Benjamin Herrenschmidt
 wrote:
>
> The discussion at hand is about
>
> dma_buffer->foo = 1;/* WB */
> writel(KICK, DMA_KICK_REGISTER);/* UC */

Yes. That certainly is ordered on x86. In fact, afaik it's ordered
even if that writel() might be of type WC, because that only delays
writes, it doesn't move them earlier.

Whether people *do* that or not, I don't know. But I wouldn't be
surprised if they do.

So if it's a DMA buffer, it's "cached". And even cached accesses are
ordered wrt MMIO.

Basically, to get unordered writes on x86, you need to either use
explicitly nontemporal stores, or have a writecombining region with
back-to-back writes that actually combine.

And nobody really does that nontemporal store thing any more because
the hardware that cared pretty much doesn't exist any more. It was too
much pain. People use DMA and maybe an UC store for starting the DMA
(or possibly a WC buffer that gets multiple  stores in ascending order
as a stream of commands).

Things like UC will force everything to be entirely ordered, but even
without UC, loads won't pass loads, and stores won't pass stores.

> Now it appears that this wasn't fully understood back then, and some
> people are now saying that x86 might not even provide that semantic
> always.

Oh, the above UC case is absoutely guaranteed.

And I think even if it's WC, the write to kick off the DMA is ordered
wrt the cached write.

On x86, I think you need barriers only if you do things like

 - do two non-temporal stores and require them to be ordered: put a
sfence or mfence in between them.

 - do two WC stores, and make sure they do not combine: put a sfence
or mfence between them.

 - do a store, and a subsequent from a different address, and neither
of them is UC: put a mfence between them. But note that this is
literally just "load after store". A "store after load" doesn't need
one.

I think that's pretty much it.

For example, the "lfence" instruction is almost entirely pointless on
x86 - it was designed back in the time when people *thought* they
might re-order loads. But loads don't get re-ordered. At least Intel
seems to document that only non-temporal *stores* can get re-ordered
wrt each other.

End result: lfence is a historical oddity that can now be used to
guarantee that a previous  load has finished, and that in turn meant
that it is  now used in the Spectre mitigations. But it basically has
no real memory ordering meaning since nothing passes an earlier load
anyway, it's more of a pipeline thing.

But in the end, one question is just "how much do drivers actually
_rely_ on the x86 strong ordering?"

We so support "smp_wmb()" even though x86 has strong enough ordering
that just a barrier suffices. Somebody might just say "screw the x86
memory ordering, we're relaxed, and we'll fix up the drivers we care
about".

The only issue really is that 99.9% of all testing gets done on x86
unless you look at specific SoC drivers.

On ARM, for example, there is likely little reason to care about x86
memory ordering, because there is almost zero driver overlap between
x86 and ARM.

*Historically*, the reason for following the x86 IO ordering was
simply that a lot of architectures used the drivers that were
developed on x86. The alpha and powerpc workstations were *designed*
with the x86 IO bus (PCI, then PCIe) and to work with the devices that
came with it.

ARM? PCIe is almost irrelevant. For ARM servers, if they ever take
off, sure. But 99.99% of ARM is about their own SoC's, and so "x86
test coverage" is simply not an issue.

How much of an issue is it for Power? Maybe you decide it's not a big deal.

Then all the above is almost irrelevant.

   Linus


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Benjamin Herrenschmidt
On Tue, 2018-03-27 at 16:10 +0100, Will Deacon wrote:
> To clarify: are you saying that on x86 you need a wmb() prior to a writel
> if you want that writel to be ordered after prior writes to memory? Is this
> specific to WC memory or some other non-standard attribute?
> 
> The only reason we have wmb() inside writel() on arm, arm64 and power is for
> parity with x86 because Linus (CC'd) wanted architectures to order I/O vs
> memory by default so that it was easier to write portable drivers. The
> performance impact of that implicit barrier is non-trivial, but we want the
> driver portability and I went as far as adding generic _relaxed versions for
> the cases where ordering isn't required. You seem to be suggesting that none
> of this is necessary and drivers would already run into problems on x86 if
> they didn't use wmb() explicitly in conjunction with writel, which I find
> hard to believe and is in direct contradiction with the current Linux I/O
> memory model (modulo the broken example in the dma_*mb section of
> memory-barriers.txt).

Another clarification while we are at it 

All of this only applies to concurrent access by the CPU and the device
to memory allocate with dma_alloc_coherent().

For memory "mapped" into the DMA domain via dma_map_* then an extra
dma_sync_for_* is needed.

In most useful server cases etc... these latter are NOPs, but
architecture without full DMA cache coherency or using swiotlb,
dma_map_* might maintain bounce buffers or play additional cache
flushing tricks.

Cheers,
Ben.



Re: RFC on writel and writel_relaxed

2018-03-27 Thread Benjamin Herrenschmidt
On Tue, 2018-03-27 at 14:39 -1000, Linus Torvalds wrote:
> On Tue, Mar 27, 2018 at 11:33 AM, Benjamin Herrenschmidt
>  wrote:
> > 
> > Well, we need to clarify that once and for all, because as I wrote
> > earlier, it was decreed by Linus more than a decade ago that writel
> > would be fully ordered by itself vs. previous memory stores (at least
> > on UC memory).
> 
> Yes.
> 
> So "writel()" needs to be ordered with respect to other writel() uses
> on the same thread. Anything else *will* break drivers. Obviously, the
> drivers may then do magic to say "do write combining etc", but that
> magic will be architecture-specific.
> 
> The other issue is that "writel()" needs to be ordered wrt other CPU's
> doing "writel()" if those writel's are in a spinlocked region.

 .../...

The discussion at hand is about

dma_buffer->foo = 1;/* WB */
writel(KICK, DMA_KICK_REGISTER);/* UC */

(The WC case is something else, let's not mix things up just yet)

IE, a store to normal WB cache memory followed by a writel to a device
which will then DMA from that buffer.

Back in the days, we did require on powerpc a wmb() between these, but
you made the point that x86 didn't and driver writers would never get
that right.

We decided to go conservative, added the necessary barrier inside
writel, so did ARM and it became the norm that writel is also fully
ordered vs. previous stores to memory *by the same CPU* of course (or
protected by the same spinlock).

Now it appears that this wasn't fully understood back then, and some
people are now saying that x86 might not even provide that semantic
always.

So a number (fairly large) of drivers have been adding wmb() in those
case, while others haven't, and it's a mess.

The mess is compounded by the fact that if writel is now defined to
*not* provide that ordering guarantee, then writel_relaxed() is
pointless since all it is defined to relax is precisely the above
ordering guarantee.

So I want to get to the bottom of this once and for all so we can have
well defined and documented semantics and stop having drivers do random
things that may or may not work on some or all architectures (including
x86 !).

Quite note about the spinlock case... In fact this is the only case you
did allow back then to be relaxed. In theory a writel followed by a
spin_unlock requires an mmiowb (which is the only point of that barrier
in fact). This was done because an arch (I think ia64) had a hard time
getting MMIOs from multiple CPUs get in order vs. a lock and required
an expensive access to the PCI host bridge to do so.

Back then, on powerpc, we chose not to allow that relaxing and instead
added code to our writel to set a per-cpu flag which would cause the
next spin_unlock to use a stronger barrier than usual.

We do need to clarify this as well, but let's start with the most basic
one first, there is enough confusion already.

Cheers,
Ben.


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Linus Torvalds
On Tue, Mar 27, 2018 at 11:33 AM, Benjamin Herrenschmidt
 wrote:
>
> Well, we need to clarify that once and for all, because as I wrote
> earlier, it was decreed by Linus more than a decade ago that writel
> would be fully ordered by itself vs. previous memory stores (at least
> on UC memory).

Yes.

So "writel()" needs to be ordered with respect to other writel() uses
on the same thread. Anything else *will* break drivers. Obviously, the
drivers may then do magic to say "do write combining etc", but that
magic will be architecture-specific.

The other issue is that "writel()" needs to be ordered wrt other CPU's
doing "writel()" if those writel's are in a spinlocked region.

So it's not  that "writel()" needs to be ordered wrt the spinlock
itself, but you *do* need to honor ordering if you have something like
this:

   spin_lock();
   writel(a);
   writel(b);
   spin_unlock();

and if two CPU's run the above code "at the same time", then the
*ONLY* acceptable sequence is abab.

You cannot, and must not, ever see "aabb" at the device, for example,
because of how the writel would basically leak out of the spinlock.

That sounds "obvious", but dammit, a lot of architectures got that
wrong, afaik.

Linus


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Benjamin Herrenschmidt
On Tue, 2018-03-27 at 14:54 -0700, Alexander Duyck wrote:
> On Tue, Mar 27, 2018 at 2:35 PM, Benjamin Herrenschmidt
>  wrote:
> > On Tue, 2018-03-27 at 10:46 -0400, Sinan Kaya wrote:
> > >  combined buffers.
> > > 
> > > Alex:
> > > "Don't bother. I can tell you right now that for x86 you have to have a
> > > wmb() before the writel().
> > 
> > No, this isn't the semantics of writel. You shouldn't need it unless
> > something changed and we need to revisit our complete understanding of
> > *all* MMIO accessor semantics.
> 
> The issue seems to be that there have been two different ways of
> dealing with this. There has historically been a number of different
> drivers that have been carrying this wmb() workaround since something
> like 2002. I get that the semantics for writel might have changed
> since then, but those of us who already have the wmb() in our drivers
> will be very wary of anyone wanting to go through and remove them
> since writel is supposed to be "good enough". I would much rather err
> on the side of caution here.
> 
> I view the wmb() + writel_relaxed() as more of a driver owning and
> handling this itself. Besides in the Intel Ethernet driver case it is
> better performance as our wmb() placement for us also provides a
> secondary barrier so we don't need to add a separate smp_wmb() to deal
> with a potential race we have with the Tx cleanup.
> 
> > At least for UC space, it has always been accepted (and enforced) that
> > writel would not require any other barrier to order vs. previous stores
> > to memory.
> 
> So the one thing I would question here is if this is UC vs UC or if
> this extends to other types as well? So for x86 we could find
> references to Write Combining being flushed by a write to UC memory,
> however I have yet to find a clear explanation of what a write to UC
> does to WB. 

Well, this is the standard write memory + trigger DMA case, the one
specific case for which Linus was adamant we don't need another barrier
 back then ...

> My personal inclination would be to err on the side of
> caution.

Which means writel_relaxed is now pointless ?

We need clear semantics here. In this case the "side of caution" means
we are randomly doing things not understanding what really happens and
that makes me *more* nervous.

>  I just don't want us going through and removing the wmb()
> calls because it "should" work. I would want to know for certain it
> will work.

We need to know for certain anyway. Otherwise, all the drivers that do
not have wmb's are potentially broken.

So I dont agree with the status quo.

We need to establish precisely what x86 does, decide what we want the
semantic of writel to be, and implement things accordingly.

Ben.



Re: RFC on writel and writel_relaxed

2018-03-27 Thread Sinan Kaya
On 3/27/2018 5:54 PM, Alexander Duyck wrote:
> I view the wmb() + writel_relaxed() as more of a driver owning and
> handling this itself. Besides in the Intel Ethernet driver case it is
> better performance as our wmb() placement for us also provides a
> secondary barrier so we don't need to add a separate smp_wmb() to deal
> with a potential race we have with the Tx cleanup.

Thanks for the reminder. 

I forgot about the double barrier optimization. wmb() + writel_relaxed()
seems to be the best option for Intel network drivers at this moment. 
Otherwise, we'll have to remove wmb() and throw in smp barriers there
like you mentioned.

I'll leave the changes in the Intel drivers alone.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Alexander Duyck
On Tue, Mar 27, 2018 at 2:35 PM, Benjamin Herrenschmidt
 wrote:
> On Tue, 2018-03-27 at 10:46 -0400, Sinan Kaya wrote:
>>  combined buffers.
>>
>> Alex:
>> "Don't bother. I can tell you right now that for x86 you have to have a
>> wmb() before the writel().
>
> No, this isn't the semantics of writel. You shouldn't need it unless
> something changed and we need to revisit our complete understanding of
> *all* MMIO accessor semantics.

The issue seems to be that there have been two different ways of
dealing with this. There has historically been a number of different
drivers that have been carrying this wmb() workaround since something
like 2002. I get that the semantics for writel might have changed
since then, but those of us who already have the wmb() in our drivers
will be very wary of anyone wanting to go through and remove them
since writel is supposed to be "good enough". I would much rather err
on the side of caution here.

I view the wmb() + writel_relaxed() as more of a driver owning and
handling this itself. Besides in the Intel Ethernet driver case it is
better performance as our wmb() placement for us also provides a
secondary barrier so we don't need to add a separate smp_wmb() to deal
with a potential race we have with the Tx cleanup.

> At least for UC space, it has always been accepted (and enforced) that
> writel would not require any other barrier to order vs. previous stores
> to memory.

So the one thing I would question here is if this is UC vs UC or if
this extends to other types as well? So for x86 we could find
references to Write Combining being flushed by a write to UC memory,
however I have yet to find a clear explanation of what a write to UC
does to WB. My personal inclination would be to err on the side of
caution. I just don't want us going through and removing the wmb()
calls because it "should" work. I would want to know for certain it
will work.

- Alex


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Benjamin Herrenschmidt
On Tue, 2018-03-27 at 10:46 -0400, Sinan Kaya wrote:
>  combined buffers.
> 
> Alex:
> "Don't bother. I can tell you right now that for x86 you have to have a
> wmb() before the writel().

No, this isn't the semantics of writel. You shouldn't need it unless
something changed and we need to revisit our complete understanding of
*all* MMIO accessor semantics.

At least for UC space, it has always been accepted (and enforced) that
writel would not require any other barrier to order vs. previous stores
to memory.

> Based on the comment in
> (https://www.spinics.net/lists/linux-rdma/msg62666.html):
> Replacing wmb() + writel() with wmb() + writel_relaxed() will work on
> PPC, it will just not give you a benefit today.
> 
> I say the patch set stays. This gives benefit on ARM, and has no
> effect on x86 and PowerPC. If you want to look at trying to optimize
> things further on PowerPC and such then go for it in terms of trying
> to implement the writel_relaxed(). Otherwise I say we call the ARM
> goodness a win and don't get ourselves too wrapped up in trying to fix
> this for all architectures."


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Benjamin Herrenschmidt
On Tue, 2018-03-27 at 11:54 -0700, Alexander Duyck wrote:
> As far as I know the code has been this way for a while, something
> like 2002, when the barrier was already present in e1000. However
> there it was calling out weakly ordered models "such as IA-64". Since
> then pretty much all the hardware based network drivers at this point
> have similar code floating around with wmb() in place to prevent
> issues on weak ordered memory systems.
> 
> So in any case we still need to be careful as there are architectures
> that are depending on this even if they might not be x86. :-/

Well, we need to clarify that once and for all, because as I wrote
earlier, it was decreed by Linus more than a decade ago that writel
would be fully ordered by itself vs. previous memory stores (at least
on UC memory).

This is why we added sync's to writel on powerpc and later ARM added
similar barriers to theirs.

This is also why writel_relaxed was added (though much later), since
what writel_relaxed does is to life that specific requirement.

IE. If what you say is true and wmb() is needed on x86, then
writel_relaxed is now completely useless...

Ben.


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Arnd Bergmann
On Tue, Mar 27, 2018 at 9:54 PM, Arnd Bergmann  wrote:
> On Tue, Mar 27, 2018 at 8:54 PM, Alexander Duyck
>  wrote:
>> On Tue, Mar 27, 2018 at 8:10 AM, Will Deacon  wrote:
>
> 11.10 STORE BUFFER
> Intel 64 and IA-32 processors temporarily store each write (store) to
> memory in a store buffer. The store buffer
> improves processor performance by allowing the processor to continue
> executing instructions without having to
> wait until a write to memory and/or to a cache is complete. It also
> allows writes to be delayed for more efficient use
> of memory-access bus cycles.
> In general, the existence of the store buffer is transparent to
> software, even in systems that use multiple processors.
> The processor ensures that write operations are always carried out in
> program order. It also insures that the
> contents of the store buffer are always drained to memory in the
> following situations:
> • When an exception or interrupt is generated.
> • (P6 and more recent processor families only) When a serializing
> instruction is executed.
> • When an I/O instruction is executed.

I guess I/O instruction is still ambiguous on x86, it may just refer
to 'inb'/'outb' style instructions rather than 'mov' on a device MMIO
area.

Here's a link to a reply from Linus that I found on this topic:

http://yarchive.net/comp/linux/write_combining.html

  Arnd


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Arnd Bergmann
On Tue, Mar 27, 2018 at 8:54 PM, Alexander Duyck
 wrote:
> On Tue, Mar 27, 2018 at 8:10 AM, Will Deacon  wrote:

>>>
>>> Sinan
>>> "We are being told that if you use writel(), then you don't need a wmb() on
>>> all architectures."
>>>
>>> Alex:
>>> "I'm not sure who told you that but that is incorrect, at least for
>>> x86. If you attempt to use writel() without the wmb() we will have to
>>> NAK the patches. We will accept the wmb() with writel_releaxed() since
>>> that solves things for ARM."
>>>
>>> > Jason is seeking behavior clarification for write combined buffers.
>>>
>>> Alex:
>>> "Don't bother. I can tell you right now that for x86 you have to have a
>>> wmb() before the writel().
>>
>> To clarify: are you saying that on x86 you need a wmb() prior to a writel
>> if you want that writel to be ordered after prior writes to memory? Is this
>> specific to WC memory or some other non-standard attribute?
>
> Note, I am not a CPU guy so this is just my interpretation. It is my
> understanding that the wmb(), aka sfence, is needed on x86 to sort out
> writes between Write-back(WB) system memory and Strong Uncacheable
> (UC) MMIO accesses.
>
> I was hoping to be able to cite something in the software developers
> manual 
> (https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf),
> but that tends to be pretty vague. I have re-read section 22.34
> (volume 3B) several times and I am still not clear on if it says we
> need the sfence or not. It is a matter of figuring out what the impact
> of store buffers and caching are for WB versus UC memory.

Here is what I found regarding the store buffer in that document:

11.10 STORE BUFFER
Intel 64 and IA-32 processors temporarily store each write (store) to
memory in a store buffer. The store buffer
improves processor performance by allowing the processor to continue
executing instructions without having to
wait until a write to memory and/or to a cache is complete. It also
allows writes to be delayed for more efficient use
of memory-access bus cycles.
In general, the existence of the store buffer is transparent to
software, even in systems that use multiple processors.
The processor ensures that write operations are always carried out in
program order. It also insures that the
contents of the store buffer are always drained to memory in the
following situations:
• When an exception or interrupt is generated.
• (P6 and more recent processor families only) When a serializing
instruction is executed.
• When an I/O instruction is executed.
• When a LOCK operation is performed.
• (P6 and more recent processor families only) When a BINIT operation
is performed.
• (Pentium III, and more recent processor families only) When using an
SFENCE instruction to order stores.
• (Pentium 4 and more recent processor families only) When using an
MFENCE instruction to order stores.
The discussion of write ordering in Section 8.2, “Memory Ordering,”
gives a detailed description of the operation of
the store buffer.

   Arnd


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Alexander Duyck
On Tue, Mar 27, 2018 at 8:10 AM, Will Deacon  wrote:
> Hi Alex,
>
> On Tue, Mar 27, 2018 at 10:46:58AM -0400, Sinan Kaya wrote:
>> +netdev, +Alex
>>
>> On 3/26/2018 6:00 PM, Benjamin Herrenschmidt wrote:
>> > On Mon, 2018-03-26 at 23:30 +0200, Arnd Bergmann wrote:
>> >>  Most of the drivers have a unwound loop with writeq() or something to
>> >>> do it.
>> >>
>> >> But isn't the writeq() barrier much more expensive than anything you'd
>> >> do in function calls?
>> >
>> > It is for us, and will break any write combining.
>> >
>> > The same document says that _relaxed() does not give that guarentee.
>> >
>> > The lwn articule on this went into some depth on the interaction with
>> > spinlocks.
>> >
>> > As far as I can see, containment in a spinlock seems to be the only
>> > different between writel and writel_relaxed..
>> 
>>  I was always puzzled by this: The intention of _relaxed() on ARM
>>  (where it originates) was to skip the barrier that serializes DMA
>>  with MMIO, not to skip the serialization between MMIO and locks.
>> >>>
>> >>> But that was never a requirement of writel(),
>> >>> Documentation/memory-barriers.txt gives an explicit example demanding
>> >>> the wmb() before writel() for ordering system memory against writel.
>> >
>> > This is a bug in the documentation.
>> >
>> >> Indeed, but it's in an example for when to use dma_wmb(), not wmb().
>> >> Adding Alexander Duyck to Cc, he added that section as part of
>> >> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
>> >> dma_wmb()"). Also adding the other people that were involved with that.
>> >
>> > Linus himself made it very clear years ago. readl and writel have to
>> > order vs memory accesses.
>> >
>> >>> I actually have no idea why ARM had that barrier, I always assumed it
>> >>> was to give program ordering to the accesses and that _relaxed allowed
>> >>> re-ordering (the usual meaning of relaxed)..
>> >>>
>> >>> But the barrier document makes it pretty clear that the only
>> >>> difference between the two is spinlock containment, and WillD wrote
>> >>> this text, so I belive it is accurate for ARM.
>> >>>
>> >>> Very confusing.
>> >>
>> >> It does mention serialization with both DMA and locks in the
>> >> section about  readX_relaxed()/writeX_relaxed(). The part
>> >> about DMA is very clear here, and I must have just forgotten
>> >> the exact semantics with regards to spinlocks. I'm still not
>> >> sure what prevents a writel() from leaking out the end of a
>> >> spinlock section that doesn't happen with writel_relaxed(), since
>> >> the barrier in writel() comes before the access, and the
>> >> spin_unlock() shouldn't affect the external buses.
>> >
>> > So...
>> >
>> > Historically, what happened is that we (we means whoever participated
>> > in the discussion on the list with Linus calling the shots really)
>> > decided that there was no sane way for drivers to understand a world
>> > where readl/writel didn't fully order things vs. memory accesses (ie,
>> > DMA).
>> >
>> > So it should always be correct to do:
>> >
>> > - Write to some in-memory buffer
>> > - writel() to kick the DMA read of that buffer
>> >
>> > without any extra barrier.
>> >
>> > The spinlock situation however got murky. Mostly that came up because
>> > on architecture (I forgot who, might have been ia64) has a hard time
>> > providing that consistency without making writel insanely expensive.
>> >
>> > Thus they created mmiowb whose main purpose was precisely to order
>> > writel with a following spin_unlock.
>> >
>> > I decided not to go down that path on power because getting all drivers
>> > "fixed" to do the right thing was going to be a losing battle, and
>> > instead added per-cpu tracking of writel in order to "escalate" to a
>> > heavier barrier in spin_unlock itself when necessary.
>> >
>> > Now, all this happened more than a decade ago and it's possible that
>> > the understanding or expectations "shifted" over time...
>>
>> Alex is raising concerns on the netdev list.
>>
>> Sinan
>> "We are being told that if you use writel(), then you don't need a wmb() on
>> all architectures."
>>
>> Alex:
>> "I'm not sure who told you that but that is incorrect, at least for
>> x86. If you attempt to use writel() without the wmb() we will have to
>> NAK the patches. We will accept the wmb() with writel_releaxed() since
>> that solves things for ARM."
>>
>> > Jason is seeking behavior clarification for write combined buffers.
>>
>> Alex:
>> "Don't bother. I can tell you right now that for x86 you have to have a
>> wmb() before the writel().
>
> To clarify: are you saying that on x86 you need a wmb() prior to a writel
> if you want that writel to be ordered after prior writes to memory? Is this
> specific to WC memory or some other non-standard attribute?

Note, I am not a CPU guy so this is just my interpretation. It is my
understanding that the wmb(), aka sfence, is 

Re: RFC on writel and writel_relaxed

2018-03-27 Thread Will Deacon
Hi Alex,

On Tue, Mar 27, 2018 at 10:46:58AM -0400, Sinan Kaya wrote:
> +netdev, +Alex
> 
> On 3/26/2018 6:00 PM, Benjamin Herrenschmidt wrote:
> > On Mon, 2018-03-26 at 23:30 +0200, Arnd Bergmann wrote:
> >>  Most of the drivers have a unwound loop with writeq() or something to
> >>> do it.
> >>
> >> But isn't the writeq() barrier much more expensive than anything you'd
> >> do in function calls?
> > 
> > It is for us, and will break any write combining.
> > 
> > The same document says that _relaxed() does not give that guarentee.
> >
> > The lwn articule on this went into some depth on the interaction with
> > spinlocks.
> >
> > As far as I can see, containment in a spinlock seems to be the only
> > different between writel and writel_relaxed..
> 
>  I was always puzzled by this: The intention of _relaxed() on ARM
>  (where it originates) was to skip the barrier that serializes DMA
>  with MMIO, not to skip the serialization between MMIO and locks.
> >>>
> >>> But that was never a requirement of writel(),
> >>> Documentation/memory-barriers.txt gives an explicit example demanding
> >>> the wmb() before writel() for ordering system memory against writel.
> > 
> > This is a bug in the documentation.
> > 
> >> Indeed, but it's in an example for when to use dma_wmb(), not wmb().
> >> Adding Alexander Duyck to Cc, he added that section as part of
> >> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
> >> dma_wmb()"). Also adding the other people that were involved with that.
> > 
> > Linus himself made it very clear years ago. readl and writel have to
> > order vs memory accesses.
> > 
> >>> I actually have no idea why ARM had that barrier, I always assumed it
> >>> was to give program ordering to the accesses and that _relaxed allowed
> >>> re-ordering (the usual meaning of relaxed)..
> >>>
> >>> But the barrier document makes it pretty clear that the only
> >>> difference between the two is spinlock containment, and WillD wrote
> >>> this text, so I belive it is accurate for ARM.
> >>>
> >>> Very confusing.
> >>
> >> It does mention serialization with both DMA and locks in the
> >> section about  readX_relaxed()/writeX_relaxed(). The part
> >> about DMA is very clear here, and I must have just forgotten
> >> the exact semantics with regards to spinlocks. I'm still not
> >> sure what prevents a writel() from leaking out the end of a
> >> spinlock section that doesn't happen with writel_relaxed(), since
> >> the barrier in writel() comes before the access, and the
> >> spin_unlock() shouldn't affect the external buses.
> > 
> > So...
> > 
> > Historically, what happened is that we (we means whoever participated
> > in the discussion on the list with Linus calling the shots really)
> > decided that there was no sane way for drivers to understand a world
> > where readl/writel didn't fully order things vs. memory accesses (ie,
> > DMA).
> > 
> > So it should always be correct to do:
> > 
> > - Write to some in-memory buffer
> > - writel() to kick the DMA read of that buffer
> > 
> > without any extra barrier.
> > 
> > The spinlock situation however got murky. Mostly that came up because
> > on architecture (I forgot who, might have been ia64) has a hard time
> > providing that consistency without making writel insanely expensive.
> > 
> > Thus they created mmiowb whose main purpose was precisely to order
> > writel with a following spin_unlock.
> > 
> > I decided not to go down that path on power because getting all drivers
> > "fixed" to do the right thing was going to be a losing battle, and
> > instead added per-cpu tracking of writel in order to "escalate" to a
> > heavier barrier in spin_unlock itself when necessary.
> > 
> > Now, all this happened more than a decade ago and it's possible that
> > the understanding or expectations "shifted" over time...
> 
> Alex is raising concerns on the netdev list.
> 
> Sinan
> "We are being told that if you use writel(), then you don't need a wmb() on
> all architectures."
> 
> Alex:
> "I'm not sure who told you that but that is incorrect, at least for
> x86. If you attempt to use writel() without the wmb() we will have to
> NAK the patches. We will accept the wmb() with writel_releaxed() since
> that solves things for ARM."
> 
> > Jason is seeking behavior clarification for write combined buffers.
> 
> Alex:
> "Don't bother. I can tell you right now that for x86 you have to have a
> wmb() before the writel().

To clarify: are you saying that on x86 you need a wmb() prior to a writel
if you want that writel to be ordered after prior writes to memory? Is this
specific to WC memory or some other non-standard attribute?

The only reason we have wmb() inside writel() on arm, arm64 and power is for
parity with x86 because Linus (CC'd) wanted architectures to order I/O vs
memory by default so that it was easier to write portable drivers. The
performance impact of that implicit barrier is non-trivial, but 

Re: RFC on writel and writel_relaxed

2018-03-27 Thread Jose Abreu
Hi,

On 27-03-2018 15:46, Sinan Kaya wrote:
>
> Sinan
> "We are being told that if you use writel(), then you don't need a wmb() on
> all architectures."
>
> Alex:
> "I'm not sure who told you that but that is incorrect, at least for
> x86. If you attempt to use writel() without the wmb() we will have to
> NAK the patches. We will accept the wmb() with writel_releaxed() since
> that solves things for ARM."
>

So this means we should always use writel() + wmb() in *all*
accesses? I don't know about x86 but arc architecture doesn't
have a wmb() in the writel() function (in some configs).

I see the point in net drivers while you have dma + io accesses
but for most drivers this shouldn't be needed, right?

What about ordering of writes? Is it guaranteed that one write
will happen before the next one ?

Best Regards,
Jose Miguel Abreu



Re: RFC on writel and writel_relaxed

2018-03-27 Thread Sinan Kaya
+netdev, +Alex

On 3/26/2018 6:00 PM, Benjamin Herrenschmidt wrote:
> On Mon, 2018-03-26 at 23:30 +0200, Arnd Bergmann wrote:
>>  Most of the drivers have a unwound loop with writeq() or something to
>>> do it.
>>
>> But isn't the writeq() barrier much more expensive than anything you'd
>> do in function calls?
> 
> It is for us, and will break any write combining.
> 
> The same document says that _relaxed() does not give that guarentee.
>
> The lwn articule on this went into some depth on the interaction with
> spinlocks.
>
> As far as I can see, containment in a spinlock seems to be the only
> different between writel and writel_relaxed..

 I was always puzzled by this: The intention of _relaxed() on ARM
 (where it originates) was to skip the barrier that serializes DMA
 with MMIO, not to skip the serialization between MMIO and locks.
>>>
>>> But that was never a requirement of writel(),
>>> Documentation/memory-barriers.txt gives an explicit example demanding
>>> the wmb() before writel() for ordering system memory against writel.
> 
> This is a bug in the documentation.
> 
>> Indeed, but it's in an example for when to use dma_wmb(), not wmb().
>> Adding Alexander Duyck to Cc, he added that section as part of
>> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
>> dma_wmb()"). Also adding the other people that were involved with that.
> 
> Linus himself made it very clear years ago. readl and writel have to
> order vs memory accesses.
> 
>>> I actually have no idea why ARM had that barrier, I always assumed it
>>> was to give program ordering to the accesses and that _relaxed allowed
>>> re-ordering (the usual meaning of relaxed)..
>>>
>>> But the barrier document makes it pretty clear that the only
>>> difference between the two is spinlock containment, and WillD wrote
>>> this text, so I belive it is accurate for ARM.
>>>
>>> Very confusing.
>>
>> It does mention serialization with both DMA and locks in the
>> section about  readX_relaxed()/writeX_relaxed(). The part
>> about DMA is very clear here, and I must have just forgotten
>> the exact semantics with regards to spinlocks. I'm still not
>> sure what prevents a writel() from leaking out the end of a
>> spinlock section that doesn't happen with writel_relaxed(), since
>> the barrier in writel() comes before the access, and the
>> spin_unlock() shouldn't affect the external buses.
> 
> So...
> 
> Historically, what happened is that we (we means whoever participated
> in the discussion on the list with Linus calling the shots really)
> decided that there was no sane way for drivers to understand a world
> where readl/writel didn't fully order things vs. memory accesses (ie,
> DMA).
> 
> So it should always be correct to do:
> 
>   - Write to some in-memory buffer
>   - writel() to kick the DMA read of that buffer
> 
> without any extra barrier.
> 
> The spinlock situation however got murky. Mostly that came up because
> on architecture (I forgot who, might have been ia64) has a hard time
> providing that consistency without making writel insanely expensive.
> 
> Thus they created mmiowb whose main purpose was precisely to order
> writel with a following spin_unlock.
> 
> I decided not to go down that path on power because getting all drivers
> "fixed" to do the right thing was going to be a losing battle, and
> instead added per-cpu tracking of writel in order to "escalate" to a
> heavier barrier in spin_unlock itself when necessary.
> 
> Now, all this happened more than a decade ago and it's possible that
> the understanding or expectations "shifted" over time...

Alex is raising concerns on the netdev list.

Sinan
"We are being told that if you use writel(), then you don't need a wmb() on
all architectures."

Alex:
"I'm not sure who told you that but that is incorrect, at least for
x86. If you attempt to use writel() without the wmb() we will have to
NAK the patches. We will accept the wmb() with writel_releaxed() since
that solves things for ARM."

> Jason is seeking behavior clarification for write combined buffers.

Alex:
"Don't bother. I can tell you right now that for x86 you have to have a
wmb() before the writel().

Based on the comment in
(https://www.spinics.net/lists/linux-rdma/msg62666.html):
Replacing wmb() + writel() with wmb() + writel_relaxed() will work on
PPC, it will just not give you a benefit today.

I say the patch set stays. This gives benefit on ARM, and has no
effect on x86 and PowerPC. If you want to look at trying to optimize
things further on PowerPC and such then go for it in terms of trying
to implement the writel_relaxed(). Otherwise I say we call the ARM
goodness a win and don't get ourselves too wrapped up in trying to fix
this for all architectures."



> 
> Cheers,
> Ben.
>  
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code