Re: [PATCH v8 0/7] netdev: intel: Eliminate duplicate barriers on weakly-ordered archs

2018-07-30 Thread Sinan Kaya

+netdev,

On 7/30/2018 9:45 AM, Alexander Duyck wrote:

I haven't had a chance to work on this much myself. My understanding
is that igb has had the barriers updated, but I don't think any of the
other drivers have been worked over yet.


Unfortunately, I have recently changed jobs and I no longer have the
hardware to test my changes. I thought that you wanted to handle this
yourself.

I haven't seen any follow ups. I wanted to double check.

I worked with several architecture leads on 4.17. All architectures
support the updated barrier semantics now. It is time to clean up the
network drivers.

92d7223a7423 alpha: io: reorder barriers to guarantee writeX() and 
iowriteX() ordering #2

a1cc7034e33d MIPS: io: Add barrier after register read in readX()
f6b7aeee8f16 MIPS: io: Prevent compiler reordering writeX()
a71e7c44ffb7 io: change writeX_relaxed() to remove barriers
8875c5543761 io: change readX_relaxed() to remove barriers
cd0e00c10672 alpha: io: reorder barriers to guarantee writeX() and 
iowriteX() ordering

87fe2d543f81 io: change inX() to have their own IO barrier overrides
a7851aa54c0c io: change outX() to have their own IO barrier overrides
755bd04aaf4b io: define stronger ordering for the default writeX() 
implementation
032d59e1cde9 io: define stronger ordering for the default readX() 
implementation
64e2c6738b4d io: define several IO & PIO barrier types for the 
asm-generic version







Re: [PATCH v1 0/4] PCI: Remove unnecessary includes of

2018-07-25 Thread Sinan Kaya

On 7/25/2018 12:52 PM, Bjorn Helgaas wrote:

emove includes of  from files that don't need
it.  I'll apply all these via the PCI tree unless there's objection.

---

Bjorn Helgaas (4):
   igb: Remove unnecessary include of 
   ath9k: Remove unnecessary include of 
   iwlwifi: Remove unnecessary include of 
   PCI: Remove unnecessary include of 


Thanks.

Reviewed-by: Sinan Kaya 

Is it possible to kill that file altogether? I haven't looked who is
using outside of pci directory.


Re: [PATCH v8 0/7] netdev: intel: Eliminate duplicate barriers on weakly-ordered archs

2018-04-03 Thread Sinan Kaya
On 4/3/2018 1:47 PM, Alexander Duyck wrote:
> On Mon, Apr 2, 2018 at 7:59 PM, Sinan Kaya <ok...@codeaurora.org> wrote:
>> Alex,
>>
>> On 4/2/2018 3:06 PM, Sinan Kaya wrote:
>>> Code includes wmb() followed by writel() in multiple places. writel()
>>> already has a barrier on some architectures like arm64.
>>>
>>> This ends up CPU observing two barriers back to back before executing the
>>> register write.
>>>
>>> Since code already has an explicit barrier call, changing writel() to
>>> writel_relaxed().
>>>
>>> I did a regex search for wmb() followed by writel() in each drivers
>>> directory. I scrubbed the ones I care about in this series.
>>>
>>> I considered "ease of change", "popular usage" and "performance critical
>>> path" as the determining criteria for my filtering.
>>>
>>> We used relaxed API heavily on ARM for a long time but
>>> it did not exist on other architectures. For this reason, relaxed
>>> architectures have been paying double penalty in order to use the common
>>> drivers.
>>>
>>> Now that relaxed API is present on all architectures, we can go and scrub
>>> all drivers to see what needs to change and what can remain.
>>>
>>> We start with mostly used ones and hope to increase the coverage over time.
>>> It will take a while to cover all drivers.
>>>
>>> Feel free to apply patches individually.
>>>
>>> Change since 7:
>>>
>>> * API clarification with Linus and several architecture folks regarding
>>> writel()
>>>
>>> https://www.mail-archive.com/netdev@vger.kernel.org/msg225806.html
>>>
>>> * removed wmb() in front of writel() at several places.
>>> * remove old IA64 comments regarding ordering.
>>>
>>
>> What do you think about this version? Did I miss any SMP barriers?
> 
> I would say we should probably just drop the whole set and start over.
> If we don't need the wmb() we are going to need to go through and
> clean up all of these paths and consider the barriers when updating
> the layout of the code.
> 
> For example I have been thinking about it and in the case of x86 we
> are probably better off not bothering with the wmb() and
> writel_relaxed() and instead switch over to the smp_wmb() and writel()
> since in the case of a strongly ordered system like x86 or sparc this
> ends up translating out to a couple of compile barriers.
> 
> I will also need time to reevaluate the Rx paths since dropping the
> wmb() may have other effects which I need to verify.

Sounds good, I'll let you work on it.

@Jeff Kirsher: can you drop this version from your development branch until
Alex posts the next version?

> 
> Thanks.
> 
> - Alex
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH v8 0/7] netdev: intel: Eliminate duplicate barriers on weakly-ordered archs

2018-04-02 Thread Sinan Kaya
Alex,

On 4/2/2018 3:06 PM, Sinan Kaya wrote:
> Code includes wmb() followed by writel() in multiple places. writel()
> already has a barrier on some architectures like arm64.
> 
> This ends up CPU observing two barriers back to back before executing the
> register write.
> 
> Since code already has an explicit barrier call, changing writel() to
> writel_relaxed().
> 
> I did a regex search for wmb() followed by writel() in each drivers
> directory. I scrubbed the ones I care about in this series.
> 
> I considered "ease of change", "popular usage" and "performance critical
> path" as the determining criteria for my filtering.
> 
> We used relaxed API heavily on ARM for a long time but
> it did not exist on other architectures. For this reason, relaxed
> architectures have been paying double penalty in order to use the common
> drivers.
> 
> Now that relaxed API is present on all architectures, we can go and scrub
> all drivers to see what needs to change and what can remain.
> 
> We start with mostly used ones and hope to increase the coverage over time.
> It will take a while to cover all drivers.
> 
> Feel free to apply patches individually.
> 
> Change since 7:
> 
> * API clarification with Linus and several architecture folks regarding
> writel()
> 
> https://www.mail-archive.com/netdev@vger.kernel.org/msg225806.html
> 
> * removed wmb() in front of writel() at several places.
> * remove old IA64 comments regarding ordering.
> 

What do you think about this version? Did I miss any SMP barriers?


> Sinan Kaya (7):
>   i40e/i40evf: Eliminate duplicate barriers on weakly-ordered archs
>   ixgbe: eliminate duplicate barriers on weakly-ordered archs
>   igbvf: eliminate duplicate barriers on weakly-ordered archs
>   igb: eliminate duplicate barriers on weakly-ordered archs
>   fm10k: Eliminate duplicate barriers on weakly-ordered archs
>   ixgbevf: keep writel() closer to wmb()
>   ixgbevf: eliminate duplicate barriers on weakly-ordered archs
> 
>  drivers/net/ethernet/intel/fm10k/fm10k_main.c | 15 ++---
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 22 --
>  drivers/net/ethernet/intel/i40evf/i40e_txrx.c |  8 +--
>  drivers/net/ethernet/intel/igb/igb_main.c | 14 ++--
>  drivers/net/ethernet/intel/igbvf/netdev.c | 16 +-
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 23 ++-
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |  5 -
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 27 
> ---
>  8 files changed, 30 insertions(+), 100 deletions(-)
> 

Sinan

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


[PATCH v8 2/7] ixgbe: eliminate duplicate barriers on weakly-ordered archs

2018-04-02 Thread Sinan Kaya
memory-barriers.txt has been updated as follows:

"When using writel(), a prior wmb() is not needed to guarantee that the
cache coherent memory writes have completed before writing to the MMIO
region."

Remove old IA-64 comments in the code along with unneeded wmb() in front
of writel().

There are places in the code where wmb() has been used as a double barrier
for CPU and IO in place of smp_wmb() and wmb() as an optimization. For
such places, keep the wmb() but replace the following writel() with
writel_relaxed() to have a sequence as

wmb()
writel_relaxed()
mmio_wb()

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 23 ++-
 1 file changed, 2 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index afadba9..c17924b 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1696,12 +1696,6 @@ void ixgbe_alloc_rx_buffers(struct ixgbe_ring *rx_ring, 
u16 cleaned_count)
/* update next to alloc since we have filled the ring */
rx_ring->next_to_alloc = i;
 
-   /* Force memory writes to complete before letting h/w
-* know there are new descriptors to fetch.  (Only
-* applicable for weak-ordered memory model archs,
-* such as IA-64).
-*/
-   wmb();
writel(i, rx_ring->tail);
}
 }
@@ -2467,10 +2461,6 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector 
*q_vector,
if (xdp_xmit) {
struct ixgbe_ring *ring = adapter->xdp_ring[smp_processor_id()];
 
-   /* Force memory writes to complete before letting h/w
-* know there are new descriptors to fetch.
-*/
-   wmb();
writel(ring->next_to_use, ring->tail);
 
xdp_do_flush_map();
@@ -8080,12 +8070,7 @@ static int ixgbe_tx_map(struct ixgbe_ring *tx_ring,
/* set the timestamp */
first->time_stamp = jiffies;
 
-   /*
-* Force memory writes to complete before letting h/w know there
-* are new descriptors to fetch.  (Only applicable for weak-ordered
-* memory model archs, such as IA-64).
-*
-* We also need this memory barrier to make certain all of the
+   /* We need this memory barrier to make certain all of the
 * status bits have been updated before next_to_watch is written.
 */
wmb();
@@ -8102,7 +8087,7 @@ static int ixgbe_tx_map(struct ixgbe_ring *tx_ring,
ixgbe_maybe_stop_tx(tx_ring, DESC_NEEDED);
 
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
@@ -10034,10 +10019,6 @@ static void ixgbe_xdp_flush(struct net_device *dev)
if (unlikely(!ring))
return;
 
-   /* Force memory writes to complete before letting h/w know there
-* are new descriptors to fetch.
-*/
-   wmb();
writel(ring->next_to_use, ring->tail);
 
return;
-- 
2.7.4



[PATCH v8 1/7] i40e/i40evf: Eliminate duplicate barriers on weakly-ordered archs

2018-04-02 Thread Sinan Kaya
memory-barriers.txt has been updated as follows:

"When using writel(), a prior wmb() is not needed to guarantee that the
cache coherent memory writes have completed before writing to the MMIO
region."

Remove old IA-64 comments in the code along with unneeded wmb() in front
of writel().

There are places in the code where wmb() has been used as a double barrier
for CPU and IO in place of smp_wmb() and wmb() as an optimization. For
such places, keep the wmb() but replace the following writel() with
writel_relaxed() to have a sequence as

wmb()
writel_relaxed()
mmio_wb()

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 22 +-
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c |  8 +---
 2 files changed, 10 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index f174c72..1b9fa7a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -186,7 +186,13 @@ static int i40e_program_fdir_filter(struct 
i40e_fdir_filter *fdir_data,
/* Mark the data descriptor to be watched */
first->next_to_watch = tx_desc;
 
-   writel(tx_ring->next_to_use, tx_ring->tail);
+   writel_relaxed(tx_ring->next_to_use, tx_ring->tail);
+
+   /* We need this if more than one processor can write to our tail
+* at a time, it synchronizes IO on IA64/Altix systems
+*/
+   mmiowb();
+
return 0;
 
 dma_fail:
@@ -1523,12 +1529,6 @@ static inline void i40e_release_rx_desc(struct i40e_ring 
*rx_ring, u32 val)
/* update next to alloc since we have filled the ring */
rx_ring->next_to_alloc = val;
 
-   /* Force memory writes to complete before letting h/w
-* know there are new descriptors to fetch.  (Only
-* applicable for weak-ordered memory model archs,
-* such as IA-64).
-*/
-   wmb();
writel(val, rx_ring->tail);
 }
 
@@ -2274,11 +2274,7 @@ static void i40e_rx_buffer_flip(struct i40e_ring 
*rx_ring,
 
 static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
 {
-   /* Force memory writes to complete before letting h/w
-* know there are new descriptors to fetch.
-*/
-   wmb();
-   writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
+   writel(xdp_ring->next_to_use, xdp_ring->tail);
 }
 
 /**
@@ -3444,7 +3440,7 @@ static inline int i40e_tx_map(struct i40e_ring *tx_ring, 
struct sk_buff *skb,
 
/* notify HW of packet */
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index 12bd937..eb5556e 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -804,12 +804,6 @@ static inline void i40e_release_rx_desc(struct i40e_ring 
*rx_ring, u32 val)
/* update next to alloc since we have filled the ring */
rx_ring->next_to_alloc = val;
 
-   /* Force memory writes to complete before letting h/w
-* know there are new descriptors to fetch.  (Only
-* applicable for weak-ordered memory model archs,
-* such as IA-64).
-*/
-   wmb();
writel(val, rx_ring->tail);
 }
 
@@ -2379,7 +2373,7 @@ static inline void i40evf_tx_map(struct i40e_ring 
*tx_ring, struct sk_buff *skb,
 
/* notify HW of packet */
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
-- 
2.7.4



[PATCH v8 5/7] fm10k: Eliminate duplicate barriers on weakly-ordered archs

2018-04-02 Thread Sinan Kaya
memory-barriers.txt has been updated as follows:

"When using writel(), a prior wmb() is not needed to guarantee that the
cache coherent memory writes have completed before writing to the MMIO
region."

Remove old IA-64 comments in the code along with unneeded wmb() in front
of writel().

There are places in the code where wmb() has been used as a double barrier
for CPU and IO in place of smp_wmb() and wmb() as an optimization. For
such places, keep the wmb() but replace the following writel() with
writel_relaxed() to have a sequence as

wmb()
writel_relaxed()
mmio_wb()

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/intel/fm10k/fm10k_main.c | 15 ++-
 1 file changed, 2 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index df86070..41e3aaa 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -172,13 +172,6 @@ void fm10k_alloc_rx_buffers(struct fm10k_ring *rx_ring, 
u16 cleaned_count)
/* update next to alloc since we have filled the ring */
rx_ring->next_to_alloc = i;
 
-   /* Force memory writes to complete before letting h/w
-* know there are new descriptors to fetch.  (Only
-* applicable for weak-ordered memory model archs,
-* such as IA-64).
-*/
-   wmb();
-
/* notify hardware of new descriptors */
writel(i, rx_ring->tail);
}
@@ -1036,11 +1029,7 @@ static void fm10k_tx_map(struct fm10k_ring *tx_ring,
/* record SW timestamp if HW timestamp is not available */
skb_tx_timestamp(first->skb);
 
-   /* Force memory writes to complete before letting h/w know there
-* are new descriptors to fetch.  (Only applicable for weak-ordered
-* memory model archs, such as IA-64).
-*
-* We also need this memory barrier to make certain all of the
+   /* We need this memory barrier to make certain all of the
 * status bits have been updated before next_to_watch is written.
 */
wmb();
@@ -1055,7 +1044,7 @@ static void fm10k_tx_map(struct fm10k_ring *tx_ring,
 
/* notify HW of packet */
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
-- 
2.7.4



[PATCH v8 4/7] igb: eliminate duplicate barriers on weakly-ordered archs

2018-04-02 Thread Sinan Kaya
memory-barriers.txt has been updated as follows:

"When using writel(), a prior wmb() is not needed to guarantee that the
cache coherent memory writes have completed before writing to the MMIO
region."

Remove old IA-64 comments in the code along with unneeded wmb() in front
of writel().

There are places in the code where wmb() has been used as a double barrier
for CPU and IO in place of smp_wmb() and wmb() as an optimization. For
such places, keep the wmb() but replace the following writel() with
writel_relaxed() to have a sequence as

wmb()
writel_relaxed()
mmio_wb()

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/intel/igb/igb_main.c | 14 ++
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index c1c0bc3..c3f7130 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -5652,11 +5652,7 @@ static int igb_tx_map(struct igb_ring *tx_ring,
/* set the timestamp */
first->time_stamp = jiffies;
 
-   /* Force memory writes to complete before letting h/w know there
-* are new descriptors to fetch.  (Only applicable for weak-ordered
-* memory model archs, such as IA-64).
-*
-* We also need this memory barrier to make certain all of the
+   /* We need this memory barrier to make certain all of the
 * status bits have been updated before next_to_watch is written.
 */
wmb();
@@ -5674,7 +5670,7 @@ static int igb_tx_map(struct igb_ring *tx_ring,
igb_maybe_stop_tx(tx_ring, DESC_NEEDED);
 
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
@@ -8073,12 +8069,6 @@ void igb_alloc_rx_buffers(struct igb_ring *rx_ring, u16 
cleaned_count)
/* update next to alloc since we have filled the ring */
rx_ring->next_to_alloc = i;
 
-   /* Force memory writes to complete before letting h/w
-* know there are new descriptors to fetch.  (Only
-* applicable for weak-ordered memory model archs,
-* such as IA-64).
-*/
-   wmb();
writel(i, rx_ring->tail);
}
 }
-- 
2.7.4



[PATCH v8 6/7] ixgbevf: keep writel() closer to wmb()

2018-04-02 Thread Sinan Kaya
Remove ixgbevf_write_tail() in favor of moving writel() close to
wmb().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
Reviewed-by: Alexander Duyck <alexander.h.du...@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  | 5 -
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 6 +++---
 2 files changed, 3 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index 447ce1d..c75ea1f 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -312,11 +312,6 @@ static inline u16 ixgbevf_desc_unused(struct ixgbevf_ring 
*ring)
return ((ntc > ntu) ? 0 : ring->count) + ntc - ntu - 1;
 }
 
-static inline void ixgbevf_write_tail(struct ixgbevf_ring *ring, u32 value)
-{
-   writel(value, ring->tail);
-}
-
 #define IXGBEVF_RX_DESC(R, i)  \
(&(((union ixgbe_adv_rx_desc *)((R)->desc))[i]))
 #define IXGBEVF_TX_DESC(R, i)  \
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index e3d04f2..757dac6 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -725,7 +725,7 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_ring 
*rx_ring,
 * such as IA-64).
 */
wmb();
-   ixgbevf_write_tail(rx_ring, i);
+   writel(i, rx_ring->tail);
}
 }
 
@@ -1232,7 +1232,7 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector 
*q_vector,
 * know there are new descriptors to fetch.
 */
wmb();
-   ixgbevf_write_tail(xdp_ring, xdp_ring->next_to_use);
+   writel(xdp_ring->next_to_use, xdp_ring->tail);
}
 
u64_stats_update_begin(_ring->syncp);
@@ -4004,7 +4004,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
tx_ring->next_to_use = i;
 
/* notify HW of packet */
-   ixgbevf_write_tail(tx_ring, i);
+   writel(i, tx_ring->tail);
 
return;
 dma_error:
-- 
2.7.4



[PATCH v8 7/7] ixgbevf: eliminate duplicate barriers on weakly-ordered archs

2018-04-02 Thread Sinan Kaya
memory-barriers.txt has been updated as follows:

"When using writel(), a prior wmb() is not needed to guarantee that the
cache coherent memory writes have completed before writing to the MMIO
region."

Remove old IA-64 comments in the code along with unneeded wmb() in front
of writel().

There are places in the code where wmb() has been used as a double barrier
for CPU and IO in place of smp_wmb() and wmb() as an optimization. For
such places, keep the wmb() but replace the following writel() with
writel_relaxed() to have a sequence as

wmb()
writel_relaxed()
mmio_wb()

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 23 +++
 1 file changed, 7 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 757dac6..29b71a7 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -719,12 +719,6 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_ring 
*rx_ring,
/* update next to alloc since we have filled the ring */
rx_ring->next_to_alloc = i;
 
-   /* Force memory writes to complete before letting h/w
-* know there are new descriptors to fetch.  (Only
-* applicable for weak-ordered memory model archs,
-* such as IA-64).
-*/
-   wmb();
writel(i, rx_ring->tail);
}
 }
@@ -1228,10 +1222,6 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector 
*q_vector,
struct ixgbevf_ring *xdp_ring =
adapter->xdp_ring[rx_ring->queue_index];
 
-   /* Force memory writes to complete before letting h/w
-* know there are new descriptors to fetch.
-*/
-   wmb();
writel(xdp_ring->next_to_use, xdp_ring->tail);
}
 
@@ -3985,11 +3975,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
/* set the timestamp */
first->time_stamp = jiffies;
 
-   /* Force memory writes to complete before letting h/w know there
-* are new descriptors to fetch.  (Only applicable for weak-ordered
-* memory model archs, such as IA-64).
-*
-* We also need this memory barrier (wmb) to make certain all of the
+   /* We also need this memory barrier (wmb) to make certain all of the
 * status bits have been updated before next_to_watch is written.
 */
wmb();
@@ -4004,7 +3990,12 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
tx_ring->next_to_use = i;
 
/* notify HW of packet */
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
+
+   /* We need this if more than one processor can write to our tail
+* at a time, it synchronizes IO on IA64/Altix systems
+*/
+   mmiowb();
 
return;
 dma_error:
-- 
2.7.4



[PATCH v8 3/7] igbvf: eliminate duplicate barriers on weakly-ordered archs

2018-04-02 Thread Sinan Kaya
memory-barriers.txt has been updated as follows:

"When using writel(), a prior wmb() is not needed to guarantee that the
cache coherent memory writes have completed before writing to the MMIO
region."

Remove old IA-64 comments in the code along with unneeded wmb() in front
of writel().

There are places in the code where wmb() has been used as a double barrier
for CPU and IO in place of smp_wmb() and wmb() as an optimization. For
such places, keep the wmb() but replace the following writel() with
writel_relaxed() to have a sequence as

wmb()
writel_relaxed()
mmio_wb()

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/intel/igbvf/netdev.c | 16 +---
 1 file changed, 5 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/igbvf/netdev.c 
b/drivers/net/ethernet/intel/igbvf/netdev.c
index e2b7502..d9f186a 100644
--- a/drivers/net/ethernet/intel/igbvf/netdev.c
+++ b/drivers/net/ethernet/intel/igbvf/netdev.c
@@ -246,12 +246,6 @@ static void igbvf_alloc_rx_buffers(struct igbvf_ring 
*rx_ring,
else
i--;
 
-   /* Force memory writes to complete before letting h/w
-* know there are new descriptors to fetch.  (Only
-* applicable for weak-ordered memory model archs,
-* such as IA-64).
-   */
-   wmb();
writel(i, adapter->hw.hw_addr + rx_ring->tail);
}
 }
@@ -2289,16 +2283,16 @@ static inline void igbvf_tx_queue_adv(struct 
igbvf_adapter *adapter,
}
 
tx_desc->read.cmd_type_len |= cpu_to_le32(adapter->txd_cmd);
-   /* Force memory writes to complete before letting h/w
-* know there are new descriptors to fetch.  (Only
-* applicable for weak-ordered memory model archs,
-* such as IA-64).
+
+   /* We use this memory barrier to make certain all of the
+* status bits have been updated before next_to_watch is
+* written.
 */
wmb();
 
tx_ring->buffer_info[first].next_to_watch = tx_desc;
tx_ring->next_to_use = i;
-   writel(i, adapter->hw.hw_addr + tx_ring->tail);
+   writel_relaxed(i, adapter->hw.hw_addr + tx_ring->tail);
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
 */
-- 
2.7.4



[PATCH v8 0/7] netdev: intel: Eliminate duplicate barriers on weakly-ordered archs

2018-04-02 Thread Sinan Kaya
Code includes wmb() followed by writel() in multiple places. writel()
already has a barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

I did a regex search for wmb() followed by writel() in each drivers
directory. I scrubbed the ones I care about in this series.

I considered "ease of change", "popular usage" and "performance critical
path" as the determining criteria for my filtering.

We used relaxed API heavily on ARM for a long time but
it did not exist on other architectures. For this reason, relaxed
architectures have been paying double penalty in order to use the common
drivers.

Now that relaxed API is present on all architectures, we can go and scrub
all drivers to see what needs to change and what can remain.

We start with mostly used ones and hope to increase the coverage over time.
It will take a while to cover all drivers.

Feel free to apply patches individually.

Change since 7:

* API clarification with Linus and several architecture folks regarding
writel()

https://www.mail-archive.com/netdev@vger.kernel.org/msg225806.html

* removed wmb() in front of writel() at several places.
* remove old IA64 comments regarding ordering.

Sinan Kaya (7):
  i40e/i40evf: Eliminate duplicate barriers on weakly-ordered archs
  ixgbe: eliminate duplicate barriers on weakly-ordered archs
  igbvf: eliminate duplicate barriers on weakly-ordered archs
  igb: eliminate duplicate barriers on weakly-ordered archs
  fm10k: Eliminate duplicate barriers on weakly-ordered archs
  ixgbevf: keep writel() closer to wmb()
  ixgbevf: eliminate duplicate barriers on weakly-ordered archs

 drivers/net/ethernet/intel/fm10k/fm10k_main.c | 15 ++---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 22 --
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c |  8 +--
 drivers/net/ethernet/intel/igb/igb_main.c | 14 ++--
 drivers/net/ethernet/intel/igbvf/netdev.c | 16 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 23 ++-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |  5 -
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 27 ---
 8 files changed, 30 insertions(+), 100 deletions(-)

-- 
2.7.4



Re: RFC on writel and writel_relaxed

2018-04-02 Thread Sinan Kaya
On 3/29/2018 9:40 PM, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-29 at 09:56 -0400, Sinan Kaya wrote:
>> On 3/28/2018 11:55 AM, David Miller wrote:
>>> From: Benjamin Herrenschmidt <b...@kernel.crashing.org>
>>> Date: Thu, 29 Mar 2018 02:13:16 +1100
>>>
>>>> Let's fix all archs, it's way easier than fixing all drivers. Half of
>>>> the archs are unused or dead anyway.
>>>
>>> Agreed.
>>>
>>
>> I pinged most of the maintainers yesterday.
>> Which arches do we care about these days?
>> I have not been paying attention any other architecture besides arm64.
> 
> Thanks for going through that exercise !
> 
> Once sparc, s390, microblaze and mips reply, I think we'll have a good
> coverage, maybe riscv is to put in that lot too.


I posted the following two patches for supporting microblaze and unicore32. 

[PATCH v2 1/2] io: prevent compiler reordering on the default writeX() 
implementation
[PATCH v2 2/2] io: prevent compiler reordering on the default readX() 
implementation

The rest of the arches except mips and alpha seem OK.
I sent a question email on Friday to mips and alpha mailing lists. I'll follow 
up with
an actual patch today.


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-03-29 Thread Sinan Kaya
On 3/29/2018 12:29 PM, Arnd Bergmann wrote:
> On Thu, Mar 29, 2018 at 3:56 PM, Sinan Kaya <ok...@codeaurora.org> wrote:
>> On 3/28/2018 11:55 AM, David Miller wrote:
>>> From: Benjamin Herrenschmidt <b...@kernel.crashing.org>
>>> Date: Thu, 29 Mar 2018 02:13:16 +1100
>>>
>>>> Let's fix all archs, it's way easier than fixing all drivers. Half of
>>>> the archs are unused or dead anyway.
>>>
>>> Agreed.
>>>
>>
>> I pinged most of the maintainers yesterday.
>> Which arches do we care about these days?
>> I have not been paying attention any other architecture besides arm64.
>>
>> archstatus  detail
>> --  -   
>> alpha   question sent

Thanks for the detailed analysis.

> 
> I'm guessing alpha has problems
> 
> extern inline u32 readl(const volatile void __iomem *addr)
> {
> u32 ret = __raw_readl(addr);
> mb();
> return ret;
> }
> extern inline void writel(u32 b, volatile void __iomem *addr)
> {
> __raw_writel(b, addr);
> mb();
> }

Looks like a problem to me too. I'll start a thread with the alpha
people and CC you.


> 
> There is a barrier in writel /after/ the acess but not before.
> 

This is the consolidated list. I also heart back from m68k and corrected
contacts for arc and h8300.

archstatus  detail
--  -   
alpha   question sent   Arnd: alpha has problems
arc question sent   vineet.gup...@synopsys.com says he'll 
get to this
in the next few days
arm no issues
arm64   no issues
c6x no issues   no PCI
h8300   no issues   no PCI: ys...@users.sourceforge.jp will 
fix it.
hexagon no issues   no PCI
ia64no issues   confirmed by Tony Luck
m68kno issues   ge...@linux-m68k.org says no problem
metag   no issues   arnd: removed
microblaze  question sent   arnd: some mips platforms have problems
mipsquestion sent   arnd: some mips platforms have problems
nds32   question sent
nios2   no issues   no PCI
openriscno issues   sho...@gmail.com says should no issues
parisc  no issues   grantgrund...@gmail.com says most 
probably no problem
but still looking
powerpc no issues
riscv   no issues   arnd: riscv should be fine
s390no issues   arnd: Pretty sure this is also fine
sh  question sent
sparc   no issues   da...@davemloft.net says always 
strongly ordered
unicore32   question sent   resent to g...@pku.edu.cn
x86 no issues
x86_64  no issues



> 
>  Arnd
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-03-29 Thread Sinan Kaya
On 3/28/2018 11:55 AM, David Miller wrote:
> From: Benjamin Herrenschmidt <b...@kernel.crashing.org>
> Date: Thu, 29 Mar 2018 02:13:16 +1100
> 
>> Let's fix all archs, it's way easier than fixing all drivers. Half of
>> the archs are unused or dead anyway.
> 
> Agreed.
> 

I pinged most of the maintainers yesterday.
Which arches do we care about these days?
I have not been paying attention any other architecture besides arm64.

archstatus  detail
--  -   
alpha   question sent
arc question sent   ys...@users.sourceforge.jp will fix it.
arm no issues
arm64   no issues
blackfinquestion sent   about to be removed
c6x question sent
crisquestion sent
frv
h8300   question sent
hexagon question sent
ia64no issues   confirmed by Tony Luck
m32r
m68kquestion sent
metag
microblaze  question sent
mipsquestion sent
mn10300 question sent
nios2   question sent
openriscno issues   sho...@gmail.com says should no issues
parisc  no issues   grantgrund...@gmail.com says most 
probably no problem but still looking
powerpc no issues
riscv   question sent
s390question sent
score   question sent
sh  question sent
sparc   question sent
tilequestion sent
unicore32   question sent
x86 no issues
xtensa  question sent


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH v7 3/7] bnx2x: Replace doorbell barrier() with wmb()

2018-03-29 Thread Sinan Kaya
Hi Ariel,

On 3/29/2018 5:17 AM, Elior, Ariel wrote:
>> Subject: [PATCH v7 3/7] bnx2x: Replace doorbell barrier() with wmb()
>>
>> barrier() doesn't guarantee memory writes to be observed by the hardware on
>> all architectures. barrier() only tells compiler not to move this code
>> with respect to other read/writes.
>>
>> If memory write needs to be observed by the HW, wmb() is the right choice.
> The wmb() is there (a couple of lines above). Your modification adds an
> unnecessary fence which would hurt high pps scenarios. The memory
> writes which the HW needs to observe are the buffer descriptors, not the
> producer update message. The producer is written to the HW, and exists
> on the stack. The barrier() is there to prevent the compiler from mixing the
> order of the prod update message preparation and writing it to the host.
> A possible alternative would be to move the existing wmb() to where
> the barrier() is, achieving both goals, although in the existing design each
> barrier has a distinct purpose. The comment location is misleading, though.

I was told that barrier() is there to guarantee that HW is observing the memory
write before writel(). 

I reacted to this and changed barrier() to wmb() following the old directions. 
You are saying that this not true. 

Since then, Linus gave us direction not to have wmb() in front of writel() as
writel() already has memory-IO guarantee.

https://www.mail-archive.com/netdev@vger.kernel.org/msg225806.html

I'll be doing one more pass to remove wmb() before writel() soon. Please review
that carefully. 

Intel drivers use wmb() as a substitute for smp_wmb(). So, we can't always 
assume
that you can remove all wmb() in front of writel() as the write barrier seems to
serve dual purpose.

Please help me getting this right on the next version.

Sinan


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Sinan Kaya
On 3/27/2018 10:51 PM, Linus Torvalds wrote:
>> The discussion at hand is about
>>
>> dma_buffer->foo = 1;/* WB */
>> writel(KICK, DMA_KICK_REGISTER);/* UC */
> Yes. That certainly is ordered on x86. In fact, afaik it's ordered
> even if that writel() might be of type WC, because that only delays
> writes, it doesn't move them earlier.

Now that we clarified x86 myth, Is this guaranteed on all architectures?
We keep getting IA64 exception example. Maybe, this is also corrected since
then.

Jose Abreu says "I don't know about x86 but arc architecture doesn't
have a wmb() in the writel() function (in some configs)".

As long as we have these exceptions, these wmb() in common drivers is not
going anywhere and relaxed-arches will continue paying performance penalty.

I see 15% performance loss on ARM64 servers using Intel i40e network
drivers and an XL710 adapter due to CPU keeping itself busy doing barriers
most of the time rather than real work because of sequences like this all over
the place.

 dma_buffer->foo = 1;/* WB */
 wmb()
 writel(KICK, DMA_KICK_REGISTER);/* UC */

I posted several patches last week to remove duplicate barriers on ARM while
trying to make the code friendly with other architectures.

Basically changing it to

dma_buffer->foo = 1;/* WB */
wmb()
writel_relaxed(KICK, DMA_KICK_REGISTER);/* UC */
mmiowb()

This is a small step in the performance direction until we remove all 
exceptions.

https://www.spinics.net/lists/netdev/msg491842.html
https://www.spinics.net/lists/linux-rdma/msg62434.html
https://www.spinics.net/lists/arm-kernel/msg642336.html

Discussion started to move around the need for relaxed API on PPC and then
why wmb() question came up.

Sinan

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Sinan Kaya
On 3/27/2018 5:54 PM, Alexander Duyck wrote:
> I view the wmb() + writel_relaxed() as more of a driver owning and
> handling this itself. Besides in the Intel Ethernet driver case it is
> better performance as our wmb() placement for us also provides a
> secondary barrier so we don't need to add a separate smp_wmb() to deal
> with a potential race we have with the Tx cleanup.

Thanks for the reminder. 

I forgot about the double barrier optimization. wmb() + writel_relaxed()
seems to be the best option for Intel network drivers at this moment. 
Otherwise, we'll have to remove wmb() and throw in smp barriers there
like you mentioned.

I'll leave the changes in the Intel drivers alone.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH v7 0/7] netdev: intel: Eliminate duplicate barriers on weakly-ordered archs

2018-03-27 Thread Sinan Kaya
On 3/27/2018 12:54 PM, Jeff Kirsher wrote:
> On Tue, 2018-03-27 at 08:42 -0400, Sinan Kaya wrote:
>> On 3/23/2018 10:34 PM, ok...@codeaurora.org wrote:
>>> On 2018-03-23 19:58, Jeff Kirsher wrote:
>>>> On Fri, 2018-03-23 at 14:53 -0700, Alexander Duyck wrote:
>>>>> On Fri, Mar 23, 2018 at 11:52 AM, Sinan Kaya <okaya@codeaurora.
>>>>> org>
>>>>> wrote:
>>>>>> Code includes wmb() followed by writel() in multiple places.
>>>>>> writel()
>>>>>> already has a barrier on some architectures like arm64.
>>>>>>
>>>>>> This ends up CPU observing two barriers back to back before
>>>>>> executing
>>>>>> the
>>>>>> register write.
>>>>>>
>>>>>> Since code already has an explicit barrier call, changing
>>>>>> writel() to
>>>>>> writel_relaxed().
>>>>>>
>>>>>> I did a regex search for wmb() followed by writel() in each
>>>>>> drivers
>>>>>> directory.
>>>>>> I scrubbed the ones I care about in this series.
>>>>>>
>>>>>> I considered "ease of change", "popular usage" and
>>>>>> "performance
>>>>>> critical
>>>>>> path" as the determining criteria for my filtering.
>>>>>>
>>>>>> We used relaxed API heavily on ARM for a long time but
>>>>>> it did not exist on other architectures. For this reason,
>>>>>> relaxed
>>>>>> architectures have been paying double penalty in order to use
>>>>>> the
>>>>>> common
>>>>>> drivers.
>>>>>>
>>>>>> Now that relaxed API is present on all architectures, we can
>>>>>> go and
>>>>>> scrub
>>>>>> all drivers to see what needs to change and what can remain.
>>>>>>
>>>>>> We start with mostly used ones and hope to increase the
>>>>>> coverage over
>>>>>> time.
>>>>>> It will take a while to cover all drivers.
>>>>>>
>>>>>> Feel free to apply patches individually.
>>>>>
>>>>> I looked over the set and they seem good.
>>>>>
>>>>> Reviewed-by: Alexander Duyck <alexander.h.du...@intel.com>
>>>>
>>>> Grrr, patch 1 does not apply cleanly to my next-queue tree (dev-
>>>> queue
>>>> branch).  I will deal with this series in a day or two, after I
>>>> have dealt
>>>> with my driver pull requests.
>>>
>>> Sorry, you will have to replace the ones you took from me.
>>
>> Double sorry now.
>>
>> I don't know if you have been following "RFC on writel and
>> writel_relaxed" thread
>> or not but there are some new developments about wmb() requirement. 
>>
>> Basically, wmb() should never be used before writel() as writel()
>> seem to
>> provide coherency and observability guarantee.
>>
>> wmb()+writel_relaxed() is slower on some architectures than plain
>> writel()
>>
>> I'll have to rework these patches to have writel() only. 
>>
>> Are you able to drop the applied ones so that I can post V8 or is it
>> too late?
> 
> Currently I do not have any of your patches applied to my next-queue
> tree (dev-queue branch).  So feel free to do any revisions you need to
> do and to re-submit to intel-wired-...@lists.osuosl.org (IWL) mailing
> list.
> 

Thanks, Good to know. 

I'm waiting for the discussion to settle. I'll update as soon as I get
a clear direction.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH v7 0/7] netdev: intel: Eliminate duplicate barriers on weakly-ordered archs

2018-03-27 Thread Sinan Kaya
On 3/27/2018 10:38 AM, Alexander Duyck wrote:
>> We are being told that if you use writel(), then you don't need a wmb() on
>> all architectures.
> I'm not sure who told you that but that is incorrect, at least for
> x86. If you attempt to use writel() without the wmb() we will have to
> NAK the patches. We will accept the wmb() with writel_releaxed() since
> that solves things for ARM.
> 

I added netdev and you to the RFC on writel and writel_relaxed list. 
Feel free to raise your concerns.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Sinan Kaya
+netdev, +Alex

On 3/26/2018 6:00 PM, Benjamin Herrenschmidt wrote:
> On Mon, 2018-03-26 at 23:30 +0200, Arnd Bergmann wrote:
>>  Most of the drivers have a unwound loop with writeq() or something to
>>> do it.
>>
>> But isn't the writeq() barrier much more expensive than anything you'd
>> do in function calls?
> 
> It is for us, and will break any write combining.
> 
>>>>> The same document says that _relaxed() does not give that guarentee.
>>>>>
>>>>> The lwn articule on this went into some depth on the interaction with
>>>>> spinlocks.
>>>>>
>>>>> As far as I can see, containment in a spinlock seems to be the only
>>>>> different between writel and writel_relaxed..
>>>>
>>>> I was always puzzled by this: The intention of _relaxed() on ARM
>>>> (where it originates) was to skip the barrier that serializes DMA
>>>> with MMIO, not to skip the serialization between MMIO and locks.
>>>
>>> But that was never a requirement of writel(),
>>> Documentation/memory-barriers.txt gives an explicit example demanding
>>> the wmb() before writel() for ordering system memory against writel.
> 
> This is a bug in the documentation.
> 
>> Indeed, but it's in an example for when to use dma_wmb(), not wmb().
>> Adding Alexander Duyck to Cc, he added that section as part of
>> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
>> dma_wmb()"). Also adding the other people that were involved with that.
> 
> Linus himself made it very clear years ago. readl and writel have to
> order vs memory accesses.
> 
>>> I actually have no idea why ARM had that barrier, I always assumed it
>>> was to give program ordering to the accesses and that _relaxed allowed
>>> re-ordering (the usual meaning of relaxed)..
>>>
>>> But the barrier document makes it pretty clear that the only
>>> difference between the two is spinlock containment, and WillD wrote
>>> this text, so I belive it is accurate for ARM.
>>>
>>> Very confusing.
>>
>> It does mention serialization with both DMA and locks in the
>> section about  readX_relaxed()/writeX_relaxed(). The part
>> about DMA is very clear here, and I must have just forgotten
>> the exact semantics with regards to spinlocks. I'm still not
>> sure what prevents a writel() from leaking out the end of a
>> spinlock section that doesn't happen with writel_relaxed(), since
>> the barrier in writel() comes before the access, and the
>> spin_unlock() shouldn't affect the external buses.
> 
> So...
> 
> Historically, what happened is that we (we means whoever participated
> in the discussion on the list with Linus calling the shots really)
> decided that there was no sane way for drivers to understand a world
> where readl/writel didn't fully order things vs. memory accesses (ie,
> DMA).
> 
> So it should always be correct to do:
> 
>   - Write to some in-memory buffer
>   - writel() to kick the DMA read of that buffer
> 
> without any extra barrier.
> 
> The spinlock situation however got murky. Mostly that came up because
> on architecture (I forgot who, might have been ia64) has a hard time
> providing that consistency without making writel insanely expensive.
> 
> Thus they created mmiowb whose main purpose was precisely to order
> writel with a following spin_unlock.
> 
> I decided not to go down that path on power because getting all drivers
> "fixed" to do the right thing was going to be a losing battle, and
> instead added per-cpu tracking of writel in order to "escalate" to a
> heavier barrier in spin_unlock itself when necessary.
> 
> Now, all this happened more than a decade ago and it's possible that
> the understanding or expectations "shifted" over time...

Alex is raising concerns on the netdev list.

Sinan
"We are being told that if you use writel(), then you don't need a wmb() on
all architectures."

Alex:
"I'm not sure who told you that but that is incorrect, at least for
x86. If you attempt to use writel() without the wmb() we will have to
NAK the patches. We will accept the wmb() with writel_releaxed() since
that solves things for ARM."

> Jason is seeking behavior clarification for write combined buffers.

Alex:
"Don't bother. I can tell you right now that for x86 you have to have a
wmb() before the writel().

Based on the comment in
(https://www.spinics.net/lists/linux-rdma/msg62666.html):
Replacing wmb() + writel() with wmb() + writel_relaxed() will work on
PPC, it will just not give you a

Re: [PATCH v7 0/7] netdev: intel: Eliminate duplicate barriers on weakly-ordered archs

2018-03-27 Thread Sinan Kaya
On 3/27/2018 10:04 AM, Lino Sanfilippo wrote:
> Hi,
> 
>> Double sorry now.
>>
>> I don't know if you have been following "RFC on writel and writel_relaxed" 
>> thread
>> or not but there are some new developments about wmb() requirement. 
> 
> Just out of interest: Where can this thread be found?

https://www.spinics.net/lists/linux-rdma/msg62570.html

https://patchwork.kernel.org/patch/10309913/


> 
>>
>> Basically, wmb() should never be used before writel() as writel() seem to
>> provide coherency and observability guarantee.
>>
> 
> AFAIU memory-barriers.txt writel() only guarantees correct order of accesses 
> to
> IO-memory not RAM vs. IO-memory (this may be the case for some architectures 
> where the writel() implementation contains a wmb() but not for all).
> For the RAM vs. IO-memory case at least a a wmb()/rmb() has to be used. 
> Is this not correct? 

We are being told that if you use writel(), then you don't need a wmb() on
all architectures.

Jason is seeking behavior clarification for write combined buffers.

> 
> Regards,
> Lino
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH v7 0/7] netdev: Eliminate duplicate barriers on weakly-ordered archs

2018-03-27 Thread Sinan Kaya
On 3/27/2018 10:00 AM, David Miller wrote:
> From: Sinan Kaya <ok...@codeaurora.org>
> Date: Tue, 27 Mar 2018 08:40:41 -0400
> 
>> Are you able to drop the applied ones so that I can post V7 or is it
>> too late?
> 
> I cannot "drop" changes from my tree since my tree is used by thousands
> of people and therefore immutable.
> 
> You must therefore send me relative fixes or reverts.
> 

Thanks, I'll send fixes. Just wanted to see whether it got merged or if it was
sitting on a branch.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH v7 0/7] netdev: intel: Eliminate duplicate barriers on weakly-ordered archs

2018-03-27 Thread Sinan Kaya
Jeff,

On 3/23/2018 10:34 PM, ok...@codeaurora.org wrote:
> On 2018-03-23 19:58, Jeff Kirsher wrote:
>> On Fri, 2018-03-23 at 14:53 -0700, Alexander Duyck wrote:
>>> On Fri, Mar 23, 2018 at 11:52 AM, Sinan Kaya <ok...@codeaurora.org>
>>> wrote:
>>> > Code includes wmb() followed by writel() in multiple places. writel()
>>> > already has a barrier on some architectures like arm64.
>>> >
>>> > This ends up CPU observing two barriers back to back before executing
>>> > the
>>> > register write.
>>> >
>>> > Since code already has an explicit barrier call, changing writel() to
>>> > writel_relaxed().
>>> >
>>> > I did a regex search for wmb() followed by writel() in each drivers
>>> > directory.
>>> > I scrubbed the ones I care about in this series.
>>> >
>>> > I considered "ease of change", "popular usage" and "performance
>>> > critical
>>> > path" as the determining criteria for my filtering.
>>> >
>>> > We used relaxed API heavily on ARM for a long time but
>>> > it did not exist on other architectures. For this reason, relaxed
>>> > architectures have been paying double penalty in order to use the
>>> > common
>>> > drivers.
>>> >
>>> > Now that relaxed API is present on all architectures, we can go and
>>> > scrub
>>> > all drivers to see what needs to change and what can remain.
>>> >
>>> > We start with mostly used ones and hope to increase the coverage over
>>> > time.
>>> > It will take a while to cover all drivers.
>>> >
>>> > Feel free to apply patches individually.
>>>
>>> I looked over the set and they seem good.
>>>
>>> Reviewed-by: Alexander Duyck <alexander.h.du...@intel.com>
>>
>> Grrr, patch 1 does not apply cleanly to my next-queue tree (dev-queue
>> branch).  I will deal with this series in a day or two, after I have dealt
>> with my driver pull requests.
> 
> Sorry, you will have to replace the ones you took from me.

Double sorry now.

I don't know if you have been following "RFC on writel and writel_relaxed" 
thread
or not but there are some new developments about wmb() requirement. 

Basically, wmb() should never be used before writel() as writel() seem to
provide coherency and observability guarantee.

wmb()+writel_relaxed() is slower on some architectures than plain writel()

I'll have to rework these patches to have writel() only. 

Are you able to drop the applied ones so that I can post V8 or is it too late?


Sinan

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH v7 0/7] netdev: Eliminate duplicate barriers on weakly-ordered archs

2018-03-27 Thread Sinan Kaya
Dave,

On 3/26/2018 12:48 PM, David Miller wrote:
> From: Sinan Kaya <ok...@codeaurora.org>
> Date: Sun, 25 Mar 2018 10:39:14 -0400
> 
>> Code includes wmb() followed by writel() in multiple places. writel()
>> already has a barrier on some architectures like arm64.
>>
>> This ends up CPU observing two barriers back to back before executing the
>> register write.
>>
>> Since code already has an explicit barrier call, changing writel() to
>> writel_relaxed().
>>
>> I did a regex search for wmb() followed by writel() in each drivers
>> directory.
>> I scrubbed the ones I care about in this series.
>>
>> I considered "ease of change", "popular usage" and "performance critical
>> path" as the determining criteria for my filtering.
>>
>> We used relaxed API heavily on ARM for a long time but
>> it did not exist on other architectures. For this reason, relaxed
>> architectures have been paying double penalty in order to use the common
>> drivers.
>>
>> Now that relaxed API is present on all architectures, we can go and scrub
>> all drivers to see what needs to change and what can remain.
>>
>> We start with mostly used ones and hope to increase the coverage over time.
>> It will take a while to cover all drivers.
>>
>> Feel free to apply patches individually.
>>
>> Changes since v6:
>> - bring back amazon ena and add mmiowb, remove
>>   ena_com_write_sq_doorbell_rel(). 
>> - remove extra mmiowb in bnx2x
>> - correct spelling mistake in  bnx2x: Replace doorbell barrier() with wmb()
> 
> Series applied, thank you.
> 

I don't know if you have been following "RFC on writel and writel_relaxed" 
thread
or not but there are some new developments about wmb() requirement. 

Basically, wmb() should never be used before writel() as writel() seem to
provide coherency and observability guarantee.

wmb()+writel_relaxed() is slower on some architectures than plain writel()

I'll have to rework these patches to have writel() only. 

Are you able to drop the applied ones so that I can post V7 or is it too late?

Sinan

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


[PATCH v7 5/7] net: qlge: Eliminate duplicate barriers on weakly-ordered archs

2018-03-25 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a barrier on
some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Create a new wrapper function with relaxed write operator. Use the new
wrapper when a write is following a wmb().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/qlogic/qlge/qlge.h  | 16 
 drivers/net/ethernet/qlogic/qlge/qlge_main.c |  3 ++-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/qlge/qlge.h 
b/drivers/net/ethernet/qlogic/qlge/qlge.h
index 84ac50f..3e71b65 100644
--- a/drivers/net/ethernet/qlogic/qlge/qlge.h
+++ b/drivers/net/ethernet/qlogic/qlge/qlge.h
@@ -2185,6 +2185,22 @@ static inline void ql_write_db_reg(u32 val, void __iomem 
*addr)
 }
 
 /*
+ * Doorbell Registers:
+ * Doorbell registers are virtual registers in the PCI memory space.
+ * The space is allocated by the chip during PCI initialization.  The
+ * device driver finds the doorbell address in BAR 3 in PCI config space.
+ * The registers are used to control outbound and inbound queues. For
+ * example, the producer index for an outbound queue.  Each queue uses
+ * 1 4k chunk of memory.  The lower half of the space is for outbound
+ * queues. The upper half is for inbound queues.
+ * Caller has to guarantee ordering.
+ */
+static inline void ql_write_db_reg_relaxed(u32 val, void __iomem *addr)
+{
+   writel_relaxed(val, addr);
+}
+
+/*
  * Shadow Registers:
  * Outbound queues have a consumer index that is maintained by the chip.
  * Inbound queues have a producer index that is maintained by the chip.
diff --git a/drivers/net/ethernet/qlogic/qlge/qlge_main.c 
b/drivers/net/ethernet/qlogic/qlge/qlge_main.c
index 50038d9..8293c202 100644
--- a/drivers/net/ethernet/qlogic/qlge/qlge_main.c
+++ b/drivers/net/ethernet/qlogic/qlge/qlge_main.c
@@ -2700,7 +2700,8 @@ static netdev_tx_t qlge_send(struct sk_buff *skb, struct 
net_device *ndev)
tx_ring->prod_idx = 0;
wmb();
 
-   ql_write_db_reg(tx_ring->prod_idx, tx_ring->prod_idx_db_reg);
+   ql_write_db_reg_relaxed(tx_ring->prod_idx, tx_ring->prod_idx_db_reg);
+   mmiowb();
netif_printk(qdev, tx_queued, KERN_DEBUG, qdev->ndev,
 "tx queued, slot %d, len %d\n",
 tx_ring->prod_idx, skb->len);
-- 
2.7.4



[PATCH v7 6/7] bnxt_en: Eliminate duplicate barriers on weakly-ordered archs

2018-03-25 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a barrier on
some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Create a new wrapper function with relaxed write operator. Use the new
wrapper when a write is following a wmb().

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Also add mmiowb() so that write code doesn't move outside of scope.

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.h | 9 +
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 1500243..befb538 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -1922,7 +1922,7 @@ static int bnxt_poll_work(struct bnxt *bp, struct 
bnxt_napi *bnapi, int budget)
/* Sync BD data before updating doorbell */
wmb();
 
-   bnxt_db_write(bp, db, DB_KEY_TX | prod);
+   bnxt_db_write_relaxed(bp, db, DB_KEY_TX | prod);
}
 
cpr->cp_raw_cons = raw_cons;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 1989c47..5e453b9 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -1401,6 +1401,15 @@ static inline u32 bnxt_tx_avail(struct bnxt *bp, struct 
bnxt_tx_ring_info *txr)
((txr->tx_prod - txr->tx_cons) & bp->tx_ring_mask);
 }
 
+/* For TX and RX ring doorbells with no ordering guarantee*/
+static inline void bnxt_db_write_relaxed(struct bnxt *bp, void __iomem *db,
+u32 val)
+{
+   writel_relaxed(val, db);
+   if (bp->flags & BNXT_FLAG_DOUBLE_DB)
+   writel_relaxed(val, db);
+}
+
 /* For TX and RX ring doorbells */
 static inline void bnxt_db_write(struct bnxt *bp, void __iomem *db, u32 val)
 {
-- 
2.7.4



[PATCH v7 1/7] net: qla3xxx: Eliminate duplicate barriers on weakly-ordered archs

2018-03-25 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a
barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing
the register write.

Since code already has an explicit barrier call, changing code to

wmb()
writel_relaxed()
mmiowb()

for multi-arch support.

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/qlogic/qla3xxx.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qla3xxx.c 
b/drivers/net/ethernet/qlogic/qla3xxx.c
index 9e5264d..b48f761 100644
--- a/drivers/net/ethernet/qlogic/qla3xxx.c
+++ b/drivers/net/ethernet/qlogic/qla3xxx.c
@@ -1858,8 +1858,9 @@ static void ql_update_small_bufq_prod_index(struct 
ql3_adapter *qdev)
qdev->small_buf_release_cnt -= 8;
}
wmb();
-   writel(qdev->small_buf_q_producer_index,
-   _regs->CommonRegs.rxSmallQProducerIndex);
+   writel_relaxed(qdev->small_buf_q_producer_index,
+  _regs->CommonRegs.rxSmallQProducerIndex);
+   mmiowb();
}
 }
 
-- 
2.7.4



[PATCH v7 2/7] qlcnic: Eliminate duplicate barriers on weakly-ordered archs

2018-03-25 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a
barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing
the register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
Acked-by: Manish Chopra <manish.cho...@cavium.com>
---
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c 
b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c
index 46b0372..97c146e7 100644
--- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c
+++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c
@@ -478,7 +478,7 @@ irqreturn_t qlcnic_83xx_clear_legacy_intr(struct 
qlcnic_adapter *adapter)
wmb();
 
/* clear the interrupt trigger control register */
-   writel(0, adapter->isr_int_vec);
+   writel_relaxed(0, adapter->isr_int_vec);
intr_val = readl(adapter->isr_int_vec);
do {
intr_val = readl(adapter->tgt_status_reg);
-- 
2.7.4



[PATCH v7 7/7] net: ena: Eliminate duplicate barriers on weakly-ordered archs

2018-03-25 Thread Sinan Kaya
Code includes barrier() followed by writel(). writel() already has a
barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Create a new wrapper function with relaxed write operator. Use the new
wrapper when a write is following a barrier().

Since code already has an explicit barrier call, changing writel() to
writel_relaxed() and adding mmiowb() for ordering protection.

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/amazon/ena/ena_com.c | 8 ++--
 drivers/net/ethernet/amazon/ena/ena_eth_com.h | 8 ++--
 drivers/net/ethernet/amazon/ena/ena_netdev.c  | 5 +++--
 3 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c 
b/drivers/net/ethernet/amazon/ena/ena_com.c
index bf2de52..1b9d3130 100644
--- a/drivers/net/ethernet/amazon/ena/ena_com.c
+++ b/drivers/net/ethernet/amazon/ena/ena_com.c
@@ -631,8 +631,10 @@ static u32 ena_com_reg_bar_read32(struct ena_com_dev 
*ena_dev, u16 offset)
 */
wmb();
 
-   writel(mmio_read_reg, ena_dev->reg_bar + ENA_REGS_MMIO_REG_READ_OFF);
+   writel_relaxed(mmio_read_reg,
+  ena_dev->reg_bar + ENA_REGS_MMIO_REG_READ_OFF);
 
+   mmiowb();
for (i = 0; i < timeout; i++) {
if (read_resp->req_id == mmio_read->seq_num)
break;
@@ -1826,7 +1828,9 @@ void ena_com_aenq_intr_handler(struct ena_com_dev *dev, 
void *data)
 
/* write the aenq doorbell after all AENQ descriptors were read */
mb();
-   writel((u32)aenq->head, dev->reg_bar + ENA_REGS_AENQ_HEAD_DB_OFF);
+   writel_relaxed((u32)aenq->head,
+  dev->reg_bar + ENA_REGS_AENQ_HEAD_DB_OFF);
+   mmiowb();
 }
 
 int ena_com_dev_reset(struct ena_com_dev *ena_dev,
diff --git a/drivers/net/ethernet/amazon/ena/ena_eth_com.h 
b/drivers/net/ethernet/amazon/ena/ena_eth_com.h
index 2f76572..6fdc753 100644
--- a/drivers/net/ethernet/amazon/ena/ena_eth_com.h
+++ b/drivers/net/ethernet/amazon/ena/ena_eth_com.h
@@ -107,7 +107,8 @@ static inline int ena_com_sq_empty_space(struct 
ena_com_io_sq *io_sq)
return io_sq->q_depth - 1 - cnt;
 }
 
-static inline int ena_com_write_sq_doorbell(struct ena_com_io_sq *io_sq)
+static inline int ena_com_write_sq_doorbell(struct ena_com_io_sq *io_sq,
+   bool relaxed)
 {
u16 tail;
 
@@ -116,7 +117,10 @@ static inline int ena_com_write_sq_doorbell(struct 
ena_com_io_sq *io_sq)
pr_debug("write submission queue doorbell for queue: %d tail: %d\n",
 io_sq->qid, tail);
 
-   writel(tail, io_sq->db_addr);
+   if (relaxed)
+   writel_relaxed(tail, io_sq->db_addr);
+   else
+   writel(tail, io_sq->db_addr);
 
return 0;
 }
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index 6975150..a822e70 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -556,7 +556,8 @@ static int ena_refill_rx_bufs(struct ena_ring *rx_ring, u32 
num)
 * issue a doorbell
 */
wmb();
-   ena_com_write_sq_doorbell(rx_ring->ena_com_io_sq);
+   ena_com_write_sq_doorbell(rx_ring->ena_com_io_sq, true);
+   mmiowb();
}
 
rx_ring->next_to_use = next_to_use;
@@ -2151,7 +2152,7 @@ static netdev_tx_t ena_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
 
if (netif_xmit_stopped(txq) || !skb->xmit_more) {
/* trigger the dma engine */
-   ena_com_write_sq_doorbell(tx_ring->ena_com_io_sq);
+   ena_com_write_sq_doorbell(tx_ring->ena_com_io_sq, false);
u64_stats_update_begin(_ring->syncp);
tx_ring->tx_stats.doorbells++;
u64_stats_update_end(_ring->syncp);
-- 
2.7.4



[PATCH v7 4/7] bnx2x: Eliminate duplicate barriers on weakly-ordered archs

2018-03-25 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a
barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing
the register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x.h | 12 
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c |  2 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h |  4 ++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c |  2 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c|  4 ++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c|  4 +++-
 6 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
index 352beff..d847e1b 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
@@ -166,6 +166,12 @@ do {   \
 #define REG_RD8(bp, offset)readb(REG_ADDR(bp, offset))
 #define REG_RD16(bp, offset)   readw(REG_ADDR(bp, offset))
 
+#define REG_WR_RELAXED(bp, offset, val)\
+   writel_relaxed((u32)val, REG_ADDR(bp, offset))
+
+#define REG_WR16_RELAXED(bp, offset, val) \
+   writew_relaxed((u16)val, REG_ADDR(bp, offset))
+
 #define REG_WR(bp, offset, val)writel((u32)val, REG_ADDR(bp, 
offset))
 #define REG_WR8(bp, offset, val)   writeb((u8)val, REG_ADDR(bp, offset))
 #define REG_WR16(bp, offset, val)  writew((u16)val, REG_ADDR(bp, offset))
@@ -758,10 +764,8 @@ struct bnx2x_fastpath {
 #if (BNX2X_DB_SHIFT < BNX2X_DB_MIN_SHIFT)
 #error "Min DB doorbell stride is 8"
 #endif
-#define DOORBELL(bp, cid, val) \
-   do { \
-   writel((u32)(val), bp->doorbells + (bp->db_size * (cid))); \
-   } while (0)
+#define DOORBELL_RELAXED(bp, cid, val) \
+   writel_relaxed((u32)(val), (bp)->doorbells + ((bp)->db_size * (cid)))
 
 /* TX CSUM helpers */
 #define SKB_CS_OFF(skb)(offsetof(struct tcphdr, check) - \
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index 0f86f18..95871576 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -4156,7 +4156,7 @@ netdev_tx_t bnx2x_start_xmit(struct sk_buff *skb, struct 
net_device *dev)
/* make sure descriptor update is observed by HW */
wmb();
 
-   DOORBELL(bp, txdata->cid, txdata->tx_db.raw);
+   DOORBELL_RELAXED(bp, txdata->cid, txdata->tx_db.raw);
 
mmiowb();
 
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
index a5265e1..a8ce5c5 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
@@ -522,8 +522,8 @@ static inline void bnx2x_update_rx_prod(struct bnx2x *bp,
wmb();
 
for (i = 0; i < sizeof(rx_prods)/4; i++)
-   REG_WR(bp, fp->ustorm_rx_prods_offset + i*4,
-  ((u32 *)_prods)[i]);
+   REG_WR_RELAXED(bp, fp->ustorm_rx_prods_offset + i * 4,
+  ((u32 *)_prods)[i]);
 
mmiowb(); /* keep prod updates ordered */
 
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
index 39af4f8..da18aa2 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
@@ -2593,7 +2593,7 @@ static int bnx2x_run_loopback(struct bnx2x *bp, int 
loopback_mode)
txdata->tx_db.data.prod += 2;
/* make sure descriptor update is observed by the HW */
wmb();
-   DOORBELL(bp, txdata->cid, txdata->tx_db.raw);
+   DOORBELL_RELAXED(bp, txdata->cid, txdata->tx_db.raw);
 
mmiowb();
barrier();
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index 74fc9af..146c40d 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -3817,8 +3817,8 @@ static void bnx2x_sp_prod_update(struct bnx2x *bp)
 */
mb();
 
-   REG_WR16(bp, BAR_XSTRORM_INTMEM + XSTORM_SPQ_PROD_OFFSET(func),
-bp->spq_prod_idx);
+   REG_WR16_RELAXED(bp, BAR_XSTRORM_INTMEM + XSTORM_SPQ_PROD_OFFSET(func),
+bp->spq_prod_idx);
mmiowb();
 }
 
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c
index 76a4668..8e0a317 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c
@@ -170,7 +170,

[PATCH v7 3/7] bnx2x: Replace doorbell barrier() with wmb()

2018-03-25 Thread Sinan Kaya
barrier() doesn't guarantee memory writes to be observed by the hardware on
all architectures. barrier() only tells compiler not to move this code
with respect to other read/writes.

If memory write needs to be observed by the HW, wmb() is the right choice.

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 3 ++-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index d7c98e8..0f86f18 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -4153,7 +4153,8 @@ netdev_tx_t bnx2x_start_xmit(struct sk_buff *skb, struct 
net_device *dev)
wmb();
 
txdata->tx_db.data.prod += nbd;
-   barrier();
+   /* make sure descriptor update is observed by HW */
+   wmb();
 
DOORBELL(bp, txdata->cid, txdata->tx_db.raw);
 
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
index 1e33abd..39af4f8 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
@@ -2591,7 +2591,8 @@ static int bnx2x_run_loopback(struct bnx2x *bp, int 
loopback_mode)
wmb();
 
txdata->tx_db.data.prod += 2;
-   barrier();
+   /* make sure descriptor update is observed by the HW */
+   wmb();
DOORBELL(bp, txdata->cid, txdata->tx_db.raw);
 
mmiowb();
-- 
2.7.4



[PATCH v7 0/7] netdev: Eliminate duplicate barriers on weakly-ordered archs

2018-03-25 Thread Sinan Kaya
Code includes wmb() followed by writel() in multiple places. writel()
already has a barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

I did a regex search for wmb() followed by writel() in each drivers
directory.
I scrubbed the ones I care about in this series.

I considered "ease of change", "popular usage" and "performance critical
path" as the determining criteria for my filtering.

We used relaxed API heavily on ARM for a long time but
it did not exist on other architectures. For this reason, relaxed
architectures have been paying double penalty in order to use the common
drivers.

Now that relaxed API is present on all architectures, we can go and scrub
all drivers to see what needs to change and what can remain.

We start with mostly used ones and hope to increase the coverage over time.
It will take a while to cover all drivers.

Feel free to apply patches individually.

Changes since v6:
- bring back amazon ena and add mmiowb, remove
  ena_com_write_sq_doorbell_rel(). 
- remove extra mmiowb in bnx2x
- correct spelling mistake in  bnx2x: Replace doorbell barrier() with wmb()


Sinan Kaya (7):
  net: qla3xxx: Eliminate duplicate barriers on weakly-ordered archs
  qlcnic: Eliminate duplicate barriers on weakly-ordered archs
  bnx2x: Replace doorbell barrier() with wmb()
  bnx2x: Eliminate duplicate barriers on weakly-ordered archs
  net: qlge: Eliminate duplicate barriers on weakly-ordered archs
  bnxt_en: Eliminate duplicate barriers on weakly-ordered archs
  net: ena: Eliminate duplicate barriers on weakly-ordered archs

 drivers/net/ethernet/amazon/ena/ena_com.c   |  8 ++--
 drivers/net/ethernet/amazon/ena/ena_eth_com.h   |  8 ++--
 drivers/net/ethernet/amazon/ena/ena_netdev.c|  5 +++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x.h | 12 
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c |  5 +++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h |  4 ++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c |  5 +++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c|  4 ++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c|  4 +++-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c   |  2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.h   |  9 +
 drivers/net/ethernet/qlogic/qla3xxx.c   |  5 +++--
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c |  2 +-
 drivers/net/ethernet/qlogic/qlge/qlge.h | 16 
 drivers/net/ethernet/qlogic/qlge/qlge_main.c|  3 ++-
 15 files changed, 68 insertions(+), 24 deletions(-)

-- 
2.7.4



Re: [PATCH v5 3/5] bnx2x: Eliminate duplicate barriers on weakly-ordered archs

2018-03-24 Thread Sinan Kaya
On 3/24/2018 10:30 AM, Chopra, Manish wrote:
>> -Original Message-
>> From: Sinan Kaya [mailto:ok...@codeaurora.org]
>> Sent: Friday, March 23, 2018 10:44 PM
>> To: David Miller <da...@davemloft.net>
>> Cc: netdev@vger.kernel.org; ti...@codeaurora.org; sulr...@codeaurora.org;
>> linux-arm-...@vger.kernel.org; linux-arm-ker...@lists.infradead.org; Elior,
>> Ariel <ariel.el...@cavium.com>; Dept-Eng Everest Linux L2 > engeverestlinu...@cavium.com>; linux-ker...@vger.kernel.org
>> Subject: Re: [PATCH v5 3/5] bnx2x: Eliminate duplicate barriers on weakly-
>> ordered archs
>>
>> On 3/23/2018 1:04 PM, David Miller wrote:
>>> From: Sinan Kaya <ok...@codeaurora.org>
>>> Date: Fri, 23 Mar 2018 12:51:47 -0400
>>>
>>>> It could if txdata->tx_db was not a union. There is a data dependency
>>>> between txdata->tx_db.data.prod and txdata->tx_db.raw.
>>>>
>>>> So, no reordering.
>>>
>>> I don't see it that way, the code requires that:
>>>
>>> txdata->tx_db.data.prod += nbd;
>>>
>>> is visible before the doorbell update.>
>>> barrier() doesn't provide that.
>>>
>>> Neither does writel_relaxed().  However plain writel() does.
>>
>> Correct for some architectures including ARM but not correct universally.
>>
>> writel() just guarantees register read/writes before and after to be ordered
>> when HW observes it.
>>
>> writel() doesn't guarantee that the memory update is visible to the HW on all
>> architectures.
>>
>> If you need memory update visibility, that barrier() should have been a
>> wmb()
>>
>> A correct multi-arch pattern is
>>
>> wmb()
>> writel_relaxed()
>> mmiowb()
>>
> 
> Sinan,  Since you have mentioned the use of mmiowb() here after 
> writel_relaxed().
> I believe this is not always correct for all types of IO mapped memory 
> [Specially if IO memory is mapped using write combined (for ex. 
> Ioremap_wc())].
> We have a current issue on our NIC (qede) driver on x86 for which the patch 
> is already been sent more than a week ago [Still awaiting to hear from David 
> on that].
> where mmiowb() seems to be useless since we use write combined mapped 
> doorbell and mmiowb() just seems to be a compiler barrier() there.
> So in order to flush write combined buffer we really need writel_relaxed() 
> followed by a wmb() to synchronize writes among CPU cores.
> I think  the correct pattern in such cases (for write combined IO) would have 
> been like below - 
> 
> wmb();
> writel_relaxed();
> wmb(); -> To flush the writes actually.

You actually have good points. It is the same problem with barrier() 
description above.

The answer really depends on what you are doing/expecting after mmiowb(). If 
you expect
that some memory content to be observed by HW, you definitely need a wmb() like
you mentioned. 

If you just want writes to be flushed but you don't expect any memory content 
to be
updated, you need a mmiowb(). 

https://lwn.net/Articles/198988/
"There is mmiowb(), but its real purpose is to enforce ordering between MMIO 
operations only."

> 
> Thanks.
> 
> 
> 
> 
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH v6 6/6] bnxt_en: Eliminate duplicate barriers on weakly-ordered archs

2018-03-23 Thread Sinan Kaya
On 3/23/2018 6:36 PM, Michael Chan wrote:
>> +   mmiowb();
> Sorry for the late review.  mmiowb() is not required here because we
> are in NAPI context, so only one CPU will be updating this doorbell.
> 
> Other than that, it looks good.
> 

OK, I'll fix this on the next version.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


[PATCH v6 3/6] bnx2x: Replace doorbell barrier() with wmb()

2018-03-23 Thread Sinan Kaya
barrier() doesn't guarantee memory writes to be observed by the hardware on
all architectures. barrier() only tells compiler not to move this code
with respect to other read/writes.

If memory write needs to be observed by the HW, wmb() is the right choice.

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 3 ++-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index d7c98e8..b97820f 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -4153,7 +4153,8 @@ netdev_tx_t bnx2x_start_xmit(struct sk_buff *skb, struct 
net_device *dev)
wmb();
 
txdata->tx_db.data.prod += nbd;
-   barrier();
+   /* make sur edescriptor update is observed by HW */
+   wmb();
 
DOORBELL(bp, txdata->cid, txdata->tx_db.raw);
 
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
index 1e33abd..39af4f8 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
@@ -2591,7 +2591,8 @@ static int bnx2x_run_loopback(struct bnx2x *bp, int 
loopback_mode)
wmb();
 
txdata->tx_db.data.prod += 2;
-   barrier();
+   /* make sure descriptor update is observed by the HW */
+   wmb();
DOORBELL(bp, txdata->cid, txdata->tx_db.raw);
 
mmiowb();
-- 
2.7.4



[PATCH v6 2/6] qlcnic: Eliminate duplicate barriers on weakly-ordered archs

2018-03-23 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a
barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing
the register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
Acked-by: Manish Chopra <manish.cho...@cavium.com>
---
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c 
b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c
index 46b0372..97c146e7 100644
--- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c
+++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c
@@ -478,7 +478,7 @@ irqreturn_t qlcnic_83xx_clear_legacy_intr(struct 
qlcnic_adapter *adapter)
wmb();
 
/* clear the interrupt trigger control register */
-   writel(0, adapter->isr_int_vec);
+   writel_relaxed(0, adapter->isr_int_vec);
intr_val = readl(adapter->isr_int_vec);
do {
intr_val = readl(adapter->tgt_status_reg);
-- 
2.7.4



[PATCH v6 1/6] net: qla3xxx: Eliminate duplicate barriers on weakly-ordered archs

2018-03-23 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a
barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing
the register write.

Since code already has an explicit barrier call, changing code to

wmb()
writel_relaxed()
mmiowb()

for multi-arch support.

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/qlogic/qla3xxx.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qla3xxx.c 
b/drivers/net/ethernet/qlogic/qla3xxx.c
index 9e5264d..b48f761 100644
--- a/drivers/net/ethernet/qlogic/qla3xxx.c
+++ b/drivers/net/ethernet/qlogic/qla3xxx.c
@@ -1858,8 +1858,9 @@ static void ql_update_small_bufq_prod_index(struct 
ql3_adapter *qdev)
qdev->small_buf_release_cnt -= 8;
}
wmb();
-   writel(qdev->small_buf_q_producer_index,
-   _regs->CommonRegs.rxSmallQProducerIndex);
+   writel_relaxed(qdev->small_buf_q_producer_index,
+  _regs->CommonRegs.rxSmallQProducerIndex);
+   mmiowb();
}
 }
 
-- 
2.7.4



[PATCH v6 4/6] bnx2x: Eliminate duplicate barriers on weakly-ordered archs

2018-03-23 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a
barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing
the register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x.h | 12 
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c |  2 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h |  4 ++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c |  2 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c|  4 ++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c|  4 +++-
 6 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
index 352beff..d847e1b 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
@@ -166,6 +166,12 @@ do {   \
 #define REG_RD8(bp, offset)readb(REG_ADDR(bp, offset))
 #define REG_RD16(bp, offset)   readw(REG_ADDR(bp, offset))
 
+#define REG_WR_RELAXED(bp, offset, val)\
+   writel_relaxed((u32)val, REG_ADDR(bp, offset))
+
+#define REG_WR16_RELAXED(bp, offset, val) \
+   writew_relaxed((u16)val, REG_ADDR(bp, offset))
+
 #define REG_WR(bp, offset, val)writel((u32)val, REG_ADDR(bp, 
offset))
 #define REG_WR8(bp, offset, val)   writeb((u8)val, REG_ADDR(bp, offset))
 #define REG_WR16(bp, offset, val)  writew((u16)val, REG_ADDR(bp, offset))
@@ -758,10 +764,8 @@ struct bnx2x_fastpath {
 #if (BNX2X_DB_SHIFT < BNX2X_DB_MIN_SHIFT)
 #error "Min DB doorbell stride is 8"
 #endif
-#define DOORBELL(bp, cid, val) \
-   do { \
-   writel((u32)(val), bp->doorbells + (bp->db_size * (cid))); \
-   } while (0)
+#define DOORBELL_RELAXED(bp, cid, val) \
+   writel_relaxed((u32)(val), (bp)->doorbells + ((bp)->db_size * (cid)))
 
 /* TX CSUM helpers */
 #define SKB_CS_OFF(skb)(offsetof(struct tcphdr, check) - \
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index b97820f..91d2de6 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -4156,7 +4156,7 @@ netdev_tx_t bnx2x_start_xmit(struct sk_buff *skb, struct 
net_device *dev)
/* make sur edescriptor update is observed by HW */
wmb();
 
-   DOORBELL(bp, txdata->cid, txdata->tx_db.raw);
+   DOORBELL_RELAXED(bp, txdata->cid, txdata->tx_db.raw);
 
mmiowb();
 
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
index a5265e1..a8ce5c5 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
@@ -522,8 +522,8 @@ static inline void bnx2x_update_rx_prod(struct bnx2x *bp,
wmb();
 
for (i = 0; i < sizeof(rx_prods)/4; i++)
-   REG_WR(bp, fp->ustorm_rx_prods_offset + i*4,
-  ((u32 *)_prods)[i]);
+   REG_WR_RELAXED(bp, fp->ustorm_rx_prods_offset + i * 4,
+  ((u32 *)_prods)[i]);
 
mmiowb(); /* keep prod updates ordered */
 
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
index 39af4f8..da18aa2 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
@@ -2593,7 +2593,7 @@ static int bnx2x_run_loopback(struct bnx2x *bp, int 
loopback_mode)
txdata->tx_db.data.prod += 2;
/* make sure descriptor update is observed by the HW */
wmb();
-   DOORBELL(bp, txdata->cid, txdata->tx_db.raw);
+   DOORBELL_RELAXED(bp, txdata->cid, txdata->tx_db.raw);
 
mmiowb();
barrier();
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index 74fc9af..146c40d 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -3817,8 +3817,8 @@ static void bnx2x_sp_prod_update(struct bnx2x *bp)
 */
mb();
 
-   REG_WR16(bp, BAR_XSTRORM_INTMEM + XSTORM_SPQ_PROD_OFFSET(func),
-bp->spq_prod_idx);
+   REG_WR16_RELAXED(bp, BAR_XSTRORM_INTMEM + XSTORM_SPQ_PROD_OFFSET(func),
+bp->spq_prod_idx);
mmiowb();
 }
 
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c
index 76a4668..8e0a317 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c
@@ -170,7 +170,

[PATCH v6 6/6] bnxt_en: Eliminate duplicate barriers on weakly-ordered archs

2018-03-23 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a barrier on
some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Create a new wrapper function with relaxed write operator. Use the new
wrapper when a write is following a wmb().

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Also add mmiowb() so that write code doesn't move outside of scope.

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 3 ++-
 drivers/net/ethernet/broadcom/bnxt/bnxt.h | 9 +
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 1500243..fc8ea0d 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -1922,7 +1922,8 @@ static int bnxt_poll_work(struct bnxt *bp, struct 
bnxt_napi *bnapi, int budget)
/* Sync BD data before updating doorbell */
wmb();
 
-   bnxt_db_write(bp, db, DB_KEY_TX | prod);
+   bnxt_db_write_relaxed(bp, db, DB_KEY_TX | prod);
+   mmiowb();
}
 
cpr->cp_raw_cons = raw_cons;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 1989c47..5e453b9 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -1401,6 +1401,15 @@ static inline u32 bnxt_tx_avail(struct bnxt *bp, struct 
bnxt_tx_ring_info *txr)
((txr->tx_prod - txr->tx_cons) & bp->tx_ring_mask);
 }
 
+/* For TX and RX ring doorbells with no ordering guarantee*/
+static inline void bnxt_db_write_relaxed(struct bnxt *bp, void __iomem *db,
+u32 val)
+{
+   writel_relaxed(val, db);
+   if (bp->flags & BNXT_FLAG_DOUBLE_DB)
+   writel_relaxed(val, db);
+}
+
 /* For TX and RX ring doorbells */
 static inline void bnxt_db_write(struct bnxt *bp, void __iomem *db, u32 val)
 {
-- 
2.7.4



[PATCH v6 5/6] net: qlge: Eliminate duplicate barriers on weakly-ordered archs

2018-03-23 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a barrier on
some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Create a new wrapper function with relaxed write operator. Use the new
wrapper when a write is following a wmb().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/qlogic/qlge/qlge.h  | 16 
 drivers/net/ethernet/qlogic/qlge/qlge_main.c |  3 ++-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/qlge/qlge.h 
b/drivers/net/ethernet/qlogic/qlge/qlge.h
index 84ac50f..3e71b65 100644
--- a/drivers/net/ethernet/qlogic/qlge/qlge.h
+++ b/drivers/net/ethernet/qlogic/qlge/qlge.h
@@ -2185,6 +2185,22 @@ static inline void ql_write_db_reg(u32 val, void __iomem 
*addr)
 }
 
 /*
+ * Doorbell Registers:
+ * Doorbell registers are virtual registers in the PCI memory space.
+ * The space is allocated by the chip during PCI initialization.  The
+ * device driver finds the doorbell address in BAR 3 in PCI config space.
+ * The registers are used to control outbound and inbound queues. For
+ * example, the producer index for an outbound queue.  Each queue uses
+ * 1 4k chunk of memory.  The lower half of the space is for outbound
+ * queues. The upper half is for inbound queues.
+ * Caller has to guarantee ordering.
+ */
+static inline void ql_write_db_reg_relaxed(u32 val, void __iomem *addr)
+{
+   writel_relaxed(val, addr);
+}
+
+/*
  * Shadow Registers:
  * Outbound queues have a consumer index that is maintained by the chip.
  * Inbound queues have a producer index that is maintained by the chip.
diff --git a/drivers/net/ethernet/qlogic/qlge/qlge_main.c 
b/drivers/net/ethernet/qlogic/qlge/qlge_main.c
index 50038d9..8293c202 100644
--- a/drivers/net/ethernet/qlogic/qlge/qlge_main.c
+++ b/drivers/net/ethernet/qlogic/qlge/qlge_main.c
@@ -2700,7 +2700,8 @@ static netdev_tx_t qlge_send(struct sk_buff *skb, struct 
net_device *ndev)
tx_ring->prod_idx = 0;
wmb();
 
-   ql_write_db_reg(tx_ring->prod_idx, tx_ring->prod_idx_db_reg);
+   ql_write_db_reg_relaxed(tx_ring->prod_idx, tx_ring->prod_idx_db_reg);
+   mmiowb();
netif_printk(qdev, tx_queued, KERN_DEBUG, qdev->ndev,
 "tx queued, slot %d, len %d\n",
 tx_ring->prod_idx, skb->len);
-- 
2.7.4



[PATCH v6 0/6] netdev: Eliminate duplicate barriers on weakly-ordered archs

2018-03-23 Thread Sinan Kaya
Code includes wmb() followed by writel() in multiple places. writel()
already has a barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

I did a regex search for wmb() followed by writel() in each drivers
directory.
I scrubbed the ones I care about in this series.

I considered "ease of change", "popular usage" and "performance critical
path" as the determining criteria for my filtering.

We used relaxed API heavily on ARM for a long time but
it did not exist on other architectures. For this reason, relaxed
architectures have been paying double penalty in order to use the common
drivers.

Now that relaxed API is present on all architectures, we can go and scrub
all drivers to see what needs to change and what can remain.

We start with mostly used ones and hope to increase the coverage over time.
It will take a while to cover all drivers.

Feel free to apply patches individually.

Changes since v5:
- add mmiowb() for PPC architecture
- collect reviewed/acked bys
- bn2x: change doorbell barrier to wmb for observability guarantee across
all architectures.

Sinan Kaya (6):
  net: qla3xxx: Eliminate duplicate barriers on weakly-ordered archs
  qlcnic: Eliminate duplicate barriers on weakly-ordered archs
  bnx2x: Replace doorbell barrier() with wmb()
  bnx2x: Eliminate duplicate barriers on weakly-ordered archs
  net: qlge: Eliminate duplicate barriers on weakly-ordered archs
  bnxt_en: Eliminate duplicate barriers on weakly-ordered archs

 drivers/net/ethernet/broadcom/bnx2x/bnx2x.h | 12 
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c |  5 +++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h |  4 ++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c |  5 +++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c|  4 ++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c|  4 +++-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c   |  3 ++-
 drivers/net/ethernet/broadcom/bnxt/bnxt.h   |  9 +
 drivers/net/ethernet/qlogic/qla3xxx.c   |  5 +++--
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c |  2 +-
 drivers/net/ethernet/qlogic/qlge/qlge.h | 16 
 drivers/net/ethernet/qlogic/qlge/qlge_main.c|  3 ++-
 12 files changed, 54 insertions(+), 18 deletions(-)

-- 
2.7.4



[PATCH v7 2/7] ixgbe: eliminate duplicate barriers on weakly-ordered archs

2018-03-23 Thread Sinan Kaya
Code includes wmb() followed by writel() in multiple places. writel()
already has a barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index e3b32ea..fb80edb 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1701,7 +1701,7 @@ void ixgbe_alloc_rx_buffers(struct ixgbe_ring *rx_ring, 
u16 cleaned_count)
 * such as IA-64).
 */
wmb();
-   writel(i, rx_ring->tail);
+   writel_relaxed(i, rx_ring->tail);
}
 }
 
@@ -2470,7 +2470,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector 
*q_vector,
 * know there are new descriptors to fetch.
 */
wmb();
-   writel(ring->next_to_use, ring->tail);
+   writel_relaxed(ring->next_to_use, ring->tail);
 
xdp_do_flush_map();
}
@@ -8101,7 +8101,7 @@ static int ixgbe_tx_map(struct ixgbe_ring *tx_ring,
ixgbe_maybe_stop_tx(tx_ring, DESC_NEEDED);
 
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
@@ -10038,7 +10038,7 @@ static void ixgbe_xdp_flush(struct net_device *dev)
 * are new descriptors to fetch.
 */
wmb();
-   writel(ring->next_to_use, ring->tail);
+   writel_relaxed(ring->next_to_use, ring->tail);
 
return;
 }
-- 
2.7.4



[PATCH v7 3/7] igbvf: eliminate duplicate barriers on weakly-ordered archs

2018-03-23 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a barrier
on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/intel/igbvf/netdev.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igbvf/netdev.c 
b/drivers/net/ethernet/intel/igbvf/netdev.c
index fa07876..df2283b 100644
--- a/drivers/net/ethernet/intel/igbvf/netdev.c
+++ b/drivers/net/ethernet/intel/igbvf/netdev.c
@@ -252,7 +252,7 @@ static void igbvf_alloc_rx_buffers(struct igbvf_ring 
*rx_ring,
 * such as IA-64).
*/
wmb();
-   writel(i, adapter->hw.hw_addr + rx_ring->tail);
+   writel_relaxed(i, adapter->hw.hw_addr + rx_ring->tail);
}
 }
 
@@ -2298,7 +2298,7 @@ static inline void igbvf_tx_queue_adv(struct 
igbvf_adapter *adapter,
 
tx_ring->buffer_info[first].next_to_watch = tx_desc;
tx_ring->next_to_use = i;
-   writel(i, adapter->hw.hw_addr + tx_ring->tail);
+   writel_relaxed(i, adapter->hw.hw_addr + tx_ring->tail);
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
 */
-- 
2.7.4



[PATCH v7 1/7] i40e/i40evf: Eliminate duplicate barriers on weakly-ordered archs

2018-03-23 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a barrier
on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 14 ++
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c |  4 ++--
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index c6972bd..d970350 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -186,7 +186,13 @@ static int i40e_program_fdir_filter(struct 
i40e_fdir_filter *fdir_data,
/* Mark the data descriptor to be watched */
first->next_to_watch = tx_desc;
 
-   writel(tx_ring->next_to_use, tx_ring->tail);
+   writel_relaxed(tx_ring->next_to_use, tx_ring->tail);
+
+   /* We need this if more than one processor can write to our tail
+* at a time, it synchronizes IO on IA64/Altix systems
+*/
+   mmiowb();
+
return 0;
 
 dma_fail:
@@ -1529,7 +1535,7 @@ static inline void i40e_release_rx_desc(struct i40e_ring 
*rx_ring, u32 val)
 * such as IA-64).
 */
wmb();
-   writel(val, rx_ring->tail);
+   writel_relaxed(val, rx_ring->tail);
 }
 
 /**
@@ -2412,7 +2418,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, 
int budget)
 */
wmb();
 
-   writel(xdp_ring->next_to_use, xdp_ring->tail);
+   writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
}
 
rx_ring->skb = skb;
@@ -3437,7 +3443,7 @@ static inline int i40e_tx_map(struct i40e_ring *tx_ring, 
struct sk_buff *skb,
 
/* notify HW of packet */
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index 1ae112f..5db5c50c 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -810,7 +810,7 @@ static inline void i40e_release_rx_desc(struct i40e_ring 
*rx_ring, u32 val)
 * such as IA-64).
 */
wmb();
-   writel(val, rx_ring->tail);
+   writel_relaxed(val, rx_ring->tail);
 }
 
 /**
@@ -2379,7 +2379,7 @@ static inline void i40evf_tx_map(struct i40e_ring 
*tx_ring, struct sk_buff *skb,
 
/* notify HW of packet */
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
-- 
2.7.4



[PATCH v7 4/7] igb: eliminate duplicate barriers on weakly-ordered archs

2018-03-23 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a barrier
on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/intel/igb/igb_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 3d4ff3c..c501378 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -5674,7 +5674,7 @@ static int igb_tx_map(struct igb_ring *tx_ring,
igb_maybe_stop_tx(tx_ring, DESC_NEEDED);
 
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
@@ -8079,7 +8079,7 @@ void igb_alloc_rx_buffers(struct igb_ring *rx_ring, u16 
cleaned_count)
 * such as IA-64).
 */
wmb();
-   writel(i, rx_ring->tail);
+   writel_relaxed(i, rx_ring->tail);
}
 }
 
-- 
2.7.4



[PATCH v7 5/7] fm10k: Eliminate duplicate barriers on weakly-ordered archs

2018-03-23 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a
barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing
the register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/intel/fm10k/fm10k_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index 409554d..bb94f04 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -180,7 +180,7 @@ void fm10k_alloc_rx_buffers(struct fm10k_ring *rx_ring, u16 
cleaned_count)
wmb();
 
/* notify hardware of new descriptors */
-   writel(i, rx_ring->tail);
+   writel_relaxed(i, rx_ring->tail);
}
 }
 
@@ -1055,7 +1055,7 @@ static void fm10k_tx_map(struct fm10k_ring *tx_ring,
 
/* notify HW of packet */
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
-- 
2.7.4



[PATCH v7 6/7] ixgbevf: keep writel() closer to wmb()

2018-03-23 Thread Sinan Kaya
Remove ixgbevf_write_tail() in favor of moving writel() close to
wmb().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  | 5 -
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 6 +++---
 2 files changed, 3 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index f712646..f97091d 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -312,11 +312,6 @@ static inline u16 ixgbevf_desc_unused(struct ixgbevf_ring 
*ring)
return ((ntc > ntu) ? 0 : ring->count) + ntc - ntu - 1;
 }
 
-static inline void ixgbevf_write_tail(struct ixgbevf_ring *ring, u32 value)
-{
-   writel(value, ring->tail);
-}
-
 #define IXGBEVF_RX_DESC(R, i)  \
(&(((union ixgbe_adv_rx_desc *)((R)->desc))[i]))
 #define IXGBEVF_TX_DESC(R, i)  \
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 3d9033f..815cb1a 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -725,7 +725,7 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_ring 
*rx_ring,
 * such as IA-64).
 */
wmb();
-   ixgbevf_write_tail(rx_ring, i);
+   writel(i, rx_ring->tail);
}
 }
 
@@ -1232,7 +1232,7 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector 
*q_vector,
 * know there are new descriptors to fetch.
 */
wmb();
-   ixgbevf_write_tail(xdp_ring, xdp_ring->next_to_use);
+   writel(xdp_ring->next_to_use, xdp_ring->tail);
}
 
u64_stats_update_begin(_ring->syncp);
@@ -4004,7 +4004,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
tx_ring->next_to_use = i;
 
/* notify HW of packet */
-   ixgbevf_write_tail(tx_ring, i);
+   writel(i, tx_ring->tail);
 
return;
 dma_error:
-- 
2.7.4



[PATCH v7 7/7] ixgbevf: eliminate duplicate barriers on weakly-ordered archs

2018-03-23 Thread Sinan Kaya
Code includes wmb() followed by writel() in multiple places. writel()
already has a barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 815cb1a..7277647 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -725,7 +725,7 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_ring 
*rx_ring,
 * such as IA-64).
 */
wmb();
-   writel(i, rx_ring->tail);
+   writel_relaxed(i, rx_ring->tail);
}
 }
 
@@ -1232,7 +1232,7 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector 
*q_vector,
 * know there are new descriptors to fetch.
 */
wmb();
-   writel(xdp_ring->next_to_use, xdp_ring->tail);
+   writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
}
 
u64_stats_update_begin(_ring->syncp);
@@ -4004,7 +4004,12 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
tx_ring->next_to_use = i;
 
/* notify HW of packet */
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
+
+   /* We need this if more than one processor can write to our tail
+* at a time, it synchronizes IO on IA64/Altix systems
+*/
+   mmiowb();
 
return;
 dma_error:
-- 
2.7.4



[PATCH v7 0/7] netdev: intel: Eliminate duplicate barriers on weakly-ordered archs

2018-03-23 Thread Sinan Kaya
Code includes wmb() followed by writel() in multiple places. writel()
already has a barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

I did a regex search for wmb() followed by writel() in each drivers
directory.
I scrubbed the ones I care about in this series.

I considered "ease of change", "popular usage" and "performance critical
path" as the determining criteria for my filtering.

We used relaxed API heavily on ARM for a long time but
it did not exist on other architectures. For this reason, relaxed
architectures have been paying double penalty in order to use the common
drivers.

Now that relaxed API is present on all architectures, we can go and scrub
all drivers to see what needs to change and what can remain.

We start with mostly used ones and hope to increase the coverage over time.
It will take a while to cover all drivers.

Feel free to apply patches individually.

Changes since v6:
clean up between 2..6 and then make your Alex's changes on 1 and 7
The mmiowb shouldn't be needed for Rx. Only one CPU will be running
NAPI for the queue and we will synchronize this with a full writel
anyway when we re-enable the interrupts.

Sinan Kaya (7):
  i40e/i40evf: Eliminate duplicate barriers on weakly-ordered archs
  ixgbe: eliminate duplicate barriers on weakly-ordered archs
  igbvf: eliminate duplicate barriers on weakly-ordered archs
  igb: eliminate duplicate barriers on weakly-ordered archs
  fm10k: Eliminate duplicate barriers on weakly-ordered archs
  ixgbevf: keep writel() closer to wmb()
  ixgbevf: eliminate duplicate barriers on weakly-ordered archs

 drivers/net/ethernet/intel/fm10k/fm10k_main.c |  4 ++--
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 14 ++
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c |  4 ++--
 drivers/net/ethernet/intel/igb/igb_main.c |  4 ++--
 drivers/net/ethernet/intel/igbvf/netdev.c |  4 ++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  8 
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |  5 -
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 11 ---
 8 files changed, 30 insertions(+), 24 deletions(-)

-- 
2.7.4



Re: [Intel-wired-lan] [PATCH v6 7/7] ixgbevf: eliminate duplicate barriers on weakly-ordered archs

2018-03-23 Thread Sinan Kaya
On 3/23/2018 2:31 PM, Alexander Duyck wrote:
>> Please point me to the redundant ones.
> So from what I can tell only this file and i40e needed any additional
> mmiowb calls added. The rest are not needed.


Thanks, I'll clean up between 2..6 and then make your suggested changes
on 1 and 7.


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [Intel-wired-lan] [PATCH v6 7/7] ixgbevf: eliminate duplicate barriers on weakly-ordered archs

2018-03-23 Thread Sinan Kaya
On 3/23/2018 2:25 PM, Alexander Duyck wrote:
>> +   /* We need this if more than one processor can write to our 
>> tail
>> +* at a time, it synchronizes IO on IA64/Altix systems
>> +*/
>> +   mmiowb();
>> }
> The mmiowb shouldn't be needed for Rx. Only one CPU will be running
> NAPI for the queue and we will synchronize this with a full writel
> anyway when we re-enable the interrupts.
> 

OK. I can fix this on the next version. I did a blanket search and replace for
my writel_relaxed() changes as I don't know the code well enough. 

Please point me to the redundant ones.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


[PATCH v6 3/7] igbvf: eliminate duplicate barriers on weakly-ordered archs

2018-03-23 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a barrier
on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/intel/igbvf/netdev.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igbvf/netdev.c 
b/drivers/net/ethernet/intel/igbvf/netdev.c
index fa07876..6dfd3dc 100644
--- a/drivers/net/ethernet/intel/igbvf/netdev.c
+++ b/drivers/net/ethernet/intel/igbvf/netdev.c
@@ -252,7 +252,12 @@ static void igbvf_alloc_rx_buffers(struct igbvf_ring 
*rx_ring,
 * such as IA-64).
*/
wmb();
-   writel(i, adapter->hw.hw_addr + rx_ring->tail);
+   writel_relaxed(i, adapter->hw.hw_addr + rx_ring->tail);
+
+   /* We need this if more than one processor can write to our tail
+* at a time, it synchronizes IO on IA64/Altix systems
+*/
+   mmiowb();
}
 }
 
@@ -2298,7 +2303,7 @@ static inline void igbvf_tx_queue_adv(struct 
igbvf_adapter *adapter,
 
tx_ring->buffer_info[first].next_to_watch = tx_desc;
tx_ring->next_to_use = i;
-   writel(i, adapter->hw.hw_addr + tx_ring->tail);
+   writel_relaxed(i, adapter->hw.hw_addr + tx_ring->tail);
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
 */
-- 
2.7.4



[PATCH v6 5/7] fm10k: Eliminate duplicate barriers on weakly-ordered archs

2018-03-23 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a
barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing
the register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/intel/fm10k/fm10k_main.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index 409554d..360ff9b 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -180,7 +180,12 @@ void fm10k_alloc_rx_buffers(struct fm10k_ring *rx_ring, 
u16 cleaned_count)
wmb();
 
/* notify hardware of new descriptors */
-   writel(i, rx_ring->tail);
+   writel_relaxed(i, rx_ring->tail);
+
+   /* We need this if more than one processor can write to our tail
+* at a time, it synchronizes IO on IA64/Altix systems
+*/
+   mmiowb();
}
 }
 
@@ -1055,7 +1060,7 @@ static void fm10k_tx_map(struct fm10k_ring *tx_ring,
 
/* notify HW of packet */
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
-- 
2.7.4



[PATCH v6 4/7] igb: eliminate duplicate barriers on weakly-ordered archs

2018-03-23 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a barrier
on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/intel/igb/igb_main.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 3d4ff3c..570af25 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -5674,7 +5674,7 @@ static int igb_tx_map(struct igb_ring *tx_ring,
igb_maybe_stop_tx(tx_ring, DESC_NEEDED);
 
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
@@ -8079,7 +8079,12 @@ void igb_alloc_rx_buffers(struct igb_ring *rx_ring, u16 
cleaned_count)
 * such as IA-64).
 */
wmb();
-   writel(i, rx_ring->tail);
+   writel_relaxed(i, rx_ring->tail);
+
+   /* We need this if more than one processor can write to our tail
+* at a time, it synchronizes IO on IA64/Altix systems
+*/
+   mmiowb();
}
 }
 
-- 
2.7.4



[PATCH v6 1/7] i40e/i40evf: Eliminate duplicate barriers on weakly-ordered archs

2018-03-23 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a barrier
on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 24 
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c |  9 +++--
 2 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index c6972bd..fc10cc0 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -186,7 +186,13 @@ static int i40e_program_fdir_filter(struct 
i40e_fdir_filter *fdir_data,
/* Mark the data descriptor to be watched */
first->next_to_watch = tx_desc;
 
-   writel(tx_ring->next_to_use, tx_ring->tail);
+   writel_relaxed(tx_ring->next_to_use, tx_ring->tail);
+
+   /* We need this if more than one processor can write to our tail
+* at a time, it synchronizes IO on IA64/Altix systems
+*/
+   mmiowb();
+
return 0;
 
 dma_fail:
@@ -1529,7 +1535,12 @@ static inline void i40e_release_rx_desc(struct i40e_ring 
*rx_ring, u32 val)
 * such as IA-64).
 */
wmb();
-   writel(val, rx_ring->tail);
+   writel_relaxed(val, rx_ring->tail);
+
+   /* We need this if more than one processor can write to our tail
+* at a time, it synchronizes IO on IA64/Altix systems
+*/
+   mmiowb();
 }
 
 /**
@@ -2412,7 +2423,12 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, 
int budget)
 */
wmb();
 
-   writel(xdp_ring->next_to_use, xdp_ring->tail);
+   writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
+
+   /* We need this if more than one processor can write to our tail
+* at a time, it synchronizes IO on IA64/Altix systems
+*/
+   mmiowb();
}
 
rx_ring->skb = skb;
@@ -3437,7 +3453,7 @@ static inline int i40e_tx_map(struct i40e_ring *tx_ring, 
struct sk_buff *skb,
 
/* notify HW of packet */
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index 1ae112f..ca02762 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -810,7 +810,12 @@ static inline void i40e_release_rx_desc(struct i40e_ring 
*rx_ring, u32 val)
 * such as IA-64).
 */
wmb();
-   writel(val, rx_ring->tail);
+   writel_relaxed(val, rx_ring->tail);
+
+   /* We need this if more than one processor can write to our tail
+* at a time, it synchronizes IO on IA64/Altix systems
+*/
+   mmiowb();
 }
 
 /**
@@ -2379,7 +2384,7 @@ static inline void i40evf_tx_map(struct i40e_ring 
*tx_ring, struct sk_buff *skb,
 
/* notify HW of packet */
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
-- 
2.7.4



[PATCH v6 6/7] ixgbevf: keep writel() closer to wmb()

2018-03-23 Thread Sinan Kaya
Remove ixgbevf_write_tail() in favor of moving writel() close to
wmb().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  | 5 -
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 6 +++---
 2 files changed, 3 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index f712646..f97091d 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -312,11 +312,6 @@ static inline u16 ixgbevf_desc_unused(struct ixgbevf_ring 
*ring)
return ((ntc > ntu) ? 0 : ring->count) + ntc - ntu - 1;
 }
 
-static inline void ixgbevf_write_tail(struct ixgbevf_ring *ring, u32 value)
-{
-   writel(value, ring->tail);
-}
-
 #define IXGBEVF_RX_DESC(R, i)  \
(&(((union ixgbe_adv_rx_desc *)((R)->desc))[i]))
 #define IXGBEVF_TX_DESC(R, i)  \
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 3d9033f..815cb1a 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -725,7 +725,7 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_ring 
*rx_ring,
 * such as IA-64).
 */
wmb();
-   ixgbevf_write_tail(rx_ring, i);
+   writel(i, rx_ring->tail);
}
 }
 
@@ -1232,7 +1232,7 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector 
*q_vector,
 * know there are new descriptors to fetch.
 */
wmb();
-   ixgbevf_write_tail(xdp_ring, xdp_ring->next_to_use);
+   writel(xdp_ring->next_to_use, xdp_ring->tail);
}
 
u64_stats_update_begin(_ring->syncp);
@@ -4004,7 +4004,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
tx_ring->next_to_use = i;
 
/* notify HW of packet */
-   ixgbevf_write_tail(tx_ring, i);
+   writel(i, tx_ring->tail);
 
return;
 dma_error:
-- 
2.7.4



[PATCH v6 7/7] ixgbevf: eliminate duplicate barriers on weakly-ordered archs

2018-03-23 Thread Sinan Kaya
Code includes wmb() followed by writel() in multiple places. writel()
already has a barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 21 ++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 815cb1a..9e684b1 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -725,7 +725,12 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_ring 
*rx_ring,
 * such as IA-64).
 */
wmb();
-   writel(i, rx_ring->tail);
+   writel_relaxed(i, rx_ring->tail);
+
+   /* We need this if more than one processor can write to our tail
+* at a time, it synchronizes IO on IA64/Altix systems
+*/
+   mmiowb();
}
 }
 
@@ -1232,7 +1237,12 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector 
*q_vector,
 * know there are new descriptors to fetch.
 */
wmb();
-   writel(xdp_ring->next_to_use, xdp_ring->tail);
+   writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
+
+   /* We need this if more than one processor can write to our tail
+* at a time, it synchronizes IO on IA64/Altix systems
+*/
+   mmiowb();
}
 
u64_stats_update_begin(_ring->syncp);
@@ -4004,7 +4014,12 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
tx_ring->next_to_use = i;
 
/* notify HW of packet */
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
+
+   /* We need this if more than one processor can write to our tail
+* at a time, it synchronizes IO on IA64/Altix systems
+*/
+   mmiowb();
 
return;
 dma_error:
-- 
2.7.4



[PATCH v6 0/7] netdev: intel: Eliminate duplicate barriers on weakly-ordered archs

2018-03-23 Thread Sinan Kaya
Code includes wmb() followed by writel() in multiple places. writel()
already has a barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

I did a regex search for wmb() followed by writel() in each drivers
directory.
I scrubbed the ones I care about in this series.

I considered "ease of change", "popular usage" and "performance critical
path" as the determining criteria for my filtering.

We used relaxed API heavily on ARM for a long time but
it did not exist on other architectures. For this reason, relaxed
architectures have been paying double penalty in order to use the common
drivers.

Now that relaxed API is present on all architectures, we can go and scrub
all drivers to see what needs to change and what can remain.

We start with mostly used ones and hope to increase the coverage over time.
It will take a while to cover all drivers.

Feel free to apply patches individually.

Changes since v5:
add mmiowb to missing places in order not to break PPC

Changes since v4:
posted ixgbevf: keep writel() closer to wmb() to jkircher
posted ixgbevf: eliminate duplicate barriers on weakly-ordered archs to

Sinan Kaya (7):
  i40e/i40evf: Eliminate duplicate barriers on weakly-ordered archs
  ixgbe: eliminate duplicate barriers on weakly-ordered archs
  igbvf: eliminate duplicate barriers on weakly-ordered archs
  igb: eliminate duplicate barriers on weakly-ordered archs
  fm10k: Eliminate duplicate barriers on weakly-ordered archs
  ixgbevf: keep writel() closer to wmb()
  ixgbevf: eliminate duplicate barriers on weakly-ordered archs

 drivers/net/ethernet/intel/fm10k/fm10k_main.c |  9 +++--
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 24 +++
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c |  9 +++--
 drivers/net/ethernet/intel/igb/igb_main.c |  9 +++--
 drivers/net/ethernet/intel/igbvf/netdev.c |  9 +++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 23 ++
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |  5 -
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 21 +---
 8 files changed, 85 insertions(+), 24 deletions(-)

-- 
2.7.4



[PATCH v6 2/7] ixgbe: eliminate duplicate barriers on weakly-ordered archs

2018-03-23 Thread Sinan Kaya
Code includes wmb() followed by writel() in multiple places. writel()
already has a barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 23 +++
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index e3b32ea..1ecc2f5 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1701,7 +1701,12 @@ void ixgbe_alloc_rx_buffers(struct ixgbe_ring *rx_ring, 
u16 cleaned_count)
 * such as IA-64).
 */
wmb();
-   writel(i, rx_ring->tail);
+   writel_relaxed(i, rx_ring->tail);
+
+   /* We need this if more than one processor can write to our tail
+* at a time, it synchronizes IO on IA64/Altix systems
+*/
+   mmiowb();
}
 }
 
@@ -2470,7 +2475,12 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector 
*q_vector,
 * know there are new descriptors to fetch.
 */
wmb();
-   writel(ring->next_to_use, ring->tail);
+   writel_relaxed(ring->next_to_use, ring->tail);
+
+   /* We need this if more than one processor can write to our tail
+* at a time, it synchronizes IO on IA64/Altix systems
+*/
+   mmiowb();
 
xdp_do_flush_map();
}
@@ -8101,7 +8111,7 @@ static int ixgbe_tx_map(struct ixgbe_ring *tx_ring,
ixgbe_maybe_stop_tx(tx_ring, DESC_NEEDED);
 
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
@@ -10038,7 +10048,12 @@ static void ixgbe_xdp_flush(struct net_device *dev)
 * are new descriptors to fetch.
 */
wmb();
-   writel(ring->next_to_use, ring->tail);
+   writel_relaxed(ring->next_to_use, ring->tail);
+
+   /* We need this if more than one processor can write to our tail
+* at a time, it synchronizes IO on IA64/Altix systems
+*/
+   mmiowb();
 
return;
 }
-- 
2.7.4



Re: [PATCH v5 3/5] bnx2x: Eliminate duplicate barriers on weakly-ordered archs

2018-03-23 Thread Sinan Kaya
On 3/23/2018 1:04 PM, David Miller wrote:
> From: Sinan Kaya <ok...@codeaurora.org>
> Date: Fri, 23 Mar 2018 12:51:47 -0400
> 
>> It could if txdata->tx_db was not a union. There is a data dependency
>> between txdata->tx_db.data.prod and txdata->tx_db.raw. 
>>
>> So, no reordering.
> 
> I don't see it that way, the code requires that:
> 
>   txdata->tx_db.data.prod += nbd;
> 
> is visible before the doorbell update.> 
> barrier() doesn't provide that.
> 
> Neither does writel_relaxed().  However plain writel() does.

Correct for some architectures including ARM but not correct universally.

writel() just guarantees register read/writes before and after to be
ordered when HW observes it. 

writel() doesn't guarantee that the memory update is visible to the HW
on all architectures.

If you need memory update visibility, that barrier() should have been a
wmb()

A correct multi-arch pattern is

wmb()
writel_relaxed()
mmiowb()

We have decided to drop a similar patch on Infiniband due to incorrect
usage of barrier(). If you feel strongly about it, I can remove the
DOORBELL change.

> 
> Therefore the code is only correct as-is, and your change potentially
> adds a reordering problem.
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH v5 3/5] bnx2x: Eliminate duplicate barriers on weakly-ordered archs

2018-03-23 Thread Sinan Kaya
On 3/23/2018 12:43 PM, David Miller wrote:
> From: Sinan Kaya <ok...@codeaurora.org>
> Date: Fri, 23 Mar 2018 12:31:12 -0400
> 
>> Sorry, you got me confused now.
>>
>> If you look at the code closer, you'll see this.
>>
>>  wmb();
>>
>>  txdata->tx_db.data.prod += nbd;
>>  barrier();
>>
>>  DOORBELL(bp, txdata->cid, txdata->tx_db.raw);
>>
>> and you also asked me to rename DOORBELL to DOORBELL_RELAXED() to make
>> it obvious that we have a relaxed operator inside the macro.
> 
> This still doesn't match the stated pattern.

I can certainly update the commit text for this or spin into its own
patch to make it obvious.

> 
>   wmb();
>   /* no other memory or I/O or IOMEM operation */
>   writel();
> 
> There is a write to a producer index there and then no non-compiler
> barrier or any kind before the writel().
> 
> So, in fact, it might really need that implicit writel() barrier here!
> 

It could if txdata->tx_db was not a union. There is a data dependency
between txdata->tx_db.data.prod and txdata->tx_db.raw. 

So, no reordering.

I can argue that barrier() here is useless in fact.

Anyhow, I'll spin this piece out of this patch so that we pay special
attention with a better description.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH v5 3/5] bnx2x: Eliminate duplicate barriers on weakly-ordered archs

2018-03-23 Thread Sinan Kaya
On 3/23/2018 12:20 PM, David Miller wrote:
> From: Sinan Kaya <ok...@codeaurora.org>
> Date: Thu, 22 Mar 2018 13:10:00 -0400
> 
>> Code includes wmb() followed by writel(). writel() already has a
>> barrier on some architectures like arm64.
>  ...
>> @@ -4155,7 +4155,7 @@ netdev_tx_t bnx2x_start_xmit(struct sk_buff *skb, 
>> struct net_device *dev)
>>  txdata->tx_db.data.prod += nbd;
>>  barrier();
>>  
>> -DOORBELL(bp, txdata->cid, txdata->tx_db.raw);
>> +DOORBELL_RELAXED(bp, txdata->cid, txdata->tx_db.raw);
>>  
>>  mmiowb();
>  ...
>> @@ -2592,7 +2592,7 @@ static int bnx2x_run_loopback(struct bnx2x *bp, int 
>> loopback_mode)
>>  
>>  txdata->tx_db.data.prod += 2;
>>  barrier();
>> -DOORBELL(bp, txdata->cid, txdata->tx_db.raw);
>> +DOORBELL_RELAXED(bp, txdata->cid, txdata->tx_db.raw);
> 
> These are compiler barriers being used here, not wmb().
> 
> Look, if I can't see a clear:
> 
>   wmb()
>   writel()
> 
> sequence in the patch hunks, I am going to keep pushing back on
> these changes.

Sorry, you got me confused now.

If you look at the code closer, you'll see this.

wmb();

txdata->tx_db.data.prod += nbd;
barrier();

DOORBELL(bp, txdata->cid, txdata->tx_db.raw);

and you also asked me to rename DOORBELL to DOORBELL_RELAXED() to make
it obvious that we have a relaxed operator inside the macro.

Did I miss something?

of course, treating barrier() universally as a write barrier is wrong.

> 
> Thank you.
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


[PATCH v5 3/5] bnx2x: Eliminate duplicate barriers on weakly-ordered archs

2018-03-22 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a
barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing
the register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x.h | 12 
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c |  2 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h |  4 ++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c |  2 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c|  4 ++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c   |  2 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c|  2 +-
 7 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
index 352beff..d847e1b 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
@@ -166,6 +166,12 @@ do {   \
 #define REG_RD8(bp, offset)readb(REG_ADDR(bp, offset))
 #define REG_RD16(bp, offset)   readw(REG_ADDR(bp, offset))
 
+#define REG_WR_RELAXED(bp, offset, val)\
+   writel_relaxed((u32)val, REG_ADDR(bp, offset))
+
+#define REG_WR16_RELAXED(bp, offset, val) \
+   writew_relaxed((u16)val, REG_ADDR(bp, offset))
+
 #define REG_WR(bp, offset, val)writel((u32)val, REG_ADDR(bp, 
offset))
 #define REG_WR8(bp, offset, val)   writeb((u8)val, REG_ADDR(bp, offset))
 #define REG_WR16(bp, offset, val)  writew((u16)val, REG_ADDR(bp, offset))
@@ -758,10 +764,8 @@ struct bnx2x_fastpath {
 #if (BNX2X_DB_SHIFT < BNX2X_DB_MIN_SHIFT)
 #error "Min DB doorbell stride is 8"
 #endif
-#define DOORBELL(bp, cid, val) \
-   do { \
-   writel((u32)(val), bp->doorbells + (bp->db_size * (cid))); \
-   } while (0)
+#define DOORBELL_RELAXED(bp, cid, val) \
+   writel_relaxed((u32)(val), (bp)->doorbells + ((bp)->db_size * (cid)))
 
 /* TX CSUM helpers */
 #define SKB_CS_OFF(skb)(offsetof(struct tcphdr, check) - \
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index d7c98e8..172fd79 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -4155,7 +4155,7 @@ netdev_tx_t bnx2x_start_xmit(struct sk_buff *skb, struct 
net_device *dev)
txdata->tx_db.data.prod += nbd;
barrier();
 
-   DOORBELL(bp, txdata->cid, txdata->tx_db.raw);
+   DOORBELL_RELAXED(bp, txdata->cid, txdata->tx_db.raw);
 
mmiowb();
 
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
index a5265e1..a8ce5c5 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
@@ -522,8 +522,8 @@ static inline void bnx2x_update_rx_prod(struct bnx2x *bp,
wmb();
 
for (i = 0; i < sizeof(rx_prods)/4; i++)
-   REG_WR(bp, fp->ustorm_rx_prods_offset + i*4,
-  ((u32 *)_prods)[i]);
+   REG_WR_RELAXED(bp, fp->ustorm_rx_prods_offset + i * 4,
+  ((u32 *)_prods)[i]);
 
mmiowb(); /* keep prod updates ordered */
 
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
index 1e33abd..bda5e4f 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
@@ -2592,7 +2592,7 @@ static int bnx2x_run_loopback(struct bnx2x *bp, int 
loopback_mode)
 
txdata->tx_db.data.prod += 2;
barrier();
-   DOORBELL(bp, txdata->cid, txdata->tx_db.raw);
+   DOORBELL_RELAXED(bp, txdata->cid, txdata->tx_db.raw);
 
mmiowb();
barrier();
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index 74fc9af..146c40d 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -3817,8 +3817,8 @@ static void bnx2x_sp_prod_update(struct bnx2x *bp)
 */
mb();
 
-   REG_WR16(bp, BAR_XSTRORM_INTMEM + XSTORM_SPQ_PROD_OFFSET(func),
-bp->spq_prod_idx);
+   REG_WR16_RELAXED(bp, BAR_XSTRORM_INTMEM + XSTORM_SPQ_PROD_OFFSET(func),
+bp->spq_prod_idx);
mmiowb();
 }
 
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
index ffa7959..40e55d8 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
@@ -105,7 +105,7 

[PATCH v5 2/5] qlcnic: Eliminate duplicate barriers on weakly-ordered archs

2018-03-22 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a
barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing
the register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
Acked-by: Manish Chopra <manish.cho...@cavium.com>
---
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c 
b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c
index 46b0372..97c146e7 100644
--- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c
+++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c
@@ -478,7 +478,7 @@ irqreturn_t qlcnic_83xx_clear_legacy_intr(struct 
qlcnic_adapter *adapter)
wmb();
 
/* clear the interrupt trigger control register */
-   writel(0, adapter->isr_int_vec);
+   writel_relaxed(0, adapter->isr_int_vec);
intr_val = readl(adapter->isr_int_vec);
do {
intr_val = readl(adapter->tgt_status_reg);
-- 
2.7.4



[PATCH v5 1/5] net: qla3xxx: Eliminate duplicate barriers on weakly-ordered archs

2018-03-22 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a
barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing
the register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/qlogic/qla3xxx.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qla3xxx.c 
b/drivers/net/ethernet/qlogic/qla3xxx.c
index 9e5264d..0e71b74 100644
--- a/drivers/net/ethernet/qlogic/qla3xxx.c
+++ b/drivers/net/ethernet/qlogic/qla3xxx.c
@@ -1858,8 +1858,8 @@ static void ql_update_small_bufq_prod_index(struct 
ql3_adapter *qdev)
qdev->small_buf_release_cnt -= 8;
}
wmb();
-   writel(qdev->small_buf_q_producer_index,
-   _regs->CommonRegs.rxSmallQProducerIndex);
+   writel_relaxed(qdev->small_buf_q_producer_index,
+  _regs->CommonRegs.rxSmallQProducerIndex);
}
 }
 
-- 
2.7.4



[PATCH v5 5/5] bnxt_en: Eliminate duplicate barriers on weakly-ordered archs

2018-03-22 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a barrier on
some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Create a new wrapper function with relaxed write operator. Use the new
wrapper when a write is following a wmb().

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.h | 9 +
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 1500243..befb538 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -1922,7 +1922,7 @@ static int bnxt_poll_work(struct bnxt *bp, struct 
bnxt_napi *bnapi, int budget)
/* Sync BD data before updating doorbell */
wmb();
 
-   bnxt_db_write(bp, db, DB_KEY_TX | prod);
+   bnxt_db_write_relaxed(bp, db, DB_KEY_TX | prod);
}
 
cpr->cp_raw_cons = raw_cons;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 1989c47..5e453b9 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -1401,6 +1401,15 @@ static inline u32 bnxt_tx_avail(struct bnxt *bp, struct 
bnxt_tx_ring_info *txr)
((txr->tx_prod - txr->tx_cons) & bp->tx_ring_mask);
 }
 
+/* For TX and RX ring doorbells with no ordering guarantee*/
+static inline void bnxt_db_write_relaxed(struct bnxt *bp, void __iomem *db,
+u32 val)
+{
+   writel_relaxed(val, db);
+   if (bp->flags & BNXT_FLAG_DOUBLE_DB)
+   writel_relaxed(val, db);
+}
+
 /* For TX and RX ring doorbells */
 static inline void bnxt_db_write(struct bnxt *bp, void __iomem *db, u32 val)
 {
-- 
2.7.4



[PATCH v5 4/5] net: qlge: Eliminate duplicate barriers on weakly-ordered archs

2018-03-22 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a barrier on
some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Create a new wrapper function with relaxed write operator. Use the new
wrapper when a write is following a wmb().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/qlogic/qlge/qlge.h  | 16 
 drivers/net/ethernet/qlogic/qlge/qlge_main.c |  3 ++-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/qlge/qlge.h 
b/drivers/net/ethernet/qlogic/qlge/qlge.h
index 84ac50f..3e71b65 100644
--- a/drivers/net/ethernet/qlogic/qlge/qlge.h
+++ b/drivers/net/ethernet/qlogic/qlge/qlge.h
@@ -2185,6 +2185,22 @@ static inline void ql_write_db_reg(u32 val, void __iomem 
*addr)
 }
 
 /*
+ * Doorbell Registers:
+ * Doorbell registers are virtual registers in the PCI memory space.
+ * The space is allocated by the chip during PCI initialization.  The
+ * device driver finds the doorbell address in BAR 3 in PCI config space.
+ * The registers are used to control outbound and inbound queues. For
+ * example, the producer index for an outbound queue.  Each queue uses
+ * 1 4k chunk of memory.  The lower half of the space is for outbound
+ * queues. The upper half is for inbound queues.
+ * Caller has to guarantee ordering.
+ */
+static inline void ql_write_db_reg_relaxed(u32 val, void __iomem *addr)
+{
+   writel_relaxed(val, addr);
+}
+
+/*
  * Shadow Registers:
  * Outbound queues have a consumer index that is maintained by the chip.
  * Inbound queues have a producer index that is maintained by the chip.
diff --git a/drivers/net/ethernet/qlogic/qlge/qlge_main.c 
b/drivers/net/ethernet/qlogic/qlge/qlge_main.c
index 50038d9..8293c202 100644
--- a/drivers/net/ethernet/qlogic/qlge/qlge_main.c
+++ b/drivers/net/ethernet/qlogic/qlge/qlge_main.c
@@ -2700,7 +2700,8 @@ static netdev_tx_t qlge_send(struct sk_buff *skb, struct 
net_device *ndev)
tx_ring->prod_idx = 0;
wmb();
 
-   ql_write_db_reg(tx_ring->prod_idx, tx_ring->prod_idx_db_reg);
+   ql_write_db_reg_relaxed(tx_ring->prod_idx, tx_ring->prod_idx_db_reg);
+   mmiowb();
netif_printk(qdev, tx_queued, KERN_DEBUG, qdev->ndev,
 "tx queued, slot %d, len %d\n",
 tx_ring->prod_idx, skb->len);
-- 
2.7.4



[PATCH v5 0/5] netdev: Eliminate duplicate barriers on weakly-ordered archs

2018-03-22 Thread Sinan Kaya
Code includes wmb() followed by writel() in multiple places. writel()
already has a barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

I did a regex search for wmb() followed by writel() in each drivers
directory.
I scrubbed the ones I care about in this series.

I considered "ease of change", "popular usage" and "performance critical
path" as the determining criteria for my filtering.

We used relaxed API heavily on ARM for a long time but
it did not exist on other architectures. For this reason, relaxed
architectures have been paying double penalty in order to use the common
drivers.

Now that relaxed API is present on all architectures, we can go and scrub
all drivers to see what needs to change and what can remain.

We start with mostly used ones and hope to increase the coverage over time.
It will take a while to cover all drivers.

Changes since v4:
- dropped bnx2x, ena
- collect reviewed and tested bys
- rename DOORBELL to DOORBELL_RELAXED
- limit changes to hot paths only

Sinan Kaya (5):
  net: qla3xxx: Eliminate duplicate barriers on weakly-ordered archs
  qlcnic: Eliminate duplicate barriers on weakly-ordered archs
  bnx2x: Eliminate duplicate barriers on weakly-ordered archs
  net: qlge: Eliminate duplicate barriers on weakly-ordered archs
  bnxt_en: Eliminate duplicate barriers on weakly-ordered archs

 drivers/net/ethernet/broadcom/bnx2x/bnx2x.h | 12 
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c |  2 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h |  4 ++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c |  2 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c|  4 ++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c   |  2 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c|  2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c   |  2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.h   |  9 +
 drivers/net/ethernet/qlogic/qla3xxx.c   |  4 ++--
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c |  2 +-
 drivers/net/ethernet/qlogic/qlge/qlge.h | 16 
 drivers/net/ethernet/qlogic/qlge/qlge_main.c|  3 ++-
 13 files changed, 47 insertions(+), 17 deletions(-)

-- 
2.7.4



[PATCH v5 2/2] ixgbevf: eliminate duplicate barriers on weakly-ordered archs

2018-03-22 Thread Sinan Kaya
Code includes wmb() followed by writel() in multiple places. writel()
already has a barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 815cb1a..eaa930e 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -725,7 +725,7 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_ring 
*rx_ring,
 * such as IA-64).
 */
wmb();
-   writel(i, rx_ring->tail);
+   writel_relaxed(i, rx_ring->tail);
}
 }
 
@@ -1232,7 +1232,7 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector 
*q_vector,
 * know there are new descriptors to fetch.
 */
wmb();
-   writel(xdp_ring->next_to_use, xdp_ring->tail);
+   writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
}
 
u64_stats_update_begin(_ring->syncp);
@@ -4004,7 +4004,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
tx_ring->next_to_use = i;
 
/* notify HW of packet */
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
return;
 dma_error:
-- 
2.7.4



[PATCH v5 1/2] ixgbevf: keep writel() closer to wmb()

2018-03-22 Thread Sinan Kaya
Remove ixgbevf_write_tail() in favor of moving writel() close to
wmb().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  | 5 -
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 6 +++---
 2 files changed, 3 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index f712646..f97091d 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -312,11 +312,6 @@ static inline u16 ixgbevf_desc_unused(struct ixgbevf_ring 
*ring)
return ((ntc > ntu) ? 0 : ring->count) + ntc - ntu - 1;
 }
 
-static inline void ixgbevf_write_tail(struct ixgbevf_ring *ring, u32 value)
-{
-   writel(value, ring->tail);
-}
-
 #define IXGBEVF_RX_DESC(R, i)  \
(&(((union ixgbe_adv_rx_desc *)((R)->desc))[i]))
 #define IXGBEVF_TX_DESC(R, i)  \
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 3d9033f..815cb1a 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -725,7 +725,7 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_ring 
*rx_ring,
 * such as IA-64).
 */
wmb();
-   ixgbevf_write_tail(rx_ring, i);
+   writel(i, rx_ring->tail);
}
 }
 
@@ -1232,7 +1232,7 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector 
*q_vector,
 * know there are new descriptors to fetch.
 */
wmb();
-   ixgbevf_write_tail(xdp_ring, xdp_ring->next_to_use);
+   writel(xdp_ring->next_to_use, xdp_ring->tail);
}
 
u64_stats_update_begin(_ring->syncp);
@@ -4004,7 +4004,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
tx_ring->next_to_use = i;
 
/* notify HW of packet */
-   ixgbevf_write_tail(tx_ring, i);
+   writel(i, tx_ring->tail);
 
return;
 dma_error:
-- 
2.7.4



Re: [PATCH v4 00/17] netdev: Eliminate duplicate barriers on weakly-ordered archs

2018-03-21 Thread Sinan Kaya
On 3/21/2018 10:56 AM, David Miller wrote:
> From: Sinan Kaya <ok...@codeaurora.org>
> Date: Mon, 19 Mar 2018 22:42:15 -0400
> 
>> Code includes wmb() followed by writel() in multiple places. writel()
>> already has a barrier on some architectures like arm64.
>>
>> This ends up CPU observing two barriers back to back before executing the
>> register write.
>>
>> Since code already has an explicit barrier call, changing writel() to
>> writel_relaxed().
>>
>> I did a regex search for wmb() followed by writel() in each drivers
>> directory.
>> I scrubbed the ones I care about in this series.
>>
>> I considered "ease of change", "popular usage" and "performance critical
>> path" as the determining criteria for my filtering.
> 
> I agree that for performance sensitive operations, specifically writing
> doorbell registers in the hot paths or RX and TX packet processing, this
> is a good change.
> 
> However, in configuration paths and whatnot, it is much less urgent and
> useful.
> 
> Therefore I think it would work better if you concentrated solely on
> hot code path cases.
> 
> You can, on a driver by driver basis, submit the other transformations
> in the slow paths, and let the driver maintainers decide whether to
> take those on or not.
> 
> Also, please stick exactly to the case where we have:
> 
>   wmb/mb/etc.
>   writel()
> 

OK

> Because I see some changes where we have:
> 
>   writel()
> 
>   barrier()
> 
>   writel()
> 

barrier() on ARM is a write barrier. Apparently, it is a compiler barrier
on Intel. I briefly discussed the barrier() behavior in rdma mailing list [1].

Our conclusion is that code should have used wmb() if it really needed
to synchronize memory contents to the device and barrier() is already
wrong. It just guarantees that code doesn't move. writel() already has
a compiler barrier inside. It won't move to begin with.

Like you suggested, we decided to leave these changes alone and even
skip those drivers.

I'll take another look at the patches.

> for exmaple, and you are turning that second writel() into a relaxed
> on as well.  The above is using a compile barrier, not a memory
> barrier, so effectively it is two writel()'s in sequence which is
> not what this patch set is about.
> 
> If anything, that compile barrier() is superfluous and could be
> removed.  But that is also a separate change from what this patch
> series is doing here.
> 

agreed, I'll remove such changes.

> Finally, it makes it that much easier if we can see the preceeding
> memory barrier in the context of the patch that adjusts the writel
> into a writel_relaxed.
> 
> In one case, a macro DOORBELL() is changed to use writel().  This
> makes it so that the patch reviewer has to scan over the entire
> driver in question to see exactly how DOORBELL() is used and whether
> it fits the criteria for the writel_relaxed() transformation.
> 
> I would suggest that you adjust the name of the macro in a situation
> like this, f.e. to DOORBELL_RELAXED().

makes sense.

> 
> Thank you.
> 

[1] https://patchwork.kernel.org/project/LKML/list/?submitter=145491

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


[PATCH REPOST v4 4/7] igb: eliminate duplicate barriers on weakly-ordered archs

2018-03-21 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a barrier
on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
Reviewed-by: Alexander Duyck <alexander.h.du...@intel.com>
---
 drivers/net/ethernet/intel/igb/igb_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index b88fae7..82aea92 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -5671,7 +5671,7 @@ static int igb_tx_map(struct igb_ring *tx_ring,
igb_maybe_stop_tx(tx_ring, DESC_NEEDED);
 
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
@@ -8072,7 +8072,7 @@ void igb_alloc_rx_buffers(struct igb_ring *rx_ring, u16 
cleaned_count)
 * such as IA-64).
 */
wmb();
-   writel(i, rx_ring->tail);
+   writel_relaxed(i, rx_ring->tail);
}
 }
 
-- 
2.7.4



[PATCH REPOST v4 1/7] i40e/i40evf: Eliminate duplicate barriers on weakly-ordered archs

2018-03-21 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a barrier
on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
Reviewed-by: Alexander Duyck <alexander.h.du...@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 8 
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 4 ++--
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index e554aa6cf..9455869 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -185,7 +185,7 @@ static int i40e_program_fdir_filter(struct i40e_fdir_filter 
*fdir_data,
/* Mark the data descriptor to be watched */
first->next_to_watch = tx_desc;
 
-   writel(tx_ring->next_to_use, tx_ring->tail);
+   writel_relaxed(tx_ring->next_to_use, tx_ring->tail);
return 0;
 
 dma_fail:
@@ -1375,7 +1375,7 @@ static inline void i40e_release_rx_desc(struct i40e_ring 
*rx_ring, u32 val)
 * such as IA-64).
 */
wmb();
-   writel(val, rx_ring->tail);
+   writel_relaxed(val, rx_ring->tail);
 }
 
 /**
@@ -2258,7 +2258,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, 
int budget)
 */
wmb();
 
-   writel(xdp_ring->next_to_use, xdp_ring->tail);
+   writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
}
 
rx_ring->skb = skb;
@@ -3286,7 +3286,7 @@ static inline int i40e_tx_map(struct i40e_ring *tx_ring, 
struct sk_buff *skb,
 
/* notify HW of packet */
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index 357d605..56eea20 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -667,7 +667,7 @@ static inline void i40e_release_rx_desc(struct i40e_ring 
*rx_ring, u32 val)
 * such as IA-64).
 */
wmb();
-   writel(val, rx_ring->tail);
+   writel_relaxed(val, rx_ring->tail);
 }
 
 /**
@@ -2243,7 +2243,7 @@ static inline void i40evf_tx_map(struct i40e_ring 
*tx_ring, struct sk_buff *skb,
 
/* notify HW of packet */
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
-- 
2.7.4



[PATCH REPOST v4 2/7] ixgbe: eliminate duplicate barriers on weakly-ordered archs

2018-03-21 Thread Sinan Kaya
Code includes wmb() followed by writel() in multiple places. writel()
already has a barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
Reviewed-by: Alexander Duyck <alexander.h.du...@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 0da5aa2..58ed70f 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1692,7 +1692,7 @@ void ixgbe_alloc_rx_buffers(struct ixgbe_ring *rx_ring, 
u16 cleaned_count)
 * such as IA-64).
 */
wmb();
-   writel(i, rx_ring->tail);
+   writel_relaxed(i, rx_ring->tail);
}
 }
 
@@ -2453,7 +2453,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector 
*q_vector,
 * know there are new descriptors to fetch.
 */
wmb();
-   writel(ring->next_to_use, ring->tail);
+   writel_relaxed(ring->next_to_use, ring->tail);
 
xdp_do_flush_map();
}
@@ -8078,7 +8078,7 @@ static int ixgbe_tx_map(struct ixgbe_ring *tx_ring,
ixgbe_maybe_stop_tx(tx_ring, DESC_NEEDED);
 
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
@@ -10014,7 +10014,7 @@ static void ixgbe_xdp_flush(struct net_device *dev)
 * are new descriptors to fetch.
 */
wmb();
-   writel(ring->next_to_use, ring->tail);
+   writel_relaxed(ring->next_to_use, ring->tail);
 
return;
 }
-- 
2.7.4



[PATCH REPOST v4 5/7] ixgbevf: keep writel() closer to wmb()

2018-03-21 Thread Sinan Kaya
Remove ixgbevf_write_tail() in favor of moving writel() close to
wmb().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
Reviewed-by: Alexander Duyck <alexander.h.du...@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  | 5 -
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 4 ++--
 2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index f695242..11e893e 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -244,11 +244,6 @@ static inline u16 ixgbevf_desc_unused(struct ixgbevf_ring 
*ring)
return ((ntc > ntu) ? 0 : ring->count) + ntc - ntu - 1;
 }
 
-static inline void ixgbevf_write_tail(struct ixgbevf_ring *ring, u32 value)
-{
-   writel(value, ring->tail);
-}
-
 #define IXGBEVF_RX_DESC(R, i)  \
(&(((union ixgbe_adv_rx_desc *)((R)->desc))[i]))
 #define IXGBEVF_TX_DESC(R, i)  \
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 9b3d43d..6bf778a 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -659,7 +659,7 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_ring 
*rx_ring,
 * such as IA-64).
 */
wmb();
-   ixgbevf_write_tail(rx_ring, i);
+   writel(i, rx_ring->tail);
}
 }
 
@@ -3644,7 +3644,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
tx_ring->next_to_use = i;
 
/* notify HW of packet */
-   ixgbevf_write_tail(tx_ring, i);
+   writel(i, tx_ring->tail);
 
return;
 dma_error:
-- 
2.7.4



[PATCH REPOST v4 7/7] fm10k: Eliminate duplicate barriers on weakly-ordered archs

2018-03-21 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a
barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing
the register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/intel/fm10k/fm10k_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index 8e12aae..eebef01 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -179,7 +179,7 @@ void fm10k_alloc_rx_buffers(struct fm10k_ring *rx_ring, u16 
cleaned_count)
wmb();
 
/* notify hardware of new descriptors */
-   writel(i, rx_ring->tail);
+   writel_relaxed(i, rx_ring->tail);
}
 }
 
@@ -1054,7 +1054,7 @@ static void fm10k_tx_map(struct fm10k_ring *tx_ring,
 
/* notify HW of packet */
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
-- 
2.7.4



[PATCH REPOST v4 6/7] ixgbevf: eliminate duplicate barriers on weakly-ordered archs

2018-03-21 Thread Sinan Kaya
Code includes wmb() followed by writel() in multiple places. writel()
already has a barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
Reviewed-by: Alexander Duyck <alexander.h.du...@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 6bf778a..774b2a6 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -659,7 +659,7 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_ring 
*rx_ring,
 * such as IA-64).
 */
wmb();
-   writel(i, rx_ring->tail);
+   writel_relaxed(i, rx_ring->tail);
}
 }
 
@@ -3644,7 +3644,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
tx_ring->next_to_use = i;
 
/* notify HW of packet */
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
return;
 dma_error:
-- 
2.7.4



[PATCH REPOST v4 3/7] igbvf: eliminate duplicate barriers on weakly-ordered archs

2018-03-21 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a barrier
on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
Reviewed-by: Alexander Duyck <alexander.h.du...@intel.com>
---
 drivers/net/ethernet/intel/igbvf/netdev.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igbvf/netdev.c 
b/drivers/net/ethernet/intel/igbvf/netdev.c
index 4214c15..edb1c34 100644
--- a/drivers/net/ethernet/intel/igbvf/netdev.c
+++ b/drivers/net/ethernet/intel/igbvf/netdev.c
@@ -251,7 +251,7 @@ static void igbvf_alloc_rx_buffers(struct igbvf_ring 
*rx_ring,
 * such as IA-64).
*/
wmb();
-   writel(i, adapter->hw.hw_addr + rx_ring->tail);
+   writel_relaxed(i, adapter->hw.hw_addr + rx_ring->tail);
}
 }
 
@@ -2297,7 +2297,7 @@ static inline void igbvf_tx_queue_adv(struct 
igbvf_adapter *adapter,
 
tx_ring->buffer_info[first].next_to_watch = tx_desc;
tx_ring->next_to_use = i;
-   writel(i, adapter->hw.hw_addr + tx_ring->tail);
+   writel_relaxed(i, adapter->hw.hw_addr + tx_ring->tail);
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
 */
-- 
2.7.4



[PATCH REPOST v4 0/7] netdev: intel: Eliminate duplicate barriers on weakly-ordered archs

2018-03-21 Thread Sinan Kaya
Code includes wmb() followed by writel() in multiple places. writel()
already has a barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

I did a regex search for wmb() followed by writel() in each drivers
directory.
I scrubbed the ones I care about in this series.

I considered "ease of change", "popular usage" and "performance critical
path" as the determining criteria for my filtering.

We used relaxed API heavily on ARM for a long time but
it did not exist on other architectures. For this reason, relaxed
architectures have been paying double penalty in order to use the common
drivers.

Now that relaxed API is present on all architectures, we can go and scrub
all drivers to see what needs to change and what can remain.

We start with mostly used ones and hope to increase the coverage over time.
It will take a while to cover all drivers.

Feel free to apply patches individually.

repost:
- split into intel specific patches

Changes since v3:
- https://www.spinics.net/lists/arm-kernel/msg641851.html
- group patches together into subsystems net:...
- collect reviewed and tested bys
- scrub barrier()

Sinan Kaya (7):
  i40e/i40evf: Eliminate duplicate barriers on weakly-ordered archs
  ixgbe: eliminate duplicate barriers on weakly-ordered archs
  igbvf: eliminate duplicate barriers on weakly-ordered archs
  igb: eliminate duplicate barriers on weakly-ordered archs
  ixgbevf: keep writel() closer to wmb()
  ixgbevf: eliminate duplicate barriers on weakly-ordered archs
  fm10k: Eliminate duplicate barriers on weakly-ordered archs

 drivers/net/ethernet/intel/fm10k/fm10k_main.c | 4 ++--
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 8 
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 4 ++--
 drivers/net/ethernet/intel/igb/igb_main.c | 4 ++--
 drivers/net/ethernet/intel/igbvf/netdev.c | 4 ++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 8 
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  | 5 -
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 4 ++--
 8 files changed, 18 insertions(+), 23 deletions(-)

-- 
2.7.4



[PATCH v4 05/17] ixgbevf: keep writel() closer to wmb()

2018-03-19 Thread Sinan Kaya
Remove ixgbevf_write_tail() in favor of moving writel() close to
wmb().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
Reviewed-by: Alexander Duyck <alexander.h.du...@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  | 5 -
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 4 ++--
 2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index f695242..11e893e 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -244,11 +244,6 @@ static inline u16 ixgbevf_desc_unused(struct ixgbevf_ring 
*ring)
return ((ntc > ntu) ? 0 : ring->count) + ntc - ntu - 1;
 }
 
-static inline void ixgbevf_write_tail(struct ixgbevf_ring *ring, u32 value)
-{
-   writel(value, ring->tail);
-}
-
 #define IXGBEVF_RX_DESC(R, i)  \
(&(((union ixgbe_adv_rx_desc *)((R)->desc))[i]))
 #define IXGBEVF_TX_DESC(R, i)  \
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 9b3d43d..6bf778a 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -659,7 +659,7 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_ring 
*rx_ring,
 * such as IA-64).
 */
wmb();
-   ixgbevf_write_tail(rx_ring, i);
+   writel(i, rx_ring->tail);
}
 }
 
@@ -3644,7 +3644,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
tx_ring->next_to_use = i;
 
/* notify HW of packet */
-   ixgbevf_write_tail(tx_ring, i);
+   writel(i, tx_ring->tail);
 
return;
 dma_error:
-- 
2.7.4



[PATCH v4 08/17] drivers: net: cxgb: Eliminate duplicate barriers on weakly-ordered archs

2018-03-19 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a barrier on
some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/chelsio/cxgb/sge.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb/sge.c 
b/drivers/net/ethernet/chelsio/cxgb/sge.c
index 30de26e..57891bd6 100644
--- a/drivers/net/ethernet/chelsio/cxgb/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb/sge.c
@@ -495,7 +495,7 @@ static struct sk_buff *sched_skb(struct sge *sge, struct 
sk_buff *skb,
 static inline void doorbell_pio(struct adapter *adapter, u32 val)
 {
wmb();
-   writel(val, adapter->regs + A_SG_DOORBELL);
+   writel_relaxed(val, adapter->regs + A_SG_DOORBELL);
 }
 
 /*
-- 
2.7.4



[PATCH v4 09/17] net: qla3xxx: Eliminate duplicate barriers on weakly-ordered archs

2018-03-19 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a
barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing
the register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/qlogic/qla3xxx.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qla3xxx.c 
b/drivers/net/ethernet/qlogic/qla3xxx.c
index 9e5264d..0e71b74 100644
--- a/drivers/net/ethernet/qlogic/qla3xxx.c
+++ b/drivers/net/ethernet/qlogic/qla3xxx.c
@@ -1858,8 +1858,8 @@ static void ql_update_small_bufq_prod_index(struct 
ql3_adapter *qdev)
qdev->small_buf_release_cnt -= 8;
}
wmb();
-   writel(qdev->small_buf_q_producer_index,
-   _regs->CommonRegs.rxSmallQProducerIndex);
+   writel_relaxed(qdev->small_buf_q_producer_index,
+  _regs->CommonRegs.rxSmallQProducerIndex);
}
 }
 
-- 
2.7.4



[PATCH v4 03/17] igbvf: eliminate duplicate barriers on weakly-ordered archs

2018-03-19 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a barrier
on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
Reviewed-by: Alexander Duyck <alexander.h.du...@intel.com>
---
 drivers/net/ethernet/intel/igbvf/netdev.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igbvf/netdev.c 
b/drivers/net/ethernet/intel/igbvf/netdev.c
index 4214c15..edb1c34 100644
--- a/drivers/net/ethernet/intel/igbvf/netdev.c
+++ b/drivers/net/ethernet/intel/igbvf/netdev.c
@@ -251,7 +251,7 @@ static void igbvf_alloc_rx_buffers(struct igbvf_ring 
*rx_ring,
 * such as IA-64).
*/
wmb();
-   writel(i, adapter->hw.hw_addr + rx_ring->tail);
+   writel_relaxed(i, adapter->hw.hw_addr + rx_ring->tail);
}
 }
 
@@ -2297,7 +2297,7 @@ static inline void igbvf_tx_queue_adv(struct 
igbvf_adapter *adapter,
 
tx_ring->buffer_info[first].next_to_watch = tx_desc;
tx_ring->next_to_use = i;
-   writel(i, adapter->hw.hw_addr + tx_ring->tail);
+   writel_relaxed(i, adapter->hw.hw_addr + tx_ring->tail);
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
 */
-- 
2.7.4



[PATCH v4 06/17] ixgbevf: eliminate duplicate barriers on weakly-ordered archs

2018-03-19 Thread Sinan Kaya
Code includes wmb() followed by writel() in multiple places. writel()
already has a barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
Reviewed-by: Alexander Duyck <alexander.h.du...@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 6bf778a..774b2a6 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -659,7 +659,7 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_ring 
*rx_ring,
 * such as IA-64).
 */
wmb();
-   writel(i, rx_ring->tail);
+   writel_relaxed(i, rx_ring->tail);
}
 }
 
@@ -3644,7 +3644,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
tx_ring->next_to_use = i;
 
/* notify HW of packet */
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
return;
 dma_error:
-- 
2.7.4



[PATCH v4 02/17] ixgbe: eliminate duplicate barriers on weakly-ordered archs

2018-03-19 Thread Sinan Kaya
Code includes wmb() followed by writel() in multiple places. writel()
already has a barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
Reviewed-by: Alexander Duyck <alexander.h.du...@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 0da5aa2..58ed70f 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1692,7 +1692,7 @@ void ixgbe_alloc_rx_buffers(struct ixgbe_ring *rx_ring, 
u16 cleaned_count)
 * such as IA-64).
 */
wmb();
-   writel(i, rx_ring->tail);
+   writel_relaxed(i, rx_ring->tail);
}
 }
 
@@ -2453,7 +2453,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector 
*q_vector,
 * know there are new descriptors to fetch.
 */
wmb();
-   writel(ring->next_to_use, ring->tail);
+   writel_relaxed(ring->next_to_use, ring->tail);
 
xdp_do_flush_map();
}
@@ -8078,7 +8078,7 @@ static int ixgbe_tx_map(struct ixgbe_ring *tx_ring,
ixgbe_maybe_stop_tx(tx_ring, DESC_NEEDED);
 
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
@@ -10014,7 +10014,7 @@ static void ixgbe_xdp_flush(struct net_device *dev)
 * are new descriptors to fetch.
 */
wmb();
-   writel(ring->next_to_use, ring->tail);
+   writel_relaxed(ring->next_to_use, ring->tail);
 
return;
 }
-- 
2.7.4



[PATCH v4 14/17] net: qlge: Eliminate duplicate barriers on weakly-ordered archs

2018-03-19 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a barrier on
some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Create a new wrapper function with relaxed write operator. Use the new
wrapper when a write is following a wmb().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/qlogic/qlge/qlge.h  | 18 ++
 drivers/net/ethernet/qlogic/qlge/qlge_main.c |  2 +-
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/qlge/qlge.h 
b/drivers/net/ethernet/qlogic/qlge/qlge.h
index 84ac50f..1465986 100644
--- a/drivers/net/ethernet/qlogic/qlge/qlge.h
+++ b/drivers/net/ethernet/qlogic/qlge/qlge.h
@@ -2185,6 +2185,24 @@ static inline void ql_write_db_reg(u32 val, void __iomem 
*addr)
 }
 
 /*
+ * Doorbell Registers:
+ * Doorbell registers are virtual registers in the PCI memory space.
+ * The space is allocated by the chip during PCI initialization.  The
+ * device driver finds the doorbell address in BAR 3 in PCI config space.
+ * The registers are used to control outbound and inbound queues. For
+ * example, the producer index for an outbound queue.  Each queue uses
+ * 1 4k chunk of memory.  The lower half of the space is for outbound
+ * queues. The upper half is for inbound queues.
+ * Caller has to guarantee ordering.
+ */
+static inline void ql_write_db_reg_relaxed(u32 val, void __iomem *addr)
+{
+   writel_relaxed(val, addr);
+   mmiowb();
+}
+
+
+/*
  * Shadow Registers:
  * Outbound queues have a consumer index that is maintained by the chip.
  * Inbound queues have a producer index that is maintained by the chip.
diff --git a/drivers/net/ethernet/qlogic/qlge/qlge_main.c 
b/drivers/net/ethernet/qlogic/qlge/qlge_main.c
index 50038d9..c222b7c 100644
--- a/drivers/net/ethernet/qlogic/qlge/qlge_main.c
+++ b/drivers/net/ethernet/qlogic/qlge/qlge_main.c
@@ -2700,7 +2700,7 @@ static netdev_tx_t qlge_send(struct sk_buff *skb, struct 
net_device *ndev)
tx_ring->prod_idx = 0;
wmb();
 
-   ql_write_db_reg(tx_ring->prod_idx, tx_ring->prod_idx_db_reg);
+   ql_write_db_reg_relaxed(tx_ring->prod_idx, tx_ring->prod_idx_db_reg);
netif_printk(qdev, tx_queued, KERN_DEBUG, qdev->ndev,
 "tx queued, slot %d, len %d\n",
 tx_ring->prod_idx, skb->len);
-- 
2.7.4



[PATCH v4 11/17] bnx2x: Eliminate duplicate barriers on weakly-ordered archs

2018-03-19 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a
barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing
the register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x.h   |  9 -
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h   |  4 ++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c  | 21 +++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c |  2 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c  |  2 +-
 5 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
index 352beff..ac38db9 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
@@ -166,6 +166,12 @@ do {   \
 #define REG_RD8(bp, offset)readb(REG_ADDR(bp, offset))
 #define REG_RD16(bp, offset)   readw(REG_ADDR(bp, offset))
 
+#define REG_WR_RELAXED(bp, offset, val)writel_relaxed((u32)val,\
+  REG_ADDR(bp, offset))
+
+#define REG_WR16_RELAXED(bp, offset, val) \
+   writew_relaxed((u16)val, REG_ADDR(bp, offset))
+
 #define REG_WR(bp, offset, val)writel((u32)val, REG_ADDR(bp, 
offset))
 #define REG_WR8(bp, offset, val)   writeb((u8)val, REG_ADDR(bp, offset))
 #define REG_WR16(bp, offset, val)  writew((u16)val, REG_ADDR(bp, offset))
@@ -760,7 +766,8 @@ struct bnx2x_fastpath {
 #endif
 #define DOORBELL(bp, cid, val) \
do { \
-   writel((u32)(val), bp->doorbells + (bp->db_size * (cid))); \
+   writel_relaxed((u32)(val),\
+   bp->doorbells + (bp->db_size * (cid))); \
} while (0)
 
 /* TX CSUM helpers */
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
index a5265e1..a8ce5c5 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
@@ -522,8 +522,8 @@ static inline void bnx2x_update_rx_prod(struct bnx2x *bp,
wmb();
 
for (i = 0; i < sizeof(rx_prods)/4; i++)
-   REG_WR(bp, fp->ustorm_rx_prods_offset + i*4,
-  ((u32 *)_prods)[i]);
+   REG_WR_RELAXED(bp, fp->ustorm_rx_prods_offset + i * 4,
+  ((u32 *)_prods)[i]);
 
mmiowb(); /* keep prod updates ordered */
 
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index 74fc9af..2dea1b6 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -1608,8 +1608,8 @@ static void bnx2x_hc_int_enable(struct bnx2x *bp)
} else
val = 0x;
 
-   REG_WR(bp, HC_REG_TRAILING_EDGE_0 + port*8, val);
-   REG_WR(bp, HC_REG_LEADING_EDGE_0 + port*8, val);
+   REG_WR_RELAXED(bp, HC_REG_TRAILING_EDGE_0 + port * 8, val);
+   REG_WR_RELAXED(bp, HC_REG_LEADING_EDGE_0 + port * 8, val);
}
 
/* Make sure that interrupts are indeed enabled from here on */
@@ -1672,8 +1672,8 @@ static void bnx2x_igu_int_enable(struct bnx2x *bp)
} else
val = 0x;
 
-   REG_WR(bp, IGU_REG_TRAILING_EDGE_LATCH, val);
-   REG_WR(bp, IGU_REG_LEADING_EDGE_LATCH, val);
+   REG_WR_RELAXED(bp, IGU_REG_TRAILING_EDGE_LATCH, val);
+   REG_WR_RELAXED(bp, IGU_REG_LEADING_EDGE_LATCH, val);
 
/* Make sure that interrupts are indeed enabled from here on */
mmiowb();
@@ -3817,8 +3817,8 @@ static void bnx2x_sp_prod_update(struct bnx2x *bp)
 */
mb();
 
-   REG_WR16(bp, BAR_XSTRORM_INTMEM + XSTORM_SPQ_PROD_OFFSET(func),
-bp->spq_prod_idx);
+   REG_WR16_RELAXED(bp, BAR_XSTRORM_INTMEM + XSTORM_SPQ_PROD_OFFSET(func),
+bp->spq_prod_idx);
mmiowb();
 }
 
@@ -7761,7 +7761,7 @@ void bnx2x_igu_clear_sb_gen(struct bnx2x *bp, u8 func, u8 
idu_sb_id, bool is_pf)
barrier();
DP(NETIF_MSG_HW, "write 0x%08x to IGU(via GRC) addr 0x%x\n",
  ctl, igu_addr_ctl);
-   REG_WR(bp, igu_addr_ctl, ctl);
+   REG_WR_RELAXED(bp, igu_addr_ctl, ctl);
mmiowb();
barrier();
 
@@ -9720,13 +9720,14 @@ static void bnx2x_process_kill_chip_reset(struct bnx2x 
*bp, bool global)
barrier();
mmiowb();
 
-   REG_WR(bp, GRCBASE_MISC + MISC_REGISTERS_RESET_REG_2_SET,
-  reset_mask2 & (~stay_reset2));
+   REG_WR_RELAXED(bp, GRCBASE_MISC + MISC_REGISTERS_RESET_REG_2_SET,
+  r

[PATCH v4 15/17] bnxt_en: Eliminate duplicate barriers on weakly-ordered archs

2018-03-19 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a barrier on
some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Create a new wrapper function with relaxed write operator. Use the new
wrapper when a write is following a wmb().

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c |  2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.h | 11 ++-
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c |  2 +-
 3 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 1500243..befb538 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -1922,7 +1922,7 @@ static int bnxt_poll_work(struct bnxt *bp, struct 
bnxt_napi *bnapi, int budget)
/* Sync BD data before updating doorbell */
wmb();
 
-   bnxt_db_write(bp, db, DB_KEY_TX | prod);
+   bnxt_db_write_relaxed(bp, db, DB_KEY_TX | prod);
}
 
cpr->cp_raw_cons = raw_cons;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 1989c47..4c0d048 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -1402,11 +1402,20 @@ static inline u32 bnxt_tx_avail(struct bnxt *bp, struct 
bnxt_tx_ring_info *txr)
 }
 
 /* For TX and RX ring doorbells */
+static inline void bnxt_db_write_relaxed(struct bnxt *bp, void __iomem *db,
+u32 val)
+{
+   writel_relaxed(val, db);
+   if (bp->flags & BNXT_FLAG_DOUBLE_DB)
+   writel_relaxed(val, db);
+}
+
+/* For TX and RX ring doorbells */
 static inline void bnxt_db_write(struct bnxt *bp, void __iomem *db, u32 val)
 {
writel(val, db);
if (bp->flags & BNXT_FLAG_DOUBLE_DB)
-   writel(val, db);
+   writel_relaxed(val, db);
 }
 
 extern const u16 bnxt_lhint_arr[];
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
index 1801582..a1b1060 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
@@ -2403,7 +2403,7 @@ static int bnxt_run_loopback(struct bnxt *bp)
/* Sync BD data before updating doorbell */
wmb();
 
-   bnxt_db_write(bp, txr->tx_doorbell, DB_KEY_TX | txr->tx_prod);
+   bnxt_db_write_relaxed(bp, txr->tx_doorbell, DB_KEY_TX | txr->tx_prod);
rc = bnxt_poll_loopback(bp, pkt_size);
 
dma_unmap_single(>pdev->dev, map, pkt_size, PCI_DMA_TODEVICE);
-- 
2.7.4



[PATCH v4 10/17] qlcnic: Eliminate duplicate barriers on weakly-ordered archs

2018-03-19 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a
barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing
the register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
Acked-by: Manish Chopra <manish.cho...@cavium.com>
---
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c 
b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c
index 46b0372..97c146e7 100644
--- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c
+++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c
@@ -478,7 +478,7 @@ irqreturn_t qlcnic_83xx_clear_legacy_intr(struct 
qlcnic_adapter *adapter)
wmb();
 
/* clear the interrupt trigger control register */
-   writel(0, adapter->isr_int_vec);
+   writel_relaxed(0, adapter->isr_int_vec);
intr_val = readl(adapter->isr_int_vec);
do {
intr_val = readl(adapter->tgt_status_reg);
-- 
2.7.4



[PATCH v4 12/17] net: cxgb4/cxgb4vf: Eliminate duplicate barriers on weakly-ordered archs

2018-03-19 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a barrier on
some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Create a new wrapper function with relaxed write operator. Use the new
wrapper when a write is following a wmb().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h  |  6 ++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 13 +++--
 drivers/net/ethernet/chelsio/cxgb4/sge.c| 12 ++--
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c  |  2 +-
 drivers/net/ethernet/chelsio/cxgb4vf/adapter.h  | 14 ++
 drivers/net/ethernet/chelsio/cxgb4vf/sge.c  | 18 ++
 6 files changed, 44 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 9040e13..6bde0b9 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -1202,6 +1202,12 @@ static inline void t4_write_reg(struct adapter *adap, 
u32 reg_addr, u32 val)
writel(val, adap->regs + reg_addr);
 }
 
+static inline void t4_write_reg_relaxed(struct adapter *adap, u32 reg_addr,
+   u32 val)
+{
+   writel_relaxed(val, adap->regs + reg_addr);
+}
+
 #ifndef readq
 static inline u64 readq(const volatile void __iomem *addr)
 {
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 7b452e8..276472d 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -1723,8 +1723,8 @@ int cxgb4_sync_txq_pidx(struct net_device *dev, u16 qid, 
u16 pidx,
else
val = PIDX_T5_V(delta);
wmb();
-   t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBELL_A),
-QID_V(qid) | val);
+   t4_write_reg_relaxed(adap, MYPF_REG(SGE_PF_KDOORBELL_A),
+QID_V(qid) | val);
}
 out:
return ret;
@@ -1902,8 +1902,9 @@ static void enable_txq_db(struct adapter *adap, struct 
sge_txq *q)
 * are committed before we tell HW about them.
 */
wmb();
-   t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBELL_A),
-QID_V(q->cntxt_id) | PIDX_V(q->db_pidx_inc));
+   t4_write_reg_relaxed(adap, MYPF_REG(SGE_PF_KDOORBELL_A),
+QID_V(q->cntxt_id) |
+   PIDX_V(q->db_pidx_inc));
q->db_pidx_inc = 0;
}
q->db_disabled = 0;
@@ -2003,8 +2004,8 @@ static void sync_txq_pidx(struct adapter *adap, struct 
sge_txq *q)
else
val = PIDX_T5_V(delta);
wmb();
-   t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBELL_A),
-QID_V(q->cntxt_id) | val);
+   t4_write_reg_relaxed(adap, MYPF_REG(SGE_PF_KDOORBELL_A),
+QID_V(q->cntxt_id) | val);
}
 out:
q->db_disabled = 0;
diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c 
b/drivers/net/ethernet/chelsio/cxgb4/sge.c
index 6e310a0..7388aac 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c
@@ -530,11 +530,11 @@ static inline void ring_fl_db(struct adapter *adap, 
struct sge_fl *q)
 * mechanism.
 */
if (unlikely(q->bar2_addr == NULL)) {
-   t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBELL_A),
-val | QID_V(q->cntxt_id));
+   t4_write_reg_relaxed(adap, MYPF_REG(SGE_PF_KDOORBELL_A),
+val | QID_V(q->cntxt_id));
} else {
-   writel(val | QID_V(q->bar2_qid),
-  q->bar2_addr + SGE_UDB_KDOORBELL);
+   writel_relaxed(val | QID_V(q->bar2_qid),
+  q->bar2_addr + SGE_UDB_KDOORBELL);
 
/* This Write memory Barrier will force the write to
 * the User Doorbell area to be flushed.
@@ -986,8 +986,8 @@ inline void cxgb4_ring_tx_db(struct adapter *adap, struct 
sge_txq *q, int n)
  (q->bar2_addr + SGE_UDB_WCDOORBELL),
  wr);
} else {
-   writel(val | QID_V(q->bar2_qid),
-  q->bar2_addr + SGE_UDB_KDOORBELL);
+   writel_relaxed(val | QID_V(q->bar2_qid),
+  q->bar2_addr + SGE_UDB_KDOORBELL);
  

[PATCH v4 16/17] qed/qede: Eliminate duplicate barriers on weakly-ordered archs

2018-03-19 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a barrier on
some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Create a new wrapper function with relaxed write operator. Use the new
wrapper when a write is following a wmb().

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/qlogic/qed/qed.h   |  5 -
 drivers/net/ethernet/qlogic/qed/qed_hw.c| 12 
 drivers/net/ethernet/qlogic/qed/qed_hw.h| 14 ++
 drivers/net/ethernet/qlogic/qed/qed_int.c   |  2 +-
 drivers/net/ethernet/qlogic/qed/qed_l2.c|  2 +-
 drivers/net/ethernet/qlogic/qed/qed_ll2.c   |  2 +-
 drivers/net/ethernet/qlogic/qed/qed_vf.c|  7 ---
 drivers/net/ethernet/qlogic/qede/qede_ethtool.c |  2 +-
 drivers/net/ethernet/qlogic/qede/qede_fp.c  |  4 ++--
 drivers/net/ethernet/qlogic/qlge/qlge.h |  1 -
 include/linux/qed/qed_if.h  | 17 +
 11 files changed, 53 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed.h 
b/drivers/net/ethernet/qlogic/qed/qed.h
index 6948855..241077f 100644
--- a/drivers/net/ethernet/qlogic/qed/qed.h
+++ b/drivers/net/ethernet/qlogic/qed/qed.h
@@ -818,12 +818,15 @@ u16 qed_get_cm_pq_idx_vf(struct qed_hwfn *p_hwfn, u16 vf);
(cdev->regview) + \
 (offset))
 
+#define REG_WR_RELAXED(cdev, offset, val)  \
+   writel_relaxed((u32)val, REG_ADDR(cdev, offset))
+
 #define REG_RD(cdev, offset)readl(REG_ADDR(cdev, offset))
 #define REG_WR(cdev, offset, val)   writel((u32)val, REG_ADDR(cdev, 
offset))
 #define REG_WR16(cdev, offset, val) writew((u16)val, REG_ADDR(cdev, 
offset))
 
 #define DOORBELL(cdev, db_addr, val)\
-   writel((u32)val, (void __iomem *)((u8 __iomem *)\
+   writel_relaxed((u32)val, (void __iomem *)((u8 __iomem *)\
  (cdev->doorbells) + (db_addr)))
 
 /* Prototypes */
diff --git a/drivers/net/ethernet/qlogic/qed/qed_hw.c 
b/drivers/net/ethernet/qlogic/qed/qed_hw.c
index fca2dbd..1d76121 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_hw.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_hw.c
@@ -222,6 +222,18 @@ struct qed_ptt *qed_get_reserved_ptt(struct qed_hwfn 
*p_hwfn,
return _hwfn->p_ptt_pool->ptts[ptt_idx];
 }
 
+void qed_wr_relaxed(struct qed_hwfn *p_hwfn,
+   struct qed_ptt *p_ptt,
+   u32 hw_addr, u32 val)
+{
+   u32 bar_addr = qed_set_ptt(p_hwfn, p_ptt, hw_addr);
+
+   REG_WR_RELAXED(p_hwfn, bar_addr, val);
+   DP_VERBOSE(p_hwfn, NETIF_MSG_HW,
+  "bar_addr 0x%x, hw_addr 0x%x, val 0x%x\n",
+  bar_addr, hw_addr, val);
+}
+
 void qed_wr(struct qed_hwfn *p_hwfn,
struct qed_ptt *p_ptt,
u32 hw_addr, u32 val)
diff --git a/drivers/net/ethernet/qlogic/qed/qed_hw.h 
b/drivers/net/ethernet/qlogic/qed/qed_hw.h
index 8db2839..bb4f5ff 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_hw.h
+++ b/drivers/net/ethernet/qlogic/qed/qed_hw.h
@@ -152,6 +152,20 @@ struct qed_ptt *qed_get_reserved_ptt(struct qed_hwfn 
*p_hwfn,
 enum reserved_ptts ptt_idx);
 
 /**
+ * @brief qed_wr_relaxed - Write value to BAR using the given ptt
+ *No ordering guarantee.
+ *
+ * @param p_hwfn
+ * @param p_ptt
+ * @param val
+ * @param hw_addr
+ */
+void qed_wr_relaxed(struct qed_hwfn *p_hwfn,
+   struct qed_ptt *p_ptt,
+   u32 hw_addr,
+   u32 val);
+
+/**
  * @brief qed_wr - Write value to BAR using the given ptt
  *
  * @param p_hwfn
diff --git a/drivers/net/ethernet/qlogic/qed/qed_int.c 
b/drivers/net/ethernet/qlogic/qed/qed_int.c
index d3eabcf..5f09253 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_int.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_int.c
@@ -1747,7 +1747,7 @@ static void qed_int_igu_cleanup_sb(struct qed_hwfn 
*p_hwfn,
 
barrier();
 
-   qed_wr(p_hwfn, p_ptt, IGU_REG_COMMAND_REG_CTRL, cmd_ctrl);
+   qed_wr_relaxed(p_hwfn, p_ptt, IGU_REG_COMMAND_REG_CTRL, cmd_ctrl);
 
/* Flush the write to IGU */
mmiowb();
diff --git a/drivers/net/ethernet/qlogic/qed/qed_l2.c 
b/drivers/net/ethernet/qlogic/qed/qed_l2.c
index 893ef08..7f3f923b 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_l2.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_l2.c
@@ -921,7 +921,7 @@ qed_eth_pf_rx_queue_start(struct qed_hwfn *p_hwfn,
 
/* Init the rcq, rx bd and rx sge (if valid) producers to 0 */
__internal_ram_wr(p_hwfn, *pp_prod, sizeof(u32),
- (u32 *)(_prod_val));
+ (u32 *)(_prod_va

[PATCH v4 17/17] net: ena: Eliminate duplicate barriers on weakly-ordered archs

2018-03-19 Thread Sinan Kaya
Code includes barrier() followed by writel(). writel() already has a
barrier
on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Create a new wrapper function with relaxed write operator. Use the new
wrapper when a write is following a barrier().

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/amazon/ena/ena_com.c |  6 --
 drivers/net/ethernet/amazon/ena/ena_eth_com.h | 22 --
 drivers/net/ethernet/amazon/ena/ena_netdev.c  |  4 ++--
 3 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c 
b/drivers/net/ethernet/amazon/ena/ena_com.c
index bf2de52..b6e628f 100644
--- a/drivers/net/ethernet/amazon/ena/ena_com.c
+++ b/drivers/net/ethernet/amazon/ena/ena_com.c
@@ -631,7 +631,8 @@ static u32 ena_com_reg_bar_read32(struct ena_com_dev 
*ena_dev, u16 offset)
 */
wmb();
 
-   writel(mmio_read_reg, ena_dev->reg_bar + ENA_REGS_MMIO_REG_READ_OFF);
+   writel_relaxed(mmio_read_reg,
+  ena_dev->reg_bar + ENA_REGS_MMIO_REG_READ_OFF);
 
for (i = 0; i < timeout; i++) {
if (read_resp->req_id == mmio_read->seq_num)
@@ -1826,7 +1827,8 @@ void ena_com_aenq_intr_handler(struct ena_com_dev *dev, 
void *data)
 
/* write the aenq doorbell after all AENQ descriptors were read */
mb();
-   writel((u32)aenq->head, dev->reg_bar + ENA_REGS_AENQ_HEAD_DB_OFF);
+   writel_relaxed((u32)aenq->head,
+  dev->reg_bar + ENA_REGS_AENQ_HEAD_DB_OFF);
 }
 
 int ena_com_dev_reset(struct ena_com_dev *ena_dev,
diff --git a/drivers/net/ethernet/amazon/ena/ena_eth_com.h 
b/drivers/net/ethernet/amazon/ena/ena_eth_com.h
index 2f76572..09ef7cd 100644
--- a/drivers/net/ethernet/amazon/ena/ena_eth_com.h
+++ b/drivers/net/ethernet/amazon/ena/ena_eth_com.h
@@ -107,7 +107,8 @@ static inline int ena_com_sq_empty_space(struct 
ena_com_io_sq *io_sq)
return io_sq->q_depth - 1 - cnt;
 }
 
-static inline int ena_com_write_sq_doorbell(struct ena_com_io_sq *io_sq)
+static inline int ena_com_write_sq_doorbell(struct ena_com_io_sq *io_sq,
+   bool relaxed)
 {
u16 tail;
 
@@ -116,7 +117,24 @@ static inline int ena_com_write_sq_doorbell(struct 
ena_com_io_sq *io_sq)
pr_debug("write submission queue doorbell for queue: %d tail: %d\n",
 io_sq->qid, tail);
 
-   writel(tail, io_sq->db_addr);
+   if (relaxed)
+   writel_relaxed(tail, io_sq->db_addr);
+   else
+   writel(tail, io_sq->db_addr);
+
+   return 0;
+}
+
+static inline int ena_com_write_sq_doorbell_rel(struct ena_com_io_sq *io_sq)
+{
+   u16 tail;
+
+   tail = io_sq->tail;
+
+   pr_debug("write submission queue doorbell for queue: %d tail: %d\n",
+io_sq->qid, tail);
+
+   writel_relaxed(tail, io_sq->db_addr);
 
return 0;
 }
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index 6975150..0530201 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -556,7 +556,7 @@ static int ena_refill_rx_bufs(struct ena_ring *rx_ring, u32 
num)
 * issue a doorbell
 */
wmb();
-   ena_com_write_sq_doorbell(rx_ring->ena_com_io_sq);
+   ena_com_write_sq_doorbell(rx_ring->ena_com_io_sq, true);
}
 
rx_ring->next_to_use = next_to_use;
@@ -2151,7 +2151,7 @@ static netdev_tx_t ena_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
 
if (netif_xmit_stopped(txq) || !skb->xmit_more) {
/* trigger the dma engine */
-   ena_com_write_sq_doorbell(tx_ring->ena_com_io_sq);
+   ena_com_write_sq_doorbell(tx_ring->ena_com_io_sq, false);
u64_stats_update_begin(_ring->syncp);
tx_ring->tx_stats.doorbells++;
u64_stats_update_end(_ring->syncp);
-- 
2.7.4



[PATCH v4 04/17] igb: eliminate duplicate barriers on weakly-ordered archs

2018-03-19 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a barrier
on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
Reviewed-by: Alexander Duyck <alexander.h.du...@intel.com>
---
 drivers/net/ethernet/intel/igb/igb_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index b88fae7..82aea92 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -5671,7 +5671,7 @@ static int igb_tx_map(struct igb_ring *tx_ring,
igb_maybe_stop_tx(tx_ring, DESC_NEEDED);
 
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
@@ -8072,7 +8072,7 @@ void igb_alloc_rx_buffers(struct igb_ring *rx_ring, u16 
cleaned_count)
 * such as IA-64).
 */
wmb();
-   writel(i, rx_ring->tail);
+   writel_relaxed(i, rx_ring->tail);
}
 }
 
-- 
2.7.4



[PATCH v4 13/17] net: cxgb3: Eliminate duplicate barriers on weakly-ordered archs

2018-03-19 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a barrier on
some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Create a new wrapper function with relaxed write operator. Use the new
wrapper when a write is following a wmb().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/chelsio/cxgb3/adapter.h |  7 +++
 drivers/net/ethernet/chelsio/cxgb3/sge.c | 19 ++-
 2 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb3/adapter.h 
b/drivers/net/ethernet/chelsio/cxgb3/adapter.h
index 087ff0f..0e21e66 100644
--- a/drivers/net/ethernet/chelsio/cxgb3/adapter.h
+++ b/drivers/net/ethernet/chelsio/cxgb3/adapter.h
@@ -281,6 +281,13 @@ static inline void t3_write_reg(struct adapter *adapter, 
u32 reg_addr, u32 val)
writel(val, adapter->regs + reg_addr);
 }
 
+static inline void t3_write_reg_relaxed(struct adapter *adapter, u32 reg_addr,
+   u32 val)
+{
+   CH_DBG(adapter, MMIO, "setting register 0x%x to 0x%x\n", reg_addr, val);
+   writel_relaxed(val, adapter->regs + reg_addr);
+}
+
 static inline struct port_info *adap2pinfo(struct adapter *adap, int idx)
 {
return netdev_priv(adap->port[idx]);
diff --git a/drivers/net/ethernet/chelsio/cxgb3/sge.c 
b/drivers/net/ethernet/chelsio/cxgb3/sge.c
index e988caa..0baab06 100644
--- a/drivers/net/ethernet/chelsio/cxgb3/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb3/sge.c
@@ -487,7 +487,8 @@ static inline void ring_fl_db(struct adapter *adap, struct 
sge_fl *q)
if (q->pend_cred >= q->credits / 4) {
q->pend_cred = 0;
wmb();
-   t3_write_reg(adap, A_SG_KDOORBELL, V_EGRCNTX(q->cntxt_id));
+   t3_write_reg_relaxed(adap, A_SG_KDOORBELL,
+V_EGRCNTX(q->cntxt_id));
}
 }
 
@@ -1058,8 +1059,8 @@ static inline void check_ring_tx_db(struct adapter *adap, 
struct sge_txq *q)
}
 #else
wmb();  /* write descriptors before telling HW */
-   t3_write_reg(adap, A_SG_KDOORBELL,
-F_SELEGRCNTX | V_EGRCNTX(q->cntxt_id));
+   t3_write_reg_relaxed(adap, A_SG_KDOORBELL,
+F_SELEGRCNTX | V_EGRCNTX(q->cntxt_id));
 #endif
 }
 
@@ -1510,8 +1511,8 @@ static int ctrl_xmit(struct adapter *adap, struct sge_txq 
*q,
}
spin_unlock(>lock);
wmb();
-   t3_write_reg(adap, A_SG_KDOORBELL,
-F_SELEGRCNTX | V_EGRCNTX(q->cntxt_id));
+   t3_write_reg_relaxed(adap, A_SG_KDOORBELL,
+F_SELEGRCNTX | V_EGRCNTX(q->cntxt_id));
return NET_XMIT_SUCCESS;
 }
 
@@ -1554,8 +1555,8 @@ static void restart_ctrlq(unsigned long data)
 
spin_unlock(>lock);
wmb();
-   t3_write_reg(qs->adap, A_SG_KDOORBELL,
-F_SELEGRCNTX | V_EGRCNTX(q->cntxt_id));
+   t3_write_reg_relaxed(qs->adap, A_SG_KDOORBELL,
+F_SELEGRCNTX | V_EGRCNTX(q->cntxt_id));
 }
 
 /*
@@ -1793,8 +1794,8 @@ again:reclaim_completed_tx(adap, q, TX_RECLAIM_CHUNK);
 #endif
wmb();
if (likely(written))
-   t3_write_reg(adap, A_SG_KDOORBELL,
-F_SELEGRCNTX | V_EGRCNTX(q->cntxt_id));
+   t3_write_reg_relaxed(adap, A_SG_KDOORBELL,
+F_SELEGRCNTX | V_EGRCNTX(q->cntxt_id));
 }
 
 /**
-- 
2.7.4



[PATCH v4 01/17] i40e/i40evf: Eliminate duplicate barriers on weakly-ordered archs

2018-03-19 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a barrier
on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
Reviewed-by: Alexander Duyck <alexander.h.du...@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 8 
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 4 ++--
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index e554aa6cf..9455869 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -185,7 +185,7 @@ static int i40e_program_fdir_filter(struct i40e_fdir_filter 
*fdir_data,
/* Mark the data descriptor to be watched */
first->next_to_watch = tx_desc;
 
-   writel(tx_ring->next_to_use, tx_ring->tail);
+   writel_relaxed(tx_ring->next_to_use, tx_ring->tail);
return 0;
 
 dma_fail:
@@ -1375,7 +1375,7 @@ static inline void i40e_release_rx_desc(struct i40e_ring 
*rx_ring, u32 val)
 * such as IA-64).
 */
wmb();
-   writel(val, rx_ring->tail);
+   writel_relaxed(val, rx_ring->tail);
 }
 
 /**
@@ -2258,7 +2258,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, 
int budget)
 */
wmb();
 
-   writel(xdp_ring->next_to_use, xdp_ring->tail);
+   writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
}
 
rx_ring->skb = skb;
@@ -3286,7 +3286,7 @@ static inline int i40e_tx_map(struct i40e_ring *tx_ring, 
struct sk_buff *skb,
 
/* notify HW of packet */
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index 357d605..56eea20 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -667,7 +667,7 @@ static inline void i40e_release_rx_desc(struct i40e_ring 
*rx_ring, u32 val)
 * such as IA-64).
 */
wmb();
-   writel(val, rx_ring->tail);
+   writel_relaxed(val, rx_ring->tail);
 }
 
 /**
@@ -2243,7 +2243,7 @@ static inline void i40evf_tx_map(struct i40e_ring 
*tx_ring, struct sk_buff *skb,
 
/* notify HW of packet */
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
-- 
2.7.4



[PATCH v4 07/17] fm10k: Eliminate duplicate barriers on weakly-ordered archs

2018-03-19 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a
barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing
the register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <ok...@codeaurora.org>
---
 drivers/net/ethernet/intel/fm10k/fm10k_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index 8e12aae..eebef01 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -179,7 +179,7 @@ void fm10k_alloc_rx_buffers(struct fm10k_ring *rx_ring, u16 
cleaned_count)
wmb();
 
/* notify hardware of new descriptors */
-   writel(i, rx_ring->tail);
+   writel_relaxed(i, rx_ring->tail);
}
 }
 
@@ -1054,7 +1054,7 @@ static void fm10k_tx_map(struct fm10k_ring *tx_ring,
 
/* notify HW of packet */
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
-- 
2.7.4



[PATCH v4 00/17] netdev: Eliminate duplicate barriers on weakly-ordered archs

2018-03-19 Thread Sinan Kaya
Code includes wmb() followed by writel() in multiple places. writel()
already has a barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

I did a regex search for wmb() followed by writel() in each drivers
directory.
I scrubbed the ones I care about in this series.

I considered "ease of change", "popular usage" and "performance critical
path" as the determining criteria for my filtering.

We used relaxed API heavily on ARM for a long time but
it did not exist on other architectures. For this reason, relaxed
architectures have been paying double penalty in order to use the common
drivers.

Now that relaxed API is present on all architectures, we can go and scrub
all drivers to see what needs to change and what can remain.

We start with mostly used ones and hope to increase the coverage over time.
It will take a while to cover all drivers.

Feel free to apply patches individually.

Changes since v3:
- https://www.spinics.net/lists/arm-kernel/msg641851.html
- group patches together into subsystems net:... 
- collect reviewed and tested bys
- scrub barrier()


Sinan Kaya (17):
  i40e/i40evf: Eliminate duplicate barriers on weakly-ordered archs
  ixgbe: eliminate duplicate barriers on weakly-ordered archs
  igbvf: eliminate duplicate barriers on weakly-ordered archs
  igb: eliminate duplicate barriers on weakly-ordered archs
  ixgbevf: keep writel() closer to wmb()
  ixgbevf: eliminate duplicate barriers on weakly-ordered archs
  fm10k: Eliminate duplicate barriers on weakly-ordered archs
  drivers: net: cxgb: Eliminate duplicate barriers on weakly-ordered
archs
  net: qla3xxx: Eliminate duplicate barriers on weakly-ordered archs
  qlcnic: Eliminate duplicate barriers on weakly-ordered archs
  bnx2x: Eliminate duplicate barriers on weakly-ordered archs
  net: cxgb4/cxgb4vf: Eliminate duplicate barriers on weakly-ordered
archs
  net: cxgb3: Eliminate duplicate barriers on weakly-ordered archs
  net: qlge: Eliminate duplicate barriers on weakly-ordered archs
  bnxt_en: Eliminate duplicate barriers on weakly-ordered archs
  qed/qede: Eliminate duplicate barriers on weakly-ordered archs
  net: ena: Eliminate duplicate barriers on weakly-ordered archs

 drivers/net/ethernet/amazon/ena/ena_com.c  |  6 --
 drivers/net/ethernet/amazon/ena/ena_eth_com.h  | 22 --
 drivers/net/ethernet/amazon/ena/ena_netdev.c   |  4 ++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x.h|  9 -
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h|  4 ++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c   | 21 +++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c  |  2 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c   |  2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c  |  2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.h  | 11 ++-
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c  |  2 +-
 drivers/net/ethernet/chelsio/cxgb/sge.c|  2 +-
 drivers/net/ethernet/chelsio/cxgb3/adapter.h   |  7 +++
 drivers/net/ethernet/chelsio/cxgb3/sge.c   | 19 ++-
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h |  6 ++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c| 13 +++--
 drivers/net/ethernet/chelsio/cxgb4/sge.c   | 12 ++--
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c |  2 +-
 drivers/net/ethernet/chelsio/cxgb4vf/adapter.h | 14 ++
 drivers/net/ethernet/chelsio/cxgb4vf/sge.c | 18 ++
 drivers/net/ethernet/intel/fm10k/fm10k_main.c  |  4 ++--
 drivers/net/ethernet/intel/i40e/i40e_txrx.c|  8 
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c  |  4 ++--
 drivers/net/ethernet/intel/igb/igb_main.c  |  4 ++--
 drivers/net/ethernet/intel/igbvf/netdev.c  |  4 ++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  |  8 
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h   |  5 -
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c  |  4 ++--
 drivers/net/ethernet/qlogic/qed/qed.h  |  5 -
 drivers/net/ethernet/qlogic/qed/qed_hw.c   | 12 
 drivers/net/ethernet/qlogic/qed/qed_hw.h   | 14 ++
 drivers/net/ethernet/qlogic/qed/qed_int.c  |  2 +-
 drivers/net/ethernet/qlogic/qed/qed_l2.c   |  2 +-
 drivers/net/ethernet/qlogic/qed/qed_ll2.c  |  2 +-
 drivers/net/ethernet/qlogic/qed/qed_vf.c   |  7 ---
 drivers/net/ethernet/qlogic/qede/qede_ethtool.c|  2 +-
 drivers/net/ethernet/qlogic/qede/qede_fp.c |  4 ++--
 drivers/net/ethernet/qlogic/qla3xxx.c  |  4 ++--
 .../net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c|  2 +-
 drivers/net/ethernet/qlogic/qlge/qlge.h| 17 +

Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs

2018-03-17 Thread Sinan Kaya
+linuxppc-...@lists.ozlabs.org

On 3/17/2018 11:05 AM, Jason Gunthorpe wrote:
> On Sat, Mar 17, 2018 at 12:25:14AM -0400, Sinan Kaya wrote:
>> On 3/17/2018 12:03 AM, Sinan Kaya wrote:
>>> On 3/16/2018 11:40 PM, Sinan Kaya wrote:
>>>> I'll change writel_relaxed() with __raw_writel() in the series like you 
>>>> suggested
>>>> and also look at your other comments.
>>>
>>> I spoke too soon.
>>>
>>> Now that I realized, code needs to follow one of the following patterns for 
>>> correctness
>>>
>>> 1)
>>> wmb()
>>> writel()/writel_relaxed()
>>>
>>> or 
>>>
>>> 2)
>>> wmb()
>>> __raw_wrltel()
>>> mmiowb()
>>>
>>> but definitely not
>>>
>>> wmb()
>>> __raw_wrltel()
>>>
>>> Since #1 == #2, I'll stick to my current implementation of writel_relaxed()
>>>
>>> Changing writel() to writel_relaxed() or __raw_writel() isn't enough. 
>>> PowerPC needs mmiowb()
>>> for correctness. ARM's mmiowb() implementation is empty.
>>>
>>> So, there is no one size fits all solution with the current state of 
>>> affairs.
>>>
>>>
>>
>> I think I finally got what you mean.
>>
>> Code seems to have
>>
>> wmb()
>> writel()/writeq()
>> wmb()
>>
>> this can be safely replaced with
>>
>> wmb()
>> __raw_writel()/__raw_writeq()
>> wmb()
>>
>> This will work on all arches. Below is the new version. Let me know if this 
>> is OK.
>>
>> +++ b/drivers/infiniband/hw/cxgb4/t4.h
>> @@ -457,7 +457,7 @@ static inline void pio_copy(u64 __iomem *dst, u64 *src)
>> int count = 8;
>>
>> while (count) {
>> -   writeq(*src, dst);
>> +   __raw_writeq(*src, dst);
>> src++;
>> dst++;
>> count--;
>> @@ -477,15 +477,16 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 
>> inc, union t4_wr *wqe)
>>  (u64 *)wqe);
>> } else {
>> pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
>> -   writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
>> -  wq->sq.bar2_va + SGE_UDB_KDOORBELL);
>> +   __raw_writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
>> +wq->sq.bar2_va + SGE_UDB_KDOORBELL);
>> }
>>
>> /* Flush user doorbell area writes. */
>> wmb();
>> return;
>> }
>> -   writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
>> +   __raw_writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
>> +   mmiowmb()
>>  }
>>
>>  static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
>> @@ -502,15 +503,16 @@ static inline void t4_ring_rq_db(struct t4_wq *wq, u16 
>> inc,
>>  (void *)wqe);
>> } else {
>> pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx);
>> -   writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
>> -  wq->rq.bar2_va + SGE_UDB_KDOORBELL);
>> +   __raw_writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
>> +wq->rq.bar2_va + SGE_UDB_KDOORBELL);
>> }
>>
>> /* Flush user doorbell area writes. */
>> wmb();
>> return;
>> }
>> -   writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
>> +   __raw_writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
>> +   mmiowmb();
> 
> No! NAK! Adding *new* barriers to use the *wrong* accessor is crazy!
> 
> Your first patch was right, replacing
>   wmb()
>   writel()
> 
> With wmb(); writel_relaxed() is fine, and helps ARM.
> 
> The problem with PPC has nothing to do with the writel, it is with the
> spinlock, and we can't improve it without adding some new
> infrastructure. Certainly don't hack around the problem by using
> __raw_writel in multi-arch drivers!
> 
> In userspace we settled on something like this as a pattern:
> 
>  mmio_wc_spin_lock()
>  writel_wc_mmio()
>  mmio_wc_spin_unlock()

> 
> Where mmio_wc_spin_unlock incorporates the mmiowmb and is defined to
> fully contain all MMIO acc

Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs

2018-03-16 Thread Sinan Kaya
On 3/17/2018 12:03 AM, Sinan Kaya wrote:
> On 3/16/2018 11:40 PM, Sinan Kaya wrote:
>> I'll change writel_relaxed() with __raw_writel() in the series like you 
>> suggested
>> and also look at your other comments.
> 
> I spoke too soon.
> 
> Now that I realized, code needs to follow one of the following patterns for 
> correctness
> 
> 1)
> wmb()
> writel()/writel_relaxed()
> 
> or 
> 
> 2)
> wmb()
> __raw_wrltel()
> mmiowb()
> 
> but definitely not
> 
> wmb()
> __raw_wrltel()
> 
> Since #1 == #2, I'll stick to my current implementation of writel_relaxed()
> 
> Changing writel() to writel_relaxed() or __raw_writel() isn't enough. PowerPC 
> needs mmiowb()
> for correctness. ARM's mmiowb() implementation is empty.
> 
> So, there is no one size fits all solution with the current state of affairs.
> 
> 

I think I finally got what you mean.

Code seems to have

wmb()
writel()/writeq()
wmb()

this can be safely replaced with

wmb()
__raw_writel()/__raw_writeq()
wmb()

This will work on all arches. Below is the new version. Let me know if this is 
OK.

+++ b/drivers/infiniband/hw/cxgb4/t4.h
@@ -457,7 +457,7 @@ static inline void pio_copy(u64 __iomem *dst, u64 *src)
int count = 8;

while (count) {
-   writeq(*src, dst);
+   __raw_writeq(*src, dst);
src++;
dst++;
count--;
@@ -477,15 +477,16 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 
inc, union t4_wr *wqe)
 (u64 *)wqe);
} else {
pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
-   writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
-  wq->sq.bar2_va + SGE_UDB_KDOORBELL);
+   __raw_writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
+wq->sq.bar2_va + SGE_UDB_KDOORBELL);
}

/* Flush user doorbell area writes. */
wmb();
return;
}
-   writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
+   __raw_writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
+   mmiowmb()
 }

 static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
@@ -502,15 +503,16 @@ static inline void t4_ring_rq_db(struct t4_wq *wq, u16 
inc,
 (void *)wqe);
} else {
pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx);
-   writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
-  wq->rq.bar2_va + SGE_UDB_KDOORBELL);
+   __raw_writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
+wq->rq.bar2_va + SGE_UDB_KDOORBELL);
}

/* Flush user doorbell area writes. */
wmb();
    return;
}
-   writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
+   __raw_writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
+   mmiowmb();
 }


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


  1   2   >