Re: clean up and modularize arch dma_mapping interface V2

2017-06-26 Thread tndave



On 06/26/2017 02:47 AM, Christoph Hellwig wrote:

On Sat, Jun 24, 2017 at 10:36:56AM -0500, Benjamin Herrenschmidt wrote:

I think we still need to do it. For example we have a bunch new "funky"
cases.


I have no plan to do away with the selection - I just want a better
interface than the current one.

I agree we need better interface than the current one.
Like Benjamin mentioned cases for powerpc , sparc also need some special
treatment for ATU IOMMU depending on device's DMA mask.

For sparc, I am in process of enabling one or more dedicated IOTSB (I/O
Translation Storage Buffer) per PCI BDF (contrary to current design
where all PCI device under root complex share a 32bit and/or 64bit IOTSB
depending on 32bit and/or 64bit DMA). I am planning to use DMA set mask
APIs as hook where based on device's dma mask values (dma_mask and
coherent_dma_mask) one or more IOTSB resource will be allocated (and
released [1]).

Without set_dma_mask ops, I can still rely on HAVE_ARCH_DMA_SET_MASK and
dma_supported() that allows me to distinguish if device is setting
its streaming dma_mask and coherent_dma_mask respectively.

-Tushar

[1] By default, every PCI BDF will have one dedicated 32bit IOTSB. This
is to support default case where some device drivers even don't bother
to set DMA mask but instead are fine with default 32bit mask.
A 64bit IOTSB will be allocated when device request 64bit dma_mask.
However if device wants 64bit dma mask for both coherent and
non-coherent, a default 32bit IOTSB will be released as well. Wasting an
IOTSB is not a good idea because there is a hard limit on max number of
IOTSB per guest domain per root complex.



--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: clean up and modularize arch dma_mapping interface V2

2017-06-21 Thread tndave



On 06/16/2017 11:10 AM, Christoph Hellwig wrote:

Hi all,

for a while we have a generic implementation of the dma mapping routines
that call into per-arch or per-device operations.  But right now there
still are various bits in the interfaces where don't clearly operate
on these ops.  This series tries to clean up a lot of those (but not all
yet, but the series is big enough).  It gets rid of the DMA_ERROR_CODE
way of signaling failures of the mapping routines from the
implementations to the generic code (and cleans up various drivers that
were incorrectly using it), and gets rid of the ->set_dma_mask routine
in favor of relying on the ->dma_capable method that can be used in
the same way, but which requires less code duplication.

Chris,

Thanks for doing this.
So archs can still have their own definition for dma_set_mask() if 
HAVE_ARCH_DMA_SET_MASK is y?
(and similarly for dma_set_coherent_mask() when 
CONFIG_ARCH_HAS_DMA_SET_COHERENT_MASK is y)

Any plan to change these?

I'm in a process of making some changes to SPARC iommu so it would be 
good to know. Thanks.


-Tushar



I've got a good number of reviews last time, but a few are still missing.
I'd love to not have to re-spam everyone with this patchbomb, so early
ACKs (or complaints) are welcome.

I plan to create a new dma-mapping tree to collect all this work.
Any volunteers for co-maintainers, especially from the iommu gang?

The whole series is also available in git:

 git://git.infradead.org/users/hch/misc.git dma-map

Gitweb:

 http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-map

Changes since V1:
  - remove two lines of code from arm dmabounce
  - a few commit message tweaks
  - lots of ACKs



Re: [PATCH] i40e: Fix incorrect pf->flags

2017-05-25 Thread tndave



On 05/25/2017 02:13 AM, Jeff Kirsher wrote:

On Fri, 2017-05-19 at 18:01 -0700, Tushar Dave wrote:

Fix bug introduced by 'commit 47994c119a36e ("i40e: remove
hw_disabled_flags in favor of using separate flag bits")' that
mistakenly wipes out pf->flags.

Signed-off-by: Tushar Dave 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)


:-(
Not even close to applying, I even tried to apply by hand but the patch
was not near the same as my branch in my tree.  I will forgive this
second instance of a patch generated on different tree/branch, but I
really must insist that you use my tree & branch if you need/want to
make changes to Intel wired LAN driver changes.

Apology Jeff. I used Dave's net tree.
Now onwards, I will use your 'net-queue:dev_queue' for bug fixes and
'next-queue:dev_queue' for developmental changes to Intel Ethernet drivers.

Being this patch a bug fix, I will send v2 based off your git tree
'net-queue' on branch 'dev_queue'. Hope that is okay?

Thanks.

-Tushar




diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index d5c9c9e..6b98d34 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -8821,9 +8821,9 @@ static int i40e_sw_init(struct i40e_pf *pf)
(pf->hw.aq.api_min_ver > 4))) {
/* Supported in FW API version higher than 1.4 */
pf->flags |= I40E_FLAG_GENEVE_OFFLOAD_CAPABLE;
-   pf->flags = I40E_FLAG_HW_ATR_EVICT_CAPABLE;
+   pf->flags |= I40E_FLAG_HW_ATR_EVICT_CAPABLE;
} else {
-   pf->flags = I40E_FLAG_HW_ATR_EVICT_CAPABLE;
+   pf->flags |= I40E_FLAG_HW_ATR_EVICT_CAPABLE;
}

pf->eeprom_version = 0xDEAD;


Re: [PATCH] netpoll: Check for skb->queue_mapping

2017-04-20 Thread tndave



On 04/12/2017 03:37 PM, tndave wrote:



On 04/06/2017 12:14 PM, Eric Dumazet wrote:

On Thu, 2017-04-06 at 12:07 -0700, tndave wrote:


+q_index = q_index % dev->real_num_tx_queues;

cpu interrupted here and dev->real_num_tx_queues has reduced!

+skb_set_queue_mapping(skb, q_index);
+}
+txq = netdev_get_tx_queue(dev, q_index);

or cpu interrupted here and dev->real_num_tx_queues has reduced!


If dev->real_num_tx_queues can be changed while this code is running we
are in deep deep trouble.

Better make sure that when control path does this change, device (and/pr
netpoll) is frozen and no packet can be sent.

When control path is making change to real_num_tx_queues, underlying
device is disabled; also netdev tx queues are stopped/disabled so
certainly no transmit is happening.


The corner case I was referring is if netpoll's queue_process() code is
interrupted and while it is not running, control path makes change to
dev->real_num_tx_queues and exits. Later on, interrupted queue_process()
resume execution and it can end up with wrong skb->queue_mapping and txq.
We can prevent this case with below change:

diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 9424673..29be246 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -105,15 +105,21 @@ static void queue_process(struct work_struct *work)
while ((skb = skb_dequeue(>txq))) {
struct net_device *dev = skb->dev;
struct netdev_queue *txq;
+   unsigned int q_index;

if (!netif_device_present(dev) || !netif_running(dev)) {
kfree_skb(skb);
continue;
}

-   txq = skb_get_tx_queue(dev, skb);
-
local_irq_save(flags);
+   /* check if skb->queue_mapping is still valid */
+   q_index = skb_get_queue_mapping(skb);
+   if (unlikely(q_index >= dev->real_num_tx_queues)) {
+   q_index = q_index % dev->real_num_tx_queues;
+   skb_set_queue_mapping(skb, q_index);
+   }
+   txq = netdev_get_tx_queue(dev, q_index);
HARD_TX_LOCK(dev, txq, smp_processor_id());
if (netif_xmit_frozen_or_stopped(txq) ||
netpoll_start_xmit(skb, dev, txq) != NETDEV_TX_OK) {


Thanks for your patience.

Eric,

Let me know if you are okay with above changes and me sending v2.

-Tushar


-Tushar









Re: [PATCH] netpoll: Check for skb->queue_mapping

2017-04-12 Thread tndave



On 04/06/2017 12:14 PM, Eric Dumazet wrote:

On Thu, 2017-04-06 at 12:07 -0700, tndave wrote:


+   q_index = q_index % dev->real_num_tx_queues;

cpu interrupted here and dev->real_num_tx_queues has reduced!

+   skb_set_queue_mapping(skb, q_index);
+   }
+   txq = netdev_get_tx_queue(dev, q_index);

or cpu interrupted here and dev->real_num_tx_queues has reduced!


If dev->real_num_tx_queues can be changed while this code is running we
are in deep deep trouble.

Better make sure that when control path does this change, device (and/pr
netpoll) is frozen and no packet can be sent.

When control path is making change to real_num_tx_queues, underlying
device is disabled; also netdev tx queues are stopped/disabled so
certainly no transmit is happening.


The corner case I was referring is if netpoll's queue_process() code is
interrupted and while it is not running, control path makes change to
dev->real_num_tx_queues and exits. Later on, interrupted queue_process()
resume execution and it can end up with wrong skb->queue_mapping and txq.
We can prevent this case with below change:

diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 9424673..29be246 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -105,15 +105,21 @@ static void queue_process(struct work_struct *work)
while ((skb = skb_dequeue(>txq))) {
struct net_device *dev = skb->dev;
struct netdev_queue *txq;
+   unsigned int q_index;

if (!netif_device_present(dev) || !netif_running(dev)) {
kfree_skb(skb);
continue;
}

-   txq = skb_get_tx_queue(dev, skb);
-
local_irq_save(flags);
+   /* check if skb->queue_mapping is still valid */
+   q_index = skb_get_queue_mapping(skb);
+   if (unlikely(q_index >= dev->real_num_tx_queues)) {
+   q_index = q_index % dev->real_num_tx_queues;
+   skb_set_queue_mapping(skb, q_index);
+   }
+   txq = netdev_get_tx_queue(dev, q_index);
HARD_TX_LOCK(dev, txq, smp_processor_id());
if (netif_xmit_frozen_or_stopped(txq) ||
netpoll_start_xmit(skb, dev, txq) != NETDEV_TX_OK) {


Thanks for your patience.

-Tushar







Re: [PATCH] netpoll: Check for skb->queue_mapping

2017-04-06 Thread tndave



On 04/06/2017 03:26 AM, Eric Dumazet wrote:

On Wed, 2017-04-05 at 19:06 -0700, Tushar Dave wrote:

Reducing real_num_tx_queues needs to be in sync with skb queue_mapping
otherwise skbs with queue_mapping greater than real_num_tx_queues
can be sent to the underlying driver and can result in kernel panic.

One such event is running netconsole and enabling VF on the same
device. Or running netconsole and changing number of tx queues via
ethtool on same device.

e.g.




Signed-off-by: Tushar Dave 
---
 net/core/netpoll.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 9424673..c572e49 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -101,6 +101,7 @@ static void queue_process(struct work_struct *work)
container_of(work, struct netpoll_info, tx_work.work);
struct sk_buff *skb;
unsigned long flags;
+   u16 q_index;

while ((skb = skb_dequeue(>txq))) {
struct net_device *dev = skb->dev;
@@ -117,6 +118,12 @@ static void queue_process(struct work_struct *work)
HARD_TX_LOCK(dev, txq, smp_processor_id());
if (netif_xmit_frozen_or_stopped(txq) ||
netpoll_start_xmit(skb, dev, txq) != NETDEV_TX_OK) {
+   /* check if skb->queue_mapping has changed */
+   q_index = skb_get_queue_mapping(skb);
+   if (unlikely(q_index >= dev->real_num_tx_queues)) {
+   q_index = q_index % dev->real_num_tx_queues;
+   skb_set_queue_mapping(skb, q_index);
+   }
skb_queue_head(>txq, skb);
HARD_TX_UNLOCK(dev, txq);
local_irq_restore(flags);


Hi Tushar, thank you for working on this issue.

Eric,

Thank you for reviewing my change and your valuable comments.


Where and when skb->queue_mapping has changed ?

Well, I should amend the comment,
"/* check if skb->queue_mapping has changed */" is misleading.
I should say, /* check if dev->real_num_tx_queues has changed */


It looks that the real problem is that dev->real_num_tx_queues has been
changed instead of skb->queue_mapping

Yes.


So maybe the more correct change would be to cap skb->queue_mapping even
before getting skb_get_tx_queue() ?

Otherwise, even after your patch, we might still access an invalid queue
on the device ?

This is the case of direct xmit (netdev_start_xmit). Most of underlying
mq device drivers retrieve tx queue index from skb->queue_mapping.
One possibility I see where device driver still get invalid queue index
(skb->queue_mapping) with my patch is when following occurs:
- after executing "txq = skb_get_tx_queue(dev, skb)" cpu is interrupted';
- some other cpu reduced dev->real_num_tx_queues;
- Interrupted cpu resume (queue_process()).
With above sequence, at the underlying driver's xmit function we can
have skb->queue_mapping > dev->real_num_tx_queues [1].

I like your suggested patch more than mine though because it checks for
change in dev->real_num_tx_queues before even we retrieve 'netdev_queue
txq'. Less chance to have wrong txq.

However even your patch has same issue like [1] ? (see my comment below).



Something like the following :

diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 
9424673009c14e0fb288b8e4041dba596b37ee8d..16702d95f83ab884e605e3868cfef94615dcbc72
 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -105,13 +105,20 @@ static void queue_process(struct work_struct *work)
while ((skb = skb_dequeue(>txq))) {
struct net_device *dev = skb->dev;
struct netdev_queue *txq;
+   unsigned int q_index;

if (!netif_device_present(dev) || !netif_running(dev)) {
kfree_skb(skb);
continue;
}

-   txq = skb_get_tx_queue(dev, skb);
+   /* check if skb->queue_mapping is still valid */
+   q_index = skb_get_queue_mapping(skb);
+   if (unlikely(q_index >= dev->real_num_tx_queues)) {
+   q_index = q_index % dev->real_num_tx_queues;

cpu interrupted here and dev->real_num_tx_queues has reduced!

+   skb_set_queue_mapping(skb, q_index);
+   }
+   txq = netdev_get_tx_queue(dev, q_index);

or cpu interrupted here and dev->real_num_tx_queues has reduced!

In any case we hit the bug or am I missing something?

The other concern is concurrent execution of netpoll's direct xmit path
and updates to dev->real_num_tx_queues (by other cpu)?


Thanks.

-Tushar


local_irq_save(flags);
HARD_TX_LOCK(dev, txq, smp_processor_id());






Re: [PATCH] ixgbe: Check for skb->queue_mapping

2017-03-29 Thread tndave



On 03/29/2017 05:29 PM, Eric Dumazet wrote:

On Wed, 2017-03-29 at 16:55 -0700, Tushar Dave wrote:

There are events seen where skb->queue_mapping value is greater than
adapter->num_tx_queue. In such cases, adapter->tx_ring becomes invalid
(null) and xmit results in kernel panic.

One such event is running netconsole and enabling VF on the same device.
Or running netconsole and changing number of tx queues via ethtool on
same device.

This patch adds check for skb->queue_mapping and uses simple arithmetic
to select available tx queue from driver.


We are not going to fix all multi queue drivers, for something that
seems to be a generic issue higher in the stack.

I have fixed it in ixgbe,i40e drivers because I see Intel igb driver
does similar thing. But I agree that the core issue lies up in the stack.

For the information, I found that this issue only affecting skbs that
are coming from direct xmit path (netdev_start_xmit) that netconsole,
pktgen , af_packet etc,. uses. skbs coming from qdisc already have this
fixed in stack. (http://lists.openwall.net/netdev/2010/07/01/58)

Thanks.

-Tushar








Re: [Intel-wired-lan] Question on ixgbe flow director

2017-03-08 Thread tndave



On 03/08/2017 01:03 PM, Alexander Duyck wrote:

On Wed, Mar 8, 2017 at 11:30 AM, tndave <tushar.n.d...@oracle.com> wrote:



On 03/08/2017 07:39 AM, Alexander Duyck wrote:


On Tue, Mar 7, 2017 at 3:43 PM, tndave <tushar.n.d...@oracle.com>
wrote:


Hi,

I have few questions regarding ixgbe flow director. As per my
understanding flow director in ixgbe can work in 2 exclusive ways,
 a. Using ATR filters - where flow director is setup in HW by
driver identifying transmit traffic. And based on that, receive
traffic of the same flow get assigned/directed to same queue.



So ATR for ixgbe uses the Flow Director signature filters. Basically
the signature filters are purely hash based filters.  The easiest way
to think of it is it is essentially a beefed up version of RSS.


Alex,

Thanks for the info.




b. Perfect filter, where user can manually program flow director
using ethtool so that receive packets gets directed to specified
rx queue (depending on on how ethtool flow-type and action etc,.);
But with perfect filters there is no intelligence involved alike
ATR has on identifying transit, right?



You are right.  The perfect match filters are the ones we configure
via ethtool.  You are correct in that ATR logic is not involved so
there is no matching up an Rx flow to a Tx flow.


Few question regarding ixgbe ATR, 1. does ATR works in case if
protocol is UDP? (Based on the current ixgbe_atr() it only
supports TCP)



It could, but the software for doing ATR flow identification,
ixgbe_atr(), only works on TCP.  The reasoning behind it is that Flow
Director doesn't filter fragmented frames and UDP can be fragmented.


So we could just add code in ixgbe_atr() for non-fragmented udp packets!
However question is if it is going to work w.r.t ATR? TCP is connected
stream vs udp is stateless and single udp socket can talk to multiple
endpoints.


Right so the Tx based approach for UDP also probably won't work
because there isn't a symmetric set of endpoints used for most UDP
flows.  For example with TCP you have a fixed set of source and
destination ports that you can swap and get the opposite direction.
The same isn't true for UDP.  You can end up with UDP using a random
source port and sending to the same fixed destination port on both
ends.

You might look at the hardware RPS offload that some of the other
drivers out there support that allow for hardware accelerated RFS
offload.  Something like that might work for your needs if you are
needing to route UDP.

Okay. Thanks Alex.

-Tushar





2. Does ATR flow director can be programmed using ethtool? (As per
my understanding only perfect filter can be programmed from
ethtool, is that so?)



You can enable/disable the ATR filters via the NTUPLE feaure flag.
Basically it toggles between ATR mode, and NTUPLE mode.  In NTUPLE
mode the perfect filters are enabled and can be configured via
ethtool.


Got it.

Thanks.

-Tushar




Thanks in advance, -Tushar



Hope that helps.

- Alex







Re: [Intel-wired-lan] Question on ixgbe flow director

2017-03-08 Thread tndave



On 03/08/2017 07:39 AM, Alexander Duyck wrote:

On Tue, Mar 7, 2017 at 3:43 PM, tndave <tushar.n.d...@oracle.com>
wrote:

Hi,

I have few questions regarding ixgbe flow director. As per my
understanding flow director in ixgbe can work in 2 exclusive ways,
 a. Using ATR filters - where flow director is setup in HW by
driver identifying transmit traffic. And based on that, receive
traffic of the same flow get assigned/directed to same queue.


So ATR for ixgbe uses the Flow Director signature filters. Basically
the signature filters are purely hash based filters.  The easiest way
to think of it is it is essentially a beefed up version of RSS.

Alex,

Thanks for the info.



b. Perfect filter, where user can manually program flow director
using ethtool so that receive packets gets directed to specified
rx queue (depending on on how ethtool flow-type and action etc,.);
But with perfect filters there is no intelligence involved alike
ATR has on identifying transit, right?


You are right.  The perfect match filters are the ones we configure
via ethtool.  You are correct in that ATR logic is not involved so
there is no matching up an Rx flow to a Tx flow.


Few question regarding ixgbe ATR, 1. does ATR works in case if
protocol is UDP? (Based on the current ixgbe_atr() it only
supports TCP)


It could, but the software for doing ATR flow identification,
ixgbe_atr(), only works on TCP.  The reasoning behind it is that Flow
Director doesn't filter fragmented frames and UDP can be fragmented.

So we could just add code in ixgbe_atr() for non-fragmented udp packets!
However question is if it is going to work w.r.t ATR? TCP is connected
stream vs udp is stateless and single udp socket can talk to multiple
endpoints.




2. Does ATR flow director can be programmed using ethtool? (As per
my understanding only perfect filter can be programmed from
ethtool, is that so?)


You can enable/disable the ATR filters via the NTUPLE feaure flag.
Basically it toggles between ATR mode, and NTUPLE mode.  In NTUPLE
mode the perfect filters are enabled and can be configured via
ethtool.

Got it.

Thanks.

-Tushar



Thanks in advance, -Tushar


Hope that helps.

- Alex



Question on ixgbe flow director

2017-03-07 Thread tndave

Hi,

I have few questions regarding ixgbe flow director.
As per my understanding flow director in ixgbe can work in 2 exclusive ways,
a. Using ATR filters - where flow director is setup in HW by driver
identifying transmit traffic. And based on that, receive traffic of the
same flow get assigned/directed to same queue.

b. Perfect filter, where user can manually program flow director using
ethtool so that receive packets gets directed to specified rx queue
(depending on on how ethtool flow-type and action etc,.); But with
perfect filters there is no intelligence involved alike ATR has on
identifying transit, right?

Few question regarding ixgbe ATR,
1. does ATR works in case if protocol is UDP? (Based on the current
ixgbe_atr() it only supports TCP)
2. Does ATR flow director can be programmed using ethtool?
(As per my understanding only perfect filter can be programmed from
ethtool, is that so?)

Thanks in advance,
-Tushar



Re: [RFC PATCH 0/2] rx zero copy interface for af_packet

2017-01-31 Thread tndave



On 01/27/2017 01:33 PM, John Fastabend wrote:

This is an experimental implementation of rx zero copy for af_packet.
Its a bit rough and likely has errors but the plan is to clean it up
over the next few months.

And seeing I said I would post it in another thread a few days back
here it is.


This sounds good (believe me I have been thinking along the lines :)
From driver Rx side, we always premap RX buffers so best to map them to
shmem for PF_PACKET sockets.
Also, I like the idea that user can put selected queue (may be queues in
future?) to PF_PACKET mode keeping rest of the queues as it is.
Zero copy and removing skb setup & processing overhead on RX certainly
makes things faster and help latency. Zero copy is good on Tx however
without skb should we figure out how to use segmentation and checksum 
offloading features of HW. Can this be considered in tpacket V4 hdr!


-Tushar


Re: [RFC PATCH] i40e: enable PCIe relax ordering for SPARC

2016-12-28 Thread tndave



On 12/27/2016 04:40 PM, maowenan wrote:




-Original Message-
From: tndave [mailto:tushar.n.d...@oracle.com]
Sent: Wednesday, December 28, 2016 6:28 AM
To: maowenan; jeffrey.t.kirs...@intel.com; intel-wired-...@lists.osuosl.org
Cc: netdev@vger.kernel.org; weiyongjun (A); Dingtianhong
Subject: Re: [RFC PATCH] i40e: enable PCIe relax ordering for SPARC



On 12/26/2016 03:39 AM, maowenan wrote:




-Original Message-
From: netdev-ow...@vger.kernel.org
[mailto:netdev-ow...@vger.kernel.org]
On Behalf Of Tushar Dave
Sent: Tuesday, December 06, 2016 1:07 AM
To: jeffrey.t.kirs...@intel.com; intel-wired-...@lists.osuosl.org
Cc: netdev@vger.kernel.org
Subject: [RFC PATCH] i40e: enable PCIe relax ordering for SPARC

Unlike previous generation NIC (e.g. ixgbe) i40e doesn't seem to have
standard CSR where PCIe relaxed ordering can be set. Without PCIe
relax ordering enabled, i40e performance is significantly low on SPARC.


[Mao Wenan]Hi Tushar, you have referred to i40e doesn't seem to have
standard CSR to set PCIe relaxed ordering, this CSR like TX DCA Control

Register in 82599, right?
Yes.
i40e datasheet mentions some CSR that can be used to enable/disable PCIe
relaxed ordering in device; however I don't see the exact definition of those
register in datasheet.
(https://www.mail-archive.com/netdev@vger.kernel.org/msg117219.html).


Is DMA_ATTR_WEAK_ORDERING the same as TX control register in

82599?
No.
DMA_ATTR_WEAK_ORDERING applies to the PCIe root complex of the system.

-Tushar


I understand that the PCIe Root Complex is the Host Bridge in the CPU that
connects the CPU and memory to the PCIe architecture. So this attribute
DMA_ATTR_WEAK_ORDERING is only applied on CPU side(the SPARC in you
system), it can't apply on i40e, is it right?

Yes.

And it is not the same as 82599 DCA control register's relax ordering bits.

It is not same as 82599 DCA control register's relax ordering bits.

-Tushar


-Mao Wenan



And to enable relax ordering mode in 82599 for SPARC using below codes:
s32 ixgbe_start_hw_gen2(struct ixgbe_hw *hw) {
u32 i;

/* Clear the rate limiters */
for (i = 0; i < hw->mac.max_tx_queues; i++) {
IXGBE_WRITE_REG(hw, IXGBE_RTTDQSEL, i);
IXGBE_WRITE_REG(hw, IXGBE_RTTBCNRC, 0);
}
IXGBE_WRITE_FLUSH(hw);

#ifndef CONFIG_SPARC
/* Disable relaxed ordering */
for (i = 0; i < hw->mac.max_tx_queues; i++) {
u32 regval;

regval = IXGBE_READ_REG(hw, IXGBE_DCA_TXCTRL_82599(i));
regval &= ~IXGBE_DCA_TXCTRL_DESC_WRO_EN;
IXGBE_WRITE_REG(hw, IXGBE_DCA_TXCTRL_82599(i), regval);
}

for (i = 0; i < hw->mac.max_rx_queues; i++) {
u32 regval;

regval = IXGBE_READ_REG(hw, IXGBE_DCA_RXCTRL(i));
regval &= ~(IXGBE_DCA_RXCTRL_DATA_WRO_EN |
IXGBE_DCA_RXCTRL_HEAD_WRO_EN);
IXGBE_WRITE_REG(hw, IXGBE_DCA_RXCTRL(i), regval);
}
#endif
return 0;
}




This patch sets PCIe relax ordering for SPARC arch by setting dma
attr DMA_ATTR_WEAK_ORDERING for every tx and rx DMA map/unmap.
This has shown 10x increase in performance numbers.

e.g.
iperf TCP test with 10 threads on SPARC S7

Test 1: Without this patch

[root@brm-snt1-03 net]# iperf -s

Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)

[  4] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 40926 [
5] local
16.0.0.7 port 5001 connected with 16.0.0.1 port 40934 [  6] local
16.0.0.7 port
5001 connected with 16.0.0.1 port 40930 [  7] local 16.0.0.7 port
5001 connected with 16.0.0.1 port 40928 [  8] local 16.0.0.7 port
5001 connected with 16.0.0.1 port 40922 [  9] local 16.0.0.7 port
5001 connected with 16.0.0.1 port 40932 [ 10] local 16.0.0.7 port
5001 connected with 16.0.0.1 port 40920 [ 11] local 16.0.0.7 port
5001 connected with 16.0.0.1 port 40924 [ 14] local
16.0.0.7 port 5001 connected with 16.0.0.1 port 40982 [ 12] local
16.0.0.7 port
5001 connected with 16.0.0.1 port 40980
[ ID] Interval   Transfer Bandwidth
[  4]  0.0-20.0 sec   566 MBytes   237 Mbits/sec
[  5]  0.0-20.0 sec   532 MBytes   223 Mbits/sec
[  6]  0.0-20.0 sec   537 MBytes   225 Mbits/sec
[  8]  0.0-20.0 sec   546 MBytes   229 Mbits/sec
[ 11]  0.0-20.0 sec   592 MBytes   248 Mbits/sec
[  7]  0.0-20.0 sec   539 MBytes   226 Mbits/sec
[  9]  0.0-20.0 sec   572 MBytes   240 Mbits/sec
[ 10]  0.0-20.0 sec   604 MBytes   253 Mbits/sec
[ 14]  0.0-20.0 sec   567 MBytes   238 Mbits/sec
[ 12]  0.0-20.0 sec   511 MBytes   214 Mbits/sec
[SUM]  0.0-20.0 sec  5.44 GBytes  2.33 Gbits/sec

Test 2: with this patch:

[root@brm-snt1-03 net]# iperf -s

Server listening on TCP port 5001
TCP window

Re: [RFC PATCH] i40e: enable PCIe relax ordering for SPARC

2016-12-27 Thread tndave



On 12/26/2016 03:39 AM, maowenan wrote:




-Original Message-
From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
On Behalf Of Tushar Dave
Sent: Tuesday, December 06, 2016 1:07 AM
To: jeffrey.t.kirs...@intel.com; intel-wired-...@lists.osuosl.org
Cc: netdev@vger.kernel.org
Subject: [RFC PATCH] i40e: enable PCIe relax ordering for SPARC

Unlike previous generation NIC (e.g. ixgbe) i40e doesn't seem to have standard
CSR where PCIe relaxed ordering can be set. Without PCIe relax ordering
enabled, i40e performance is significantly low on SPARC.


[Mao Wenan]Hi Tushar, you have referred to i40e doesn't seem to have standard 
CSR
to set PCIe relaxed ordering, this CSR like TX DCA Control Register in 
82599, right?

Yes.
i40e datasheet mentions some CSR that can be used to enable/disable PCIe
relaxed ordering in device; however I don't see the exact definition of
those register in datasheet.
(https://www.mail-archive.com/netdev@vger.kernel.org/msg117219.html).


Is DMA_ATTR_WEAK_ORDERING the same as TX control register in
82599?

No.
DMA_ATTR_WEAK_ORDERING applies to the PCIe root complex of the system.

-Tushar



And to enable relax ordering mode in 82599 for SPARC using below codes:
s32 ixgbe_start_hw_gen2(struct ixgbe_hw *hw)
{
u32 i;

/* Clear the rate limiters */
for (i = 0; i < hw->mac.max_tx_queues; i++) {
IXGBE_WRITE_REG(hw, IXGBE_RTTDQSEL, i);
IXGBE_WRITE_REG(hw, IXGBE_RTTBCNRC, 0);
}
IXGBE_WRITE_FLUSH(hw);

#ifndef CONFIG_SPARC
/* Disable relaxed ordering */
for (i = 0; i < hw->mac.max_tx_queues; i++) {
u32 regval;

regval = IXGBE_READ_REG(hw, IXGBE_DCA_TXCTRL_82599(i));
regval &= ~IXGBE_DCA_TXCTRL_DESC_WRO_EN;
IXGBE_WRITE_REG(hw, IXGBE_DCA_TXCTRL_82599(i), regval);
}

for (i = 0; i < hw->mac.max_rx_queues; i++) {
u32 regval;

regval = IXGBE_READ_REG(hw, IXGBE_DCA_RXCTRL(i));
regval &= ~(IXGBE_DCA_RXCTRL_DATA_WRO_EN |
IXGBE_DCA_RXCTRL_HEAD_WRO_EN);
IXGBE_WRITE_REG(hw, IXGBE_DCA_RXCTRL(i), regval);
}
#endif
return 0;
}




This patch sets PCIe relax ordering for SPARC arch by setting dma attr
DMA_ATTR_WEAK_ORDERING for every tx and rx DMA map/unmap.
This has shown 10x increase in performance numbers.

e.g.
iperf TCP test with 10 threads on SPARC S7

Test 1: Without this patch

[root@brm-snt1-03 net]# iperf -s

Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)

[  4] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 40926 [  5] local
16.0.0.7 port 5001 connected with 16.0.0.1 port 40934 [  6] local 16.0.0.7 port
5001 connected with 16.0.0.1 port 40930 [  7] local 16.0.0.7 port 5001
connected with 16.0.0.1 port 40928 [  8] local 16.0.0.7 port 5001 connected
with 16.0.0.1 port 40922 [  9] local 16.0.0.7 port 5001 connected with 16.0.0.1
port 40932 [ 10] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 40920
[ 11] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 40924 [ 14] local
16.0.0.7 port 5001 connected with 16.0.0.1 port 40982 [ 12] local 16.0.0.7 port
5001 connected with 16.0.0.1 port 40980
[ ID] Interval   Transfer Bandwidth
[  4]  0.0-20.0 sec   566 MBytes   237 Mbits/sec
[  5]  0.0-20.0 sec   532 MBytes   223 Mbits/sec
[  6]  0.0-20.0 sec   537 MBytes   225 Mbits/sec
[  8]  0.0-20.0 sec   546 MBytes   229 Mbits/sec
[ 11]  0.0-20.0 sec   592 MBytes   248 Mbits/sec
[  7]  0.0-20.0 sec   539 MBytes   226 Mbits/sec
[  9]  0.0-20.0 sec   572 MBytes   240 Mbits/sec
[ 10]  0.0-20.0 sec   604 MBytes   253 Mbits/sec
[ 14]  0.0-20.0 sec   567 MBytes   238 Mbits/sec
[ 12]  0.0-20.0 sec   511 MBytes   214 Mbits/sec
[SUM]  0.0-20.0 sec  5.44 GBytes  2.33 Gbits/sec

Test 2: with this patch:

[root@brm-snt1-03 net]# iperf -s

Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)

TCP: request_sock_TCP: Possible SYN flooding on port 5001. Sending cookies.
Check SNMP counters.
[  4] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 46876 [  5] local
16.0.0.7 port 5001 connected with 16.0.0.1 port 46874 [  6] local 16.0.0.7 port
5001 connected with 16.0.0.1 port 46872 [  7] local 16.0.0.7 port 5001
connected with 16.0.0.1 port 46880 [  8] local 16.0.0.7 port 5001 connected
with 16.0.0.1 port 46878 [  9] local 16.0.0.7 port 5001 connected with 16.0.0.1
port 46884 [ 10] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 46886
[ 11] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 46890 [ 12] local
16.0.0.7 port 5001 connected with 16.0.0.1 port 46888 [ 13] local 16.0.0.7 port
5001 connected with 16.0.0.1 

Re: [Intel-wired-lan] [RFC PATCH] i40e: enable PCIe relax ordering for SPARC

2016-12-08 Thread tndave



On 12/08/2016 04:45 PM, tndave wrote:



On 12/08/2016 08:05 AM, Alexander Duyck wrote:

On Thu, Dec 8, 2016 at 2:43 AM, David Laight
<david.lai...@aculab.com> wrote:

From: Alexander Duyck

Sent: 06 December 2016 17:10

...

I was thinking about it and I realized we can probably simplify
this even further.  In the case of most other architectures the
DMA_ATTR_WEAK_ORDERING has no effect anyway.  So from what I can
tell there is probably no reason not to just always pass that
attribute with the DMA mappings.  From what I can tell the only
other architecture that uses this is the PowerPC Cell
architecture.


And I should have read all the thread :-(


Also I was wondering if you actually needed to enable this
attribute for both Rx and Tx buffers or just Rx buffers?  The
patch that enabled DMA_ATTR_WEAK_ORDERING for Sparc64 seems to
call out writes, but I didn't see anything about reads.  I'm just
wondering if changing the code for Tx has any effect?  If not you
could probably drop those changes and just focus on Rx.


'Weak ordering' only applies to PCIe read transfers, so can only
have an effect on descriptor reads and transmit buffer reads.

Basically PCIe is a comms protocol and an endpoint (or the host)
can have multiple outstanding read requests (each of which might
generate multiple response messages. The responses for each request
must arrive in order, but responses for different requests can be
interleaved. Setting 'not weak ordering' lets the host interwork
with broken endpoints. (Or, like we did, you fix the fpga's PCIe
implementation.)


I get the basics of relaxed ordering.  The question is how does the
Sparc64 IOMMU translate DMA_ATTR_WEAK_ORDERING into relaxed ordering
messages, and at what level the ordering is relaxed.  Odds are the
wording in the description where this attribute was added to Sparc
is just awkward, but I was wanting to verify if this only applies to
writes, or also read completions.

In Sparc64, passing DMA_ATTR_WEAK_ORDERING in dma map/unmap only affects
PCIe root complex (Hostbridge). Using DMA_ATTR_WEAK_ORDERING, requested
DMA transaction can be relaxed ordered within the PCIe root complex.

In Sparc64, memory writes can be held at PCIe root complex not letting
other memory writes to go through. By passing DMA_ATTR_WEAK_ORDERING in
dma map/unmap allows memory writes to bypass other memory writes in PCIe
root complex. (This applies to only PCIe root complex and does not
affect at any other level of PCIe hierarchy e.g. PCIe bridges et al.
Also the PCIe root complex when bypassing memory writes does follow PCIe
relax ordering rules as per PCIe specification.

For reference [old but still relevant write-up]: PCI-Express Relaxed
Ordering and the Sun SPARC Enterprise M-class Servers
https://blogs.oracle.com/olympus/entry/relaxed_ordering




In this case you need the reads of both transmit and receive rings
to 'overtake' reads of transmit data.


Actually that isn't quite right.  With relaxed ordering completions
and writes can pass each other if I recall correctly, but reads will
always force all writes ahead of them to be completed before you can
begin generating the read completions.

That is my understanding as well.




I'm not at all clear how this 'flag' can be set on dma_map(). It is
a property of the PCIe subsystem.

Because in Sparc64, passing DMA_ATTR_WEAK_ORDERING flag in DMA map/unmap
adds an entry in IOMMU/ATU table so that an access to requested DMA
address from PCIe root complex can be relaxed ordered.


That was where my original question on this came in.  We can do a
blanket enable of relaxed ordering for Tx and Rx data buffers, but
if we only need it on Rx then there isn't any need for us to make
unnecessary changes.

I ran some quick test and it is likely that we don't need
DMA_ATTR_WEAK_ORDERING for any TX dma buffer (because in case of TX dma
buffers, its all memory reads from device).

in above line , s/from/by

+ cc sparcli...@vger.kernel.org

-Tushar


-Tushar


- Alex





Re: [Intel-wired-lan] [RFC PATCH] i40e: enable PCIe relax ordering for SPARC

2016-12-08 Thread tndave



On 12/08/2016 08:05 AM, Alexander Duyck wrote:

On Thu, Dec 8, 2016 at 2:43 AM, David Laight
 wrote:

From: Alexander Duyck

Sent: 06 December 2016 17:10

...

I was thinking about it and I realized we can probably simplify
this even further.  In the case of most other architectures the
DMA_ATTR_WEAK_ORDERING has no effect anyway.  So from what I can
tell there is probably no reason not to just always pass that
attribute with the DMA mappings.  From what I can tell the only
other architecture that uses this is the PowerPC Cell
architecture.


And I should have read all the thread :-(


Also I was wondering if you actually needed to enable this
attribute for both Rx and Tx buffers or just Rx buffers?  The
patch that enabled DMA_ATTR_WEAK_ORDERING for Sparc64 seems to
call out writes, but I didn't see anything about reads.  I'm just
wondering if changing the code for Tx has any effect?  If not you
could probably drop those changes and just focus on Rx.


'Weak ordering' only applies to PCIe read transfers, so can only
have an effect on descriptor reads and transmit buffer reads.

Basically PCIe is a comms protocol and an endpoint (or the host)
can have multiple outstanding read requests (each of which might
generate multiple response messages. The responses for each request
must arrive in order, but responses for different requests can be
interleaved. Setting 'not weak ordering' lets the host interwork
with broken endpoints. (Or, like we did, you fix the fpga's PCIe
implementation.)


I get the basics of relaxed ordering.  The question is how does the
Sparc64 IOMMU translate DMA_ATTR_WEAK_ORDERING into relaxed ordering
messages, and at what level the ordering is relaxed.  Odds are the
wording in the description where this attribute was added to Sparc
is just awkward, but I was wanting to verify if this only applies to
writes, or also read completions.
In Sparc64, passing DMA_ATTR_WEAK_ORDERING in dma map/unmap only affects 
PCIe root complex (Hostbridge). Using DMA_ATTR_WEAK_ORDERING, requested 
DMA transaction can be relaxed ordered within the PCIe root complex.


In Sparc64, memory writes can be held at PCIe root complex not letting
other memory writes to go through. By passing DMA_ATTR_WEAK_ORDERING in
dma map/unmap allows memory writes to bypass other memory writes in PCIe
root complex. (This applies to only PCIe root complex and does not 
affect at any other level of PCIe hierarchy e.g. PCIe bridges et al. 
Also the PCIe root complex when bypassing memory writes does follow PCIe 
relax ordering rules as per PCIe specification.


For reference [old but still relevant write-up]: PCI-Express Relaxed 
Ordering and the Sun SPARC Enterprise M-class Servers

https://blogs.oracle.com/olympus/entry/relaxed_ordering




In this case you need the reads of both transmit and receive rings
to 'overtake' reads of transmit data.


Actually that isn't quite right.  With relaxed ordering completions
and writes can pass each other if I recall correctly, but reads will
always force all writes ahead of them to be completed before you can
begin generating the read completions.

That is my understanding as well.




I'm not at all clear how this 'flag' can be set on dma_map(). It is
a property of the PCIe subsystem.
Because in Sparc64, passing DMA_ATTR_WEAK_ORDERING flag in DMA map/unmap 
adds an entry in IOMMU/ATU table so that an access to requested DMA 
address from PCIe root complex can be relaxed ordered.


That was where my original question on this came in.  We can do a
blanket enable of relaxed ordering for Tx and Rx data buffers, but
if we only need it on Rx then there isn't any need for us to make
unnecessary changes.

I ran some quick test and it is likely that we don't need
DMA_ATTR_WEAK_ORDERING for any TX dma buffer (because in case of TX dma
buffers, its all memory reads from device).

-Tushar


- Alex



Re: [Intel-wired-lan] [RFC PATCH] i40e: enable PCIe relax ordering for SPARC

2016-12-06 Thread tndave



On 12/06/2016 09:10 AM, Alexander Duyck wrote:

On Mon, Dec 5, 2016 at 2:23 PM, tndave <tushar.n.d...@oracle.com> wrote:



On 12/05/2016 01:54 PM, Alexander Duyck wrote:


On Mon, Dec 5, 2016 at 9:07 AM, Tushar Dave <tushar.n.d...@oracle.com>
wrote:


Unlike previous generation NIC (e.g. ixgbe) i40e doesn't seem to have
standard CSR where PCIe relaxed ordering can be set. Without PCIe relax
ordering enabled, i40e performance is significantly low on SPARC.

This patch sets PCIe relax ordering for SPARC arch by setting dma attr
DMA_ATTR_WEAK_ORDERING for every tx and rx DMA map/unmap.
This has shown 10x increase in performance numbers.

e.g.
iperf TCP test with 10 threads on SPARC S7

Test 1: Without this patch

# iperf -s

Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)

[  4] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 40926
[  5] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 40934
[  6] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 40930
[  7] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 40928
[  8] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 40922
[  9] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 40932
[ 10] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 40920
[ 11] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 40924
[ 14] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 40982
[ 12] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 40980
[ ID] Interval   Transfer Bandwidth
[  4]  0.0-20.0 sec   566 MBytes   237 Mbits/sec
[  5]  0.0-20.0 sec   532 MBytes   223 Mbits/sec
[  6]  0.0-20.0 sec   537 MBytes   225 Mbits/sec
[  8]  0.0-20.0 sec   546 MBytes   229 Mbits/sec
[ 11]  0.0-20.0 sec   592 MBytes   248 Mbits/sec
[  7]  0.0-20.0 sec   539 MBytes   226 Mbits/sec
[  9]  0.0-20.0 sec   572 MBytes   240 Mbits/sec
[ 10]  0.0-20.0 sec   604 MBytes   253 Mbits/sec
[ 14]  0.0-20.0 sec   567 MBytes   238 Mbits/sec
[ 12]  0.0-20.0 sec   511 MBytes   214 Mbits/sec
[SUM]  0.0-20.0 sec  5.44 GBytes  2.33 Gbits/sec

Test 2: with this patch:

# iperf -s

Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)

TCP: request_sock_TCP: Possible SYN flooding on port 5001. Sending
cookies.  Check SNMP counters.
[  4] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 46876
[  5] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 46874
[  6] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 46872
[  7] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 46880
[  8] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 46878
[  9] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 46884
[ 10] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 46886
[ 11] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 46890
[ 12] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 46888
[ 13] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 46882
[ ID] Interval   Transfer Bandwidth
[  4]  0.0-20.0 sec  7.45 GBytes  3.19 Gbits/sec
[  5]  0.0-20.0 sec  7.48 GBytes  3.21 Gbits/sec
[  7]  0.0-20.0 sec  7.34 GBytes  3.15 Gbits/sec
[  8]  0.0-20.0 sec  7.42 GBytes  3.18 Gbits/sec
[  9]  0.0-20.0 sec  7.24 GBytes  3.11 Gbits/sec
[ 10]  0.0-20.0 sec  7.40 GBytes  3.17 Gbits/sec
[ 12]  0.0-20.0 sec  7.49 GBytes  3.21 Gbits/sec
[  6]  0.0-20.0 sec  7.30 GBytes  3.13 Gbits/sec
[ 11]  0.0-20.0 sec  7.44 GBytes  3.19 Gbits/sec
[ 13]  0.0-20.0 sec  7.22 GBytes  3.10 Gbits/sec
[SUM]  0.0-20.0 sec  73.8 GBytes  31.6 Gbits/sec

NOTE: In my testing, this patch does _not_ show any harm to i40e
performance numbers on x86.

Signed-off-by: Tushar Dave <tushar.n.d...@oracle.com>



You went through and replaced all of the dma_unmap/map_page calls with
dma_map/unmap_single_attrs  I would prefer you didn't do that.  I have


Yes, because currently there is no DMA API for dma_map/unmap_page with dma
attr*


patches to add the ability to map and unmap pages with attributes that
should be available for 4.10-rc1 so if you could wait on this patch
until then it would be preferred.


:-) thanks. I will wait until your patches are out.




---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 69
-
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |  1 +
 2 files changed, 49 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 6287bf6..800dca7 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -551,15 +551,17 @@ static void i40e_unmap_and_free_tx_resource(struct
i40e_ring *ring,
else
dev_kfree_skb_any(tx_buffer->skb);
if (dma_unm

Re: [Intel-wired-lan] [RFC PATCH] i40e: enable PCIe relax ordering for SPARC

2016-12-05 Thread tndave



On 12/05/2016 01:54 PM, Alexander Duyck wrote:

On Mon, Dec 5, 2016 at 9:07 AM, Tushar Dave  wrote:

Unlike previous generation NIC (e.g. ixgbe) i40e doesn't seem to have
standard CSR where PCIe relaxed ordering can be set. Without PCIe relax
ordering enabled, i40e performance is significantly low on SPARC.

This patch sets PCIe relax ordering for SPARC arch by setting dma attr
DMA_ATTR_WEAK_ORDERING for every tx and rx DMA map/unmap.
This has shown 10x increase in performance numbers.

e.g.
iperf TCP test with 10 threads on SPARC S7

Test 1: Without this patch

# iperf -s

Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)

[  4] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 40926
[  5] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 40934
[  6] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 40930
[  7] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 40928
[  8] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 40922
[  9] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 40932
[ 10] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 40920
[ 11] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 40924
[ 14] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 40982
[ 12] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 40980
[ ID] Interval   Transfer Bandwidth
[  4]  0.0-20.0 sec   566 MBytes   237 Mbits/sec
[  5]  0.0-20.0 sec   532 MBytes   223 Mbits/sec
[  6]  0.0-20.0 sec   537 MBytes   225 Mbits/sec
[  8]  0.0-20.0 sec   546 MBytes   229 Mbits/sec
[ 11]  0.0-20.0 sec   592 MBytes   248 Mbits/sec
[  7]  0.0-20.0 sec   539 MBytes   226 Mbits/sec
[  9]  0.0-20.0 sec   572 MBytes   240 Mbits/sec
[ 10]  0.0-20.0 sec   604 MBytes   253 Mbits/sec
[ 14]  0.0-20.0 sec   567 MBytes   238 Mbits/sec
[ 12]  0.0-20.0 sec   511 MBytes   214 Mbits/sec
[SUM]  0.0-20.0 sec  5.44 GBytes  2.33 Gbits/sec

Test 2: with this patch:

# iperf -s

Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)

TCP: request_sock_TCP: Possible SYN flooding on port 5001. Sending
cookies.  Check SNMP counters.
[  4] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 46876
[  5] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 46874
[  6] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 46872
[  7] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 46880
[  8] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 46878
[  9] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 46884
[ 10] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 46886
[ 11] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 46890
[ 12] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 46888
[ 13] local 16.0.0.7 port 5001 connected with 16.0.0.1 port 46882
[ ID] Interval   Transfer Bandwidth
[  4]  0.0-20.0 sec  7.45 GBytes  3.19 Gbits/sec
[  5]  0.0-20.0 sec  7.48 GBytes  3.21 Gbits/sec
[  7]  0.0-20.0 sec  7.34 GBytes  3.15 Gbits/sec
[  8]  0.0-20.0 sec  7.42 GBytes  3.18 Gbits/sec
[  9]  0.0-20.0 sec  7.24 GBytes  3.11 Gbits/sec
[ 10]  0.0-20.0 sec  7.40 GBytes  3.17 Gbits/sec
[ 12]  0.0-20.0 sec  7.49 GBytes  3.21 Gbits/sec
[  6]  0.0-20.0 sec  7.30 GBytes  3.13 Gbits/sec
[ 11]  0.0-20.0 sec  7.44 GBytes  3.19 Gbits/sec
[ 13]  0.0-20.0 sec  7.22 GBytes  3.10 Gbits/sec
[SUM]  0.0-20.0 sec  73.8 GBytes  31.6 Gbits/sec

NOTE: In my testing, this patch does _not_ show any harm to i40e
performance numbers on x86.

Signed-off-by: Tushar Dave 


You went through and replaced all of the dma_unmap/map_page calls with
dma_map/unmap_single_attrs  I would prefer you didn't do that.  I have
Yes, because currently there is no DMA API for dma_map/unmap_page with 
dma attr*

patches to add the ability to map and unmap pages with attributes that
should be available for 4.10-rc1 so if you could wait on this patch
until then it would be preferred.

:-) thanks. I will wait until your patches are out.



---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 69 -
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |  1 +
 2 files changed, 49 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 6287bf6..800dca7 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -551,15 +551,17 @@ static void i40e_unmap_and_free_tx_resource(struct 
i40e_ring *ring,
else
dev_kfree_skb_any(tx_buffer->skb);
if (dma_unmap_len(tx_buffer, len))
-   dma_unmap_single(ring->dev,
-dma_unmap_addr(tx_buffer, dma),
-

Re: [PATCH 0/2] net: Fix compiler warnings

2016-10-15 Thread tndave



On 10/15/2016 02:48 PM, David Miller wrote:

From: Tushar Dave 
Date: Fri, 14 Oct 2016 17:06:04 -0700


Recently, ATU (iommu) changes are submitted to linux-sparc that
enables 64bit DMA on SPARC. However, this change also makes
'incompatible pointer type' compiler warnings inevitable on sunqe
and sunbmac driver.

The two patches in series fix compiler warnings.


Only the sparc tree has this build problem, so these patches
really ought to be submitted for and applied there.

Okay. I will send these to sparclinux then.

Thanks.

-Tushar


Thanks.



Question on i40e PCIe relaxed ordering (RO)

2016-06-29 Thread tndave

Hi,

Running iperf tcp test on 2 sparc systems with i40e connected back to
back, I see huge number of 'port.rx_dropped' (on iperf server). Based on
past experience with ixgbe, this could very well because of PCIe RO
(relaxed ordering) not enabled.

I am trying to confirm RO is enabled. i40e datasheet mentioned RO
settings in 3 different sections:

1. section 10.2.2.2.38 PCIe Global Config 2 - GLPCI_CNF2 register
contains global status fields of PCIe configuration. The bit 0 of the
register is "RO_DIS". If this bit is set to 1 RO is disabled.

RO_DIS in my setup is 0 imply RO is not disabled.

2. section 12.3.5.5 Device Control Register (0xA8; RW) has bit 4
that enable/disable RO. This is pcie cap register.

In my i40e pcie config space value at offset 0xA8 is "2110". i.e 4th bit
set to 1 imply RO is enabled.

3. section 3.1.2.7.2 mentions some relaxed ordering rules
e.g. "The GLLAN_RCTL.RXDESCRDROEN bit (loaded from NVM) enables relaxed
ordering for Rx descriptor reads"

However, GLLAN_RCTL register definition has not bit like RXDESCRDROEN.
Same goes for GLLAN_TCTL.TXDESCRDROEN.

Am I missing anything? please advise.

Thanks.

-Tushar

(Ref:http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xl710-10-40-controller-datasheet.pdf)





i40e: Errors while turning off TSO

2016-05-16 Thread tndave

On systems with 128 CPUs, turning off TSO results in errors.

Errors:
i40e :03:00.0: failed to get tracking for 1 vectors for VSI 400, err=-12
i40e :03:00.0: Couldn't create FDir VSI
i40e :03:00.0: i40e_ptp_init: PTP not supported on eth0
i40e :03:00.0: couldn't add VEB, err I40E_ERR_ADMIN_QUEUE_ERROR
aq_err I40E_AQ_RC_ENOENT
i40e :03:00.0: rebuild of switch failed: -1, will try to set up
simple PF connection
i40e :03:00.0 eth0: adding 00:10:e0:8a:24:b6 vid=0


From kernel log:
i40e: Intel(R) Ethernet Connection XL710 Network Driver - version 1.4.8-k
i40e: Copyright (c) 2013 - 2014 Intel Corporation.
i40e :03:00.0: fw 4.40.35115 api 1.4 nvm 4.53 0x80001e8c 0.0.0
i40e :03:00.0: MAC address: 00:10:e0:8a:24:b6
i40e :03:00.0: i40e_ptp_init: PTP not supported on eth0
i40e :03:00.0: PCI-Express: Speed 8.0GT/s Width x8
i40e :03:00.0: Features: PF-id[0] VFs: 32 VSIs: 34 QP: 128 RX: 1BUF
RSS FD_ATR VxLAN VEPA

As per the log above, feature FD_SB (sideband flow director)) is not
enabled. Because there are no enough MSI-X vectors available.
(Device function caps report 129 MSI-X vectors in this case. And driver
reserved 1 of them for misc interrupt and rest 128 for 128 QP. So no
vector left for FD_SB. Therefore driver disables FD_SB)

However turning off TSO invokes i40e_set_ntuple()
that enables FD_SB and returns true, issues reset in i40e_set_features()
i.e i40e_do_reset.

Later during reset, driver fails to find irq for FD_SB from irq pile
(and it won't because there was no irq vector assigned for FD_SB).
This results in the very first error,
'i40e :03:00.0: failed to get tracking for 1 vectors for VSI 400,
err=-12'

I believe before enabling FD_SB in i40e_set_ntuple(), driver should
check if MSI-X vector available for FD_SB.

Sending patch in separate email.

(FWIW, if number of CPUs reduced to 64, I don't see the issue
described above because in that case out of 129 MSI-X vectors only
64 get assigned to QP. Remaining are used for features like FD_SB)

-Tushar


Re: i40e: Kernel unaligned access due to 'struct i40e_dma_mem' being 'packed'

2016-02-13 Thread tndave



On 01/30/2016 04:17 AM, tndave wrote:



On 01/27/2016 10:56 PM, David Miller wrote:

From: tndave <tushar.n.d...@oracle.com>
Date: Wed, 27 Jan 2016 17:50:14 -0800


Hi,

i40e driver has 'struct i40e_dma_mem' defined with 'packed' directive
causing kernel unaligned errors on sparc (when
40e_allocate_dma_mem_d()
is being called)

log_unaligned: 1031 callbacks suppressed
Kernel unaligned access at TPC[448ae8]
dma_4v_alloc_coherent+0x188/0x2e0
Kernel unaligned access at TPC[448ae8]
dma_4v_alloc_coherent+0x188/0x2e0
Kernel unaligned access at TPC[448ae8]
dma_4v_alloc_coherent+0x188/0x2e0
Kernel unaligned access at TPC[448ae8]
dma_4v_alloc_coherent+0x188/0x2e0

This can be fixed with get_unaligned/put_unaligned(). However I don't
see 'struct i40e_dma_mem' is being directly shoved into NIC hardware.
But instead fields of the struct are being read and used for hardware
(e.g. dma_addr_t pa). For the test, I remove __packed, and i40e driver
and HW works fine. (of course kernel unaligned errors are gone too).
My question is, does 'struct i40e_dma_mem' required to be __packed?


People get overzealoud with __packed.

And even if it doesn't cause unaligned accesses like this, it generates
terrible code (byte at a time accesses to words) on several
architectures.

True. For the same reason I want to clarify if __packed is actually
needed? instead of fixing it with get_unaligned/put_unaligned()!


We are having this issue on multiple sparc servers. It would be really
helpful to have feedback from i40e driver folks.

Thanks.

-Tushar


Re: i40e: Kernel unaligned access due to 'struct i40e_dma_mem' being 'packed'

2016-01-29 Thread tndave



On 01/27/2016 10:56 PM, David Miller wrote:

From: tndave <tushar.n.d...@oracle.com>
Date: Wed, 27 Jan 2016 17:50:14 -0800


Hi,

i40e driver has 'struct i40e_dma_mem' defined with 'packed' directive
causing kernel unaligned errors on sparc (when
40e_allocate_dma_mem_d()
is being called)

log_unaligned: 1031 callbacks suppressed
Kernel unaligned access at TPC[448ae8]
dma_4v_alloc_coherent+0x188/0x2e0
Kernel unaligned access at TPC[448ae8]
dma_4v_alloc_coherent+0x188/0x2e0
Kernel unaligned access at TPC[448ae8]
dma_4v_alloc_coherent+0x188/0x2e0
Kernel unaligned access at TPC[448ae8]
dma_4v_alloc_coherent+0x188/0x2e0

This can be fixed with get_unaligned/put_unaligned(). However I don't
see 'struct i40e_dma_mem' is being directly shoved into NIC hardware.
But instead fields of the struct are being read and used for hardware
(e.g. dma_addr_t pa). For the test, I remove __packed, and i40e driver
and HW works fine. (of course kernel unaligned errors are gone too).
My question is, does 'struct i40e_dma_mem' required to be __packed?


People get overzealoud with __packed.

And even if it doesn't cause unaligned accesses like this, it generates
terrible code (byte at a time accesses to words) on several architectures.
True. For the same reason I want to clarify if __packed is actually 
needed? instead of fixing it with get_unaligned/put_unaligned()!


-Tushar




i40e: Kernel unaligned access due to 'struct i40e_dma_mem' being 'packed'

2016-01-27 Thread tndave

Hi,

i40e driver has 'struct i40e_dma_mem' defined with 'packed' directive
causing kernel unaligned errors on sparc (when 40e_allocate_dma_mem_d()
is being called)

log_unaligned: 1031 callbacks suppressed
Kernel unaligned access at TPC[448ae8] dma_4v_alloc_coherent+0x188/0x2e0
Kernel unaligned access at TPC[448ae8] dma_4v_alloc_coherent+0x188/0x2e0
Kernel unaligned access at TPC[448ae8] dma_4v_alloc_coherent+0x188/0x2e0
Kernel unaligned access at TPC[448ae8] dma_4v_alloc_coherent+0x188/0x2e0

This can be fixed with get_unaligned/put_unaligned(). However I don't
see 'struct i40e_dma_mem' is being directly shoved into NIC hardware.
But instead fields of the struct are being read and used for hardware
(e.g. dma_addr_t pa). For the test, I remove __packed, and i40e driver
and HW works fine. (of course kernel unaligned errors are gone too).
My question is, does 'struct i40e_dma_mem' required to be __packed?

Thanks.

-Tushar