Re: [PATCH net-next v2 4/4] net: mvneta: Spread out the TX queues management on all CPUs

2015-12-05 Thread Marcin Wojtas
Hi Gregory,

> @@ -1824,13 +1835,16 @@ error:
>  static int mvneta_tx(struct sk_buff *skb, struct net_device *dev)
>  {
> struct mvneta_port *pp = netdev_priv(dev);
> -   u16 txq_id = skb_get_queue_mapping(skb);
> +   u16 txq_id = smp_processor_id() % txq_number;

I think it may be ok to bind TXQs to different CPUs, but I don't think
that replacing skb_get_queue_mapping by in fact smp_processor_id() is
the best idea. This way you use only 2 TXQs on A385 and 4 TXQs on AXP.
There are HW mechanisms like WRR or EJP that provide balancing for
egress, so let's better keep all 8.

As a compromise I think it's enough to do the mapping and we would
achieve some offload by TX processing done on different CPU's and let
BQL do the balance on higher level. FYI, I've already implemented BQL
and will submit it asap, however I still have some weird problems
after enabling it.

Best regards,
Marcin
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next v2 4/4] net: mvneta: Spread out the TX queues management on all CPUs

2015-12-05 Thread David Miller
From: Marcin Wojtas 
Date: Sat, 5 Dec 2015 20:14:31 +0100

> Hi Gregory,
> 
>> @@ -1824,13 +1835,16 @@ error:
>>  static int mvneta_tx(struct sk_buff *skb, struct net_device *dev)
>>  {
>> struct mvneta_port *pp = netdev_priv(dev);
>> -   u16 txq_id = skb_get_queue_mapping(skb);
>> +   u16 txq_id = smp_processor_id() % txq_number;
> 
> I think it may be ok to bind TXQs to different CPUs, but I don't think
> that replacing skb_get_queue_mapping by in fact smp_processor_id() is
> the best idea. This way you use only 2 TXQs on A385 and 4 TXQs on AXP.
> There are HW mechanisms like WRR or EJP that provide balancing for
> egress, so let's better keep all 8.

Also it is possible for other parts of the stack to set the SKB queue
mapping and you must respect that setting rather than override it.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next v2 4/4] net: mvneta: Spread out the TX queues management on all CPUs

2015-12-04 Thread Arnd Bergmann
On Friday 04 December 2015 11:12:30 Eric Dumazet wrote:
> On Fri, 2015-12-04 at 19:45 +0100, Gregory CLEMENT wrote:
> > With this patch each CPU is associated with its own set of TX queues. In
> > the same time the SKB received in mvneta_tx is bound to the queue
> > associated to the CPU sending the data. Thanks to this the next IRQ will
> > be received on the same CPU allowing sending more data.
> > 
> > It will also allow to have a more predictable behavior regarding
> > throughput and latency when having multiple threads sending out data on
> > different CPUs.
> > 
> > As an example on Armada XP GP, with an iperf bound to a CPU and a ping
> > bound to another CPU, without this patch the ping round trip was about
> > 2.5ms (and could reach 3s!), whereas with this patch it was around
> > 0.7ms (and sometime it went to 1.2ms).
> 
> This really looks like you need something smarter than pfifo_fast qdisc,
> and maybe BQL (I did not check if this driver already implements this)

I suggested this change as well as the BQL implementation that Marcin did.
I believe he hasn't posted that yet while he's doing some more testing,
but it should come soon.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next v2 4/4] net: mvneta: Spread out the TX queues management on all CPUs

2015-12-04 Thread Gregory CLEMENT
With this patch each CPU is associated with its own set of TX queues. In
the same time the SKB received in mvneta_tx is bound to the queue
associated to the CPU sending the data. Thanks to this the next IRQ will
be received on the same CPU allowing sending more data.

It will also allow to have a more predictable behavior regarding
throughput and latency when having multiple threads sending out data on
different CPUs.

As an example on Armada XP GP, with an iperf bound to a CPU and a ping
bound to another CPU, without this patch the ping round trip was about
2.5ms (and could reach 3s!), whereas with this patch it was around
0.7ms (and sometime it went to 1.2ms).

Suggested-by: Arnd Bergmann 
Signed-off-by: Gregory CLEMENT 
---
 drivers/net/ethernet/marvell/mvneta.c | 48 ++-
 1 file changed, 36 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c 
b/drivers/net/ethernet/marvell/mvneta.c
index e0dba6869605..bb5e29daac0b 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -110,6 +110,7 @@
 #define  MVNETA_CPU_RXQ_ACCESS_ALL_MASK  0x00ff
 #define  MVNETA_CPU_TXQ_ACCESS_ALL_MASK  0xff00
 #define  MVNETA_CPU_RXQ_ACCESS(rxq) BIT(rxq)
+#define  MVNETA_CPU_TXQ_ACCESS(txq) BIT(txq + 8)
 #define MVNETA_RXQ_TIME_COAL_REG(q)  (0x2580 + ((q) << 2))
 
 /* Exception Interrupt Port/Queue Cause register
@@ -1022,20 +1023,30 @@ static void mvneta_defaults_set(struct mvneta_port *pp)
/* Enable MBUS Retry bit16 */
mvreg_write(pp, MVNETA_MBUS_RETRY, 0x20);
 
-   /* Set CPU queue access map. CPUs are assigned to the RX
-* queues modulo their number and all the TX queues are
-* assigned to the CPU associated to the default RX queue.
+   /* Set CPU queue access map. CPUs are assigned to the RX and
+* TX queues modulo their number. If there is only one TX
+* queue then it is assigned to the CPU associated to the
+* default RX queue.
 */
for_each_present_cpu(cpu) {
int rxq_map = 0, txq_map = 0;
-   int rxq;
+   int rxq, txq;
 
for (rxq = 0; rxq < rxq_number; rxq++)
if ((rxq % max_cpu) == cpu)
rxq_map |= MVNETA_CPU_RXQ_ACCESS(rxq);
 
-   if (cpu == pp->rxq_def)
-   txq_map = MVNETA_CPU_TXQ_ACCESS_ALL_MASK;
+   for (txq = 0; txq < txq_number; txq++)
+   if ((txq % max_cpu) == cpu)
+   txq_map |= MVNETA_CPU_TXQ_ACCESS(txq);
+
+   /* With only one TX queue we configure a special case
+* which will allow to get all the irq on a single
+* CPU
+*/
+   if (txq_number == 1)
+   txq_map = (cpu == pp->rxq_def) ?
+   MVNETA_CPU_TXQ_ACCESS(1) : 0;
 
mvreg_write(pp, MVNETA_CPU_MAP(cpu), rxq_map | txq_map);
}
@@ -1824,13 +1835,16 @@ error:
 static int mvneta_tx(struct sk_buff *skb, struct net_device *dev)
 {
struct mvneta_port *pp = netdev_priv(dev);
-   u16 txq_id = skb_get_queue_mapping(skb);
+   u16 txq_id = smp_processor_id() % txq_number;
struct mvneta_tx_queue *txq = >txqs[txq_id];
struct mvneta_tx_desc *tx_desc;
int len = skb->len;
int frags = 0;
u32 tx_cmd;
 
+   /* Use the tx queue bound to this CPU */
+   skb_set_queue_mapping(skb, txq_id);
+
if (!netif_running(dev))
goto out;
 
@@ -2811,13 +2825,23 @@ static void mvneta_percpu_elect(struct mvneta_port *pp)
if ((rxq % max_cpu) == cpu)
rxq_map |= MVNETA_CPU_RXQ_ACCESS(rxq);
 
-   if (i == online_cpu_idx) {
-   /* Map the default receive queue and transmit
-* queue to the elected CPU
+   if (i == online_cpu_idx)
+   /* Map the default receive queue queue to the
+* elected CPU
 */
rxq_map |= MVNETA_CPU_RXQ_ACCESS(pp->rxq_def);
-   txq_map = MVNETA_CPU_TXQ_ACCESS_ALL_MASK;
-   }
+
+   /* We update the TX queue map only if we have one
+* queue. In this case we associate the TX queue to
+* the CPU bound to the default RX queue
+*/
+   if (txq_number == 1)
+   txq_map = (i == online_cpu_idx) ?
+   MVNETA_CPU_TXQ_ACCESS(1) : 0;
+   else
+   txq_map = mvreg_read(pp, MVNETA_CPU_MAP(cpu)) &
+   MVNETA_CPU_TXQ_ACCESS_ALL_MASK;
+
  

Re: [PATCH net-next v2 4/4] net: mvneta: Spread out the TX queues management on all CPUs

2015-12-04 Thread Eric Dumazet
On Fri, 2015-12-04 at 19:45 +0100, Gregory CLEMENT wrote:
> With this patch each CPU is associated with its own set of TX queues. In
> the same time the SKB received in mvneta_tx is bound to the queue
> associated to the CPU sending the data. Thanks to this the next IRQ will
> be received on the same CPU allowing sending more data.
> 
> It will also allow to have a more predictable behavior regarding
> throughput and latency when having multiple threads sending out data on
> different CPUs.
> 
> As an example on Armada XP GP, with an iperf bound to a CPU and a ping
> bound to another CPU, without this patch the ping round trip was about
> 2.5ms (and could reach 3s!), whereas with this patch it was around
> 0.7ms (and sometime it went to 1.2ms).

This really looks like you need something smarter than pfifo_fast qdisc,
and maybe BQL (I did not check if this driver already implements this)

> 
> Suggested-by: Arnd Bergmann 
> Signed-off-by: Gregory CLEMENT 

...

> @@ -1824,13 +1835,16 @@ error:
>  static int mvneta_tx(struct sk_buff *skb, struct net_device *dev)
>  {
>   struct mvneta_port *pp = netdev_priv(dev);
> - u16 txq_id = skb_get_queue_mapping(skb);
> + u16 txq_id = smp_processor_id() % txq_number;
>   struct mvneta_tx_queue *txq = >txqs[txq_id];
>   struct mvneta_tx_desc *tx_desc;
>   int len = skb->len;
>   int frags = 0;
>   u32 tx_cmd;
>  
> + /* Use the tx queue bound to this CPU */
> + skb_set_queue_mapping(skb, txq_id);
> +


We certainly do not want every driver implementing its own hacks.

We have a standard way to handle this, it is called XPS, and eventually
ndo_select_queue()

Documentation/networking/scaling.txt contains some hints.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html