Re: [PATCH net-next v2 4/4] net: mvneta: Spread out the TX queues management on all CPUs
Hi Gregory, > @@ -1824,13 +1835,16 @@ error: > static int mvneta_tx(struct sk_buff *skb, struct net_device *dev) > { > struct mvneta_port *pp = netdev_priv(dev); > - u16 txq_id = skb_get_queue_mapping(skb); > + u16 txq_id = smp_processor_id() % txq_number; I think it may be ok to bind TXQs to different CPUs, but I don't think that replacing skb_get_queue_mapping by in fact smp_processor_id() is the best idea. This way you use only 2 TXQs on A385 and 4 TXQs on AXP. There are HW mechanisms like WRR or EJP that provide balancing for egress, so let's better keep all 8. As a compromise I think it's enough to do the mapping and we would achieve some offload by TX processing done on different CPU's and let BQL do the balance on higher level. FYI, I've already implemented BQL and will submit it asap, however I still have some weird problems after enabling it. Best regards, Marcin -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next v2 4/4] net: mvneta: Spread out the TX queues management on all CPUs
From: Marcin WojtasDate: Sat, 5 Dec 2015 20:14:31 +0100 > Hi Gregory, > >> @@ -1824,13 +1835,16 @@ error: >> static int mvneta_tx(struct sk_buff *skb, struct net_device *dev) >> { >> struct mvneta_port *pp = netdev_priv(dev); >> - u16 txq_id = skb_get_queue_mapping(skb); >> + u16 txq_id = smp_processor_id() % txq_number; > > I think it may be ok to bind TXQs to different CPUs, but I don't think > that replacing skb_get_queue_mapping by in fact smp_processor_id() is > the best idea. This way you use only 2 TXQs on A385 and 4 TXQs on AXP. > There are HW mechanisms like WRR or EJP that provide balancing for > egress, so let's better keep all 8. Also it is possible for other parts of the stack to set the SKB queue mapping and you must respect that setting rather than override it. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next v2 4/4] net: mvneta: Spread out the TX queues management on all CPUs
On Friday 04 December 2015 11:12:30 Eric Dumazet wrote: > On Fri, 2015-12-04 at 19:45 +0100, Gregory CLEMENT wrote: > > With this patch each CPU is associated with its own set of TX queues. In > > the same time the SKB received in mvneta_tx is bound to the queue > > associated to the CPU sending the data. Thanks to this the next IRQ will > > be received on the same CPU allowing sending more data. > > > > It will also allow to have a more predictable behavior regarding > > throughput and latency when having multiple threads sending out data on > > different CPUs. > > > > As an example on Armada XP GP, with an iperf bound to a CPU and a ping > > bound to another CPU, without this patch the ping round trip was about > > 2.5ms (and could reach 3s!), whereas with this patch it was around > > 0.7ms (and sometime it went to 1.2ms). > > This really looks like you need something smarter than pfifo_fast qdisc, > and maybe BQL (I did not check if this driver already implements this) I suggested this change as well as the BQL implementation that Marcin did. I believe he hasn't posted that yet while he's doing some more testing, but it should come soon. Arnd -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next v2 4/4] net: mvneta: Spread out the TX queues management on all CPUs
With this patch each CPU is associated with its own set of TX queues. In the same time the SKB received in mvneta_tx is bound to the queue associated to the CPU sending the data. Thanks to this the next IRQ will be received on the same CPU allowing sending more data. It will also allow to have a more predictable behavior regarding throughput and latency when having multiple threads sending out data on different CPUs. As an example on Armada XP GP, with an iperf bound to a CPU and a ping bound to another CPU, without this patch the ping round trip was about 2.5ms (and could reach 3s!), whereas with this patch it was around 0.7ms (and sometime it went to 1.2ms). Suggested-by: Arnd BergmannSigned-off-by: Gregory CLEMENT --- drivers/net/ethernet/marvell/mvneta.c | 48 ++- 1 file changed, 36 insertions(+), 12 deletions(-) diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c index e0dba6869605..bb5e29daac0b 100644 --- a/drivers/net/ethernet/marvell/mvneta.c +++ b/drivers/net/ethernet/marvell/mvneta.c @@ -110,6 +110,7 @@ #define MVNETA_CPU_RXQ_ACCESS_ALL_MASK 0x00ff #define MVNETA_CPU_TXQ_ACCESS_ALL_MASK 0xff00 #define MVNETA_CPU_RXQ_ACCESS(rxq) BIT(rxq) +#define MVNETA_CPU_TXQ_ACCESS(txq) BIT(txq + 8) #define MVNETA_RXQ_TIME_COAL_REG(q) (0x2580 + ((q) << 2)) /* Exception Interrupt Port/Queue Cause register @@ -1022,20 +1023,30 @@ static void mvneta_defaults_set(struct mvneta_port *pp) /* Enable MBUS Retry bit16 */ mvreg_write(pp, MVNETA_MBUS_RETRY, 0x20); - /* Set CPU queue access map. CPUs are assigned to the RX -* queues modulo their number and all the TX queues are -* assigned to the CPU associated to the default RX queue. + /* Set CPU queue access map. CPUs are assigned to the RX and +* TX queues modulo their number. If there is only one TX +* queue then it is assigned to the CPU associated to the +* default RX queue. */ for_each_present_cpu(cpu) { int rxq_map = 0, txq_map = 0; - int rxq; + int rxq, txq; for (rxq = 0; rxq < rxq_number; rxq++) if ((rxq % max_cpu) == cpu) rxq_map |= MVNETA_CPU_RXQ_ACCESS(rxq); - if (cpu == pp->rxq_def) - txq_map = MVNETA_CPU_TXQ_ACCESS_ALL_MASK; + for (txq = 0; txq < txq_number; txq++) + if ((txq % max_cpu) == cpu) + txq_map |= MVNETA_CPU_TXQ_ACCESS(txq); + + /* With only one TX queue we configure a special case +* which will allow to get all the irq on a single +* CPU +*/ + if (txq_number == 1) + txq_map = (cpu == pp->rxq_def) ? + MVNETA_CPU_TXQ_ACCESS(1) : 0; mvreg_write(pp, MVNETA_CPU_MAP(cpu), rxq_map | txq_map); } @@ -1824,13 +1835,16 @@ error: static int mvneta_tx(struct sk_buff *skb, struct net_device *dev) { struct mvneta_port *pp = netdev_priv(dev); - u16 txq_id = skb_get_queue_mapping(skb); + u16 txq_id = smp_processor_id() % txq_number; struct mvneta_tx_queue *txq = >txqs[txq_id]; struct mvneta_tx_desc *tx_desc; int len = skb->len; int frags = 0; u32 tx_cmd; + /* Use the tx queue bound to this CPU */ + skb_set_queue_mapping(skb, txq_id); + if (!netif_running(dev)) goto out; @@ -2811,13 +2825,23 @@ static void mvneta_percpu_elect(struct mvneta_port *pp) if ((rxq % max_cpu) == cpu) rxq_map |= MVNETA_CPU_RXQ_ACCESS(rxq); - if (i == online_cpu_idx) { - /* Map the default receive queue and transmit -* queue to the elected CPU + if (i == online_cpu_idx) + /* Map the default receive queue queue to the +* elected CPU */ rxq_map |= MVNETA_CPU_RXQ_ACCESS(pp->rxq_def); - txq_map = MVNETA_CPU_TXQ_ACCESS_ALL_MASK; - } + + /* We update the TX queue map only if we have one +* queue. In this case we associate the TX queue to +* the CPU bound to the default RX queue +*/ + if (txq_number == 1) + txq_map = (i == online_cpu_idx) ? + MVNETA_CPU_TXQ_ACCESS(1) : 0; + else + txq_map = mvreg_read(pp, MVNETA_CPU_MAP(cpu)) & + MVNETA_CPU_TXQ_ACCESS_ALL_MASK; +
Re: [PATCH net-next v2 4/4] net: mvneta: Spread out the TX queues management on all CPUs
On Fri, 2015-12-04 at 19:45 +0100, Gregory CLEMENT wrote: > With this patch each CPU is associated with its own set of TX queues. In > the same time the SKB received in mvneta_tx is bound to the queue > associated to the CPU sending the data. Thanks to this the next IRQ will > be received on the same CPU allowing sending more data. > > It will also allow to have a more predictable behavior regarding > throughput and latency when having multiple threads sending out data on > different CPUs. > > As an example on Armada XP GP, with an iperf bound to a CPU and a ping > bound to another CPU, without this patch the ping round trip was about > 2.5ms (and could reach 3s!), whereas with this patch it was around > 0.7ms (and sometime it went to 1.2ms). This really looks like you need something smarter than pfifo_fast qdisc, and maybe BQL (I did not check if this driver already implements this) > > Suggested-by: Arnd Bergmann> Signed-off-by: Gregory CLEMENT ... > @@ -1824,13 +1835,16 @@ error: > static int mvneta_tx(struct sk_buff *skb, struct net_device *dev) > { > struct mvneta_port *pp = netdev_priv(dev); > - u16 txq_id = skb_get_queue_mapping(skb); > + u16 txq_id = smp_processor_id() % txq_number; > struct mvneta_tx_queue *txq = >txqs[txq_id]; > struct mvneta_tx_desc *tx_desc; > int len = skb->len; > int frags = 0; > u32 tx_cmd; > > + /* Use the tx queue bound to this CPU */ > + skb_set_queue_mapping(skb, txq_id); > + We certainly do not want every driver implementing its own hacks. We have a standard way to handle this, it is called XPS, and eventually ndo_select_queue() Documentation/networking/scaling.txt contains some hints. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html