[PATCH net] net/dim: Update DIM start sample after each DIM iteration

2018-11-21 Thread Tal Gilboa
On every iteration of net_dim, the algorithm may choose to
check for the system state by comparing current data sample
with previous data sample. After each of these comparison,
regardless of the action taken, the sample used as baseline
is needed to be updated.

This patch fixes a bug that causes DIM to take wrong decisions,
due to never updating the baseline sample for comparison between
iterations. This way, DIM always compares current sample with
zeros.

Although this is a functional fix, it also improves and stabilizes
performance as the algorithm works properly now.

Performance:
Tested single UDP TX stream with pktgen:
samples/pktgen/pktgen_sample03_burst_single_flow.sh -i p4p2 -d 1.1.1.1
-m 24:8a:07:88:26:8b -f 3 -b 128

ConnectX-5 100GbE packet rate improved from 15-19Mpps to 19-20Mpps.
Also, toggling between profiles is less frequent with the fix.

Fixes: 8115b750dbcb ("net/dim: use struct net_dim_sample as arg to net_dim")
Signed-off-by: Tal Gilboa 
Reviewed-by: Tariq Toukan 
---
 include/linux/net_dim.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/net_dim.h b/include/linux/net_dim.h
index c79e859..fd45838 100644
--- a/include/linux/net_dim.h
+++ b/include/linux/net_dim.h
@@ -406,6 +406,8 @@ static inline void net_dim(struct net_dim *dim,
}
/* fall through */
case NET_DIM_START_MEASURE:
+   net_dim_sample(end_sample.event_ctr, end_sample.pkt_ctr, 
end_sample.byte_ctr,
+  >start_sample);
dim->state = NET_DIM_MEASURE_IN_PROGRESS;
break;
case NET_DIM_APPLY_NEW_PROFILE:
-- 
1.8.3.1



Re: [PATCH net-next v2 1/2] net: stmmac: Rework coalesce timer and fix multi-queue races

2018-09-18 Thread Tal Gilboa

On 9/10/2018 10:22 PM, Florian Fainelli wrote:

On 09/10/2018 02:14 AM, Jose Abreu wrote:

This follows David Miller advice and tries to fix coalesce timer in
multi-queue scenarios.

We are now using per-queue coalesce values and per-queue TX timer.

Coalesce timer default values was changed to 1ms and the coalesce frames
to 25.

Tested in B2B setup between XGMAC2 and GMAC5.


Why not revert the entire features for this merge window and work on
getting it to work over the next weeks/merge windows?

The idea of using a timer to coalesce TX path when there is not a HW
timer is a good idea and if this is made robust enough, you could even
promote that as being a network stack library/feature that could be used
by other drivers. In fact, this could be a great addition to the net DIM
library (Tal, what do you think?)


Not sure it would be a natural fit. DIM doesn't know/care which type of 
timer is used. Maybe the two mechanisms can work together with DIM 
optimizes the number of frames to use for optimal coalescing. As the 
timer is set to 1ms, DIM would have issues suggesting meaningful values 
as it is designed for typical values of 10s on us.




Here's a quick drive by review of things that appear wrong in the
current driver (without your patches):

- in stmmac_xmit(), in case we hit the !is_jumbo branch and we fail the
DMA mapping, there is no timer cancellation, don't we want to abort the
whole transmission?

- stmmac_tx_clean() should probably use netif_lock_bh() to guard against
the timer (soft IRQ context) and the the NAPI context (also soft IRQ)
running in parallel on two different CPUs. This may not explain all
problems, but these two things are fundamentally exclusive, because the
timer is meant to emulate the interrupt after N packets, while NAPI
executes when such a thing did actually occur

- stmmac_poll() should cancel pending timer(s) if it was able to reclaim
packets, likewise stmmac_tx_timer() should re-enable TX interrupts if it
reclaimed packets, since TX interrupts could have been left disabled
from a prior NAPI run. These could be considered optimizations, since
you could leave the TX timer running all the time, just adjust the
deadline (based on line rate, MTU, IPG, number of fragments and their
respective length), worst case, both NAPI and the timer clean up your TX
ring, so you should always have room to push more packets



Signed-off-by: Jose Abreu 
Cc: Jerome Brunet 
Cc: Martin Blumenstingl 
Cc: David S. Miller 
Cc: Joao Pinto 
Cc: Giuseppe Cavallaro 
Cc: Alexandre Torgue 
---
Jerome,

Can you please test if this one is okay ?

Thanks and Best Regards,
Jose Miguel Abreu
---
  drivers/net/ethernet/stmicro/stmmac/common.h  |   4 +-
  drivers/net/ethernet/stmicro/stmmac/stmmac.h  |   6 +-
  drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 207 ++
  3 files changed, 135 insertions(+), 82 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h 
b/drivers/net/ethernet/stmicro/stmmac/common.h
index 1854f270ad66..b1b305f8f414 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -258,10 +258,10 @@ struct stmmac_safety_stats {
  #define MAX_DMA_RIWT  0xff
  #define MIN_DMA_RIWT  0x20
  /* Tx coalesce parameters */
-#define STMMAC_COAL_TX_TIMER   4
+#define STMMAC_COAL_TX_TIMER   1000
  #define STMMAC_MAX_COAL_TX_TICK   10
  #define STMMAC_TX_MAX_FRAMES  256
-#define STMMAC_TX_FRAMES   64
+#define STMMAC_TX_FRAMES   25
  
  /* Packets types */

  enum packets_types {
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h 
b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
index c0a855b7ab3b..957030cfb833 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
@@ -48,6 +48,9 @@ struct stmmac_tx_info {
  
  /* Frequently used values are kept adjacent for cache effect */

  struct stmmac_tx_queue {
+   u32 tx_count_frames;
+   int tx_timer_active;
+   struct timer_list txtimer;
u32 queue_index;
struct stmmac_priv *priv_data;
struct dma_extended_desc *dma_etx cacheline_aligned_in_smp;
@@ -59,6 +62,7 @@ struct stmmac_tx_queue {
dma_addr_t dma_tx_phy;
u32 tx_tail_addr;
u32 mss;
+   struct napi_struct napi cacheline_aligned_in_smp;
  };
  
  struct stmmac_rx_queue {

@@ -109,14 +113,12 @@ struct stmmac_pps_cfg {
  
  struct stmmac_priv {

/* Frequently used values are kept adjacent for cache effect */
-   u32 tx_count_frames;
u32 tx_coal_frames;
u32 tx_coal_timer;
  
  	int tx_coalesce;

int hwts_tx_en;
bool tx_path_in_lpi_mode;
-   struct timer_list txtimer;
bool tso;
  
  	unsigned int dma_buf_sz;

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 9f458bb16f2a..9809c2b319fe 100644
--- 

[PATCH net-next V3 2/3] net/dim: Support adaptive TX moderation

2018-04-24 Thread Tal Gilboa
Interrupt moderation for TX traffic requires different profiles than RX
interrupt moderation. The main goal here is to reduce interrupt rate and
allow better payload aggregation by keeping SKBs in the TX queue a bit
longer. Ping-pong behavior would get a profile with a short timer, so
latency wouldn't increase for these scenarios. There might be a slight
degradation in bandwidth for single stream with large message sizes, since
net.ipv4.tcp_limit_output_bytes is limiting the allowed TX traffic, but
with many streams performance is always improved.

Signed-off-by: Tal Gilboa <ta...@mellanox.com>
Reviewed-by: Tariq Toukan <tar...@mellanox.com>
---
 include/linux/net_dim.h | 63 +++--
 1 file changed, 50 insertions(+), 13 deletions(-)

diff --git a/include/linux/net_dim.h b/include/linux/net_dim.h
index 7ca3c4d..db99240 100644
--- a/include/linux/net_dim.h
+++ b/include/linux/net_dim.h
@@ -103,11 +103,12 @@ enum {
 #define NET_DIM_PARAMS_NUM_PROFILES 5
 /* Adaptive moderation profiles */
 #define NET_DIM_DEFAULT_RX_CQ_MODERATION_PKTS_FROM_EQE 256
+#define NET_DIM_DEFAULT_TX_CQ_MODERATION_PKTS_FROM_EQE 128
 #define NET_DIM_DEF_PROFILE_CQE 1
 #define NET_DIM_DEF_PROFILE_EQE 1
 
 /* All profiles sizes must be NET_PARAMS_DIM_NUM_PROFILES */
-#define NET_DIM_EQE_PROFILES { \
+#define NET_DIM_RX_EQE_PROFILES { \
{1,   NET_DIM_DEFAULT_RX_CQ_MODERATION_PKTS_FROM_EQE}, \
{8,   NET_DIM_DEFAULT_RX_CQ_MODERATION_PKTS_FROM_EQE}, \
{64,  NET_DIM_DEFAULT_RX_CQ_MODERATION_PKTS_FROM_EQE}, \
@@ -115,7 +116,7 @@ enum {
{256, NET_DIM_DEFAULT_RX_CQ_MODERATION_PKTS_FROM_EQE}, \
 }
 
-#define NET_DIM_CQE_PROFILES { \
+#define NET_DIM_RX_CQE_PROFILES { \
{2,  256}, \
{8,  128}, \
{16, 64},  \
@@ -123,32 +124,68 @@ enum {
{64, 64}   \
 }
 
+#define NET_DIM_TX_EQE_PROFILES { \
+   {1,   NET_DIM_DEFAULT_TX_CQ_MODERATION_PKTS_FROM_EQE},  \
+   {8,   NET_DIM_DEFAULT_TX_CQ_MODERATION_PKTS_FROM_EQE},  \
+   {32,  NET_DIM_DEFAULT_TX_CQ_MODERATION_PKTS_FROM_EQE},  \
+   {64,  NET_DIM_DEFAULT_TX_CQ_MODERATION_PKTS_FROM_EQE},  \
+   {128, NET_DIM_DEFAULT_TX_CQ_MODERATION_PKTS_FROM_EQE}   \
+}
+
+#define NET_DIM_TX_CQE_PROFILES { \
+   {5,  128},  \
+   {8,  64},  \
+   {16, 32},  \
+   {32, 32},  \
+   {64, 32}   \
+}
+
 static const struct net_dim_cq_moder
-profile[NET_DIM_CQ_PERIOD_NUM_MODES][NET_DIM_PARAMS_NUM_PROFILES] = {
-   NET_DIM_EQE_PROFILES,
-   NET_DIM_CQE_PROFILES,
+rx_profile[NET_DIM_CQ_PERIOD_NUM_MODES][NET_DIM_PARAMS_NUM_PROFILES] = {
+   NET_DIM_RX_EQE_PROFILES,
+   NET_DIM_RX_CQE_PROFILES,
+};
+
+static const struct net_dim_cq_moder
+tx_profile[NET_DIM_CQ_PERIOD_NUM_MODES][NET_DIM_PARAMS_NUM_PROFILES] = {
+   NET_DIM_TX_EQE_PROFILES,
+   NET_DIM_TX_CQE_PROFILES,
 };
 
 static inline struct net_dim_cq_moder
 net_dim_get_rx_moderation(u8 cq_period_mode, int ix)
 {
-   struct net_dim_cq_moder cq_moder = profile[cq_period_mode][ix];
+   struct net_dim_cq_moder cq_moder = rx_profile[cq_period_mode][ix];
 
cq_moder.cq_period_mode = cq_period_mode;
return cq_moder;
 }
 
 static inline struct net_dim_cq_moder
-net_dim_get_def_rx_moderation(u8 rx_cq_period_mode)
+net_dim_get_def_rx_moderation(u8 cq_period_mode)
+{
+   u8 profile_ix = cq_period_mode == NET_DIM_CQ_PERIOD_MODE_START_FROM_CQE 
?
+   NET_DIM_DEF_PROFILE_CQE : NET_DIM_DEF_PROFILE_EQE;
+
+   return net_dim_get_rx_moderation(cq_period_mode, profile_ix);
+}
+
+static inline struct net_dim_cq_moder
+net_dim_get_tx_moderation(u8 cq_period_mode, int ix)
 {
-   int default_profile_ix;
+   struct net_dim_cq_moder cq_moder = tx_profile[cq_period_mode][ix];
 
-   if (rx_cq_period_mode == NET_DIM_CQ_PERIOD_MODE_START_FROM_CQE)
-   default_profile_ix = NET_DIM_DEF_PROFILE_CQE;
-   else /* NET_DIM_CQ_PERIOD_MODE_START_FROM_EQE */
-   default_profile_ix = NET_DIM_DEF_PROFILE_EQE;
+   cq_moder.cq_period_mode = cq_period_mode;
+   return cq_moder;
+}
+
+static inline struct net_dim_cq_moder
+net_dim_get_def_tx_moderation(u8 cq_period_mode)
+{
+   u8 profile_ix = cq_period_mode == NET_DIM_CQ_PERIOD_MODE_START_FROM_CQE 
?
+   NET_DIM_DEF_PROFILE_CQE : NET_DIM_DEF_PROFILE_EQE;
 
-   return net_dim_get_rx_moderation(rx_cq_period_mode, default_profile_ix);
+   return net_dim_get_tx_moderation(cq_period_mode, profile_ix);
 }
 
 static inline bool net_dim_on_top(struct net_dim *dim)
-- 
1.8.3.1



[PATCH net-next V3 0/3] Introduce adaptive TX interrupt moderation to net DIM

2018-04-24 Thread Tal Gilboa
Net DIM is a library designed for dynamic interrupt moderation. It was
implemented and optimized with receive side interrupts in mind, since these
are usually the CPU expensive ones. This patch-set introduces adaptive transmit
interrupt moderation to net DIM, complete with a usage in the mlx5e driver.
Using adaptive TX behavior would reduce interrupt rate for multiple scenarios.
Furthermore, it is essential for increasing bandwidth on cases where payload
aggregation is required.

v3: Remove "inline" from functions in .c files (requested by DaveM). Revert
adding "enabled" field from struct net_dim and applied mlx5e structural
suggestions (suggested by SaeedM).

v2: Rebase over proper tree.

v1: Fix compilation issues due to missed function renaming.

Tal Gilboa (3):
  net/dim: Rename *_get_profile() functions to *_get_rx_moderation()
  net/dim: Support adaptive TX moderation
  net/mlx5e: Enable adaptive-TX moderation

 drivers/net/ethernet/broadcom/bcmsysport.c |  6 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c  |  8 +--
 drivers/net/ethernet/broadcom/genet/bcmgenet.c |  6 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |  4 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_dim.c   | 28 ++--
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   | 35 +++---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 79 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c  | 37 +++---
 include/linux/net_dim.h| 69 ++-
 9 files changed, 190 insertions(+), 82 deletions(-)

-- 
1.8.3.1



[PATCH net-next V3 3/3] net/mlx5e: Enable adaptive-TX moderation

2018-04-24 Thread Tal Gilboa
Add support for adaptive TX moderation. This greatly reduces TX interrupt
rate and increases bandwidth, mostly for TCP bandwidth over ARM
architecture (below). There is a slight single stream TCP with very large
message sizes degradation (x86). In this case if there's any moderation on
transmitted packets the bandwidth would reduce due to hitting TCP output limit.
Since this is a synthetic case, this is still worth doing.

Performance improvement (ConnectX-4Lx 40GbE, ARM)
TCP 64B bandwidth with 1-50 streams increased 6-35%.
TCP 64B bandwidth with 100-500 streams increased 20-70%.

Performance improvement (ConnectX-5 100GbE, x86)
Bandwidth: increased up to 40% (1024B with 10s of streams).
Interrupt rate: reduced up to 50% (1024B with 1000s of streams).

Performance degradation (ConnectX-5 100GbE, x86)
Bandwidth: up to 10% decrease single stream TCP (1MB message size from
51Gb/s to 47Gb/s).

Signed-off-by: Tal Gilboa <ta...@mellanox.com>
Reviewed-by: Tariq Toukan <tar...@mellanox.com>
Acked-by: Saeed Mahameed <sae...@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |  4 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_dim.c   | 24 +--
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   | 35 +++---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 81 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c  | 37 +++---
 5 files changed, 125 insertions(+), 56 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 3317a4d..56c748c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -241,6 +241,7 @@ struct mlx5e_params {
bool vlan_strip_disable;
bool scatter_fcs_en;
bool rx_dim_enabled;
+   bool tx_dim_enabled;
u32 lro_timeout;
u32 pflags;
struct bpf_prog *xdp_prog;
@@ -330,6 +331,7 @@ enum {
MLX5E_SQ_STATE_ENABLED,
MLX5E_SQ_STATE_RECOVERING,
MLX5E_SQ_STATE_IPSEC,
+   MLX5E_SQ_STATE_AM,
 };
 
 struct mlx5e_sq_wqe_info {
@@ -342,6 +344,7 @@ struct mlx5e_txqsq {
/* dirtied @completion */
u16cc;
u32dma_fifo_cc;
+   struct net_dim dim; /* Adaptive Moderation */
 
/* dirtied @xmit */
u16pc cacheline_aligned_in_smp;
@@ -,4 +1114,5 @@ void mlx5e_build_nic_params(struct mlx5_core_dev *mdev,
u16 max_channels, u16 mtu);
 u8 mlx5e_params_calculate_tx_min_inline(struct mlx5_core_dev *mdev);
 void mlx5e_rx_dim_work(struct work_struct *work);
+void mlx5e_tx_dim_work(struct work_struct *work);
 #endif /* __MLX5_EN_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
index 1b286e1..d67adf7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
@@ -33,16 +33,30 @@
 #include 
 #include "en.h"
 
+static void
+mlx5e_complete_dim_work(struct net_dim *dim, struct net_dim_cq_moder moder,
+   struct mlx5_core_dev *mdev, struct mlx5_core_cq *mcq)
+{
+   mlx5_core_modify_cq_moderation(mdev, mcq, moder.usec, moder.pkts);
+   dim->state = NET_DIM_START_MEASURE;
+}
+
 void mlx5e_rx_dim_work(struct work_struct *work)
 {
-   struct net_dim *dim = container_of(work, struct net_dim,
-  work);
+   struct net_dim *dim = container_of(work, struct net_dim, work);
struct mlx5e_rq *rq = container_of(dim, struct mlx5e_rq, dim);
struct net_dim_cq_moder cur_moder =
net_dim_get_rx_moderation(dim->mode, dim->profile_ix);
 
-   mlx5_core_modify_cq_moderation(rq->mdev, >cq.mcq,
-  cur_moder.usec, cur_moder.pkts);
+   mlx5e_complete_dim_work(dim, cur_moder, rq->mdev, >cq.mcq);
+}
 
-   dim->state = NET_DIM_START_MEASURE;
+void mlx5e_tx_dim_work(struct work_struct *work)
+{
+   struct net_dim *dim = container_of(work, struct net_dim, work);
+   struct mlx5e_txqsq *sq = container_of(dim, struct mlx5e_txqsq, dim);
+   struct net_dim_cq_moder cur_moder =
+   net_dim_get_tx_moderation(dim->mode, dim->profile_ix);
+
+   mlx5e_complete_dim_work(dim, cur_moder, sq->cq.mdev, >cq.mcq);
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 37fd024..2b786c4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -389,14 +389,20 @@ static int mlx5e_set_channels(struct net_device *dev,
 int mlx5e_ethtool_get_coalesce(struct mlx5e_priv *priv,
   struct ethtool_coalesce *coal)
 {
+   struct net_dim_cq_mo

[PATCH net-next V3 1/3] net/dim: Rename *_get_profile() functions to *_get_rx_moderation()

2018-04-24 Thread Tal Gilboa
Preparation for introducing adaptive TX to net DIM.

Signed-off-by: Tal Gilboa <ta...@mellanox.com>
Reviewed-by: Tariq Toukan <tar...@mellanox.com>
---
 drivers/net/ethernet/broadcom/bcmsysport.c|  6 +++---
 drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c |  8 
 drivers/net/ethernet/broadcom/genet/bcmgenet.c|  6 +++---
 drivers/net/ethernet/mellanox/mlx5/core/en_dim.c  |  6 +++---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |  6 --
 include/linux/net_dim.h   | 12 ++--
 6 files changed, 23 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
b/drivers/net/ethernet/broadcom/bcmsysport.c
index f9a3c1a..effc651 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -654,7 +654,7 @@ static int bcm_sysport_set_coalesce(struct net_device *dev,
pkts = priv->rx_max_coalesced_frames;
 
if (ec->use_adaptive_rx_coalesce && !priv->dim.use_dim) {
-   moder = net_dim_get_def_profile(priv->dim.dim.mode);
+   moder = net_dim_get_def_rx_moderation(priv->dim.dim.mode);
usecs = moder.usec;
pkts = moder.pkts;
}
@@ -1064,7 +1064,7 @@ static void bcm_sysport_dim_work(struct work_struct *work)
struct bcm_sysport_priv *priv =
container_of(ndim, struct bcm_sysport_priv, dim);
struct net_dim_cq_moder cur_profile =
-   net_dim_get_profile(dim->mode, dim->profile_ix);
+   net_dim_get_rx_moderation(dim->mode, dim->profile_ix);
 
bcm_sysport_set_rx_coalesce(priv, cur_profile.usec, cur_profile.pkts);
dim->state = NET_DIM_START_MEASURE;
@@ -1437,7 +1437,7 @@ static void bcm_sysport_init_rx_coalesce(struct 
bcm_sysport_priv *priv)
 
/* If DIM was enabled, re-apply default parameters */
if (dim->use_dim) {
-   moder = net_dim_get_def_profile(dim->dim.mode);
+   moder = net_dim_get_def_rx_moderation(dim->dim.mode);
usecs = moder.usec;
pkts = moder.pkts;
}
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c
index 408dd19..afa97c8 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c
@@ -21,11 +21,11 @@ void bnxt_dim_work(struct work_struct *work)
struct bnxt_napi *bnapi = container_of(cpr,
   struct bnxt_napi,
   cp_ring);
-   struct net_dim_cq_moder cur_profile = net_dim_get_profile(dim->mode,
- 
dim->profile_ix);
+   struct net_dim_cq_moder cur_moder =
+   net_dim_get_rx_moderation(dim->mode, dim->profile_ix);
 
-   cpr->rx_ring_coal.coal_ticks = cur_profile.usec;
-   cpr->rx_ring_coal.coal_bufs = cur_profile.pkts;
+   cpr->rx_ring_coal.coal_ticks = cur_moder.usec;
+   cpr->rx_ring_coal.coal_bufs = cur_moder.pkts;
 
bnxt_hwrm_set_ring_coal(bnapi->bp, bnapi);
dim->state = NET_DIM_START_MEASURE;
diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c 
b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
index 0445f2c..20c1681 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
@@ -652,7 +652,7 @@ static void bcmgenet_set_ring_rx_coalesce(struct 
bcmgenet_rx_ring *ring,
pkts = ring->rx_max_coalesced_frames;
 
if (ec->use_adaptive_rx_coalesce && !ring->dim.use_dim) {
-   moder = net_dim_get_def_profile(ring->dim.dim.mode);
+   moder = net_dim_get_def_rx_moderation(ring->dim.dim.mode);
usecs = moder.usec;
pkts = moder.pkts;
}
@@ -1925,7 +1925,7 @@ static void bcmgenet_dim_work(struct work_struct *work)
struct bcmgenet_rx_ring *ring =
container_of(ndim, struct bcmgenet_rx_ring, dim);
struct net_dim_cq_moder cur_profile =
-   net_dim_get_profile(dim->mode, dim->profile_ix);
+   net_dim_get_rx_moderation(dim->mode, dim->profile_ix);
 
bcmgenet_set_rx_coalesce(ring, cur_profile.usec, cur_profile.pkts);
dim->state = NET_DIM_START_MEASURE;
@@ -2102,7 +2102,7 @@ static void bcmgenet_init_rx_coalesce(struct 
bcmgenet_rx_ring *ring)
 
/* If DIM was enabled, re-apply default parameters */
if (dim->use_dim) {
-   moder = net_dim_get_def_profile(dim->dim.mode);
+   moder = net_dim_get_def_rx_moderation(dim->dim.mode);
usecs = moder.usec;
pkts = moder.pkts;
}
diff --git a/drivers/net/

Re: net_dim() may use uninitialized data

2018-04-12 Thread Tal Gilboa

On 4/5/2018 4:13 PM, Geert Uytterhoeven wrote:

Hi Tal,

With gcc-4.1.2:

 drivers/net/ethernet/broadcom/bcmsysport.c: In function ‘bcm_sysport_poll’:
 include/linux/net_dim.h:354: warning: ‘curr_stats.ppms’ may be
used uninitialized in this function
 include/linux/net_dim.h:354: warning: ‘curr_stats.bpms’ may be
used uninitialized in this function
 include/linux/net_dim.h:354: warning: ‘curr_stats.epms’ may be
used uninitialized in this function

Indeed, ...

| static inline void net_dim_calc_stats(struct net_dim_sample *start,
|   struct net_dim_sample *end,
|   struct net_dim_stats *curr_stats)
| {
| /* u32 holds up to 71 minutes, should be enough */
| u32 delta_us = ktime_us_delta(end->time, start->time);
| u32 npkts = BIT_GAP(BITS_PER_TYPE(u32), end->pkt_ctr, start->pkt_ctr);
| u32 nbytes = BIT_GAP(BITS_PER_TYPE(u32), end->byte_ctr,
|  start->byte_ctr);
|
| if (!delta_us)
| return;

... if delta_us is zero, none of the below will be initialized ...

| curr_stats->ppms = DIV_ROUND_UP(npkts * USEC_PER_MSEC, delta_us);
| curr_stats->bpms = DIV_ROUND_UP(nbytes * USEC_PER_MSEC, delta_us);
| curr_stats->epms = DIV_ROUND_UP(NET_DIM_NEVENTS * USEC_PER_MSEC,
| delta_us);
| }
|
| static inline void net_dim(struct net_dim *dim,
|struct net_dim_sample end_sample)
| {
| struct net_dim_stats curr_stats;
| u16 nevents;
|
| switch (dim->state) {
| case NET_DIM_MEASURE_IN_PROGRESS:
| nevents = BIT_GAP(BITS_PER_TYPE(u16),
|   end_sample.event_ctr,
|   dim->start_sample.event_ctr);
| if (nevents < NET_DIM_NEVENTS)
| break;
| net_dim_calc_stats(>start_sample, _sample,
|_stats);

... in the output parameter curr_stats ...

| if (net_dim_decision(_stats, dim)) {

... and net_dim_decision will make some funky decisions based on
uninitialized data.

What are the proper values to initialize curr_stats with?
Alternatively, perhaps the call to net_dim_decision() should be made
dependent on delta_us being non-zero?


First, thanks a lot for pointing this out. There are no valid values for 
initializing curr_stats. If we consider the most straightforward (all 
0s) this may result in a (big) negative delta between current and 
previous stats and a wrong decision. Any other value would make very 
little sense.
The case of !delta_us is an error flow (0 time passed or more probably 
issues when setting start and/or end times). I suggest adding a return 
value to net_dim_calc_stats() and abort the net_dim cycle if an error 
occurs.




| dim->state = NET_DIM_APPLY_NEW_PROFILE;
| schedule_work(>work);
| break;
| }
| /* fall through */
| case NET_DIM_START_MEASURE:
| dim->state = NET_DIM_MEASURE_IN_PROGRESS;
| break;
| case NET_DIM_APPLY_NEW_PROFILE:
| break;
| }
| }

Gr{oetje,eeting}s,

 Geert



Re: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed and whether it's limited

2018-04-02 Thread Tal Gilboa

On 4/2/2018 11:25 PM, Keller, Jacob E wrote:




-Original Message-
From: Bjorn Helgaas [mailto:helg...@kernel.org]
Sent: Monday, April 02, 2018 12:58 PM
To: Keller, Jacob E <jacob.e.kel...@intel.com>
Cc: Tal Gilboa <ta...@mellanox.com>; Tariq Toukan <tar...@mellanox.com>; Ariel
Elior <ariel.el...@cavium.com>; Ganesh Goudar <ganes...@chelsio.com>;
Kirsher, Jeffrey T <jeffrey.t.kirs...@intel.com>; everest-linux...@cavium.com;
intel-wired-...@lists.osuosl.org; netdev@vger.kernel.org; linux-
ker...@vger.kernel.org; linux-...@vger.kernel.org
Subject: Re: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link 
speed
and whether it's limited

On Mon, Apr 02, 2018 at 04:25:17PM +, Keller, Jacob E wrote:

-Original Message-
From: Bjorn Helgaas [mailto:helg...@kernel.org]
Sent: Friday, March 30, 2018 2:05 PM
To: Tal Gilboa <ta...@mellanox.com>
Cc: Tariq Toukan <tar...@mellanox.com>; Keller, Jacob E
<jacob.e.kel...@intel.com>; Ariel Elior <ariel.el...@cavium.com>; Ganesh
Goudar <ganes...@chelsio.com>; Kirsher, Jeffrey T
<jeffrey.t.kirs...@intel.com>; everest-linux...@cavium.com; intel-wired-
l...@lists.osuosl.org; netdev@vger.kernel.org; linux-ker...@vger.kernel.org;
linux-...@vger.kernel.org
Subject: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed

and

whether it's limited

From: Tal Gilboa <ta...@mellanox.com>

Add pcie_print_link_status().  This logs the current settings of the link
(speed, width, and total available bandwidth).

If the device is capable of more bandwidth but is limited by a slower
upstream link, we include information about the link that limits the
device's performance.

The user may be able to move the device to a different slot for better
performance.

This provides a unified method for all PCI devices to report status and
issues, instead of each device reporting in a different way, using
different code.

Signed-off-by: Tal Gilboa <ta...@mellanox.com>
[bhelgaas: changelog, reword log messages, print device capabilities when
not limited]
Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
---
  drivers/pci/pci.c   |   29 +
  include/linux/pci.h |1 +
  2 files changed, 30 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e00d56b12747..cec7aed09f6b 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5283,6 +5283,35 @@ u32 pcie_bandwidth_capable(struct pci_dev *dev,
enum pci_bus_speed *speed,
return *width * PCIE_SPEED2MBS_ENC(*speed);
  }

+/**
+ * pcie_print_link_status - Report the PCI device's link speed and width
+ * @dev: PCI device to query
+ *
+ * Report the available bandwidth at the device.  If this is less than the
+ * device is capable of, report the device's maximum possible bandwidth and
+ * the upstream link that limits its performance to less than that.
+ */
+void pcie_print_link_status(struct pci_dev *dev)
+{
+   enum pcie_link_width width, width_cap;
+   enum pci_bus_speed speed, speed_cap;
+   struct pci_dev *limiting_dev = NULL;
+   u32 bw_avail, bw_cap;
+
+   bw_cap = pcie_bandwidth_capable(dev, _cap, _cap);
+   bw_avail = pcie_bandwidth_available(dev, _dev, ,
);
+
+   if (bw_avail >= bw_cap)
+   pci_info(dev, "%d Mb/s available bandwidth (%s x%d link)\n",
+bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
+   else
+   pci_info(dev, "%d Mb/s available bandwidth, limited by %s x%d
link at %s (capable of %d Mb/s with %s x%d link)\n",
+bw_avail, PCIE_SPEED2STR(speed), width,
+limiting_dev ? pci_name(limiting_dev) : "",
+bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
+}


Personally, I would make thic last one a pci_warn() to indicate it at a
higher log level, but I'm  ok with the wording, and if consensus is that
this should be at info, I'm ok with that.


Tal's original patch did have a pci_warn() here, and we went back and
forth a bit.  They get bug reports when a device doesn't perform as
expected, which argues for pci_warn().  But they also got feedback
saying warnings are a bit too much, which argues for pci_info() [1]

I don't have a really strong opinion either way.  I have a slight
preference for info because the user may not be able to do anything
about it (there may not be a faster slot available), and I think
distros are usually configured so a warning interrupts the smooth
graphical boot.

It looks like mlx4, fm10k, and ixgbe currently use warnings, while
bnx2x, bnxt_en, and cxgb4 use info.  It's a tie so far :)

[1] https://lkml.kernel.org/r/e47f3628-b56c-4d0a-f18b-
5ffaf261c...@mellanox.com



With that information, I'm fine with the proposal to display this as only an 
info. The message is still printed and can be used for debugging purposes, and 
I think that's really enough.


Here's a proposal for printing the bandwidth as "x.xxx Gb/s":


Nice, I like that also.

Regards,
Jake



Same here for both.


Re: [PATCH net-next V2 0/4] Introduce adaptive TX interrupt moderation to net DIM

2018-04-02 Thread Tal Gilboa

On 4/2/2018 5:27 PM, David Miller wrote:

From: Tal Gilboa <ta...@mellanox.com>
Date: Mon,  2 Apr 2018 16:59:30 +0300


Net DIM is a library designed for dynamic interrupt moderation. It was
implemented and optimized with receive side interrupts in mind, since these
are usually the CPU expensive ones. This patch-set introduces adaptive transmit
interrupt moderation to net DIM, complete with a usage in the mlx5e driver.
Using adaptive TX behavior would reduce interrupt rate for multiple scenarios.
Furthermore, it is essential for increasing bandwidth on cases where payload
aggregation is required.

v2: Rebased over proper tree.

v1: Fix compilation issues due to missed function renaming.


This series still needs fixes, and the net-next tree has closed meanwhile.

And to be honest, handling this series has been very painful for me so far.
The patches either didn't apply or didn't even compile.

Please do not resubmit this until the merge window is over and the net-next
tree opens up again.

Thank you.


Ack.


Re: [PATCH v5 03/14] PCI: Add pcie_bandwidth_capable() to compute max supported link bandwidth

2018-04-02 Thread Tal Gilboa

On 4/2/2018 5:05 PM, Bjorn Helgaas wrote:

On Mon, Apr 02, 2018 at 10:34:58AM +0300, Tal Gilboa wrote:

On 4/2/2018 3:40 AM, Bjorn Helgaas wrote:

On Sun, Apr 01, 2018 at 11:38:53PM +0300, Tal Gilboa wrote:

On 3/31/2018 12:05 AM, Bjorn Helgaas wrote:

From: Tal Gilboa <ta...@mellanox.com>

Add pcie_bandwidth_capable() to compute the max link bandwidth supported by
a device, based on the max link speed and width, adjusted by the encoding
overhead.

The maximum bandwidth of the link is computed as:

 max_link_speed * max_link_width * (1 - encoding_overhead)

The encoding overhead is about 20% for 2.5 and 5.0 GT/s links using 8b/10b
encoding, and about 1.5% for 8 GT/s or higher speed links using 128b/130b
encoding.

Signed-off-by: Tal Gilboa <ta...@mellanox.com>
[bhelgaas: adjust for pcie_get_speed_cap() and pcie_get_width_cap()
signatures, don't export outside drivers/pci]
Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
Reviewed-by: Tariq Toukan <tar...@mellanox.com>
---
drivers/pci/pci.c |   21 +
drivers/pci/pci.h |9 +
2 files changed, 30 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 43075be79388..9ce89e254197 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5208,6 +5208,27 @@ enum pcie_link_width pcie_get_width_cap(struct pci_dev 
*dev)
return PCIE_LNK_WIDTH_UNKNOWN;
}
+/**
+ * pcie_bandwidth_capable - calculates a PCI device's link bandwidth capability
+ * @dev: PCI device
+ * @speed: storage for link speed
+ * @width: storage for link width
+ *
+ * Calculate a PCI device's link bandwidth by querying for its link speed
+ * and width, multiplying them, and applying encoding overhead.
+ */
+u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed *speed,
+  enum pcie_link_width *width)
+{
+   *speed = pcie_get_speed_cap(dev);
+   *width = pcie_get_width_cap(dev);
+
+   if (*speed == PCI_SPEED_UNKNOWN || *width == PCIE_LNK_WIDTH_UNKNOWN)
+   return 0;
+
+   return *width * PCIE_SPEED2MBS_ENC(*speed);
+}
+
/**
 * pci_select_bars - Make BAR mask from the type of resource
 * @dev: the PCI device for which BAR mask is made
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 66738f1050c0..2a50172b9803 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -261,8 +261,17 @@ void pci_disable_bridge_window(struct pci_dev *dev);
 (speed) == PCIE_SPEED_2_5GT ? "2.5 GT/s" : \
 "Unknown speed")
+/* PCIe speed to Mb/s with encoding overhead: 20% for gen2, ~1.5% for gen3 */
+#define PCIE_SPEED2MBS_ENC(speed) \


Missing gen4.


I made it "gen3+".  I think that's accurate, isn't it?  The spec
doesn't seem to actually use "gen3" as a specific term, but sec 4.2.2
says rates of 8 GT/s or higher (which I think includes gen3 and gen4)
use 128b/130b encoding.



I meant that PCIE_SPEED_16_0GT will return 0 from this macro since it wasn't
added. Need to return 15754.


Oh, duh, of course!  Sorry for being dense.  What about the following?
I included the calculation as opposed to just the magic numbers to try
to make it clear how they're derived.  This has the disadvantage of
truncating the result instead of rounding, but I doubt that's
significant in this context.  If it is, we could use the magic numbers
and put the computation in a comment.


We can always use DIV_ROUND_UP((speed * enc_nominator), 
enc_denominator). I think this is confusing and since this introduces a 
bandwidth limit I would prefer to give a wider limit than a wrong one, 
even it is by less than 1Mb/s. My vote is for leaving it as you wrote below.




Another question: we currently deal in Mb/s, not MB/s.  Mb/s has the
advantage of sort of corresponding to the GT/s numbers, but using MB/s
would have the advantage of smaller numbers that match the table here:
https://en.wikipedia.org/wiki/PCI_Express#History_and_revisions,
but I don't know what's most typical in user-facing situations.
What's better?


I don't know what's better but for network devices we measure bandwidth 
in Gb/s, so presenting bandwidth in MB/s would mean additional 
calculations. The truth is I would have prefer to use Gb/s instead of 
Mb/s, but again, don't want to loss up to 1Gb/s.





commit 946435491b35b7782157e9a4d1bd73071fba7709
Author: Tal Gilboa <ta...@mellanox.com>
Date:   Fri Mar 30 08:32:03 2018 -0500

 PCI: Add pcie_bandwidth_capable() to compute max supported link bandwidth
 
 Add pcie_bandwidth_capable() to compute the max link bandwidth supported by

 a device, based on the max link speed and width, adjusted by the encoding
 overhead.
 
 The maximum bandwidth of the link is computed as:
 
   max_link_width * max_link_speed * (1 - encoding_overhead)
 
 2.5 and 5.0 GT/s links use 8b/10b encoding, which reduces the raw bandwidth

 available by 20%; 8.0 GT/s and f

[PATCH net-next V2 1/4] net/dim: Rename *_get_profile() functions to *_get_rx_moderation()

2018-04-02 Thread Tal Gilboa
Preparation for introducing adaptive TX to net DIM.

Signed-off-by: Tal Gilboa <ta...@mellanox.com>
---
 drivers/net/ethernet/broadcom/bcmsysport.c|  6 +++---
 drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c |  8 
 drivers/net/ethernet/broadcom/genet/bcmgenet.c|  6 +++---
 drivers/net/ethernet/mellanox/mlx5/core/en_dim.c  |  6 +++---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |  6 --
 include/linux/net_dim.h   | 12 ++--
 6 files changed, 23 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
b/drivers/net/ethernet/broadcom/bcmsysport.c
index 4a75b1d..98c5183 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -654,7 +654,7 @@ static int bcm_sysport_set_coalesce(struct net_device *dev,
pkts = priv->rx_max_coalesced_frames;
 
if (ec->use_adaptive_rx_coalesce && !priv->dim.use_dim) {
-   moder = net_dim_get_def_profile(priv->dim.dim.mode);
+   moder = net_dim_get_def_rx_moderation(priv->dim.dim.mode);
usecs = moder.usec;
pkts = moder.pkts;
}
@@ -1064,7 +1064,7 @@ static void bcm_sysport_dim_work(struct work_struct *work)
struct bcm_sysport_priv *priv =
container_of(ndim, struct bcm_sysport_priv, dim);
struct net_dim_cq_moder cur_profile =
-   net_dim_get_profile(dim->mode, dim->profile_ix);
+   net_dim_get_rx_moderation(dim->mode, 
dim->profile_ix);
 
bcm_sysport_set_rx_coalesce(priv, cur_profile.usec, cur_profile.pkts);
dim->state = NET_DIM_START_MEASURE;
@@ -1436,7 +1436,7 @@ static void bcm_sysport_init_rx_coalesce(struct 
bcm_sysport_priv *priv)
 
/* If DIM was enabled, re-apply default parameters */
if (dim->use_dim) {
-   moder = net_dim_get_def_profile(dim->dim.mode);
+   moder = net_dim_get_def_rx_moderation(dim->dim.mode);
usecs = moder.usec;
pkts = moder.pkts;
}
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c
index 408dd19..afa97c8 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c
@@ -21,11 +21,11 @@ void bnxt_dim_work(struct work_struct *work)
struct bnxt_napi *bnapi = container_of(cpr,
   struct bnxt_napi,
   cp_ring);
-   struct net_dim_cq_moder cur_profile = net_dim_get_profile(dim->mode,
- 
dim->profile_ix);
+   struct net_dim_cq_moder cur_moder =
+   net_dim_get_rx_moderation(dim->mode, dim->profile_ix);
 
-   cpr->rx_ring_coal.coal_ticks = cur_profile.usec;
-   cpr->rx_ring_coal.coal_bufs = cur_profile.pkts;
+   cpr->rx_ring_coal.coal_ticks = cur_moder.usec;
+   cpr->rx_ring_coal.coal_bufs = cur_moder.pkts;
 
bnxt_hwrm_set_ring_coal(bnapi->bp, bnapi);
dim->state = NET_DIM_START_MEASURE;
diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c 
b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
index 264fb37..3c3b780 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
@@ -652,7 +652,7 @@ static void bcmgenet_set_ring_rx_coalesce(struct 
bcmgenet_rx_ring *ring,
pkts = ring->rx_max_coalesced_frames;
 
if (ec->use_adaptive_rx_coalesce && !ring->dim.use_dim) {
-   moder = net_dim_get_def_profile(ring->dim.dim.mode);
+   moder = net_dim_get_def_rx_moderation(ring->dim.dim.mode);
usecs = moder.usec;
pkts = moder.pkts;
}
@@ -1924,7 +1924,7 @@ static void bcmgenet_dim_work(struct work_struct *work)
struct bcmgenet_rx_ring *ring =
container_of(ndim, struct bcmgenet_rx_ring, dim);
struct net_dim_cq_moder cur_profile =
-   net_dim_get_profile(dim->mode, dim->profile_ix);
+   net_dim_get_rx_moderation(dim->mode, dim->profile_ix);
 
bcmgenet_set_rx_coalesce(ring, cur_profile.usec, cur_profile.pkts);
dim->state = NET_DIM_START_MEASURE;
@@ -2101,7 +2101,7 @@ static void bcmgenet_init_rx_coalesce(struct 
bcmgenet_rx_ring *ring)
 
/* If DIM was enabled, re-apply default parameters */
if (dim->use_dim) {
-   moder = net_dim_get_def_profile(dim->dim.mode);
+   moder = net_dim_get_def_rx_moderation(dim->dim.mode);
usecs = moder.usec;
pkts = moder.pkts;
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c 
b/dri

[PATCH net-next V2 2/4] net/dim: Add "enabled" field to net_dim struct

2018-04-02 Thread Tal Gilboa
Preparation for introducing adaptive TX to net DIM.

Signed-off-by: Tal Gilboa <ta...@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h |  1 -
 drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c | 10 +++---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c|  6 +++---
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c |  2 +-
 include/linux/net_dim.h  |  2 ++
 5 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 30cad07..2c18d2f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -238,7 +238,6 @@ struct mlx5e_params {
u32 indirection_rqt[MLX5E_INDIR_RQT_SIZE];
bool vlan_strip_disable;
bool scatter_fcs_en;
-   bool rx_dim_enabled;
u32 lro_timeout;
u32 pflags;
struct bpf_prog *xdp_prog;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 37fd024..66c71da 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -394,9 +394,10 @@ int mlx5e_ethtool_get_coalesce(struct mlx5e_priv *priv,
 
coal->rx_coalesce_usecs   = 
priv->channels.params.rx_cq_moderation.usec;
coal->rx_max_coalesced_frames = 
priv->channels.params.rx_cq_moderation.pkts;
+   coal->use_adaptive_rx_coalesce =
+   priv->channels.params.rx_cq_moderation.enabled;
coal->tx_coalesce_usecs   = 
priv->channels.params.tx_cq_moderation.usec;
coal->tx_max_coalesced_frames = 
priv->channels.params.tx_cq_moderation.pkts;
-   coal->use_adaptive_rx_coalesce = priv->channels.params.rx_dim_enabled;
 
return 0;
 }
@@ -467,7 +468,8 @@ int mlx5e_ethtool_set_coalesce(struct mlx5e_priv *priv,
new_channels.params.tx_cq_moderation.pkts = 
coal->tx_max_coalesced_frames;
new_channels.params.rx_cq_moderation.usec = coal->rx_coalesce_usecs;
new_channels.params.rx_cq_moderation.pkts = 
coal->rx_max_coalesced_frames;
-   new_channels.params.rx_dim_enabled= 
!!coal->use_adaptive_rx_coalesce;
+   new_channels.params.rx_cq_moderation.enabled =
+   !!coal->use_adaptive_rx_coalesce;
 
if (!test_bit(MLX5E_STATE_OPENED, >state)) {
priv->channels.params = new_channels.params;
@@ -475,7 +477,9 @@ int mlx5e_ethtool_set_coalesce(struct mlx5e_priv *priv,
}
/* we are opened */
 
-   reset = !!coal->use_adaptive_rx_coalesce != 
priv->channels.params.rx_dim_enabled;
+   reset = !!coal->use_adaptive_rx_coalesce !=
+   priv->channels.params.rx_cq_moderation.enabled;
+
if (!reset) {
mlx5e_set_priv_channels_coalesce(priv, coal);
priv->channels.params = new_channels.params;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 37a89b7..9bcc578 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -781,7 +781,7 @@ static int mlx5e_open_rq(struct mlx5e_channel *c,
if (err)
goto err_destroy_rq;
 
-   if (params->rx_dim_enabled)
+   if (params->rx_cq_moderation.enabled)
c->rq.state |= BIT(MLX5E_RQ_STATE_AM);
 
return 0;
@@ -4103,7 +4103,7 @@ void mlx5e_set_rx_cq_mode_params(struct mlx5e_params 
*params, u8 cq_period_mode)
params->rx_cq_moderation.usec =
MLX5E_PARAMS_DEFAULT_RX_CQ_MODERATION_USEC_FROM_CQE;
 
-   if (params->rx_dim_enabled) {
+   if (params->rx_cq_moderation.enabled) {
switch (cq_period_mode) {
case MLX5_CQ_PERIOD_MODE_START_FROM_CQE:
params->rx_cq_moderation =
@@ -4178,7 +4178,7 @@ void mlx5e_build_nic_params(struct mlx5_core_dev *mdev,
rx_cq_period_mode = MLX5_CAP_GEN(mdev, cq_period_start_from_cqe) ?
MLX5_CQ_PERIOD_MODE_START_FROM_CQE :
MLX5_CQ_PERIOD_MODE_START_FROM_EQE;
-   params->rx_dim_enabled = MLX5_CAP_GEN(mdev, cq_moderation);
+   params->rx_cq_moderation.enabled = MLX5_CAP_GEN(mdev, cq_moderation);
mlx5e_set_rx_cq_mode_params(params, rx_cq_period_mode);
mlx5e_set_tx_cq_mode_params(params, MLX5_CQ_PERIOD_MODE_START_FROM_EQE);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index d8f68e4..a2918a5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -888,7 +888,7 @@ static void mlx5e_build_rep_params(struct mlx5_c

[PATCH net-next V2 3/4] net/dim: Support adaptive TX moderation

2018-04-02 Thread Tal Gilboa
Interrupt moderation for TX traffic requires different profiles than RX
interrupt moderation. The main goal here is to reduce interrupt rate and
allow better payload aggregation by keeping SKBs in the TX queue a bit
longer. Ping-pong behavior would get a profile with a short timer, so
latency wouldn't increase for these scenarios. There's a slight degradtion
in bandwidth for single stream with large message sizes, since
net.ipv4.tcp_limit_output_bytes is limiting the allowed TX traffic, but
with many streams performance is always improved.

Performance improvements (ConnectX-5 100GbE)
Bandwidth: increased up to 40% (1024B with 10s of streams).
Interrupt rate: reduced up to 50% (1024B with 1000s of streams).

Performance degradation (ConnectX-5 100GbE)
Bandwidth: up to 10% decrease single stream TCP (1MB message size from
51Gb/s to 47Gb/s).

*Both cases with TX EQE based moderation enabled.

Signed-off-by: Tal Gilboa <ta...@mellanox.com>
---
 include/linux/net_dim.h | 64 +++--
 1 file changed, 51 insertions(+), 13 deletions(-)

diff --git a/include/linux/net_dim.h b/include/linux/net_dim.h
index e6623cf..95a7a2f 100644
--- a/include/linux/net_dim.h
+++ b/include/linux/net_dim.h
@@ -104,11 +104,12 @@ enum {
 #define NET_DIM_PARAMS_NUM_PROFILES 5
 /* Adaptive moderation profiles */
 #define NET_DIM_DEFAULT_RX_CQ_MODERATION_PKTS_FROM_EQE 256
+#define NET_DIM_DEFAULT_TX_CQ_MODERATION_PKTS_FROM_EQE 128
 #define NET_DIM_DEF_PROFILE_CQE 1
 #define NET_DIM_DEF_PROFILE_EQE 1
 
 /* All profiles sizes must be NET_PARAMS_DIM_NUM_PROFILES */
-#define NET_DIM_EQE_PROFILES { \
+#define NET_DIM_RX_EQE_PROFILES { \
{1,   NET_DIM_DEFAULT_RX_CQ_MODERATION_PKTS_FROM_EQE}, \
{8,   NET_DIM_DEFAULT_RX_CQ_MODERATION_PKTS_FROM_EQE}, \
{64,  NET_DIM_DEFAULT_RX_CQ_MODERATION_PKTS_FROM_EQE}, \
@@ -116,7 +117,7 @@ enum {
{256, NET_DIM_DEFAULT_RX_CQ_MODERATION_PKTS_FROM_EQE}, \
 }
 
-#define NET_DIM_CQE_PROFILES { \
+#define NET_DIM_RX_CQE_PROFILES { \
{2,  256}, \
{8,  128}, \
{16, 64},  \
@@ -124,16 +125,38 @@ enum {
{64, 64}   \
 }
 
+#define NET_DIM_TX_EQE_PROFILES { \
+   {1,   NET_DIM_DEFAULT_TX_CQ_MODERATION_PKTS_FROM_EQE},  \
+   {8,   NET_DIM_DEFAULT_TX_CQ_MODERATION_PKTS_FROM_EQE},  \
+   {32,  NET_DIM_DEFAULT_TX_CQ_MODERATION_PKTS_FROM_EQE},  \
+   {64,  NET_DIM_DEFAULT_TX_CQ_MODERATION_PKTS_FROM_EQE},  \
+   {128, NET_DIM_DEFAULT_TX_CQ_MODERATION_PKTS_FROM_EQE}   \
+}
+
+#define NET_DIM_TX_CQE_PROFILES { \
+   {5,  128},  \
+   {8,  64},  \
+   {16, 32},  \
+   {32, 32},  \
+   {64, 32}   \
+}
+
+static const struct net_dim_cq_moder
+rx_profile[NET_DIM_CQ_PERIOD_NUM_MODES][NET_DIM_PARAMS_NUM_PROFILES] = {
+   NET_DIM_RX_EQE_PROFILES,
+   NET_DIM_RX_CQE_PROFILES,
+};
+
 static const struct net_dim_cq_moder
-profile[NET_DIM_CQ_PERIOD_NUM_MODES][NET_DIM_PARAMS_NUM_PROFILES] = {
-   NET_DIM_EQE_PROFILES,
-   NET_DIM_CQE_PROFILES,
+tx_profile[NET_DIM_CQ_PERIOD_NUM_MODES][NET_DIM_PARAMS_NUM_PROFILES] = {
+   NET_DIM_TX_EQE_PROFILES,
+   NET_DIM_TX_CQE_PROFILES,
 };
 
 static inline struct net_dim_cq_moder
 net_dim_get_rx_moderation(u8 cq_period_mode, int ix)
 {
-   struct net_dim_cq_moder cq_moder = profile[cq_period_mode][ix];
+   struct net_dim_cq_moder cq_moder = rx_profile[cq_period_mode][ix];
 
cq_moder.cq_period_mode = cq_period_mode;
cq_moder.enabled = true;
@@ -141,16 +164,31 @@ enum {
 }
 
 static inline struct net_dim_cq_moder
-net_dim_get_def_rx_moderation(u8 rx_cq_period_mode)
+net_dim_get_def_rx_moderation(u8 cq_period_mode)
 {
-   int default_profile_ix;
+   u8 profile_ix = cq_period_mode == NET_DIM_CQ_PERIOD_MODE_START_FROM_CQE 
?
+   NET_DIM_DEF_PROFILE_CQE : NET_DIM_DEF_PROFILE_EQE;
 
-   if (rx_cq_period_mode == NET_DIM_CQ_PERIOD_MODE_START_FROM_CQE)
-   default_profile_ix = NET_DIM_DEF_PROFILE_CQE;
-   else /* NET_DIM_CQ_PERIOD_MODE_START_FROM_EQE */
-   default_profile_ix = NET_DIM_DEF_PROFILE_EQE;
+   return net_dim_get_rx_moderation(cq_period_mode, profile_ix);
+}
+
+static inline struct net_dim_cq_moder
+net_dim_get_tx_moderation(u8 cq_period_mode, int ix)
+{
+   struct net_dim_cq_moder cq_moder = tx_profile[cq_period_mode][ix];
+
+   cq_moder.cq_period_mode = cq_period_mode;
+   cq_moder.enabled = true;
+   return cq_moder;
+}
+
+static inline struct net_dim_cq_moder
+net_dim_get_def_tx_moderation(u8 cq_period_mode)
+{
+   u8 profile_ix = cq_period_mode == NET_DIM_CQ_PERIOD_MODE_START_FROM_CQE 
?
+   NET_DIM_DEF_PROFILE_CQE : NET_DIM_DEF_PROFILE_EQE;
 
-   return net_dim_get_rx_moderation(rx_cq_period_mode, default_profile_ix);
+   return net_dim_get_tx_moderation(cq_period_mode, profile_ix);
 }
 
 static inline bool net_dim_on_top(struct net_di

[PATCH net-next V2 4/4] net/mlx5e: Enable adaptive-TX moderation

2018-04-02 Thread Tal Gilboa
Add support for adaptive TX moderation. This greatly reduces TX interrupt
rate and increases bandwidth, mostly for TCP bandwidth over ARM
architecture (below). There is a slight single stream TCP with very large
message sizes degradation (x86). In this case if there's any moderation on
transmitted packets the bandwidth would reduce due to hitting TCP output
limit. Since this is a synthetic case, this is still worth doing.

Performance improvement (ConnectX-4Lx 40GbE, ARM)
TCP 64B bandwidth with 1-50 streams increased 6-35%.
TCP 64B bandwidth with 100-500 streams increased 20-70%.

Performance improvement (ConnectX-5 100GbE, x86)
Bandwidth: increased up to 40% (1024B with 10s of streams).
Interrupt rate: reduced up to 50% (1024B with 1000s of streams).

Signed-off-by: Tal Gilboa <ta...@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |  4 +++
 drivers/net/ethernet/mellanox/mlx5/core/en_dim.c   | 24 +++---
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   | 37 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 22 +
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c  | 37 --
 5 files changed, 96 insertions(+), 28 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 2c18d2f..1a05db0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -327,6 +327,7 @@ enum {
MLX5E_SQ_STATE_ENABLED,
MLX5E_SQ_STATE_RECOVERING,
MLX5E_SQ_STATE_IPSEC,
+   MLX5E_SQ_STATE_AM,
 };
 
 struct mlx5e_sq_wqe_info {
@@ -339,6 +340,7 @@ struct mlx5e_txqsq {
/* dirtied @completion */
u16cc;
u32dma_fifo_cc;
+   struct net_dim dim; /* Adaptive Moderation */
 
/* dirtied @xmit */
u16pc cacheline_aligned_in_smp;
@@ -376,6 +378,7 @@ struct mlx5e_txqsq {
struct work_struct recover_work;
u64last_recover;
} recover;
+
 } cacheline_aligned_in_smp;
 
 struct mlx5e_xdpsq {
@@ -1106,4 +1109,5 @@ void mlx5e_build_nic_params(struct mlx5_core_dev *mdev,
u16 max_channels, u16 mtu);
 u8 mlx5e_params_calculate_tx_min_inline(struct mlx5_core_dev *mdev);
 void mlx5e_rx_dim_work(struct work_struct *work);
+void mlx5e_tx_dim_work(struct work_struct *work);
 #endif /* __MLX5_EN_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
index 1b286e1..9cec351 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
@@ -33,16 +33,30 @@
 #include 
 #include "en.h"
 
+static inline void
+mlx5e_complete_dim_work(struct net_dim *dim, struct net_dim_cq_moder moder,
+   struct mlx5_core_dev *mdev, struct mlx5_core_cq *mcq)
+{
+   mlx5_core_modify_cq_moderation(mdev, mcq, moder.usec, moder.pkts);
+   dim->state = NET_DIM_START_MEASURE;
+}
+
 void mlx5e_rx_dim_work(struct work_struct *work)
 {
-   struct net_dim *dim = container_of(work, struct net_dim,
-  work);
+   struct net_dim *dim = container_of(work, struct net_dim, work);
struct mlx5e_rq *rq = container_of(dim, struct mlx5e_rq, dim);
struct net_dim_cq_moder cur_moder =
net_dim_get_rx_moderation(dim->mode, dim->profile_ix);
 
-   mlx5_core_modify_cq_moderation(rq->mdev, >cq.mcq,
-  cur_moder.usec, cur_moder.pkts);
+   mlx5e_complete_dim_work(dim, cur_moder, rq->mdev, >cq.mcq);
+}
 
-   dim->state = NET_DIM_START_MEASURE;
+void mlx5e_tx_dim_work(struct work_struct *work)
+{
+   struct net_dim *dim = container_of(work, struct net_dim, work);
+   struct mlx5e_txqsq *sq = container_of(dim, struct mlx5e_txqsq, dim);
+   struct net_dim_cq_moder cur_moder =
+   net_dim_get_tx_moderation(dim->mode, dim->profile_ix);
+
+   mlx5e_complete_dim_work(dim, cur_moder, sq->cq.mdev, >cq.mcq);
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 66c71da..4629bf8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -389,15 +389,20 @@ static int mlx5e_set_channels(struct net_device *dev,
 int mlx5e_ethtool_get_coalesce(struct mlx5e_priv *priv,
   struct ethtool_coalesce *coal)
 {
+   struct net_dim_cq_moder *rx_moder, *tx_moder;
+
if (!MLX5_CAP_GEN(priv->mdev, cq_moderation))
return -EOPNOTSUPP;
 
-   coal->rx_coalesce_usecs   = 
priv->channels.params.rx_cq_m

[PATCH net-next V2 0/4] Introduce adaptive TX interrupt moderation to net DIM

2018-04-02 Thread Tal Gilboa
Net DIM is a library designed for dynamic interrupt moderation. It was
implemented and optimized with receive side interrupts in mind, since these
are usually the CPU expensive ones. This patch-set introduces adaptive transmit
interrupt moderation to net DIM, complete with a usage in the mlx5e driver.
Using adaptive TX behavior would reduce interrupt rate for multiple scenarios.
Furthermore, it is essential for increasing bandwidth on cases where payload
aggregation is required.

v2: Rebased over proper tree.

v1: Fix compilation issues due to missed function renaming.

Tal Gilboa (4):
  net/dim: Rename *_get_profile() functions to *_get_rx_moderation()
  net/dim: Add "enabled" field to net_dim struct
  net/dim: Support adaptive TX moderation
  net/mlx5e: Enable adaptive-TX moderation

 drivers/net/ethernet/broadcom/bcmsysport.c |  6 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c  |  8 +--
 drivers/net/ethernet/broadcom/genet/bcmgenet.c |  6 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |  5 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_dim.c   | 28 ++---
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   | 35 +++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 34 --
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c   |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c  | 37 ---
 include/linux/net_dim.h| 72 +-
 10 files changed, 173 insertions(+), 60 deletions(-)

-- 
1.8.3.1



Re: [PATCH v5 03/14] PCI: Add pcie_bandwidth_capable() to compute max supported link bandwidth

2018-04-02 Thread Tal Gilboa

On 4/2/2018 3:40 AM, Bjorn Helgaas wrote:

On Sun, Apr 01, 2018 at 11:38:53PM +0300, Tal Gilboa wrote:

On 3/31/2018 12:05 AM, Bjorn Helgaas wrote:

From: Tal Gilboa <ta...@mellanox.com>

Add pcie_bandwidth_capable() to compute the max link bandwidth supported by
a device, based on the max link speed and width, adjusted by the encoding
overhead.

The maximum bandwidth of the link is computed as:

max_link_speed * max_link_width * (1 - encoding_overhead)

The encoding overhead is about 20% for 2.5 and 5.0 GT/s links using 8b/10b
encoding, and about 1.5% for 8 GT/s or higher speed links using 128b/130b
encoding.

Signed-off-by: Tal Gilboa <ta...@mellanox.com>
[bhelgaas: adjust for pcie_get_speed_cap() and pcie_get_width_cap()
signatures, don't export outside drivers/pci]
Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
Reviewed-by: Tariq Toukan <tar...@mellanox.com>
---
   drivers/pci/pci.c |   21 +
   drivers/pci/pci.h |9 +
   2 files changed, 30 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 43075be79388..9ce89e254197 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5208,6 +5208,27 @@ enum pcie_link_width pcie_get_width_cap(struct pci_dev 
*dev)
return PCIE_LNK_WIDTH_UNKNOWN;
   }
+/**
+ * pcie_bandwidth_capable - calculates a PCI device's link bandwidth capability
+ * @dev: PCI device
+ * @speed: storage for link speed
+ * @width: storage for link width
+ *
+ * Calculate a PCI device's link bandwidth by querying for its link speed
+ * and width, multiplying them, and applying encoding overhead.
+ */
+u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed *speed,
+  enum pcie_link_width *width)
+{
+   *speed = pcie_get_speed_cap(dev);
+   *width = pcie_get_width_cap(dev);
+
+   if (*speed == PCI_SPEED_UNKNOWN || *width == PCIE_LNK_WIDTH_UNKNOWN)
+   return 0;
+
+   return *width * PCIE_SPEED2MBS_ENC(*speed);
+}
+
   /**
* pci_select_bars - Make BAR mask from the type of resource
* @dev: the PCI device for which BAR mask is made
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 66738f1050c0..2a50172b9803 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -261,8 +261,17 @@ void pci_disable_bridge_window(struct pci_dev *dev);
 (speed) == PCIE_SPEED_2_5GT ? "2.5 GT/s" : \
 "Unknown speed")
+/* PCIe speed to Mb/s with encoding overhead: 20% for gen2, ~1.5% for gen3 */
+#define PCIE_SPEED2MBS_ENC(speed) \


Missing gen4.


I made it "gen3+".  I think that's accurate, isn't it?  The spec
doesn't seem to actually use "gen3" as a specific term, but sec 4.2.2
says rates of 8 GT/s or higher (which I think includes gen3 and gen4)
use 128b/130b encoding.



I meant that PCIE_SPEED_16_0GT will return 0 from this macro since it 
wasn't added. Need to return 15754.


Re: [PATCH v5 04/14] PCI: Add pcie_bandwidth_available() to compute bandwidth available to device

2018-04-01 Thread Tal Gilboa

On 3/31/2018 12:05 AM, Bjorn Helgaas wrote:

From: Tal Gilboa <ta...@mellanox.com>

Add pcie_bandwidth_available() to compute the bandwidth available to a
device.  This may be limited by the device itself or by a slower upstream
link leading to the device.

The available bandwidth at each link along the path is computed as:

   link_speed * link_width * (1 - encoding_overhead)

The encoding overhead is about 20% for 2.5 and 5.0 GT/s links using 8b/10b
encoding, and about 1.5% for 8 GT/s or higher speed links using 128b/130b
encoding.

Also return the device with the slowest link and the speed and width of
that link.

Signed-off-by: Tal Gilboa <ta...@mellanox.com>
[bhelgaas: changelog, leave pcie_get_minimum_link() alone for now, return
bw directly, use pci_upstream_bridge(), check "next_bw <= bw" to find
uppermost limiting device, return speed/width of the limiting device]
Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
---
  drivers/pci/pci.c   |   54 +++
  include/linux/pci.h |3 +++
  2 files changed, 57 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 9ce89e254197..e00d56b12747 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5146,6 +5146,60 @@ int pcie_get_minimum_link(struct pci_dev *dev, enum 
pci_bus_speed *speed,
  }
  EXPORT_SYMBOL(pcie_get_minimum_link);
  
+/**

+ * pcie_bandwidth_available - determine minimum link settings of a PCIe
+ *   device and its bandwidth limitation
+ * @dev: PCI device to query
+ * @limiting_dev: storage for device causing the bandwidth limitation
+ * @speed: storage for speed of limiting device
+ * @width: storage for width of limiting device
+ *
+ * Walk up the PCI device chain and find the point where the minimum
+ * bandwidth is available.  Return the bandwidth available there and (if
+ * limiting_dev, speed, and width pointers are supplied) information about
+ * that point.
+ */
+u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev 
**limiting_dev,
+enum pci_bus_speed *speed,
+enum pcie_link_width *width)
+{
+   u16 lnksta;
+   enum pci_bus_speed next_speed;
+   enum pcie_link_width next_width;
+   u32 bw, next_bw;
+
+   *speed = PCI_SPEED_UNKNOWN;
+   *width = PCIE_LNK_WIDTH_UNKNOWN;


This is not safe anymore, now that we allow speed/width=NULL.


+   bw = 0;
+
+   while (dev) {
+   pcie_capability_read_word(dev, PCI_EXP_LNKSTA, );
+
+   next_speed = pcie_link_speed[lnksta & PCI_EXP_LNKSTA_CLS];
+   next_width = (lnksta & PCI_EXP_LNKSTA_NLW) >>
+   PCI_EXP_LNKSTA_NLW_SHIFT;
+
+   next_bw = next_width * PCIE_SPEED2MBS_ENC(next_speed);
+
+   /* Check if current device limits the total bandwidth */
+   if (!bw || next_bw <= bw) {
+   bw = next_bw;
+
+   if (limiting_dev)
+   *limiting_dev = dev;
+   if (speed)
+   *speed = next_speed;
+   if (width)
+   *width = next_width;
+   }
+
+   dev = pci_upstream_bridge(dev);
+   }
+
+   return bw;
+}
+EXPORT_SYMBOL(pcie_bandwidth_available);
+
  /**
   * pcie_get_speed_cap - query for the PCI device's link speed capability
   * @dev: PCI device to query
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 8043a5937ad0..f2bf2b7a66c7 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1083,6 +1083,9 @@ int pcie_get_mps(struct pci_dev *dev);
  int pcie_set_mps(struct pci_dev *dev, int mps);
  int pcie_get_minimum_link(struct pci_dev *dev, enum pci_bus_speed *speed,
  enum pcie_link_width *width);
+u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev 
**limiting_dev,
+enum pci_bus_speed *speed,
+enum pcie_link_width *width);
  void pcie_flr(struct pci_dev *dev);
  int __pci_reset_function_locked(struct pci_dev *dev);
  int pci_reset_function(struct pci_dev *dev);



Re: [PATCH v5 03/14] PCI: Add pcie_bandwidth_capable() to compute max supported link bandwidth

2018-04-01 Thread Tal Gilboa

On 3/31/2018 12:05 AM, Bjorn Helgaas wrote:

From: Tal Gilboa <ta...@mellanox.com>

Add pcie_bandwidth_capable() to compute the max link bandwidth supported by
a device, based on the max link speed and width, adjusted by the encoding
overhead.

The maximum bandwidth of the link is computed as:

   max_link_speed * max_link_width * (1 - encoding_overhead)

The encoding overhead is about 20% for 2.5 and 5.0 GT/s links using 8b/10b
encoding, and about 1.5% for 8 GT/s or higher speed links using 128b/130b
encoding.

Signed-off-by: Tal Gilboa <ta...@mellanox.com>
[bhelgaas: adjust for pcie_get_speed_cap() and pcie_get_width_cap()
signatures, don't export outside drivers/pci]
Signed-off-by: Bjorn Helgaas <bhelg...@google.com>
Reviewed-by: Tariq Toukan <tar...@mellanox.com>
---
  drivers/pci/pci.c |   21 +
  drivers/pci/pci.h |9 +
  2 files changed, 30 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 43075be79388..9ce89e254197 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5208,6 +5208,27 @@ enum pcie_link_width pcie_get_width_cap(struct pci_dev 
*dev)
return PCIE_LNK_WIDTH_UNKNOWN;
  }
  
+/**

+ * pcie_bandwidth_capable - calculates a PCI device's link bandwidth capability
+ * @dev: PCI device
+ * @speed: storage for link speed
+ * @width: storage for link width
+ *
+ * Calculate a PCI device's link bandwidth by querying for its link speed
+ * and width, multiplying them, and applying encoding overhead.
+ */
+u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed *speed,
+  enum pcie_link_width *width)
+{
+   *speed = pcie_get_speed_cap(dev);
+   *width = pcie_get_width_cap(dev);
+
+   if (*speed == PCI_SPEED_UNKNOWN || *width == PCIE_LNK_WIDTH_UNKNOWN)
+   return 0;
+
+   return *width * PCIE_SPEED2MBS_ENC(*speed);
+}
+
  /**
   * pci_select_bars - Make BAR mask from the type of resource
   * @dev: the PCI device for which BAR mask is made
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 66738f1050c0..2a50172b9803 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -261,8 +261,17 @@ void pci_disable_bridge_window(struct pci_dev *dev);
 (speed) == PCIE_SPEED_2_5GT ? "2.5 GT/s" : \
 "Unknown speed")
  
+/* PCIe speed to Mb/s with encoding overhead: 20% for gen2, ~1.5% for gen3 */

+#define PCIE_SPEED2MBS_ENC(speed) \


Missing gen4.


+   ((speed) == PCIE_SPEED_8_0GT ? 7877 : \
+(speed) == PCIE_SPEED_5_0GT ? 4000 : \
+(speed) == PCIE_SPEED_2_5GT ? 2000 : \
+0)
+
  enum pci_bus_speed pcie_get_speed_cap(struct pci_dev *dev);
  enum pcie_link_width pcie_get_width_cap(struct pci_dev *dev);
+u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed *speed,
+  enum pcie_link_width *width);
  
  /* Single Root I/O Virtualization */

  struct pci_sriov {



Re: [PATCH net-next] net/broadcom: Fixup broken build due to function name change

2018-04-01 Thread Tal Gilboa

On 4/1/2018 7:33 PM, Florian Fainelli wrote:

Le 03/31/18 à 23:48, Tal Gilboa a écrit :

Fixes: 8c6d6895bebb ("net/dim: Rename *_get_profile() functions to 
*_get_rx_moderation()")
Signed-off-by: Tal Gilboa <ta...@mellanox.com>


I think David just backed out your entire patch series adding TX DIM so
you would have to incorporate that change into your series and
re-submit. Thanks!


I see. Re-submitted the original series with a fix.
Please ignore this patch.


[PATCH net-next V1 2/4] net/dim: Add "enabled" field to net_dim struct

2018-04-01 Thread Tal Gilboa
Preparation for introducing adaptive TX to net DIM.

Signed-off-by: Tal Gilboa <ta...@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h |  1 -
 drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c | 10 +++---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c|  6 +++---
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c |  2 +-
 include/linux/net_dim.h  |  2 ++
 5 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 0c2fdf0..3e24864 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -250,7 +250,6 @@ struct mlx5e_params {
u32 indirection_rqt[MLX5E_INDIR_RQT_SIZE];
bool vlan_strip_disable;
bool scatter_fcs_en;
-   bool rx_dim_enabled;
u32 lro_timeout;
u32 pflags;
struct bpf_prog *xdp_prog;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index c57c929..4b5d022 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -459,9 +459,10 @@ int mlx5e_ethtool_get_coalesce(struct mlx5e_priv *priv,
 
coal->rx_coalesce_usecs   = 
priv->channels.params.rx_cq_moderation.usec;
coal->rx_max_coalesced_frames = 
priv->channels.params.rx_cq_moderation.pkts;
+   coal->use_adaptive_rx_coalesce =
+   priv->channels.params.rx_cq_moderation.enabled;
coal->tx_coalesce_usecs   = 
priv->channels.params.tx_cq_moderation.usec;
coal->tx_max_coalesced_frames = 
priv->channels.params.tx_cq_moderation.pkts;
-   coal->use_adaptive_rx_coalesce = priv->channels.params.rx_dim_enabled;
 
return 0;
 }
@@ -515,7 +516,8 @@ int mlx5e_ethtool_set_coalesce(struct mlx5e_priv *priv,
new_channels.params.tx_cq_moderation.pkts = 
coal->tx_max_coalesced_frames;
new_channels.params.rx_cq_moderation.usec = coal->rx_coalesce_usecs;
new_channels.params.rx_cq_moderation.pkts = 
coal->rx_max_coalesced_frames;
-   new_channels.params.rx_dim_enabled= 
!!coal->use_adaptive_rx_coalesce;
+   new_channels.params.rx_cq_moderation.enabled =
+   !!coal->use_adaptive_rx_coalesce;
 
if (!test_bit(MLX5E_STATE_OPENED, >state)) {
priv->channels.params = new_channels.params;
@@ -523,7 +525,9 @@ int mlx5e_ethtool_set_coalesce(struct mlx5e_priv *priv,
}
/* we are opened */
 
-   reset = !!coal->use_adaptive_rx_coalesce != 
priv->channels.params.rx_dim_enabled;
+   reset = !!coal->use_adaptive_rx_coalesce !=
+   priv->channels.params.rx_cq_moderation.enabled;
+
if (!reset) {
mlx5e_set_priv_channels_coalesce(priv, coal);
priv->channels.params = new_channels.params;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 8a9670c..e16a705 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -781,7 +781,7 @@ static int mlx5e_open_rq(struct mlx5e_channel *c,
if (err)
goto err_destroy_rq;
 
-   if (params->rx_dim_enabled)
+   if (params->rx_cq_moderation.enabled)
c->rq.state |= BIT(MLX5E_RQ_STATE_AM);
 
return 0;
@@ -4083,7 +4083,7 @@ void mlx5e_set_rx_cq_mode_params(struct mlx5e_params 
*params, u8 cq_period_mode)
params->rx_cq_moderation.usec =
MLX5E_PARAMS_DEFAULT_RX_CQ_MODERATION_USEC_FROM_CQE;
 
-   if (params->rx_dim_enabled) {
+   if (params->rx_cq_moderation.enabled) {
switch (cq_period_mode) {
case MLX5_CQ_PERIOD_MODE_START_FROM_CQE:
params->rx_cq_moderation =
@@ -4155,7 +4155,7 @@ void mlx5e_build_nic_params(struct mlx5_core_dev *mdev,
cq_period_mode = MLX5_CAP_GEN(mdev, cq_period_start_from_cqe) ?
MLX5_CQ_PERIOD_MODE_START_FROM_CQE :
MLX5_CQ_PERIOD_MODE_START_FROM_EQE;
-   params->rx_dim_enabled = MLX5_CAP_GEN(mdev, cq_moderation);
+   params->rx_cq_moderation.enabled = MLX5_CAP_GEN(mdev, cq_moderation);
mlx5e_set_rx_cq_mode_params(params, cq_period_mode);
mlx5e_set_tx_cq_mode_params(params, cq_period_mode);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index dd32f3e..f238d4c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -881,7 +881,7 @@ static void mlx5e_build_rep_params(struct mlx5_core_dev 
*mdev,
param

[PATCH net-next V1 4/4] net/mlx5e: Enable adaptive-TX moderation

2018-04-01 Thread Tal Gilboa
Add support for adaptive TX moderation. This greatly reduces TX interrupt
rate and increases bandwidth, mostly for TCP bandwidth over ARM
architecture (below). There is a slight single stream TCP with very large
message sizes degradation (x86). In this case if there's any moderation on
transmitted packets the bandwidth would reduce due to hitting TCP output
limit. Since this is a synthetic case, this is still worth doing.

Performance improvement (ConnectX-4Lx 40GbE, ARM)
TCP 64B bandwidth with 1-50 streams increased 6-35%.
TCP 64B bandwidth with 100-500 streams increased 20-70%.

Performance improvement (ConnectX-5 100GbE, x86)
Bandwidth: increased up to 40% (1024B with 10s of streams).
Interrupt rate: reduced up to 50% (1024B with 1000s of streams).

Signed-off-by: Tal Gilboa <ta...@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |  4 +++
 drivers/net/ethernet/mellanox/mlx5/core/en_dim.c   | 24 +++---
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   | 37 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 22 +
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c  | 37 --
 5 files changed, 96 insertions(+), 28 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 3e24864..e18574d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -338,6 +338,7 @@ enum {
MLX5E_SQ_STATE_IPSEC,
MLX5E_SQ_STATE_TLS,
MLX5E_SQ_STATE_RECOVERING,
+   MLX5E_SQ_STATE_AM,
 };
 
 struct mlx5e_sq_wqe_info {
@@ -350,6 +351,7 @@ struct mlx5e_txqsq {
/* dirtied @completion */
u16cc;
u32dma_fifo_cc;
+   struct net_dim dim; /* Adaptive Moderation */
 
/* dirtied @xmit */
u16pc cacheline_aligned_in_smp;
@@ -387,6 +389,7 @@ struct mlx5e_txqsq {
struct work_struct recover_work;
u64last_recover;
} recover;
+
 } cacheline_aligned_in_smp;
 
 struct mlx5e_xdpsq {
@@ -1136,4 +1139,5 @@ void mlx5e_build_nic_params(struct mlx5_core_dev *mdev,
u16 max_channels);
 u8 mlx5e_params_calculate_tx_min_inline(struct mlx5_core_dev *mdev);
 void mlx5e_rx_dim_work(struct work_struct *work);
+void mlx5e_tx_dim_work(struct work_struct *work);
 #endif /* __MLX5_EN_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
index 1b286e1..9cec351 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
@@ -33,16 +33,30 @@
 #include 
 #include "en.h"
 
+static inline void
+mlx5e_complete_dim_work(struct net_dim *dim, struct net_dim_cq_moder moder,
+   struct mlx5_core_dev *mdev, struct mlx5_core_cq *mcq)
+{
+   mlx5_core_modify_cq_moderation(mdev, mcq, moder.usec, moder.pkts);
+   dim->state = NET_DIM_START_MEASURE;
+}
+
 void mlx5e_rx_dim_work(struct work_struct *work)
 {
-   struct net_dim *dim = container_of(work, struct net_dim,
-  work);
+   struct net_dim *dim = container_of(work, struct net_dim, work);
struct mlx5e_rq *rq = container_of(dim, struct mlx5e_rq, dim);
struct net_dim_cq_moder cur_moder =
net_dim_get_rx_moderation(dim->mode, dim->profile_ix);
 
-   mlx5_core_modify_cq_moderation(rq->mdev, >cq.mcq,
-  cur_moder.usec, cur_moder.pkts);
+   mlx5e_complete_dim_work(dim, cur_moder, rq->mdev, >cq.mcq);
+}
 
-   dim->state = NET_DIM_START_MEASURE;
+void mlx5e_tx_dim_work(struct work_struct *work)
+{
+   struct net_dim *dim = container_of(work, struct net_dim, work);
+   struct mlx5e_txqsq *sq = container_of(dim, struct mlx5e_txqsq, dim);
+   struct net_dim_cq_moder cur_moder =
+   net_dim_get_tx_moderation(dim->mode, dim->profile_ix);
+
+   mlx5e_complete_dim_work(dim, cur_moder, sq->cq.mdev, >cq.mcq);
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 4b5d022..05faaf7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -454,15 +454,20 @@ static int mlx5e_set_channels(struct net_device *dev,
 int mlx5e_ethtool_get_coalesce(struct mlx5e_priv *priv,
   struct ethtool_coalesce *coal)
 {
+   struct net_dim_cq_moder *rx_moder, *tx_moder;
+
if (!MLX5_CAP_GEN(priv->mdev, cq_moderation))
return -EOPNOTSUPP;
 
-   coal->rx_coalesce_usecs   = 
priv->channels.params.rx_cq_moderation.usec;
-   coal->rx_max

[PATCH net-next V1 1/4] net/dim: Rename *_get_profile() functions to *_get_rx_moderation()

2018-04-01 Thread Tal Gilboa
Preparation for introducing adaptive TX to net DIM.

Signed-off-by: Tal Gilboa <ta...@mellanox.com>
---
 drivers/net/ethernet/broadcom/bcmsysport.c|  6 +++---
 drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c |  8 
 drivers/net/ethernet/broadcom/genet/bcmgenet.c|  6 +++---
 drivers/net/ethernet/mellanox/mlx5/core/en_dim.c  |  6 +++---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |  6 --
 include/linux/net_dim.h   | 12 ++--
 6 files changed, 23 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
b/drivers/net/ethernet/broadcom/bcmsysport.c
index 4a75b1d..98c5183 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -654,7 +654,7 @@ static int bcm_sysport_set_coalesce(struct net_device *dev,
pkts = priv->rx_max_coalesced_frames;
 
if (ec->use_adaptive_rx_coalesce && !priv->dim.use_dim) {
-   moder = net_dim_get_def_profile(priv->dim.dim.mode);
+   moder = net_dim_get_def_rx_moderation(priv->dim.dim.mode);
usecs = moder.usec;
pkts = moder.pkts;
}
@@ -1064,7 +1064,7 @@ static void bcm_sysport_dim_work(struct work_struct *work)
struct bcm_sysport_priv *priv =
container_of(ndim, struct bcm_sysport_priv, dim);
struct net_dim_cq_moder cur_profile =
-   net_dim_get_profile(dim->mode, dim->profile_ix);
+   net_dim_get_rx_moderation(dim->mode, 
dim->profile_ix);
 
bcm_sysport_set_rx_coalesce(priv, cur_profile.usec, cur_profile.pkts);
dim->state = NET_DIM_START_MEASURE;
@@ -1436,7 +1436,7 @@ static void bcm_sysport_init_rx_coalesce(struct 
bcm_sysport_priv *priv)
 
/* If DIM was enabled, re-apply default parameters */
if (dim->use_dim) {
-   moder = net_dim_get_def_profile(dim->dim.mode);
+   moder = net_dim_get_def_rx_moderation(dim->dim.mode);
usecs = moder.usec;
pkts = moder.pkts;
}
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c
index 408dd19..afa97c8 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c
@@ -21,11 +21,11 @@ void bnxt_dim_work(struct work_struct *work)
struct bnxt_napi *bnapi = container_of(cpr,
   struct bnxt_napi,
   cp_ring);
-   struct net_dim_cq_moder cur_profile = net_dim_get_profile(dim->mode,
- 
dim->profile_ix);
+   struct net_dim_cq_moder cur_moder =
+   net_dim_get_rx_moderation(dim->mode, dim->profile_ix);
 
-   cpr->rx_ring_coal.coal_ticks = cur_profile.usec;
-   cpr->rx_ring_coal.coal_bufs = cur_profile.pkts;
+   cpr->rx_ring_coal.coal_ticks = cur_moder.usec;
+   cpr->rx_ring_coal.coal_bufs = cur_moder.pkts;
 
bnxt_hwrm_set_ring_coal(bnapi->bp, bnapi);
dim->state = NET_DIM_START_MEASURE;
diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c 
b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
index f8af472..ccd9205 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
@@ -652,7 +652,7 @@ static void bcmgenet_set_ring_rx_coalesce(struct 
bcmgenet_rx_ring *ring,
pkts = ring->rx_max_coalesced_frames;
 
if (ec->use_adaptive_rx_coalesce && !ring->dim.use_dim) {
-   moder = net_dim_get_def_profile(ring->dim.dim.mode);
+   moder = net_dim_get_def_rx_moderation(ring->dim.dim.mode);
usecs = moder.usec;
pkts = moder.pkts;
}
@@ -1924,7 +1924,7 @@ static void bcmgenet_dim_work(struct work_struct *work)
struct bcmgenet_rx_ring *ring =
container_of(ndim, struct bcmgenet_rx_ring, dim);
struct net_dim_cq_moder cur_profile =
-   net_dim_get_profile(dim->mode, dim->profile_ix);
+   net_dim_get_rx_moderation(dim->mode, dim->profile_ix);
 
bcmgenet_set_rx_coalesce(ring, cur_profile.usec, cur_profile.pkts);
dim->state = NET_DIM_START_MEASURE;
@@ -2101,7 +2101,7 @@ static void bcmgenet_init_rx_coalesce(struct 
bcmgenet_rx_ring *ring)
 
/* If DIM was enabled, re-apply default parameters */
if (dim->use_dim) {
-   moder = net_dim_get_def_profile(dim->dim.mode);
+   moder = net_dim_get_def_rx_moderation(dim->dim.mode);
usecs = moder.usec;
pkts = moder.pkts;
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c 
b/dri

[PATCH net-next V1 0/4] Introduce adaptive TX interrupt moderation to net DIM

2018-04-01 Thread Tal Gilboa
Net DIM is a library designed for dynamic interrupt moderation. It was
implemented and optimized with receive side interrupts in mind, since these
are usually the CPU expensive ones. This patch-set introduces adaptive transmit
interrupt moderation to net DIM, complete with a usage in the mlx5e driver.
Using adaptive TX behavior would reduce interrupt rate for multiple scenarios.
Furthermore, it is essential for increasing bandwidth on cases where payload
aggregation is required.

In order to avoid conflicts, this patch-set is rebased over Florian Fainelli's
"[PATCH net-next v2 3/3] net: bcmgenet: Fix coalescing settings handling" patch.

v1: Fix compilation issues due to missed function renaming.

Tal Gilboa (4):
  net/dim: Rename *_get_profile() functions to *_get_rx_moderation()
  net/dim: Add "enabled" field to net_dim struct
  net/dim: Support adaptive TX moderation
  net/mlx5e: Enable adaptive-TX moderation

 drivers/net/ethernet/broadcom/bcmsysport.c |  6 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c  |  8 +--
 drivers/net/ethernet/broadcom/genet/bcmgenet.c |  6 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |  5 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_dim.c   | 28 ++---
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   | 35 +++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 34 --
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c   |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c  | 37 ---
 include/linux/net_dim.h| 72 +-
 10 files changed, 173 insertions(+), 60 deletions(-)

-- 
1.8.3.1



[PATCH net-next V1 3/4] net/dim: Support adaptive TX moderation

2018-04-01 Thread Tal Gilboa
Interrupt moderation for TX traffic requires different profiles than RX
interrupt moderation. The main goal here is to reduce interrupt rate and
allow better payload aggregation by keeping SKBs in the TX queue a bit
longer. Ping-pong behavior would get a profile with a short timer, so
latency wouldn't increase for these scenarios. There's a slight degradtion
in bandwidth for single stream with large message sizes, since
net.ipv4.tcp_limit_output_bytes is limiting the allowed TX traffic, but
with many streams performance is always improved.

Performance improvements (ConnectX-5 100GbE)
Bandwidth: increased up to 40% (1024B with 10s of streams).
Interrupt rate: reduced up to 50% (1024B with 1000s of streams).

Performance degradation (ConnectX-5 100GbE)
Bandwidth: up to 10% decrease single stream TCP (1MB message size from
51Gb/s to 47Gb/s).

*Both cases with TX EQE based moderation enabled.

Signed-off-by: Tal Gilboa <ta...@mellanox.com>
---
 include/linux/net_dim.h | 64 +++--
 1 file changed, 51 insertions(+), 13 deletions(-)

diff --git a/include/linux/net_dim.h b/include/linux/net_dim.h
index d34fbfe..9449a61 100644
--- a/include/linux/net_dim.h
+++ b/include/linux/net_dim.h
@@ -104,11 +104,12 @@ enum {
 #define NET_DIM_PARAMS_NUM_PROFILES 5
 /* Adaptive moderation profiles */
 #define NET_DIM_DEFAULT_RX_CQ_MODERATION_PKTS_FROM_EQE 256
+#define NET_DIM_DEFAULT_TX_CQ_MODERATION_PKTS_FROM_EQE 128
 #define NET_DIM_DEF_PROFILE_CQE 1
 #define NET_DIM_DEF_PROFILE_EQE 1
 
 /* All profiles sizes must be NET_PARAMS_DIM_NUM_PROFILES */
-#define NET_DIM_EQE_PROFILES { \
+#define NET_DIM_RX_EQE_PROFILES { \
{1,   NET_DIM_DEFAULT_RX_CQ_MODERATION_PKTS_FROM_EQE}, \
{8,   NET_DIM_DEFAULT_RX_CQ_MODERATION_PKTS_FROM_EQE}, \
{64,  NET_DIM_DEFAULT_RX_CQ_MODERATION_PKTS_FROM_EQE}, \
@@ -116,7 +117,7 @@ enum {
{256, NET_DIM_DEFAULT_RX_CQ_MODERATION_PKTS_FROM_EQE}, \
 }
 
-#define NET_DIM_CQE_PROFILES { \
+#define NET_DIM_RX_CQE_PROFILES { \
{2,  256}, \
{8,  128}, \
{16, 64},  \
@@ -124,16 +125,38 @@ enum {
{64, 64}   \
 }
 
+#define NET_DIM_TX_EQE_PROFILES { \
+   {1,   NET_DIM_DEFAULT_TX_CQ_MODERATION_PKTS_FROM_EQE},  \
+   {8,   NET_DIM_DEFAULT_TX_CQ_MODERATION_PKTS_FROM_EQE},  \
+   {32,  NET_DIM_DEFAULT_TX_CQ_MODERATION_PKTS_FROM_EQE},  \
+   {64,  NET_DIM_DEFAULT_TX_CQ_MODERATION_PKTS_FROM_EQE},  \
+   {128, NET_DIM_DEFAULT_TX_CQ_MODERATION_PKTS_FROM_EQE}   \
+}
+
+#define NET_DIM_TX_CQE_PROFILES { \
+   {5,  128},  \
+   {8,  64},  \
+   {16, 32},  \
+   {32, 32},  \
+   {64, 32}   \
+}
+
+static const struct net_dim_cq_moder
+rx_profile[NET_DIM_CQ_PERIOD_NUM_MODES][NET_DIM_PARAMS_NUM_PROFILES] = {
+   NET_DIM_RX_EQE_PROFILES,
+   NET_DIM_RX_CQE_PROFILES,
+};
+
 static const struct net_dim_cq_moder
-profile[NET_DIM_CQ_PERIOD_NUM_MODES][NET_DIM_PARAMS_NUM_PROFILES] = {
-   NET_DIM_EQE_PROFILES,
-   NET_DIM_CQE_PROFILES,
+tx_profile[NET_DIM_CQ_PERIOD_NUM_MODES][NET_DIM_PARAMS_NUM_PROFILES] = {
+   NET_DIM_TX_EQE_PROFILES,
+   NET_DIM_TX_CQE_PROFILES,
 };
 
 static inline struct net_dim_cq_moder
 net_dim_get_rx_moderation(u8 cq_period_mode, int ix)
 {
-   struct net_dim_cq_moder cq_moder = profile[cq_period_mode][ix];
+   struct net_dim_cq_moder cq_moder = rx_profile[cq_period_mode][ix];
 
cq_moder.cq_period_mode = cq_period_mode;
cq_moder.enabled = true;
@@ -141,16 +164,31 @@ enum {
 }
 
 static inline struct net_dim_cq_moder
-net_dim_get_def_rx_moderation(u8 rx_cq_period_mode)
+net_dim_get_def_rx_moderation(u8 cq_period_mode)
 {
-   int default_profile_ix;
+   u8 profile_ix = cq_period_mode == NET_DIM_CQ_PERIOD_MODE_START_FROM_CQE 
?
+   NET_DIM_DEF_PROFILE_CQE : NET_DIM_DEF_PROFILE_EQE;
 
-   if (rx_cq_period_mode == NET_DIM_CQ_PERIOD_MODE_START_FROM_CQE)
-   default_profile_ix = NET_DIM_DEF_PROFILE_CQE;
-   else /* NET_DIM_CQ_PERIOD_MODE_START_FROM_EQE */
-   default_profile_ix = NET_DIM_DEF_PROFILE_EQE;
+   return net_dim_get_rx_moderation(cq_period_mode, profile_ix);
+}
+
+static inline struct net_dim_cq_moder
+net_dim_get_tx_moderation(u8 cq_period_mode, int ix)
+{
+   struct net_dim_cq_moder cq_moder = tx_profile[cq_period_mode][ix];
+
+   cq_moder.cq_period_mode = cq_period_mode;
+   cq_moder.enabled = true;
+   return cq_moder;
+}
+
+static inline struct net_dim_cq_moder
+net_dim_get_def_tx_moderation(u8 cq_period_mode)
+{
+   u8 profile_ix = cq_period_mode == NET_DIM_CQ_PERIOD_MODE_START_FROM_CQE 
?
+   NET_DIM_DEF_PROFILE_CQE : NET_DIM_DEF_PROFILE_EQE;
 
-   return net_dim_get_rx_moderation(rx_cq_period_mode, default_profile_ix);
+   return net_dim_get_tx_moderation(cq_period_mode, profile_ix);
 }
 
 static inline bool net_dim_on_top(struct net_di

Re: [PATCH net-next 0/4] Introduce adaptive TX interrupt moderation to net DIM

2018-04-01 Thread Tal Gilboa

On 4/1/2018 6:19 AM, David Miller wrote:

From: David Miller <da...@davemloft.net>
Date: Sat, 31 Mar 2018 22:02:55 -0400 (EDT)


From: Tal Gilboa <ta...@mellanox.com>
Date: Fri, 30 Mar 2018 09:37:29 +0300


Net DIM is a library designed for dynamic interrupt moderation. It was
implemented and optimized with receive side interrupts in mind, since these
are usually the CPU expensive ones. This patch-set introduces adaptive transmit
interrupt moderation to net DIM, complete with a usage in the mlx5e driver.
Using adaptive TX behavior would reduce interrupt rate for multiple scenarios.
Furthermore, it is essential for increasing bandwidth on cases where payload
aggregation is required.


Series applied.


This breaks the build:

drivers/net/ethernet/broadcom/bcmsysport.c: In function 
‘bcm_sysport_set_coalesce’:
drivers/net/ethernet/broadcom/bcmsysport.c:657:11: error: implicit declaration 
of function ‘net_dim_get_def_profile’; did you mean ‘net_dim_exit_parking’? 
[-Werror=implicit-function-declaration]
moder = net_dim_get_def_profile(priv->dim.dim.mode);
^~~


Rebase issues, sorry.
Sent a fixup patch.


[PATCH net-next] net/broadcom: Fixup broken build due to function name change

2018-04-01 Thread Tal Gilboa
Fixes: 8c6d6895bebb ("net/dim: Rename *_get_profile() functions to 
*_get_rx_moderation()")
Signed-off-by: Tal Gilboa <ta...@mellanox.com>
---
 drivers/net/ethernet/broadcom/bcmsysport.c | 4 ++--
 drivers/net/ethernet/broadcom/genet/bcmgenet.c | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
b/drivers/net/ethernet/broadcom/bcmsysport.c
index 53c035f..98c5183 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -654,7 +654,7 @@ static int bcm_sysport_set_coalesce(struct net_device *dev,
pkts = priv->rx_max_coalesced_frames;
 
if (ec->use_adaptive_rx_coalesce && !priv->dim.use_dim) {
-   moder = net_dim_get_def_profile(priv->dim.dim.mode);
+   moder = net_dim_get_def_rx_moderation(priv->dim.dim.mode);
usecs = moder.usec;
pkts = moder.pkts;
}
@@ -1436,7 +1436,7 @@ static void bcm_sysport_init_rx_coalesce(struct 
bcm_sysport_priv *priv)
 
/* If DIM was enabled, re-apply default parameters */
if (dim->use_dim) {
-   moder = net_dim_get_def_profile(dim->dim.mode);
+   moder = net_dim_get_def_rx_moderation(dim->dim.mode);
usecs = moder.usec;
pkts = moder.pkts;
}
diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c 
b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
index 1c9c75b..ccd9205 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
@@ -652,7 +652,7 @@ static void bcmgenet_set_ring_rx_coalesce(struct 
bcmgenet_rx_ring *ring,
pkts = ring->rx_max_coalesced_frames;
 
if (ec->use_adaptive_rx_coalesce && !ring->dim.use_dim) {
-   moder = net_dim_get_def_profile(ring->dim.dim.mode);
+   moder = net_dim_get_def_rx_moderation(ring->dim.dim.mode);
usecs = moder.usec;
pkts = moder.pkts;
}
@@ -2101,7 +2101,7 @@ static void bcmgenet_init_rx_coalesce(struct 
bcmgenet_rx_ring *ring)
 
/* If DIM was enabled, re-apply default parameters */
if (dim->use_dim) {
-   moder = net_dim_get_def_profile(dim->dim.mode);
+   moder = net_dim_get_def_rx_moderation(dim->dim.mode);
usecs = moder.usec;
pkts = moder.pkts;
}
-- 
1.8.3.1



Re: [PATCH net-next 0/4] Introduce adaptive TX interrupt moderation to net DIM

2018-03-30 Thread Tal Gilboa

On 3/30/2018 9:37 AM, Tal Gilboa wrote:

Net DIM is a library designed for dynamic interrupt moderation. It was
implemented and optimized with receive side interrupts in mind, since these
are usually the CPU expensive ones. This patch-set introduces adaptive transmit
interrupt moderation to net DIM, complete with a usage in the mlx5e driver.
Using adaptive TX behavior would reduce interrupt rate for multiple scenarios.
Furthermore, it is essential for increasing bandwidth on cases where payload
aggregation is required.


Important note: In order to avoid conflicts, this patch-set is rebased 
over Florian Fainelli's "[PATCH net-next v2 3/3] net: bcmgenet: Fix 
coalescing settings handling" patch.




Tal Gilboa (4):
   net/dim: Rename *_get_profile() functions to *_get_rx_moderation()
   net/dim: Add "enabled" field to net_dim struct
   net/dim: Support adaptive TX moderation
   net/mlx5e: Enable adaptive-TX moderation

  drivers/net/ethernet/broadcom/bcmsysport.c |  2 +-
  drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c  |  8 +--
  drivers/net/ethernet/broadcom/genet/bcmgenet.c |  2 +-
  drivers/net/ethernet/mellanox/mlx5/core/en.h   |  5 +-
  drivers/net/ethernet/mellanox/mlx5/core/en_dim.c   | 28 ++---
  .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   | 35 +++
  drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 34 --
  drivers/net/ethernet/mellanox/mlx5/core/en_rep.c   |  2 +-
  drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c  | 37 ---
  include/linux/net_dim.h| 72 +-
  10 files changed, 169 insertions(+), 56 deletions(-)



[PATCH net-next 2/4] net/dim: Add "enabled" field to net_dim struct

2018-03-30 Thread Tal Gilboa
Preparation for introducing adaptive TX to net DIM.

Signed-off-by: Tal Gilboa <ta...@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h |  1 -
 drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c | 10 +++---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c|  6 +++---
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c |  2 +-
 include/linux/net_dim.h  |  2 ++
 5 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 0c2fdf0..3e24864 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -250,7 +250,6 @@ struct mlx5e_params {
u32 indirection_rqt[MLX5E_INDIR_RQT_SIZE];
bool vlan_strip_disable;
bool scatter_fcs_en;
-   bool rx_dim_enabled;
u32 lro_timeout;
u32 pflags;
struct bpf_prog *xdp_prog;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index c57c929..4b5d022 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -459,9 +459,10 @@ int mlx5e_ethtool_get_coalesce(struct mlx5e_priv *priv,
 
coal->rx_coalesce_usecs   = 
priv->channels.params.rx_cq_moderation.usec;
coal->rx_max_coalesced_frames = 
priv->channels.params.rx_cq_moderation.pkts;
+   coal->use_adaptive_rx_coalesce =
+   priv->channels.params.rx_cq_moderation.enabled;
coal->tx_coalesce_usecs   = 
priv->channels.params.tx_cq_moderation.usec;
coal->tx_max_coalesced_frames = 
priv->channels.params.tx_cq_moderation.pkts;
-   coal->use_adaptive_rx_coalesce = priv->channels.params.rx_dim_enabled;
 
return 0;
 }
@@ -515,7 +516,8 @@ int mlx5e_ethtool_set_coalesce(struct mlx5e_priv *priv,
new_channels.params.tx_cq_moderation.pkts = 
coal->tx_max_coalesced_frames;
new_channels.params.rx_cq_moderation.usec = coal->rx_coalesce_usecs;
new_channels.params.rx_cq_moderation.pkts = 
coal->rx_max_coalesced_frames;
-   new_channels.params.rx_dim_enabled= 
!!coal->use_adaptive_rx_coalesce;
+   new_channels.params.rx_cq_moderation.enabled =
+   !!coal->use_adaptive_rx_coalesce;
 
if (!test_bit(MLX5E_STATE_OPENED, >state)) {
priv->channels.params = new_channels.params;
@@ -523,7 +525,9 @@ int mlx5e_ethtool_set_coalesce(struct mlx5e_priv *priv,
}
/* we are opened */
 
-   reset = !!coal->use_adaptive_rx_coalesce != 
priv->channels.params.rx_dim_enabled;
+   reset = !!coal->use_adaptive_rx_coalesce !=
+   priv->channels.params.rx_cq_moderation.enabled;
+
if (!reset) {
mlx5e_set_priv_channels_coalesce(priv, coal);
priv->channels.params = new_channels.params;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 8a9670c..e16a705 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -781,7 +781,7 @@ static int mlx5e_open_rq(struct mlx5e_channel *c,
if (err)
goto err_destroy_rq;
 
-   if (params->rx_dim_enabled)
+   if (params->rx_cq_moderation.enabled)
c->rq.state |= BIT(MLX5E_RQ_STATE_AM);
 
return 0;
@@ -4083,7 +4083,7 @@ void mlx5e_set_rx_cq_mode_params(struct mlx5e_params 
*params, u8 cq_period_mode)
params->rx_cq_moderation.usec =
MLX5E_PARAMS_DEFAULT_RX_CQ_MODERATION_USEC_FROM_CQE;
 
-   if (params->rx_dim_enabled) {
+   if (params->rx_cq_moderation.enabled) {
switch (cq_period_mode) {
case MLX5_CQ_PERIOD_MODE_START_FROM_CQE:
params->rx_cq_moderation =
@@ -4155,7 +4155,7 @@ void mlx5e_build_nic_params(struct mlx5_core_dev *mdev,
cq_period_mode = MLX5_CAP_GEN(mdev, cq_period_start_from_cqe) ?
MLX5_CQ_PERIOD_MODE_START_FROM_CQE :
MLX5_CQ_PERIOD_MODE_START_FROM_EQE;
-   params->rx_dim_enabled = MLX5_CAP_GEN(mdev, cq_moderation);
+   params->rx_cq_moderation.enabled = MLX5_CAP_GEN(mdev, cq_moderation);
mlx5e_set_rx_cq_mode_params(params, cq_period_mode);
mlx5e_set_tx_cq_mode_params(params, cq_period_mode);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index dd32f3e..f238d4c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -881,7 +881,7 @@ static void mlx5e_build_rep_params(struct mlx5_core_dev 
*mdev,
param

[PATCH net-next 3/4] net/dim: Support adaptive TX moderation

2018-03-30 Thread Tal Gilboa
Interrupt moderation for TX traffic requires different profiles than RX
interrupt moderation. The main goal here is to reduce interrupt rate and
allow better payload aggregation by keeping SKBs in the TX queue a bit
longer. Ping-pong behavior would get a profile with a short timer, so
latency wouldn't increase for these scenarios. There's a slight degradation
in bandwidth for single stream with large message sizes, since
net.ipv4.tcp_limit_output_bytes is limiting the allowed TX traffic, but
with many streams performance is always improved.

Performance improvements (ConnectX-5 100GbE)
Bandwidth: increased up to 40% (1024B with 10s of streams).
Interrupt rate: reduced up to 50% (1024B with 1000s of streams).

Performance degradation (ConnectX-5 100GbE)
Bandwidth: up to 10% decrease single stream TCP (1MB message size from
51Gb/s to 47Gb/s).

*Both cases with TX EQE based moderation enabled.

Signed-off-by: Tal Gilboa <ta...@mellanox.com>
---
 include/linux/net_dim.h | 64 +++--
 1 file changed, 51 insertions(+), 13 deletions(-)

diff --git a/include/linux/net_dim.h b/include/linux/net_dim.h
index d34fbfe..9449a61 100644
--- a/include/linux/net_dim.h
+++ b/include/linux/net_dim.h
@@ -104,11 +104,12 @@ enum {
 #define NET_DIM_PARAMS_NUM_PROFILES 5
 /* Adaptive moderation profiles */
 #define NET_DIM_DEFAULT_RX_CQ_MODERATION_PKTS_FROM_EQE 256
+#define NET_DIM_DEFAULT_TX_CQ_MODERATION_PKTS_FROM_EQE 128
 #define NET_DIM_DEF_PROFILE_CQE 1
 #define NET_DIM_DEF_PROFILE_EQE 1
 
 /* All profiles sizes must be NET_PARAMS_DIM_NUM_PROFILES */
-#define NET_DIM_EQE_PROFILES { \
+#define NET_DIM_RX_EQE_PROFILES { \
{1,   NET_DIM_DEFAULT_RX_CQ_MODERATION_PKTS_FROM_EQE}, \
{8,   NET_DIM_DEFAULT_RX_CQ_MODERATION_PKTS_FROM_EQE}, \
{64,  NET_DIM_DEFAULT_RX_CQ_MODERATION_PKTS_FROM_EQE}, \
@@ -116,7 +117,7 @@ enum {
{256, NET_DIM_DEFAULT_RX_CQ_MODERATION_PKTS_FROM_EQE}, \
 }
 
-#define NET_DIM_CQE_PROFILES { \
+#define NET_DIM_RX_CQE_PROFILES { \
{2,  256}, \
{8,  128}, \
{16, 64},  \
@@ -124,16 +125,38 @@ enum {
{64, 64}   \
 }
 
+#define NET_DIM_TX_EQE_PROFILES { \
+   {1,   NET_DIM_DEFAULT_TX_CQ_MODERATION_PKTS_FROM_EQE},  \
+   {8,   NET_DIM_DEFAULT_TX_CQ_MODERATION_PKTS_FROM_EQE},  \
+   {32,  NET_DIM_DEFAULT_TX_CQ_MODERATION_PKTS_FROM_EQE},  \
+   {64,  NET_DIM_DEFAULT_TX_CQ_MODERATION_PKTS_FROM_EQE},  \
+   {128, NET_DIM_DEFAULT_TX_CQ_MODERATION_PKTS_FROM_EQE}   \
+}
+
+#define NET_DIM_TX_CQE_PROFILES { \
+   {5,  128},  \
+   {8,  64},  \
+   {16, 32},  \
+   {32, 32},  \
+   {64, 32}   \
+}
+
+static const struct net_dim_cq_moder
+rx_profile[NET_DIM_CQ_PERIOD_NUM_MODES][NET_DIM_PARAMS_NUM_PROFILES] = {
+   NET_DIM_RX_EQE_PROFILES,
+   NET_DIM_RX_CQE_PROFILES,
+};
+
 static const struct net_dim_cq_moder
-profile[NET_DIM_CQ_PERIOD_NUM_MODES][NET_DIM_PARAMS_NUM_PROFILES] = {
-   NET_DIM_EQE_PROFILES,
-   NET_DIM_CQE_PROFILES,
+tx_profile[NET_DIM_CQ_PERIOD_NUM_MODES][NET_DIM_PARAMS_NUM_PROFILES] = {
+   NET_DIM_TX_EQE_PROFILES,
+   NET_DIM_TX_CQE_PROFILES,
 };
 
 static inline struct net_dim_cq_moder
 net_dim_get_rx_moderation(u8 cq_period_mode, int ix)
 {
-   struct net_dim_cq_moder cq_moder = profile[cq_period_mode][ix];
+   struct net_dim_cq_moder cq_moder = rx_profile[cq_period_mode][ix];
 
cq_moder.cq_period_mode = cq_period_mode;
cq_moder.enabled = true;
@@ -141,16 +164,31 @@ enum {
 }
 
 static inline struct net_dim_cq_moder
-net_dim_get_def_rx_moderation(u8 rx_cq_period_mode)
+net_dim_get_def_rx_moderation(u8 cq_period_mode)
 {
-   int default_profile_ix;
+   u8 profile_ix = cq_period_mode == NET_DIM_CQ_PERIOD_MODE_START_FROM_CQE 
?
+   NET_DIM_DEF_PROFILE_CQE : NET_DIM_DEF_PROFILE_EQE;
 
-   if (rx_cq_period_mode == NET_DIM_CQ_PERIOD_MODE_START_FROM_CQE)
-   default_profile_ix = NET_DIM_DEF_PROFILE_CQE;
-   else /* NET_DIM_CQ_PERIOD_MODE_START_FROM_EQE */
-   default_profile_ix = NET_DIM_DEF_PROFILE_EQE;
+   return net_dim_get_rx_moderation(cq_period_mode, profile_ix);
+}
+
+static inline struct net_dim_cq_moder
+net_dim_get_tx_moderation(u8 cq_period_mode, int ix)
+{
+   struct net_dim_cq_moder cq_moder = tx_profile[cq_period_mode][ix];
+
+   cq_moder.cq_period_mode = cq_period_mode;
+   cq_moder.enabled = true;
+   return cq_moder;
+}
+
+static inline struct net_dim_cq_moder
+net_dim_get_def_tx_moderation(u8 cq_period_mode)
+{
+   u8 profile_ix = cq_period_mode == NET_DIM_CQ_PERIOD_MODE_START_FROM_CQE 
?
+   NET_DIM_DEF_PROFILE_CQE : NET_DIM_DEF_PROFILE_EQE;
 
-   return net_dim_get_rx_moderation(rx_cq_period_mode, default_profile_ix);
+   return net_dim_get_tx_moderation(cq_period_mode, profile_ix);
 }
 
 static inline bool net_dim_on_top(struct net_di

[PATCH net-next 4/4] net/mlx5e: Enable adaptive-TX moderation

2018-03-30 Thread Tal Gilboa
Add support for adaptive TX moderation. This greatly reduces TX interrupt
rate and increases bandwidth, mostly for TCP bandwidth over ARM
architecture (below). There is a slight single stream TCP with very large
message sizes degradation (x86). In this case if there's any moderation on
transmitted packets the bandwidth would reduce due to hitting TCP output
limit. Since this is a synthetic case, this is still worth doing.

Performance improvement (ConnectX-4Lx 40GbE, ARM)
TCP 64B bandwidth with 1-50 streams increased 6-35%.
TCP 64B bandwidth with 100-500 streams increased 20-70%.

Performance improvement (ConnectX-5 100GbE, x86)
Bandwidth: increased up to 40% (1024B with 10s of streams).
Interrupt rate: reduced up to 50% (1024B with 1000s of streams).

Signed-off-by: Tal Gilboa <ta...@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |  4 +++
 drivers/net/ethernet/mellanox/mlx5/core/en_dim.c   | 24 +++---
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   | 37 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 22 +
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c  | 37 --
 5 files changed, 96 insertions(+), 28 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 3e24864..e18574d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -338,6 +338,7 @@ enum {
MLX5E_SQ_STATE_IPSEC,
MLX5E_SQ_STATE_TLS,
MLX5E_SQ_STATE_RECOVERING,
+   MLX5E_SQ_STATE_AM,
 };
 
 struct mlx5e_sq_wqe_info {
@@ -350,6 +351,7 @@ struct mlx5e_txqsq {
/* dirtied @completion */
u16cc;
u32dma_fifo_cc;
+   struct net_dim dim; /* Adaptive Moderation */
 
/* dirtied @xmit */
u16pc cacheline_aligned_in_smp;
@@ -387,6 +389,7 @@ struct mlx5e_txqsq {
struct work_struct recover_work;
u64last_recover;
} recover;
+
 } cacheline_aligned_in_smp;
 
 struct mlx5e_xdpsq {
@@ -1136,4 +1139,5 @@ void mlx5e_build_nic_params(struct mlx5_core_dev *mdev,
u16 max_channels);
 u8 mlx5e_params_calculate_tx_min_inline(struct mlx5_core_dev *mdev);
 void mlx5e_rx_dim_work(struct work_struct *work);
+void mlx5e_tx_dim_work(struct work_struct *work);
 #endif /* __MLX5_EN_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
index 1b286e1..9cec351 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
@@ -33,16 +33,30 @@
 #include 
 #include "en.h"
 
+static inline void
+mlx5e_complete_dim_work(struct net_dim *dim, struct net_dim_cq_moder moder,
+   struct mlx5_core_dev *mdev, struct mlx5_core_cq *mcq)
+{
+   mlx5_core_modify_cq_moderation(mdev, mcq, moder.usec, moder.pkts);
+   dim->state = NET_DIM_START_MEASURE;
+}
+
 void mlx5e_rx_dim_work(struct work_struct *work)
 {
-   struct net_dim *dim = container_of(work, struct net_dim,
-  work);
+   struct net_dim *dim = container_of(work, struct net_dim, work);
struct mlx5e_rq *rq = container_of(dim, struct mlx5e_rq, dim);
struct net_dim_cq_moder cur_moder =
net_dim_get_rx_moderation(dim->mode, dim->profile_ix);
 
-   mlx5_core_modify_cq_moderation(rq->mdev, >cq.mcq,
-  cur_moder.usec, cur_moder.pkts);
+   mlx5e_complete_dim_work(dim, cur_moder, rq->mdev, >cq.mcq);
+}
 
-   dim->state = NET_DIM_START_MEASURE;
+void mlx5e_tx_dim_work(struct work_struct *work)
+{
+   struct net_dim *dim = container_of(work, struct net_dim, work);
+   struct mlx5e_txqsq *sq = container_of(dim, struct mlx5e_txqsq, dim);
+   struct net_dim_cq_moder cur_moder =
+   net_dim_get_tx_moderation(dim->mode, dim->profile_ix);
+
+   mlx5e_complete_dim_work(dim, cur_moder, sq->cq.mdev, >cq.mcq);
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 4b5d022..05faaf7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -454,15 +454,20 @@ static int mlx5e_set_channels(struct net_device *dev,
 int mlx5e_ethtool_get_coalesce(struct mlx5e_priv *priv,
   struct ethtool_coalesce *coal)
 {
+   struct net_dim_cq_moder *rx_moder, *tx_moder;
+
if (!MLX5_CAP_GEN(priv->mdev, cq_moderation))
return -EOPNOTSUPP;
 
-   coal->rx_coalesce_usecs   = 
priv->channels.params.rx_cq_moderation.usec;
-   coal->rx_max

[PATCH net-next 0/4] Introduce adaptive TX interrupt moderation to net DIM

2018-03-30 Thread Tal Gilboa
Net DIM is a library designed for dynamic interrupt moderation. It was
implemented and optimized with receive side interrupts in mind, since these
are usually the CPU expensive ones. This patch-set introduces adaptive transmit
interrupt moderation to net DIM, complete with a usage in the mlx5e driver.
Using adaptive TX behavior would reduce interrupt rate for multiple scenarios.
Furthermore, it is essential for increasing bandwidth on cases where payload
aggregation is required.

Tal Gilboa (4):
  net/dim: Rename *_get_profile() functions to *_get_rx_moderation()
  net/dim: Add "enabled" field to net_dim struct
  net/dim: Support adaptive TX moderation
  net/mlx5e: Enable adaptive-TX moderation

 drivers/net/ethernet/broadcom/bcmsysport.c |  2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c  |  8 +--
 drivers/net/ethernet/broadcom/genet/bcmgenet.c |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |  5 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_dim.c   | 28 ++---
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   | 35 +++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 34 --
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c   |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c  | 37 ---
 include/linux/net_dim.h| 72 +-
 10 files changed, 169 insertions(+), 56 deletions(-)

-- 
1.8.3.1



[PATCH net-next 1/4] net/dim: Rename *_get_profile() functions to *_get_rx_moderation()

2018-03-30 Thread Tal Gilboa
Preparation for introducing adaptive TX to net DIM.

Signed-off-by: Tal Gilboa <ta...@mellanox.com>
---
 drivers/net/ethernet/broadcom/bcmsysport.c|  2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c |  8 
 drivers/net/ethernet/broadcom/genet/bcmgenet.c|  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_dim.c  |  6 +++---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |  6 --
 include/linux/net_dim.h   | 12 ++--
 6 files changed, 19 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
b/drivers/net/ethernet/broadcom/bcmsysport.c
index 4a75b1d..53c035f 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -1064,7 +1064,7 @@ static void bcm_sysport_dim_work(struct work_struct *work)
struct bcm_sysport_priv *priv =
container_of(ndim, struct bcm_sysport_priv, dim);
struct net_dim_cq_moder cur_profile =
-   net_dim_get_profile(dim->mode, dim->profile_ix);
+   net_dim_get_rx_moderation(dim->mode, 
dim->profile_ix);
 
bcm_sysport_set_rx_coalesce(priv, cur_profile.usec, cur_profile.pkts);
dim->state = NET_DIM_START_MEASURE;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c
index 408dd19..afa97c8 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c
@@ -21,11 +21,11 @@ void bnxt_dim_work(struct work_struct *work)
struct bnxt_napi *bnapi = container_of(cpr,
   struct bnxt_napi,
   cp_ring);
-   struct net_dim_cq_moder cur_profile = net_dim_get_profile(dim->mode,
- 
dim->profile_ix);
+   struct net_dim_cq_moder cur_moder =
+   net_dim_get_rx_moderation(dim->mode, dim->profile_ix);
 
-   cpr->rx_ring_coal.coal_ticks = cur_profile.usec;
-   cpr->rx_ring_coal.coal_bufs = cur_profile.pkts;
+   cpr->rx_ring_coal.coal_ticks = cur_moder.usec;
+   cpr->rx_ring_coal.coal_bufs = cur_moder.pkts;
 
bnxt_hwrm_set_ring_coal(bnapi->bp, bnapi);
dim->state = NET_DIM_START_MEASURE;
diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c 
b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
index f8af472..1c9c75b 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
@@ -1924,7 +1924,7 @@ static void bcmgenet_dim_work(struct work_struct *work)
struct bcmgenet_rx_ring *ring =
container_of(ndim, struct bcmgenet_rx_ring, dim);
struct net_dim_cq_moder cur_profile =
-   net_dim_get_profile(dim->mode, dim->profile_ix);
+   net_dim_get_rx_moderation(dim->mode, dim->profile_ix);
 
bcmgenet_set_rx_coalesce(ring, cur_profile.usec, cur_profile.pkts);
dim->state = NET_DIM_START_MEASURE;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
index 602851a..1b286e1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
@@ -38,11 +38,11 @@ void mlx5e_rx_dim_work(struct work_struct *work)
struct net_dim *dim = container_of(work, struct net_dim,
   work);
struct mlx5e_rq *rq = container_of(dim, struct mlx5e_rq, dim);
-   struct net_dim_cq_moder cur_profile = net_dim_get_profile(dim->mode,
- 
dim->profile_ix);
+   struct net_dim_cq_moder cur_moder =
+   net_dim_get_rx_moderation(dim->mode, dim->profile_ix);
 
mlx5_core_modify_cq_moderation(rq->mdev, >cq.mcq,
-  cur_profile.usec, cur_profile.pkts);
+  cur_moder.usec, cur_moder.pkts);
 
dim->state = NET_DIM_START_MEASURE;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 4a675f1..8a9670c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -4087,12 +4087,14 @@ void mlx5e_set_rx_cq_mode_params(struct mlx5e_params 
*params, u8 cq_period_mode)
switch (cq_period_mode) {
case MLX5_CQ_PERIOD_MODE_START_FROM_CQE:
params->rx_cq_moderation =
-   
net_dim_get_def_profile(NET_DIM_CQ_PERIOD_MODE_START_FROM_CQE);
+   net_dim_get_def_rx_moderation(
+   N

[PATCH net] net/dim: Fix int overflow

2018-03-29 Thread Tal Gilboa
When calculating difference between samples, the values
are multiplied by 100. Large values may cause int overflow
when multiplied (usually on first iteration).
Fixed by forcing 100 to be of type unsigned long.

Fixes: 4c4dbb4a7363 ("net/mlx5e: Move dynamic interrupt coalescing code to 
include/linux")
Signed-off-by: Tal Gilboa <ta...@mellanox.com>
---
 include/linux/net_dim.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/net_dim.h b/include/linux/net_dim.h
index bebeaad..29ed8fd 100644
--- a/include/linux/net_dim.h
+++ b/include/linux/net_dim.h
@@ -231,7 +231,7 @@ static inline void net_dim_exit_parking(struct net_dim *dim)
 }
 
 #define IS_SIGNIFICANT_DIFF(val, ref) \
-   (((100 * abs((val) - (ref))) / (ref)) > 10) /* more than 10% difference 
*/
+   (((100UL * abs((val) - (ref))) / (ref)) > 10) /* more than 10% 
difference */
 
 static inline int net_dim_stats_compare(struct net_dim_stats *curr,
struct net_dim_stats *prev)
-- 
1.8.3.1



Re: [PATCH net-next v2 2/3] net: systemport: Fix coalescing settings handling

2018-03-29 Thread Tal Gilboa
gt;dim;
+   struct net_dim_cq_moder moder;
+   u32 usecs, pkts;
+
+   usecs = priv->rx_coalesce_usecs;
+   pkts = priv->rx_max_coalesced_frames;
+
+   /* If DIM was enabled, re-apply default parameters */
+   if (dim->use_dim) {
+   moder = net_dim_get_def_profile(dim->dim.mode);
+   usecs = moder.usec;
+   pkts = moder.pkts;
+   }
+
+   bcm_sysport_set_rx_coalesce(priv, usecs, pkts);
+}
+
  static int bcm_sysport_init_tx_ring(struct bcm_sysport_priv *priv,
unsigned int index)
  {
@@ -1658,8 +1684,6 @@ static int bcm_sysport_init_rx_ring(struct 
bcm_sysport_priv *priv)
rdma_writel(priv, 0, RDMA_END_ADDR_HI);
rdma_writel(priv, priv->num_rx_desc_words - 1, RDMA_END_ADDR_LO);
  
-	rdma_writel(priv, 1, RDMA_MBDONE_INTR);

-
netif_dbg(priv, hw, priv->netdev,
  "RDMA cfg, num_rx_bds=%d, rx_bds=%p\n",
  priv->num_rx_bds, priv->rx_bds);
@@ -1827,7 +1851,8 @@ static void bcm_sysport_netif_start(struct net_device 
*dev)
struct bcm_sysport_priv *priv = netdev_priv(dev);
  
  	/* Enable NAPI */

-   bcm_sysport_init_dim(>dim, bcm_sysport_dim_work);
+   bcm_sysport_init_dim(priv, bcm_sysport_dim_work);
+   bcm_sysport_init_rx_coalesce(priv);
napi_enable(>napi);
  
  	/* Enable RX interrupt and TX ring full interrupt */

@@ -2333,6 +2358,7 @@ static int bcm_sysport_probe(struct platform_device *pdev)
/* libphy will adjust the link state accordingly */
netif_carrier_off(dev);
  
+	priv->rx_max_coalesced_frames = 1;


Do you still want to keep {usecs,frames}={0,1} as default static 
configuration? Is the latency so important for the static case? I'm 
assuming RT latency is at least 10us so you can somewhat increase the 
timer and counter without causing an increase in latency. This should be 
properly tested of course so maybe in a future patch.



u64_stats_init(>syncp);
  
  	priv->dsa_notifier.notifier_call = bcm_sysport_dsa_notifier;

diff --git a/drivers/net/ethernet/broadcom/bcmsysport.h 
b/drivers/net/ethernet/broadcom/bcmsysport.h
index 57e18ef8f206..d6e5d0cbf3a3 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.h
+++ b/drivers/net/ethernet/broadcom/bcmsysport.h
@@ -701,8 +701,6 @@ struct bcm_sysport_net_dim {
u16 event_ctr;
unsigned long   packets;
unsigned long   bytes;
-   u32 coal_usecs;
-   u32 coal_pkts;
struct net_dim  dim;
  };
  
@@ -755,6 +753,8 @@ struct bcm_sysport_priv {

unsigned intrx_c_index;
  
  	struct bcm_sysport_net_dim	dim;

+   u32 rx_max_coalesced_frames;
+   u32 rx_coalesce_usecs;
  
  	/* PHY device */

struct device_node  *phy_dn;


Reviewed-by: Tal Gilboa <ta...@mellanox.com>


Re: [PATCH net-next v2 3/3] net: bcmgenet: Fix coalescing settings handling

2018-03-29 Thread Tal Gilboa
net_dim_get_profile(dim->mode, dim->profile_ix);
  
-	ring->dim.coal_usecs = cur_profile.usec;

-   ring->dim.coal_pkts = cur_profile.pkts;
-
-   bcmgenet_set_rx_coalesce(ring);
+   bcmgenet_set_rx_coalesce(ring, cur_profile.usec, cur_profile.pkts);
dim->state = NET_DIM_START_MEASURE;
  }
  
@@ -2079,9 +2078,11 @@ static void init_umac(struct bcmgenet_priv *priv)

dev_dbg(kdev, "done init umac\n");
  }
  
-static void bcmgenet_init_dim(struct bcmgenet_net_dim *dim,

+static void bcmgenet_init_dim(struct bcmgenet_rx_ring *ring,
  void (*cb)(struct work_struct *work))
  {
+   struct bcmgenet_net_dim *dim = >dim;
+
INIT_WORK(>dim.work, cb);
dim->dim.mode = NET_DIM_CQ_PERIOD_MODE_START_FROM_EQE;
dim->event_ctr = 0;
@@ -2089,6 +2090,25 @@ static void bcmgenet_init_dim(struct bcmgenet_net_dim 
*dim,
dim->bytes = 0;
  }
  
+static void bcmgenet_init_rx_coalesce(struct bcmgenet_rx_ring *ring)

+{
+   struct bcmgenet_net_dim *dim = >dim;
+   struct net_dim_cq_moder moder;
+   u32 usecs, pkts;
+
+   usecs = ring->rx_coalesce_usecs;
+   pkts = ring->rx_max_coalesced_frames;
+
+   /* If DIM was enabled, re-apply default parameters */
+   if (dim->use_dim) {
+   moder = net_dim_get_def_profile(dim->dim.mode);
+   usecs = moder.usec;
+   pkts = moder.pkts;
+   }
+
+   bcmgenet_set_rx_coalesce(ring, usecs, pkts);
+}
+
  /* Initialize a Tx ring along with corresponding hardware registers */
  static void bcmgenet_init_tx_ring(struct bcmgenet_priv *priv,
  unsigned int index, unsigned int size,
@@ -2178,7 +2198,8 @@ static int bcmgenet_init_rx_ring(struct bcmgenet_priv 
*priv,
if (ret)
return ret;
  
-	bcmgenet_init_dim(>dim, bcmgenet_dim_work);

+   bcmgenet_init_dim(ring, bcmgenet_dim_work);
+   bcmgenet_init_rx_coalesce(ring);
  
  	/* Initialize Rx NAPI */

netif_napi_add(priv->dev, >napi, bcmgenet_rx_poll,
@@ -2186,7 +2207,6 @@ static int bcmgenet_init_rx_ring(struct bcmgenet_priv 
*priv,
  
  	bcmgenet_rdma_ring_writel(priv, index, 0, RDMA_PROD_INDEX);

bcmgenet_rdma_ring_writel(priv, index, 0, RDMA_CONS_INDEX);
-   bcmgenet_rdma_ring_writel(priv, index, 1, DMA_MBUF_DONE_THRESH);
bcmgenet_rdma_ring_writel(priv, index,
  ((size << DMA_RING_SIZE_SHIFT) |
   RX_BUF_LENGTH), DMA_RING_BUF_SIZE);
@@ -3424,6 +3444,7 @@ static int bcmgenet_probe(struct platform_device *pdev)
struct net_device *dev;
const void *macaddr;
struct resource *r;
+   unsigned int i;
int err = -EIO;
const char *phy_mode_str;
  
@@ -3552,6 +3573,11 @@ static int bcmgenet_probe(struct platform_device *pdev)

netif_set_real_num_tx_queues(priv->dev, priv->hw_params->tx_queues + 1);
netif_set_real_num_rx_queues(priv->dev, priv->hw_params->rx_queues + 1);
  
+	/* Set default coalescing parameters */

+   for (i = 0; i < priv->hw_params->rx_queues; i++)
+   priv->rx_rings[i].rx_max_coalesced_frames = 1;
+   priv->rx_rings[DESC_INDEX].rx_max_coalesced_frames = 1;
+


Do you still want to keep {usecs,frames}={0,1} as default static 
configuration? Is the latency so important for the static case? I'm 
assuming RT latency is at least 10us so you can somewhat increase the 
timer and counter without causing an increase in latency. This should be 
properly tested of course so maybe in a future patch.



/* libphy will determine the link state */
netif_carrier_off(dev);
  
diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.h b/drivers/net/ethernet/broadcom/genet/bcmgenet.h

index 22c41e0430fb..b773bc07edf7 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.h
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.h
@@ -578,8 +578,6 @@ struct bcmgenet_net_dim {
u16 event_ctr;
unsigned long   packets;
unsigned long   bytes;
-   u32 coal_usecs;
-   u32 coal_pkts;
struct net_dim  dim;
  };
  
@@ -598,6 +596,8 @@ struct bcmgenet_rx_ring {

unsigned intend_ptr;/* Rx ring end CB ptr */
unsigned intold_discards;
    struct bcmgenet_net_dim dim;
+   u32 rx_max_coalesced_frames;
+   u32 rx_coalesce_usecs;
void (*int_enable)(struct bcmgenet_rx_ring *);
void (*int_disable)(struct bcmgenet_rx_ring *);
struct bcmgenet_priv *priv;


Reviewed-by: Tal Gilboa <ta...@mellanox.com>


Re: [PATCH net-next 2/3] net: systemport: Fix coalescing settings handling

2018-03-28 Thread Tal Gilboa

On 3/27/2018 10:47 PM, Florian Fainelli wrote:

There were a number of issues with setting the RX coalescing parameters:

- we would not be preserving values that would have been configured
   across close/open calls, instead we would always reset to no timeout
   and 1 interrupt per packet, this would also prevent DIM from setting its
   default usec/pkts values

- when adaptive RX would be turned on, we woud not be fetching the
   default parameters, we would stay with no timeout/1 packet per
   interrupt until the estimator kicks in and changes that

- finally disabling adaptive RX coalescing while providing parameters
   would not be honored, and we would stay with whatever DIM had
   previously determined instead of the user requested parameters

Fixes: b6e0e875421e ("net: systemport: Implement adaptive interrupt coalescing")
Signed-off-by: Florian Fainelli 
---
  drivers/net/ethernet/broadcom/bcmsysport.c | 57 --
  drivers/net/ethernet/broadcom/bcmsysport.h |  4 +--
  2 files changed, 41 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
b/drivers/net/ethernet/broadcom/bcmsysport.c
index 1e52bb7d822e..43ad6300c351 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -574,16 +574,16 @@ static int bcm_sysport_set_wol(struct net_device *dev,
return 0;
  }
  
-static void bcm_sysport_set_rx_coalesce(struct bcm_sysport_priv *priv)

+static void bcm_sysport_set_rx_coalesce(struct bcm_sysport_priv *priv,
+   u32 usecs, u32 pkts)
  {
u32 reg;
  
  	reg = rdma_readl(priv, RDMA_MBDONE_INTR);

reg &= ~(RDMA_INTR_THRESH_MASK |
 RDMA_TIMEOUT_MASK << RDMA_TIMEOUT_SHIFT);
-   reg |= priv->dim.coal_pkts;
-   reg |= DIV_ROUND_UP(priv->dim.coal_usecs * 1000, 8192) <<
-   RDMA_TIMEOUT_SHIFT;
+   reg |= pkts;
+   reg |= DIV_ROUND_UP(usecs * 1000, 8192) << RDMA_TIMEOUT_SHIFT;
rdma_writel(priv, reg, RDMA_MBDONE_INTR);
  }
  
@@ -626,6 +626,8 @@ static int bcm_sysport_set_coalesce(struct net_device *dev,

struct ethtool_coalesce *ec)
  {
struct bcm_sysport_priv *priv = netdev_priv(dev);
+   struct net_dim_cq_moder moder;
+   u32 usecs, pkts;
unsigned int i;
  
  	/* Base system clock is 125Mhz, DMA timeout is this reference clock

@@ -646,15 +648,22 @@ static int bcm_sysport_set_coalesce(struct net_device 
*dev,
for (i = 0; i < dev->num_tx_queues; i++)
bcm_sysport_set_tx_coalesce(>tx_rings[i], ec);
  
-	priv->dim.coal_usecs = ec->rx_coalesce_usecs;

-   priv->dim.coal_pkts = ec->rx_max_coalesced_frames;
+   priv->rx_coalesce_usecs = ec->rx_coalesce_usecs;
+   priv->rx_max_coalesced_frames = ec->rx_max_coalesced_frames;
+   usecs = priv->rx_coalesce_usecs;
+   pkts = priv->rx_max_coalesced_frames;
  
-	if (!ec->use_adaptive_rx_coalesce && priv->dim.use_dim) {

-   priv->dim.coal_pkts = 1;
-   priv->dim.coal_usecs = 0;
+   /* If DIM is enabled, immediately obtain default parameters */
+   if (ec->use_adaptive_rx_coalesce) {
+   moder = net_dim_get_def_profile(priv->dim.dim.mode);
+   usecs = moder.usec;
+   pkts = moder.pkts;
}


What if DIM is already in use? We don't want to set default values if 
DIM is already working.



+
priv->dim.use_dim = ec->use_adaptive_rx_coalesce;
-   bcm_sysport_set_rx_coalesce(priv);
+
+   /* Apply desired coalescing parameters */
+   bcm_sysport_set_rx_coalesce(priv, usecs, pkts);
  
  	return 0;

  }
@@ -1058,10 +1067,7 @@ static void bcm_sysport_dim_work(struct work_struct 
*work)
struct net_dim_cq_moder cur_profile =
net_dim_get_profile(dim->mode, dim->profile_ix);
  
-	priv->dim.coal_usecs = cur_profile.usec;

-   priv->dim.coal_pkts = cur_profile.pkts;
-
-   bcm_sysport_set_rx_coalesce(priv);
+   bcm_sysport_set_rx_coalesce(priv, cur_profile.usec, cur_profile.pkts);
dim->state = NET_DIM_START_MEASURE;
  }
  
@@ -1408,14 +1414,30 @@ static void bcm_sysport_adj_link(struct net_device *dev)

phy_print_status(phydev);
  }
  
-static void bcm_sysport_init_dim(struct bcm_sysport_net_dim *dim,

+static void bcm_sysport_init_dim(struct bcm_sysport_priv *priv,
 void (*cb)(struct work_struct *work))
  {
+   struct bcm_sysport_net_dim *dim = >dim;
+   struct net_dim_cq_moder moder;
+   u32 usecs, pkts;
+
INIT_WORK(>dim.work, cb);
dim->dim.mode = NET_DIM_CQ_PERIOD_MODE_START_FROM_EQE;
dim->event_ctr = 0;
dim->packets = 0;
dim->bytes = 0;


I would expect only dim related code in a function called init_dim(). I 
think the code below should be elsewhere. Or maybe change the function 

Re: [PATCH net-next 3/3] net: bcmgenet: Fix coalescing settings handling

2018-03-28 Thread Tal Gilboa

On 3/27/2018 10:47 PM, Florian Fainelli wrote:

There were a number of issues with setting the RX coalescing parameters:

- we would not be preserving values that would have been configured
   across close/open calls, instead we would always reset to no timeout
   and 1 interrupt per packet, this would also prevent DIM from setting its
   default usec/pkts values

- when adaptive RX would be turned on, we woud not be fetching the
   default parameters, we would stay with no timeout/1 packet per interrupt
   until the estimator kicks in and changes that

- finally disabling adaptive RX coalescing while providing parameters
   would not be honored, and we would stay with whatever DIM had previously
   determined instead of the user requested parameters

Fixes: 9f4ca05827a2 ("net: bcmgenet: Add support for adaptive RX coalescing")
Signed-off-by: Florian Fainelli 
---
  drivers/net/ethernet/broadcom/genet/bcmgenet.c | 78 ++
  drivers/net/ethernet/broadcom/genet/bcmgenet.h |  4 +-
  2 files changed, 57 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c 
b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
index 7db8edc643ec..76409debb796 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
@@ -625,18 +625,18 @@ static int bcmgenet_get_coalesce(struct net_device *dev,
return 0;
  }
  
-static void bcmgenet_set_rx_coalesce(struct bcmgenet_rx_ring *ring)

+static void bcmgenet_set_rx_coalesce(struct bcmgenet_rx_ring *ring,
+u32 usecs, u32 pkts)
  {
struct bcmgenet_priv *priv = ring->priv;
unsigned int i = ring->index;
u32 reg;
  
-	bcmgenet_rdma_ring_writel(priv, i, ring->dim.coal_pkts,

- DMA_MBUF_DONE_THRESH);
+   bcmgenet_rdma_ring_writel(priv, i, pkts, DMA_MBUF_DONE_THRESH);
  
  	reg = bcmgenet_rdma_readl(priv, DMA_RING0_TIMEOUT + i);

reg &= ~DMA_TIMEOUT_MASK;
-   reg |= DIV_ROUND_UP(ring->dim.coal_usecs * 1000, 8192);
+   reg |= DIV_ROUND_UP(usecs * 1000, 8192);
bcmgenet_rdma_writel(priv, reg, DMA_RING0_TIMEOUT + i);
  }
  
@@ -645,6 +645,8 @@ static int bcmgenet_set_coalesce(struct net_device *dev,

  {
struct bcmgenet_priv *priv = netdev_priv(dev);
struct bcmgenet_rx_ring *ring;
+   struct net_dim_cq_moder moder;
+   u32 usecs, pkts;
unsigned int i;
  
  	/* Base system clock is 125Mhz, DMA timeout is this reference clock

@@ -682,25 +684,37 @@ static int bcmgenet_set_coalesce(struct net_device *dev,
  
  	for (i = 0; i < priv->hw_params->rx_queues; i++) {

ring = >rx_rings[i];
-   ring->dim.coal_usecs = ec->rx_coalesce_usecs;
-   ring->dim.coal_pkts = ec->rx_max_coalesced_frames;
-   if (!ec->use_adaptive_rx_coalesce && ring->dim.use_dim) {
-   ring->dim.coal_pkts = 1;
-   ring->dim.coal_usecs = 0;
+
+   ring->rx_coalesce_usecs = ec->rx_coalesce_usecs;
+   ring->rx_max_coalesced_frames = ec->rx_max_coalesced_frames;
+   usecs = ring->rx_coalesce_usecs;
+   pkts = ring->rx_max_coalesced_frames;
+
+   if (ec->use_adaptive_rx_coalesce) {
+   moder = net_dim_get_def_profile(ring->dim.dim.mode);
+   usecs = moder.usec;
+   pkts = moder.pkts;
}


What if DIM is already in use? We don't want to set default values if 
DIM is already working.



+
ring->dim.use_dim = ec->use_adaptive_rx_coalesce;
-   bcmgenet_set_rx_coalesce(ring);
+   bcmgenet_set_rx_coalesce(ring, usecs, pkts);
}
  
  	ring = >rx_rings[DESC_INDEX];

-   ring->dim.coal_usecs = ec->rx_coalesce_usecs;
-   ring->dim.coal_pkts = ec->rx_max_coalesced_frames;
-   if (!ec->use_adaptive_rx_coalesce && ring->dim.use_dim) {
-   ring->dim.coal_pkts = 1;
-   ring->dim.coal_usecs = 0;
+
+   ring->rx_coalesce_usecs = ec->rx_coalesce_usecs;
+   ring->rx_max_coalesced_frames = ec->rx_max_coalesced_frames;
+   usecs = ring->rx_coalesce_usecs;
+   pkts = ring->rx_max_coalesced_frames;
+
+   if (ec->use_adaptive_rx_coalesce) {
+   moder = net_dim_get_def_profile(ring->dim.dim.mode);
+   usecs = moder.usec;
+   pkts = moder.pkts;
}


Same as above.


+
ring->dim.use_dim = ec->use_adaptive_rx_coalesce;
-   bcmgenet_set_rx_coalesce(ring);
+   bcmgenet_set_rx_coalesce(ring, usecs, pkts);
  
  	return 0;

  }
@@ -1924,10 +1938,7 @@ static void bcmgenet_dim_work(struct work_struct *work)
struct net_dim_cq_moder cur_profile =
net_dim_get_profile(dim->mode, dim->profile_ix);
  
-	ring->dim.coal_usecs = cur_profile.usec;

-   ring->dim.coal_pkts = 

Re: [PATCH net-next 1/3] net: systemport: Remove adaptive TX coalescing

2018-03-28 Thread Tal Gilboa
one = 0;
  
  	work_done = bcm_sysport_tx_reclaim(ring->priv, ring);

@@ -1004,12 +986,6 @@ static int bcm_sysport_tx_poll(struct napi_struct *napi, 
int budget)
return 0;
}
  
-	if (ring->dim.use_dim) {

-   net_dim_sample(ring->dim.event_ctr, ring->dim.packets,
-  ring->dim.bytes, _sample);
-   net_dim(>dim.dim, dim_sample);
-   }
-
return budget;
  }
  
@@ -1089,23 +1065,6 @@ static void bcm_sysport_dim_work(struct work_struct *work)

dim->state = NET_DIM_START_MEASURE;
  }
  
-static void bcm_sysport_dim_tx_work(struct work_struct *work)

-{
-   struct net_dim *dim = container_of(work, struct net_dim, work);
-   struct bcm_sysport_net_dim *ndim =
-   container_of(dim, struct bcm_sysport_net_dim, dim);
-   struct bcm_sysport_tx_ring *ring =
-   container_of(ndim, struct bcm_sysport_tx_ring, dim);
-   struct net_dim_cq_moder cur_profile =
-   net_dim_get_profile(dim->mode, dim->profile_ix);
-
-   ring->dim.coal_usecs = cur_profile.usec;
-   ring->dim.coal_pkts = cur_profile.pkts;
-
-   bcm_sysport_set_tx_coalesce(ring);
-   dim->state = NET_DIM_START_MEASURE;
-}
-
  /* RX and misc interrupt routine */
  static irqreturn_t bcm_sysport_rx_isr(int irq, void *dev_id)
  {
@@ -1152,7 +,6 @@ static irqreturn_t bcm_sysport_rx_isr(int irq, void 
*dev_id)
continue;
  
  		txr = >tx_rings[ring];

-   txr->dim.event_ctr++;
  
  		if (likely(napi_schedule_prep(>napi))) {

intrl2_0_mask_set(priv, ring_bit);
@@ -1185,7 +1143,6 @@ static irqreturn_t bcm_sysport_tx_isr(int irq, void 
*dev_id)
continue;
  
  		txr = >tx_rings[ring];

-   txr->dim.event_ctr++;
  
  		if (likely(napi_schedule_prep(>napi))) {

intrl2_1_mask_set(priv, BIT(ring));
@@ -1551,7 +1508,6 @@ static int bcm_sysport_init_tx_ring(struct 
bcm_sysport_priv *priv,
reg |= (1 << index);
tdma_writel(priv, reg, TDMA_TIER1_ARB_0_QUEUE_EN);
  
-	bcm_sysport_init_dim(>dim, bcm_sysport_dim_tx_work);

napi_enable(>napi);
  
  	netif_dbg(priv, hw, priv->netdev,

@@ -1582,7 +1538,6 @@ static void bcm_sysport_fini_tx_ring(struct 
bcm_sysport_priv *priv,
return;
  
  	napi_disable(>napi);

-   cancel_work_sync(>dim.dim.work);
netif_napi_del(>napi);
  
  	bcm_sysport_tx_clean(priv, ring);

diff --git a/drivers/net/ethernet/broadcom/bcmsysport.h 
b/drivers/net/ethernet/broadcom/bcmsysport.h
index e1c97d4a82b4..57e18ef8f206 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.h
+++ b/drivers/net/ethernet/broadcom/bcmsysport.h
@@ -723,7 +723,6 @@ struct bcm_sysport_tx_ring {
struct bcm_sysport_priv *priv;  /* private context backpointer */
unsigned long   packets;/* packets statistics */
unsigned long   bytes;  /* bytes statistics */
-   struct bcm_sysport_net_dim dim; /* Net DIM context */
unsigned intswitch_queue;   /* switch port queue number */
unsigned intswitch_port;/* switch port queue number */
boolinspect;/* inspect switch port and queue */



Reviewed-by: Tal Gilboa <ta...@mellanox.com>


Re: [PATCH net-next V2] Documentation/networking: Add net DIM documentation

2018-03-26 Thread Tal Gilboa

On 3/22/2018 8:51 PM, David Miller wrote:

From: Tal Gilboa <ta...@mellanox.com>
Date: Wed, 21 Mar 2018 20:33:45 +0200


Net DIM is a generic algorithm, purposed for dynamically
optimizing network devices interrupt moderation. This
document describes how it works and how to use it.

Signed-off-by: Tal Gilboa <ta...@mellanox.com>


Applied, although several improvements have been suggested.

Please handle that as follow-ups.

Thank you.


Will do ASAP. Thanks all for the feedback.


Re: [PATCH net-next 0/2] net: broadcom: Adaptive interrupt coalescing

2018-03-26 Thread Tal Gilboa

On 3/27/2018 1:29 AM, Florian Fainelli wrote:

On 03/26/2018 03:04 PM, Florian Fainelli wrote:

On 03/26/2018 02:16 PM, Tal Gilboa wrote:

On 3/23/2018 4:19 AM, Florian Fainelli wrote:

Hi all,

This patch series adds adaptive interrupt coalescing for the Gigabit
Ethernet
drivers SYSTEMPORT and GENET.

This really helps lower the interrupt count and system load, as
measured by
vmstat for a Gigabit TCP RX session:


I don't see an improvement in system load, the opposite - 42% vs. 100%
for SYSTEMPORT and 85% vs. 100% for GENET. Both with the same bandwidth.


Looks like I did not extract the correct data the load could spike in
both cases (with and without net_dim) up to 100, but averaged over the
transmission I see the following:

GENET without:
  1  0  0 1169568  0  2555600 0 0 130079 62795  2
86 13  0  0

GENET with:
  1  0  0 1169536  0  2555600 0 0 10566 10869  1
21 78  0  0


Am I missing something? Talking about bandwidth, I would expect 941Mb/s
(assuming this is TCP over IPv4). Do you know why the reduced interrupt
rate doesn't improve bandwidth?


I am assuming that this comes down to a latency, still capturing some
pcap files to analyze the TCP session with wireshark and see if that is
indeed what is going on. The test machine is actually not that great


I would expect 1GbE full wire speed on almost any setup. I'll try 
applying your code on my setup and see what I get.





Also, any effect on the client side (you
mentioned enabling TX moderation for SYSTEMPORT)?


Yes, on SYSTEMPORT, being the TCP IPv4 client, I have the following:

SYSTEMPORT without:
  2  0  0 191428  0  2574800 0 0 86254  264  0 41
59  0  0

SYSTEMPORT with:
  3  0  0 190176  0  2574800 0 0 45485 31332  0
100  0  0  0

I don't get top to agree with these load results though but it looks
like we just have the CPU spinning more, does not look like a win.


The problem appears to be the timeout selection on TX, ignoring it
completely allows us to keep the load average down while maintaining the
bandwidth. Looks like NAPI on TX already does a good job, so interrupt
mitigation on TX is not such a great idea actually...


I saw a similar behavior for TX. For me the issue was too many 
outstanding bytes without a completion (defined to be 256KB by sysctl 
net.ipv4.tcp_limit_output_bytes). I tested on a 100GbE connection so 
with reasonable timeout values I already waited too long (4 TSO 
sessions). For the 1GbE case this might have no effect since you need a 
very long timeout. I'm currently working on adding TX support for dim. 
If you don't see a good benefit currently you might want to wait a 
little with TX adaptive interrupt moderation. Maybe only adjust static 
moderation for now?




Also, doing UDP TX tests shows that we can lower the interrupt count by
setting an appropriate tx-frames (as expected), but we won't be lowering
the CPU load since that is inherently a CPU intensive work. Past


Do you see higher TX UDP bandwidth? If you are bounded by CPU on both 
cases I would at least expect higher bandwidth with less interrupts 
since you reduce work from the CPU.



tx-frames=64, the bandwidth completely drops because that would be 1/2
of the ring size.





Re: [PATCH net-next 1/2] net: systemport: Implement adaptive interrupt coalescing

2018-03-26 Thread Tal Gilboa

On 3/27/2018 12:36 AM, Florian Fainelli wrote:

On 03/26/2018 02:22 PM, Tal Gilboa wrote:

On 3/23/2018 4:19 AM, Florian Fainelli wrote:

Implement support for adaptive RX and TX interrupt coalescing using
net_dim. We have each of our TX ring and our single RX ring implement a
bcm_sysport_net_dim structure which holds an interrupt counter, number
of packets, bytes, and a container for a net_dim instance.

Signed-off-by: Florian Fainelli <f.faine...@gmail.com>
---
   drivers/net/ethernet/broadcom/bcmsysport.c | 141
++---
   drivers/net/ethernet/broadcom/bcmsysport.h |  14 +++
   2 files changed, 140 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c
b/drivers/net/ethernet/broadcom/bcmsysport.c
index f15a8fc6dfc9..5a5a726bafa4 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -15,6 +15,7 @@
   #include 
   #include 
   #include 
+#include 


I don't think you need this include. You already include net_dim in
bcmsysport.h and include the bcmsysport.h here.


Indeed.




   #include 
   #include 
   #include 
@@ -574,21 +575,55 @@ static int bcm_sysport_set_wol(struct net_device
*dev,
   return 0;
   }
   +static void bcm_sysport_set_rx_coalesce(struct bcm_sysport_priv *priv)
+{
+    u32 reg;
+
+    reg = rdma_readl(priv, RDMA_MBDONE_INTR);
+    reg &= ~(RDMA_INTR_THRESH_MASK |
+ RDMA_TIMEOUT_MASK << RDMA_TIMEOUT_SHIFT);
+    reg |= priv->dim.coal_pkts;
+    reg |= DIV_ROUND_UP(priv->dim.coal_usecs * 1000, 8192) <<
+    RDMA_TIMEOUT_SHIFT;
+    rdma_writel(priv, reg, RDMA_MBDONE_INTR);
+}
+
+static void bcm_sysport_set_tx_coalesce(struct bcm_sysport_tx_ring
*ring)
+{
+    struct bcm_sysport_priv *priv = ring->priv;
+    u32 reg;
+
+    reg = tdma_readl(priv, TDMA_DESC_RING_INTR_CONTROL(ring->index));
+    reg &= ~(RING_INTR_THRESH_MASK |
+ RING_TIMEOUT_MASK << RING_TIMEOUT_SHIFT);
+    reg |= ring->dim.coal_pkts;
+    reg |= DIV_ROUND_UP(ring->dim.coal_usecs * 1000, 8192) <<
+    RING_TIMEOUT_SHIFT;
+    tdma_writel(priv, reg, TDMA_DESC_RING_INTR_CONTROL(ring->index));
+}
+


I wouldn't couple these functions with dim. This implies dim is always
used. IMO, would be more clear to use a generic method which takes usecs
and packets as an argument.


I did not want to create an additional structure for storing coalescing
parameters, but if you prefer I make this function take two parameters,
that sounds entirely reasonable.




   static int bcm_sysport_get_coalesce(struct net_device *dev,
   struct ethtool_coalesce *ec)
   {
   struct bcm_sysport_priv *priv = netdev_priv(dev);
+    struct bcm_sysport_tx_ring *ring;
+    unsigned int i;
   u32 reg;
     reg = tdma_readl(priv, TDMA_DESC_RING_INTR_CONTROL(0));
     ec->tx_coalesce_usecs = (reg >> RING_TIMEOUT_SHIFT) * 8192 /
1000;
   ec->tx_max_coalesced_frames = reg & RING_INTR_THRESH_MASK;
+    for (i = 0; i < dev->num_tx_queues; i++) {
+    ring = >tx_rings[i];
+    ec->use_adaptive_tx_coalesce |= ring->dim.use_dim;
+    }
     reg = rdma_readl(priv, RDMA_MBDONE_INTR);
     ec->rx_coalesce_usecs = (reg >> RDMA_TIMEOUT_SHIFT) * 8192 /
1000;
   ec->rx_max_coalesced_frames = reg & RDMA_INTR_THRESH_MASK;
+    ec->use_adaptive_rx_coalesce = priv->dim.use_dim;
     return 0;
   }
@@ -597,8 +632,8 @@ static int bcm_sysport_set_coalesce(struct
net_device *dev,
   struct ethtool_coalesce *ec)
   {
   struct bcm_sysport_priv *priv = netdev_priv(dev);
+    struct bcm_sysport_tx_ring *ring;
   unsigned int i;
-    u32 reg;
     /* Base system clock is 125Mhz, DMA timeout is this reference
clock
    * divided by 1024, which yield roughly 8.192 us, our maximum
value has
@@ -615,22 +650,26 @@ static int bcm_sysport_set_coalesce(struct
net_device *dev,
   return -EINVAL;
     for (i = 0; i < dev->num_tx_queues; i++) {
-    reg = tdma_readl(priv, TDMA_DESC_RING_INTR_CONTROL(i));
-    reg &= ~(RING_INTR_THRESH_MASK |
- RING_TIMEOUT_MASK << RING_TIMEOUT_SHIFT);
-    reg |= ec->tx_max_coalesced_frames;
-    reg |= DIV_ROUND_UP(ec->tx_coalesce_usecs * 1000, 8192) <<
- RING_TIMEOUT_SHIFT;
-    tdma_writel(priv, reg, TDMA_DESC_RING_INTR_CONTROL(i));
+    ring = >tx_rings[i];
+    ring->dim.coal_pkts = ec->tx_max_coalesced_frames;
+    ring->dim.coal_usecs = ec->tx_coalesce_usecs;
+    if (!ec->use_adaptive_tx_coalesce && ring->dim.use_dim) {
+    ring->dim.coal_pkts = 1;
+    ring->dim.coal_usecs = 0;
+    }
+    ring->dim.use_dim = ec->use_adaptive_tx_coalesce;
+    bcm_sysport_set_tx_coalesce(ring);
   }


If I understand correctly,

Re: [PATCH net-next 2/2] net: bcmgenet: Add support for adaptive RX coalescing

2018-03-26 Thread Tal Gilboa

On 3/23/2018 4:19 AM, Florian Fainelli wrote:

Unlike the moder modern SYSTEMPORT hardware, we do not have a
configurable TDMA timeout, which limits us to implement adaptive RX
interrupt coalescing only. We have each of our RX rings implement a
bcmgenet_net_dim structure which holds an interrupt counter, number of
packets, bytes, and a container for a net_dim instance.

Signed-off-by: Florian Fainelli 
---
  drivers/net/ethernet/broadcom/genet/bcmgenet.c | 109 +
  drivers/net/ethernet/broadcom/genet/bcmgenet.h |  12 +++
  2 files changed, 103 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c 
b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
index b1e35a9accf1..7db8edc643ec 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
@@ -603,6 +603,8 @@ static int bcmgenet_get_coalesce(struct net_device *dev,
 struct ethtool_coalesce *ec)
  {
struct bcmgenet_priv *priv = netdev_priv(dev);
+   struct bcmgenet_rx_ring *ring;
+   unsigned int i;
  
  	ec->tx_max_coalesced_frames =

bcmgenet_tdma_ring_readl(priv, DESC_INDEX,
@@ -613,15 +615,37 @@ static int bcmgenet_get_coalesce(struct net_device *dev,
ec->rx_coalesce_usecs =
bcmgenet_rdma_readl(priv, DMA_RING16_TIMEOUT) * 8192 / 1000;
  
+	for (i = 0; i < priv->hw_params->rx_queues; i++) {

+   ring = >rx_rings[i];
+   ec->use_adaptive_rx_coalesce |= ring->dim.use_dim;
+   }
+   ring = >rx_rings[DESC_INDEX];
+   ec->use_adaptive_rx_coalesce |= ring->dim.use_dim;
+
return 0;
  }
  
+static void bcmgenet_set_rx_coalesce(struct bcmgenet_rx_ring *ring)

+{
+   struct bcmgenet_priv *priv = ring->priv;
+   unsigned int i = ring->index;
+   u32 reg;
+
+   bcmgenet_rdma_ring_writel(priv, i, ring->dim.coal_pkts,
+ DMA_MBUF_DONE_THRESH);
+
+   reg = bcmgenet_rdma_readl(priv, DMA_RING0_TIMEOUT + i);
+   reg &= ~DMA_TIMEOUT_MASK;
+   reg |= DIV_ROUND_UP(ring->dim.coal_usecs * 1000, 8192);
+   bcmgenet_rdma_writel(priv, reg, DMA_RING0_TIMEOUT + i);
+}
+


Similar comments from path 1/2 apply here - wouldn't couple the genric 
get_set_coalesce functions with dim.



  static int bcmgenet_set_coalesce(struct net_device *dev,
 struct ethtool_coalesce *ec)
  {
struct bcmgenet_priv *priv = netdev_priv(dev);
+   struct bcmgenet_rx_ring *ring;
unsigned int i;
-   u32 reg;
  
  	/* Base system clock is 125Mhz, DMA timeout is this reference clock

 * divided by 1024, which yields roughly 8.192us, our maximum value
@@ -641,7 +665,8 @@ static int bcmgenet_set_coalesce(struct net_device *dev,
 * transmitted, or when the ring is empty.
 */
if (ec->tx_coalesce_usecs || ec->tx_coalesce_usecs_high ||
-   ec->tx_coalesce_usecs_irq || ec->tx_coalesce_usecs_low)
+   ec->tx_coalesce_usecs_irq || ec->tx_coalesce_usecs_low ||
+   ec->use_adaptive_tx_coalesce)
return -EOPNOTSUPP;
  
  	/* Program all TX queues with the same values, as there is no

@@ -656,24 +681,26 @@ static int bcmgenet_set_coalesce(struct net_device *dev,
  DMA_MBUF_DONE_THRESH);
  
  	for (i = 0; i < priv->hw_params->rx_queues; i++) {

-   bcmgenet_rdma_ring_writel(priv, i,
- ec->rx_max_coalesced_frames,
- DMA_MBUF_DONE_THRESH);
-
-   reg = bcmgenet_rdma_readl(priv, DMA_RING0_TIMEOUT + i);
-   reg &= ~DMA_TIMEOUT_MASK;
-   reg |= DIV_ROUND_UP(ec->rx_coalesce_usecs * 1000, 8192);
-   bcmgenet_rdma_writel(priv, reg, DMA_RING0_TIMEOUT + i);
+   ring = >rx_rings[i];
+   ring->dim.coal_usecs = ec->rx_coalesce_usecs;
+   ring->dim.coal_pkts = ec->rx_max_coalesced_frames;
+   if (!ec->use_adaptive_rx_coalesce && ring->dim.use_dim) {
+   ring->dim.coal_pkts = 1;
+   ring->dim.coal_usecs = 0;
+   }
+   ring->dim.use_dim = ec->use_adaptive_rx_coalesce;
+   bcmgenet_set_rx_coalesce(ring);
}
  
-	bcmgenet_rdma_ring_writel(priv, DESC_INDEX,

- ec->rx_max_coalesced_frames,
- DMA_MBUF_DONE_THRESH);
-
-   reg = bcmgenet_rdma_readl(priv, DMA_RING16_TIMEOUT);
-   reg &= ~DMA_TIMEOUT_MASK;
-   reg |= DIV_ROUND_UP(ec->rx_coalesce_usecs * 1000, 8192);
-   bcmgenet_rdma_writel(priv, reg, DMA_RING16_TIMEOUT);
+   ring = >rx_rings[DESC_INDEX];
+   ring->dim.coal_usecs = ec->rx_coalesce_usecs;
+   ring->dim.coal_pkts = ec->rx_max_coalesced_frames;
+   if (!ec->use_adaptive_rx_coalesce && 

Re: [PATCH net-next 1/2] net: systemport: Implement adaptive interrupt coalescing

2018-03-26 Thread Tal Gilboa

On 3/23/2018 4:19 AM, Florian Fainelli wrote:

Implement support for adaptive RX and TX interrupt coalescing using
net_dim. We have each of our TX ring and our single RX ring implement a
bcm_sysport_net_dim structure which holds an interrupt counter, number
of packets, bytes, and a container for a net_dim instance.

Signed-off-by: Florian Fainelli 
---
  drivers/net/ethernet/broadcom/bcmsysport.c | 141 ++---
  drivers/net/ethernet/broadcom/bcmsysport.h |  14 +++
  2 files changed, 140 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
b/drivers/net/ethernet/broadcom/bcmsysport.c
index f15a8fc6dfc9..5a5a726bafa4 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -15,6 +15,7 @@
  #include 
  #include 
  #include 
+#include 


I don't think you need this include. You already include net_dim in 
bcmsysport.h and include the bcmsysport.h here.



  #include 
  #include 
  #include 
@@ -574,21 +575,55 @@ static int bcm_sysport_set_wol(struct net_device *dev,
return 0;
  }
  
+static void bcm_sysport_set_rx_coalesce(struct bcm_sysport_priv *priv)

+{
+   u32 reg;
+
+   reg = rdma_readl(priv, RDMA_MBDONE_INTR);
+   reg &= ~(RDMA_INTR_THRESH_MASK |
+RDMA_TIMEOUT_MASK << RDMA_TIMEOUT_SHIFT);
+   reg |= priv->dim.coal_pkts;
+   reg |= DIV_ROUND_UP(priv->dim.coal_usecs * 1000, 8192) <<
+   RDMA_TIMEOUT_SHIFT;
+   rdma_writel(priv, reg, RDMA_MBDONE_INTR);
+}
+
+static void bcm_sysport_set_tx_coalesce(struct bcm_sysport_tx_ring *ring)
+{
+   struct bcm_sysport_priv *priv = ring->priv;
+   u32 reg;
+
+   reg = tdma_readl(priv, TDMA_DESC_RING_INTR_CONTROL(ring->index));
+   reg &= ~(RING_INTR_THRESH_MASK |
+RING_TIMEOUT_MASK << RING_TIMEOUT_SHIFT);
+   reg |= ring->dim.coal_pkts;
+   reg |= DIV_ROUND_UP(ring->dim.coal_usecs * 1000, 8192) <<
+   RING_TIMEOUT_SHIFT;
+   tdma_writel(priv, reg, TDMA_DESC_RING_INTR_CONTROL(ring->index));
+}
+


I wouldn't couple these functions with dim. This implies dim is always 
used. IMO, would be more clear to use a generic method which takes usecs 
and packets as an argument.



  static int bcm_sysport_get_coalesce(struct net_device *dev,
struct ethtool_coalesce *ec)
  {
struct bcm_sysport_priv *priv = netdev_priv(dev);
+   struct bcm_sysport_tx_ring *ring;
+   unsigned int i;
u32 reg;
  
  	reg = tdma_readl(priv, TDMA_DESC_RING_INTR_CONTROL(0));
  
  	ec->tx_coalesce_usecs = (reg >> RING_TIMEOUT_SHIFT) * 8192 / 1000;

ec->tx_max_coalesced_frames = reg & RING_INTR_THRESH_MASK;
+   for (i = 0; i < dev->num_tx_queues; i++) {
+   ring = >tx_rings[i];
+   ec->use_adaptive_tx_coalesce |= ring->dim.use_dim;
+   }
  
  	reg = rdma_readl(priv, RDMA_MBDONE_INTR);
  
  	ec->rx_coalesce_usecs = (reg >> RDMA_TIMEOUT_SHIFT) * 8192 / 1000;

ec->rx_max_coalesced_frames = reg & RDMA_INTR_THRESH_MASK;
+   ec->use_adaptive_rx_coalesce = priv->dim.use_dim;
  
  	return 0;

  }
@@ -597,8 +632,8 @@ static int bcm_sysport_set_coalesce(struct net_device *dev,
struct ethtool_coalesce *ec)
  {
struct bcm_sysport_priv *priv = netdev_priv(dev);
+   struct bcm_sysport_tx_ring *ring;
unsigned int i;
-   u32 reg;
  
  	/* Base system clock is 125Mhz, DMA timeout is this reference clock

 * divided by 1024, which yield roughly 8.192 us, our maximum value has
@@ -615,22 +650,26 @@ static int bcm_sysport_set_coalesce(struct net_device 
*dev,
return -EINVAL;
  
  	for (i = 0; i < dev->num_tx_queues; i++) {

-   reg = tdma_readl(priv, TDMA_DESC_RING_INTR_CONTROL(i));
-   reg &= ~(RING_INTR_THRESH_MASK |
-RING_TIMEOUT_MASK << RING_TIMEOUT_SHIFT);
-   reg |= ec->tx_max_coalesced_frames;
-   reg |= DIV_ROUND_UP(ec->tx_coalesce_usecs * 1000, 8192) <<
-RING_TIMEOUT_SHIFT;
-   tdma_writel(priv, reg, TDMA_DESC_RING_INTR_CONTROL(i));
+   ring = >tx_rings[i];
+   ring->dim.coal_pkts = ec->tx_max_coalesced_frames;
+   ring->dim.coal_usecs = ec->tx_coalesce_usecs;
+   if (!ec->use_adaptive_tx_coalesce && ring->dim.use_dim) {
+   ring->dim.coal_pkts = 1;
+   ring->dim.coal_usecs = 0;
+   }
+   ring->dim.use_dim = ec->use_adaptive_tx_coalesce;
+   bcm_sysport_set_tx_coalesce(ring);
}


If I understand correctly, if I disable dim, moderation is set to 
{usecs,packets}={0,1} regardless of the input from ethtool right? 
Doesn't this break the wanted behavior? As mentioned above, I would 
decouple dim from the set_tx/rx_coalesce() 

Re: [PATCH net-next 0/2] net: broadcom: Adaptive interrupt coalescing

2018-03-26 Thread Tal Gilboa

On 3/23/2018 4:19 AM, Florian Fainelli wrote:

Hi all,

This patch series adds adaptive interrupt coalescing for the Gigabit Ethernet
drivers SYSTEMPORT and GENET.

This really helps lower the interrupt count and system load, as measured by
vmstat for a Gigabit TCP RX session:


I don't see an improvement in system load, the opposite - 42% vs. 100% 
for SYSTEMPORT and 85% vs. 100% for GENET. Both with the same bandwidth. 
Am I missing something? Talking about bandwidth, I would expect 941Mb/s 
(assuming this is TCP over IPv4). Do you know why the reduced interrupt 
rate doesn't improve bandwidth? Also, any effect on the client side (you 
mentioned enabling TX moderation for SYSTEMPORT)?




SYSTEMPORT:

without:

  1  0  0 192188  0  2547200 0 0 122100 38870  1 42 57  
0  0
[ ID] Interval   Transfer Bandwidth
[  4]  0.0-10.0 sec  1.03 GBytes   884 Mbits/sec

with:

  1  0  0 192288  0  2546800 0 0 58806 44401  0 100  0  
0  0
[  5]  0.0-10.0 sec  1.04 GBytes   888 Mbits/sec

GENET:

without:

  1  0  0 1170404  0  2542000 0 0 130785 63402  2 85 12 
 0  0
[ ID] Interval   Transfer Bandwidth
[  4]  0.0-10.0 sec  1.04 GBytes   888 Mbits/sec

with:

  1  0  0 1170560  0  2542000 0 0 50610 48477  0 100  0 
 0  0
[  5]  0.0-10.0 sec  1.05 GBytes   899 Mbits/sec

Please look at the implementation and let me know if you see any problems, this
was largely inspired by bnxt_en.

Thank you!

Florian Fainelli (2):
   net: systemport: Implement adaptive interrupt coalescing
   net: bcmgenet: Add support for adaptive RX coalescing

  drivers/net/ethernet/broadcom/bcmsysport.c | 141 ++---
  drivers/net/ethernet/broadcom/bcmsysport.h |  14 +++
  drivers/net/ethernet/broadcom/genet/bcmgenet.c | 109 +++
  drivers/net/ethernet/broadcom/genet/bcmgenet.h |  12 +++
  4 files changed, 243 insertions(+), 33 deletions(-)



Re: [PATCH net-next] Documentation/networking: Add net DIM documentation

2018-03-21 Thread Tal Gilboa

On 3/21/2018 8:33 PM, Randy Dunlap wrote:

On 03/21/2018 11:20 AM, Marcelo Ricardo Leitner wrote:

On Wed, Mar 21, 2018 at 11:30:29AM +0200, Tal Gilboa wrote:
...

+Dynamic Interrupt Moderation (DIM) (in networking) refers to changing the 
interrupt
+moderation configuration of a channel in order to optimize packet processing.
+The mechanism includes an algorithm which decides if and how to change
+moderation parameters for a channel, usually by performing an analysis on
+runtime data sampled from the system. Net DIM is such a mechanism. In each
+iteration of the algorithm, it analyses a given sample of the data, compares 
it to
+the previous sample and if required, is can decide to change some of the 
interrupt moderation


Some lines are above 80 chars. Please format the text so they don't exceed it.


Agreed.

thanks,


lines' length and spelling mistakes fixed. See V2.
Thanks for the feedback.


[PATCH net-next V2] Documentation/networking: Add net DIM documentation

2018-03-21 Thread Tal Gilboa
Net DIM is a generic algorithm, purposed for dynamically
optimizing network devices interrupt moderation. This
document describes how it works and how to use it.

Signed-off-by: Tal Gilboa <ta...@mellanox.com>
---
 Documentation/networking/net_dim.txt | 174 +++
 1 file changed, 174 insertions(+)
 create mode 100644 Documentation/networking/net_dim.txt

diff --git a/Documentation/networking/net_dim.txt 
b/Documentation/networking/net_dim.txt
new file mode 100644
index 000..9cb31c5
--- /dev/null
+++ b/Documentation/networking/net_dim.txt
@@ -0,0 +1,174 @@
+Net DIM - Generic Network Dynamic Interrupt Moderation
+==
+
+Author:
+       Tal Gilboa <ta...@mellanox.com>
+
+
+Contents
+=
+
+- Assumptions
+- Introduction
+- The Net DIM Algorithm
+- Registering a Network Device to DIM
+- Example
+
+Part 0: Assumptions
+==
+
+This document assumes the reader has basic knowledge in network drivers
+and in general interrupt moderation.
+
+
+Part I: Introduction
+==
+
+Dynamic Interrupt Moderation (DIM) (in networking) refers to changing the
+interrupt moderation configuration of a channel in order to optimize packet
+processing. The mechanism includes an algorithm which decides if and how to
+change moderation parameters for a channel, usually by performing an analysis 
on
+runtime data sampled from the system. Net DIM is such a mechanism. In each
+iteration of the algorithm, it analyses a given sample of the data, compares it
+to the previous sample and if required, it can decide to change some of the
+interrupt moderation configuration fields. The data sample is composed of data
+bandwidth, the number of packets and the number of events. The time between
+samples is also measured. Net DIM compares the current and the previous data 
and
+returns an adjusted interrupt moderation configuration object. In some cases,
+the algorithm might decide not to change anything. The configuration fields are
+the minimum duration (microseconds) allowed between events and the maximum
+number of wanted packets per event. The Net DIM algorithm ascribes importance 
to
+increase bandwidth over reducing interrupt rate.
+
+
+Part II: The Net DIM Algorithm
+===
+
+Each iteration of the Net DIM algorithm follows these steps:
+1. Calculates new data sample.
+2. Compares it to previous sample.
+3. Makes a decision - suggests interrupt moderation configuration fields.
+4. Applies a schedule work function, which applies suggested configuration.
+
+The first two steps are straightforward, both the new and the previous data are
+supplied by the driver registered to Net DIM. The previous data is the new data
+supplied to the previous iteration. The comparison step checks the difference
+between the new and previous data and decides on the result of the last step.
+A step would result as "better" if bandwidth increases and as "worse" if
+bandwidth reduces. If there is no change in bandwidth, the packet rate is
+compared in a similar fashion - increase == "better" and decrease == "worse".
+In case there is no change in the packet rate as well, the interrupt rate is
+compared. Here the algorithm tries to optimize for lower interrupt rate so an
+increase in the interrupt rate is considered "worse" and a decrease is
+considered "better". Step #2 has an optimization for avoiding false results: it
+only considers a difference between samples as valid if it is greater than a
+certain percentage. Also, since Net DIM does not measure anything by itself, it
+assumes the data provided by the driver is valid.
+
+Step #3 decides on the suggested configuration based on the result from step #2
+and the internal state of the algorithm. The states reflect the "direction" of
+the algorithm: is it going left (reducing moderation), right (increasing
+moderation) or standing still. Another optimization is that if a decision
+to stay still is made multiple times, the interval between iterations of the
+algorithm would increase in order to reduce calculation overhead. Also, after
+"parking" on one of the most left or most right decisions, the algorithm may
+decide to verify this decision by taking a step in the other direction. This is
+done in order to avoid getting stuck in a "deep sleep" scenario. Once a
+decision is made, an interrupt moderation configuration is selected from
+the predefined profiles.
+
+The last step is to notify the registered driver that it should apply the
+suggested configuration. This is done by scheduling a work function, defined by
+the Net DIM API and provided by the registered driver.
+
+As you can see, Net DIM itself does not actively interact with the system. It
+would have trouble making the correct decisions if the wrong data is supplied 
to
+it and it would be useless if th

[PATCH net-next V1] Documentation/networking: Add net DIM documentation

2018-03-21 Thread Tal Gilboa
Net DIM is a generic algorithm, purposed for dynamically
optimizing network devices interrupt moderation. This
document describes how it works and how to use it.

Signed-off-by: Tal Gilboa <ta...@mellanox.com>
---
 Documentation/networking/net_dim.txt | 174 +++
 1 file changed, 174 insertions(+)
 create mode 100644 Documentation/networking/net_dim.txt

diff --git a/Documentation/networking/net_dim.txt 
b/Documentation/networking/net_dim.txt
new file mode 100644
index 000..823748b
--- /dev/null
+++ b/Documentation/networking/net_dim.txt
@@ -0,0 +1,174 @@
+Net DIM - Generic Network Dynamic Interrupt Moderation
+==
+
+Author:
+       Tal Gilboa <ta...@mellanox.com>
+
+
+Contents
+=
+
+- Assumptions
+- Introduction
+- The Net DIM Algorithm
+- Registering a Network Device to DIM
+- Example
+
+Part 0: Assumptions
+==
+
+This document assumes the reader has basic knowledge in network drivers
+and in general interrupt moderation.
+
+
+Part I: Introduction
+==
+
+Dynamic Interrupt Moderation (DIM) (in networking) refers to changing the 
interrupt
+moderation configuration of a channel in order to optimize packet processing.
+The mechanism includes an algorithm which decides if and how to change
+moderation parameters for a channel, usually by performing an analysis on
+runtime data sampled from the system. Net DIM is such a mechanism. In each
+iteration of the algorithm, it analyses a given sample of the data, compares 
it to
+the previous sample and if required, it can decide to change some of the 
interrupt moderation
+configuration fields. The data sample is composed of data bandwidth, the 
number of
+packets and the number of events. The time between samples is also measured. 
Net DIM
+compares the current and the previous data and returns an adjusted interrupt
+moderation configuration object. In some cases, the algorithm might decide not
+to change anything. The configuration fields are the minimum duration
+(microseconds) allowed between events and the maximum number of wanted packets
+per event. The Net DIM algorithm ascribes importance to increase bandwidth over
+reducing interrupt rate.
+
+
+Part II: The Net DIM Algorithm
+===
+
+Each iteration of the Net DIM algorithm follows these steps:
+1. Calculates new data sample.
+2. Compares it to previous sample.
+3. Makes a decision - suggests interrupt moderation configuration fields.
+4. Applies a schedule work function, which applies suggested configuration.
+
+The first two steps are straightforward, both the new and the previous data are
+supplied by the driver registered to Net DIM. The previous data is the new data
+supplied to the previous iteration. The comparison step checks the difference
+between the new and previous data and decides on the result of the last step. 
A step
+would result as "better" if bandwidth increases and as "worse" if bandwidth
+reduces. If there is no change in bandwidth, the packet rate is compared in a 
similar
+fashion - increase == "better" and decrease == "worse". In case there is no
+change in the packet rate as well, the interrupt rate is compared. Here the
+algorithm tries to optimize for lower interrupt rate so an increase in the
+interrupt rate is considered "worse" and a decrease is considered "better".
+Step #2 has an optimization for avoiding false results: it only considers a
+difference between samples as valid if it is greater than a certain percentage.
+Also, since Net DIM does not measure anything by itself, it assumes the data
+provided by the driver is valid.
+
+Step #3 decides on the suggested configuration based on the result from step #2
+and the internal state of the algorithm. The states reflect the "direction" of
+the algorithm: is it going left (reducing moderation), right (increasing
+moderation) or standing still. Another optimization is that if a decision
+to stay still is made multiple times, the interval between iterations of the
+algorithm would increase in order to reduce calculation overhead. Also, after
+"parking" on one of the most left or most right decisions, the algorithm may
+decide to verify this decision by taking a step in the other direction. This is
+done in order to avoid getting stuck in a "deep sleep" scenario. Once a
+decision is made, an interrupt moderation configuration is selected from
+the predefined profiles.
+
+The last step is to notify the registered driver that it should apply the 
suggested
+configuration. This is done by scheduling a work function, defined by the Net 
DIM
+API and provided by the registered driver.
+
+As you can see, Net DIM itself does not actively interact with the system. It
+would have trouble making the correct decisions if the wrong data is supplied 
to it
+and it would b

[PATCH net-next] Documentation/networking: Add net DIM documentation

2018-03-21 Thread Tal Gilboa
Net DIM is a generic algorithm, purposed for dynamically
optimizing network devices interrupt moderation. This
document describes how it works and how to use it.

Signed-off-by: Tal Gilboa <ta...@mellanox.com>
---
 Documentation/networking/net_dim.txt | 174 +++
 1 file changed, 174 insertions(+)
 create mode 100644 Documentation/networking/net_dim.txt

diff --git a/Documentation/networking/net_dim.txt 
b/Documentation/networking/net_dim.txt
new file mode 100644
index 000..ef622c8
--- /dev/null
+++ b/Documentation/networking/net_dim.txt
@@ -0,0 +1,174 @@
+Net DIM - Generic Network Dynamic Interrupt Moderation
+==
+
+Author:
+       Tal Gilboa <ta...@mellanox.com>
+
+
+Contents
+=
+
+- Assumptions
+- Introduction
+- The Net DIM Algorithm
+- Registering a Network Device to DIM
+- Example
+
+Part 0: Assumptions
+==
+
+This document assumes the reader has basic knowledge in network drivers
+and in general interrupt moderation.
+
+
+Part I: Introduction
+==
+
+Dynamic Interrupt Moderation (DIM) (in networking) refers to changing the 
interrupt
+moderation configuration of a channel in order to optimize packet processing.
+The mechanism includes an algorithm which decides if and how to change
+moderation parameters for a channel, usually by performing an analysis on
+runtime data sampled from the system. Net DIM is such a mechanism. In each
+iteration of the algorithm, it analyses a given sample of the data, compares 
it to
+the previous sample and if required, is can decide to change some of the 
interrupt moderation
+configuration fields. The data sample is composed of data bandwidth, the 
number of
+packets and the number of events. The time between samples is also measured. 
Net DIM
+compares the current and the previous data and returns an adjusted interrupt
+moderation configuration object. In some cases, the algorithm might decide not
+to change anything. The configuration fields are the minimum duration
+(microseconds) allowed between events and the maximum number of wanted packets
+per event. The Net DIM algorithm ascribes importance to increase bandwidth over
+reducing interrupt rate.
+
+
+Part II: The Net DIM Algorithm
+===
+
+Each iteration of the Net DIM algorithm follows these steps:
+1. Calculates new data sample.
+2. Compares it to previous sample.
+3. Makes a decision - suggests interrupt moderation configuration fields.
+4. Applies a schedule work function, which applies suggested configuration.
+
+The first two steps are straight forward, both the new and the previous data 
are
+supplied by the driver registered to Net DIM. The previous data is the new data
+supplied to the previous iteration. The comparison step checks the difference
+between the new and previous data and decides on the result of the last step. 
A step
+would result as "better" if bandwidth increases and as "worse" if bandwidth
+reduces. If there is no change in bandwidth, the packet rate is compared in a 
similar
+fashion - increase == "better" and decrease == "worse". In case there is no
+change in the packet rate as well, the interrupt rate is compared. Here the
+algorithm tries to optimize for lower interrupt rate so an increase in the
+interrupt rate is considered "worse" and a decrease is considered "better".
+Step #2 has an optimization for avoiding false results, it only considers a
+difference between samples as valid if it is greater than a certain percentage.
+Also, since Net DIM does not measure anything by itself, it assumes the data
+provided by the driver is valid.
+
+Step #3 decides on the suggested configuration based on the result from step #2
+and the internal state of the algorithm. The states reflect the "direction" of
+the algorithm, is it going left (reducing moderation), right (increasing
+moderation) or standing still. Another optimization is that if a decision
+to stay still is made multiple times, the interval between iterations of the
+algorithm would increase in order to reduce calculation overhead. Also, after
+"parking" on one of the most left or most right decisions, the algorithm may
+decide to verify this decision by taking a step on the other direction. This is
+done in order to avoid getting stuck in a "deep sleep" scenario. Once a
+decision is made, an interrupt moderation configuration is selected from
+the predefined profiles.
+
+The last step is to notify the registered driver that it should apply the 
suggested
+configuration. This is done by scheduling a work function, defined by the Net 
DIM
+API and provided by the registered driver.
+
+As you can see, Net DIM itself does not actively interact with the system. It
+would have trouble making the correct decisions if the wrong data is supplied 
to it
+and it would be

Re: [PATCH net-next 0/7] tcp: implement rb-tree based retransmit queue

2018-02-06 Thread Tal Gilboa

On 2/6/2018 5:52 PM, Eric Dumazet wrote:

On Tue, 2018-02-06 at 15:22 +, David Laight wrote:

From: Eric Dumazet

Sent: 06 February 2018 14:20


...

Please give exact details.
Sending 64, 128, 256 or 512 bytes at a time on TCP_STREAM makes little sense.
We are not optimizing stack for pathological cases, sorry.


There are plenty of workloads which are not bulk data and where multiple
small buffers get sent at unknown intervals (which may be back to back).
Such connections have to have Nagle disabled because the Nagle delays
are 'horrid'.
Clearly lost packets can cause delays, but they are rare on local networks.


Auto corking makes sure aggregation happens, even for when Nagle is in
the picture.




netperf -- -m 256will still cook 64KB TSO packets


This is what we would have liked to see, but auto corking isn't forcing 
64KB TSO packets. Under certain conditions, specifically when TX queue 
is empty, it would send the SKB to transmit even if it isn't full:

static bool tcp_should_autocork(struct sock *sk, struct sk_buff *skb,
int size_goal)
{
return skb->len < size_goal &&
   sock_net(sk)->ipv4.sysctl_tcp_autocorking &&
   skb != tcp_write_queue_head(sk) &&
   refcount_read(>sk_wmem_alloc) > skb->truesize;
}
When skb == tcp_write_queue_head(sk) corking is done. This is part of 
the optimization for mlx5 driver I've mentioned. If we can better 
utilize auto corking we shouldn't have an issue.




netperf is not adding delays between each send(), unless it has been
modified.




I ran this command:
./super_netperf 2000 -H  -l 30 -f g -- -m $size
didn't change netperf in any way.


Re: [PATCH net-next 0/7] tcp: implement rb-tree based retransmit queue

2018-02-06 Thread Tal Gilboa

On 1/24/2018 5:09 PM, Eric Dumazet wrote:

On Wed, 2018-01-24 at 16:42 +0200, Tal Gilboa wrote:

Hi Eric,
My choice of words in my comment was misplaced, and I apologies. It
completely missed the point. I understand, of course, the importance of
optimizing real-life scenarios.

We are currently evaluating this patch and if/how it might affect our
customers. We would also evaluate your suggestion below.

We will contact you if and when we have a real concern.
Thanks.


Sure, I am curious how a 50% regression can be possible as a matter of
fact, so please update even if this caused by some specific synthetic
test conditions.

Thanks.



Sorry for the delay in the response. I ran super_netperf for 64B, 128B, 
256B and 512B and 500, 1000, 2000 and 4000 streams and compared these 
(consecutive) commits:

Base - f331981 tcp: pass previous skb to tcp_shifted_skb()
rb_tree - 75c119a tcp: implement rb-tree based retransmit queue

I got lower BW with rb-tree for all cases.
Example - 2000 streams results (in Gb/s):
size | Base | rb-tree | degradation
 64  | 25.6 | 23.3| -9%
 128 | 52.8 | 44.43   | -16%
 256 | 89.8 | 66.1| -26.5%
 512 | 87.7 | 67.8| -22.7%

I'm currently working on improving our CPU utilization in TX flow (by 
better utilizing payload aggregation mechanisms). It somewhat improves 
the rb-tree results when applied on top of it, but not for all cases and 
not to the "base" results.


Re: [PATCH net-next 0/7] tcp: implement rb-tree based retransmit queue

2018-01-24 Thread Tal Gilboa

Hi Eric,
My choice of words in my comment was misplaced, and I apologies. It 
completely missed the point. I understand, of course, the importance of 
optimizing real-life scenarios.


We are currently evaluating this patch and if/how it might affect our 
customers. We would also evaluate your suggestion below.


We will contact you if and when we have a real concern.
Thanks.

On 1/22/2018 1:47 AM, Eric Dumazet wrote:

On Sun, Jan 21, 2018 at 12:52 PM, Tal Gilboa <ta...@mellanox.com> wrote:

Hi Eric,
We have noticed a degradation on both of our drivers (mlx4 and mlx5) when
running TCP. Exact scenario is single stream TCP with 1KB packets. The
degradation is a steady 50% drop.
We tracked the offending commit to be:
75c119a ("tcp: implement rb-tree based retransmit queue")

Since mlx4 and mlx5 code base is completely different and by looking at the
changes in this commit, we believe the issue is external to the mlx4/5
drivers.

I see in the comment below you anticipated some overhead, but this may be a
too common case to ignore.

Can you please review and consider reverting/fixing it?



Hi Tal

You have to provide way more details than a simple mail, asking for a
" revert or a fix " ...

On our GFEs, we got a win, while I was expecting a small overhead,
given the apparent complexity of dealing with RB tree instead of
linear list.

And on the stress scenario described in my patch set, the win was
absolutely abysmal.

A " single strean TCP with 1KB packets"  is not something we need to optimize,
unless there is some really strange setup for one of your customers ?

Here we deal with millions of TCP flows, and this is what we need to
take care of.

Thanks.


Thanks,

Tal G.


On 10/7/2017 2:31 AM, David Miller wrote:


From: Eric Dumazet <eduma...@google.com>
Date: Thu,  5 Oct 2017 22:21:20 -0700


This patch series implement RB-tree based retransmit queue for TCP,
to better match modern BDP.



Indeed, there was a lot of resistence to this due to the overhead
for small retransmit queue sizes, but with today's scale this is
long overdue.

So, series applied, nice work!

Maybe we can look into dynamic schemes where when the queue never
goes over N entries we elide the rbtree and use a list.  I'm not
so sure how practical that would be.

Thanks!





Re: [PATCH net-next 0/7] tcp: implement rb-tree based retransmit queue

2018-01-21 Thread Tal Gilboa

Hi Eric,
We have noticed a degradation on both of our drivers (mlx4 and mlx5) 
when running TCP. Exact scenario is single stream TCP with 1KB packets. 
The degradation is a steady 50% drop.

We tracked the offending commit to be:
75c119a ("tcp: implement rb-tree based retransmit queue")

Since mlx4 and mlx5 code base is completely different and by looking at 
the changes in this commit, we believe the issue is external to the 
mlx4/5 drivers.


I see in the comment below you anticipated some overhead, but this may 
be a too common case to ignore.


Can you please review and consider reverting/fixing it?

Thanks,

Tal G.

On 10/7/2017 2:31 AM, David Miller wrote:

From: Eric Dumazet 
Date: Thu,  5 Oct 2017 22:21:20 -0700


This patch series implement RB-tree based retransmit queue for TCP,
to better match modern BDP.


Indeed, there was a lot of resistence to this due to the overhead
for small retransmit queue sizes, but with today's scale this is
long overdue.

So, series applied, nice work!

Maybe we can look into dynamic schemes where when the queue never
goes over N entries we elide the rbtree and use a list.  I'm not
so sure how practical that would be.

Thanks!



[PATCH net-next] net/dim: Fix fixpoint divide exception in net_dim_stats_compare

2018-01-17 Thread Tal Gilboa
From: Talat Batheesh <tal...@mellanox.com>

Helmut reported a bug about devision by zero while
running traffic and doing physical cable pull test.

When the cable unplugged the ppms become zero, so when
dividing the current ppms by the previous ppms in the
next dim iteration there is devision by zero.

This patch prevent this division for both ppms and epms.

Fixes: c3164d2fc48f ("net/mlx5e: Added BW check for DIM decision mechanism")
Fixes: 4c4dbb4a7363 ("net/mlx5e: Move dynamic interrupt coalescing code to 
include/linux")
Reported-by: Helmut Grauer <helmut.gra...@de.ibm.com>
Signed-off-by: Talat Batheesh <tal...@mellanox.com>
Signed-off-by: Tal Gilboa <ta...@mellanox.com>
---
 include/linux/net_dim.h | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/net_dim.h b/include/linux/net_dim.h
index 1c7e450..bebeaad 100644
--- a/include/linux/net_dim.h
+++ b/include/linux/net_dim.h
@@ -244,10 +244,17 @@ static inline int net_dim_stats_compare(struct 
net_dim_stats *curr,
return (curr->bpms > prev->bpms) ? NET_DIM_STATS_BETTER :
   NET_DIM_STATS_WORSE;
 
+   if (!prev->ppms)
+   return curr->ppms ? NET_DIM_STATS_BETTER :
+   NET_DIM_STATS_SAME;
+
if (IS_SIGNIFICANT_DIFF(curr->ppms, prev->ppms))
return (curr->ppms > prev->ppms) ? NET_DIM_STATS_BETTER :
   NET_DIM_STATS_WORSE;
 
+   if (!prev->epms)
+   return NET_DIM_STATS_SAME;
+
if (IS_SIGNIFICANT_DIFF(curr->epms, prev->epms))
return (curr->epms < prev->epms) ? NET_DIM_STATS_BETTER :
   NET_DIM_STATS_WORSE;
-- 
1.8.3.1



Re: [PATCH net-next v4 00/10] net: create dynamic software irq moderation library

2018-01-09 Thread Tal Gilboa

On 1/10/2018 12:46 AM, Florian Fainelli wrote:

Hey Andy,

On 01/09/2018 01:06 PM, Andy Gospodarek wrote:

From: Andy Gospodarek <go...@broadcom.com>

This converts the dynamic interrupt moderation library from the mlx5e
driver into a library so it can be used by any driver.  The penultimate
patch in this set adds support for this new dynamic interrupt moderation
library in the bnxt_en driver and the last patch creates an entry in the
MAINTAINERS file for this library.

The main purpose of this code is to allow an administrator to make sure
that default coalesce settings are optimized for low latency, but
quickly adapt to handle high throughput/bulk traffic by altering how
much time passes before popping an interrupt.

For any new driver the following changes would be needed to use this
library:

- add elements in ring struct to track items needed by this library
- create function that can be called to actually set coalesce settings
   for the driver

Credit to Rob Rice and Lee Reed for doing some of the initial proof of
concept and testing for this patch and Tal Gilboa and Or Gerlitz for
their comments, etc on this set.

v4: Fix build breakage for VF representers noticed by kbuild test robot.
Thanks for being so courteous, kbuild test robot!

v3: bnxt_en fix from Michael Chan, comment suggestion from Vasundhara
Volam, and small mlx5e header file fix from Tal Gilboa.

v2: Spelling fixes from Stephen Hemminger, bnxt_en suggestions from
Michael Chan, spelling and formatting fixes from Or Gerlitz, and
spelling and mlx5e changes suggested by Tal Gilboa.


Certainly not a blocking item for this patch series, but can you
consider a follow up patch adding a small bit of documentation entry
covering how the implementation works as well as possible
limitations/considerations depending on what the networking HW supports
in terms of interrupt moderation capabilities? (e.g: is it necessary to
support generating an interrupt on ring empty, a micro-second resolution
RX/TX timeout etc. etc.).

Thanks for doing this!



Hi Florian, I plan to do so right after these patches would be accepted.


Re: [PATCH net-next v4 00/10] net: create dynamic software irq moderation library

2018-01-09 Thread Tal Gilboa

On 1/9/2018 11:06 PM, Andy Gospodarek wrote:

From: Andy Gospodarek <go...@broadcom.com>

This converts the dynamic interrupt moderation library from the mlx5e
driver into a library so it can be used by any driver.  The penultimate
patch in this set adds support for this new dynamic interrupt moderation
library in the bnxt_en driver and the last patch creates an entry in the
MAINTAINERS file for this library.

The main purpose of this code is to allow an administrator to make sure
that default coalesce settings are optimized for low latency, but
quickly adapt to handle high throughput/bulk traffic by altering how
much time passes before popping an interrupt.

For any new driver the following changes would be needed to use this
library:

- add elements in ring struct to track items needed by this library
- create function that can be called to actually set coalesce settings
   for the driver

Credit to Rob Rice and Lee Reed for doing some of the initial proof of
concept and testing for this patch and Tal Gilboa and Or Gerlitz for
their comments, etc on this set.

v4: Fix build breakage for VF representers noticed by kbuild test robot.
Thanks for being so courteous, kbuild test robot!

v3: bnxt_en fix from Michael Chan, comment suggestion from Vasundhara
Volam, and small mlx5e header file fix from Tal Gilboa.

v2: Spelling fixes from Stephen Hemminger, bnxt_en suggestions from
Michael Chan, spelling and formatting fixes from Or Gerlitz, and
spelling and mlx5e changes suggested by Tal Gilboa.

Andy Gospodarek (10):
   net/mlx5e: Move interrupt moderation structs to new file
   net/mlx5e: Move interrupt moderation forward declarations
   net/mlx5e: Remove rq references in mlx5e_rx_am
   net/mlx5e: Move AM logic enums
   net/mlx5e: Move generic functions to new file
   net/mlx5e: Change Mellanox references in DIM code
   net/mlx5e: Move dynamic interrupt coalescing code to include/linux
   net/dim: use struct net_dim_sample as arg to net_dim
   bnxt_en: add support for software dynamic interrupt moderation
   MAINTAINERS: add entry for Dynamic Interrupt Moderation

  MAINTAINERS|   5 +
  drivers/net/ethernet/broadcom/bnxt/Makefile|   2 +-
  drivers/net/ethernet/broadcom/bnxt/bnxt.c  |  50 +++
  drivers/net/ethernet/broadcom/bnxt/bnxt.h  |  34 +-
  drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c  |  33 ++
  drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c  |  12 +
  drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   2 +-
  drivers/net/ethernet/mellanox/mlx5/core/en.h   |  46 +--
  drivers/net/ethernet/mellanox/mlx5/core/en_dim.c   |  49 +++
  .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |   6 +-
  drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  40 ++-
  drivers/net/ethernet/mellanox/mlx5/core/en_rep.c   |   2 +-
  drivers/net/ethernet/mellanox/mlx5/core/en_rx_am.c | 341 ---
  drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c  |  10 +-
  include/linux/net_dim.h| 373 +
  15 files changed, 594 insertions(+), 411 deletions(-)
  create mode 100644 drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c
  create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
  delete mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_rx_am.c
  create mode 100644 include/linux/net_dim.h



+1. Great work Andy!


Re: [PATCH net-next v3 06/10] net/mlx5e: Change Mellanox references in DIM code

2018-01-09 Thread Tal Gilboa

On 1/9/2018 6:06 PM, Andy Gospodarek wrote:

On Mon, Jan 08, 2018 at 11:06:28PM -0800, Saeed Mahameed wrote:



On 01/08/2018 10:13 PM, Andy Gospodarek wrote:

From: Andy Gospodarek <go...@broadcom.com>

Change all appropriate mlx5_am* and MLX5_AM* references to net_dim and
NET_DIM, respectively, in code that handles dynamic interrupt
moderation.  Also change all references from 'am' to 'dim' when used as
local variables and add generic profile references.

Signed-off-by: Andy Gospodarek <go...@broadcom.com>
Acked-by: Tal Gilboa <ta...@mellanox.com>
Acked-by: Saeed Mahameed <sae...@mellanox.com>
---
   drivers/net/ethernet/mellanox/mlx5/core/en.h   |   9 +-
   drivers/net/ethernet/mellanox/mlx5/core/en_dim.c   |  14 +-
   .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |   6 +-
   drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  40 ++-
   drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c  |   8 +-
   drivers/net/ethernet/mellanox/mlx5/core/net_dim.c  | 286 
++---
   drivers/net/ethernet/mellanox/mlx5/core/net_dim.h  |  63 ++---
   7 files changed, 225 insertions(+), 201 deletions(-)



[...]


   #define IS_SIGNIFICANT_DIFF(val, ref) \
(((100 * abs((val) - (ref))) / (ref)) > 10) /* more than 10% difference 
*/
-static int mlx5e_am_stats_compare(struct mlx5e_rx_am_stats *curr,
- struct mlx5e_rx_am_stats *prev)
+static int net_dim_stats_compare(struct net_dim_stats *curr,
+struct net_dim_stats *prev)
   {
if (!prev->bpms)
-   return curr->bpms ? MLX5E_AM_STATS_BETTER :
-   MLX5E_AM_STATS_SAME;
+   return curr->bpms ? NET_DIM_STATS_BETTER :
+   NET_DIM_STATS_SAME;
if (IS_SIGNIFICANT_DIFF(curr->bpms, prev->bpms))
-   return (curr->bpms > prev->bpms) ? MLX5E_AM_STATS_BETTER :
-  MLX5E_AM_STATS_WORSE;
+   return (curr->bpms > prev->bpms) ? NET_DIM_STATS_BETTER :
+  NET_DIM_STATS_WORSE;


Hey Andy,

I am currently reviewing a patch internally that fixes a bug in this area,
prev->ppms can be 0 and could cause IS_SIGNIFICANT_DIFF ouch !
same goes for prev->eppm, for some reason we had a broken assumption that if
ppms is 0 for some reason then the bpms is 0 and the above condition will
cover us.

Anyway the patch will go to net, which means when this series gets accepted
then net-next will fail to merge with net and we need to manually push the
fix to the new DIM library.

But for now I don't think anything is required for this series other than
bringing this division by 0 issue and the future merge conflict to your
attention.



Thanks for bringing that to everyone's attention.  I agree there is
probably not much that should be done at this point -- hopefully the
merge should go pretty smoothly, since net_dim.h is seen as a rename
from en_rx_am.c.


I talked with Talat, who is submitting the fix. He will apply it over 
these patches after they are accepted.






if (IS_SIGNIFICANT_DIFF(curr->ppms, prev->ppms))
-   return (curr->ppms > prev->ppms) ? MLX5E_AM_STATS_BETTER :
-  MLX5E_AM_STATS_WORSE;
+   return (curr->ppms > prev->ppms) ? NET_DIM_STATS_BETTER :
+  NET_DIM_STATS_WORSE;
if (IS_SIGNIFICANT_DIFF(curr->epms, prev->epms))
-   return (curr->epms < prev->epms) ? MLX5E_AM_STATS_BETTER :
-  MLX5E_AM_STATS_WORSE;
+   return (curr->epms < prev->epms) ? NET_DIM_STATS_BETTER :
+  NET_DIM_STATS_WORSE;
-   return MLX5E_AM_STATS_SAME;
+   return NET_DIM_STATS_SAME;
   }


Re: [PATCH net-next v3 00/10] net: create dynamic software irq moderation library

2018-01-08 Thread Tal Gilboa

On 1/9/2018 8:13 AM, Andy Gospodarek wrote:

From: Andy Gospodarek 

This converts the dynamic interrupt moderation library from the mlx5e
driver into a library so it can be used by any driver.  The penultimate
patch in this set adds support for thiw new dynamic interrupt moderation
library in the bnxt_en driver and the last patch creates an entry in the
MAINTAINERS file for this library.


thiw->this.


Re: [PATCH net-next v2 06/10] net/mlx5e: Change Mellanox references in DIM code

2018-01-07 Thread Tal Gilboa



On 1/6/2018 12:58 AM, Andy Gospodarek wrote:

From: Andy Gospodarek <go...@broadcom.com>

Change all appropriate mlx5_am* and MLX5_AM* references to net_dim and
NET_DIM, respectively, in code that handles dynamic interrupt
moderation.  Also change all references from 'am' to 'dim' when used as
local variables.

Signed-off-by: Andy Gospodarek <go...@broadcom.com>
Acked-by: Tal Gilboa <ta...@mellanox.com>
Acked-by: Saeed Mahameed <sae...@mellanox.com>

---
  drivers/net/ethernet/mellanox/mlx5/core/en.h   |  12 +-
  drivers/net/ethernet/mellanox/mlx5/core/en_dim.c   |  14 +-
  .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |  12 +-
  drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  52 ++--
  drivers/net/ethernet/mellanox/mlx5/core/en_rep.c   |   4 +-
  drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c  |   2 +-
  drivers/net/ethernet/mellanox/mlx5/core/net_dim.c  | 284 ++---
  drivers/net/ethernet/mellanox/mlx5/core/net_dim.h  |  63 +++--
  8 files changed, 232 insertions(+), 211 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 121f280..732f275 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -115,6 +115,9 @@
  #define MLX5E_PARAMS_DEFAULT_MIN_RX_WQES0x80
  #define MLX5E_PARAMS_DEFAULT_MIN_RX_WQES_MPW0x2
  
+#define MLX5E_CQ_PERIOD_MODE_START_FROM_EQE		0x0

+#define MLX5E_CQ_PERIOD_MODE_START_FROM_CQE0x1
+


This enum should be defined under include/linux/mlx5/mlx5_ifc.h:
enum {
MLX5_CQ_PERIOD_MODE_START_FROM_EQE = 0x0,
MLX5_CQ_PERIOD_MODE_START_FROM_CQE = 0x1,
MLX5_CQ_PERIOD_NUM_MODES
};

We already include this in the relevant files. This was changed in patch 
01/10, please revert.




Re: [net-next 00/10] net: create dynamic software irq moderation library

2018-01-05 Thread Tal Gilboa

Thanks Andy for your hard work. Looks great overall!

On 1/4/2018 10:21 PM, Andy Gospodarek wrote:

From: Andy Gospodarek <go...@broadcom.com>

This converts the dynamic interrupt moderation library from the mlx5_en driver
into a library so it can be used by any driver.  The penultimatepatch in this

Had to look up "penultimatepatch " :), but aren't these two words?


set adds support for interrupt moderation in the bnxt_en driver and the last
patch creates an entry in the MAINTAINERS file.

The main purpose of this code in the mlx5_en driver is to allow an
administrator to make sure that default coalesce settings are optimized
for low latency, but quickly adapt to handle high throughput traffic and
optimize how many packets are received during each napi poll.

For any new driver the following changes would be needed to use this
library:

- add elements in ring struct to track items needed by this library
- create function that can be called to actually set coalesce settings
   for the driver

Credit to Rob Rice and Lee Reed for doing some of the initial proof of
concept and testing for this patch and Tal Gilboa and Or Gerlitz for their
comments, etc on this set.

Andy Gospodarek (10):
   net/mlx5e: move interrupt moderation structs to new file
   net/mlx5e: move interrupt moderation forward declarations
   net/mlx5e: remove rq references in mlx5e_rx_am
   net/mlx5e: move AM logic enums
   net/mlx5e: move generic functions to new file
   net/mlx5e: change Mellanox references in DIM code
   net: move dynamic interrpt coalescing code to include/linux
interrpt -> interrupt. The topic of the actual patch was fixed, only 
left in the cover.



   net/dim: use struct net_dim_sample as arg to net_dim
   bnxt_en: add support for software dynamic interrupt moderation
   MAINTAINERS: add entry for Dynamic Interrupt Moderation

  MAINTAINERS|   5 +
  drivers/net/ethernet/broadcom/bnxt/Makefile|   2 +-
  drivers/net/ethernet/broadcom/bnxt/bnxt.c  |  52 +++
  drivers/net/ethernet/broadcom/bnxt/bnxt.h  |  34 +-
  drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c  |  32 ++
  drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c  |  12 +
  drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   2 +-
  drivers/net/ethernet/mellanox/mlx5/core/en.h   |  46 +--
  drivers/net/ethernet/mellanox/mlx5/core/en_dim.c   |  49 +++
  .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |  12 +-
  drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  32 +-
  drivers/net/ethernet/mellanox/mlx5/core/en_rep.c   |   4 +-
  drivers/net/ethernet/mellanox/mlx5/core/en_rx_am.c | 341 ---
  drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c  |  10 +-
  drivers/net/ethernet/mellanox/mlx5/core/net_dim.h  | 108 ++
  include/linux/mlx5/mlx5_ifc.h  |   6 -
  include/linux/net_dim.h| 372 +
  17 files changed, 693 insertions(+), 426 deletions(-)
  create mode 100644 drivers/net/ethernet/broadcom/bnxt/bnxt_dim.c
  create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_dim.c
  delete mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_rx_am.c
  create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/net_dim.h

mlx5/core/net_dim.h was removed from code. Please fix the cover.

  create mode 100644 include/linux/net_dim.h



Re: [net-next 08/10] net/dim: use struct net_dim_sample as arg to net_dim

2018-01-05 Thread Tal Gilboa

Thanks for doing this, would make future changes easier.

On 1/4/2018 10:21 PM, Andy Gospodarek wrote:

From: Andy Gospodarek 

Simplify the arguments net_dim() by formatting them into a struct
net_dim_sample before calling the function.



Re: [net-next 06/10] net/mlx5e: change Mellanox references in DIM code

2018-01-05 Thread Tal Gilboa

On 1/4/2018 10:21 PM, Andy Gospodarek wrote:

From: Andy Gospodarek 

Change all mlx5_am* and MLX_AM* references to net_dim and NET_DIM,

MLX_AM->MLX5_AM


cq_period_mode = enable ?
-   MLX5_CQ_PERIOD_MODE_START_FROM_CQE :
-   MLX5_CQ_PERIOD_MODE_START_FROM_EQE;
+   NET_DIM_CQ_PERIOD_MODE_START_FROM_CQE :
+   NET_DIM_CQ_PERIOD_MODE_START_FROM_EQE;
I'm not sure about this part. CQE/EQE based moderation is a feature in 
Mellanox's chips, which isn't necessarily coupled with adaptive 
moderation. net_dim lib should know which values to choose according to 
the selected mode, but I don't think mlx5 driver should use an enum from 
net_dim for enabling/disabling HW features. Another issue is that we use 
the enum value as an argument for the command to HW (0=EQE, 1=CQE). If 
someone would change the values it would break the HW feature. I think 
it would be safer to use the NET_DIM_XXX enum only when using functions 
from net_dim lib.



current_cq_period_mode = is_rx_cq ?
priv->channels.params.rx_cq_moderation.cq_period_mode :
priv->channels.params.tx_cq_moderation.cq_period_mode;
mode_changed = cq_period_mode != current_cq_period_mode;
  
-	if (cq_period_mode == MLX5_CQ_PERIOD_MODE_START_FROM_CQE &&

+   if (cq_period_mode == NET_DIM_CQ_PERIOD_MODE_START_FROM_CQE &&
!MLX5_CAP_GEN(mdev, cq_period_start_from_cqe))
return -EOPNOTSUPP;
  
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c

index 3aa1c90..edd4077 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -674,8 +674,8 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
wqe->data.lkey = rq->mkey_be;
}
  
-	INIT_WORK(>am.work, mlx5e_rx_am_work);

-   rq->am.mode = params->rx_cq_moderation.cq_period_mode;
+   INIT_WORK(>dim.work, mlx5e_rx_dim_work);
+   rq->dim.mode = params->rx_cq_moderation.cq_period_mode;
rq->page_cache.head = 0;
rq->page_cache.tail = 0;
  
@@ -919,7 +919,7 @@ static int mlx5e_open_rq(struct mlx5e_channel *c,

if (err)
goto err_destroy_rq;
  
-	if (params->rx_am_enabled)

+   if (params->rx_dim_enabled)
c->rq.state |= BIT(MLX5E_RQ_STATE_AM);
  
  	return 0;

@@ -952,7 +952,7 @@ static void mlx5e_deactivate_rq(struct mlx5e_rq *rq)
  
  static void mlx5e_close_rq(struct mlx5e_rq *rq)

  {
-   cancel_work_sync(>am.work);
+   cancel_work_sync(>dim.work);
mlx5e_destroy_rq(rq);
mlx5e_free_rx_descs(rq);
mlx5e_free_rq(rq);
@@ -1565,7 +1565,7 @@ static void mlx5e_destroy_cq(struct mlx5e_cq *cq)
  }
  
  static int mlx5e_open_cq(struct mlx5e_channel *c,

-struct mlx5e_cq_moder moder,
+struct net_dim_cq_moder moder,
 struct mlx5e_cq_param *param,
 struct mlx5e_cq *cq)
  {
@@ -1747,7 +1747,7 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, 
int ix,
  struct mlx5e_channel_param *cparam,
  struct mlx5e_channel **cp)
  {
-   struct mlx5e_cq_moder icocq_moder = {0, 0};
+   struct net_dim_cq_moder icocq_moder = {0, 0};
struct net_device *netdev = priv->netdev;
int cpu = mlx5e_get_cpu(priv, ix);
struct mlx5e_channel *c;
@@ -1999,7 +1999,7 @@ static void mlx5e_build_ico_cq_param(struct mlx5e_priv 
*priv,
  
  	mlx5e_build_common_cq_param(priv, param);
  
-	param->cq_period_mode = MLX5_CQ_PERIOD_MODE_START_FROM_EQE;

+   param->cq_period_mode = NET_DIM_CQ_PERIOD_MODE_START_FROM_EQE;
  }
  
  static void mlx5e_build_icosq_param(struct mlx5e_priv *priv,

@@ -4016,13 +4016,13 @@ void mlx5e_set_tx_cq_mode_params(struct mlx5e_params 
*params, u8 cq_period_mode)
params->tx_cq_moderation.usec =
MLX5E_PARAMS_DEFAULT_TX_CQ_MODERATION_USEC;
  
-	if (cq_period_mode == MLX5_CQ_PERIOD_MODE_START_FROM_CQE)

+   if (cq_period_mode == NET_DIM_CQ_PERIOD_MODE_START_FROM_CQE)
params->tx_cq_moderation.usec =
MLX5E_PARAMS_DEFAULT_TX_CQ_MODERATION_USEC_FROM_CQE;
  
  	MLX5E_SET_PFLAG(params, MLX5E_PFLAG_TX_CQE_BASED_MODER,

params->tx_cq_moderation.cq_period_mode ==
-   MLX5_CQ_PERIOD_MODE_START_FROM_CQE);
+   NET_DIM_CQ_PERIOD_MODE_START_FROM_CQE);
  }
  
  void mlx5e_set_rx_cq_mode_params(struct mlx5e_params *params, u8 cq_period_mode)

@@ -4034,17 +4034,17 @@ void mlx5e_set_rx_cq_mode_params(struct mlx5e_params 
*params, u8 cq_period_mode)
params->rx_cq_moderation.usec =
MLX5E_PARAMS_DEFAULT_RX_CQ_MODERATION_USEC;
  
-	if (cq_period_mode == MLX5_CQ_PERIOD_MODE_START_FROM_CQE)

+   if