[PATCH net-next] liquidio: Add the features to show FEC settings and set FEC settings

2018-09-16 Thread Felix Manlunas
From: Weilin Chang 

1. Add functions for get_fecparam and set_fecparam.
2. Modify lio_get_link_ksettings to display FEC setting.

Signed-off-by: Weilin Chang 
Acked-by: Derek Chickles 
Signed-off-by: Felix Manlunas 
---
 drivers/net/ethernet/cavium/liquidio/lio_core.c| 148 +
 drivers/net/ethernet/cavium/liquidio/lio_ethtool.c |  76 ++-
 drivers/net/ethernet/cavium/liquidio/lio_main.c|   8 ++
 .../net/ethernet/cavium/liquidio/liquidio_common.h |   5 +
 .../net/ethernet/cavium/liquidio/octeon_device.h   |   2 +
 .../net/ethernet/cavium/liquidio/octeon_network.h  |   7 +-
 6 files changed, 243 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_core.c 
b/drivers/net/ethernet/cavium/liquidio/lio_core.c
index 52b32b8..eb96b06 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_core.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_core.c
@@ -1654,3 +1654,151 @@ int liquidio_get_speed(struct lio *lio)
 
return retval;
 }
+
+int liquidio_set_fec(struct lio *lio, int on_off)
+{
+   struct oct_nic_seapi_resp *resp;
+   struct octeon_soft_command *sc;
+   struct octeon_device *oct;
+   union octnet_cmd *ncmd;
+   int retval;
+   u32 var;
+
+   oct = lio->oct_dev;
+
+   if (oct->props[lio->ifidx].fec == on_off)
+   return 0;
+
+   if (!OCTEON_CN23XX_PF(oct)) {
+   dev_err(&oct->pci_dev->dev, "%s: SET FEC only for PF\n",
+   __func__);
+   return -1;
+   }
+
+   if (oct->speed_boot != 25)  {
+   dev_err(&oct->pci_dev->dev,
+   "Set FEC only when link speed is 25G during insmod\n");
+   return -1;
+   }
+
+   sc = octeon_alloc_soft_command(oct, OCTNET_CMD_SIZE,
+  sizeof(struct oct_nic_seapi_resp), 0);
+
+   ncmd = sc->virtdptr;
+   resp = sc->virtrptr;
+   memset(resp, 0, sizeof(struct oct_nic_seapi_resp));
+
+   init_completion(&sc->complete);
+   sc->sc_status = OCTEON_REQUEST_PENDING;
+
+   ncmd->u64 = 0;
+   ncmd->s.cmd = SEAPI_CMD_FEC_SET;
+   ncmd->s.param1 = on_off;
+   /* SEAPI_CMD_FEC_DISABLE(0) or SEAPI_CMD_FEC_RS(1) */
+
+   octeon_swap_8B_data((u64 *)ncmd, (OCTNET_CMD_SIZE >> 3));
+
+   sc->iq_no = lio->linfo.txpciq[0].s.q_no;
+
+   octeon_prepare_soft_command(oct, sc, OPCODE_NIC,
+   OPCODE_NIC_UBOOT_CTL, 0, 0, 0);
+
+   retval = octeon_send_soft_command(oct, sc);
+   if (retval == IQ_SEND_FAILED) {
+   dev_info(&oct->pci_dev->dev, "Failed to send soft command\n");
+   octeon_free_soft_command(oct, sc);
+   return -EIO;
+   }
+
+   retval = wait_for_sc_completion_timeout(oct, sc, 0);
+   if (retval)
+   return (-EIO);
+
+   var = be32_to_cpu(resp->fec_setting);
+   resp->fec_setting = var;
+   if (var != on_off) {
+   dev_err(&oct->pci_dev->dev,
+   "Setting failed fec= %x, expect %x\n",
+   var, on_off);
+   oct->props[lio->ifidx].fec = var;
+   if (resp->fec_setting == SEAPI_CMD_FEC_SET_RS)
+   oct->props[lio->ifidx].fec = 1;
+   else
+   oct->props[lio->ifidx].fec = 0;
+   }
+
+   WRITE_ONCE(sc->caller_is_done, true);
+
+   if (oct->props[lio->ifidx].fec !=
+   oct->props[lio->ifidx].fec_boot) {
+   dev_dbg(&oct->pci_dev->dev,
+   "Reloade driver to chang fec to %s\n",
+   oct->props[lio->ifidx].fec ? "on" : "off");
+   }
+
+   return retval;
+}
+
+int liquidio_get_fec(struct lio *lio)
+{
+   struct oct_nic_seapi_resp *resp;
+   struct octeon_soft_command *sc;
+   struct octeon_device *oct;
+   union octnet_cmd *ncmd;
+   int retval;
+   u32 var;
+
+   oct = lio->oct_dev;
+
+   sc = octeon_alloc_soft_command(oct, OCTNET_CMD_SIZE,
+  sizeof(struct oct_nic_seapi_resp), 0);
+   if (!sc)
+   return -ENOMEM;
+
+   ncmd = sc->virtdptr;
+   resp = sc->virtrptr;
+   memset(resp, 0, sizeof(struct oct_nic_seapi_resp));
+
+   init_completion(&sc->complete);
+   sc->sc_status = OCTEON_REQUEST_PENDING;
+
+   ncmd->u64 = 0;
+   ncmd->s.cmd = SEAPI_CMD_FEC_GET;
+
+   octeon_swap_8B_data((u64 *)ncmd, (OCTNET_CMD_SIZE >> 3));
+
+   sc->iq_no = lio->linfo.txpciq[0].s.q_no;
+
+   octeon_prepare_soft_command(oct, sc, OPCODE_NIC,
+   OPCODE_NIC_UBOOT_CTL, 0, 0, 0);
+
+   retval = octeon_send_soft_command(oct, sc);
+   if (retval == IQ_SEND_FAILED) {
+   dev_info(&oct->pci_dev->dev,
+"%s: Failed to send soft command\n", __func__);
+   octeon_free_soft_command(oct, sc);
+   return -

Re: [PATCH net-next v4 18/20] crypto: port ChaCha20 to Zinc

2018-09-16 Thread Jason A. Donenfeld
Hey Martin,

Thanks for running these and pointing this out. I've replicated the
results with tcrypt and fixed some issues, and the next patch series
should be a lot closer to what you'd expect, instead of the regression
you noticed. Most of the slowdown happened as a result of over-eager
XSAVEs, which I've now rectified. I'm still working on a few other
facets of it, but I believe v5 will be more satisfactory when posted.

Regards,
Jason


Re: [PATCH net-next v3 02/17] zinc: introduce minimal cryptography library

2018-09-16 Thread David Miller
From: Andy Lutomirski 
Date: Sun, 16 Sep 2018 21:09:11 -0700

> CRYPTO API
> M:  Herbert Xu 
> M:  "David S. Miller" 
> L:  linux-cry...@vger.kernel.org
> 
> Herbert hasn't replied to any of these submissions.  You're the other
> maintainer :)

Herbert is the primary crypto maintainer, I haven't done a serious
review of crypto code in ages.

So yes, Herbert review is what is important here.


Re: [PATCH net] netfilter: bridge: Don't sabotage nf_hook calls from an l3mdev

2018-09-16 Thread David Ahern
Pablo:

DaveM has this marked as waiting for upstream. Any comment on this patch?

Thanks,
David

On 9/7/18 3:08 PM, dsah...@kernel.org wrote:
> From: David Ahern 
> 
> For starters, the bridge netfilter code registers operations that
> are invoked any time nh_hook is called. Specifically, ip_sabotage_in
> watches for nested calls for NF_INET_PRE_ROUTING when a bridge is in
> the stack.
> 
> Packet wise, the bridge netfilter hook runs first. br_nf_pre_routing
> allocates nf_bridge, sets in_prerouting to 1 and calls NF_HOOK for
> NF_INET_PRE_ROUTING. It's finish function, br_nf_pre_routing_finish,
> then resets in_prerouting flag to 0 and the packet continues up the
> stack. The packet eventually makes it to the VRF driver and it invokes
> nf_hook for NF_INET_PRE_ROUTING in case any rules have been added against
> the vrf device.
> 
> Because of the registered operations the call to nf_hook causes
> ip_sabotage_in to be invoked. That function sees the nf_bridge on the
> skb and that in_prerouting is not set. Thinking it is an invalid nested
> call it steals (drops) the packet.
> 
> Update ip_sabotage_in to recognize that the bridge or one of its upper
> devices (e.g., vlan) can be enslaved to a VRF (L3 master device) and
> allow the packet to go through the nf_hook a second time.
> 
> Fixes: 73e20b761acf ("net: vrf: Add support for PREROUTING rules on vrf 
> device")
> Reported-by: D'Souza, Nelson 
> Signed-off-by: David Ahern 
> ---
>  net/bridge/br_netfilter_hooks.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
> index 6e0dc6bcd32a..37278dc280eb 100644
> --- a/net/bridge/br_netfilter_hooks.c
> +++ b/net/bridge/br_netfilter_hooks.c
> @@ -835,7 +835,8 @@ static unsigned int ip_sabotage_in(void *priv,
>  struct sk_buff *skb,
>  const struct nf_hook_state *state)
>  {
> - if (skb->nf_bridge && !skb->nf_bridge->in_prerouting) {
> + if (skb->nf_bridge && !skb->nf_bridge->in_prerouting &&
> + !netif_is_l3_master(skb->dev)) {
>   state->okfn(state->net, state->sk, skb);
>   return NF_STOLEN;
>   }
> 



Re: [PATCH net-next v3 02/17] zinc: introduce minimal cryptography library

2018-09-16 Thread Andy Lutomirski
On Tue, Sep 11, 2018 at 4:57 PM David Miller  wrote:
>
> From: Andrew Lunn 
> Date: Wed, 12 Sep 2018 01:30:15 +0200
>
> > Just as an FYI:
> >
> > 1) I don't think anybody in netdev has taken a serious look at the
> > network code yet. There is little point until the controversial part
> > of the code, Zinc, has been sorted out.
> >
> > 2) I personally would be surprised if DaveM took this code without
> > having an Acked-by from the crypto subsystem people. In the same way,
> > i doubt the crypto people would take an Ethernet driver without having
> > DaveM's Acked-by.
>
> Both of Andrew's statements are completely true.
>
> I'm not looking at any of the networking bits until the crypto stuff
> is fully sorted and fully supported and Ack'd by crypto folks.

So, as a process question, whom exactly are we waiting for:

CRYPTO API
M:  Herbert Xu 
M:  "David S. Miller" 
L:  linux-cry...@vger.kernel.org

Herbert hasn't replied to any of these submissions.  You're the other
maintainer :)

To the extent that you (DaveM) want my ack, here's what I think of the
series so far:

The new APIs to the crypto primitives are good.  For code that wants
to do a specific known crypto operation, they are much, much more
pleasant to use than the existing crypto API.  The code cleanups in
the big keys patch speak for themselves IMO.  I have no problem with
the addition of a brand-new API to the kernel, especially when it's a
nice one like Zinc's, even if that API starts out with only a couple
of users.

Zinc's arrangement of arch code is just fine.  Sure, there are
arguments that putting arch-specific code in arch/ is better, but this
is mostly bikeshedding IMO.

There has been some discussion of the exact best way to handle
simd_relax() and some other minor nitpicks of API details.  This kind
of stuff doesn't need to block the series -- it can always be reworked
down the road if needed.

There are two things I don't like right now, though:

1. Zinc conflates the addition of a new API with the replacement of
some algorithm implementations.  This is problematic.  Look at the
recent benchmarks of ipsec before and after this series.  Apparently
big packets get faster and small packets get slower.  It would be
really nice to bisect the series to narrow down *where* the regression
came from, but, as currently structured, you can't.

The right way to do this is to rearrange the series.  First, the new
Zinc APIs should be added, and they should be backed with the
*existing* crypto code.  (If the code needs to be moved or copied to a
new location, so be it.  The patch will be messy because somehow the
Zinc API is going to have to dispatch to the arch-specific code, and
the way that the crypto API handles it is not exactly friendly to this
type of use.  So be it.)  Then another patch should switch the crypto
API to use the Zinc interface.  That patch, *by itself*, can be
benchmarked.  If it causes a regression for small ipsec packets, then
it can be tracked down relatively easily.  Once this is all done, the
actual crypto implementation can be changed, and that changed can be
reviewed on its own merits.

As a simplistic example, I wrote this code a while back:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=crypto/sha256_bpf&id=e9e12f056f2abed50a30b762db9185799f5864e6

and its two parents.  This series added a more Zinc-like API to
SHA256.  And it did it without replacing the SHA256 implementation.
Doing this for Zinc would be a bit more complication, since the arch
code would need to be invoked too, but it should be doable.

FWIW, Wireguard should not actually depend on the replacement of the
crypto implementation.

2. The new Zinc crypto implementations look like they're brand new.  I
realize that they have some history, some of them are derived from
OpenSSL, etc, but none of this is really apparent in the patches
themselves.  It would be great if the kernel could literally share the
same code as something like OpenSSL, since OpenSSL gets much more
attention than the kernel's crypto.  Failing that, it would be nice if
the patches made it more clear how the code differs from its origin.
At the very least, though, if the replacement of the crypto code were,
as above, a patch that just replaced the crypto code, it would be much
easier to review and benchmark intelligently.

--Andy


Re: [PATCH net-next] net/smc: cast sizeof to int for comparison

2018-09-16 Thread YueHaibing
On 2018/9/15 19:35, Andreas Schwab wrote:
> On Sep 15 2018, YueHaibing  wrote:
> 
>> Comparing an int to a size, which is unsigned, causes the int to become
>> unsigned, giving the wrong result. kernel_sendmsg can return a negative
>> error code.
>>
>> Signed-off-by: YueHaibing 
>> ---
>>  net/smc/smc_clc.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/smc/smc_clc.c b/net/smc/smc_clc.c
>> index 83aba9a..fd0f5ce 100644
>> --- a/net/smc/smc_clc.c
>> +++ b/net/smc/smc_clc.c
>> @@ -446,7 +446,7 @@ int smc_clc_send_proposal(struct smc_sock *smc, int 
>> smc_type,
>>  vec[i++].iov_len = sizeof(trl);
>>  /* due to the few bytes needed for clc-handshake this cannot block */
>>  len = kernel_sendmsg(smc->clcsock, &msg, vec, i, plen);
>> -if (len < sizeof(pclc)) {
>> +if (len < (int)sizeof(pclc)) {
>>  if (len >= 0) {
>>  reason_code = -ENETUNREACH;
>>  smc->sk.sk_err = -reason_code;
> 
> It would perhaps be better to handle len < 0 first.

That need refactor the err hangding, is worth doing it?

> 
> Andreas.
> 



Re: [PATCH iproute2] libnetlink: fix leak and using unused memory on error

2018-09-16 Thread David Ahern
On 9/14/18 10:48 AM, Mahesh Bandewar (महेश बंडेवार) wrote:
> On Thu, Sep 13, 2018 at 12:33 PM, Stephen Hemminger
>  wrote:
>> If an error happens in multi-segment message (tc only)
>> then report the error and stop processing further responses.
>> This also fixes refering to the buffer after free.
>>
>> The sequence check is not necessary here because the
>> response message has already been validated to be in
>> the window of the sequence number of the iov.
>>
>> Reported-by: Mahesh Bandewar 
>> Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic")

$ git show 7b2ee50c0cd5
fatal: ambiguous argument '7b2ee50c0cd5': unknown revision or path not
in the working tree.

but we knew that since iproute2 does not have hv_netsvc code. Copy and
paste error?


Re: pull-request: bpf 2018-09-16

2018-09-16 Thread David Miller
From: Daniel Borkmann 
Date: Sun, 16 Sep 2018 02:36:45 +0200

> The following pull-request contains BPF updates for your *net* tree.
> 
> The main changes are:
> 
> 1) Fix end boundary calculation in BTF for the type section, from Martin.
> 
> 2) Fix and revert subtraction of pointers that was accidentally allowed
>for unprivileged programs, from Alexei.
> 
> 3) Fix bpf_msg_pull_data() helper by using __GFP_COMP in order to avoid
>a warning in linearizing sg pages into a single one for large allocs,
>from Tushar.
> 
> Please consider pulling these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Pulled, thanks Daniel.


[PATCHv2 net-next 1/1] net: rds: use memset to optimize the recv

2018-09-16 Thread Zhu Yanjun
The function rds_inc_init is in recv process. To use memset can optimize
the function rds_inc_init.
The test result:

 Before:
 1) + 24.950 us   |rds_inc_init [rds]();
 After:
 1) + 10.990 us   |rds_inc_init [rds]();

Acked-by: Santosh Shilimkar 
Signed-off-by: Zhu Yanjun 
---
V1->V2: a new patch for net-next
---
 net/rds/recv.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/net/rds/recv.c b/net/rds/recv.c
index 12719653188a..727639dac8a7 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -43,8 +43,6 @@
 void rds_inc_init(struct rds_incoming *inc, struct rds_connection *conn,
 struct in6_addr *saddr)
 {
-   int i;
-
refcount_set(&inc->i_refcount, 1);
INIT_LIST_HEAD(&inc->i_item);
inc->i_conn = conn;
@@ -52,8 +50,7 @@ void rds_inc_init(struct rds_incoming *inc, struct 
rds_connection *conn,
inc->i_rdma_cookie = 0;
inc->i_rx_tstamp = ktime_set(0, 0);
 
-   for (i = 0; i < RDS_RX_MAX_TRACES; i++)
-   inc->i_rx_lat_trace[i] = 0;
+   memset(inc->i_rx_lat_trace, 0, sizeof(inc->i_rx_lat_trace));
 }
 EXPORT_SYMBOL_GPL(rds_inc_init);
 
-- 
2.17.1



[net-next:master 13/125] drivers/net//ethernet/ni/nixge.c:254:9: warning: cast to pointer from integer of different size

2018-09-16 Thread kbuild test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
master
head:   c3ec8bcceb07ab81e4ff017b4ebbacc137a5a15e
commit: 7e8d5755be0e6c92d3b86a85e54c6a550b1910c5 [13/125] net: nixge: Add 
support for 64-bit platforms
config: i386-randconfig-i1-09170930 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
git checkout 7e8d5755be0e6c92d3b86a85e54c6a550b1910c5
# save the attached .config to linux build tree
make ARCH=i386 

All warnings (new ones prefixed by >>):

   drivers/net//ethernet/ni/nixge.c: In function 'nixge_hw_dma_bd_release':
>> drivers/net//ethernet/ni/nixge.c:254:9: warning: cast to pointer from 
>> integer of different size [-Wint-to-pointer-cast]
  skb = (struct sk_buff *)
^
   In file included from include/linux/skbuff.h:17:0,
from include/linux/if_ether.h:23,
from include/linux/etherdevice.h:25,
from drivers/net//ethernet/ni/nixge.c:7:
   drivers/net//ethernet/ni/nixge.c: In function 'nixge_hw_dma_bd_init':
>> drivers/net//ethernet/ni/nixge.c:130:37: warning: cast from pointer to 
>> integer of different size [-Wpointer-to-int-cast]
  (bd)->field##_lo = lower_32_bits(((u64)addr)); \
^
   include/linux/kernel.h:234:33: note: in definition of macro 'lower_32_bits'
#define lower_32_bits(n) ((u32)(n))
^
   drivers/net//ethernet/ni/nixge.c:145:2: note: in expansion of macro 
'nixge_hw_dma_bd_set_addr'
 nixge_hw_dma_bd_set_addr((bd), sw_id_offset, (addr))
 ^~~~
   drivers/net//ethernet/ni/nixge.c:326:3: note: in expansion of macro 
'nixge_hw_dma_bd_set_offset'
  nixge_hw_dma_bd_set_offset(&priv->rx_bd_v[i], skb);
  ^~
   drivers/net//ethernet/ni/nixge.c:131:37: warning: cast from pointer to 
integer of different size [-Wpointer-to-int-cast]
  (bd)->field##_hi = upper_32_bits(((u64)addr)); \
^
   include/linux/kernel.h:228:35: note: in definition of macro 'upper_32_bits'
#define upper_32_bits(n) ((u32)(((n) >> 16) >> 16))
  ^
   drivers/net//ethernet/ni/nixge.c:145:2: note: in expansion of macro 
'nixge_hw_dma_bd_set_addr'
 nixge_hw_dma_bd_set_addr((bd), sw_id_offset, (addr))
 ^~~~
   drivers/net//ethernet/ni/nixge.c:326:3: note: in expansion of macro 
'nixge_hw_dma_bd_set_offset'
  nixge_hw_dma_bd_set_offset(&priv->rx_bd_v[i], skb);
  ^~
   drivers/net//ethernet/ni/nixge.c: In function 'nixge_recv':
   drivers/net//ethernet/ni/nixge.c:604:9: warning: cast to pointer from 
integer of different size [-Wint-to-pointer-cast]
  skb = (struct sk_buff *)nixge_hw_dma_bd_get_addr(cur_p,
^
   In file included from include/linux/skbuff.h:17:0,
from include/linux/if_ether.h:23,
from include/linux/etherdevice.h:25,
from drivers/net//ethernet/ni/nixge.c:7:
>> drivers/net//ethernet/ni/nixge.c:130:37: warning: cast from pointer to 
>> integer of different size [-Wpointer-to-int-cast]
  (bd)->field##_lo = lower_32_bits(((u64)addr)); \
^
   include/linux/kernel.h:234:33: note: in definition of macro 'lower_32_bits'
#define lower_32_bits(n) ((u32)(n))
^
   drivers/net//ethernet/ni/nixge.c:145:2: note: in expansion of macro 
'nixge_hw_dma_bd_set_addr'
 nixge_hw_dma_bd_set_addr((bd), sw_id_offset, (addr))
 ^~~~
   drivers/net//ethernet/ni/nixge.c:646:3: note: in expansion of macro 
'nixge_hw_dma_bd_set_offset'
  nixge_hw_dma_bd_set_offset(cur_p, new_skb);
  ^~
   drivers/net//ethernet/ni/nixge.c:131:37: warning: cast from pointer to 
integer of different size [-Wpointer-to-int-cast]
  (bd)->field##_hi = upper_32_bits(((u64)addr)); \
^
   include/linux/kernel.h:228:35: note: in definition of macro 'upper_32_bits'
#define upper_32_bits(n) ((u32)(((n) >> 16) >> 16))
  ^
   drivers/net//ethernet/ni/nixge.c:145:2: note: in expansion of macro 
'nixge_hw_dma_bd_set_addr'
 nixge_hw_dma_bd_set_addr((bd), sw_id_offset, (addr))
 ^~~~
   drivers/net//ethernet/ni/nixge.c:646:3: note: in expansion of macro 
'nixge_hw_dma_bd_set_offset'
  nixge_hw_dma_bd_set_offset(cur_p, new_skb);
  ^~

vim +254 drivers/net//ethernet/ni/nixge.c

   126  
   127  #ifdef CONFIG_PHYS_ADDR_T_64BIT
   128  #define nixge_hw_dma_bd_set_addr(bd, field, addr) \
   129  do { \
 > 130  (bd)->field##_lo = lower_32_bits(((u64)addr)); \
   131  (bd)->field##_hi = upper_32_bits(((u64)addr)); \
   132  } while (0)
   133  #else
   134  #defi

Re: [PATCH 1/1] net: rds: use memset to optimize the recv

2018-09-16 Thread zhuyj
 OK. I will send V2 for net-next very soon.

Zhu Yanjun

On Mon, Sep 17, 2018 at 6:38 AM David Miller  wrote:
>
> From: Zhu Yanjun 
> Date: Fri, 14 Sep 2018 04:45:38 -0400
>
> > The function rds_inc_init is in recv process. To use memset can optimize
> > the function rds_inc_init.
> > The test result:
> >
> > Before:
> > 1) + 24.950 us   |rds_inc_init [rds]();
> > After:
> > 1) + 10.990 us   |rds_inc_init [rds]();
> >
> > Signed-off-by: Zhu Yanjun 
>
> This doesn't apply cleanly to net-next, please respin.


Re: [PATCH 1/1] net: rds: use memset to optimize the recv

2018-09-16 Thread David Miller
From: Zhu Yanjun 
Date: Fri, 14 Sep 2018 04:45:38 -0400

> The function rds_inc_init is in recv process. To use memset can optimize
> the function rds_inc_init.
> The test result:
> 
> Before:
> 1) + 24.950 us   |rds_inc_init [rds]();
> After:
> 1) + 10.990 us   |rds_inc_init [rds]();
> 
> Signed-off-by: Zhu Yanjun 

This doesn't apply cleanly to net-next, please respin.


Re: [PATCH net] veth: Orphan skb before GRO

2018-09-16 Thread David Miller
From: Toshiaki Makita 
Date: Fri, 14 Sep 2018 13:33:44 +0900

> GRO expects skbs not to be owned by sockets, but when XDP is enabled veth
> passed skbs owned by sockets. It caused corrupted sk_wmem_alloc.
> 
> Paolo Abeni reported the following splat:
 ...
> In order to avoid this, orphan the skb before entering GRO.
> 
> Fixes: 948d4f214fde ("veth: Add driver XDP")
> Reported-by: Paolo Abeni 
> Signed-off-by: Toshiaki Makita 

Applied, thanks.


Re: [PATCH net-next 0/2] net/sched: act_police: lockless data path

2018-09-16 Thread David Miller
From: Davide Caratti 
Date: Thu, 13 Sep 2018 19:29:11 +0200

> the data path of 'police' action can be faster if we avoid using spinlocks:
>  - patch 1 converts act_police to use per-cpu counters
>  - patch 2 lets act_police use RCU to access its configuration data.
> 
> test procedure (using pktgen from https://github.com/netoptimizer):
>  # ip link add name eth1 type dummy
>  # ip link set dev eth1 up
>  # tc qdisc add dev eth1 clsact
>  # tc filter add dev eth1 egress matchall action police \
>  > rate 2gbit burst 100k conform-exceed pass/pass index 100
>  # for c in 1 2 4; do
>  > ./pktgen_bench_xmit_mode_queue_xmit.sh -v -s 64 -t $c -n 500 -i eth1
>  > done
> 
> test results (avg. pps/thread):
> 
>   $c | before patch |  after patch | improvement
>  +--+--+-
>1 |  3518448 |  3591240 |  irrelevant
>2 |  3070065 |  3383393 | 10%
>4 |  1540969 |  3238385 |110%

Series applied.


Re: [PATCH net 0/2] udp: add missing check on edumx rx path

2018-09-16 Thread David Miller
From: Paolo Abeni 
Date: Thu, 13 Sep 2018 16:27:19 +0200

> The early demux RX path for the UDP protocol is currently missing
> some checks. Both ipv4 and ipv6 implementations lack checksum conversion
> and the ipv6 implementation additionally lack the zero checksum
> validation.
> 
> The first patch takes care of UDPv4 and the second one of UDPv6

Series applied and queued up for -stable.


Oddities with connmark

2018-09-16 Thread Алексей Болдырев
Actually, there is a suricata with the following rules:

#pass tls any any -> any any (pcre: "/play.google.com/i"; 
tls_sni;nfq_set_mark:0x8/0x; sid:2466;)
#pass tls any any -> any any (pcre: "/google.com/i"; 
tls_sni;nfq_set_mark:0x8/0x; sid:2465;)
#pass tls any any -> any any (pcre: "/gstatic.com/i"; 
tls_sni;nfq_set_mark:0x8/0x; sid:2467;)
#pass tls any any -> any any (pcre: "/googleservice.com/i"; 
tls_sni;nfq_set_mark:0x8/0x; sid:2467;)
pass tls any any -> any any (pcre: "/youtube.com/s"; 
tls_sni;nfq_set_mark:0x2/0x; sid:2455;)
pass tls any any -> any any (pcre: "/googlevideo.com/s"; 
tls_sni;nfq_set_mark:0x2/0x; sid:2456;)
pass http any any <> any any (content: "tactical-market.ru"; 
http_header;nfq_set_mark:0x4/0x; sid:2457;)
pass http any any <> any any (content: "voent.org"; 
http_header;nfq_set_mark:0x4/0x; sid:2458;)
pass http any any <> any any (content: "h-mag.ru"; 
http_header;nfq_set_mark:0x4/0x; sid:2459;)
pass tls any any <> any any (content: 
"voent.org";tls_sni;nfq_set_mark:0x4/0x; sid:2460;)
pass tls any any <> any any (content: 
"h-mag.ru";tls_sni;nfq_set_mark:0x4/0x; sid:2461;)
rejectboth tcp any any <> any any (content: "GET http://";content: "Host: "; 
sid:2462;)
pass http any any <> any any (content: "302";http_stat_code;content: 
"ivrn.net";http_header;nfq_set_mark:0x64/0x; sid:2463;)
pass ssh any any <> any any (nfq_set_mark:0x6/0x; sid:2464;)

#reject tls any any <> any any (content:"www.youtube.com"; 
tls_sni;nfq_set_mark:0x2/0x; sid:2456;)

#ytimg.com

iptables:

Chain PREROUTING (policy ACCEPT 228K packets, 138M bytes)
 pkts bytes target prot opt in out source   destination 

   11  3630 RETURN all  --  *  *   0.0.0.0  
255.255.255.255 
 127K  121M RETURN all  --  eth1   *   0.0.0.0/00.0.0.0/0   

  187 11489 RETURN all  --  ppp0   *   0.0.0.0/00.0.0.0/0   

10365 2323K RETURN all  --  vpns0.10 *   0.0.0.0/00.0.0.0/0 
  
0 0 LOGall  --  *  *   0.0.0.0/00.0.0.0/0   
 rpfilter invert LOG flags 0 level 4 prefix "IP SPOOFING: "
0 0 DROP   all  --  *  *   0.0.0.0/00.0.0.0/0   
 rpfilter invert
0 0 DROP   all  --  *  *   0.0.0.0/00.0.0.0/0   
 -m ipv4options --flags 7 
0 0 DROP   all  --  *  *   0.0.0.0/00.0.0.0/0   
 -m ipv4options --flags 3 
0 0 DROP   all  --  *  *   0.0.0.0/00.0.0.0/0   
 -m ipv4options --flags 9 
0 0 MARK   all  --  *  *   0.0.0.0/00.0.0.0/0   
 match-set dpi_detect dst MARK xset 0x40/0xfe
0 0 MARK   all  --  *  *   0.0.0.0/00.0.0.0/0   
 match-set dpi_detect src MARK xset 0x40/0xfe

Chain INPUT (policy ACCEPT 107K packets, 45M bytes)
 pkts bytes target prot opt in out source   destination 


Chain FORWARD (policy ACCEPT 120K packets, 93M bytes)
 pkts bytes target prot opt in out source   destination 

 241K  185M DPIall  --  *  *   0.0.0.0/00.0.0.0/0   

 120K   93M DPI_SH all  --  *  *   0.0.0.0/00.0.0.0/0   

 2063  123K TCPMSS tcp  --  *  *   0.0.0.0/00.0.0.0/0   
 tcp flags:0x06/0x02 TCPMSS clamp to PMTU

Chain OUTPUT (policy ACCEPT 109K packets, 24M bytes)
 pkts bytes target prot opt in out source   destination 


Chain POSTROUTING (policy ACCEPT 229K packets, 116M bytes)
 pkts bytes target prot opt in out source   destination 


Chain DPI (1 references)
 pkts bytes target prot opt in out source   destination 

0 0 RETURN all  --  *  *   198.18.0.0/15
192.168.0.0/15  
0 0 RETURN all  --  *  *   192.168.0.0/16   
198.18.0.0/15   
0 0 RETURN all  --  *  *   192.168.0.0/16   
192.168.0.0/16  
 121K   93M NFQUEUEall  --  *  *   0.0.0.0/00.0.0.0/0   
 mark match ! 0x1/0x1 NFQUEUE num 0

Chain DPI_SH (1 references)
 pkts bytes target prot opt in out source   destination 

 3542 2688K RETURN all  --  *  *   0.0.0.0/00.0.0.0/0   
 connmark match  0x8/0xfe
   53 45450 CONNMARK   all  --  *  *   0.0.0.0/00.0.0.0/0   
 mark match 0x8/0xfe CONNMARK xset 0x8/0xfe
0 0 CONNMARK   all  --  *  *   0.0.0.0/00.0.0.0/0   
 mark match 0x4/0xfe CONNMARK xset 0x4/0xfe
8  9366 CONNMARK   all  --  *  *   0.0.0.0/00.0.0.0/0   

Re: [RFC PATCH iproute2-next] System specification health API

2018-09-16 Thread Andrew Lunn
> Why is this going under iproute rather than using one of the existing sensor 
> API's.
> For example Intel NIC's have thermal sensors etc.

Hi Stephen

These are not that sort of sensors. This is part of the naming problem
here. It is not really to do with health, it is about exceptions and
bugs. And the sensors are more like timeouts and watchdogs.

It is clear that the current names lead to a lot of confusion. Maybe:

health -> exception
sensor -> condition

   Andrew


Re: [RFC PATCH iproute2-next] System specification health API

2018-09-16 Thread Stephen Hemminger
On Thu, 13 Sep 2018 10:36:04 -0700
Jakub Kicinski  wrote:

> On Thu, 13 Sep 2018 11:18:15 +0300, Eran Ben Elisha wrote:
> > The health spec is targeted for Real Time Alerting, in order to know when
> > something bad had happened to a PCI device  
> 
> By spec you mean some standards body spec you implement or this
> proposal is a spec?
> 
> > - Provide alert debug information
> > - Self healing
> > - If problem needs vendor support, provide a way to gather all needed 
> > debugging
> >   information.
> > 
> > The health contains sensors which sense for malfunction. Once sensor 
> > triggered,
> > actions such as logs and correction can be taken.
> > Sensors are sensing the health state and can trigger correction action.
> > 
> > The sensors are divided into the following groups
> > - Hardware sensor - a sensor which is triggered by the device due to
> >   malfunction.
> > - Software sensor - a sensor which is triggered by the software due to
> >   malfunction.
> > Both group of sensors can be triggered due to error event or due to a 
> > periodic check.
> > 
> > Actions are the way to handle sensor events. Action can be in one of the
> > following groups:
> > - Dump -  SW trace, SW dump, HW trace, HW dump
> > - Reset - Surgical correction (e.g. modify Q, flush Q, reset of device, etc)
> > Actions can be performed by SW or HW.
> > 
> > User is allowed to enable or disable sensors and sensor2action mapping.
> > 
> > This RFC man page patch describes the suggested API of devlink-health in 
> > order
> > to control sensors and actions.  
> 
> I like the idea of configuring response to events like this, although
> I'm not sure the name sensor is appropriate here - perhaps exception or
> error would be better?  Are there going to be values reported?
> 
> I'm not so sure about HW sensors in relation to existing HWMON
> infrastructure...  I assume you're targeting things like say some HW
> engine/block reporting it encountered an error?  Sounds good, too.
> 
> Are the actions all envisioned to be performed by the driver?
> Firmware?  Hardware?  I guess that distinction can be added later.
> For FW/HW actions we would go back to the problem of persistence of 
> the setting since it was only implemented for params :S
> 
> Is the dump option going to tie back into region snapshots?

Why is this going under iproute rather than using one of the existing sensor 
API's.
For example Intel NIC's have thermal sensors etc.



Re: [PATH RFC net-next 1/8] net: phy: Move linkmode helpers to somewhere public

2018-09-16 Thread Andrew Lunn
> Good idea, I wonder if we should create a more specific directory within
> include/linux/ that can host a variety of PHYLIB, PHYLINK and what not
> header files, but this could be solved later on.

I'm leaving it for later.

We would also need to figure out a name for this directory.  phy is
already used by the generic phy subsystem. So i guess we would have to
use something like ethernet-phy.

Andrew


Re: [RFC PATCH 2/4] net: enable UDP gro on demand.

2018-09-16 Thread Willem de Bruijn
On Fri, Sep 14, 2018 at 1:16 PM Willem de Bruijn
 wrote:
>
> On Fri, Sep 14, 2018 at 11:47 AM Paolo Abeni  wrote:
> >
> > Currently, the UDP GRO callback is always invoked, regardless of
> > the existence of any actual user (e.g. a UDP tunnel). With retpoline
> > enabled, this causes measurable overhead.
> >
> > This changeset introduces explicit accounting of the sockets requiring
> > UDP GRO and updates the UDP offloads at runtime accordingly, so that
> > the GRO callback is present (and invoked) only when there is at least
> > one socket requiring it.
>
> I have a difference solution both to the UDP socket lookup avoidance
> and configurable GRO in general.
>
> I've been sitting on it for too long. Let me slightly clean it up and
> send it out for discussion sake..

http://patchwork.ozlabs.org/project/netdev/list/?series=65763

That udp gro implementation is clearly less complete than yours in
this patchset. The point I wanted to bring up for discussion is not the
protocol implementation, but the infrastructure for enabling it
conditionally.

Assuming cycle cost is comparable, what do you think of  using the
existing sk offload callbacks to enable this on a per-socket basis?

As for the protocol-wide knob, I do strongly prefer something that can
work for all protocols, not just UDP. I also implemented a version that
atomically swaps the struct ptr instead of the flag based approach I sent
for review. I'm fairly agnostic about that point. One subtle issue is that I
believe we need to keep the gro_complete callbacks enabled, as gro
packets may be queued for completion when gro_receive gets disabled.


Re: [PATCH net-next RFC 5/8] net: deconstify net_offload

2018-09-16 Thread Willem de Bruijn
On Fri, Sep 14, 2018 at 11:30 PM Subash Abhinov Kasiviswanathan
 wrote:
>
> On 2018-09-14 11:59, Willem de Bruijn wrote:
> > From: Willem de Bruijn 
> >
> > With configurable gro, the flags field in net_offloads may be changed.
> >
> > Remove the const keyword. This is a noop otherwise.
> >
> > Signed-off-by: Willem de Bruijn 
> > diff --git a/net/sctp/offload.c b/net/sctp/offload.c
> > index 123e9f2dc226..ad504b83245d 100644
> > --- a/net/sctp/offload.c
> > +++ b/net/sctp/offload.c
> > @@ -90,7 +90,7 @@ static struct sk_buff *sctp_gso_segment(struct
> > sk_buff
> > *skb,
> >   return segs;
> >  }
> >
> > -static const struct net_offload sctp_offload = {
> > +static struct net_offload sctp_offload = {
> >   .callbacks = {
> >   .gso_segment = sctp_gso_segment,
> >   },
>
> Hi Willem
>
> sctp6 also needs to be deconstified.
>
> diff --git a/net/sctp/offload.c b/net/sctp/offload.c
> index ad504b8..4be7794 100644
> --- a/net/sctp/offload.c
> +++ b/net/sctp/offload.c
> @@ -96,7 +96,7 @@ static struct sk_buff *sctp_gso_segment(struct sk_buff
> *skb,
>  },
>   };
>
> -static const struct net_offload sctp6_offload = {
> +static struct net_offload sctp6_offload = {
>  .callbacks = {
>  .gso_segment = sctp_gso_segment,
>  },

Thanks! I'll update that.


Re: [PATCH net-next RFC 7/8] udp: gro behind static key

2018-09-16 Thread Willem de Bruijn
On Fri, Sep 14, 2018 at 11:37 PM Subash Abhinov Kasiviswanathan
 wrote:
>
> On 2018-09-14 11:59, Willem de Bruijn wrote:
> > From: Willem de Bruijn 
> >
> > Avoid the socket lookup cost in udp_gro_receive if no socket has a
> > gro callback configured.
> >
> > Signed-off-by: Willem de Bruijn 
> > diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
> > index 4f6aa95a9b12..f44fe328aa0f 100644
> > --- a/net/ipv4/udp_offload.c
> > +++ b/net/ipv4/udp_offload.c
> > @@ -405,7 +405,7 @@ static struct sk_buff *udp4_gro_receive(struct
> > list_head *head,
> >  {
> >   struct udphdr *uh = udp_gro_udphdr(skb);
> >
> > - if (unlikely(!uh))
> > + if (unlikely(!uh) ||
> > !static_branch_unlikely(&udp_encap_needed_key))
> >   goto flush;
> >
>
> Hi Willem
>
> Does this need to be
>
> diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
> index 6dd3f0a..fcd5589 100644
> --- a/net/ipv4/udp_offload.c
> +++ b/net/ipv4/udp_offload.c
> @@ -407,7 +407,7 @@ static struct sk_buff *udp4_gro_receive(struct
> list_head *head,
>   {
>  struct udphdr *uh = udp_gro_udphdr(skb);
>
> -   if (unlikely(!uh) ||
> !static_branch_unlikely(&udp_encap_needed_key))
> +   if (unlikely(!uh) ||
> static_branch_unlikely(&udp_encap_needed_key))
>  goto flush;
>
>  /* Don't bother verifying checksum if we're going to flush
> anyway. */
>
> I tried setting UDP_GRO socket option and I had to make this change to
> exercise the udp_gro_receive_cb code path.

Thanks for trying it out, Subash.

The static_branch logic is correct as is. It skips the gro logic if
the static key is not enabled.

But there is indeed a bug. the UDP_GRO setsockopt has to enable
the static key. A full patch is more complete, but this should fix it for
now:

--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2506,6 +2506,7 @@ int udp_lib_setsockopt(struct sock *sk, int
level, int optname,
if (!udp_sk(sk)->gro_receive) {
udp_sk(sk)->gro_complete = udp_gro_complete_cb;
udp_sk(sk)->gro_receive = udp_gro_receive_cb;
+   udp_encap_enable();

sorry about that. I had tested the two patches independently, but
apparently not on top of each other.

Independent of udp gro, I tested the static key by running udp_rr
with perf record -a -g and looking for __udp4_lib_lookup. With the
patch, all calls are from __udp4_lib_rcv else udp_gro_receive also
shows up. I had to stress the porttable by opening ~32K ports.

Also, the simple patch to udpgso_bench_rx.c that I used to test
gro:

diff --git a/tools/testing/selftests/net/udpgso_bench_rx.c
b/tools/testing/selftests/net/udpgso_bench_rx.c
index 727cf67a3f75..35ba8567fc56 100644
--- a/tools/testing/selftests/net/udpgso_bench_rx.c
+++ b/tools/testing/selftests/net/udpgso_bench_rx.c
@@ -31,6 +31,11 @@
 #include 
 #include 

+#ifndef UDP_GRO
+#define UDP_GRO104
+#endif
+
+static bool cfg_do_gro;
 static int  cfg_port   = 8000;
 static bool cfg_tcp;
 static bool cfg_verify;
@@ -89,6 +94,11 @@ static int do_socket(bool do_tcp)
if (setsockopt(fd, SOL_SOCKET, SO_REUSEPORT, &val, sizeof(val)))
error(1, errno, "setsockopt reuseport");

+   if (cfg_do_gro) {
+   if (setsockopt(fd, SOL_UDP, UDP_GRO, &val, sizeof(val)))
+   error(1, errno, "setsockopt gro");
+   }
+
addr.sin6_family =  PF_INET6;
addr.sin6_port =htons(cfg_port);
addr.sin6_addr =in6addr_any;
@@ -199,8 +209,11 @@ static void parse_opts(int argc, char **argv)
 {
int c;

-   while ((c = getopt(argc, argv, "ptv")) != -1) {
+   while ((c = getopt(argc, argv, "gptv")) != -1) {
switch (c) {
+   case 'g':
+   cfg_do_gro = true;
+   break;
case 'p':
cfg_port = htons(strtoul(optarg, NULL, 0));
break;


Re: [PATH RFC net-next 7/8] net: phy: Replace phy driver features u32 with link_mode bitmap

2018-09-16 Thread Andrew Lunn
> By that you mean having to determine whether you overflow the
> capacity of an unsigned long storage type and having to put the bits
> in either unsigned long [0] or [1]? Being able to eliminate the
> duplication would also be nice, but I cannot think about a smart
> solution at compile time that would avoid doing that.

Hi Florian

I've given up on doing it at compile time. I'm working on a run-time
solution at the moment, which i think looks better. I will probably
split the patchset. Post for merging all but the last two patches, and
then an RFC for replacing this patch.

Andrew


[PATCH iproute2-next] rdma: Fix representation of PortInfo CapabilityMask

2018-09-16 Thread Leon Romanovsky
From: Leon Romanovsky 

The port capability mask represents IBTA PortInfo specification,
but as it is written in description of kernel commit 2f944c0fbf58
("RDMA: Fix storage of PortInfo CapabilityMask in the kernel"),
the bit 26 was mistakenly overwritten.

The rdmatool followed it too and mislead users by presenting wrong
value. Since it never showed proper value, we update the whole
port_cap_mask to comply with IBTA and show real HW values.

Fixes: da990ab40a92 ("rdma: Add link object")
Signed-off-by: Leon Romanovsky 
---
 rdma/link.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/rdma/link.c b/rdma/link.c
index 7e914c87..7a6d4b7e 100644
--- a/rdma/link.c
+++ b/rdma/link.c
@@ -20,6 +20,7 @@ static int link_help(struct rd *rd)
 static const char *caps_to_str(uint32_t idx)
 {
 #define RDMA_PORT_FLAGS(x) \
+   x(RESERVED, 0) \
x(SM, 1) \
x(NOTICE, 2) \
x(TRAP, 3) \
@@ -32,7 +33,9 @@ static const char *caps_to_str(uint32_t idx)
x(SM_DISABLED, 10) \
x(SYS_IMAGE_GUID, 11) \
x(PKEY_SW_EXT_PORT_TRAP, 12) \
+   x(CABLE_INFO, 13) \
x(EXTENDED_SPEEDS, 14) \
+   x(CAP_MASK2, 15) \
x(CM, 16) \
x(SNMP_TUNNEL, 17) \
x(REINIT, 18) \
@@ -43,7 +46,12 @@ static const char *caps_to_str(uint32_t idx)
x(BOOT_MGMT, 23) \
x(LINK_LATENCY, 24) \
x(CLIENT_REG, 25) \
-   x(IP_BASED_GIDS, 26)
+   x(OTHER_LOCAL_CHANGES, 26) \
+   x(LINK_SPPED_WIDTH, 27) \
+   x(VENDOR_SPECIFIC_MADS, 28) \
+   x(MULT_PKER_TRAP, 29) \
+   x(MULT_FDB, 30) \
+   x(HIERARCHY_INFO, 31)
 
enum { RDMA_PORT_FLAGS(RDMA_BITMAP_ENUM) };
 
@@ -51,9 +59,7 @@ static const char *caps_to_str(uint32_t idx)
rdma_port_names[] = { RDMA_PORT_FLAGS(RDMA_BITMAP_NAMES) };
#undef RDMA_PORT_FLAGS
 
-   if (idx < ARRAY_SIZE(rdma_port_names) && rdma_port_names[idx])
-   return rdma_port_names[idx];
-   return "UNKNOWN";
+   return rdma_port_names[idx];
 }
 
 static void link_print_caps(struct rd *rd, struct nlattr **tb)
-- 
2.14.4



Re: [PATH RFC net-next 7/8] net: phy: Replace phy driver features u32 with link_mode bitmap

2018-09-16 Thread Florian Fainelli
Hi Andrew,

On September 15, 2018 3:30:53 PM PDT, Andrew Lunn  wrote:
>On Sat, Sep 15, 2018 at 02:31:14PM -0700, Florian Fainelli wrote:
>> 
>> 
>> On 09/14/18 14:38, Andrew Lunn wrote:
>> > This is one step in allowing phylib to make use of link_mode
>bitmaps,
>> > instead of u32 for supported and advertised features. Convert the
>phy
>> > drivers to use bitmaps to indicates the features they support. This
>> > requires some macro magic in order to construct constant bitmaps
>used
>> > to initialise the driver structures.
>> > 
>> > Some new PHY_*_FEATURES are added, to indicate FIBRE is supported,
>and
>> > that all media ports are supported. This is done since bitmaps
>cannot
>> > be ORed together at compile time.
>> > 
>> > Within phylib, the features bitmap is currently turned back into a
>> > u32.  The MAC API to phylib needs to be cleaned up before the core
>of
>> > phylib can be converted to using bitmaps instead of u32.
>> 
>> Nice!
>
>Hi Florian
>
>This is the patch i don't like. I'm hoping somebody can think of a
>better way to initialise a bitmap.

By that you mean having to determine whether you overflow the capacity of an 
unsigned long storage type and having to put the bits in either unsigned long 
[0] or [1]? Being able to eliminate the duplication would also be nice, but I 
cannot think about a smart solution at compile time that would avoid doing that.
-- 
Florian


Re: [RFC PATCH iproute2-next] System specification health API

2018-09-16 Thread Eran Ben Elisha




On 9/13/2018 8:36 PM, Jakub Kicinski wrote:

On Thu, 13 Sep 2018 11:18:15 +0300, Eran Ben Elisha wrote:

The health spec is targeted for Real Time Alerting, in order to know when
something bad had happened to a PCI device


By spec you mean some standards body spec you implement or this
proposal is a spec?


This proposal is a spec




- Provide alert debug information
- Self healing
- If problem needs vendor support, provide a way to gather all needed debugging
   information.

The health contains sensors which sense for malfunction. Once sensor triggered,
actions such as logs and correction can be taken.
Sensors are sensing the health state and can trigger correction action.

The sensors are divided into the following groups
- Hardware sensor - a sensor which is triggered by the device due to
   malfunction.
- Software sensor - a sensor which is triggered by the software due to
   malfunction.
Both group of sensors can be triggered due to error event or due to a periodic 
check.

Actions are the way to handle sensor events. Action can be in one of the
following groups:
- Dump -  SW trace, SW dump, HW trace, HW dump
- Reset - Surgical correction (e.g. modify Q, flush Q, reset of device, etc)
Actions can be performed by SW or HW.

User is allowed to enable or disable sensors and sensor2action mapping.

This RFC man page patch describes the suggested API of devlink-health in order
to control sensors and actions.


I like the idea of configuring response to events like this, although
I'm not sure the name sensor is appropriate here - perhaps exception or
error would be better?


I was trying to avoid the negativity description. Have it called sensor 
to avoid restricting the API for errors / exceptions only. I got the 
same type of comment from Andrew as well devlink-health->devlink-bug.


But if other vendors driver developers don't see it can be expanded to 
sensor which are not errors, then I guess we can refactor the names.


Are there going to be values reported?

It depends on the sensor. If it has data that would help in the debug, 
then I assume yes, via the dumps.




I'm not so sure about HW sensors in relation to existing HWMON
infrastructure...  I assume you're targeting things like say some HW
engine/block reporting it encountered an error?  Sounds good, too.


yes, exactly.



Are the actions all envisioned to be performed by the driver?
Firmware?  Hardware?  I guess that distinction can be added later.
For FW/HW actions we would go back to the problem of persistence of
the setting since it was only implemented for params :S


The problem is not with FW action, the problem is when you try to set 
sensor2action mapping for the FW/HW. this will need persistence 
configuration mode. Sensor2action in SW shall be run-time mode (at least 
as a start).

But it sound as this need some more tuning, to make it clear.



Is the dump option going to tie back into region snapshots?


no necessarily, dumping SW objects as well can be helpful


Re: [RFC PATCH iproute2-next] man: Add devlink health man page

2018-09-16 Thread Eran Ben Elisha




On 9/13/2018 6:12 PM, Andrew Lunn wrote:

devlink health sensor set pci/:01:00.0 name TX_COMP_ERROR action 
reset off action dump on
Sets TX_COMP_ERROR sensor parameters for a specific device.



This is what I had in mind:
1. command interface error
2. command interface timeout
3. stuck TX queue (like tx_timeout)
4. stuck TX completion queue (driver did not process packets in a reasonable
time period)
5. stuck RX queue
6. RX completion error
7. TX completion error
8. HW / FW catastrophic error report
9. completion queue overrun



Such issues do exist in production environment, and need to be handled even
if root cause is a bug which will be fixed in latest release. My feature
should help developers / administrator to control and recover their live
systems, by auto correction and logging support.
Goal is:
- Provide alert debug information
- Self healing
- If problem needs vendor support, provide a way to gather all needed
debugging information.


So maybe you have the wrong name for this. Health is nice in terms of
Marketing, but we are actually talking about bug recovery.


The way I see it, this feature is responsible for the health of the 
system from the pci/ perspective.
I though about devlink-recover for example, but I really wouldn't like 
to limit the feature to be called after one of its actions. The same for 
devlink-bug, which highlights only part of the range of capabilities 
(sensor).


My work is currently focused on error reporting and recovery, but I 
wouldn't like to see the API limited for "bugs" only.


Eran



devlink bug sensor set pci/:01:00.0 name command_interface_error action 
reset off action dump on
devlink bug sensor set pci/:01:00.0 name command_interface_timeout action 
reset off action dump on
devlink bug sensor set pci/:01:00.0 name transmit_completion_error action 
reset off action dump on
devlink bug sensor set pci/:01:00.0 name completion_queue_overrun action 
reset off action dump on

seems a lot more understandable than:

devlink health set pci/:01:00.0 name TX_COMP_ERROR action reset off action 
dump on

Andrew