[PATCH net] sctp: fix pr_warn max_data argument type mismatch

2018-12-05 Thread Jakub Audykowicz
My previous patch introduced a compilation warning regarding a type
mismatch (int vs size_t). This is a one-letter fix for good housekeeping.

Signed-off-by: Jakub Audykowicz 
---
 net/sctp/chunk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
index d5b91bc8a377..ee5638358ad5 100644
--- a/net/sctp/chunk.c
+++ b/net/sctp/chunk.c
@@ -194,7 +194,7 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct 
sctp_association *asoc,
if (unlikely(!max_data)) {
max_data = sctp_min_frag_point(sctp_sk(asoc->base.sk),
   sctp_datachk_len(>stream));
-   pr_warn_ratelimited("%s: asoc:%p frag_point is zero, forcing 
max_data to default minimum (%d)",
+   pr_warn_ratelimited("%s: asoc:%p frag_point is zero, forcing 
max_data to default minimum (%ld)",
__func__, asoc, max_data);
}
 
-- 
2.17.1



Re: [PATCH net-next RFC 1/7] devlink: Add devlink_param for port register and unregister

2018-12-05 Thread Vasundhara Volam
On Thu, Dec 6, 2018 at 12:43 PM Jiri Pirko  wrote:
>
> Thu, Dec 06, 2018 at 07:02:59AM CET, vasundhara-v.vo...@broadcom.com wrote:
> >Thank you reviewing the patches.
> >
> >On Wed, Dec 5, 2018 at 5:24 PM Jiri Pirko  wrote:
> >>
> >> Wed, Dec 05, 2018 at 06:56:54AM CET, vasundhara-v.vo...@broadcom.com wrote:
> >> >Add functions to register and unregister for the driver supported
> >> >configuration parameters table per port.
> >> >
> >> >Cc: Jiri Pirko 
> >> >Signed-off-by: Vasundhara Volam 
> >> >---
> >> > include/net/devlink.h |  29 +++
> >> > net/core/devlink.c| 130 
> >> > ++
> >> > 2 files changed, 151 insertions(+), 8 deletions(-)
> >> >
> >> >diff --git a/include/net/devlink.h b/include/net/devlink.h
> >> >index 67f4293..9b4d80b 100644
> >> >--- a/include/net/devlink.h
> >> >+++ b/include/net/devlink.h
> >> >@@ -48,6 +48,7 @@ struct devlink_port_attrs {
> >> >
> >> > struct devlink_port {
> >> >   struct list_head list;
> >> >+  struct list_head param_list;
> >> >   struct devlink *devlink;
> >> >   unsigned index;
> >> >   bool registered;
> >> >@@ -419,6 +420,13 @@ enum devlink_param_generic_id {
> >> >   .validate = _validate,  \
> >> > }
> >> >
> >> >+enum devlink_port_param_generic_id {
> >> >+  /* add new param generic ids above here */
> >> >+  __DEVLINK_PORT_PARAM_GENERIC_ID_MAX,
> >> >+  DEVLINK_PORT_PARAM_GENERIC_ID_MAX =
> >> >+ __DEVLINK_PORT_PARAM_GENERIC_ID_MAX - 1,
> >> >+};
> >>
> >> I don't see the need for enum just for per-port params. The existing
> >> params enum should be enough.
> >In that case there won't be any differentiation between generic device
> >and port params.
>
> Yes, you don't need that.
Okay, then let me modify and refactor the patchset. Thanks.
>
> >Also, if enum is not needed, then separate struct
> >devlink_port_param_generic is also not needed.
>
> Yep.
>
> >
> >Based on this, entire patchset needs a change.
>
> [...]


Re: [PATCH net-next RFC 1/7] devlink: Add devlink_param for port register and unregister

2018-12-05 Thread Jiri Pirko
Thu, Dec 06, 2018 at 07:02:59AM CET, vasundhara-v.vo...@broadcom.com wrote:
>Thank you reviewing the patches.
>
>On Wed, Dec 5, 2018 at 5:24 PM Jiri Pirko  wrote:
>>
>> Wed, Dec 05, 2018 at 06:56:54AM CET, vasundhara-v.vo...@broadcom.com wrote:
>> >Add functions to register and unregister for the driver supported
>> >configuration parameters table per port.
>> >
>> >Cc: Jiri Pirko 
>> >Signed-off-by: Vasundhara Volam 
>> >---
>> > include/net/devlink.h |  29 +++
>> > net/core/devlink.c| 130 
>> > ++
>> > 2 files changed, 151 insertions(+), 8 deletions(-)
>> >
>> >diff --git a/include/net/devlink.h b/include/net/devlink.h
>> >index 67f4293..9b4d80b 100644
>> >--- a/include/net/devlink.h
>> >+++ b/include/net/devlink.h
>> >@@ -48,6 +48,7 @@ struct devlink_port_attrs {
>> >
>> > struct devlink_port {
>> >   struct list_head list;
>> >+  struct list_head param_list;
>> >   struct devlink *devlink;
>> >   unsigned index;
>> >   bool registered;
>> >@@ -419,6 +420,13 @@ enum devlink_param_generic_id {
>> >   .validate = _validate,  \
>> > }
>> >
>> >+enum devlink_port_param_generic_id {
>> >+  /* add new param generic ids above here */
>> >+  __DEVLINK_PORT_PARAM_GENERIC_ID_MAX,
>> >+  DEVLINK_PORT_PARAM_GENERIC_ID_MAX =
>> >+ __DEVLINK_PORT_PARAM_GENERIC_ID_MAX - 1,
>> >+};
>>
>> I don't see the need for enum just for per-port params. The existing
>> params enum should be enough.
>In that case there won't be any differentiation between generic device
>and port params.

Yes, you don't need that.

>Also, if enum is not needed, then separate struct
>devlink_port_param_generic is also not needed.

Yep.

>
>Based on this, entire patchset needs a change.

[...]


Re: [PATCH net-next RFC 7/7] bnxt_en: Add bnxt_en initial port params table and register it

2018-12-05 Thread Jakub Kicinski
On Wed, 5 Dec 2018 22:41:43 -0800, Michael Chan wrote:
> > > We do have a parameter in NVRAM that controls default WoL.  I think
> > > this is to expose that parameter so it can be set one way or the
> > > other. There are scenarios where Linux has not booted yet (and so
> > > there is no opportunity to run ethtool -s or any daemons yet) and this
> > > parameter will control whether the machine will wake up or not.  
> >
> > Isn't that set in BIOS/setup?  The config before any OS boots?  Because
> > the BMC or whatnot has to actually configure the board to power
> > appropriate things up.  Please clarify.  
> 
> It will be in the BIOS only for a LOM, I think.  For a NIC, it should
> be in the NIC's NVRAM.

This is all vague.  Could you please clearly state the use case.

> > And *if* it is proven this config is more than just setting the default
> > IMHO the setting belongs in the ethtool API.  We can't just add devlink
> > params for all existing config APIs just because it has persistence.  
> 
> I'm not sure I understand your point.  I believe the NIC firmware will
> set up the NIC's WoL setting right after power up based on this NVRAM
> parameter.  Similar to how the firmware will setup PCIe Gen2 or Gen3
> right after power up, for example.  

We have no PCIe config interface therefore the crutch of devlink params
was allowed there.  We *do* have an existing interface to configure WoL.

> So why would this belong to ethtool?  I understand the confusion that
> ethtool -s has a similar WoL setting.  But again, that's different.

Perhaps you're looking at this from firmware perspective?  FW NVM knob
== devlink param?

> This one is the power up setting that impacts whether a magic packet
> can or cannot wake up the system right after power up (before booting
> up to Linux or other OS).


Re: [PATCH net] sctp: frag_point sanity check

2018-12-05 Thread kbuild test robot
Hi Jakub,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net/master]

url:
https://github.com/0day-ci/linux/commits/Jakub-Audykowicz/sctp-frag_point-sanity-check/20181206-011917
config: x86_64-randconfig-x015-12051035 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   In file included from include/linux/kernel.h:14:0,
from net/sctp/chunk.c:36:
   net/sctp/chunk.c: In function 'sctp_datamsg_from_user':
   include/linux/kern_levels.h:5:18: warning: format '%d' expects argument of 
type 'int', but argument 4 has type 'size_t {aka long unsigned int}' [-Wformat=]
#define KERN_SOH "\001"  /* ASCII Start Of Header */
 ^
   include/linux/printk.h:429:10: note: in definition of macro 
'printk_ratelimited'
  printk(fmt, ##__VA_ARGS__);\
 ^~~
   include/linux/kern_levels.h:12:22: note: in expansion of macro 'KERN_SOH'
#define KERN_WARNING KERN_SOH "4" /* warning conditions */
 ^~~~
   include/linux/printk.h:445:21: note: in expansion of macro 'KERN_WARNING'
 printk_ratelimited(KERN_WARNING pr_fmt(fmt), ##__VA_ARGS__)
^~~~
>> net/sctp/chunk.c:197:3: note: in expansion of macro 'pr_warn_ratelimited'
  pr_warn_ratelimited("%s: asoc:%p frag_point is zero, forcing max_data to 
default minimum (%d)",
  ^~~

vim +/pr_warn_ratelimited +197 net/sctp/chunk.c

   156  
   157  
   158  /* A data chunk can have a maximum payload of (2^16 - 20).  Break
   159   * down any such message into smaller chunks.  Opportunistically, 
fragment
   160   * the chunks down to the current MTU constraints.  We may get 
refragmented
   161   * later if the PMTU changes, but it is _much better_ to fragment 
immediately
   162   * with a reasonable guess than always doing our fragmentation on the
   163   * soft-interrupt.
   164   */
   165  struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association 
*asoc,
   166  struct sctp_sndrcvinfo 
*sinfo,
   167  struct iov_iter *from)
   168  {
   169  size_t len, first_len, max_data, remaining;
   170  size_t msg_len = iov_iter_count(from);
   171  struct sctp_shared_key *shkey = NULL;
   172  struct list_head *pos, *temp;
   173  struct sctp_chunk *chunk;
   174  struct sctp_datamsg *msg;
   175  int err;
   176  
   177  msg = sctp_datamsg_new(GFP_KERNEL);
   178  if (!msg)
   179  return ERR_PTR(-ENOMEM);
   180  
   181  /* Note: Calculate this outside of the loop, so that all 
fragments
   182   * have the same expiration.
   183   */
   184  if (asoc->peer.prsctp_capable && sinfo->sinfo_timetolive &&
   185  (SCTP_PR_TTL_ENABLED(sinfo->sinfo_flags) ||
   186   !SCTP_PR_POLICY(sinfo->sinfo_flags)))
   187  msg->expires_at = jiffies +
   188
msecs_to_jiffies(sinfo->sinfo_timetolive);
   189  
   190  /* This is the biggest possible DATA chunk that can fit into
   191   * the packet
   192   */
   193  max_data = asoc->frag_point;
   194  if (unlikely(!max_data)) {
   195  max_data = sctp_min_frag_point(sctp_sk(asoc->base.sk),
   196 
sctp_datachk_len(>stream));
 > 197  pr_warn_ratelimited("%s: asoc:%p frag_point is zero, 
 > forcing max_data to default minimum (%d)",
   198  __func__, asoc, max_data);
   199  }
   200  
   201  /* If the the peer requested that we authenticate DATA chunks
   202   * we need to account for bundling of the AUTH chunks along with
   203   * DATA.
   204   */
   205  if (sctp_auth_send_cid(SCTP_CID_DATA, asoc)) {
   206  struct sctp_hmac *hmac_desc = 
sctp_auth_asoc_get_hmac(asoc);
   207  
   208  if (hmac_desc)
   209  max_data -= SCTP_PAD4(sizeof(struct 
sctp_auth_chunk) +
   210hmac_desc->hmac_len);
   211  
   212  if (sinfo->sinfo_tsn &&
   213  sinfo->sinfo_ssn != asoc->active_key_id) {
   214  shkey = sctp_auth_get_shkey(asoc, 
sinfo->sinfo_ssn);
   215  if (!shkey) {
   216  err = -EINVAL;
   217  goto errout;
   218  }
   219  } else {
   220  shkey = asoc->shkey;
   221  }
   222  }
   223  
   

Re: [PATCH net-next RFC 7/7] bnxt_en: Add bnxt_en initial port params table and register it

2018-12-05 Thread Michael Chan
On Wed, Dec 5, 2018 at 10:00 PM Jakub Kicinski
 wrote:
>
> On Wed, 5 Dec 2018 17:18:52 -0800, Michael Chan wrote:
> > On Wed, Dec 5, 2018 at 4:42 PM Jakub Kicinski wrote:
> > > On Wed, 5 Dec 2018 16:01:08 -0800, Michael Chan wrote:
> > > > On Wed, Dec 5, 2018 at 3:33 PM Jakub Kicinski wrote:
> > > > > On Wed,  5 Dec 2018 11:27:00 +0530, Vasundhara Volam wrote:
> > > > > > Register devlink_port with devlink and create initial port params
> > > > > > table for bnxt_en. The table consists of a generic parameter:
> > > > > >
> > > > > > wake-on-lan: Enables Wake on Lan for this port when magic packet
> > > > > > is received with this port's MAC address using ACPI pattern.
> > > > > > If enabled, the controller asserts a wake pin upon reception of
> > > > > > WoL packet.  ACPI (Advanced Configuration and Power Interface) is
> > > > > > an industry specification for the efficient handling of power
> > > > > > consumption in desktop and mobile computers.
> > > > > >
> > > > > > Cc: Michael Chan 
> > > > > > Signed-off-by: Vasundhara Volam 
> > > > >
> > > > > Why do we need a WoL as a devlink parameter (rather than ethtool -s)?
> > > >
> > > > I believe ethtool -s for WoL is a non-persistent setting, meaning that
> > > > if you power cycle the system, the WoL setting will go back to
> > > > default.
> > > >
> > > > devlink on the other hand is a permanent setting.  ethtool should
> > > > initially report the default WoL setting and it can then be changed
> > > > (in a non permanent way) using ethtool -s.
> > >
> > > All network configuration settings in Linux are non-persistent AFAIK.
> > > That's why network configuration daemons exist:
> > >
> > > https://wiki.debian.org/WakeOnLan
> > >
> > > Perhaps the objective to move more of the network configuration into the
> > > firmware?  That'd be a bleak scenario, so probably not..
> > >
> > > My understanding was the persistent devlink settings are for things
> > > which have to be set at device init time.  Like say PCI endpoint
> > > configuration.  FW loading configuration.
> > >
> > > Besides, the parameter you add is just true/false, when ethtool has
> > > multiple options.
> > >
> > > It feels to me like we moved from ioctls to Netlink, and now even
> > > before ethtool was converted to Netlink we may move to unstructured
> > > strings.  That's not a step forward, if you ask me.
> >
> > We do have a parameter in NVRAM that controls default WoL.  I think
> > this is to expose that parameter so it can be set one way or the
> > other. There are scenarios where Linux has not booted yet (and so
> > there is no opportunity to run ethtool -s or any daemons yet) and this
> > parameter will control whether the machine will wake up or not.
>
> Isn't that set in BIOS/setup?  The config before any OS boots?  Because
> the BMC or whatnot has to actually configure the board to power
> appropriate things up.  Please clarify.

It will be in the BIOS only for a LOM, I think.  For a NIC, it should
be in the NIC's NVRAM.

>
> And *if* it is proven this config is more than just setting the default
> IMHO the setting belongs in the ethtool API.  We can't just add devlink
> params for all existing config APIs just because it has persistence.

I'm not sure I understand your point.  I believe the NIC firmware will
set up the NIC's WoL setting right after power up based on this NVRAM
parameter.  Similar to how the firmware will setup PCIe Gen2 or Gen3
right after power up, for example.  So why would this belong to
ethtool?  I understand the confusion that ethtool -s has a similar WoL
setting.  But again, that's different.  This one is the power up
setting that impacts whether a magic packet can or cannot wake up the
system right after power up (before booting up to Linux or other OS).


Re: [PATCH net-next RFC 1/7] devlink: Add devlink_param for port register and unregister

2018-12-05 Thread Vasundhara Volam
Thank you reviewing the patches.

On Wed, Dec 5, 2018 at 5:24 PM Jiri Pirko  wrote:
>
> Wed, Dec 05, 2018 at 06:56:54AM CET, vasundhara-v.vo...@broadcom.com wrote:
> >Add functions to register and unregister for the driver supported
> >configuration parameters table per port.
> >
> >Cc: Jiri Pirko 
> >Signed-off-by: Vasundhara Volam 
> >---
> > include/net/devlink.h |  29 +++
> > net/core/devlink.c| 130 
> > ++
> > 2 files changed, 151 insertions(+), 8 deletions(-)
> >
> >diff --git a/include/net/devlink.h b/include/net/devlink.h
> >index 67f4293..9b4d80b 100644
> >--- a/include/net/devlink.h
> >+++ b/include/net/devlink.h
> >@@ -48,6 +48,7 @@ struct devlink_port_attrs {
> >
> > struct devlink_port {
> >   struct list_head list;
> >+  struct list_head param_list;
> >   struct devlink *devlink;
> >   unsigned index;
> >   bool registered;
> >@@ -419,6 +420,13 @@ enum devlink_param_generic_id {
> >   .validate = _validate,  \
> > }
> >
> >+enum devlink_port_param_generic_id {
> >+  /* add new param generic ids above here */
> >+  __DEVLINK_PORT_PARAM_GENERIC_ID_MAX,
> >+  DEVLINK_PORT_PARAM_GENERIC_ID_MAX =
> >+ __DEVLINK_PORT_PARAM_GENERIC_ID_MAX - 1,
> >+};
>
> I don't see the need for enum just for per-port params. The existing
> params enum should be enough.
In that case there won't be any differentiation between generic device
and port params.
Also, if enum is not needed, then separate struct
devlink_port_param_generic is also not needed.

Based on this, entire patchset needs a change.
>
>
> >+
> > struct devlink_region;
> >
> > typedef void devlink_snapshot_data_dest_t(const void *data);
> >@@ -574,6 +582,12 @@ int devlink_param_driverinit_value_set(struct devlink 
> >*devlink, u32 param_id,
> > void devlink_param_value_changed(struct devlink *devlink, u32 param_id);
> > void devlink_param_value_str_fill(union devlink_param_value *dst_val,
> > const char *src);
> >+int devlink_port_params_register(struct devlink_port *devlink_port,
> >+   const struct devlink_param *params,
> >+   size_t params_count);
> >+void devlink_port_params_unregister(struct devlink_port *devlink_port,
> >+  const struct devlink_param *params,
> >+  size_t params_count);
> > struct devlink_region *devlink_region_create(struct devlink *devlink,
> >const char *region_name,
> >u32 region_max_snapshots,
> >@@ -816,6 +830,21 @@ static inline bool 
> >devlink_dpipe_table_counter_enabled(struct devlink *devlink,
> > {
> > }
> >
> >+static inline int
> >+devlink_port_params_register(struct devlink_port *devlink_port,
> >+   const struct devlink_param *params,
> >+   size_t params_count)
> >+{
> >+  return 0;
> >+}
> >+
> >+static inline void
> >+devlink_port_params_unregister(struct devlink_port *devlink_port,
> >+ const struct devlink_param *params,
> >+ size_t params_count)
> >+{
> >+}
> >+
> > static inline struct devlink_region *
> > devlink_region_create(struct devlink *devlink,
> > const char *region_name,
> >diff --git a/net/core/devlink.c b/net/core/devlink.c
> >index abb0da9..e194913 100644
> >--- a/net/core/devlink.c
> >+++ b/net/core/devlink.c
> >@@ -3147,12 +3147,12 @@ static int devlink_nl_cmd_param_set_doit(struct 
> >sk_buff *skb,
> > }
> >
> > static int devlink_param_register_one(struct devlink *devlink,
> >+struct list_head *param_list,
> > const struct devlink_param *param)
> > {
> >   struct devlink_param_item *param_item;
> >
> >-  if (devlink_param_find_by_name(>param_list,
> >- param->name))
> >+  if (devlink_param_find_by_name(param_list, param->name))
> >   return -EEXIST;
> >
> >   if (param->supported_cmodes == BIT(DEVLINK_PARAM_CMODE_DRIVERINIT))
> >@@ -3165,24 +3165,54 @@ static int devlink_param_register_one(struct devlink 
> >*devlink,
> >   return -ENOMEM;
> >   param_item->param = param;
> >
> >-  list_add_tail(_item->list, >param_list);
> >+  list_add_tail(_item->list, param_list);
> >   devlink_param_notify(devlink, param_item, DEVLINK_CMD_PARAM_NEW);
> >   return 0;
> > }
> >
> > static void devlink_param_unregister_one(struct devlink *devlink,
> >+   struct list_head *param_list,
> >const struct devlink_param *param)
> > {
> >   struct devlink_param_item *param_item;
> >
> >-  param_item = devlink_param_find_by_name(>param_list,
> >-  

Re: [PATCH net-next RFC 7/7] bnxt_en: Add bnxt_en initial port params table and register it

2018-12-05 Thread Jakub Kicinski
On Wed, 5 Dec 2018 17:18:52 -0800, Michael Chan wrote:
> On Wed, Dec 5, 2018 at 4:42 PM Jakub Kicinski wrote:
> > On Wed, 5 Dec 2018 16:01:08 -0800, Michael Chan wrote:  
> > > On Wed, Dec 5, 2018 at 3:33 PM Jakub Kicinski wrote:
> > > > On Wed,  5 Dec 2018 11:27:00 +0530, Vasundhara Volam wrote:  
> > > > > Register devlink_port with devlink and create initial port params
> > > > > table for bnxt_en. The table consists of a generic parameter:
> > > > >
> > > > > wake-on-lan: Enables Wake on Lan for this port when magic packet
> > > > > is received with this port's MAC address using ACPI pattern.
> > > > > If enabled, the controller asserts a wake pin upon reception of
> > > > > WoL packet.  ACPI (Advanced Configuration and Power Interface) is
> > > > > an industry specification for the efficient handling of power
> > > > > consumption in desktop and mobile computers.
> > > > >
> > > > > Cc: Michael Chan 
> > > > > Signed-off-by: Vasundhara Volam   
> > > >
> > > > Why do we need a WoL as a devlink parameter (rather than ethtool -s)?  
> > >
> > > I believe ethtool -s for WoL is a non-persistent setting, meaning that
> > > if you power cycle the system, the WoL setting will go back to
> > > default.
> > >
> > > devlink on the other hand is a permanent setting.  ethtool should
> > > initially report the default WoL setting and it can then be changed
> > > (in a non permanent way) using ethtool -s.  
> >
> > All network configuration settings in Linux are non-persistent AFAIK.
> > That's why network configuration daemons exist:
> >
> > https://wiki.debian.org/WakeOnLan
> >
> > Perhaps the objective to move more of the network configuration into the
> > firmware?  That'd be a bleak scenario, so probably not..
> >
> > My understanding was the persistent devlink settings are for things
> > which have to be set at device init time.  Like say PCI endpoint
> > configuration.  FW loading configuration.
> >
> > Besides, the parameter you add is just true/false, when ethtool has
> > multiple options.
> >
> > It feels to me like we moved from ioctls to Netlink, and now even
> > before ethtool was converted to Netlink we may move to unstructured
> > strings.  That's not a step forward, if you ask me.  
> 
> We do have a parameter in NVRAM that controls default WoL.  I think
> this is to expose that parameter so it can be set one way or the
> other. There are scenarios where Linux has not booted yet (and so
> there is no opportunity to run ethtool -s or any daemons yet) and this
> parameter will control whether the machine will wake up or not.

Isn't that set in BIOS/setup?  The config before any OS boots?  Because
the BMC or whatnot has to actually configure the board to power
appropriate things up.  Please clarify.

And *if* it is proven this config is more than just setting the default
IMHO the setting belongs in the ethtool API.  We can't just add devlink
params for all existing config APIs just because it has persistence.


Re: [PATCH net 1/3] flex_array: make FLEX_ARRAY_BASE_SIZE the same value of FLEX_ARRAY_PART_SIZE

2018-12-05 Thread Xin Long
On Thu, Dec 6, 2018 at 1:38 PM David Miller  wrote:
>
> From: Xin Long 
> Date: Wed,  5 Dec 2018 14:49:40 +0800
>
> > This patch is to separate the base data memory from struct flex_array and
> > save it into a page. With this change, total_nr_elements of a flex_array
> > can grow or shrink without having the old element's memory changed when
> > the new size of the flex_arry crosses FLEX_ARRAY_BASE_SIZE, which will
> > be added in the next patch.
> >
> > Suggested-by: Neil Horman 
> > Signed-off-by: Xin Long 
>
> This needs to be reviewed by the flex array hackers and lkml.
>
> It can't just get reviewed on netdev alone.
Will repost with CCing lkml and
the author:
  "Dave Hansen "
and two contributors:
  "David Rientjes ", "Eric Paris "

Thanks.


Re: [PATCH net-next] neighbor: Add extack messages for add and delete commands

2018-12-05 Thread David Miller
From: David Ahern 
Date: Wed,  5 Dec 2018 20:02:29 -0800

> From: David Ahern 
> 
> Add extack messages for failures in neigh_add and neigh_delete.
> 
> Signed-off-by: David Ahern 

Looks good, applied, thanks David.


Re: [PATCH v2 bpf-next 2/7] ppc: bpf: implement jitting of BPF_ALU | BPF_ARSH | BPF_*

2018-12-05 Thread Sandipan Das


On 06/12/18 12:22 AM, Jiong Wang wrote:
> This patch implements code-gen for BPF_ALU | BPF_ARSH | BPF_*.
> 
> Cc: Naveen N. Rao 
> Cc: Sandipan Das 
> Signed-off-by: Jiong Wang 
> ---
>  arch/powerpc/include/asm/ppc-opcode.h | 2 ++
>  arch/powerpc/net/bpf_jit.h| 4 
>  arch/powerpc/net/bpf_jit_comp64.c | 6 ++
>  3 files changed, 12 insertions(+)
> 
> [...]

Looks good to me. All tests are passing as well.

Reviewed-by: Sandipan Das 



Re: [PATCH net] ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes

2018-12-05 Thread David Miller
From: Jiri Wiesner 
Date: Wed, 5 Dec 2018 16:55:29 +0100

> The *_frag_reasm() functions are susceptible to miscalculating the byte
> count of packet fragments in case the truesize of a head buffer changes.
> The truesize member may be changed by the call to skb_unclone(), leaving
> the fragment memory limit counter unbalanced even if all fragments are
> processed. This miscalculation goes unnoticed as long as the network
> namespace which holds the counter is not destroyed.
> 
> Should an attempt be made to destroy a network namespace that holds an
> unbalanced fragment memory limit counter the cleanup of the namespace
> never finishes. The thread handling the cleanup gets stuck in
> inet_frags_exit_net() waiting for the percpu counter to reach zero. The
> thread is usually in running state with a stacktrace similar to:
> 
>  PID: 1073   TASK: 880626711440  CPU: 1   COMMAND: "kworker/u48:4"
>   #5 [880621563d48] _raw_spin_lock at 815f5480
>   #6 [880621563d48] inet_evict_bucket at 8158020b
>   #7 [880621563d80] inet_frags_exit_net at 8158051c
>   #8 [880621563db0] ops_exit_list at 814f5856
>   #9 [880621563dd8] cleanup_net at 814f67c0
>  #10 [880621563e38] process_one_work at 81096f14
> 
> It is not possible to create new network namespaces, and processes
> that call unshare() end up being stuck in uninterruptible sleep state
> waiting to acquire the net_mutex.
> 
> The bug was observed in the IPv6 netfilter code by Per Sundstrom.
> I thank him for his analysis of the problem. The parts of this patch
> that apply to IPv4 and IPv6 fragment reassembly are preemptive measures.
> 
> Signed-off-by: Jiri Wiesner 
> Reported-by: Per Sundstrom 

Nice catch.

Applied and queued up for -stable, thanks!


[PATCH bpf 1/2] selftests/bpf: use thoff instead of nhoff in BPF flow dissector

2018-12-05 Thread Stanislav Fomichev
We are returning thoff from the flow dissector, not the nhoff. Pass
thoff along with nhoff to the bpf program (initially thoff == nhoff)
and expect flow dissector amend/return thoff, not nhoff.

This avoids confusion, when by the time bpf flow dissector exits,
nhoff == thoff, which doesn't make much sense.

Signed-off-by: Stanislav Fomichev 
---
 net/core/flow_dissector.c  |  1 +
 tools/testing/selftests/bpf/bpf_flow.c | 36 --
 2 files changed, 18 insertions(+), 19 deletions(-)

diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 588f475019d4..ff5556d80570 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -783,6 +783,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
/* Pass parameters to the BPF program */
cb->qdisc_cb.flow_keys = _keys;
flow_keys.nhoff = nhoff;
+   flow_keys.thoff = nhoff;
 
bpf_compute_data_pointers((struct sk_buff *)skb);
result = BPF_PROG_RUN(attached, skb);
diff --git a/tools/testing/selftests/bpf/bpf_flow.c 
b/tools/testing/selftests/bpf/bpf_flow.c
index 107350a7821d..df9d32fd2055 100644
--- a/tools/testing/selftests/bpf/bpf_flow.c
+++ b/tools/testing/selftests/bpf/bpf_flow.c
@@ -70,18 +70,18 @@ static __always_inline void 
*bpf_flow_dissect_get_header(struct __sk_buff *skb,
 {
void *data_end = (void *)(long)skb->data_end;
void *data = (void *)(long)skb->data;
-   __u16 nhoff = skb->flow_keys->nhoff;
+   __u16 thoff = skb->flow_keys->thoff;
__u8 *hdr;
 
/* Verifies this variable offset does not overflow */
-   if (nhoff > (USHRT_MAX - hdr_size))
+   if (thoff > (USHRT_MAX - hdr_size))
return NULL;
 
-   hdr = data + nhoff;
+   hdr = data + thoff;
if (hdr + hdr_size <= data_end)
return hdr;
 
-   if (bpf_skb_load_bytes(skb, nhoff, buffer, hdr_size))
+   if (bpf_skb_load_bytes(skb, thoff, buffer, hdr_size))
return NULL;
 
return buffer;
@@ -158,13 +158,13 @@ static __always_inline int parse_ip_proto(struct 
__sk_buff *skb, __u8 proto)
/* Only inspect standard GRE packets with version 0 */
return BPF_OK;
 
-   keys->nhoff += sizeof(*gre); /* Step over GRE Flags and Proto */
+   keys->thoff += sizeof(*gre); /* Step over GRE Flags and Proto */
if (GRE_IS_CSUM(gre->flags))
-   keys->nhoff += 4; /* Step over chksum and Padding */
+   keys->thoff += 4; /* Step over chksum and Padding */
if (GRE_IS_KEY(gre->flags))
-   keys->nhoff += 4; /* Step over key */
+   keys->thoff += 4; /* Step over key */
if (GRE_IS_SEQ(gre->flags))
-   keys->nhoff += 4; /* Step over sequence number */
+   keys->thoff += 4; /* Step over sequence number */
 
keys->is_encap = true;
 
@@ -174,7 +174,7 @@ static __always_inline int parse_ip_proto(struct __sk_buff 
*skb, __u8 proto)
if (!eth)
return BPF_DROP;
 
-   keys->nhoff += sizeof(*eth);
+   keys->thoff += sizeof(*eth);
 
return parse_eth_proto(skb, eth->h_proto);
} else {
@@ -191,7 +191,6 @@ static __always_inline int parse_ip_proto(struct __sk_buff 
*skb, __u8 proto)
if ((__u8 *)tcp + (tcp->doff << 2) > data_end)
return BPF_DROP;
 
-   keys->thoff = keys->nhoff;
keys->sport = tcp->source;
keys->dport = tcp->dest;
return BPF_OK;
@@ -201,7 +200,6 @@ static __always_inline int parse_ip_proto(struct __sk_buff 
*skb, __u8 proto)
if (!udp)
return BPF_DROP;
 
-   keys->thoff = keys->nhoff;
keys->sport = udp->source;
keys->dport = udp->dest;
return BPF_OK;
@@ -252,8 +250,8 @@ PROG(IP)(struct __sk_buff *skb)
keys->ipv4_src = iph->saddr;
keys->ipv4_dst = iph->daddr;
 
-   keys->nhoff += iph->ihl << 2;
-   if (data + keys->nhoff > data_end)
+   keys->thoff += iph->ihl << 2;
+   if (data + keys->thoff > data_end)
return BPF_DROP;
 
if (iph->frag_off & bpf_htons(IP_MF | IP_OFFSET)) {
@@ -285,7 +283,7 @@ PROG(IPV6)(struct __sk_buff *skb)
keys->addr_proto = ETH_P_IPV6;
memcpy(>ipv6_src, >saddr, 2*sizeof(ip6h->saddr));
 
-   keys->nhoff += sizeof(struct ipv6hdr);
+   keys->thoff += sizeof(struct ipv6hdr);
 
return parse_ipv6_proto(skb, ip6h->nexthdr);
 }
@@ -301,7 +299,7 @@ PROG(IPV6OP)(struct __sk_buff *skb)
/* hlen is in 8-octets and does not include the first 8 bytes
 * of the header
 

[PATCH bpf 2/2] net/flow_dissector: correctly cap nhoff and thoff in case of BPF

2018-12-05 Thread Stanislav Fomichev
We want to make sure that the following condition holds:
0 <= nhoff <= thoff <= skb->len

BPF program can set out-of-bounds nhoff and thoff, which is dangerous, see
recent commit d0c081b49137 ("flow_dissector: properly cap thoff field")'.

Signed-off-by: Stanislav Fomichev 
---
 net/core/flow_dissector.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index ff5556d80570..af68207ee56c 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -791,9 +791,12 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
/* Restore state */
memcpy(cb, _saved, sizeof(cb_saved));
 
+   flow_keys.nhoff = clamp_t(u16, flow_keys.nhoff, 0, skb->len);
+   flow_keys.thoff = clamp_t(u16, flow_keys.thoff,
+ flow_keys.nhoff, skb->len);
+
__skb_flow_bpf_to_target(_keys, flow_dissector,
 target_container);
-   key_control->thoff = min_t(u16, key_control->thoff, skb->len);
rcu_read_unlock();
return result == BPF_OK;
}
-- 
2.20.0.rc1.387.gf8505762e3-goog



Re: [PATCH net 1/3] flex_array: make FLEX_ARRAY_BASE_SIZE the same value of FLEX_ARRAY_PART_SIZE

2018-12-05 Thread David Miller
From: Xin Long 
Date: Wed,  5 Dec 2018 14:49:40 +0800

> This patch is to separate the base data memory from struct flex_array and
> save it into a page. With this change, total_nr_elements of a flex_array
> can grow or shrink without having the old element's memory changed when
> the new size of the flex_arry crosses FLEX_ARRAY_BASE_SIZE, which will
> be added in the next patch.
> 
> Suggested-by: Neil Horman 
> Signed-off-by: Xin Long 

This needs to be reviewed by the flex array hackers and lkml.

It can't just get reviewed on netdev alone.


Re: [PATCH v2 net-next 1/1] net: netem: use a list in addition to rbtree

2018-12-05 Thread David Miller
From: Peter Oskolkov 
Date: Tue,  4 Dec 2018 11:55:56 -0800

> When testing high-bandwidth TCP streams with large windows,
> high latency, and low jitter, netem consumes a lot of CPU cycles
> doing rbtree rebalancing.
> 
> This patch uses a linear list/queue in addition to the rbtree:
> if an incoming packet is past the tail of the linear queue, it is
> added there, otherwise it is inserted into the rbtree.
> 
> Without this patch, perf shows netem_enqueue, netem_dequeue,
> and rb_* functions among the top offenders. With this patch,
> only netem_enqueue is noticeable if jitter is low/absent.
> 
> Suggested-by: Eric Dumazet 
> Signed-off-by: Peter Oskolkov 

Applied, thanks.


Re: [PATCH net] sctp: frag_point sanity check

2018-12-05 Thread David Miller
From: Jakub Audykowicz 
Date: Tue,  4 Dec 2018 20:27:41 +0100

> If for some reason an association's fragmentation point is zero,
> sctp_datamsg_from_user will try to endlessly try to divide a message
> into zero-sized chunks. This eventually causes kernel panic due to
> running out of memory.
> 
> Although this situation is quite unlikely, it has occurred before as
> reported. I propose to add this simple last-ditch sanity check due to
> the severity of the potential consequences.
> 
> Signed-off-by: Jakub Audykowicz 

Applied.


[PATCH net-next] neighbor: Add extack messages for add and delete commands

2018-12-05 Thread David Ahern
From: David Ahern 

Add extack messages for failures in neigh_add and neigh_delete.

Signed-off-by: David Ahern 
---
 net/core/neighbour.c | 55 +---
 1 file changed, 39 insertions(+), 16 deletions(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 41954e42a2de..6d479b5562be 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1137,8 +1137,9 @@ static void neigh_update_hhs(struct neighbour *neigh)
Caller MUST hold reference count on the entry.
  */
 
-int neigh_update(struct neighbour *neigh, const u8 *lladdr, u8 new,
-u32 flags, u32 nlmsg_pid)
+static int __neigh_update(struct neighbour *neigh, const u8 *lladdr,
+ u8 new, u32 flags, u32 nlmsg_pid,
+ struct netlink_ext_ack *extack)
 {
u8 old;
int err;
@@ -1155,8 +1156,10 @@ int neigh_update(struct neighbour *neigh, const u8 
*lladdr, u8 new,
if (!(flags & NEIGH_UPDATE_F_ADMIN) &&
(old & (NUD_NOARP | NUD_PERMANENT)))
goto out;
-   if (neigh->dead)
+   if (neigh->dead) {
+   NL_SET_ERR_MSG(extack, "Neighbor entry is now dead");
goto out;
+   }
 
neigh_update_ext_learned(neigh, flags, );
 
@@ -1193,8 +1196,10 @@ int neigh_update(struct neighbour *neigh, const u8 
*lladdr, u8 new,
   use it, otherwise discard the request.
 */
err = -EINVAL;
-   if (!(old & NUD_VALID))
+   if (!(old & NUD_VALID)) {
+   NL_SET_ERR_MSG(extack, "No link layer address given");
goto out;
+   }
lladdr = neigh->ha;
}
 
@@ -1307,6 +1312,12 @@ int neigh_update(struct neighbour *neigh, const u8 
*lladdr, u8 new,
 
return err;
 }
+
+int neigh_update(struct neighbour *neigh, const u8 *lladdr, u8 new,
+u32 flags, u32 nlmsg_pid)
+{
+   return __neigh_update(neigh, lladdr, new, flags, nlmsg_pid, NULL);
+}
 EXPORT_SYMBOL(neigh_update);
 
 /* Update the neigh to listen temporarily for probe responses, even if it is
@@ -1678,8 +1689,10 @@ static int neigh_delete(struct sk_buff *skb, struct 
nlmsghdr *nlh,
goto out;
 
dst_attr = nlmsg_find_attr(nlh, sizeof(*ndm), NDA_DST);
-   if (dst_attr == NULL)
+   if (!dst_attr) {
+   NL_SET_ERR_MSG(extack, "Network address not specified");
goto out;
+   }
 
ndm = nlmsg_data(nlh);
if (ndm->ndm_ifindex) {
@@ -1694,8 +1707,10 @@ static int neigh_delete(struct sk_buff *skb, struct 
nlmsghdr *nlh,
if (tbl == NULL)
return -EAFNOSUPPORT;
 
-   if (nla_len(dst_attr) < (int)tbl->key_len)
+   if (nla_len(dst_attr) < (int)tbl->key_len) {
+   NL_SET_ERR_MSG(extack, "Invalid network address");
goto out;
+   }
 
if (ndm->ndm_flags & NTF_PROXY) {
err = pneigh_delete(tbl, net, nla_data(dst_attr), dev);
@@ -1711,10 +1726,9 @@ static int neigh_delete(struct sk_buff *skb, struct 
nlmsghdr *nlh,
goto out;
}
 
-   err = neigh_update(neigh, NULL, NUD_FAILED,
-  NEIGH_UPDATE_F_OVERRIDE |
-  NEIGH_UPDATE_F_ADMIN,
-  NETLINK_CB(skb).portid);
+   err = __neigh_update(neigh, NULL, NUD_FAILED,
+NEIGH_UPDATE_F_OVERRIDE | NEIGH_UPDATE_F_ADMIN,
+NETLINK_CB(skb).portid, extack);
write_lock_bh(>lock);
neigh_release(neigh);
neigh_remove_one(neigh, tbl);
@@ -1744,8 +1758,10 @@ static int neigh_add(struct sk_buff *skb, struct 
nlmsghdr *nlh,
goto out;
 
err = -EINVAL;
-   if (tb[NDA_DST] == NULL)
+   if (!tb[NDA_DST]) {
+   NL_SET_ERR_MSG(extack, "Network address not specified");
goto out;
+   }
 
ndm = nlmsg_data(nlh);
if (ndm->ndm_ifindex) {
@@ -1755,16 +1771,21 @@ static int neigh_add(struct sk_buff *skb, struct 
nlmsghdr *nlh,
goto out;
}
 
-   if (tb[NDA_LLADDR] && nla_len(tb[NDA_LLADDR]) < dev->addr_len)
+   if (tb[NDA_LLADDR] && nla_len(tb[NDA_LLADDR]) < dev->addr_len) {
+   NL_SET_ERR_MSG(extack, "Invalid link address");
goto out;
+   }
}
 
tbl = neigh_find_table(ndm->ndm_family);
if (tbl == NULL)
return -EAFNOSUPPORT;
 
-   if (nla_len(tb[NDA_DST]) < (int)tbl->key_len)
+   if (nla_len(tb[NDA_DST]) < (int)tbl->key_len) {
+   NL_SET_ERR_MSG(extack, "Invalid network address");
goto out;
+   }
+
dst = nla_data(tb[NDA_DST]);
lladdr = tb[NDA_LLADDR] ? nla_data(tb[NDA_LLADDR]) : NULL;
 
@@ -1780,8 +1801,10 @@ static int neigh_add(struct 

Re: [PATCH bpf-next] tools: bpftool: add a command to dump the trace pipe

2018-12-05 Thread Alexei Starovoitov
On Wed, Dec 05, 2018 at 06:15:23PM +, Quentin Monnet wrote:
> > > +
> > > + /* Allow room for NULL terminating byte and pipe file name */
> > > + snprintf(format, sizeof(format), "%%*s %%%zds %%99s %%*s %%*d %%*d\\n",
> > > +  PATH_MAX - strlen(pipe_name) - 1);
> > 
> > before scanning trace_pipe could you add a check that trace_options are 
> > compatible?
> > Otherwise there will be a lot of garbage printed.
> > afaik default is rarely changed, so the patch is ok as-is.
> > The followup some time in the future would be perfect.
> 
> Sure. What do you mean exactly by compatible options? I can check that
> "trace_printk" is set, is there any other option that would be relevant?

See Documentation/trace/ftrace.rst
a lot of the flags will change the format significantly.
Like 'bin' will make it binary.
I'm not suggesting to support all possible output formats.
Only to check that trace flags match scanf.



Re: [PATCH bpf-next] bpf: relax verifier restriction on BPF_MOV | BPF_ALU

2018-12-05 Thread Alexei Starovoitov
On Wed, Dec 05, 2018 at 03:32:50PM +, Jiong Wang wrote:
> On 05/12/2018 14:52, Edward Cree wrote:
> > On 05/12/18 09:46, Jiong Wang wrote:
> > > There is NO processed instruction number regression, either with or 
> > > without
> > > -mattr=+alu32.
> > 
> > > Cilium bpf
> > > ===
> > > bpf_lb-DLB_L3.o 2110/21101730/1733
> > That looks like a regression of 3 insns in the 32-bit case.
> > May be worth investigating why.
> 
> Will look into this.
> 
> > 
> > > + dst_reg = insn->dst_reg;
> > > + regs[dst_reg] = regs[src_reg];
> > > + if (BPF_CLASS(insn->code) == BPF_ALU) {
> > > + /* Update type and range info. */
> > > + regs[dst_reg].type = SCALAR_VALUE;
> > > + coerce_reg_to_size([dst_reg], 4);
> > Won't this break when handed a pointer (as root, so allowed to leak
> >   it)?  E.g. (pointer + x) gets turned into scalar x, rather than
> >   unknown scalar in range [0, 0x].
> 
> Initially I was gating this to scalar_value only, later was thinking it
> could be extended to ptr case if ptr leak is allowed.
> 
> But, your comment remind me min/max value doesn't mean real min/max value
> for ptr types value, it means the offset only if I am understanding the
> issue correctly. So, it will break pointer case.

correct. In case of is_pointer_value() && root -> mark_reg_unknown() has to be 
called

The explanation of additional 3 steps from another email makes sense to me.

Can you take a look why it helps default (llvm alu64) case too ?
bpf_overlay.o   3096/2898

Thanks


[Patch v2 net-next] call sk_dst_reset when set SO_DONTROUTE

2018-12-05 Thread yupeng
after set SO_DONTROUTE to 1, the IP layer should not route packets if
the dest IP address is not in link scope. But if the socket has cached
the dst_entry, such packets would be routed until the sk_dst_cache
expires. So we should clean the sk_dst_cache when a user set
SO_DONTROUTE option. Below are server/client python scripts which
could reprodue this issue:

server side code:
==
import socket
import struct
import time

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(('0.0.0.0', 9000))
s.listen(1)
sock, addr = s.accept()
sock.setsockopt(socket.SOL_SOCKET, socket.SO_DONTROUTE, struct.pack('i', 1))
while True:
sock.send(b'foo')
time.sleep(1)
==

client side code:
==
import socket
import time

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('server_address', 9000))
while True:
data = s.recv(1024)
print(data)
==

Signed-off-by: yupeng 
---
 net/core/sock.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/sock.c b/net/core/sock.c
index f5bb89785e47..f00902c532cc 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -700,6 +700,7 @@ int sock_setsockopt(struct socket *sock, int level, int 
optname,
break;
case SO_DONTROUTE:
sock_valbool_flag(sk, SOCK_LOCALROUTE, valbool);
+   sk_dst_reset(sk);
break;
case SO_BROADCAST:
sock_valbool_flag(sk, SOCK_BROADCAST, valbool);
-- 
2.17.1



Re: [PATCH bpf-next 0/4] Misc improvements on bpf_func_info

2018-12-05 Thread Alexei Starovoitov
On Wed, Dec 05, 2018 at 05:35:43PM -0800, Martin KaFai Lau wrote:
> The patchset has a few improvements on bpf_func_info:
> 1. Improvements on the behaviors of info.func_info, info.func_info_cnt
>and info.func_info_rec_size.
> 2. Name change: s/insn_offset/insn_off/
> 
> Please see individual patch for details.

Applied, Thanks



Re: [PATCH bpf-next v2 3/5] selftests/bpf: use thoff instead of nhoff in BPF flow dissector

2018-12-05 Thread Alexei Starovoitov
On Tue, Dec 04, 2018 at 03:26:15PM -0800, Stanislav Fomichev wrote:
> On 12/04, Song Liu wrote:
> > On Mon, Dec 3, 2018 at 8:01 PM Stanislav Fomichev  wrote:
> > >
> > > We are returning thoff from the flow dissector, not the nhoff. Pass
> > > thoff along with nhoff to the bpf program (initially thoff == nhoff)
> > > and expect flow dissector amend/return thoff, not nhoff.
> > >
> > > This avoids confusion, when by the time bpf flow dissector exits,
> > > nhoff == thoff, which doesn't make much sense (this is relevant
> > > in the context of the next patch, where I add simple selftest
> > > and manually construct expected flow_keys).
> > >
> > > Signed-off-by: Stanislav Fomichev 
> > > ---
> > >  net/core/flow_dissector.c  |  1 +
> > >  tools/testing/selftests/bpf/bpf_flow.c | 36 --
> > >  2 files changed, 18 insertions(+), 19 deletions(-)
> > >
> > > diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
> > > index 3c8a78decbc0..ac19da6f390b 100644
> > > --- a/net/core/flow_dissector.c
> > > +++ b/net/core/flow_dissector.c
> > > @@ -708,6 +708,7 @@ bool __skb_flow_bpf_dissect(struct bpf_prog *prog,
> > > memset(flow_keys, 0, sizeof(*flow_keys));
> > > cb->qdisc_cb.flow_keys = flow_keys;
> > > flow_keys->nhoff = skb_network_offset(skb);
> > > +   flow_keys->thoff = flow_keys->nhoff;
> > 
> > Do we need this fix without this set? If yes, do we need it for bpf
> > tree as well?
> No, I don't think so. This just changes input to the flow dissector
> slightly (going forward).
> It used to be nhoff in, thoff out. Now it's thoff in (with nhoff for
> backwards compatibility) and thoff out.

That is still an api change.
Also patch 4 is a fix.
I think patches 3 and 4 need to go into bpf tree first.
Then wait for them to merge into bpf-next and resubmit the rest.



[PATCH bpf-next 1/4] bpf: Improve the info.func_info and info.func_info_rec_size behavior

2018-12-05 Thread Martin KaFai Lau
1) When bpf_dump_raw_ok() == false and the kernel can provide >=1
   func_info to the userspace, the current behavior is setting
   the info.func_info_cnt to 0 instead of setting info.func_info
   to 0.

   It is different from the behavior in jited_func_lens/nr_jited_func_lens,
   jited_ksyms/nr_jited_ksyms...etc.

   This patch fixes it. (i.e. set func_info to 0 instead of
   func_info_cnt to 0 when bpf_dump_raw_ok() == false).

2) When the userspace passed in info.func_info_cnt == 0, the kernel
   will set the expected func_info size back to the
   info.func_info_rec_size.  It is a way for the userspace to learn
   the kernel expected func_info_rec_size introduced in
   commit 838e96904ff3 ("bpf: Introduce bpf_func_info").

   An exception is the kernel expected size is not set when
   func_info is not available for a bpf_prog.  This makes the
   returned info.func_info_rec_size has different values
   depending on the returned value of info.func_info_cnt.

   This patch sets the kernel expected size to info.func_info_rec_size
   independent of the info.func_info_cnt.

3) The current logic only rejects invalid func_info_rec_size if
   func_info_cnt is non zero.  This patch also rejects invalid
   nonzero info.func_info_rec_size and not equal to the kernel
   expected size.

4) Set info.btf_id as long as prog->aux->btf != NULL.  That will
   setup the later copy_to_user() codes look the same as others
   which then easier to understand and maintain.

   prog->aux->btf is not NULL only if prog->aux->func_info_cnt > 0.

   Breaking up info.btf_id from prog->aux->func_info_cnt is needed
   for the later line info patch anyway.

   A similar change is made to bpf_get_prog_name().

Fixes: 838e96904ff3 ("bpf: Introduce bpf_func_info")
Signed-off-by: Martin KaFai Lau 
Acked-by: Yonghong Song 
---
 kernel/bpf/core.c|  2 +-
 kernel/bpf/syscall.c | 46 +++-
 2 files changed, 21 insertions(+), 27 deletions(-)

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index f93ed667546f..2a73fda1db5f 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -410,7 +410,7 @@ static void bpf_get_prog_name(const struct bpf_prog *prog, 
char *sym)
sym  = bin2hex(sym, prog->tag, sizeof(prog->tag));
 
/* prog->aux->name will be ignored if full btf name is available */
-   if (prog->aux->btf) {
+   if (prog->aux->func_info_cnt) {
type = btf_type_by_id(prog->aux->btf,
  
prog->aux->func_info[prog->aux->func_idx].type_id);
func_name = btf_name_by_offset(prog->aux->btf, type->name_off);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 4445d0d084d8..aa05aa38f4a8 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2083,6 +2083,12 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
return -EFAULT;
}
 
+   if ((info.func_info_cnt || info.func_info_rec_size) &&
+   info.func_info_rec_size != sizeof(struct bpf_func_info))
+   return -EINVAL;
+
+   info.func_info_rec_size = sizeof(struct bpf_func_info);
+
if (!capable(CAP_SYS_ADMIN)) {
info.jited_prog_len = 0;
info.xlated_prog_len = 0;
@@ -2226,35 +2232,23 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog 
*prog,
}
}
 
-   if (prog->aux->btf) {
-   u32 krec_size = sizeof(struct bpf_func_info);
-   u32 ucnt, urec_size;
-
+   if (prog->aux->btf)
info.btf_id = btf_id(prog->aux->btf);
 
-   ucnt = info.func_info_cnt;
-   info.func_info_cnt = prog->aux->func_info_cnt;
-   urec_size = info.func_info_rec_size;
-   info.func_info_rec_size = krec_size;
-   if (ucnt) {
-   /* expect passed-in urec_size is what the kernel 
expects */
-   if (urec_size != info.func_info_rec_size)
-   return -EINVAL;
-
-   if (bpf_dump_raw_ok()) {
-   char __user *user_finfo;
-
-   user_finfo = u64_to_user_ptr(info.func_info);
-   ucnt = min_t(u32, info.func_info_cnt, ucnt);
-   if (copy_to_user(user_finfo, 
prog->aux->func_info,
-krec_size * ucnt))
-   return -EFAULT;
-   } else {
-   info.func_info_cnt = 0;
-   }
+   ulen = info.func_info_cnt;
+   info.func_info_cnt = prog->aux->func_info_cnt;
+   if (info.func_info_cnt && ulen) {
+   if (bpf_dump_raw_ok()) {
+   char __user *user_finfo;
+
+   user_finfo = u64_to_user_ptr(info.func_info);
+   ulen = min_t(u32, info.func_info_cnt, 

[PATCH bpf-next 4/4] bpf: Expect !info.func_info and insn_off name changes in test_btf/libbpf/bpftool

2018-12-05 Thread Martin KaFai Lau
Similar to info.jited_*, info.func_info could be 0 if
bpf_dump_raw_ok() == false.

This patch makes changes to test_btf and bpftool to expect info.func_info
could be 0.

This patch also makes the needed changes for s/insn_offset/insn_off/.

Signed-off-by: Martin KaFai Lau 
Acked-by: Yonghong Song 
---
 tools/bpf/bpftool/prog.c   |  7 +++
 tools/bpf/bpftool/xlated_dumper.c  |  4 ++--
 tools/lib/bpf/btf.c| 12 ++--
 tools/testing/selftests/bpf/test_btf.c |  8 +++-
 4 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index 56db61c5a91f..3148bc0e225b 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -589,6 +589,13 @@ static int do_dump(int argc, char **argv)
goto err_free;
}
 
+   if (func_info && !info.func_info) {
+   /* kernel.kptr_restrict is set.  No func_info available. */
+   free(func_info);
+   func_info = NULL;
+   finfo_cnt = 0;
+   }
+
if ((member_len == _prog_len &&
 info.jited_prog_insns == 0) ||
(member_len == _prog_len &&
diff --git a/tools/bpf/bpftool/xlated_dumper.c 
b/tools/bpf/bpftool/xlated_dumper.c
index e06ac0286a75..131ecd175533 100644
--- a/tools/bpf/bpftool/xlated_dumper.c
+++ b/tools/bpf/bpftool/xlated_dumper.c
@@ -261,7 +261,7 @@ void dump_xlated_json(struct dump_data *dd, void *buf, 
unsigned int len,
jsonw_start_object(json_wtr);
 
if (btf && record) {
-   if (record->insn_offset == i) {
+   if (record->insn_off == i) {
btf_dumper_type_only(btf, record->type_id,
 func_sig,
 sizeof(func_sig));
@@ -330,7 +330,7 @@ void dump_xlated_plain(struct dump_data *dd, void *buf, 
unsigned int len,
}
 
if (btf && record) {
-   if (record->insn_offset == i) {
+   if (record->insn_off == i) {
btf_dumper_type_only(btf, record->type_id,
 func_sig,
 sizeof(func_sig));
diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c
index c2d641f3e16e..85d6446cf832 100644
--- a/tools/lib/bpf/btf.c
+++ b/tools/lib/bpf/btf.c
@@ -45,7 +45,7 @@ struct btf_ext {
 
 /* The minimum bpf_func_info checked by the loader */
 struct bpf_func_info_min {
-   __u32   insn_offset;
+   __u32   insn_off;
__u32   type_id;
 };
 
@@ -670,7 +670,7 @@ int btf_ext__reloc_init(struct btf *btf, struct btf_ext 
*btf_ext,
 
memcpy(data, sinfo->data, records_len);
 
-   /* adjust the insn_offset, the data in .BTF.ext is
+   /* adjust the insn_off, the data in .BTF.ext is
 * the actual byte offset, and the kernel expects
 * the offset in term of bpf_insn.
 *
@@ -681,7 +681,7 @@ int btf_ext__reloc_init(struct btf *btf, struct btf_ext 
*btf_ext,
struct bpf_func_info_min *record;
 
record = data + i * record_size;
-   record->insn_offset /= sizeof(struct bpf_insn);
+   record->insn_off /= sizeof(struct bpf_insn);
}
 
*func_info = data;
@@ -722,15 +722,15 @@ int btf_ext__reloc(struct btf *btf, struct btf_ext 
*btf_ext,
return -ENOMEM;
 
memcpy(data + existing_flen, sinfo->data, records_len);
-   /* adjust insn_offset only, the rest data will be passed
+   /* adjust insn_off only, the rest data will be passed
 * to the kernel.
 */
for (i = 0; i < sinfo->num_func_info; i++) {
struct bpf_func_info_min *record;
 
record = data + existing_flen + i * record_size;
-   record->insn_offset =
-   record->insn_offset / sizeof(struct bpf_insn) +
+   record->insn_off =
+   record->insn_off / sizeof(struct bpf_insn) +
insns_cnt;
}
*func_info = data;
diff --git a/tools/testing/selftests/bpf/test_btf.c 
b/tools/testing/selftests/bpf/test_btf.c
index bae7308b7ec5..ff0952ea757a 100644
--- a/tools/testing/selftests/bpf/test_btf.c
+++ b/tools/testing/selftests/bpf/test_btf.c
@@ -3156,7 +3156,7 @@ static struct btf_func_type_test {
 },
 
 {
-   .descr = "func_type (Incorrect bpf_func_info.insn_offset)",
+   .descr = "func_type (Incorrect bpf_func_info.insn_off)",
.raw_types = {
BTF_TYPE_INT_ENC(NAME_TBD, BTF_INT_SIGNED, 0, 32, 4),   /* [1] 
*/
  

[PATCH bpf-next 2/4] bpf: Change insn_offset to insn_off in bpf_func_info

2018-12-05 Thread Martin KaFai Lau
The later patch will introduce "struct bpf_line_info" which
has member "line_off" and "file_off" referring back to the
string section in btf.  The line_"off" and file_"off"
are more consistent to the naming convention in btf.h that
means "offset" (e.g. name_off in "struct btf_type").

The to-be-added "struct bpf_line_info" also has another
member, "insn_off" which is the same as the "insn_offset"
in "struct bpf_func_info".  Hence, this patch renames "insn_offset"
to "insn_off" for "struct bpf_func_info".

Signed-off-by: Martin KaFai Lau 
Acked-by: Yonghong Song 
---
 include/uapi/linux/bpf.h |  2 +-
 kernel/bpf/verifier.c| 18 +-
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c8e12c5f..a84fd232d934 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2991,7 +2991,7 @@ struct bpf_flow_keys {
 };
 
 struct bpf_func_info {
-   __u32   insn_offset;
+   __u32   insn_off;
__u32   type_id;
 };
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 71988337ac14..7658c61c1a88 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -4707,24 +4707,24 @@ static int check_btf_func(struct bpf_prog *prog, struct 
bpf_verifier_env *env,
goto free_btf;
}
 
-   /* check insn_offset */
+   /* check insn_off */
if (i == 0) {
-   if (krecord[i].insn_offset) {
+   if (krecord[i].insn_off) {
verbose(env,
-   "nonzero insn_offset %u for the first 
func info record",
-   krecord[i].insn_offset);
+   "nonzero insn_off %u for the first func 
info record",
+   krecord[i].insn_off);
ret = -EINVAL;
goto free_btf;
}
-   } else if (krecord[i].insn_offset <= prev_offset) {
+   } else if (krecord[i].insn_off <= prev_offset) {
verbose(env,
"same or smaller insn offset (%u) than previous 
func info record (%u)",
-   krecord[i].insn_offset, prev_offset);
+   krecord[i].insn_off, prev_offset);
ret = -EINVAL;
goto free_btf;
}
 
-   if (env->subprog_info[i].start != krecord[i].insn_offset) {
+   if (env->subprog_info[i].start != krecord[i].insn_off) {
verbose(env, "func_info BTF section doesn't match 
subprog layout in BPF program\n");
ret = -EINVAL;
goto free_btf;
@@ -4739,7 +4739,7 @@ static int check_btf_func(struct bpf_prog *prog, struct 
bpf_verifier_env *env,
goto free_btf;
}
 
-   prev_offset = krecord[i].insn_offset;
+   prev_offset = krecord[i].insn_off;
urecord += urec_size;
}
 
@@ -4762,7 +4762,7 @@ static void adjust_btf_func(struct bpf_verifier_env *env)
return;
 
for (i = 0; i < env->subprog_cnt; i++)
-   env->prog->aux->func_info[i].insn_offset = 
env->subprog_info[i].start;
+   env->prog->aux->func_info[i].insn_off = 
env->subprog_info[i].start;
 }
 
 /* check %cur's range satisfies %old's */
-- 
2.17.1



[PATCH bpf-next 3/4] bpf: tools: Sync uapi bpf.h for the name changes in bpf_func_info

2018-12-05 Thread Martin KaFai Lau
This patch sync the name changes in bpf_func_info to
the tools/.

Signed-off-by: Martin KaFai Lau 
Acked-by: Yonghong Song 
---
 tools/include/uapi/linux/bpf.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 64262890feb2..16263e8827fc 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2991,7 +2991,7 @@ struct bpf_flow_keys {
 };
 
 struct bpf_func_info {
-   __u32   insn_offset;
+   __u32   insn_off;
__u32   type_id;
 };
 
-- 
2.17.1



[PATCH bpf-next 0/4] Misc improvements on bpf_func_info

2018-12-05 Thread Martin KaFai Lau
The patchset has a few improvements on bpf_func_info:
1. Improvements on the behaviors of info.func_info, info.func_info_cnt
   and info.func_info_rec_size.
2. Name change: s/insn_offset/insn_off/

Please see individual patch for details.

Martin KaFai Lau (4):
  bpf: Improve the info.func_info and info.func_info_rec_size behavior
  bpf: Change insn_offset to insn_off in bpf_func_info
  bpf: tools: Sync uapi bpf.h for the name changes in bpf_func_info
  bpf: Expect !info.func_info and insn_off name changes in
test_btf/libbpf/bpftool

 include/uapi/linux/bpf.h   |  2 +-
 kernel/bpf/core.c  |  2 +-
 kernel/bpf/syscall.c   | 46 +++---
 kernel/bpf/verifier.c  | 18 +-
 tools/bpf/bpftool/prog.c   |  7 
 tools/bpf/bpftool/xlated_dumper.c  |  4 +--
 tools/include/uapi/linux/bpf.h |  2 +-
 tools/lib/bpf/btf.c| 12 +++
 tools/testing/selftests/bpf/test_btf.c |  8 -
 9 files changed, 54 insertions(+), 47 deletions(-)

-- 
2.17.1



Re: [PATCH net-next RFC 7/7] bnxt_en: Add bnxt_en initial port params table and register it

2018-12-05 Thread Michael Chan
On Wed, Dec 5, 2018 at 4:42 PM Jakub Kicinski
 wrote:
>
> On Wed, 5 Dec 2018 16:01:08 -0800, Michael Chan wrote:
> > On Wed, Dec 5, 2018 at 3:33 PM Jakub Kicinski
> >  wrote:
> > >
> > > On Wed,  5 Dec 2018 11:27:00 +0530, Vasundhara Volam wrote:
> > > > Register devlink_port with devlink and create initial port params
> > > > table for bnxt_en. The table consists of a generic parameter:
> > > >
> > > > wake-on-lan: Enables Wake on Lan for this port when magic packet
> > > > is received with this port's MAC address using ACPI pattern.
> > > > If enabled, the controller asserts a wake pin upon reception of
> > > > WoL packet.  ACPI (Advanced Configuration and Power Interface) is
> > > > an industry specification for the efficient handling of power
> > > > consumption in desktop and mobile computers.
> > > >
> > > > Cc: Michael Chan 
> > > > Signed-off-by: Vasundhara Volam 
> > >
> > > Why do we need a WoL as a devlink parameter (rather than ethtool -s)?
> >
> > I believe ethtool -s for WoL is a non-persistent setting, meaning that
> > if you power cycle the system, the WoL setting will go back to
> > default.
> >
> > devlink on the other hand is a permanent setting.  ethtool should
> > initially report the default WoL setting and it can then be changed
> > (in a non permanent way) using ethtool -s.
>
> All network configuration settings in Linux are non-persistent AFAIK.
> That's why network configuration daemons exist:
>
> https://wiki.debian.org/WakeOnLan
>
> Perhaps the objective to move more of the network configuration into the
> firmware?  That'd be a bleak scenario, so probably not..
>
> My understanding was the persistent devlink settings are for things
> which have to be set at device init time.  Like say PCI endpoint
> configuration.  FW loading configuration.
>
> Besides, the parameter you add is just true/false, when ethtool has
> multiple options.
>
> It feels to me like we moved from ioctls to Netlink, and now even
> before ethtool was converted to Netlink we may move to unstructured
> strings.  That's not a step forward, if you ask me.

We do have a parameter in NVRAM that controls default WoL.  I think
this is to expose that parameter so it can be set one way or the
other. There are scenarios where Linux has not booted yet (and so
there is no opportunity to run ethtool -s or any daemons yet) and this
parameter will control whether the machine will wake up or not.


Re: [PATCH net-next 0/8] Pass extack to NETDEV_PRE_UP

2018-12-05 Thread Petr Machata
David Miller  writes:

> Your CC list is so huge that vger.kernel.org dropped all of your postings.
>
> That CC list is not reasonable at all, trim it down to the most minimum
> set.  Probably 2 or 3 mailing lists, primarily netdev, and maybe a small
> handful of specific developers.
>
> Nothing more.

Sorry about that. I'll resend tomorrow with a reduced CC.


Re: [PATCH net-next 2/7] neighbor: Fold ___neigh_lookup_noref into __neigh_lookup_noref

2018-12-05 Thread David Miller
From: David Ahern 
Date: Wed, 5 Dec 2018 17:46:37 -0700

> ok. patches 5-7 are not dependent on 1-4. Should I re-send outside of
> this set?

Yes, please respin.

Thanks David.


Re: [pull request][net-next V2 0/7] Mellanox, mlx5e updates 2018-12-04

2018-12-05 Thread David Miller
From: Saeed Mahameed 
Date: Wed,  5 Dec 2018 16:12:58 -0800

> The following series is for mlx5e netdevice driver, it adds ethtool
> support for RX hash fields configuration and some misc updates, please
> see tag log below.
> 
> Please pull and let me know if there's any problem.
> 
> v1->v2:
>  - Move static const array to c file.
>  - Remove unnecessary blank line
>  - Add #include 
>  - Print priv flag name rather than its hex value

Pulled, thanks Saeed.


Re: [PATCH net-next 2/7] neighbor: Fold ___neigh_lookup_noref into __neigh_lookup_noref

2018-12-05 Thread David Ahern
On 12/5/18 5:46 PM, David Ahern wrote:
> ok. patches 5-7 are not dependent on 1-4. Should I re-send outside of
> this set?

bleh. 5 is. I'll re-send.


Re: [PATCH net-next 2/7] neighbor: Fold ___neigh_lookup_noref into __neigh_lookup_noref

2018-12-05 Thread David Ahern
On 12/5/18 5:44 PM, David Miller wrote:
> From: David Ahern 
> Date: Wed,  5 Dec 2018 15:34:09 -0800
> 
>> @@ -270,37 +270,25 @@ static inline bool neigh_key_eq128(const struct 
>> neighbour *n, const void *pkey)
>>  (n32[2] ^ p32[2]) | (n32[3] ^ p32[3])) == 0;
>>  }
>>  
>> -static inline struct neighbour *___neigh_lookup_noref(
>> -struct neigh_table *tbl,
>> -bool (*key_eq)(const struct neighbour *n, const void *pkey),
>> -__u32 (*hash)(const void *pkey,
>> -  const struct net_device *dev,
>> -  __u32 *hash_rnd),
>> -const void *pkey,
>> -struct net_device *dev)
>> +static inline struct neighbour *__neigh_lookup_noref(struct neigh_table 
>> *tbl,
>> + const void *pkey,
>> + struct net_device *dev)
>>  {
> 
> Sorry, we can't do this.
> 
> The whole point of how this is laid out is so that the entire hash traversal,
> including the hash function, is expanded inline.
> 
> This demux is extremely critical on the output side, it must be the
> smallest number of cycles possible.  It was the only way I could justify
> not caching neigh entries in the routes any more when I wrote this code.
> 
> Even before retpoline, putting an indirect call here is painful.  With
> retpoline it is deadly.
> 
> Please avoid removing the full inline expansion of the neigh lookup in the 
> ipv6
> and ipv4 data paths.
> 

ok. patches 5-7 are not dependent on 1-4. Should I re-send outside of
this set?


[Patch v2 net-next] call sk_dst_reset when set SO_DONTROUTE

2018-12-05 Thread yupeng
after set SO_DONTROUTE to 1, the IP layer should not route packets if
the dest IP address is not in link scope. But if the socket has cached
the dst_entry, such packets would be routed until the sk_dst_cache
expires. So we should clean the sk_dst_cache when a user set
SO_DONTROUTE option. Below are server/client python scripts which
could reprodue this issue:

server side code:
==
import socket
import struct
import time

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(('0.0.0.0', 9000))
s.listen(1)
sock, addr = s.accept()
sock.setsockopt(socket.SOL_SOCKET, socket.SO_DONTROUTE, struct.pack('i', 1))
while True:
sock.send(b'foo')
time.sleep(1)
==

client side code:
==
import socket
import time

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('server_address', 9000))
while True:
data = s.recv(1024)
print(data)
==

Signed-off-by: yupeng 
---
 net/core/sock.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/sock.c b/net/core/sock.c
index f5bb89785e47..f00902c532cc 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -700,6 +700,7 @@ int sock_setsockopt(struct socket *sock, int level, int 
optname,
break;
case SO_DONTROUTE:
sock_valbool_flag(sk, SOCK_LOCALROUTE, valbool);
+   sk_dst_reset(sk);
break;
case SO_BROADCAST:
sock_valbool_flag(sk, SOCK_BROADCAST, valbool);
-- 
2.17.1



Re: [PATCH net-next 2/7] neighbor: Fold ___neigh_lookup_noref into __neigh_lookup_noref

2018-12-05 Thread David Miller
From: David Ahern 
Date: Wed,  5 Dec 2018 15:34:09 -0800

> @@ -270,37 +270,25 @@ static inline bool neigh_key_eq128(const struct 
> neighbour *n, const void *pkey)
>   (n32[2] ^ p32[2]) | (n32[3] ^ p32[3])) == 0;
>  }
>  
> -static inline struct neighbour *___neigh_lookup_noref(
> - struct neigh_table *tbl,
> - bool (*key_eq)(const struct neighbour *n, const void *pkey),
> - __u32 (*hash)(const void *pkey,
> -   const struct net_device *dev,
> -   __u32 *hash_rnd),
> - const void *pkey,
> - struct net_device *dev)
> +static inline struct neighbour *__neigh_lookup_noref(struct neigh_table *tbl,
> +  const void *pkey,
> +  struct net_device *dev)
>  {

Sorry, we can't do this.

The whole point of how this is laid out is so that the entire hash traversal,
including the hash function, is expanded inline.

This demux is extremely critical on the output side, it must be the
smallest number of cycles possible.  It was the only way I could justify
not caching neigh entries in the routes any more when I wrote this code.

Even before retpoline, putting an indirect call here is painful.  With
retpoline it is deadly.

Please avoid removing the full inline expansion of the neigh lookup in the ipv6
and ipv4 data paths.

Thank you.


Re: [PATCH net-next RFC 7/7] bnxt_en: Add bnxt_en initial port params table and register it

2018-12-05 Thread Jakub Kicinski
On Wed, 5 Dec 2018 16:01:08 -0800, Michael Chan wrote:
> On Wed, Dec 5, 2018 at 3:33 PM Jakub Kicinski
>  wrote:
> >
> > On Wed,  5 Dec 2018 11:27:00 +0530, Vasundhara Volam wrote:  
> > > Register devlink_port with devlink and create initial port params
> > > table for bnxt_en. The table consists of a generic parameter:
> > >
> > > wake-on-lan: Enables Wake on Lan for this port when magic packet
> > > is received with this port's MAC address using ACPI pattern.
> > > If enabled, the controller asserts a wake pin upon reception of
> > > WoL packet.  ACPI (Advanced Configuration and Power Interface) is
> > > an industry specification for the efficient handling of power
> > > consumption in desktop and mobile computers.
> > >
> > > Cc: Michael Chan 
> > > Signed-off-by: Vasundhara Volam   
> >
> > Why do we need a WoL as a devlink parameter (rather than ethtool -s)?  
> 
> I believe ethtool -s for WoL is a non-persistent setting, meaning that
> if you power cycle the system, the WoL setting will go back to
> default.
> 
> devlink on the other hand is a permanent setting.  ethtool should
> initially report the default WoL setting and it can then be changed
> (in a non permanent way) using ethtool -s.

All network configuration settings in Linux are non-persistent AFAIK.
That's why network configuration daemons exist:

https://wiki.debian.org/WakeOnLan

Perhaps the objective to move more of the network configuration into the
firmware?  That'd be a bleak scenario, so probably not..

My understanding was the persistent devlink settings are for things
which have to be set at device init time.  Like say PCI endpoint
configuration.  FW loading configuration.

Besides, the parameter you add is just true/false, when ethtool has
multiple options.

It feels to me like we moved from ioctls to Netlink, and now even
before ethtool was converted to Netlink we may move to unstructured
strings.  That's not a step forward, if you ask me.


[Patch v2 net-next] call sk_dst_reset when set SO_DONTROUTE

2018-12-05 Thread yupeng
after set SO_DONTROUTE to 1, the IP layer should not route packets if
the dest IP address is not in link scope. But if the socket has cached
the dst_entry, such packets would be routed until the sk_dst_cache
expires. So we should clean the sk_dst_cache when a user set
SO_DONTROUTE option. Below are server/client python scripts which
could reprodue this issue:

server side code:
==
import socket
import struct
import time

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(('0.0.0.0', 9000))
s.listen(1)
sock, addr = s.accept()
sock.setsockopt(socket.SOL_SOCKET, socket.SO_DONTROUTE, struct.pack('i', 1))
while True:
sock.send(b'foo')
time.sleep(1)
==

client side code:
==
import socket
import time

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('server_address', 9000))
while True:
data = s.recv(1024)
print(data)
==

Signed-off-by: yupeng 
---
 net/core/sock.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/sock.c b/net/core/sock.c
index f5bb89785e47..f00902c532cc 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -700,6 +700,7 @@ int sock_setsockopt(struct socket *sock, int level, int 
optname,
break;
case SO_DONTROUTE:
sock_valbool_flag(sk, SOCK_LOCALROUTE, valbool);
+   sk_dst_reset(sk);
break;
case SO_BROADCAST:
sock_valbool_flag(sk, SOCK_BROADCAST, valbool);
-- 
2.17.1



Re: [Patch net-next] call sk_dst_reset when set SO_DONTROUTE

2018-12-05 Thread peng yu
In fack, my customer's issue is that he set SO_DONTROUTE by mistake.
He shouldn't do that. But after he set this flag, the connection has
no problem at first. After the sk_dst_cache expired for some reasons,
the connection stucked. I think the correct behavior is that the
connection should stuck immediately after set SO_DONTROUTE to 1.
On Wed, Dec 5, 2018 at 4:20 PM Eric Dumazet  wrote:
>
>
>
> On 12/05/2018 04:13 PM, peng yu wrote:
> > The SO_DONTROUTE doesn't impact the TCP receiving path, but it should
> > block the ACK of the receiving packet. When there are too many packets
> > which are not ACKed, the client will stop to send packets, so the
> > sock.recv on the server side won't receive data after it received some
> > data. I extracted the test code from my customer's production
> > environment. The test code could reproduce the issue but it is not a
> > good example. I will rewrite a test code and re-submit the patch.
>
> Now I fully understand ;)
>
> Basically your customer is using SO_DONTROUTE to 'pause' incoming TCP traffic
> by not sending ACK.
>
> Interesting trick but quite hacky. I guess that sending ACK with 0 window
> would be less intrusive.
>
> > Wed, Dec 5, 2018 at 3:17 PM Eric Dumazet  wrote:
> >>
> >> On Wed, Dec 5, 2018 at 3:07 PM yupeng  wrote:
> >>>
> >>> after set SO_DONTROUTE to 1, the IP layer should not route packets if
> >>> the dest IP address is not in link scope. But if the socket has cached
> >>> the dst_entry, such packets would be routed until the sk_dst_cache
> >>> expires. So we should clean the sk_dst_cache when a user set
> >>> SO_DONTROUTE option. Below are server/client python scripts which
> >>> could reprodue this issue:
> >>>
> >>> server side code:
> >>> ==
> >>> import socket
> >>> import struct
> >>>
> >>> s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> >>> s.bind(('0.0.0.0', 9000))
> >>> s.listen(1)
> >>> sock, addr = s.accept()
> >>> sock.setsockopt(socket.SOL_SOCKET, socket.SO_DONTROUTE, struct.pack('i', 
> >>> 1))
> >>> while True:
> >>> data = sock.recv(1024) # here the sock.recv should not return anything
> >>
> >> Why is that so ?
> >>
> >> What is the relation of input path with the SO_DONTROUTE which is for TX ?
> >>
> >> sk_dst_reset(sk) should not impact receive side ?
> >>
> >> Thanks for providing a test !
> >>
> >>> print(data)
> >>> ==
> >>>
> >>> client side code:
> >>> ==
> >>> import socket
> >>> import time
> >>>
> >>> s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> >>> s.connect(('server_address', 9000))
> >>> while True:
> >>> s.send(b'foo')
> >>> print('send foo')
> >>> time.sleep(1)
> >>> ==
> >>>
> >>> Signed-off-by: yupeng 
> >>> ---
> >>>  net/core/sock.c | 1 +
> >>>  1 file changed, 1 insertion(+)
> >>>
> >>> diff --git a/net/core/sock.c b/net/core/sock.c
> >>> index f5bb89785e47..f00902c532cc 100644
> >>> --- a/net/core/sock.c
> >>> +++ b/net/core/sock.c
> >>> @@ -700,6 +700,7 @@ int sock_setsockopt(struct socket *sock, int level, 
> >>> int optname,
> >>> break;
> >>> case SO_DONTROUTE:
> >>> sock_valbool_flag(sk, SOCK_LOCALROUTE, valbool);
> >>> +   sk_dst_reset(sk);
> >>> break;
> >>> case SO_BROADCAST:
> >>> sock_valbool_flag(sk, SOCK_BROADCAST, valbool);
> >>> --
> >>> 2.17.1
> >>>


Re: [PATCH net] tcp: fix NULL ref in tail loss probe

2018-12-05 Thread David Miller
From: Yuchung Cheng 
Date: Wed,  5 Dec 2018 14:38:38 -0800

> TCP loss probe timer may fire when the retranmission queue is empty but
> has a non-zero tp->packets_out counter. tcp_send_loss_probe will call
> tcp_rearm_rto which triggers NULL pointer reference by fetching the
> retranmission queue head in its sub-routines.
> 
> Add a more detailed warning to help catch the root cause of the inflight
> accounting inconsistency.
> 
> Reported-by: Rafael Tinoco 
> Signed-off-by: Yuchung Cheng 
> Signed-off-by: Eric Dumazet 
> Signed-off-by: Neal Cardwell 

Applied, thanks for working to diagnose this so quickly.


Re: [PATCH 1/2] net: linkwatch: send change uevent on link changes

2018-12-05 Thread David Miller
From: Jouke Witteveen 
Date: Wed, 5 Dec 2018 23:38:17 +0100

> Can you elaborate a bit? I may not be aware of the policy you have in
> mind.

When we have a user facing interface to do something, we don't create
another one unless it is absolutely, positively, unavoidable.


Re: [PATCH net] tcp: Do not underestimate rwnd_limited

2018-12-05 Thread David Miller
From: Eric Dumazet 
Date: Wed,  5 Dec 2018 14:24:31 -0800

> If available rwnd is too small, tcp_tso_should_defer()
> can decide it is worth waiting before splitting a TSO packet.
> 
> This really means we are rwnd limited.
> 
> Fixes: 5615f88614a4 ("tcp: instrument how long TCP is limited by receive 
> window")
> Signed-off-by: Eric Dumazet 

Applied and queued up for -stable, thanks Eric.


Re: pull-request: bpf 2018-12-05

2018-12-05 Thread David Miller
From: Alexei Starovoitov 
Date: Wed, 5 Dec 2018 13:23:22 -0800

> The following pull-request contains BPF updates for your *net* tree.
> 
> The main changes are:
> 
> 1) fix bpf uapi pointers for 32-bit architectures, from Daniel.
> 
> 2) improve verifer ability to handle progs with a lot of branches, from 
> Alexei.
> 
> 3) strict btf checks, from Yonghong.
> 
> 4) bpf_sk_lookup api cleanup, from Joe.
> 
> 5) other misc fixes
> 
> Please consider pulling these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Pulled, thank you.


Re: [Patch net-next] call sk_dst_reset when set SO_DONTROUTE

2018-12-05 Thread Eric Dumazet



On 12/05/2018 04:13 PM, peng yu wrote:
> The SO_DONTROUTE doesn't impact the TCP receiving path, but it should
> block the ACK of the receiving packet. When there are too many packets
> which are not ACKed, the client will stop to send packets, so the
> sock.recv on the server side won't receive data after it received some
> data. I extracted the test code from my customer's production
> environment. The test code could reproduce the issue but it is not a
> good example. I will rewrite a test code and re-submit the patch.

Now I fully understand ;)

Basically your customer is using SO_DONTROUTE to 'pause' incoming TCP traffic
by not sending ACK.

Interesting trick but quite hacky. I guess that sending ACK with 0 window
would be less intrusive.

> Wed, Dec 5, 2018 at 3:17 PM Eric Dumazet  wrote:
>>
>> On Wed, Dec 5, 2018 at 3:07 PM yupeng  wrote:
>>>
>>> after set SO_DONTROUTE to 1, the IP layer should not route packets if
>>> the dest IP address is not in link scope. But if the socket has cached
>>> the dst_entry, such packets would be routed until the sk_dst_cache
>>> expires. So we should clean the sk_dst_cache when a user set
>>> SO_DONTROUTE option. Below are server/client python scripts which
>>> could reprodue this issue:
>>>
>>> server side code:
>>> ==
>>> import socket
>>> import struct
>>>
>>> s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>>> s.bind(('0.0.0.0', 9000))
>>> s.listen(1)
>>> sock, addr = s.accept()
>>> sock.setsockopt(socket.SOL_SOCKET, socket.SO_DONTROUTE, struct.pack('i', 1))
>>> while True:
>>> data = sock.recv(1024) # here the sock.recv should not return anything
>>
>> Why is that so ?
>>
>> What is the relation of input path with the SO_DONTROUTE which is for TX ?
>>
>> sk_dst_reset(sk) should not impact receive side ?
>>
>> Thanks for providing a test !
>>
>>> print(data)
>>> ==
>>>
>>> client side code:
>>> ==
>>> import socket
>>> import time
>>>
>>> s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>>> s.connect(('server_address', 9000))
>>> while True:
>>> s.send(b'foo')
>>> print('send foo')
>>> time.sleep(1)
>>> ==
>>>
>>> Signed-off-by: yupeng 
>>> ---
>>>  net/core/sock.c | 1 +
>>>  1 file changed, 1 insertion(+)
>>>
>>> diff --git a/net/core/sock.c b/net/core/sock.c
>>> index f5bb89785e47..f00902c532cc 100644
>>> --- a/net/core/sock.c
>>> +++ b/net/core/sock.c
>>> @@ -700,6 +700,7 @@ int sock_setsockopt(struct socket *sock, int level, int 
>>> optname,
>>> break;
>>> case SO_DONTROUTE:
>>> sock_valbool_flag(sk, SOCK_LOCALROUTE, valbool);
>>> +   sk_dst_reset(sk);
>>> break;
>>> case SO_BROADCAST:
>>> sock_valbool_flag(sk, SOCK_BROADCAST, valbool);
>>> --
>>> 2.17.1
>>>


Re: [PATCH net-next 0/6] u32 to linkmode fixes

2018-12-05 Thread David Miller
From: Andrew Lunn 
Date: Wed,  5 Dec 2018 21:49:39 +0100

> This patchset fixes issues found in the last patchset which converted
> the phydev advertise etc, from a u32 to a linux bitmap. Most of the
> issues are the result of clearing bits which should not of been
> cleared. To make the API clearer, the idea from Heiner Kallweit was
> used, with _mod_ to indicate the function modifies just the bits it
> needs to, or _to_ to clear all bits and just set bit that need to be
> set.

Series applied, thanks Andrew.

Please always list the Fixes tag first in the future.  I fixed if up
for you this time.

Thanks again.


Re: [PATCH net] net: use skb_list_del_init() to remove from RX sublists

2018-12-05 Thread David Miller
From: Edward Cree 
Date: Tue, 4 Dec 2018 17:37:57 +

> list_del() leaves the skb->next pointer poisoned, which can then lead to
>  a crash in e.g. OVS forwarding.  For example, setting up an OVS VXLAN
>  forwarding bridge on sfc as per:
 ...
> So, in all listified-receive handling, instead pull skbs off the lists with
>  skb_list_del_init().
> 
> Fixes: 9af86f933894 ("net: core: fix use-after-free in 
> __netif_receive_skb_list_core")
> Fixes: 7da517a3bc52 ("net: core: Another step of skb receive list processing")
> Fixes: a4ca8b7df73c ("net: ipv4: fix drop handling in ip_list_rcv() and 
> ip_list_rcv_finish()")
> Fixes: d8269e2cbf90 ("net: ipv6: listify ipv6_rcv() and ip6_rcv_finish()")
> Signed-off-by: Edward Cree 

Applied and queued up for -stable

> I'm not sure if these are the right Fixes tags, or if I should instead be
>  fingering some commit that made dev_hard_start_xmit() more sensitive to
>  skb->next.
> Also, I only saw a crash from the list_del() in 
> __netif_receive_skb_list_core()
>  but I converted all of them in the listified RX path, in case any others
>  have similar ways to escape into paths that care about skb->next.

I think we should use skb_list_del_init() on in all cases skb->list except
where we immediately queue it onto another list in a trivially auditable
way.

Therefore I think what you did is the way to go.

Thanks.


[net-next V2 6/7] net/mlx5e: ethtool, Support user configuration for RX hash fields

2018-12-05 Thread Saeed Mahameed
From: Aya Levin 

Enable user configuration of RX hash fields that are used for traffic
spreading into RX queues. User can change built-in RSS (Receive Side
Scaling) profiles on the following traffic types: UDP4, UDP6, TCP4 and
TCP6.  This configuration effects both outer and inner headers.  Added
support for ethtool commands: ETHTOOL_SRXFH and ETHTOOL_GRXFH.

Command example respectively:
$ethtool -N eth1 rx-flow-hash tcp4 sdfn
$ethtool -n eth1 rx-flow-hash tcpp4
IP SA
IP DA
L4 bytes 0 & 1 [TCP/UDP src port]
L4 bytes 2 & 3 [TCP/UDP dst port]

Signed-off-by: Aya Levin 
Reviewed-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |   1 +
 .../mellanox/mlx5/core/en_fs_ethtool.c| 112 ++
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  27 -
 3 files changed, 134 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 330c476c1df2..c6686256a92a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -650,6 +650,7 @@ enum {
 
 struct mlx5e_rss_params {
u32 indirection_rqt[MLX5E_INDIR_RQT_SIZE];
+   u32 rx_hash_fields[MLX5E_NUM_INDIR_TIRS];
u8  toeplitz_hash_key[40];
u8  hfunc;
 };
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c
index c18dcebe1462..4421c10f58ae 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c
@@ -771,6 +771,112 @@ void mlx5e_ethtool_init_steering(struct mlx5e_priv *priv)
INIT_LIST_HEAD(>fs.ethtool.rules);
 }
 
+static enum mlx5e_traffic_types flow_type_to_traffic_type(u32 flow_type)
+{
+   switch (flow_type) {
+   case TCP_V4_FLOW:
+   return  MLX5E_TT_IPV4_TCP;
+   case TCP_V6_FLOW:
+   return MLX5E_TT_IPV6_TCP;
+   case UDP_V4_FLOW:
+   return MLX5E_TT_IPV4_UDP;
+   case UDP_V6_FLOW:
+   return MLX5E_TT_IPV6_UDP;
+   case AH_V4_FLOW:
+   return MLX5E_TT_IPV4_IPSEC_AH;
+   case AH_V6_FLOW:
+   return MLX5E_TT_IPV6_IPSEC_AH;
+   case ESP_V4_FLOW:
+   return MLX5E_TT_IPV4_IPSEC_ESP;
+   case ESP_V6_FLOW:
+   return MLX5E_TT_IPV6_IPSEC_ESP;
+   case IPV4_FLOW:
+   return MLX5E_TT_IPV4;
+   case IPV6_FLOW:
+   return MLX5E_TT_IPV6;
+   default:
+   return MLX5E_NUM_INDIR_TIRS;
+   }
+}
+
+static int mlx5e_set_rss_hash_opt(struct mlx5e_priv *priv,
+ struct ethtool_rxnfc *nfc)
+{
+   int inlen = MLX5_ST_SZ_BYTES(modify_tir_in);
+   enum mlx5e_traffic_types tt;
+   u8 rx_hash_field = 0;
+   void *in;
+
+   tt = flow_type_to_traffic_type(nfc->flow_type);
+   if (tt == MLX5E_NUM_INDIR_TIRS)
+   return -EINVAL;
+
+   /*  RSS does not support anything other than hashing to queues
+*  on src IP, dest IP, TCP/UDP src port and TCP/UDP dest
+*  port.
+*/
+   if (nfc->flow_type != TCP_V4_FLOW &&
+   nfc->flow_type != TCP_V6_FLOW &&
+   nfc->flow_type != UDP_V4_FLOW &&
+   nfc->flow_type != UDP_V6_FLOW)
+   return -EOPNOTSUPP;
+
+   if (nfc->data & ~(RXH_IP_SRC | RXH_IP_DST |
+ RXH_L4_B_0_1 | RXH_L4_B_2_3))
+   return -EOPNOTSUPP;
+
+   if (nfc->data & RXH_IP_SRC)
+   rx_hash_field |= MLX5_HASH_FIELD_SEL_SRC_IP;
+   if (nfc->data & RXH_IP_DST)
+   rx_hash_field |= MLX5_HASH_FIELD_SEL_DST_IP;
+   if (nfc->data & RXH_L4_B_0_1)
+   rx_hash_field |= MLX5_HASH_FIELD_SEL_L4_SPORT;
+   if (nfc->data & RXH_L4_B_2_3)
+   rx_hash_field |= MLX5_HASH_FIELD_SEL_L4_DPORT;
+
+   in = kvzalloc(inlen, GFP_KERNEL);
+   if (!in)
+   return -ENOMEM;
+
+   mutex_lock(>state_lock);
+
+   if (rx_hash_field == priv->rss_params.rx_hash_fields[tt])
+   goto out;
+
+   priv->rss_params.rx_hash_fields[tt] = rx_hash_field;
+   mlx5e_modify_tirs_hash(priv, in, inlen);
+
+out:
+   mutex_unlock(>state_lock);
+   kvfree(in);
+   return 0;
+}
+
+static int mlx5e_get_rss_hash_opt(struct mlx5e_priv *priv,
+ struct ethtool_rxnfc *nfc)
+{
+   enum mlx5e_traffic_types tt;
+   u32 hash_field = 0;
+
+   tt = flow_type_to_traffic_type(nfc->flow_type);
+   if (tt == MLX5E_NUM_INDIR_TIRS)
+   return -EINVAL;
+
+   hash_field = priv->rss_params.rx_hash_fields[tt];
+   nfc->data = 0;
+
+   if (hash_field & MLX5_HASH_FIELD_SEL_SRC_IP)
+   nfc->data |= RXH_IP_SRC;
+   if (hash_field & MLX5_HASH_FIELD_SEL_DST_IP)
+   nfc->data |= RXH_IP_DST;
+   if 

[net-next V2 5/7] net/mlx5e: Move RSS params to a dedicated struct

2018-12-05 Thread Saeed Mahameed
From: Aya Levin 

Remove RSS params from params struct under channels, and introduce
a new struct with RSS configuration params under priv struct. There is
no functional change here.

Signed-off-by: Aya Levin 
Reviewed-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  | 16 ---
 .../ethernet/mellanox/mlx5/core/en_ethtool.c  | 34 +++
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 43 +++
 .../net/ethernet/mellanox/mlx5/core/en_rep.c  |  9 ++--
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   |  5 ++-
 .../ethernet/mellanox/mlx5/core/ipoib/ipoib.c |  2 +-
 6 files changed, 62 insertions(+), 47 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 8539ea9a67ca..330c476c1df2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -244,9 +244,6 @@ struct mlx5e_params {
bool lro_en;
u32 lro_wqe_sz;
u8  tx_min_inline_mode;
-   u8  rss_hfunc;
-   u8  toeplitz_hash_key[40];
-   u32 indirection_rqt[MLX5E_INDIR_RQT_SIZE];
bool vlan_strip_disable;
bool scatter_fcs_en;
bool rx_dim_enabled;
@@ -651,6 +648,12 @@ enum {
MLX5E_NIC_PRIO
 };
 
+struct mlx5e_rss_params {
+   u32 indirection_rqt[MLX5E_INDIR_RQT_SIZE];
+   u8  toeplitz_hash_key[40];
+   u8  hfunc;
+};
+
 struct mlx5e_priv {
/* priv data path fields - start */
struct mlx5e_txqsq *txq2sq[MLX5E_MAX_NUM_CHANNELS * MLX5E_MAX_NUM_TC];
@@ -671,6 +674,7 @@ struct mlx5e_priv {
struct mlx5e_tir   indir_tir[MLX5E_NUM_INDIR_TIRS];
struct mlx5e_tir   inner_indir_tir[MLX5E_NUM_INDIR_TIRS];
struct mlx5e_tir   direct_tir[MLX5E_MAX_NUM_CHANNELS];
+   struct mlx5e_rss_paramsrss_params;
u32tx_rates[MLX5E_MAX_NUM_SQS];
 
struct mlx5e_flow_steering fs;
@@ -796,7 +800,7 @@ struct mlx5e_redirect_rqt_param {
 
 int mlx5e_redirect_rqt(struct mlx5e_priv *priv, u32 rqtn, int sz,
   struct mlx5e_redirect_rqt_param rrp);
-void mlx5e_build_indir_tir_ctx_hash(struct mlx5e_params *params,
+void mlx5e_build_indir_tir_ctx_hash(struct mlx5e_rss_params *rss_params,
const struct mlx5e_tirc_config *ttconfig,
void *tirc, bool inner);
 void mlx5e_modify_tirs_hash(struct mlx5e_priv *priv, void *in, int inlen);
@@ -982,11 +986,13 @@ int mlx5e_attach_netdev(struct mlx5e_priv *priv);
 void mlx5e_detach_netdev(struct mlx5e_priv *priv);
 void mlx5e_destroy_netdev(struct mlx5e_priv *priv);
 void mlx5e_build_nic_params(struct mlx5_core_dev *mdev,
+   struct mlx5e_rss_params *rss_params,
struct mlx5e_params *params,
u16 max_channels, u16 mtu);
 void mlx5e_build_rq_params(struct mlx5_core_dev *mdev,
   struct mlx5e_params *params);
-void mlx5e_build_rss_params(struct mlx5e_params *params);
+void mlx5e_build_rss_params(struct mlx5e_rss_params *rss_params,
+   u16 num_channels);
 u8 mlx5e_params_calculate_tx_min_inline(struct mlx5_core_dev *mdev);
 void mlx5e_rx_dim_work(struct work_struct *work);
 void mlx5e_tx_dim_work(struct work_struct *work);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 2d5b00751a6d..e868d42c83cb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -353,7 +353,7 @@ int mlx5e_ethtool_set_channels(struct mlx5e_priv *priv,
new_channels.params = priv->channels.params;
new_channels.params.num_channels = count;
if (!netif_is_rxfh_configured(priv->netdev))
-   
mlx5e_build_default_indir_rqt(new_channels.params.indirection_rqt,
+   mlx5e_build_default_indir_rqt(priv->rss_params.indirection_rqt,
  MLX5E_INDIR_RQT_SIZE, count);
 
if (!test_bit(MLX5E_STATE_OPENED, >state)) {
@@ -931,7 +931,7 @@ static int mlx5e_set_link_ksettings(struct net_device 
*netdev,
 
 u32 mlx5e_ethtool_get_rxfh_key_size(struct mlx5e_priv *priv)
 {
-   return sizeof(priv->channels.params.toeplitz_hash_key);
+   return sizeof(priv->rss_params.toeplitz_hash_key);
 }
 
 static u32 mlx5e_get_rxfh_key_size(struct net_device *netdev)
@@ -957,17 +957,18 @@ static int mlx5e_get_rxfh(struct net_device *netdev, u32 
*indir, u8 *key,
  u8 *hfunc)
 {
struct mlx5e_priv *priv = netdev_priv(netdev);
+   struct mlx5e_rss_params *rss = >rss_params;
 
if (indir)
-   memcpy(indir, priv->channels.params.indirection_rqt,
-  sizeof(priv->channels.params.indirection_rqt));
+   

[net-next V2 1/7] net/mlx5e: Remove trailing space of tx_pause ethtool counter name

2018-12-05 Thread Saeed Mahameed
tx_pause_storm_warning_events ethtool counter name has a trailing
space, remove it.

Signed-off-by: Saeed Mahameed 
Reviewed-by: Eran Ben Elisha 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
index 3e99d0728b2f..40b60e958cfd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
@@ -936,7 +936,7 @@ static const struct counter_desc 
pport_per_prio_pfc_stats_desc[] = {
 };
 
 static const struct counter_desc pport_pfc_stall_stats_desc[] = {
-   { "tx_pause_storm_warning_events ", 
PPORT_PER_PRIO_OFF(device_stall_minor_watermark_cnt) },
+   { "tx_pause_storm_warning_events", 
PPORT_PER_PRIO_OFF(device_stall_minor_watermark_cnt) },
{ "tx_pause_storm_error_events", 
PPORT_PER_PRIO_OFF(device_stall_critical_watermark_cnt) },
 };
 
-- 
2.19.2



Re: [Patch net-next] call sk_dst_reset when set SO_DONTROUTE

2018-12-05 Thread peng yu
The SO_DONTROUTE doesn't impact the TCP receiving path, but it should
block the ACK of the receiving packet. When there are too many packets
which are not ACKed, the client will stop to send packets, so the
sock.recv on the server side won't receive data after it received some
data. I extracted the test code from my customer's production
environment. The test code could reproduce the issue but it is not a
good example. I will rewrite a test code and re-submit the patch.On
Wed, Dec 5, 2018 at 3:17 PM Eric Dumazet  wrote:
>
> On Wed, Dec 5, 2018 at 3:07 PM yupeng  wrote:
> >
> > after set SO_DONTROUTE to 1, the IP layer should not route packets if
> > the dest IP address is not in link scope. But if the socket has cached
> > the dst_entry, such packets would be routed until the sk_dst_cache
> > expires. So we should clean the sk_dst_cache when a user set
> > SO_DONTROUTE option. Below are server/client python scripts which
> > could reprodue this issue:
> >
> > server side code:
> > ==
> > import socket
> > import struct
> >
> > s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> > s.bind(('0.0.0.0', 9000))
> > s.listen(1)
> > sock, addr = s.accept()
> > sock.setsockopt(socket.SOL_SOCKET, socket.SO_DONTROUTE, struct.pack('i', 1))
> > while True:
> > data = sock.recv(1024) # here the sock.recv should not return anything
>
> Why is that so ?
>
> What is the relation of input path with the SO_DONTROUTE which is for TX ?
>
> sk_dst_reset(sk) should not impact receive side ?
>
> Thanks for providing a test !
>
> > print(data)
> > ==
> >
> > client side code:
> > ==
> > import socket
> > import time
> >
> > s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> > s.connect(('server_address', 9000))
> > while True:
> > s.send(b'foo')
> > print('send foo')
> > time.sleep(1)
> > ==
> >
> > Signed-off-by: yupeng 
> > ---
> >  net/core/sock.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/net/core/sock.c b/net/core/sock.c
> > index f5bb89785e47..f00902c532cc 100644
> > --- a/net/core/sock.c
> > +++ b/net/core/sock.c
> > @@ -700,6 +700,7 @@ int sock_setsockopt(struct socket *sock, int level, int 
> > optname,
> > break;
> > case SO_DONTROUTE:
> > sock_valbool_flag(sk, SOCK_LOCALROUTE, valbool);
> > +   sk_dst_reset(sk);
> > break;
> > case SO_BROADCAST:
> > sock_valbool_flag(sk, SOCK_BROADCAST, valbool);
> > --
> > 2.17.1
> >


[net-next V2 4/7] net/mlx5e: Refactor TIR configuration function

2018-12-05 Thread Saeed Mahameed
From: Aya Levin 

Refactor mlx5e_build_indir_tir_ctx_hash for better code re-use. TIR
stands for Transport Interface Receive, which is responsible for all
transport related operations on the receive side. Added a
static array with TIR default configuration values. This separates
configuration values from command setting, which is needed for
downstream patch.

Signed-off-by: Aya Levin 
Reviewed-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |   3 +-
 .../net/ethernet/mellanox/mlx5/core/en/fs.h   |  16 ++
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 167 +++---
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   |   5 +-
 4 files changed, 87 insertions(+), 104 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 86d60222b3cb..8539ea9a67ca 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -797,9 +797,10 @@ struct mlx5e_redirect_rqt_param {
 int mlx5e_redirect_rqt(struct mlx5e_priv *priv, u32 rqtn, int sz,
   struct mlx5e_redirect_rqt_param rrp);
 void mlx5e_build_indir_tir_ctx_hash(struct mlx5e_params *params,
-   enum mlx5e_traffic_types tt,
+   const struct mlx5e_tirc_config *ttconfig,
void *tirc, bool inner);
 void mlx5e_modify_tirs_hash(struct mlx5e_priv *priv, void *in, int inlen);
+struct mlx5e_tirc_config mlx5e_tirc_get_default_config(enum 
mlx5e_traffic_types tt);
 
 int mlx5e_open_locked(struct net_device *netdev);
 int mlx5e_close_locked(struct net_device *netdev);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/fs.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en/fs.h
index 1431232c9a09..be5961ff24cc 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/fs.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/fs.h
@@ -73,6 +73,22 @@ enum mlx5e_traffic_types {
MLX5E_NUM_INDIR_TIRS = MLX5E_TT_ANY,
 };
 
+struct mlx5e_tirc_config {
+   u8 l3_prot_type;
+   u8 l4_prot_type;
+   u32 rx_hash_fields;
+};
+
+#define MLX5_HASH_IP   (MLX5_HASH_FIELD_SEL_SRC_IP   |\
+MLX5_HASH_FIELD_SEL_DST_IP)
+#define MLX5_HASH_IP_L4PORTS   (MLX5_HASH_FIELD_SEL_SRC_IP   |\
+MLX5_HASH_FIELD_SEL_DST_IP   |\
+MLX5_HASH_FIELD_SEL_L4_SPORT |\
+MLX5_HASH_FIELD_SEL_L4_DPORT)
+#define MLX5_HASH_IP_IPSEC_SPI (MLX5_HASH_FIELD_SEL_SRC_IP   |\
+MLX5_HASH_FIELD_SEL_DST_IP   |\
+MLX5_HASH_FIELD_SEL_IPSEC_SPI)
+
 enum mlx5e_tunnel_types {
MLX5E_TT_IPV4_GRE,
MLX5E_TT_IPV6_GRE,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 85a4633454d0..01828efe905d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2607,6 +2607,54 @@ static void mlx5e_redirect_rqts_to_drop(struct 
mlx5e_priv *priv)
mlx5e_redirect_rqts(priv, drop_rrp);
 }
 
+static const struct mlx5e_tirc_config 
tirc_default_config[MLX5E_NUM_INDIR_TIRS] = {
+   [MLX5E_TT_IPV4_TCP] = { .l3_prot_type = MLX5_L3_PROT_TYPE_IPV4,
+   .l4_prot_type = MLX5_L4_PROT_TYPE_TCP,
+   .rx_hash_fields = MLX5_HASH_IP_L4PORTS,
+   },
+   [MLX5E_TT_IPV6_TCP] = { .l3_prot_type = MLX5_L3_PROT_TYPE_IPV6,
+   .l4_prot_type = MLX5_L4_PROT_TYPE_TCP,
+   .rx_hash_fields = MLX5_HASH_IP_L4PORTS,
+   },
+   [MLX5E_TT_IPV4_UDP] = { .l3_prot_type = MLX5_L3_PROT_TYPE_IPV4,
+   .l4_prot_type = MLX5_L4_PROT_TYPE_UDP,
+   .rx_hash_fields = MLX5_HASH_IP_L4PORTS,
+   },
+   [MLX5E_TT_IPV6_UDP] = { .l3_prot_type = MLX5_L3_PROT_TYPE_IPV6,
+   .l4_prot_type = MLX5_L4_PROT_TYPE_UDP,
+   .rx_hash_fields = MLX5_HASH_IP_L4PORTS,
+   },
+   [MLX5E_TT_IPV4_IPSEC_AH] = { .l3_prot_type = MLX5_L3_PROT_TYPE_IPV4,
+.l4_prot_type = 0,
+.rx_hash_fields = MLX5_HASH_IP_IPSEC_SPI,
+   },
+   [MLX5E_TT_IPV6_IPSEC_AH] = { .l3_prot_type = MLX5_L3_PROT_TYPE_IPV6,
+.l4_prot_type = 0,
+.rx_hash_fields = MLX5_HASH_IP_IPSEC_SPI,
+   },
+   [MLX5E_TT_IPV4_IPSEC_ESP] = { .l3_prot_type = MLX5_L3_PROT_TYPE_IPV4,
+ .l4_prot_type = 0,
+ .rx_hash_fields = MLX5_HASH_IP_IPSEC_SPI,
+   },
+   [MLX5E_TT_IPV6_IPSEC_ESP] = { .l3_prot_type = MLX5_L3_PROT_TYPE_IPV6,
+

[net-next V2 3/7] net/mlx5e: Move modify tirs hash functionality

2018-12-05 Thread Saeed Mahameed
From: Aya Levin 

Move modify tirs hash functionality (mlx5e_modify_tirs_hash) from
en_ethtool.c to en_main.c. This allows future use of this fuctionality
from en_fs_ethtool.c, while keeping current convention: en_ethtool.c
doesn't have an API.  There is no functional change here.

Signed-off-by: Aya Levin 
Reviewed-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  1 +
 .../ethernet/mellanox/mlx5/core/en_ethtool.c  | 25 -
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 28 +++
 3 files changed, 29 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 2af43465ba2c..86d60222b3cb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -799,6 +799,7 @@ int mlx5e_redirect_rqt(struct mlx5e_priv *priv, u32 rqtn, 
int sz,
 void mlx5e_build_indir_tir_ctx_hash(struct mlx5e_params *params,
enum mlx5e_traffic_types tt,
void *tirc, bool inner);
+void mlx5e_modify_tirs_hash(struct mlx5e_priv *priv, void *in, int inlen);
 
 int mlx5e_open_locked(struct net_device *netdev);
 int mlx5e_close_locked(struct net_device *netdev);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 25c1c4f96841..2d5b00751a6d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -972,31 +972,6 @@ static int mlx5e_get_rxfh(struct net_device *netdev, u32 
*indir, u8 *key,
return 0;
 }
 
-static void mlx5e_modify_tirs_hash(struct mlx5e_priv *priv, void *in, int 
inlen)
-{
-   void *tirc = MLX5_ADDR_OF(modify_tir_in, in, ctx);
-   struct mlx5_core_dev *mdev = priv->mdev;
-   int ctxlen = MLX5_ST_SZ_BYTES(tirc);
-   int tt;
-
-   MLX5_SET(modify_tir_in, in, bitmask.hash, 1);
-
-   for (tt = 0; tt < MLX5E_NUM_INDIR_TIRS; tt++) {
-   memset(tirc, 0, ctxlen);
-   mlx5e_build_indir_tir_ctx_hash(>channels.params, tt, 
tirc, false);
-   mlx5_core_modify_tir(mdev, priv->indir_tir[tt].tirn, in, inlen);
-   }
-
-   if (!mlx5e_tunnel_inner_ft_supported(priv->mdev))
-   return;
-
-   for (tt = 0; tt < MLX5E_NUM_INDIR_TIRS; tt++) {
-   memset(tirc, 0, ctxlen);
-   mlx5e_build_indir_tir_ctx_hash(>channels.params, tt, 
tirc, true);
-   mlx5_core_modify_tir(mdev, priv->inner_indir_tir[tt].tirn, in, 
inlen);
-   }
-}
-
 static int mlx5e_set_rxfh(struct net_device *dev, const u32 *indir,
  const u8 *key, const u8 hfunc)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 871313d6b34d..85a4633454d0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2735,6 +2735,34 @@ void mlx5e_build_indir_tir_ctx_hash(struct mlx5e_params 
*params,
}
 }
 
+void mlx5e_modify_tirs_hash(struct mlx5e_priv *priv, void *in, int inlen)
+{
+   void *tirc = MLX5_ADDR_OF(modify_tir_in, in, ctx);
+   struct mlx5_core_dev *mdev = priv->mdev;
+   int ctxlen = MLX5_ST_SZ_BYTES(tirc);
+   int tt;
+
+   MLX5_SET(modify_tir_in, in, bitmask.hash, 1);
+
+   for (tt = 0; tt < MLX5E_NUM_INDIR_TIRS; tt++) {
+   memset(tirc, 0, ctxlen);
+   mlx5e_build_indir_tir_ctx_hash(>channels.params, tt, tirc,
+  false);
+   mlx5_core_modify_tir(mdev, priv->indir_tir[tt].tirn, in, inlen);
+   }
+
+   if (!mlx5e_tunnel_inner_ft_supported(priv->mdev))
+   return;
+
+   for (tt = 0; tt < MLX5E_NUM_INDIR_TIRS; tt++) {
+   memset(tirc, 0, ctxlen);
+   mlx5e_build_indir_tir_ctx_hash(>channels.params, tt, tirc,
+  true);
+   mlx5_core_modify_tir(mdev, priv->inner_indir_tir[tt].tirn, in,
+inlen);
+   }
+}
+
 static int mlx5e_modify_tirs_lro(struct mlx5e_priv *priv)
 {
struct mlx5_core_dev *mdev = priv->mdev;
-- 
2.19.2



[pull request][net-next V2 0/7] Mellanox, mlx5e updates 2018-12-04

2018-12-05 Thread Saeed Mahameed
Hi Dave,

The following series is for mlx5e netdevice driver, it adds ethtool
support for RX hash fields configuration and some misc updates, please
see tag log below.

Please pull and let me know if there's any problem.

v1->v2:
 - Move static const array to c file.
 - Remove unnecessary blank line
 - Add #include 
 - Print priv flag name rather than its hex value

Thanks,
Saeed.

--- 

The following changes since commit 55827458e0586b1d4e84541bc015e37f507820ee:

  Merge branch 'mlxsw-Add-one-armed-router-support' (2018-12-04 08:36:37 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git 
tags/mlx5e-updates-2018-12-04

for you to fetch changes up to 8ff57c18e9f6b03722070a372cdcc850b13bcbc8:

  net/mlx5e: Improve ethtool private-flags code structure (2018-12-05 16:00:37 
-0800)


mlx5e-updates-2018-12-04

This series includes updates to mlx5e netdevice driver

>From Saeed, Remove trailing space of tx_pause ethtool stat
>From Gal, Cleanup unused defines
>From Aya, ethtool Support for configuring of RX hash fields
>From Tariq, Improve ethtool private-flags code structure


Aya Levin (4):
  net/mlx5e: Move modify tirs hash functionality
  net/mlx5e: Refactor TIR configuration function
  net/mlx5e: Move RSS params to a dedicated struct
  net/mlx5e: ethtool, Support user configuration for RX hash fields

Gal Pressman (1):
  net/mlx5e: Cleanup unused defines

Saeed Mahameed (1):
  net/mlx5e: Remove trailing space of tx_pause ethtool counter name

Tariq Toukan (1):
  net/mlx5e: Improve ethtool private-flags code structure

 drivers/net/ethernet/mellanox/mlx5/core/en.h   |  42 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en/fs.h|  16 ++
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   | 135 +
 .../ethernet/mellanox/mlx5/core/en_fs_ethtool.c| 112 +++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 223 +++--
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c   |   9 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.c |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c|   6 +-
 .../net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c  |   2 +-
 9 files changed, 331 insertions(+), 216 deletions(-)


[net-next V2 2/7] net/mlx5e: Cleanup unused defines

2018-12-05 Thread Saeed Mahameed
From: Gal Pressman 

Remove couple of defines that are no longer used.

Signed-off-by: Gal Pressman 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 118324802926..2af43465ba2c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -147,9 +147,6 @@ struct page_pool;
   MLX5_UMR_MTT_ALIGNMENT))
 #define MLX5E_UMR_WQEBBS \
(DIV_ROUND_UP(MLX5E_UMR_WQE_INLINE_SZ, MLX5_SEND_WQE_BB))
-#define MLX5E_ICOSQ_MAX_WQEBBS MLX5E_UMR_WQEBBS
-
-#define MLX5E_NUM_MAIN_GROUPS 9
 
 #define MLX5E_MSG_LEVELNETIF_MSG_LINK
 
-- 
2.19.2



[net-next V2 7/7] net/mlx5e: Improve ethtool private-flags code structure

2018-12-05 Thread Saeed Mahameed
From: Tariq Toukan 

Refactor the code of private-flags setter.
Replace consecutive calls to mlx5e_handle_pflag with a loop
that uses a preset set of parameters.

Signed-off-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  | 18 +++--
 .../ethernet/mellanox/mlx5/core/en_ethtool.c  | 76 ---
 2 files changed, 41 insertions(+), 53 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index c6686256a92a..8760d10ad5a1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -49,6 +49,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "wq.h"
 #include "mlx5_core.h"
 #include "en_stats.h"
@@ -211,22 +212,23 @@ struct mlx5e_umr_wqe {
 extern const char mlx5e_self_tests[][ETH_GSTRING_LEN];
 
 enum mlx5e_priv_flag {
-   MLX5E_PFLAG_RX_CQE_BASED_MODER = (1 << 0),
-   MLX5E_PFLAG_TX_CQE_BASED_MODER = (1 << 1),
-   MLX5E_PFLAG_RX_CQE_COMPRESS = (1 << 2),
-   MLX5E_PFLAG_RX_STRIDING_RQ = (1 << 3),
-   MLX5E_PFLAG_RX_NO_CSUM_COMPLETE = (1 << 4),
+   MLX5E_PFLAG_RX_CQE_BASED_MODER,
+   MLX5E_PFLAG_TX_CQE_BASED_MODER,
+   MLX5E_PFLAG_RX_CQE_COMPRESS,
+   MLX5E_PFLAG_RX_STRIDING_RQ,
+   MLX5E_PFLAG_RX_NO_CSUM_COMPLETE,
+   MLX5E_NUM_PFLAGS, /* Keep last */
 };
 
 #define MLX5E_SET_PFLAG(params, pflag, enable) \
do {\
if (enable) \
-   (params)->pflags |= (pflag);\
+   (params)->pflags |= BIT(pflag); \
else\
-   (params)->pflags &= ~(pflag);   \
+   (params)->pflags &= ~(BIT(pflag));  \
} while (0)
 
-#define MLX5E_GET_PFLAG(params, pflag) (!!((params)->pflags & (pflag)))
+#define MLX5E_GET_PFLAG(params, pflag) (!!((params)->pflags & (BIT(pflag
 
 #ifdef CONFIG_MLX5_CORE_EN_DCB
 #define MLX5E_MAX_BW_ALLOC 100 /* Max percentage of BW allocation */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index e868d42c83cb..da4bb62ac528 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -135,14 +135,15 @@ void mlx5e_build_ptys2ethtool_map(void)
   ETHTOOL_LINK_MODE_5baseKR2_Full_BIT);
 }
 
-static const char mlx5e_priv_flags[][ETH_GSTRING_LEN] = {
-   "rx_cqe_moder",
-   "tx_cqe_moder",
-   "rx_cqe_compress",
-   "rx_striding_rq",
-   "rx_no_csum_complete",
+typedef int (*mlx5e_pflag_handler)(struct net_device *netdev, bool enable);
+
+struct pflag_desc {
+   char name[ETH_GSTRING_LEN];
+   mlx5e_pflag_handler handler;
 };
 
+static const struct pflag_desc mlx5e_priv_flags[MLX5E_NUM_PFLAGS];
+
 int mlx5e_ethtool_get_sset_count(struct mlx5e_priv *priv, int sset)
 {
int i, num_stats = 0;
@@ -153,7 +154,7 @@ int mlx5e_ethtool_get_sset_count(struct mlx5e_priv *priv, 
int sset)
num_stats += mlx5e_stats_grps[i].get_num_stats(priv);
return num_stats;
case ETH_SS_PRIV_FLAGS:
-   return ARRAY_SIZE(mlx5e_priv_flags);
+   return MLX5E_NUM_PFLAGS;
case ETH_SS_TEST:
return mlx5e_self_test_num(priv);
/* fallthrough */
@@ -183,8 +184,9 @@ void mlx5e_ethtool_get_strings(struct mlx5e_priv *priv, u32 
stringset, u8 *data)
 
switch (stringset) {
case ETH_SS_PRIV_FLAGS:
-   for (i = 0; i < ARRAY_SIZE(mlx5e_priv_flags); i++)
-   strcpy(data + i * ETH_GSTRING_LEN, mlx5e_priv_flags[i]);
+   for (i = 0; i < MLX5E_NUM_PFLAGS; i++)
+   strcpy(data + i * ETH_GSTRING_LEN,
+  mlx5e_priv_flags[i].name);
break;
 
case ETH_SS_TEST:
@@ -1485,8 +1487,6 @@ static int mlx5e_get_module_eeprom(struct net_device 
*netdev,
return 0;
 }
 
-typedef int (*mlx5e_pflag_handler)(struct net_device *netdev, bool enable);
-
 static int set_pflag_cqe_based_moder(struct net_device *netdev, bool enable,
 bool is_rx_cq)
 {
@@ -1649,23 +1649,30 @@ static int set_pflag_rx_no_csum_complete(struct 
net_device *netdev, bool enable)
return 0;
 }
 
+static const struct pflag_desc mlx5e_priv_flags[MLX5E_NUM_PFLAGS] = {
+   { "rx_cqe_moder",set_pflag_rx_cqe_based_moder },
+   { "tx_cqe_moder",set_pflag_tx_cqe_based_moder },
+   { "rx_cqe_compress", set_pflag_rx_cqe_compress },
+   { "rx_striding_rq",  set_pflag_rx_striding_rq },
+   { "rx_no_csum_complete", set_pflag_rx_no_csum_complete },
+};
+
 static int 

Re: [PATCH net-next RFC 7/7] bnxt_en: Add bnxt_en initial port params table and register it

2018-12-05 Thread Michael Chan
On Wed, Dec 5, 2018 at 3:33 PM Jakub Kicinski
 wrote:
>
> On Wed,  5 Dec 2018 11:27:00 +0530, Vasundhara Volam wrote:
> > Register devlink_port with devlink and create initial port params
> > table for bnxt_en. The table consists of a generic parameter:
> >
> > wake-on-lan: Enables Wake on Lan for this port when magic packet
> > is received with this port's MAC address using ACPI pattern.
> > If enabled, the controller asserts a wake pin upon reception of
> > WoL packet.  ACPI (Advanced Configuration and Power Interface) is
> > an industry specification for the efficient handling of power
> > consumption in desktop and mobile computers.
> >
> > Cc: Michael Chan 
> > Signed-off-by: Vasundhara Volam 
>
> Why do we need a WoL as a devlink parameter (rather than ethtool -s)?

I believe ethtool -s for WoL is a non-persistent setting, meaning that
if you power cycle the system, the WoL setting will go back to
default.

devlink on the other hand is a permanent setting.  ethtool should
initially report the default WoL setting and it can then be changed
(in a non permanent way) using ethtool -s.


Re: [net-next 7/7] net/mlx5e: Improve ethtool private-flags code structure

2018-12-05 Thread Saeed Mahameed
On Wed, 2018-12-05 at 11:28 -0800, Cong Wang wrote:
> Hello, Saeed
> 
> On Tue, Dec 4, 2018 at 10:27 PM Saeed Mahameed 
> wrote:
> >  static int mlx5e_handle_pflag(struct net_device *netdev,
> >   u32 wanted_flags,
> > - enum mlx5e_priv_flag flag,
> > - mlx5e_pflag_handler pflag_handler)
> > + enum mlx5e_priv_flag flag)
> >  {
> ...
> > if (err) {
> > netdev_err(netdev, "%s private flag 0x%x failed err
> > %d\n",
> >enable ? "Enable" : "Disable", flag,
> > err);
> 
> As flag is now a true enum, you probably don't want to print it out
> as hex
> any more.
> 

Good point, i guess we can print mlx5e_priv_flags[flag].name

thank you Cong !

> Thanks.


Re: [net-next 7/7] net/mlx5e: Improve ethtool private-flags code structure

2018-12-05 Thread Saeed Mahameed
On Wed, 2018-12-05 at 10:36 -0800, Cong Wang wrote:
> On Tue, Dec 4, 2018 at 10:27 PM Saeed Mahameed 
> wrote:
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> > b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> > index a429553002a6..49e90ac5dc8b 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> ...
> >  #define MLX5E_SET_PFLAG(params, pflag, enable) \
> > do {\
> > if (enable) \
> > -   (params)->pflags |= (pflag);\
> > +   (params)->pflags |= BIT(pflag); \
> > else\
> > -   (params)->pflags &= ~(pflag);   \
> > +   (params)->pflags &= ~(BIT(pflag));  \
> > } while (0)
> > 
> 
> Please #include  explicitly.

Ok, will fix in V2

> 
> Thanks.


Re: [net-next 6/7] net/mlx5e: ethtool, Support user configuration for RX hash fields

2018-12-05 Thread Saeed Mahameed
On Wed, 2018-12-05 at 11:19 -0800, Cong Wang wrote:
> Hello, Saeed
> 
> 
> On Tue, Dec 4, 2018 at 10:26 PM Saeed Mahameed 
> wrote:
> > +static int mlx5e_get_rss_hash_opt(struct mlx5e_priv *priv,
> > + struct ethtool_rxnfc *nfc)
> ...
> > +   tt = flow_type_to_traffic_type(nfc->flow_type);
> > +
> > +   if (tt == MLX5E_NUM_INDIR_TIRS)
> > +   return -EINVAL;
> 
> The blank line above is unnecessary.
> 
> Thanks.

Sure will fix in V2,

Thank you Cong.


Re: [net-next 4/7] net/mlx5e: Refactor TIR configuration function

2018-12-05 Thread Saeed Mahameed
On Wed, 2018-12-05 at 10:56 -0800, Cong Wang wrote:
> On Tue, Dec 4, 2018 at 10:26 PM Saeed Mahameed 
> wrote:
> > +static const struct mlx5e_tirc_config
> > +tirc_default_config[MLX5E_NUM_INDIR_TIRS] = {
> 
> Is it okay to define an array in a header??? No link error???
> 
> I must be dumb...

Yes it is ok, no link error, each c file scope will get it's own copy.

but this is wasteful, will fix this in v2.

Thanks.


Re: Re: [Bug] net/ipv6: skb_over_panic in mld_newpack

2018-12-05 Thread Benjamin Poirier
On 2018/12/05 16:57, Nicolas Belouin wrote:
[...]
> 
> Thanks for your help, using your debug patch I got the value of 
> needed_headroom:
> USHRT_MAX - 64
> And tracked it down to a legacy out of tree patch of ours I then fixed.
> The patch was increasing/decreasing the needed_headroom without checking
> for bounds...
> 
> Sorry for the noise then.

Thanks for reporting back. Let's consider it fixed.

> 
> Regards,
> 
> Nicolas
> 


[PATCH net-next 2/7] neighbor: Fold ___neigh_lookup_noref into __neigh_lookup_noref

2018-12-05 Thread David Ahern
From: David Ahern 

There are no more direct callers of ___neigh_lookup_noref so no need
for it to be a standalone helper.

Signed-off-by: David Ahern 
---
 include/net/neighbour.h | 22 +-
 1 file changed, 5 insertions(+), 17 deletions(-)

diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index f58b384aa6c9..aac87bc2d96b 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -270,37 +270,25 @@ static inline bool neigh_key_eq128(const struct neighbour 
*n, const void *pkey)
(n32[2] ^ p32[2]) | (n32[3] ^ p32[3])) == 0;
 }
 
-static inline struct neighbour *___neigh_lookup_noref(
-   struct neigh_table *tbl,
-   bool (*key_eq)(const struct neighbour *n, const void *pkey),
-   __u32 (*hash)(const void *pkey,
- const struct net_device *dev,
- __u32 *hash_rnd),
-   const void *pkey,
-   struct net_device *dev)
+static inline struct neighbour *__neigh_lookup_noref(struct neigh_table *tbl,
+const void *pkey,
+struct net_device *dev)
 {
struct neigh_hash_table *nht = rcu_dereference_bh(tbl->nht);
struct neighbour *n;
u32 hash_val;
 
-   hash_val = hash(pkey, dev, nht->hash_rnd) >> (32 - nht->hash_shift);
+   hash_val = tbl->hash(pkey, dev, nht->hash_rnd) >> (32 - 
nht->hash_shift);
for (n = rcu_dereference_bh(nht->hash_buckets[hash_val]);
 n != NULL;
 n = rcu_dereference_bh(n->next)) {
-   if (n->dev == dev && key_eq(n, pkey))
+   if (n->dev == dev && tbl->key_eq(n, pkey))
return n;
}
 
return NULL;
 }
 
-static inline struct neighbour *__neigh_lookup_noref(struct neigh_table *tbl,
-const void *pkey,
-struct net_device *dev)
-{
-   return ___neigh_lookup_noref(tbl, tbl->key_eq, tbl->hash, pkey, dev);
-}
-
 void neigh_table_init(int index, struct neigh_table *tbl);
 int neigh_table_clear(int index, struct neigh_table *tbl);
 struct neighbour *neigh_lookup(struct neigh_table *tbl, const void *pkey,
-- 
2.11.0



Re: [PATCH net-next RFC 7/7] bnxt_en: Add bnxt_en initial port params table and register it

2018-12-05 Thread Jakub Kicinski
On Wed,  5 Dec 2018 11:27:00 +0530, Vasundhara Volam wrote:
> Register devlink_port with devlink and create initial port params
> table for bnxt_en. The table consists of a generic parameter:
> 
> wake-on-lan: Enables Wake on Lan for this port when magic packet
> is received with this port's MAC address using ACPI pattern.
> If enabled, the controller asserts a wake pin upon reception of
> WoL packet.  ACPI (Advanced Configuration and Power Interface) is
> an industry specification for the efficient handling of power
> consumption in desktop and mobile computers.
> 
> Cc: Michael Chan 
> Signed-off-by: Vasundhara Volam 

Why do we need a WoL as a devlink parameter (rather than ethtool -s)?


[PATCH net-next 5/7] neighbor: Create a neigh_hash helper

2018-12-05 Thread David Ahern
From: David Ahern 

Consolidate calculations of the neighbor hash into a single helper.

Signed-off-by: David Ahern 
---
 include/net/neighbour.h | 10 +-
 net/core/neighbour.c| 15 +--
 2 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index aac87bc2d96b..092493a8c91b 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -270,6 +270,14 @@ static inline bool neigh_key_eq128(const struct neighbour 
*n, const void *pkey)
(n32[2] ^ p32[2]) | (n32[3] ^ p32[3])) == 0;
 }
 
+static inline u32 neigh_hash(struct neigh_table *tbl,
+struct neigh_hash_table *nht,
+const void *pkey,
+struct net_device *dev)
+{
+   return tbl->hash(pkey, dev, nht->hash_rnd) >> (32 - nht->hash_shift);
+}
+
 static inline struct neighbour *__neigh_lookup_noref(struct neigh_table *tbl,
 const void *pkey,
 struct net_device *dev)
@@ -278,7 +286,7 @@ static inline struct neighbour *__neigh_lookup_noref(struct 
neigh_table *tbl,
struct neighbour *n;
u32 hash_val;
 
-   hash_val = tbl->hash(pkey, dev, nht->hash_rnd) >> (32 - 
nht->hash_shift);
+   hash_val = neigh_hash(tbl, nht, pkey, dev);
for (n = rcu_dereference_bh(nht->hash_buckets[hash_val]);
 n != NULL;
 n = rcu_dereference_bh(n->next)) {
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 41954e42a2de..53e30c15882d 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -151,9 +151,8 @@ bool neigh_remove_one(struct neighbour *ndel, struct 
neigh_table *tbl)
 
nht = rcu_dereference_protected(tbl->nht,
lockdep_is_held(>lock));
-   hash_val = tbl->hash(pkey, ndel->dev, nht->hash_rnd);
-   hash_val = hash_val >> (32 - nht->hash_shift);
 
+   hash_val = neigh_hash(tbl, nht, pkey, ndel->dev);
np = >hash_buckets[hash_val];
while ((n = rcu_dereference_protected(*np,
  lockdep_is_held(>lock {
@@ -434,10 +433,7 @@ static struct neigh_hash_table *neigh_hash_grow(struct 
neigh_table *tbl,
   lockdep_is_held(>lock));
 n != NULL;
 n = next) {
-   hash = tbl->hash(n->primary_key, n->dev,
-new_nht->hash_rnd);
-
-   hash >>= (32 - new_nht->hash_shift);
+   hash = neigh_hash(tbl, new_nht, n->primary_key, n->dev);
next = rcu_dereference_protected(n->next,
lockdep_is_held(>lock));
 
@@ -485,9 +481,9 @@ struct neighbour *neigh_lookup_nodev(struct neigh_table 
*tbl, struct net *net,
NEIGH_CACHE_STAT_INC(tbl, lookups);
 
rcu_read_lock_bh();
-   nht = rcu_dereference_bh(tbl->nht);
-   hash_val = tbl->hash(pkey, NULL, nht->hash_rnd) >> (32 - 
nht->hash_shift);
 
+   nht = rcu_dereference_bh(tbl->nht);
+   hash_val = neigh_hash(tbl, nht, pkey, NULL);
for (n = rcu_dereference_bh(nht->hash_buckets[hash_val]);
 n != NULL;
 n = rcu_dereference_bh(n->next)) {
@@ -553,13 +549,12 @@ struct neighbour *__neigh_create(struct neigh_table *tbl, 
const void *pkey,
if (atomic_read(>entries) > (1 << nht->hash_shift))
nht = neigh_hash_grow(tbl, nht->hash_shift + 1);
 
-   hash_val = tbl->hash(n->primary_key, dev, nht->hash_rnd) >> (32 - 
nht->hash_shift);
-
if (n->parms->dead) {
rc = ERR_PTR(-EINVAL);
goto out_tbl_unlock;
}
 
+   hash_val = neigh_hash(tbl, nht, n->primary_key, dev);
for (n1 = rcu_dereference_protected(nht->hash_buckets[hash_val],
lockdep_is_held(>lock));
 n1 != NULL;
-- 
2.11.0



[PATCH net-next 6/7] neighbor: Skip the duplicate lookup in neigh_add

2018-12-05 Thread David Ahern
From: David Ahern 

When adding a new neighbor via rtnetlink, neigh_add does a lookup
and if the result is NULL calls __neigh_lookup_errno to create a
new entry if the NLM_F_CREATE flag is set. But, __neigh_lookup_errno
calls neigh_lookup again before neigh_create; the neigh_lookup is
redundant.

Replace the call to __neigh_lookup_errno with a call to __neigh_create
to more efficiently achieve the same result and prepare for the next
patch.

Signed-off-by: David Ahern 
---
 net/core/neighbour.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 53e30c15882d..e324467e9a71 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1785,7 +1785,7 @@ static int neigh_add(struct sk_buff *skb, struct nlmsghdr 
*nlh,
goto out;
}
 
-   neigh = __neigh_lookup_errno(tbl, dst, dev);
+   neigh = __neigh_create(tbl, dst, dev, true);
if (IS_ERR(neigh)) {
err = PTR_ERR(neigh);
goto out;
-- 
2.11.0



[PATCH net-next 0/7] neighbor: cleanups plus extack for add and delete

2018-12-05 Thread David Ahern
From: David Ahern 

cleanups:
- remove open coding of key and hash functions for ipv4 and ipv6
  and then collapse hash functions
- collapse now unnecessary ___neigh_lookup_noref helper
- create helper for neigh hash computation
- remove duplicate lookup in neigh_add

After that add extack messages for neighbor add and delete.

David Ahern (7):
  neighbor: Remove open coding of key and hash functions
  neighbor: Fold ___neigh_lookup_noref into __neigh_lookup_noref
  net/ipv4: Move arp_hashfn into arp_hash
  net/ipv6: Move ndisc_hashfn to ndisc_hash
  neighbor: Create a neigh_hash helper
  neighbor: Skip the duplicate lookup in neigh_add
  neighbor: Add extack messages for add and delete commands

 include/net/arp.h   | 10 +--
 include/net/ndisc.h | 12 +
 include/net/neighbour.h | 30 +
 net/core/filter.c   |  3 +--
 net/core/neighbour.c| 72 ++---
 net/ipv4/arp.c  |  5 +++-
 net/ipv6/ndisc.c|  7 -
 7 files changed, 71 insertions(+), 68 deletions(-)

-- 
2.11.0



[PATCH net-next 7/7] neighbor: Add extack messages for add and delete commands

2018-12-05 Thread David Ahern
From: David Ahern 

Add extack messages for failures in neigh_add and neigh_delete.

Also, require NDA_DST length to be exactly the key length for the
table otherwise it is an unexpected address and can lead to unexpected
entries. e.g., IPv4 table sent and IPv6 address (using a modified ip):
$ ip neigh add 2001:db8:1::1 dev foo
$ ip neigh ls dev foo
32.1.13.184 dev foo lladdr 72:ed:f1:d9:20:9a PERMANENT

Signed-off-by: David Ahern 
---
 net/core/neighbour.c | 55 +---
 1 file changed, 39 insertions(+), 16 deletions(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index e324467e9a71..916a99fbb306 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1132,8 +1132,9 @@ static void neigh_update_hhs(struct neighbour *neigh)
Caller MUST hold reference count on the entry.
  */
 
-int neigh_update(struct neighbour *neigh, const u8 *lladdr, u8 new,
-u32 flags, u32 nlmsg_pid)
+static int __neigh_update(struct neighbour *neigh, const u8 *lladdr,
+ u8 new, u32 flags, u32 nlmsg_pid,
+ struct netlink_ext_ack *extack)
 {
u8 old;
int err;
@@ -1150,8 +1151,10 @@ int neigh_update(struct neighbour *neigh, const u8 
*lladdr, u8 new,
if (!(flags & NEIGH_UPDATE_F_ADMIN) &&
(old & (NUD_NOARP | NUD_PERMANENT)))
goto out;
-   if (neigh->dead)
+   if (neigh->dead) {
+   NL_SET_ERR_MSG(extack, "Neighbor entry is now dead");
goto out;
+   }
 
neigh_update_ext_learned(neigh, flags, );
 
@@ -1188,8 +1191,10 @@ int neigh_update(struct neighbour *neigh, const u8 
*lladdr, u8 new,
   use it, otherwise discard the request.
 */
err = -EINVAL;
-   if (!(old & NUD_VALID))
+   if (!(old & NUD_VALID)) {
+   NL_SET_ERR_MSG(extack, "No link layer address given");
goto out;
+   }
lladdr = neigh->ha;
}
 
@@ -1302,6 +1307,12 @@ int neigh_update(struct neighbour *neigh, const u8 
*lladdr, u8 new,
 
return err;
 }
+
+int neigh_update(struct neighbour *neigh, const u8 *lladdr, u8 new,
+u32 flags, u32 nlmsg_pid)
+{
+   return __neigh_update(neigh, lladdr, new, flags, nlmsg_pid, NULL);
+}
 EXPORT_SYMBOL(neigh_update);
 
 /* Update the neigh to listen temporarily for probe responses, even if it is
@@ -1673,8 +1684,10 @@ static int neigh_delete(struct sk_buff *skb, struct 
nlmsghdr *nlh,
goto out;
 
dst_attr = nlmsg_find_attr(nlh, sizeof(*ndm), NDA_DST);
-   if (dst_attr == NULL)
+   if (!dst_attr) {
+   NL_SET_ERR_MSG(extack, "Network address not specified");
goto out;
+   }
 
ndm = nlmsg_data(nlh);
if (ndm->ndm_ifindex) {
@@ -1689,8 +1702,10 @@ static int neigh_delete(struct sk_buff *skb, struct 
nlmsghdr *nlh,
if (tbl == NULL)
return -EAFNOSUPPORT;
 
-   if (nla_len(dst_attr) < (int)tbl->key_len)
+   if (nla_len(dst_attr) < (int)tbl->key_len) {
+   NL_SET_ERR_MSG(extack, "Invalid network address");
goto out;
+   }
 
if (ndm->ndm_flags & NTF_PROXY) {
err = pneigh_delete(tbl, net, nla_data(dst_attr), dev);
@@ -1706,10 +1721,9 @@ static int neigh_delete(struct sk_buff *skb, struct 
nlmsghdr *nlh,
goto out;
}
 
-   err = neigh_update(neigh, NULL, NUD_FAILED,
-  NEIGH_UPDATE_F_OVERRIDE |
-  NEIGH_UPDATE_F_ADMIN,
-  NETLINK_CB(skb).portid);
+   err = __neigh_update(neigh, NULL, NUD_FAILED,
+NEIGH_UPDATE_F_OVERRIDE | NEIGH_UPDATE_F_ADMIN,
+NETLINK_CB(skb).portid, extack);
write_lock_bh(>lock);
neigh_release(neigh);
neigh_remove_one(neigh, tbl);
@@ -1739,8 +1753,10 @@ static int neigh_add(struct sk_buff *skb, struct 
nlmsghdr *nlh,
goto out;
 
err = -EINVAL;
-   if (tb[NDA_DST] == NULL)
+   if (!tb[NDA_DST]) {
+   NL_SET_ERR_MSG(extack, "Network address not specified");
goto out;
+   }
 
ndm = nlmsg_data(nlh);
if (ndm->ndm_ifindex) {
@@ -1750,16 +1766,21 @@ static int neigh_add(struct sk_buff *skb, struct 
nlmsghdr *nlh,
goto out;
}
 
-   if (tb[NDA_LLADDR] && nla_len(tb[NDA_LLADDR]) < dev->addr_len)
+   if (tb[NDA_LLADDR] && nla_len(tb[NDA_LLADDR]) < dev->addr_len) {
+   NL_SET_ERR_MSG(extack, "Invalid link address");
goto out;
+   }
}
 
tbl = neigh_find_table(ndm->ndm_family);
if (tbl == NULL)
return -EAFNOSUPPORT;
 
-   if 

[PATCH net-next 3/7] net/ipv4: Move arp_hashfn into arp_hash

2018-12-05 Thread David Ahern
From: David Ahern 

There are no more direct references to arp_hashfn so fold it into
arp_hash, the hash callback for arp.

Signed-off-by: David Ahern 
---
 include/net/arp.h | 8 
 net/ipv4/arp.c| 5 -
 2 files changed, 4 insertions(+), 9 deletions(-)

diff --git a/include/net/arp.h b/include/net/arp.h
index a5091f13cd3e..9f433c077b67 100644
--- a/include/net/arp.h
+++ b/include/net/arp.h
@@ -10,14 +10,6 @@
 
 extern struct neigh_table arp_tbl;
 
-static inline u32 arp_hashfn(const void *pkey, const struct net_device *dev, 
u32 *hash_rnd)
-{
-   u32 key = *(const u32 *)pkey;
-   u32 val = key ^ hash32_ptr(dev);
-
-   return val * hash_rnd[0];
-}
-
 static inline struct neighbour *__ipv4_neigh_lookup_noref(struct net_device 
*dev, u32 key)
 {
if (dev->flags & (IFF_LOOPBACK | IFF_POINTOPOINT))
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 850a6f13a082..6b88211287ae 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -213,7 +213,10 @@ static u32 arp_hash(const void *pkey,
const struct net_device *dev,
__u32 *hash_rnd)
 {
-   return arp_hashfn(pkey, dev, hash_rnd);
+   u32 key = *(const u32 *)pkey;
+   u32 val = key ^ hash32_ptr(dev);
+
+   return val * hash_rnd[0];
 }
 
 static bool arp_key_eq(const struct neighbour *neigh, const void *pkey)
-- 
2.11.0



[PATCH net-next 4/7] net/ipv6: Move ndisc_hashfn to ndisc_hash

2018-12-05 Thread David Ahern
From: David Ahern 

There are no more direct references to ndisc_hashfn so fold it into
ndisc_hash, the hash callback for ndisc.

Signed-off-by: David Ahern 
---
 include/net/ndisc.h | 10 --
 net/ipv6/ndisc.c|  7 ++-
 2 files changed, 6 insertions(+), 11 deletions(-)

diff --git a/include/net/ndisc.h b/include/net/ndisc.h
index c354345c679b..83a84f68901b 100644
--- a/include/net/ndisc.h
+++ b/include/net/ndisc.h
@@ -364,16 +364,6 @@ static inline u8 *ndisc_opt_addr_data(struct nd_opt_hdr *p,
 ndisc_addr_option_pad(dev->type));
 }
 
-static inline u32 ndisc_hashfn(const void *pkey, const struct net_device *dev, 
__u32 *hash_rnd)
-{
-   const u32 *p32 = pkey;
-
-   return (((p32[0] ^ hash32_ptr(dev)) * hash_rnd[0]) +
-   (p32[1] * hash_rnd[1]) +
-   (p32[2] * hash_rnd[2]) +
-   (p32[3] * hash_rnd[3]));
-}
-
 static inline struct neighbour *__ipv6_neigh_lookup_noref(struct net_device 
*dev, const void *pkey)
 {
return __neigh_lookup_noref(_tbl, pkey, dev);
diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index 659ecf4e4b3c..304a32b3c3f5 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -311,7 +311,12 @@ static u32 ndisc_hash(const void *pkey,
  const struct net_device *dev,
  __u32 *hash_rnd)
 {
-   return ndisc_hashfn(pkey, dev, hash_rnd);
+   const u32 *p32 = pkey;
+
+   return (((p32[0] ^ hash32_ptr(dev)) * hash_rnd[0]) +
+(p32[1] * hash_rnd[1]) +
+(p32[2] * hash_rnd[2]) +
+(p32[3] * hash_rnd[3]));
 }
 
 static bool ndisc_key_eq(const struct neighbour *n, const void *pkey)
-- 
2.11.0



[PATCH net-next 1/7] neighbor: Remove open coding of key and hash functions

2018-12-05 Thread David Ahern
From: David Ahern 

___neigh_lookup_noref takes the key and hash functions as inputs, yet
those are part of the operations listed in the neigh_table which is
also passed as an arugment. Remove the open coding of these internal
implementations by converting uses of ___neigh_lookup_noref to
__neigh_lookup_noref.

For IPv4, arp_key_eq is essentially a call to neigh_key_eq32 and
arp_hash is a call to arp_hashfn. Similarly for IPv6, ndisc_key_eq
calls neigh_key_eq128 and ndisc_hash calls ndisc_hashfn. So the change
in helpers is a no-op.

Signed-off-by: David Ahern 
---
 include/net/arp.h   | 2 +-
 include/net/ndisc.h | 2 +-
 net/core/filter.c   | 3 +--
 3 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/include/net/arp.h b/include/net/arp.h
index 977aabfcdc03..a5091f13cd3e 100644
--- a/include/net/arp.h
+++ b/include/net/arp.h
@@ -23,7 +23,7 @@ static inline struct neighbour 
*__ipv4_neigh_lookup_noref(struct net_device *dev
if (dev->flags & (IFF_LOOPBACK | IFF_POINTOPOINT))
key = INADDR_ANY;
 
-   return ___neigh_lookup_noref(_tbl, neigh_key_eq32, arp_hashfn, 
, dev);
+   return __neigh_lookup_noref(_tbl, , dev);
 }
 
 static inline struct neighbour *__ipv4_neigh_lookup(struct net_device *dev, 
u32 key)
diff --git a/include/net/ndisc.h b/include/net/ndisc.h
index ddfbb591e2c5..c354345c679b 100644
--- a/include/net/ndisc.h
+++ b/include/net/ndisc.h
@@ -376,7 +376,7 @@ static inline u32 ndisc_hashfn(const void *pkey, const 
struct net_device *dev, _
 
 static inline struct neighbour *__ipv6_neigh_lookup_noref(struct net_device 
*dev, const void *pkey)
 {
-   return ___neigh_lookup_noref(_tbl, neigh_key_eq128, ndisc_hashfn, 
pkey, dev);
+   return __neigh_lookup_noref(_tbl, pkey, dev);
 }
 
 static inline struct neighbour *__ipv6_neigh_lookup(struct net_device *dev, 
const void *pkey)
diff --git a/net/core/filter.c b/net/core/filter.c
index bd0df75dc7b6..f10cc675783c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4668,8 +4668,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
 * not needed here. Can not use __ipv6_neigh_lookup_noref here
 * because we need to get nd_tbl via the stub
 */
-   neigh = ___neigh_lookup_noref(ipv6_stub->nd_tbl, neigh_key_eq128,
- ndisc_hashfn, dst, dev);
+   neigh = __neigh_lookup_noref(ipv6_stub->nd_tbl, dst, dev);
if (!neigh)
return BPF_FIB_LKUP_RET_NO_NEIGH;
 
-- 
2.11.0



Re: [Patch net-next] call sk_dst_reset when set SO_DONTROUTE

2018-12-05 Thread Eric Dumazet
On Wed, Dec 5, 2018 at 3:07 PM yupeng  wrote:
>
> after set SO_DONTROUTE to 1, the IP layer should not route packets if
> the dest IP address is not in link scope. But if the socket has cached
> the dst_entry, such packets would be routed until the sk_dst_cache
> expires. So we should clean the sk_dst_cache when a user set
> SO_DONTROUTE option. Below are server/client python scripts which
> could reprodue this issue:
>
> server side code:
> ==
> import socket
> import struct
>
> s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> s.bind(('0.0.0.0', 9000))
> s.listen(1)
> sock, addr = s.accept()
> sock.setsockopt(socket.SOL_SOCKET, socket.SO_DONTROUTE, struct.pack('i', 1))
> while True:
> data = sock.recv(1024) # here the sock.recv should not return anything

Why is that so ?

What is the relation of input path with the SO_DONTROUTE which is for TX ?

sk_dst_reset(sk) should not impact receive side ?

Thanks for providing a test !

> print(data)
> ==
>
> client side code:
> ==
> import socket
> import time
>
> s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> s.connect(('server_address', 9000))
> while True:
> s.send(b'foo')
> print('send foo')
> time.sleep(1)
> ==
>
> Signed-off-by: yupeng 
> ---
>  net/core/sock.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/net/core/sock.c b/net/core/sock.c
> index f5bb89785e47..f00902c532cc 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -700,6 +700,7 @@ int sock_setsockopt(struct socket *sock, int level, int 
> optname,
> break;
> case SO_DONTROUTE:
> sock_valbool_flag(sk, SOCK_LOCALROUTE, valbool);
> +   sk_dst_reset(sk);
> break;
> case SO_BROADCAST:
> sock_valbool_flag(sk, SOCK_BROADCAST, valbool);
> --
> 2.17.1
>


[PATCH V2 mlx5-next 3/4] IB/mlx5: Use helper to get CQE opcode

2018-12-05 Thread Saeed Mahameed
From: Tariq Toukan 

Use the new helper that extracts the opcode
from a CQE (completion queue entry) structure.

Signed-off-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 drivers/infiniband/hw/mlx5/cq.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
index 0b99f7d0630d..26ab9041f94a 100644
--- a/drivers/infiniband/hw/mlx5/cq.c
+++ b/drivers/infiniband/hw/mlx5/cq.c
@@ -82,7 +82,7 @@ static void *get_sw_cqe(struct mlx5_ib_cq *cq, int n)
 
cqe64 = (cq->mcq.cqe_sz == 64) ? cqe : cqe + 64;
 
-   if (likely((cqe64->op_own) >> 4 != MLX5_CQE_INVALID) &&
+   if (likely(get_cqe_opcode(cqe64) != MLX5_CQE_INVALID) &&
!((cqe64->op_own & MLX5_CQE_OWNER_MASK) ^ !!(n & (cq->ibcq.cqe + 
1 {
return cqe;
} else {
@@ -197,7 +197,7 @@ static void handle_responder(struct ib_wc *wc, struct 
mlx5_cqe64 *cqe,
}
wc->byte_len = be32_to_cpu(cqe->byte_cnt);
 
-   switch (cqe->op_own >> 4) {
+   switch (get_cqe_opcode(cqe)) {
case MLX5_CQE_RESP_WR_IMM:
wc->opcode  = IB_WC_RECV_RDMA_WITH_IMM;
wc->wc_flags= IB_WC_WITH_IMM;
@@ -537,7 +537,7 @@ static int mlx5_poll_one(struct mlx5_ib_cq *cq,
 */
rmb();
 
-   opcode = cqe64->op_own >> 4;
+   opcode = get_cqe_opcode(cqe64);
if (unlikely(opcode == MLX5_CQE_RESIZE_CQ)) {
if (likely(cq->resize_buf)) {
free_cq_buf(dev, >buf);
@@ -1295,7 +1295,7 @@ static int copy_resize_cqes(struct mlx5_ib_cq *cq)
return -EINVAL;
}
 
-   while ((scqe64->op_own >> 4) != MLX5_CQE_RESIZE_CQ) {
+   while (get_cqe_opcode(scqe64) != MLX5_CQE_RESIZE_CQ) {
dcqe = mlx5_frag_buf_get_wqe(>resize_buf->fbc,
 (i + 1) & cq->resize_buf->nent);
dcqe64 = dsize == 64 ? dcqe : dcqe + 64;
-- 
2.19.2



[PATCH V2 mlx5-next 1/4] net/mlx5: When fetching CQEs return CQE instead of void pointer

2018-12-05 Thread Saeed Mahameed
From: Daniel Jurgens 

The function is only used to retrieve CQEs, use the proper type as the
return value.

Signed-off-by: Daniel Jurgens 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/wq.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/wq.h 
b/drivers/net/ethernet/mellanox/mlx5/core/wq.h
index b1293d153a58..9bc2184a46bc 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/wq.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/wq.h
@@ -177,7 +177,7 @@ static inline u32 mlx5_cqwq_get_ci(struct mlx5_cqwq *wq)
return mlx5_cqwq_ctr2ix(wq, wq->cc);
 }
 
-static inline void *mlx5_cqwq_get_wqe(struct mlx5_cqwq *wq, u32 ix)
+static inline struct mlx5_cqe64 *mlx5_cqwq_get_wqe(struct mlx5_cqwq *wq, u32 
ix)
 {
return mlx5_frag_buf_get_wqe(>fbc, ix);
 }
-- 
2.19.2



[PATCH V2 mlx5-next 4/4] net/mlx5: Move flow counters data structures from flow steering header

2018-12-05 Thread Saeed Mahameed
After the following flow counters API refactoring:
("net/mlx5: Use flow counter IDs and not the wrapping cache object")
flow counters private data structures mlx5_fc_cache and mlx5_fc are
redundantly exposed in fs_core.h, they have nothing to do with flow
steering core and they are private to fs_counter.c, this patch moves them
to where they belong and reduces their exposure in the driver.

Signed-off-by: Saeed Mahameed 
---
 .../net/ethernet/mellanox/mlx5/core/fs_core.h | 24 ---
 .../ethernet/mellanox/mlx5/core/fs_counters.c | 24 +++
 2 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
index b51ad217da32..186566b521f9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
@@ -36,7 +36,6 @@
 #include 
 #include 
 #include 
-#include 
 
 /* FS_TYPE_PRIO_CHAINS is a PRIO that will have namespaces only,
  * and those are in parallel to one another when going over them to connect
@@ -145,29 +144,6 @@ struct mlx5_flow_table {
struct rhltable fgs_hash;
 };
 
-struct mlx5_fc_cache {
-   u64 packets;
-   u64 bytes;
-   u64 lastuse;
-};
-
-struct mlx5_fc {
-   struct list_head list;
-   struct llist_node addlist;
-   struct llist_node dellist;
-
-   /* last{packets,bytes} members are used when calculating the delta since
-* last reading
-*/
-   u64 lastpackets;
-   u64 lastbytes;
-
-   u32 id;
-   bool aging;
-
-   struct mlx5_fc_cache cache cacheline_aligned_in_smp;
-};
-
 struct mlx5_ft_underlay_qp {
struct list_head list;
u32 qpn;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c
index 32accd6b041b..b2574c28defa 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "mlx5_core.h"
 #include "fs_core.h"
 #include "fs_cmd.h"
@@ -41,6 +42,29 @@
 /* Max number of counters to query in bulk read is 32K */
 #define MLX5_SW_MAX_COUNTERS_BULK BIT(15)
 
+struct mlx5_fc_cache {
+   u64 packets;
+   u64 bytes;
+   u64 lastuse;
+};
+
+struct mlx5_fc {
+   struct list_head list;
+   struct llist_node addlist;
+   struct llist_node dellist;
+
+   /* last{packets,bytes} members are used when calculating the delta since
+* last reading
+*/
+   u64 lastpackets;
+   u64 lastbytes;
+
+   u32 id;
+   bool aging;
+
+   struct mlx5_fc_cache cache cacheline_aligned_in_smp;
+};
+
 /* locking scheme:
  *
  * It is the responsibility of the user to prevent concurrent calls or bad
-- 
2.19.2



[PATCH V2 mlx5-next 2/4] net/mlx5: Use helper to get CQE opcode

2018-12-05 Thread Saeed Mahameed
From: Tariq Toukan 

Introduce and use a helper that extracts the opcode
from a CQE (completion queue entry) structure.

Signed-off-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 10 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/fpga/conn.c |  2 +-
 include/linux/mlx5/device.h |  5 +
 4 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 79638dcbae78..31956ddd394e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -554,9 +554,9 @@ static inline void mlx5e_poll_ico_single_cqe(struct 
mlx5e_cq *cq,
 
mlx5_cqwq_pop(>wq);
 
-   if (unlikely((cqe->op_own >> 4) != MLX5_CQE_REQ)) {
+   if (unlikely(get_cqe_opcode(cqe) != MLX5_CQE_REQ)) {
netdev_WARN_ONCE(cq->channel->netdev,
-"Bad OP in ICOSQ CQE: 0x%x\n", cqe->op_own);
+"Bad OP in ICOSQ CQE: 0x%x\n", 
get_cqe_opcode(cqe));
return;
}
 
@@ -898,7 +898,7 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct 
mlx5_cqe64 *cqe,
prefetchw(va); /* xdp_frame data area */
prefetch(data);
 
-   if (unlikely((cqe->op_own >> 4) != MLX5_CQE_RESP_SEND)) {
+   if (unlikely(get_cqe_opcode(cqe) != MLX5_CQE_RESP_SEND)) {
rq->stats->wqe_err++;
return NULL;
}
@@ -930,7 +930,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct 
mlx5_cqe64 *cqe,
u16 byte_cnt = cqe_bcnt - headlen;
struct sk_buff *skb;
 
-   if (unlikely((cqe->op_own >> 4) != MLX5_CQE_RESP_SEND)) {
+   if (unlikely(get_cqe_opcode(cqe) != MLX5_CQE_RESP_SEND)) {
rq->stats->wqe_err++;
return NULL;
}
@@ -1148,7 +1148,7 @@ void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, 
struct mlx5_cqe64 *cqe)
 
wi->consumed_strides += cstrides;
 
-   if (unlikely((cqe->op_own >> 4) != MLX5_CQE_RESP_SEND)) {
+   if (unlikely(get_cqe_opcode(cqe) != MLX5_CQE_RESP_SEND)) {
rq->stats->wqe_err++;
goto mpwrq_cqe_out;
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index 6dacaeba2fbf..46b5a6914d71 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -507,7 +507,7 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
 
wqe_counter = be16_to_cpu(cqe->wqe_counter);
 
-   if (unlikely(cqe->op_own >> 4 == MLX5_CQE_REQ_ERR)) {
+   if (unlikely(get_cqe_opcode(cqe) == MLX5_CQE_REQ_ERR)) {
if (!test_and_set_bit(MLX5E_SQ_STATE_RECOVERING,
  >state)) {
mlx5e_dump_error_cqe(sq,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fpga/conn.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fpga/conn.c
index 8ca1d1949d93..873541ef4c1b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fpga/conn.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fpga/conn.c
@@ -334,7 +334,7 @@ static void mlx5_fpga_conn_handle_cqe(struct mlx5_fpga_conn 
*conn,
 {
u8 opcode, status = 0;
 
-   opcode = cqe->op_own >> 4;
+   opcode = get_cqe_opcode(cqe);
 
switch (opcode) {
case MLX5_CQE_REQ_ERR:
diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
index f7c8bebfe472..c66867c8fc2f 100644
--- a/include/linux/mlx5/device.h
+++ b/include/linux/mlx5/device.h
@@ -781,6 +781,11 @@ static inline u8 mlx5_get_cqe_format(struct mlx5_cqe64 
*cqe)
return (cqe->op_own >> 2) & 0x3;
 }
 
+static inline u8 get_cqe_opcode(struct mlx5_cqe64 *cqe)
+{
+   return cqe->op_own >> 4;
+}
+
 static inline u8 get_cqe_lro_tcppsh(struct mlx5_cqe64 *cqe)
 {
return (cqe->lro_tcppsh_abort_dupack >> 6) & 1;
-- 
2.19.2



[PATCH V2 mlx5-next 0/4] mlx5 core CQE API and misc updates

2018-12-05 Thread Saeed Mahameed
Hi

This patchset is for mlx5-next shared branch, and will be applied there
once the review is done.

Patches 1,2,3 are trivial improvements to CQE API
1. return CQE pointer instead of void pointer in get_cqe function
2. helper function for retrieving the CQE opcode, used in rdma and netdev

Patch 4, Move flow counters data structures definition to c file.

V1->v2:
- move #include  in patch 4

Thanks,
Saeed.

---

Daniel Jurgens (1):
  net/mlx5: When fetching CQEs return CQE instead of void pointer

Saeed Mahameed (1):
  net/mlx5: Move flow counters data structures from flow steering header

Tariq Toukan (2):
  net/mlx5: Use helper to get CQE opcode
  IB/mlx5: Use helper to get CQE opcode

 drivers/infiniband/hw/mlx5/cq.c   |  8 +++
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 10 
 .../net/ethernet/mellanox/mlx5/core/en_tx.c   |  2 +-
 .../ethernet/mellanox/mlx5/core/fpga/conn.c   |  2 +-
 .../net/ethernet/mellanox/mlx5/core/fs_core.h | 24 ---
 .../ethernet/mellanox/mlx5/core/fs_counters.c | 24 +++
 drivers/net/ethernet/mellanox/mlx5/core/wq.h  |  2 +-
 include/linux/mlx5/device.h   |  5 
 8 files changed, 41 insertions(+), 36 deletions(-)

-- 
2.19.2



[Patch net-next] call sk_dst_reset when set SO_DONTROUTE

2018-12-05 Thread yupeng
after set SO_DONTROUTE to 1, the IP layer should not route packets if
the dest IP address is not in link scope. But if the socket has cached
the dst_entry, such packets would be routed until the sk_dst_cache
expires. So we should clean the sk_dst_cache when a user set
SO_DONTROUTE option. Below are server/client python scripts which
could reprodue this issue:

server side code:
==
import socket
import struct

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(('0.0.0.0', 9000))
s.listen(1)
sock, addr = s.accept()
sock.setsockopt(socket.SOL_SOCKET, socket.SO_DONTROUTE, struct.pack('i', 1))
while True:
data = sock.recv(1024) # here the sock.recv should not return anything
print(data)
==

client side code:
==
import socket
import time

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('server_address', 9000))
while True:
s.send(b'foo')
print('send foo')
time.sleep(1)
==

Signed-off-by: yupeng 
---
 net/core/sock.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/sock.c b/net/core/sock.c
index f5bb89785e47..f00902c532cc 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -700,6 +700,7 @@ int sock_setsockopt(struct socket *sock, int level, int 
optname,
break;
case SO_DONTROUTE:
sock_valbool_flag(sk, SOCK_LOCALROUTE, valbool);
+   sk_dst_reset(sk);
break;
case SO_BROADCAST:
sock_valbool_flag(sk, SOCK_BROADCAST, valbool);
-- 
2.17.1



[PATCH net] tcp: fix NULL ref in tail loss probe

2018-12-05 Thread Yuchung Cheng
TCP loss probe timer may fire when the retranmission queue is empty but
has a non-zero tp->packets_out counter. tcp_send_loss_probe will call
tcp_rearm_rto which triggers NULL pointer reference by fetching the
retranmission queue head in its sub-routines.

Add a more detailed warning to help catch the root cause of the inflight
accounting inconsistency.

Reported-by: Rafael Tinoco 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Eric Dumazet 
Signed-off-by: Neal Cardwell 
---
 net/ipv4/tcp_output.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 68b5326f7321..9a1101095298 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2494,15 +2494,18 @@ void tcp_send_loss_probe(struct sock *sk)
goto rearm_timer;
}
skb = skb_rb_last(>tcp_rtx_queue);
+   if (unlikely(!skb)) {
+   WARN_ONCE(tp->packets_out,
+ "invalid inflight: %u state %u cwnd %u mss %d\n",
+ tp->packets_out, sk->sk_state, tp->snd_cwnd, mss);
+   inet_csk(sk)->icsk_pending = 0;
+   return;
+   }
 
/* At most one outstanding TLP retransmission. */
if (tp->tlp_high_seq)
goto rearm_timer;
 
-   /* Retransmit last segment. */
-   if (WARN_ON(!skb))
-   goto rearm_timer;
-
if (skb_still_in_host_queue(sk, skb))
goto rearm_timer;
 
-- 
2.20.0.rc1.387.gf8505762e3-goog



Re: [PATCH 1/2] net: linkwatch: send change uevent on link changes

2018-12-05 Thread Jouke Witteveen
On Wed, Dec 5, 2018 at 8:45 PM David Miller  wrote:
>
> From: Jouke Witteveen 
> Date: Wed, 5 Dec 2018 14:50:31 +0100
>
> > For example, I maintain a network manager that delegates the actual
> > networking work to specialized programs.
>
> Basically "I've implemented things using separate programs"

I was trying to give an example of a typical system administrator task.

> > Basically, it is an implementation of network manager logic in shell
> > script. For such a shell script, it is easy to respond to uevents
> > (via udev, or alternatives), but responding to rtnetlink messages
> > would require a separate program.
>
> And "In order to use rtnetlink I'll need a separate program!"
>
> (╯°□°)╯︵ ┻━┻
>
> So it's ok to use the separate program paradigm for dividing up
> the tasks, but not for processing events?
>
> I'm not convinced.

This is about automation. When you want to take some action in
response to an event, you want the event system to be accommodating.
In that sense, it is indeed more ok for actions to require separate
programs than for event listening.

> Either use the facility we have or extend it to fill a valid missing
> need.
>
> I'm not applying these patches, your logic doesn't add up and it's
> inconsistent with our clear goals of not duplicating functionality.

Can you elaborate a bit? I may not be aware of the policy you have in
mind. Furthermore, I fail to see how these "change" events in response
to link state changes are not to be expected. To me, it looks like
uevents are there precisely to inform userspace about this kind of
changes.
As I asked in my conceptual argument, please consider this patch an
extension of uevent coverage, more than an attack on rtnetlink (which
it is definitely not).

Regards,
- Jouke


Re: [PATCH mlx5-next 2/4] net/mlx5: Use helper to get CQE opcode

2018-12-05 Thread Saeed Mahameed
On Tue, 2018-12-04 at 20:55 -0800, Cong Wang wrote:
> On Tue, Dec 4, 2018 at 6:06 PM Saeed Mahameed 
> wrote:
> > +static inline u8 get_cqe_opcode(struct mlx5_cqe64 *cqe)
> 
> Make it const please.

Please be specific

I can do:
static inline u8 get_cqe_opcode(struct mlx5_cqe64 const *cqe)
if that what you mean.

But i will leave it for a later time to do it for all cqe fields getter
functions in this file, i will keep the code uniform for now.


Re: [PATCH net] tcp: Do not underestimate rwnd_limited

2018-12-05 Thread Yuchung Cheng
On Wed, Dec 5, 2018 at 2:28 PM Soheil Hassas Yeganeh  wrote:
>
> On Wed, Dec 5, 2018 at 5:24 PM Eric Dumazet  wrote:
> >
> > If available rwnd is too small, tcp_tso_should_defer()
> > can decide it is worth waiting before splitting a TSO packet.
> >
> > This really means we are rwnd limited.
> >
> > Fixes: 5615f88614a4 ("tcp: instrument how long TCP is limited by receive 
> > window")
> > Signed-off-by: Eric Dumazet 
>
> Acked-by: Soheil Hassas Yeganeh 
Reviewed-by: Yuchung Cheng 
>
> Excellent catch! Thank you for the fix, Eric!
>
> > ---
> >  net/ipv4/tcp_output.c | 5 -
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index 
> > 68b5326f73212ffe7111dd0f91e0a1246fb0ae25..3186902347584090256467d8679320666aa0257e
> >  100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -2356,8 +2356,11 @@ static bool tcp_write_xmit(struct sock *sk, unsigned 
> > int mss_now, int nonagle,
> > } else {
> > if (!push_one &&
> > tcp_tso_should_defer(sk, skb, _cwnd_limited,
> > -max_segs))
> > +max_segs)) {
> > +   if (!is_cwnd_limited)
> > +   is_rwnd_limited = true;
> > break;
> > +   }
> > }
> >
> > limit = mss_now;
> > --
> > 2.20.0.rc2.403.gdbc3b29805-goog
> >


Re: [PATCH net] tcp: Do not underestimate rwnd_limited

2018-12-05 Thread Soheil Hassas Yeganeh
On Wed, Dec 5, 2018 at 5:24 PM Eric Dumazet  wrote:
>
> If available rwnd is too small, tcp_tso_should_defer()
> can decide it is worth waiting before splitting a TSO packet.
>
> This really means we are rwnd limited.
>
> Fixes: 5615f88614a4 ("tcp: instrument how long TCP is limited by receive 
> window")
> Signed-off-by: Eric Dumazet 

Acked-by: Soheil Hassas Yeganeh 

Excellent catch! Thank you for the fix, Eric!

> ---
>  net/ipv4/tcp_output.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 
> 68b5326f73212ffe7111dd0f91e0a1246fb0ae25..3186902347584090256467d8679320666aa0257e
>  100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2356,8 +2356,11 @@ static bool tcp_write_xmit(struct sock *sk, unsigned 
> int mss_now, int nonagle,
> } else {
> if (!push_one &&
> tcp_tso_should_defer(sk, skb, _cwnd_limited,
> -max_segs))
> +max_segs)) {
> +   if (!is_cwnd_limited)
> +   is_rwnd_limited = true;
> break;
> +   }
> }
>
> limit = mss_now;
> --
> 2.20.0.rc2.403.gdbc3b29805-goog
>


Re: [PATCH mlx5-next 3/4] IB/mlx5: Use helper to get CQE opcode

2018-12-05 Thread Saeed Mahameed
On Wed, 2018-12-05 at 08:09 +, Leon Romanovsky wrote:
> On Tue, Dec 04, 2018 at 06:03:02PM -0800, Saeed Mahameed wrote:
> > From: Tariq Toukan 
> > 
> > Use the new helper that extracts the opcode
> > from a CQE (completion queue entry) structure.
> > 
> > Signed-off-by: Tariq Toukan 
> > Signed-off-by: Saeed Mahameed 
> > ---
> >  drivers/infiniband/hw/mlx5/cq.c | 8 
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> > 
> 
> I recommend you to squash this patch into the previous one.
> 
> Thanks,
> Acked-by: Leon Romanovsky 

Thanks Leon, I would like to keep netdev and rdma patches separated as
much as possible 


Re: [PATCH mlx5-next 4/4] net/mlx5: Move flow counters data structures from flow steering header

2018-12-05 Thread Saeed Mahameed
On Tue, 2018-12-04 at 21:04 -0800, Cong Wang wrote:
> On Tue, Dec 4, 2018 at 6:05 PM Saeed Mahameed 
> wrote:
> > After the following flow counters API refactoring:
> > ("net/mlx5: Use flow counter IDs and not the wrapping cache
> > object")
> > flow counters private data structures mlx5_fc_cache and mlx5_fc are
> > redundantly exposed in fs_core.h, they have nothing to do with flow
> > steering core and they are private to fs_counter.c, this patch
> > moves them
> > to where they belong and reduces their exposure in the driver.
> > 
> > Signed-off-by: Saeed Mahameed 
> > ---
> >  .../net/ethernet/mellanox/mlx5/core/fs_core.h | 23 -
> > --
> >  .../ethernet/mellanox/mlx5/core/fs_counters.c | 23
> > +++
> 
> The #include  can be moved together too.

sure will do that in V2
thanks Cong!


[PATCH net] tcp: Do not underestimate rwnd_limited

2018-12-05 Thread Eric Dumazet
If available rwnd is too small, tcp_tso_should_defer()
can decide it is worth waiting before splitting a TSO packet.

This really means we are rwnd limited.

Fixes: 5615f88614a4 ("tcp: instrument how long TCP is limited by receive 
window")
Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_output.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 
68b5326f73212ffe7111dd0f91e0a1246fb0ae25..3186902347584090256467d8679320666aa0257e
 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2356,8 +2356,11 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int 
mss_now, int nonagle,
} else {
if (!push_one &&
tcp_tso_should_defer(sk, skb, _cwnd_limited,
-max_segs))
+max_segs)) {
+   if (!is_cwnd_limited)
+   is_rwnd_limited = true;
break;
+   }
}
 
limit = mss_now;
-- 
2.20.0.rc2.403.gdbc3b29805-goog



Re: DSA support for Marvell 88e6065 switch

2018-12-05 Thread Florian Fainelli
On 12/5/18 12:46 PM, Pavel Machek wrote:
> Hi!
> 
> But I'm running into problems with tagging code, and I guess I'd like
> some help understanding.
>
> tag_trailer: allocates new skb, then copies data around.
>
> tag_qca: does dev->stats.tx_packets++, and reuses existing skb.
>
> tag_brcm: reuses existing skb.
>>>
>>> Any idea why tag trailer allocates new skb,
>>
>> I wrote this code over 10 years ago, so I don't remember all that
>> well, but I think that it is because you have to do manual checksumming
>> of the packet, as there's no way to pass down the stack that you don't
>> want to checksum all the way down to the end of the data area (and you
>> don't want the tag to be included in the checksum), and so you want to
>> do that before you add the trailer tag, and you'll probably have to
>> reallocate the data area to be able to add the tag, and you probably
>> won't get an exclusive skb here anyway, so you might as well allocate
>> a new one.
> 
> Ok, thanks. Whether tag is checksummed or not is indeed an interesting 
> question.
> 
>>> and what is going on with dev->stats.tx_packets++?
>>
>> trailer_xmit would be the hard_start_xmit function for the virtual
>> (slave) network interface, so this would be the right thing to do?
> 
> Some tag_*.c do this and some do not, so I'm still confused.

This is likely just historical, or we don't have an easy way to
determine whether the SKB transmitted succeeded or not because it was
reallocated, and therefore the tagging code must make sure statistics
are maintained. I suspect it is just historical and that tag module just
has not been updated to leverage the statistics accounting done a layer
above it within DSA.

> 
>>> 6065 indeed has some kind of "egress tagging mode" (with four
>>> options), but I have trouble understanding what it really does.
>>
>> What are the options?
> 
> "transmit unmodified", "transmit untagged", "transmit tagged" and
> "transmit all frames double tagged".

It seems to me that this applies to VLAN headers, not DSA/trailer
headers though not having the data sheet I should be flat out wrong.
-- 
Florian


pull-request: bpf 2018-12-05

2018-12-05 Thread Alexei Starovoitov
Hi David,

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) fix bpf uapi pointers for 32-bit architectures, from Daniel.

2) improve verifer ability to handle progs with a lot of branches, from Alexei.

3) strict btf checks, from Yonghong.

4) bpf_sk_lookup api cleanup, from Joe.

5) other misc fixes

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Thanks a lot!



The following changes since commit d78a5ebd8b18d3809fd9d6bbaeb64d78a332204f:

  Merge branch '1GbE' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue (2018-11-28 
11:33:35 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git 

for you to fetch changes up to a92a72a24d48080f6c49bb514c082fbb1e5bf8fc:

  Merge branch 'bpf-verifier-resilience' (2018-12-04 17:22:03 +0100)


Alexei Starovoitov (4):
  Merge branch 'btf-check-name'
  bpf: check pending signals while verifying programs
  bpf: improve verifier branch analysis
  bpf: add per-insn complexity limit

Daniel Borkmann (2):
  bpf: fix pointer offsets in context for 32 bit
  Merge branch 'bpf-verifier-resilience'

David Miller (1):
  bpf: Fix verifier log string check for bad alignment.

Joe Stringer (2):
  bpf: Support sk lookup in netns with id 0
  bpf: Improve socket lookup reuseport documentation

Martin KaFai Lau (1):
  tools/bpf: fix two test_btf unit test cases

Roman Gushchin (1):
  bpf: refactor bpf_test_run() to separate own failures and test program 
result

Sandipan Das (1):
  bpf: powerpc64: optimize JIT passes for bpf function calls

Yonghong Song (4):
  bpf: btf: implement btf_name_valid_identifier()
  bpf: btf: check name validity for various types
  tools/bpf: add addition type tests to test_btf
  tools: bpftool: fix a bitfield pretty print issue

 arch/powerpc/net/bpf_jit_comp64.c |  66 
 include/linux/filter.h|   7 +
 include/uapi/linux/bpf.h  |  56 ++--
 kernel/bpf/btf.c  |  82 +
 kernel/bpf/verifier.c | 103 +-
 net/bpf/test_run.c|  21 +-
 net/core/filter.c |  27 +-
 tools/bpf/bpftool/btf_dumper.c|   6 +-
 tools/include/uapi/linux/bpf.h|  56 ++--
 tools/testing/selftests/bpf/bpf_helpers.h |   4 +-
 tools/testing/selftests/bpf/test_btf.c| 375 +-
 tools/testing/selftests/bpf/test_sk_lookup_kern.c |  18 +-
 tools/testing/selftests/bpf/test_verifier.c   |   6 +-
 13 files changed, 732 insertions(+), 95 deletions(-)


[PATCH] net-udp: deprioritize cpu match for udp socket lookup

2018-12-05 Thread Maciej Żenczykowski
From: Maciej Żenczykowski 

During udp socket lookup cpu match should be lowest priority,
hence it should increase score by only 1.

The next priority is delivering v4 to v4 sockets, and v6 to v6 sockets.
The v6 code path doesn't have to deal with this so it always gets
a score of '4'.  The v4 code path uses '4' or '2' depending on
whether we're delivering to a v4 socket or a dualstack v6 socket.

This is more important than cpu match, so has to be greater than
the '1' bump in score from cpu match.

All other matches (src/dst ip, src port) are even *more* important,
so need to bump score by 4 for ipv4.

For ipv6 we could simply bump by 2, but let's keep the two code
paths as similar as possible.

(also, while at it, remove two unnecessary unconditional score bumps)

Signed-off-by: Maciej Żenczykowski 
---
 net/ipv4/udp.c | 3 +--
 net/ipv6/udp.c | 9 -
 2 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index aff2a8e99e01..0c0ab0383cec 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -380,7 +380,7 @@ static int compute_score(struct sock *sk, struct net *net,
ipv6_only_sock(sk))
return -1;
 
-   score = (sk->sk_family == PF_INET) ? 2 : 1;
+   score = (sk->sk_family == PF_INET) ? 4 : 2;
inet = inet_sk(sk);
 
if (inet->inet_rcv_saddr) {
@@ -405,7 +405,6 @@ static int compute_score(struct sock *sk, struct net *net,
dif, sdif);
if (!dev_match)
return -1;
-   score += 4;
 
if (sk->sk_incoming_cpu == raw_smp_processor_id())
score++;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 09cba4cfe31f..5441062d7d5e 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -125,31 +125,30 @@ static int compute_score(struct sock *sk, struct net *net,
sk->sk_family != PF_INET6)
return -1;
 
-   score = 0;
+   score = 4;
inet = inet_sk(sk);
 
if (inet->inet_dport) {
if (inet->inet_dport != sport)
return -1;
-   score++;
+   score += 4;
}
 
if (!ipv6_addr_any(>sk_v6_rcv_saddr)) {
if (!ipv6_addr_equal(>sk_v6_rcv_saddr, daddr))
return -1;
-   score++;
+   score += 4;
}
 
if (!ipv6_addr_any(>sk_v6_daddr)) {
if (!ipv6_addr_equal(>sk_v6_daddr, saddr))
return -1;
-   score++;
+   score += 4;
}
 
dev_match = udp_sk_bound_dev_eq(net, sk->sk_bound_dev_if, dif, sdif);
if (!dev_match)
return -1;
-   score++;
 
if (sk->sk_incoming_cpu == raw_smp_processor_id())
score++;
-- 
2.20.0.rc1.387.gf8505762e3-goog



[PATCH net-next 4/6] net: mii: Add mii_lpa_mod_linkmode_lpa_t

2018-12-05 Thread Andrew Lunn
Add a _mod_ variant of mii_lpa_to_linkmode_lpa_t. Use this to fix the
genphy_read_status() where the 1G link partner features are getting
lost.

Fixes: c0ec3c273677 ("net: phy: Convert u32 phydev->lp_advertising to linkmode")
Reported-by: Heiner Kallweit 
Signed-off-by: Andrew Lunn 
---
 drivers/net/phy/phy_device.c |  2 +-
 include/linux/mii.h  | 68 +++-
 2 files changed, 53 insertions(+), 17 deletions(-)

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index c20b5ecc0f4b..7d5d698604aa 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -1748,7 +1748,7 @@ int genphy_read_status(struct phy_device *phydev)
if (lpa < 0)
return lpa;
 
-   mii_lpa_to_linkmode_lpa_t(phydev->lp_advertising, lpa);
+   mii_lpa_mod_linkmode_lpa_t(phydev->lp_advertising, lpa);
 
adv = phy_read(phydev, MII_ADVERTISE);
if (adv < 0)
diff --git a/include/linux/mii.h b/include/linux/mii.h
index b915ef6c3692..e72447778a08 100644
--- a/include/linux/mii.h
+++ b/include/linux/mii.h
@@ -372,6 +372,36 @@ static inline u32 mii_lpa_to_ethtool_lpa_x(u32 lpa)
return result | mii_adv_to_ethtool_adv_x(lpa);
 }
 
+/**
+ * mii_adv_mod_linkmode_adv_t
+ * @advertising:pointer to destination link mode.
+ * @adv: value of the MII_ADVERTISE register
+ *
+ * A small helper function that translates MII_ADVERTISE bits to
+ * linkmode advertisement settings. Leaves other bits unchanged.
+ */
+static inline void mii_adv_mod_linkmode_adv_t(unsigned long *advertising,
+ u32 adv)
+{
+   linkmode_mod_bit(ETHTOOL_LINK_MODE_10baseT_Half_BIT,
+advertising, adv & ADVERTISE_10HALF);
+
+   linkmode_mod_bit(ETHTOOL_LINK_MODE_10baseT_Full_BIT,
+advertising, adv & ADVERTISE_10FULL);
+
+   linkmode_mod_bit(ETHTOOL_LINK_MODE_100baseT_Half_BIT,
+advertising, adv & ADVERTISE_100HALF);
+
+   linkmode_mod_bit(ETHTOOL_LINK_MODE_100baseT_Full_BIT,
+advertising, adv & ADVERTISE_100FULL);
+
+   linkmode_mod_bit(ETHTOOL_LINK_MODE_Pause_BIT, advertising,
+adv & ADVERTISE_PAUSE_CAP);
+
+   linkmode_mod_bit(ETHTOOL_LINK_MODE_Asym_Pause_BIT,
+advertising, adv & ADVERTISE_PAUSE_ASYM);
+}
+
 /**
  * mii_adv_to_linkmode_adv_t
  * @advertising:pointer to destination link mode.
@@ -386,22 +416,7 @@ static inline void mii_adv_to_linkmode_adv_t(unsigned long 
*advertising,
 {
linkmode_zero(advertising);
 
-   if (adv & ADVERTISE_10HALF)
-   linkmode_set_bit(ETHTOOL_LINK_MODE_10baseT_Half_BIT,
-advertising);
-   if (adv & ADVERTISE_10FULL)
-   linkmode_set_bit(ETHTOOL_LINK_MODE_10baseT_Full_BIT,
-advertising);
-   if (adv & ADVERTISE_100HALF)
-   linkmode_set_bit(ETHTOOL_LINK_MODE_100baseT_Half_BIT,
-advertising);
-   if (adv & ADVERTISE_100FULL)
-   linkmode_set_bit(ETHTOOL_LINK_MODE_100baseT_Full_BIT,
-advertising);
-   if (adv & ADVERTISE_PAUSE_CAP)
-   linkmode_set_bit(ETHTOOL_LINK_MODE_Pause_BIT, advertising);
-   if (adv & ADVERTISE_PAUSE_ASYM)
-   linkmode_set_bit(ETHTOOL_LINK_MODE_Asym_Pause_BIT, advertising);
+   mii_adv_mod_linkmode_adv_t(advertising, adv);
 }
 
 /**
@@ -423,6 +438,27 @@ static inline void mii_lpa_to_linkmode_lpa_t(unsigned long 
*lp_advertising,
 
 }
 
+/**
+ * mii_lpa_mod_linkmode_lpa_t
+ * @adv: value of the MII_LPA register
+ *
+ * A small helper function that translates MII_LPA bits, when in
+ * 1000Base-T mode, to linkmode LP advertisement settings. Leaves
+ * other bits unchanged.
+ */
+static inline void mii_lpa_mod_linkmode_lpa_t(unsigned long *lp_advertising,
+ u32 lpa)
+{
+   mii_adv_mod_linkmode_adv_t(lp_advertising, lpa);
+
+   if (lpa & LPA_LPACK)
+   linkmode_set_bit(ETHTOOL_LINK_MODE_Autoneg_BIT,
+lp_advertising);
+   else
+   linkmode_clear_bit(ETHTOOL_LINK_MODE_Autoneg_BIT,
+  lp_advertising);
+}
+
 /**
  * linkmode_adv_to_lcl_adv_t
  * @advertising:pointer to linkmode advertising
-- 
2.19.1



[PATCH net-next 2/6] net: mii: Rename mii_stat1000_to_linkmode_lpa_t

2018-12-05 Thread Andrew Lunn
Rename mii_stat1000_to_linkmode_lpa_t to
mii_stat1000_mod_linkmode_lpa_t to indicate it modifies the passed
linkmode bitmap, without clearing any other bits.

Add a helper to set/clear bits in a linkmode.

Use this helper to ensure bit are clear which the stat1000 indicates
should not be set.

Suggested-by: Heiner Kallweit 
Fixes: c0ec3c273677 ("net: phy: Convert u32 phydev->lp_advertising to linkmode")
Signed-off-by: Andrew Lunn 
---
 drivers/net/phy/marvell.c|  2 +-
 drivers/net/phy/marvell10g.c |  2 +-
 drivers/net/phy/phy_device.c |  4 ++--
 include/linux/linkmode.h |  9 +
 include/linux/mii.h  | 20 ++--
 5 files changed, 23 insertions(+), 14 deletions(-)

diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
index 6a9881942e53..03dafe0e68a2 100644
--- a/drivers/net/phy/marvell.c
+++ b/drivers/net/phy/marvell.c
@@ -1138,7 +1138,7 @@ static int marvell_read_status_page_an(struct phy_device 
*phydev,
 
if (!fiber) {
mii_lpa_to_linkmode_lpa_t(phydev->lp_advertising, lpa);
-   mii_stat1000_to_linkmode_lpa_t(phydev->lp_advertising, lpagb);
+   mii_stat1000_mod_linkmode_lpa_t(phydev->lp_advertising, lpagb);
 
if (phydev->duplex == DUPLEX_FULL) {
phydev->pause = lpa & LPA_PAUSE_CAP ? 1 : 0;
diff --git a/drivers/net/phy/marvell10g.c b/drivers/net/phy/marvell10g.c
index 6f6e886fc836..82ab6ed3b74e 100644
--- a/drivers/net/phy/marvell10g.c
+++ b/drivers/net/phy/marvell10g.c
@@ -490,7 +490,7 @@ static int mv3310_read_status(struct phy_device *phydev)
if (val < 0)
return val;
 
-   mii_stat1000_to_linkmode_lpa_t(phydev->lp_advertising, val);
+   mii_stat1000_mod_linkmode_lpa_t(phydev->lp_advertising, val);
 
if (phydev->autoneg == AUTONEG_ENABLE)
phy_resolve_aneg_linkmode(phydev);
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index e6720e2a2da6..c20b5ecc0f4b 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -1739,8 +1739,8 @@ int genphy_read_status(struct phy_device *phydev)
return -ENOLINK;
}
 
-   mii_stat1000_to_linkmode_lpa_t(phydev->lp_advertising,
-  lpagb);
+   mii_stat1000_mod_linkmode_lpa_t(phydev->lp_advertising,
+   lpagb);
common_adv_gb = lpagb & adv << 2;
}
 
diff --git a/include/linux/linkmode.h b/include/linux/linkmode.h
index 22443d7fb5cd..a99c58866860 100644
--- a/include/linux/linkmode.h
+++ b/include/linux/linkmode.h
@@ -57,6 +57,15 @@ static inline void linkmode_clear_bit(int nr, volatile 
unsigned long *addr)
__clear_bit(nr, addr);
 }
 
+static inline void linkmode_mod_bit(int nr, volatile unsigned long *addr,
+   int set)
+{
+   if (set)
+   linkmode_set_bit(nr, addr);
+   else
+   linkmode_clear_bit(nr, addr);
+}
+
 static inline void linkmode_change_bit(int nr, volatile unsigned long *addr)
 {
__change_bit(nr, addr);
diff --git a/include/linux/mii.h b/include/linux/mii.h
index 57365224306c..b915ef6c3692 100644
--- a/include/linux/mii.h
+++ b/include/linux/mii.h
@@ -288,22 +288,22 @@ static inline u32 mii_stat1000_to_ethtool_lpa_t(u32 lpa)
 }
 
 /**
- * mii_stat1000_to_linkmode_lpa_t
+ * mii_stat1000_mod_linkmode_lpa_t
  * @advertising: target the linkmode advertisement settings
  * @adv: value of the MII_STAT1000 register
  *
  * A small helper function that translates MII_STAT1000 bits, when in
- * 1000Base-T mode, to linkmode advertisement settings.
+ * 1000Base-T mode, to linkmode advertisement settings. Other bits in
+ * advertising are not changes.
  */
-static inline void mii_stat1000_to_linkmode_lpa_t(unsigned long *advertising,
- u32 lpa)
+static inline void mii_stat1000_mod_linkmode_lpa_t(unsigned long *advertising,
+  u32 lpa)
 {
-   if (lpa & LPA_1000HALF)
-   linkmode_set_bit(ETHTOOL_LINK_MODE_1000baseT_Half_BIT,
-advertising);
-   if (lpa & LPA_1000FULL)
-   linkmode_set_bit(ETHTOOL_LINK_MODE_1000baseT_Full_BIT,
-advertising);
+   linkmode_mod_bit(ETHTOOL_LINK_MODE_1000baseT_Half_BIT,
+advertising, lpa & LPA_1000HALF);
+
+   linkmode_mod_bit(ETHTOOL_LINK_MODE_1000baseT_Full_BIT,
+advertising, lpa & LPA_1000FULL);
 }
 
 /**
-- 
2.19.1



[PATCH net-next 5/6] net: mii: mii_lpa_mod_linkmode_lpa_t: Make use of linkmode_mod_bit helper

2018-12-05 Thread Andrew Lunn
Replace the if else code structure with a call to the helper
linkmode_mod_bit.

Signed-off-by: Andrew Lunn 
---
 include/linux/mii.h | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/include/linux/mii.h b/include/linux/mii.h
index e72447778a08..6fee8b1a4400 100644
--- a/include/linux/mii.h
+++ b/include/linux/mii.h
@@ -451,12 +451,8 @@ static inline void mii_lpa_mod_linkmode_lpa_t(unsigned 
long *lp_advertising,
 {
mii_adv_mod_linkmode_adv_t(lp_advertising, lpa);
 
-   if (lpa & LPA_LPACK)
-   linkmode_set_bit(ETHTOOL_LINK_MODE_Autoneg_BIT,
-lp_advertising);
-   else
-   linkmode_clear_bit(ETHTOOL_LINK_MODE_Autoneg_BIT,
-  lp_advertising);
+   linkmode_mod_bit(ETHTOOL_LINK_MODE_Autoneg_BIT,
+lp_advertising, lpa & LPA_LPACK);
 }
 
 /**
-- 
2.19.1



[PATCH net-next 6/6] net: phy: Fix ioctl handler when modifing MII_ADVERTISE

2018-12-05 Thread Andrew Lunn
When the MII_ADVERTISE register is modified by the IOCTL handler,
phydev->advertising needs recalculating. Use the _mod_ variant of
mii_adv_to_linkmode_adv_t so that bits outside of the advertise
registers are not cleared.

Fixes: c0ec3c273677 ("net: phy: Convert u32 phydev->lp_advertising to linkmode")
Reported-by: Heiner Kallweit 
Signed-off-by: Andrew Lunn 
---
 drivers/net/phy/phy.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index e1a1e54baac2..e24708f1fc16 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -437,8 +437,8 @@ int phy_mii_ioctl(struct phy_device *phydev, struct ifreq 
*ifr, int cmd)
}
break;
case MII_ADVERTISE:
-   mii_adv_to_linkmode_adv_t(phydev->advertising,
- val);
+   mii_adv_mod_linkmode_adv_t(phydev->advertising,
+  val);
change_autoneg = true;
break;
default:
-- 
2.19.1



[PATCH net-next 1/6] net: mii: Fix autoneg in mii_lpa_to_linkmode_lpa_t()

2018-12-05 Thread Andrew Lunn
mii_adv_to_linkmode_adv_t() clears all bits before setting it needs to
set. This means the freshly set Autoneg gets cleared.

Change the order, and add comments about it clearing the old content
of the bitmap.

Reported-by: Heiner Kallweit 
Fixes: c0ec3c273677 ("net: phy: Convert u32 phydev->lp_advertising to linkmode")
Signed-off-by: Andrew Lunn 
---
 include/linux/mii.h | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/mii.h b/include/linux/mii.h
index fb7ae4ae8ce3..57365224306c 100644
--- a/include/linux/mii.h
+++ b/include/linux/mii.h
@@ -378,7 +378,8 @@ static inline u32 mii_lpa_to_ethtool_lpa_x(u32 lpa)
  * @adv: value of the MII_ADVERTISE register
  *
  * A small helper function that translates MII_ADVERTISE bits
- * to linkmode advertisement settings.
+ * to linkmode advertisement settings. Clears the old value
+ * of advertising.
  */
 static inline void mii_adv_to_linkmode_adv_t(unsigned long *advertising,
 u32 adv)
@@ -408,16 +409,18 @@ static inline void mii_adv_to_linkmode_adv_t(unsigned 
long *advertising,
  * @adv: value of the MII_LPA register
  *
  * A small helper function that translates MII_LPA bits, when in
- * 1000Base-T mode, to linkmode LP advertisement settings.
+ * 1000Base-T mode, to linkmode LP advertisement settings. Clears the
+ * old value of advertising
  */
 static inline void mii_lpa_to_linkmode_lpa_t(unsigned long *lp_advertising,
 u32 lpa)
 {
+   mii_adv_to_linkmode_adv_t(lp_advertising, lpa);
+
if (lpa & LPA_LPACK)
linkmode_set_bit(ETHTOOL_LINK_MODE_Autoneg_BIT,
 lp_advertising);
 
-   mii_adv_to_linkmode_adv_t(lp_advertising, lpa);
 }
 
 /**
-- 
2.19.1



[PATCH net-next 3/6] phy: marvell: Rename mii_lpa_to_linkmode_lpa_t

2018-12-05 Thread Andrew Lunn
Rename mii_lpa_to_linkmode_lpa_t to mii_lpa_mod_linkmode_lpa_t to
indicate it modifies the passed linkmode bitmap, without clearing any
other bits.

Also, ensure bit are clear which the lpa indicates should not be set.

Suggested-by: Heiner Kallweit 
Fixes: c0ec3c273677 ("net: phy: Convert u32 phydev->lp_advertising to linkmode")
Signed-off-by: Andrew Lunn 
---
 drivers/net/phy/marvell.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
index 03dafe0e68a2..a9c7c7f41b0c 100644
--- a/drivers/net/phy/marvell.c
+++ b/drivers/net/phy/marvell.c
@@ -1047,21 +1047,21 @@ static int m88e1145_config_init(struct phy_device 
*phydev)
 }
 
 /**
- * fiber_lpa_to_linkmode_lpa_t
+ * fiber_lpa_mod_linkmode_lpa_t
  * @advertising: the linkmode advertisement settings
  * @lpa: value of the MII_LPA register for fiber link
  *
- * A small helper function that translates MII_LPA
- * bits to linkmode LP advertisement settings.
+ * A small helper function that translates MII_LPA bits to linkmode LP
+ * advertisement settings. Other bits in advertising are left
+ * unchanged.
  */
-static void fiber_lpa_to_linkmode_lpa_t(unsigned long *advertising, u32 lpa)
+static void fiber_lpa_mod_linkmode_lpa_t(unsigned long *advertising, u32 lpa)
 {
-   if (lpa & LPA_FIBER_1000HALF)
-   linkmode_set_bit(ETHTOOL_LINK_MODE_1000baseT_Half_BIT,
-advertising);
-   if (lpa & LPA_FIBER_1000FULL)
-   linkmode_set_bit(ETHTOOL_LINK_MODE_1000baseT_Full_BIT,
-advertising);
+   linkmode_mod_bit(ETHTOOL_LINK_MODE_1000baseT_Half_BIT,
+advertising, lpa & LPA_FIBER_1000HALF);
+
+   linkmode_mod_bit(ETHTOOL_LINK_MODE_1000baseT_Full_BIT,
+advertising, lpa & LPA_FIBER_1000FULL);
 }
 
 /**
@@ -1146,7 +1146,7 @@ static int marvell_read_status_page_an(struct phy_device 
*phydev,
}
} else {
/* The fiber link is only 1000M capable */
-   fiber_lpa_to_linkmode_lpa_t(phydev->lp_advertising, lpa);
+   fiber_lpa_mod_linkmode_lpa_t(phydev->lp_advertising, lpa);
 
if (phydev->duplex == DUPLEX_FULL) {
if (!(lpa & LPA_PAUSE_FIBER)) {
-- 
2.19.1



[PATCH net-next 0/6] u32 to linkmode fixes

2018-12-05 Thread Andrew Lunn
This patchset fixes issues found in the last patchset which converted
the phydev advertise etc, from a u32 to a linux bitmap. Most of the
issues are the result of clearing bits which should not of been
cleared. To make the API clearer, the idea from Heiner Kallweit was
used, with _mod_ to indicate the function modifies just the bits it
needs to, or _to_ to clear all bits and just set bit that need to be
set.

Andrew Lunn (6):
  net: mii: Fix autoneg in mii_lpa_to_linkmode_lpa_t()
  net: mii: Rename mii_stat1000_to_linkmode_lpa_t
  phy: marvell: Rename mii_lpa_to_linkmode_lpa_t
  net: mii: Add mii_lpa_mod_linkmode_lpa_t
  net: mii: mii_lpa_mod_linkmode_lpa_t: Make use of linkmode_mod_bit
helper
  net: phy: Fix ioctl handler when modifing MII_ADVERTISE

 drivers/net/phy/marvell.c| 24 +-
 drivers/net/phy/marvell10g.c |  2 +-
 drivers/net/phy/phy.c|  4 +-
 drivers/net/phy/phy_device.c |  6 +--
 include/linux/linkmode.h |  9 
 include/linux/mii.h  | 93 +---
 6 files changed, 91 insertions(+), 47 deletions(-)

-- 
2.19.1



Re: DSA support for Marvell 88e6065 switch

2018-12-05 Thread Pavel Machek
Hi!

> > > > But I'm running into problems with tagging code, and I guess I'd like
> > > > some help understanding.
> > > > 
> > > > tag_trailer: allocates new skb, then copies data around.
> > > > 
> > > > tag_qca: does dev->stats.tx_packets++, and reuses existing skb.
> > > > 
> > > > tag_brcm: reuses existing skb.
> > 
> > Any idea why tag trailer allocates new skb,
> 
> I wrote this code over 10 years ago, so I don't remember all that
> well, but I think that it is because you have to do manual checksumming
> of the packet, as there's no way to pass down the stack that you don't
> want to checksum all the way down to the end of the data area (and you
> don't want the tag to be included in the checksum), and so you want to
> do that before you add the trailer tag, and you'll probably have to
> reallocate the data area to be able to add the tag, and you probably
> won't get an exclusive skb here anyway, so you might as well allocate
> a new one.

Ok, thanks. Whether tag is checksummed or not is indeed an interesting question.

> > and what is going on with dev->stats.tx_packets++?
> 
> trailer_xmit would be the hard_start_xmit function for the virtual
> (slave) network interface, so this would be the right thing to do?

Some tag_*.c do this and some do not, so I'm still confused.

> > 6065 indeed has some kind of "egress tagging mode" (with four
> > options), but I have trouble understanding what it really does.
> 
> What are the options?

"transmit unmodified", "transmit untagged", "transmit tagged" and
"transmit all frames double tagged".

Best regards,
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: [PATCH net] macvlan: remove duplicate check

2018-12-05 Thread David Miller
From: Matteo Croce 
Date: Tue,  4 Dec 2018 18:05:42 +0100

> Following commit 59f997b088d2 ("macvlan: return correct error value"),
> there is a duplicate check for mac addresses both in macvlan_sync_address()
> and macvlan_set_mac_address().
> As the former calls the latter, remove the one in macvlan_set_mac_address()
> and move the one in macvlan_sync_address() before any other check.
> 
> Signed-off-by: Matteo Croce 

Hmmm, doesn't this change behavior?

For the handling of the NETDEV_CHANGEADDR event in macvlan_device_event()
we would make it to macvlan_sync_address(), and if IFF_UP is false,
we would elide the macvlan_addr_busy() check and just copy the MAC addres
over and return.

Now, we would always perform the macvlan_addr_busy() check.

Please, if this is OK, explain and document this behavioral chance in
the commit message.

Thank you.


Re: [PATCH 0/3] net: macb: DMA race condition fixes

2018-12-05 Thread David Miller
From: Anssi Hannula 
Date: Fri, 30 Nov 2018 20:21:34 +0200

> Here are a couple of race condition fixes for the macb driver. The first
> two are issues observed on real HW.

It looks like there is still an active discussion about the memory
barriers in patch #3 being excessive.

Once that is sorted out to everyone's satisfaction, would you
please repost this series with appropriate ACKs, reviewed-by's,
tested-by's, etc. added?

Thank you.


Re: [PATCH 1/3] net: macb: fix random memory corruption on RX with 64-bit DMA

2018-12-05 Thread David Miller
From: Anssi Hannula 
Date: Fri, 30 Nov 2018 20:21:35 +0200

> @@ -682,6 +682,11 @@ static void macb_set_addr(struct macb *bp, struct 
> macb_dma_desc *desc, dma_addr_
>   if (bp->hw_dma_cap & HW_DMA_CAP_64B) {
>   desc_64 = macb_64b_desc(bp, desc);
>   desc_64->addrh = upper_32_bits(addr);
> + /* The low bits of RX address contain the RX_USED bit, clearing
> +  * of which allows packet RX. Make sure the high bits are also
> +  * visible to HW at that point.
> +  */
> + dma_wmb();
>   }

I agree with that dma_wmb() is what should be used here.

We are ordering CPU stores with DMA visibility, which is exactly what
the dma_*() are for.

If it doesn't work properly on some architecture's implementation of dma_*(),
those should be fixed rather than papering over it in the drivers.


Re: [PATCH 1/2] net: linkwatch: send change uevent on link changes

2018-12-05 Thread David Miller
From: Jouke Witteveen 
Date: Wed, 5 Dec 2018 14:50:31 +0100

> For example, I maintain a network manager that delegates the actual
> networking work to specialized programs.

Basically "I've implemented things using separate programs"

> Basically, it is an implementation of network manager logic in shell
> script. For such a shell script, it is easy to respond to uevents
> (via udev, or alternatives), but responding to rtnetlink messages
> would require a separate program.

And "In order to use rtnetlink I'll need a separate program!"

(╯°□°)╯︵ ┻━┻

So it's ok to use the separate program paradigm for dividing up
the tasks, but not for processing events?

I'm not convinced.

Either use the facility we have or extend it to fill a valid missing
need.

I'm not applying these patches, your logic doesn't add up and it's
inconsistent with our clear goals of not duplicating functionality.


  1   2   >