Re: [PATCH net-next,v4 09/12] ethtool: add basic ethtool_rx_flow_spec to flow_rule structure translator

2018-12-05 Thread Michal Kubecek
On Wed, Dec 05, 2018 at 02:10:37PM +0100, Pablo Neira Ayuso wrote:
> On Thu, Nov 29, 2018 at 02:12:14PM +0100, Michal Kubecek wrote:
> > On Thu, Nov 29, 2018 at 03:22:28AM +0100, Pablo Neira Ayuso wrote:
> > ...
> > > + act = >rule->action.entries[0];
> > > + switch (fs->ring_cookie) {
> > > + case RX_CLS_FLOW_DISC:
> > > + act->id = FLOW_ACTION_DROP;
> > > + break;
> > > + case RX_CLS_FLOW_WAKE:
> > > + act->id = FLOW_ACTION_WAKE;
> > > + break;
> > > + default:
> > > + act->id = FLOW_ACTION_QUEUE;
> > > + act->queue_index = fs->ring_cookie;
> > > + break;
> > > + }
> > 
> > IMHO it would be cleaner if rules passing packets to RSS contexts were
> > also implemented using action, e.g. by using FLOW_ACTION_CONTEXT for
> > act->id and reusing act->queue_index for context id (using either an
> > anonymous union or more generic name).
> 
> Are refering to FLOW_RSS specifically? Or just rename in this action
> and field?

My idea is that currently the ethtool interface can apply four types of
action to matching packets

  - put into a specific queue
  - discard
  - distribute accross the queues according to a RSS context
  - use the rule as a wake-on-lan filter

What is confusing is that due to lack of extensibility of the ethtool
ioctl interface, third option is implemented by setting FLOW_RSS flag in
ethtool_rxnfc::fs.flow_type and passing the RSS context id in
ethtool_rxnfc::rss_context, i.e. outside struct ethtool_rx_flow_spec.
The context id would need to be passed to the conversion function in
addition to the struct ethtool_rx_flow_spec pointer.

I believe we should use this opportunity to handle this case in a way
which would match the other three. In particular by introducing another
value for act->id (e.g. FLOW_ACTION_CONTEXT) and passing the target
context id in the same structure (e.g. by renaming queue_index to
something more generic or by using an anonymous union with it).

Michal Kubecek


Re: [PATCH iproute2] iproute2: Installation errors without libnml

2018-12-03 Thread Michal Kubecek
On Mon, Dec 03, 2018 at 11:18:14AM +0100, Emeric Dupont wrote:
> When performing make install in iproute2 (current git master),
>  if $(HAVE_MNL) is not selected, some Makefiles try to call
>  install with an empty target, which causes a non-critical make error.
> 
> Signed-off-by: Emeric Dupont 
> ---

You have a typo in patch subject (should be "libmnl").

Michal Kubecek


Re: [PATCH net-next,v4 09/12] ethtool: add basic ethtool_rx_flow_spec to flow_rule structure translator

2018-11-29 Thread Michal Kubecek
On Thu, Nov 29, 2018 at 03:22:28AM +0100, Pablo Neira Ayuso wrote:
> This patch adds a function to translate the ethtool_rx_flow_spec
> structure to the flow_rule representation.
> 
> This allows us to reuse code from the driver side given that both flower
> and ethtool_rx_flow interfaces use the same representation.
> 
> This patch also includes support for FLOW_EXT and FLOW_MAC_EXT.
> 
> Signed-off-by: Pablo Neira Ayuso 
...
> diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
> index afd9596ce636..c76d1b34c9a2 100644
> --- a/include/linux/ethtool.h
> +++ b/include/linux/ethtool.h
...
> +
> +struct ethtool_rx_flow_key {
> + struct flow_dissector_key_basic basic;
> + union {
> + struct flow_dissector_key_ipv4_addrsipv4;
> + struct flow_dissector_key_ipv6_addrsipv6;
> + };
> + struct flow_dissector_key_ports tp;
> + struct flow_dissector_key_ipip;
> + struct flow_dissector_key_vlan  vlan;
> + struct flow_dissector_key_eth_addrs eth_addrs;
> +} __aligned(BITS_PER_LONG / 8); /* Ensure that we can do comparisons as 
> longs. */

How about __aligned(sizeof(long)) ?

...
> + act = >rule->action.entries[0];
> + switch (fs->ring_cookie) {
> + case RX_CLS_FLOW_DISC:
> + act->id = FLOW_ACTION_DROP;
> + break;
> + case RX_CLS_FLOW_WAKE:
> + act->id = FLOW_ACTION_WAKE;
> + break;
> + default:
> + act->id = FLOW_ACTION_QUEUE;
> + act->queue_index = fs->ring_cookie;
> + break;
> + }

IMHO it would be cleaner if rules passing packets to RSS contexts were
also implemented using action, e.g. by using FLOW_ACTION_CONTEXT for
act->id and reusing act->queue_index for context id (using either an
anonymous union or more generic name).

Another idea: I would prefer moving this translation from NIC drivers to
the generic ethtool code. That would mean providing new ethtool_ops
callbacks (with a plan to deprecate the old ones). But it's probably
something something to leave for later as there would be a lot of pain
for no actual gain right now. (The reason I'm thinking about this is
that it wouldn't feel right to translate netlink messages to the ethtool
structures only to translate those once again in the NIC driver.)

Michal Kubecek


Re: [RFC PATCH ethtool v2 00/23] ethtool netlink interface (userspace side) (WiP)

2018-11-08 Thread Michal Kubecek
On Wed, Nov 07, 2018 at 11:57:03PM +, woojung@microchip.com wrote:
> > -Original Message-
> > From: netdev-ow...@vger.kernel.org  On Behalf 
> > Of Michal
> > Kubecek
> > Sent: Monday, July 30, 2018 8:56 AM
> > To: netdev@vger.kernel.org
> > Cc: linux-ker...@vger.kernel.org; John W. Linville 
> > Subject: [RFC PATCH ethtool v2 00/23] ethtool netlink interface (userspace 
> > side) (WiP)
> > 
> > (The series is based on v4.17)
> Can you help me find exact version of version I can apply patch?
> v4.17 from either net-next.git or linux.git shows some failure while patching.

Based on the subject you quoted, you tried to apply the patch series for
ethtool (userspace utility). Kernel patch series was sent shortly before
this one with subject "[RFC PATCH net-next v2 00/17] ethtool netlink
interface (WiP)":

  http://patchwork.ozlabs.org/project/netdev/list/?series=58290=*

But if you want to try it, I would rather suggest updated version from

  https://github.com/mkubecek/ethnl

It has kernel series based on current net-next (this morning) and
ethtool series based on v4.19.

I would like to send a v3 to the list soon but in the last few weeks
I was quite busy with higher priority work.

Michal Kubecek


Re: [net-next 06/12] i40e/ixgbe/igb: fail on new WoL flag setting WAKE_MAGICSECURE

2018-11-07 Thread Michal Kubecek
On Thu, Nov 08, 2018 at 06:05:26AM +, Kevin Easton wrote:
> On Wed, Nov 07, 2018 at 02:48:24PM -0800, Jeff Kirsher wrote:
> > From: Todd Fujinaka 
> > 
> > There's a new flag for setting WoL filters that is only
> > enabled on one manufacturer's NICs, and it's not ours. Fail
> > with EOPNOTSUPP.
> > 
> > Signed-off-by: Todd Fujinaka 
> > Tested-by: Andrew Bowers 
> > Signed-off-by: Jeff Kirsher 
> > ---
> >  drivers/net/ethernet/intel/i40e/i40e_ethtool.c   | 3 ++-
> >  drivers/net/ethernet/intel/igb/igb_ethtool.c | 2 +-
> >  drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c | 3 ++-
> >  3 files changed, 5 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
> > b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
> > index 9f8464f80783..9c1211ad2c6b 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
> > @@ -2377,7 +2377,8 @@ static int i40e_set_wol(struct net_device *netdev, 
> > struct ethtool_wolinfo *wol)
> > return -EOPNOTSUPP;
> >  
> > /* only magic packet is supported */
> > -   if (wol->wolopts && (wol->wolopts != WAKE_MAGIC))
> > +   if (wol->wolopts && (wol->wolopts != WAKE_MAGIC)
> > + | (wol->wolopts != WAKE_FILTER))
> > return -EOPNOTSUPP;
> 
> This doesn't look right.  WAKE_MAGIC and WAKE_FILTER are distinct, so
> 
> (wol->wolopts != WAKE_MAGIC) | (wol->wolopts != WAKE_FILTER)
> 
> will always be 1.

Right. Also, using "|" with logical values is rather confusing. While
the result works as expected, its priority is higher than priority of
&& (which would not be true for ||), making the code counterintuitive.

BtW, the patch subject is also wrong, the newly added flag it is dealing
with is WAKE_FILTER, not WAKE_MAGICSECURE.

> It looks like the existing test in this driver was fine - it *only*
> accepted wol->wolopts of either 0 or WAKE_MAGIC, it was already
> rejecting everything else including WAKE_FILTER.

Another way to write the check would be

if (wol->wolopts & ~WAKE_MAGIC)

> > diff --git a/drivers/net/ethernet/intel/igb/igb_ethtool.c 
> > b/drivers/net/ethernet/intel/igb/igb_ethtool.c
> > index 5acf3b743876..c57671068245 100644
> > --- a/drivers/net/ethernet/intel/igb/igb_ethtool.c
> > +++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c
> > @@ -2113,7 +2113,7 @@ static int igb_set_wol(struct net_device *netdev, 
> > struct ethtool_wolinfo *wol)
> >  {
> > struct igb_adapter *adapter = netdev_priv(netdev);
> >  
> > -   if (wol->wolopts & (WAKE_ARP | WAKE_MAGICSECURE))
> > +   if (wol->wolopts & (WAKE_ARP | WAKE_MAGICSECURE | WAKE_FILTER))
> > return -EOPNOTSUPP;
> >  
> > if (!(adapter->flags & IGB_FLAG_WOL_SUPPORTED))

I would also suggest taking the opposite approach here, i.e. listing the
flags which _are_ supported so that we don't have to update the code if
another wol flag is added (or userspace sends an invalid one):

#define SUPPORTED_WOL_MODES \
(WAKE_PHY | WAKE_UCAST | WAKE_MCAST | WAKE_BCAST | WAKE_MAGIC)
...
if (wol->wolopts & ~SUPPORTED_WOL_MODES)
return -EOPNOTSUPP;


> > diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c 
> > b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
> > index 732b1e6ecc43..acba067cc15a 100644
> > --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
> > +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
> > @@ -2206,7 +2206,8 @@ static int ixgbe_set_wol(struct net_device *netdev, 
> > struct ethtool_wolinfo *wol)
> >  {
> > struct ixgbe_adapter *adapter = netdev_priv(netdev);
> >  
> > -   if (wol->wolopts & (WAKE_PHY | WAKE_ARP | WAKE_MAGICSECURE))
> > +   if (wol->wolopts & (WAKE_PHY | WAKE_ARP | WAKE_MAGICSECURE |
> > +   WAKE_FILTER))
> > return -EOPNOTSUPP;
> >  
> > if (ixgbe_wol_exclusion(adapter, wol))

...and here.

Michal Kubecek


Re: [RFC PATCH ethtool] ethtool: better syntax for combinations of FEC modes

2018-10-04 Thread Michal Kubecek
On Thu, Oct 04, 2018 at 10:08:14AM -0400, John W. Linville wrote:
> Ping?
> 
> On Mon, Oct 01, 2018 at 02:59:10PM -0400, John W. Linville wrote:
> > Is this patch still RFC?

As far as I'm concerned, it can be taken as it is.

Michal Kubecek

> > 
> > On Wed, Sep 19, 2018 at 05:06:25PM +0100, Edward Cree wrote:
> > > Instead of commas, just have them as separate argvs.
> > > 
> > > The parsing state machine might look heavyweight, but it makes it easy to 
> > > add
> > >  more parameters later and distinguish parameter names from encoding 
> > > names.
> > > 
> > > Suggested-by: Michal Kubecek 
> > > Signed-off-by: Edward Cree 
> > > ---
> > >  ethtool.8.in   |  6 +++---
> > >  ethtool.c  | 63 
> > > --
> > >  test-cmdline.c | 10 +-
> > >  3 files changed, 25 insertions(+), 54 deletions(-)
> > > 
> > > diff --git a/ethtool.8.in b/ethtool.8.in
> > > index 414eaa1..7ea2cc0 100644
> > > --- a/ethtool.8.in
> > > +++ b/ethtool.8.in
> > > @@ -390,7 +390,7 @@ ethtool \- query or control network driver and 
> > > hardware settings
> > >  .B ethtool \-\-set\-fec
> > >  .I devname
> > >  .B encoding
> > > -.BR auto | off | rs | baser [ , ...]
> > > +.BR auto | off | rs | baser \ [...]
> > >  .
> > >  .\" Adjust lines (i.e. full justification) and hyphenate.
> > >  .ad
> > > @@ -1120,11 +1120,11 @@ current FEC mode, the driver or firmware must 
> > > take the link down
> > >  administratively and report the problem in the system logs for users to 
> > > correct.
> > >  .RS 4
> > >  .TP
> > > -.BR encoding\ auto | off | rs | baser [ , ...]
> > > +.BR encoding\ auto | off | rs | baser \ [...]
> > >  
> > >  Sets the FEC encoding for the device.  Combinations of options are 
> > > specified as
> > >  e.g.
> > > -.B auto,rs
> > > +.B encoding auto rs
> > >  ; the semantics of such combinations vary between drivers.
> > >  .TS
> > >  nokeep;
> > > diff --git a/ethtool.c b/ethtool.c
> > > index 9997930..2f7e96b 100644
> > > --- a/ethtool.c
> > > +++ b/ethtool.c
> > > @@ -4979,39 +4979,6 @@ static int fecmode_str_to_type(const char *str)
> > >   return 0;
> > >  }
> > >  
> > > -/* Takes a comma-separated list of FEC modes, returns the bitwise OR of 
> > > their
> > > - * corresponding ETHTOOL_FEC_* constants.
> > > - * Accepts repetitions (e.g. 'auto,auto') and trailing comma (e.g. 
> > > 'off,').
> > > - */
> > > -static int parse_fecmode(const char *str)
> > > -{
> > > - int fecmode = 0;
> > > - char buf[6];
> > > -
> > > - if (!str)
> > > - return 0;
> > > - while (*str) {
> > > - size_t next;
> > > - int mode;
> > > -
> > > - next = strcspn(str, ",");
> > > - if (next >= 6) /* Bad mode, longest name is 5 chars */
> > > - return 0;
> > > - /* Copy into temp buffer and nul-terminate */
> > > - memcpy(buf, str, next);
> > > - buf[next] = 0;
> > > - mode = fecmode_str_to_type(buf);
> > > - if (!mode) /* Bad mode encountered */
> > > - return 0;
> > > - fecmode |= mode;
> > > - str += next;
> > > - /* Skip over ',' (but not nul) */
> > > - if (*str)
> > > - str++;
> > > - }
> > > - return fecmode;
> > > -}
> > > -
> > >  static int do_gfec(struct cmd_context *ctx)
> > >  {
> > >   struct ethtool_fecparam feccmd = { 0 };
> > > @@ -5041,22 +5008,26 @@ static int do_gfec(struct cmd_context *ctx)
> > >  
> > >  static int do_sfec(struct cmd_context *ctx)
> > >  {
> > > - char *fecmode_str = NULL;
> > > + enum { ARG_NONE, ARG_ENCODING } state = ARG_NONE;
> > >   struct ethtool_fecparam feccmd;
> > > - struct cmdline_info cmdline_fec[] = {
> > > - { "encoding", CMDL_STR,  _str,  },
> > > - };
> > > - int changed;
> > > - int fecmode;
> > > - int rv;
> > > + int fecmode = 0, newmode;
> > > + int rv, i;
> > >  
> > > - parse_generic_cmdline(ctx, , cmdline_fec,
> > &g

Re: [PATCH net-next] geneve: fix ttl inherit type

2018-09-28 Thread Michal Kubecek
On Fri, Sep 28, 2018 at 09:09:58AM +0800, Hangbin Liu wrote:
> Phil pointed out that there is a mismatch between vxlan and geneve ttl
> inherit. We should define it as a flag and use nla_put_flag to export this
> opiton.
> 
> Fixes: 52d0d404d39dd ("geneve: add ttl inherit support")
> Reported-by: Phil Sutter 
> Signed-off-by: Hangbin Liu 
> ---
>  drivers/net/geneve.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
> index 6625fab..09ab2fd 100644
> --- a/drivers/net/geneve.c
> +++ b/drivers/net/geneve.c
> @@ -1100,7 +1100,7 @@ static const struct nla_policy 
> geneve_policy[IFLA_GENEVE_MAX + 1] = {
>   [IFLA_GENEVE_UDP_CSUM]  = { .type = NLA_U8 },
>   [IFLA_GENEVE_UDP_ZERO_CSUM6_TX] = { .type = NLA_U8 },
>   [IFLA_GENEVE_UDP_ZERO_CSUM6_RX] = { .type = NLA_U8 },
> - [IFLA_GENEVE_TTL_INHERIT]   = { .type = NLA_U8 },
> + [IFLA_GENEVE_TTL_INHERIT]   = { .type = NLA_FLAG },
>  };
>  
>  static int geneve_validate(struct nlattr *tb[], struct nlattr *data[],
> @@ -1582,7 +1582,7 @@ static size_t geneve_get_size(const struct net_device 
> *dev)
>   nla_total_size(sizeof(__u8)) + /* IFLA_GENEVE_UDP_CSUM */
>   nla_total_size(sizeof(__u8)) + /* IFLA_GENEVE_UDP_ZERO_CSUM6_TX 
> */
>   nla_total_size(sizeof(__u8)) + /* IFLA_GENEVE_UDP_ZERO_CSUM6_RX 
> */
> - nla_total_size(sizeof(__u8)) + /* IFLA_GENEVE_TTL_INHERIT */
> + nla_total_size(0) + /* IFLA_GENEVE_TTL_INHERIT */
>   0;
>  }
>  
> @@ -1636,7 +1636,7 @@ static int geneve_fill_info(struct sk_buff *skb, const 
> struct net_device *dev)
>   goto nla_put_failure;
>  #endif
>  
> - if (nla_put_u8(skb, IFLA_GENEVE_TTL_INHERIT, ttl_inherit))
> + if (ttl_inherit && nla_put_flag(skb, IFLA_GENEVE_TTL_INHERIT))
>   goto nla_put_failure;
>  
>   return 0;

Is it desirable to switch to a flag? If I read geneve_changelink() and
geneve_nl2info() correctly, it allows you to set the ttl_inherit flag
for an existing device but doesn't allow you to clear it. With NLA_U8,
you could distinguish three cases: set the flag (non-zero value), clear
the flag (zero value) and preserve current state (attribute not
present).

The same problem exists for vxlan but vxlan code intentionally disallows
changing the flag value for an existing device (I'm not sure if it's
because it's really impossible or just due to limits of the interface).
Unfortunately it has been already released with NLA_FLAG in 4.18,
AFAICS, so we have to live with it. But it's not too late for geneve.

Michal Kubecek


Re: [PATCH net v2 1/2] net: core: add member wol_enabled to struct net_device

2018-09-25 Thread Michal Kubecek
(I wrote my comment to v1 because I overlooked there is a v2 already;
duplicating it here.)

On Mon, Sep 24, 2018 at 09:58:59PM +0200, Heiner Kallweit wrote:
> Add flag wol_enabled to struct net_device indicating whether
> Wake-on-LAN is enabled. As first user phy_suspend() will use it to
> decide whether PHY can be suspended or not.
> 
> Fixes: f1e911d5d0df ("r8169: add basic phylib support")
> Fixes: e8cfd9d6c772 ("net: phy: call state machine synchronously in phy_stop")
> Signed-off-by: Heiner Kallweit 
> ---
>  include/linux/netdevice.h | 3 +++
>  net/core/ethtool.c| 9 -
>  2 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 1cbbf77a6..f5f1f1450 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1756,6 +1756,8 @@ enum netdev_priv_flags {
>   *   switch driver and used to set the phys state of the
>   *   switch port.
>   *
> + *   @wol_enabled:   Wake-on-LAN is enabled
> + *
>   *   FIXME: cleanup struct net_device such that network protocol info
>   *   moves out.
>   */
> @@ -2039,6 +2041,7 @@ struct net_device {
>   struct lock_class_key   *qdisc_tx_busylock;
>   struct lock_class_key   *qdisc_running_key;
>   boolproto_down;
> + unsignedwol_enabled:1;
>  };
>  #define to_net_dev(d) container_of(d, struct net_device, dev)
>  

As there is no bitfield yet, this would add 4 bytes to struct net_device.
How about using a bit in net_device::priv_flags like IFF_RXFH_CONFIGURED
in ethtool_set_rxfh_indir() and ethtool_set_rxfh()?

Michal Kubecek


Re: [PATCH net 1/2] net: core: add member wol_enabled to struct net_device

2018-09-25 Thread Michal Kubecek
On Mon, Sep 24, 2018 at 08:09:48PM +0200, Heiner Kallweit wrote:
> Add flag wol_enabled to struct net_device indicating whether
> Wake-on-LAN is enabled. As first user phy_suspend() will use it to
> decide whether PHY can be suspended or not.
> 
> Fixes: f1e911d5d0df ("r8169: add basic phylib support")
> Fixes: e8cfd9d6c772 ("net: phy: call state machine synchronously in phy_stop")
> Signed-off-by: Heiner Kallweit 
> ---
>  include/linux/netdevice.h | 3 +++
>  net/core/ethtool.c| 9 -
>  2 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 1cbbf77a6..f5f1f1450 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1756,6 +1756,8 @@ enum netdev_priv_flags {
>   *   switch driver and used to set the phys state of the
>   *   switch port.
>   *
> + *   @wol_enabled:   Wake-on-LAN is enabled
> + *
>   *   FIXME: cleanup struct net_device such that network protocol info
>   *   moves out.
>   */
> @@ -2039,6 +2041,7 @@ struct net_device {
>   struct lock_class_key   *qdisc_tx_busylock;
>   struct lock_class_key   *qdisc_running_key;
>   boolproto_down;
> + unsignedwol_enabled:1;
>  };
>  #define to_net_dev(d) container_of(d, struct net_device, dev)
>  

As there is no bitfield yet, this would add 4 bytes to struct net_device.
How about using a bit in net_device::priv_flags like IFF_RXFH_CONFIGURED
in ethtool_set_rxfh_indir() and ethtool_set_rxfh()?

Michal Kubecek


Re: [RFC PATCH ethtool] ethtool: better syntax for combinations of FEC modes

2018-09-20 Thread Michal Kubecek
On Wed, Sep 19, 2018 at 05:06:25PM +0100, Edward Cree wrote:
> Instead of commas, just have them as separate argvs.
> 
> The parsing state machine might look heavyweight, but it makes it easy to add
>  more parameters later and distinguish parameter names from encoding names.
> 
> Suggested-by: Michal Kubecek 
> Signed-off-by: Edward Cree 
> ---

Looks good, thank you.

Reviewed-by: Michal Kubecek 


>  ethtool.8.in   |  6 +++---
>  ethtool.c  | 63 
> --
>  test-cmdline.c | 10 +-
>  3 files changed, 25 insertions(+), 54 deletions(-)
> 
> diff --git a/ethtool.8.in b/ethtool.8.in
> index 414eaa1..7ea2cc0 100644
> --- a/ethtool.8.in
> +++ b/ethtool.8.in
> @@ -390,7 +390,7 @@ ethtool \- query or control network driver and hardware 
> settings
>  .B ethtool \-\-set\-fec
>  .I devname
>  .B encoding
> -.BR auto | off | rs | baser [ , ...]
> +.BR auto | off | rs | baser \ [...]
>  .
>  .\" Adjust lines (i.e. full justification) and hyphenate.
>  .ad
> @@ -1120,11 +1120,11 @@ current FEC mode, the driver or firmware must take 
> the link down
>  administratively and report the problem in the system logs for users to 
> correct.
>  .RS 4
>  .TP
> -.BR encoding\ auto | off | rs | baser [ , ...]
> +.BR encoding\ auto | off | rs | baser \ [...]
>  
>  Sets the FEC encoding for the device.  Combinations of options are specified 
> as
>  e.g.
> -.B auto,rs
> +.B encoding auto rs
>  ; the semantics of such combinations vary between drivers.
>  .TS
>  nokeep;
> diff --git a/ethtool.c b/ethtool.c
> index 9997930..2f7e96b 100644
> --- a/ethtool.c
> +++ b/ethtool.c
> @@ -4979,39 +4979,6 @@ static int fecmode_str_to_type(const char *str)
>   return 0;
>  }
>  
> -/* Takes a comma-separated list of FEC modes, returns the bitwise OR of their
> - * corresponding ETHTOOL_FEC_* constants.
> - * Accepts repetitions (e.g. 'auto,auto') and trailing comma (e.g. 'off,').
> - */
> -static int parse_fecmode(const char *str)
> -{
> - int fecmode = 0;
> - char buf[6];
> -
> - if (!str)
> - return 0;
> - while (*str) {
> - size_t next;
> - int mode;
> -
> - next = strcspn(str, ",");
> - if (next >= 6) /* Bad mode, longest name is 5 chars */
> - return 0;
> - /* Copy into temp buffer and nul-terminate */
> - memcpy(buf, str, next);
> - buf[next] = 0;
> - mode = fecmode_str_to_type(buf);
> - if (!mode) /* Bad mode encountered */
> - return 0;
> - fecmode |= mode;
> - str += next;
> - /* Skip over ',' (but not nul) */
> - if (*str)
> - str++;
> - }
> - return fecmode;
> -}
> -
>  static int do_gfec(struct cmd_context *ctx)
>  {
>   struct ethtool_fecparam feccmd = { 0 };
> @@ -5041,22 +5008,26 @@ static int do_gfec(struct cmd_context *ctx)
>  
>  static int do_sfec(struct cmd_context *ctx)
>  {
> - char *fecmode_str = NULL;
> + enum { ARG_NONE, ARG_ENCODING } state = ARG_NONE;
>   struct ethtool_fecparam feccmd;
> - struct cmdline_info cmdline_fec[] = {
> - { "encoding", CMDL_STR,  _str,  },
> - };
> - int changed;
> - int fecmode;
> - int rv;
> + int fecmode = 0, newmode;
> + int rv, i;
>  
> - parse_generic_cmdline(ctx, , cmdline_fec,
> -   ARRAY_SIZE(cmdline_fec));
> -
> - if (!fecmode_str)
> + for (i = 0; i < ctx->argc; i++) {
> + if (!strcmp(ctx->argp[i], "encoding")) {
> + state = ARG_ENCODING;
> + continue;
> + }
> + if (state == ARG_ENCODING) {
> + newmode = fecmode_str_to_type(ctx->argp[i]);
> + if (!newmode)
> + exit_bad_args();
> + fecmode |= newmode;
> + continue;
> + }
>   exit_bad_args();
> + }
>  
> - fecmode = parse_fecmode(fecmode_str);
>   if (!fecmode)
>   exit_bad_args();
>  
> @@ -5265,7 +5236,7 @@ static const struct option {
> " [ all ]\n"},
>   { "--show-fec", 1, do_gfec, "Show FEC settings"},
>   { "--set-fec", 1, do_sfec, "Set FEC settings",
> -   " [ encoding auto|off|rs|baser ]\n"},
> +   " [ encoding 

Re: [PATCH ethtool] ethtool: support combinations of FEC modes

2018-09-19 Thread Michal Kubecek
On Wed, Sep 19, 2018 at 04:38:27PM +0100, Edward Cree wrote:
> On 19/09/18 15:41, Michal Kubecek wrote:
> > I'm sorry I didn't notice this before the patch was accepted but as it's
> > not in a release yet, maybe it's still not too late.
> >
> > Could I suggest to make the syntax consistent with other options?
> I didn't realise ethtool had any patterns to be consistent with ;)

Way too many, I must say. :-) That is why I wasn't happy about adding
another.

> > I mean rather than a comma separated list to use either
> >
> >   ethtool --set-fec  encoding enc1 enc2 ...
> but yes this looks fine to me, as long as we're reasonably confident that
>  we won't want to add new parameters (that might require determining
>  whether enc2 is an encoding or a parameter name) in the future, because
>  while the parsing wouldn't be impossible it might get ugly.

This problem already exists for "-s ... msglvl". In the parser for the
netlink series I introduced an "end of list" marker (tentatively "--")
for this purpose, perhaps that could be a way.

> I'll rustle up an RFC patch.

Thank you.

Michal Kubecek


Re: [PATCH ethtool] ethtool: support combinations of FEC modes

2018-09-19 Thread Michal Kubecek
On Wed, Sep 19, 2018 at 05:33:38PM +0200, Andrew Lunn wrote:
> > One loosely related question: how sure can we be that we are never going
> > to need more than 32 bits for FEC encodings? Is it something completely
> > hypothetical or is it something that could happen in the future?
> > 
> Hi Michal
> 
> Hopefully we have moved to a netlink socket by that time :-)

Actually, that's why I'm asking. :-)

> I recently found that EEE still uses a u32 for advertise link modes.
> We should fix that in the netlink API.

For EEE it felt obvious that arbitrary length bitmap is the way to go so
I used it (it's still u32 in ethtool_ops but that's easier to change
when needed).

For FEC I wasn't sure if it wouldn't be an overkill.

Michal Kubecek


Re: [PATCH ethtool] ethtool: support combinations of FEC modes

2018-09-19 Thread Michal Kubecek
On Wed, Sep 05, 2018 at 06:54:57PM +0100, Edward Cree wrote:
> Of the three drivers that currently support FEC configuration, two (sfc
>  and cxgb4[vf]) accept configurations with more than one bit set in the
>  feccmd.fec bitmask.  (The precise semantics of these combinations vary.)
> Thus, this patch adds the ability to specify such combinations through a
>  comma-separated list of FEC modes in the 'encoding' argument on the
>  command line.
> 
> Also adds --set-fec tests to test-cmdline.c, and corrects the man page
>  (the encoding argument is not optional) while updating it.
> 
> Signed-off-by: Edward Cree 
...
> +/* Takes a comma-separated list of FEC modes, returns the bitwise OR of their
> + * corresponding ETHTOOL_FEC_* constants.
> + * Accepts repetitions (e.g. 'auto,auto') and trailing comma (e.g. 'off,').
> + */
> +static int parse_fecmode(const char *str)
> +{
>   int fecmode = 0;
> + char buf[6];
>  
>   if (!str)
> - return fecmode;
> -
> - if (!strcasecmp(str, "auto"))
> - fecmode |= ETHTOOL_FEC_AUTO;
> - else if (!strcasecmp(str, "off"))
> - fecmode |= ETHTOOL_FEC_OFF;
> - else if (!strcasecmp(str, "rs"))
> - fecmode |= ETHTOOL_FEC_RS;
> - else if (!strcasecmp(str, "baser"))
> - fecmode |= ETHTOOL_FEC_BASER;
> + return 0;
> + while (*str) {
> + size_t next;
> + int mode;
>  
> + next = strcspn(str, ",");
> + if (next >= 6) /* Bad mode, longest name is 5 chars */
> + return 0;
> + /* Copy into temp buffer and nul-terminate */
> + memcpy(buf, str, next);
> + buf[next] = 0;
> + mode = fecmode_str_to_type(buf);
> + if (!mode) /* Bad mode encountered */
> + return 0;
> + fecmode |= mode;
> + str += next;
> + /* Skip over ',' (but not nul) */
> + if (*str)
> + str++;
> + }
>   return fecmode;
>  }
>  

I'm sorry I didn't notice this before the patch was accepted but as it's
not in a release yet, maybe it's still not too late.

Could I suggest to make the syntax consistent with other options? I mean 
rather than a comma separated list to use either

  ethtool --set-fec  encoding enc1 enc2 ...

(as we have for --reset) or

  ethtool --set-fec  encoding enc1 on|off enc2 on|off ...

(as we have e.g. for msglvl, -K or --set-eee)?

Second option seems more appropriate to me but it would require special
handling of the case when there is only one argument after "encoding" to
maintain backward compatibility with already released versions.

One loosely related question: how sure can we be that we are never going
to need more than 32 bits for FEC encodings? Is it something completely
hypothetical or is it something that could happen in the future?

Michal Kubecek


mlx5_core: null pointer dereference in mlx5_accel_tls_device_caps() (net-next kernel)

2018-09-14 Thread Michal Kubecek
tls.h:
68  return mdev->fpga->tls->caps;
   0x0005824c <+28>:ldr x0, [x19, #30968]

drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c:
70  }
   0x00058250 <+32>:ldr x19, [sp, #16]
   0x00058254 <+36>:ldp x29, x30, [sp], #32

drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h:
68  return mdev->fpga->tls->caps;
   0x00058258 <+40>:ldr x0, [x0, #80]

drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c:
70  }
   0x0005825c <+44>:ldr w0, [x0, #20]
   0x00058260 <+48>:ret


so IIUC mdev->fpga is null (offset of tls in struct mlx5_fpga_device is
indeed 80 = 0x50).

The NIC is

  Model: "Mellanox MT27700 Family [ConnectX-4]"
  Vendor: pci 0x15b3 "Mellanox Technologies"
  Device: pci 0x1013 "MT27700 Family [ConnectX-4]"
  SubVendor: pci 0x15b3 "Mellanox Technologies"
  SubDevice: pci 0x0003 

Michal Kubecek


Re: misleading comment in netdevice.h?

2018-08-24 Thread Michal Kubecek
On Fri, Aug 24, 2018 at 06:58:38AM -0400, rpj...@crashcourse.ca wrote:
>   just pedantry here ... was perusing include/linux/netdevice.h, and in the
> declaration for struct net_device, the kerneldoc, one reads:
> 
>  *  @flags: Interface flags (a la BSD)
>  *  @priv_flags:Like 'flags' but invisible to userspace,
>  *  see if.h for the definitions
> 
> assuming "if.h" means include/uapi/linux/if.h, there is nothing in that
> file explaining the *private* flags; rather, the private flags appear
> to be explained earlier in netdevice.h -- it's the *normal* flags that
> seem to be defined in if.h.

The flag definitions were moved from include/uapi/linux/if.h to
include/linux/netdevice.h by commit 7aa98047df95 ("net: move net_device
priv_flags out from UAPI"). The comment is outdated - it was actually
written when the definitions were still in include/linux/if.h (before
the UAPI split). You may want to submit a patch fixing (or removing) it.

Michal Kubecek 


[PATCH ethtool] ethtool: document WoL filters option also in help message

2018-08-17 Thread Michal Kubecek
Commit eff0bb337223 ("ethtool: Add support for WAKE_FILTER (WoL using
filters)") added option "f" for wake on lan and documented it in man page
but not in the output of "ethtool --help".

Signed-off-by: Michal Kubecek 
---
 ethtool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ethtool.c b/ethtool.c
index aa2bbe9e4c65..e8b7703293d2 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -5069,7 +5069,7 @@ static const struct option {
  " [ advertise %x ]\n"
  " [ phyad %d ]\n"
  " [ xcvr internal|external ]\n"
- " [ wol p|u|m|b|a|g|s|d... ]\n"
+ " [ wol p|u|m|b|a|g|s|f|d... ]\n"
  " [ sopass %x:%x:%x:%x:%x:%x ]\n"
  " [ msglvl %d | msglvl type on|off ... ]\n" },
{ "-a|--show-pause", 1, do_gpause, "Show pause options" },
-- 
2.18.0



Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)

2018-08-16 Thread Michal Kubecek
On Thu, Aug 16, 2018 at 05:24:09PM +0200, Greg KH wrote:
> On Thu, Aug 16, 2018 at 02:33:56PM +0200, Michal Kubecek wrote:
> > 
> > Anyway, even at this rate, I only get ~10% of one core (Intel E5-2697).
> > 
> > What I can see, though, is that with current stable 4.4 code, modified
> > testcase which sends something like
> > 
> >   2:3, 3:4, ..., 3001:3002, 3003:3004, 3004:3005, ... 6001:6002, ...
> > 
> > I quickly eat 6 MB of memory for receive queue of one socket while
> > earlier 4.4 kernels only take 200-300 KB. I didn't test latest 4.4 with
> > Takashi's follow-up yet but I'm pretty sure it will help while
> > preserving nice performance when using the original segmentsmack
> > testcase (with increased packet ratio).
> 
> Ok, for now I've applied Takashi's fix to the 4.4 stable queue and will
> push out a new 4.4-rc later tonight.  Can everyone standardize on that
> and test and let me know if it does, or does not, fix the reported
> issues?

I did repeat the tests with Takashi's fix and the CPU utilization is
similar to what we have now, i.e. 3-5% with 10K pkt/s. I could still
saturate one CPU somewhere around 50K pkt/s but that already requires
2.75 MB/s (22 Mb/s) of throughput. (My previous tests with Mao Wenan's
changes in fact used lower speeds as the change from 128 to 1024 would
need to be done in two places.)

Where Takashi's patch does help is that it does not prevent collapsing
of ranges of adjacent segments with total length shorter than ~4KB. It
took more time to verify: it cannot be checked by watching the socket
memory consumption with ss as tcp_collapse_ofo_queue isn't called until
we reach the limits. So I needed to trace when and how tcp_collpse() is
called with both current stable 4.4 code and one with Takashi's fix.
  
> If not, we can go from there and evaluate this much larger patch
> series.  But let's try the simple thing first.

At high packet rates (say 30K pkt/s and more), we can still saturate the
CPU. This is also mentioned in the announcement with claim that switch
to rbtree based queue would be necessary to fully address that. My tests
seem to confirm that but I'm still not sure it is worth backporting
something as intrusive into stable 4.4.

Michal Kubecek


Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)

2018-08-16 Thread Michal Kubecek
On Thu, Aug 16, 2018 at 08:05:50PM +0800, maowenan wrote:
> On 2018/8/16 19:39, Michal Kubecek wrote:
> > 
> > I suspect you may be doing something wrong with your tests. I checked
> > the segmentsmack testcase and the CPU utilization on receiving side
> > (with sending 10 times as many packets as default) went down from ~100%
> > to ~3% even when comparing what is in stable 4.4 now against older 4.4
> > kernel.
> 
> There seems no obvious problem when you send packets with default
> parameter in Segmentsmack POC, Which is also very related with your
> server's hardware configuration. Please try with below parameter to
> form OFO packets

I did and even with these (questionable, see below) changes, I did not
get more than 10% (of one core) by receiving ksoftirqd.

>   for (i = 0; i < 1024; i++)  // 128->1024
...
>   usleep(10*1000); // Adjust this and packet count to match the target!, 
> sleep 100ms->10ms

The comment in the testcase source suggests to do _one_ of these two
changes so that you generate 10 times as many packets as the original
testcase. You did both so that you end up sending 102400 packets per
second. With 55 byte long packets, this kind of attack requires at least
5.5 MB/s (44 Mb/s) of throughput. This is no longer a "low packet rate
DoS", I'm afraid.

Anyway, even at this rate, I only get ~10% of one core (Intel E5-2697).

What I can see, though, is that with current stable 4.4 code, modified
testcase which sends something like

  2:3, 3:4, ..., 3001:3002, 3003:3004, 3004:3005, ... 6001:6002, ...

I quickly eat 6 MB of memory for receive queue of one socket while
earlier 4.4 kernels only take 200-300 KB. I didn't test latest 4.4 with
Takashi's follow-up yet but I'm pretty sure it will help while
preserving nice performance when using the original segmentsmack
testcase (with increased packet ratio).

Michal Kubecek


Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)

2018-08-16 Thread Michal Kubecek
On Thu, Aug 16, 2018 at 03:55:16PM +0800, maowenan wrote:
> On 2018/8/16 15:44, Michal Kubecek wrote:
> > On Thu, Aug 16, 2018 at 03:39:14PM +0800, maowenan wrote:
> >>
> >>
> >> On 2018/8/16 15:23, Michal Kubecek wrote:
> >>> On Thu, Aug 16, 2018 at 03:19:12PM +0800, maowenan wrote:
> >>>> On 2018/8/16 14:52, Michal Kubecek wrote:
> >>>>>
> >>>>> My point is that backporting all this into stable 4.4 is quite intrusive
> >>>>> so that if we can achieve similar results with a simple fix of an
> >>>>> obvious omission, it would be preferrable.
> >>>>
> >>>> There are five patches in mainline to fix this CVE, only two patches
> >>>> have no effect on stable 4.4, the important reason is 4.4 use simple
> >>>> queue but mainline use RB tree.
> >>>>
> >>>> I have tried my best to use easy way to fix this with dropping packets
> >>>> 12.5%(or other value) based on simple queue, but the result is not
> >>>> very well, so the RB tree is needed and tested result is my desire.
> >>>>
> >>>> If we only back port two patches but they don't fix the issue, I think
> >>>> they don't make any sense.
> >>>
> >>> There is an obvious omission in one of the two patches and Takashi's
> >>> patch fixes it. If his follow-up fix (applied on top of what is in
> >>> stable 4.4 now) addresses the problem, I would certainly prefer using it
> >>> over backporting the whole series.
> >>
> >> Do you mean below codes from Takashi can fix this CVE?
> >> But I have already tested like this two days ago, it is not good effect.
> > 
> > IIRC what you proposed was different, you proposed to replace the "=" in
> > the other branch by "+=".
> 
> No, I think you don't get what I mean, I have already tested stable 4.4,
> based on commit dc6ae4d, and change the codes like Takashi, which didn't
> contain any codes I have sent in this patch series.

I suspect you may be doing something wrong with your tests. I checked
the segmentsmack testcase and the CPU utilization on receiving side
(with sending 10 times as many packets as default) went down from ~100%
to ~3% even when comparing what is in stable 4.4 now against older 4.4
kernel.

This is actually not surprising. the testcase only sends 1-byte segments
starting at even offsets so that receiver never gets two adjacent
segments and the "range_truesize != head->truesize" condition would
never be satisfied, whether you fix the backport or not.

Where the missing "range_truesize += skb->truesize" makes a difference
is in the case when there are some adjacent out of order segments, e.g.
if you omitted 1:10 and then sent 10:11, 11:12, 12:13, 13:14, ...
Then the missing update of range_truesize would prevent collapsing the
queue until the total length of the range would exceed the value of
SKB_WITH_OVERHEAD(SK_MEM_QUANTUM) (i.e. a bit less than 4 KB).

Michal Kubecek



Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)

2018-08-16 Thread Michal Kubecek
On Thu, Aug 16, 2018 at 03:39:14PM +0800, maowenan wrote:
> 
> 
> On 2018/8/16 15:23, Michal Kubecek wrote:
> > On Thu, Aug 16, 2018 at 03:19:12PM +0800, maowenan wrote:
> >> On 2018/8/16 14:52, Michal Kubecek wrote:
> >>>
> >>> My point is that backporting all this into stable 4.4 is quite intrusive
> >>> so that if we can achieve similar results with a simple fix of an
> >>> obvious omission, it would be preferrable.
> >>
> >> There are five patches in mainline to fix this CVE, only two patches
> >> have no effect on stable 4.4, the important reason is 4.4 use simple
> >> queue but mainline use RB tree.
> >>
> >> I have tried my best to use easy way to fix this with dropping packets
> >> 12.5%(or other value) based on simple queue, but the result is not
> >> very well, so the RB tree is needed and tested result is my desire.
> >>
> >> If we only back port two patches but they don't fix the issue, I think
> >> they don't make any sense.
> > 
> > There is an obvious omission in one of the two patches and Takashi's
> > patch fixes it. If his follow-up fix (applied on top of what is in
> > stable 4.4 now) addresses the problem, I would certainly prefer using it
> > over backporting the whole series.
> 
> Do you mean below codes from Takashi can fix this CVE?
> But I have already tested like this two days ago, it is not good effect.

IIRC what you proposed was different, you proposed to replace the "=" in
the other branch by "+=".

Michal Kubecek


> 
> Could you try to test with POC programme mentioned previous mail in case I 
> made mistake?
> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 4a261e078082..9c4c6cd0316e 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -4835,6 +4835,7 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
>   end = TCP_SKB_CB(skb)->end_seq;
>   range_truesize = skb->truesize;
>   } else {
> +     range_truesize += skb->truesize;
>   if (before(TCP_SKB_CB(skb)->seq, start))
>   start = TCP_SKB_CB(skb)->seq;
>   if (after(TCP_SKB_CB(skb)->end_seq, end))
> -- 
> 
> 
> > 
> > Michal Kubecek
> > 
> > 
> > .
> > 
> 


Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)

2018-08-16 Thread Michal Kubecek
On Thu, Aug 16, 2018 at 03:19:12PM +0800, maowenan wrote:
> On 2018/8/16 14:52, Michal Kubecek wrote:
> > 
> > My point is that backporting all this into stable 4.4 is quite intrusive
> > so that if we can achieve similar results with a simple fix of an
> > obvious omission, it would be preferrable.
> 
> There are five patches in mainline to fix this CVE, only two patches
> have no effect on stable 4.4, the important reason is 4.4 use simple
> queue but mainline use RB tree.
> 
> I have tried my best to use easy way to fix this with dropping packets
> 12.5%(or other value) based on simple queue, but the result is not
> very well, so the RB tree is needed and tested result is my desire.
> 
> If we only back port two patches but they don't fix the issue, I think
> they don't make any sense.

There is an obvious omission in one of the two patches and Takashi's
patch fixes it. If his follow-up fix (applied on top of what is in
stable 4.4 now) addresses the problem, I would certainly prefer using it
over backporting the whole series.

Michal Kubecek



Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)

2018-08-16 Thread Michal Kubecek
On Thu, Aug 16, 2018 at 02:42:32PM +0800, maowenan wrote:
> On 2018/8/16 14:16, Michal Kubecek wrote:
> > On Thu, Aug 16, 2018 at 10:50:01AM +0800, Mao Wenan wrote:
> >> There are five patches to fix CVE-2018-5390 in latest mainline 
> >> branch, but only two patches exist in stable 4.4 and 3.18: 
> >> dc6ae4d tcp: detect malicious patterns in tcp_collapse_ofo_queue()
> >> 5fbec48 tcp: avoid collapses in tcp_prune_queue() if possible
> >> I have tested with stable 4.4 kernel, and found the cpu usage was very 
> >> high.
> >> So I think only two patches can't fix the CVE-2018-5390.
> >> test results:
> >> with fix patch: 78.2%   ksoftirqd
> >> withoutfix patch:   90% ksoftirqd
> >>
> >> Then I try to imitate 72cd43ba(tcp: free batches of packets in 
> >> tcp_prune_ofo_queue())
> >> to drop at least 12.5 % of sk_rcvbuf to avoid malicious attacks with 
> >> simple queue 
> >> instead of RB tree. The result is not very well.
> >>  
> >> After analysing the codes of stable 4.4, and debuging the 
> >> system, shows that search of ofo_queue(tcp ofo using a simple queue) cost 
> >> more cycles.
> >>
> >> So I try to backport "tcp: use an RB tree for ooo receive queue" using RB 
> >> tree 
> >> instead of simple queue, then backport Eric Dumazet 5 fixed patches in 
> >> mainline,
> >> good news is that ksoftirqd is turn to about 20%, which is the same with 
> >> mainline now.
> >>
> >> Stable 4.4 have already back port two patches, 
> >> f4a3313d(tcp: avoid collapses in tcp_prune_queue() if possible)
> >> 3d4bf93a(tcp: detect malicious patterns in tcp_collapse_ofo_queue())
> >> If we want to change simple queue to RB tree to finally resolve, we should 
> >> apply previous 
> >> patch 9f5afeae(tcp: use an RB tree for ooo receive queue.) firstly, but 
> >> 9f5afeae have many 
> >> conflicts with 3d4bf93a and f4a3313d, which are part of patch series from 
> >> Eric in 
> >> mainline to fix CVE-2018-5390, so I need revert part of patches in stable 
> >> 4.4 firstly, 
> >> then apply 9f5afeae, and reapply five patches from Eric.
> > 
> > There seems to be an obvious mistake in one of the backports. Could you
> > check the results with Takashi's follow-up fix submitted at
> > 
> >   http://lkml.kernel.org/r/20180815095846.7734-1-ti...@suse.de
> > 
> > (I would try myself but you don't mention what test you ran.)
> 
> I have backport RB tree in stable 4.4, function
> tcp_collapse_ofo_queue() has been refined, which keep the same with
> mainline, so it seems no problem when apply Eric's patch 3d4bf93a(tcp:
> detect malicious patterns in tcp_collapse_ofo_queue()).
> 
> I also noticed that range_truesize != head->truesize will be always
> false which mentioned in your URL, but this only based on stable 4.4's
> codes, If I applied RB tree's patch 9f5afeae(tcp: use an RB tree for
> ooo receive queue), and after apply 3d4bf93a,the codes should be
> range_truesize += skb->truesize, and range_truesize != head->truesize
> can be true.

My point is that backporting all this into stable 4.4 is quite intrusive
so that if we can achieve similar results with a simple fix of an
obvious omission, it would be preferrable.

Michal Kubecek


Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)

2018-08-16 Thread Michal Kubecek
On Thu, Aug 16, 2018 at 10:50:01AM +0800, Mao Wenan wrote:
> There are five patches to fix CVE-2018-5390 in latest mainline 
> branch, but only two patches exist in stable 4.4 and 3.18: 
> dc6ae4d tcp: detect malicious patterns in tcp_collapse_ofo_queue()
> 5fbec48 tcp: avoid collapses in tcp_prune_queue() if possible
> I have tested with stable 4.4 kernel, and found the cpu usage was very high.
> So I think only two patches can't fix the CVE-2018-5390.
> test results:
> with fix patch: 78.2%   ksoftirqd
> withoutfix patch:   90% ksoftirqd
> 
> Then I try to imitate 72cd43ba(tcp: free batches of packets in 
> tcp_prune_ofo_queue())
> to drop at least 12.5 % of sk_rcvbuf to avoid malicious attacks with simple 
> queue 
> instead of RB tree. The result is not very well.
>  
> After analysing the codes of stable 4.4, and debuging the 
> system, shows that search of ofo_queue(tcp ofo using a simple queue) cost 
> more cycles.
> 
> So I try to backport "tcp: use an RB tree for ooo receive queue" using RB 
> tree 
> instead of simple queue, then backport Eric Dumazet 5 fixed patches in 
> mainline,
> good news is that ksoftirqd is turn to about 20%, which is the same with 
> mainline now.
> 
> Stable 4.4 have already back port two patches, 
> f4a3313d(tcp: avoid collapses in tcp_prune_queue() if possible)
> 3d4bf93a(tcp: detect malicious patterns in tcp_collapse_ofo_queue())
> If we want to change simple queue to RB tree to finally resolve, we should 
> apply previous 
> patch 9f5afeae(tcp: use an RB tree for ooo receive queue.) firstly, but 
> 9f5afeae have many 
> conflicts with 3d4bf93a and f4a3313d, which are part of patch series from 
> Eric in 
> mainline to fix CVE-2018-5390, so I need revert part of patches in stable 4.4 
> firstly, 
> then apply 9f5afeae, and reapply five patches from Eric.

There seems to be an obvious mistake in one of the backports. Could you
check the results with Takashi's follow-up fix submitted at

  http://lkml.kernel.org/r/20180815095846.7734-1-ti...@suse.de

(I would try myself but you don't mention what test you ran.)

Michal Kubecek


Re: Linux kernel error stack

2018-08-05 Thread Michal Kubecek
On Mon, Aug 06, 2018 at 01:15:37AM +0200, Florian Westphal wrote:
> Michal Kubecek  wrote:
> > Oops, exactly this issue was already discussed almost a year ago:
> > 
> >   http://lkml.kernel.org/r/20170824104824.2c318a0...@unicorn.suse.cz
> > 
> > But something more urgent came and I forgot to get back to it. :-(
> 
> I did not even remeber, thanks for the pointer.
> So I think best course of action is to update man page to clearly
> say this only works in postrouting and with udp, and is ONLY
> intended for working around old dhcp software.

As GSO for UDP is on its way to mainline, one might get into trouble
even with UDP if the rule is not specific enough.
 
> From kernel, have checkentry emit a warning
> when this is used for protocols other than udp,
> and make this a no-op for everything else.
> 
> Same for postrouting: warn from checkentry if the hook is something
> else, and have target function not do anything for !postrouting.
> 
> Does that make sense?

It sounds reasonable to me.

> I can work on a patch later this week if needed.

I'll be on vacation until Sunday so that I certainly won't get to it
earlier than next week.

Michal Kubecek


Re: Linux kernel error stack

2018-08-05 Thread Michal Kubecek
On Sun, Aug 05, 2018 at 11:03:42PM +0200, Florian Westphal wrote:
> Satish Patel  wrote:
> > Florian,
> > 
> > It seems those rules coming from here
> > https://github.com/openstack/openstack-ansible-os_neutron/blob/master/files/post-up-metadata-checksum
> 
> This is crazy, and, as you found, it doesn't even do what they seem to
> think it does.
> 
> I see no reason for these rules at all.

Oops, exactly this issue was already discussed almost a year ago:

  http://lkml.kernel.org/r/20170824104824.2c318a0...@unicorn.suse.cz

But something more urgent came and I forgot to get back to it. :-(

Michal Kubecek



Re: Fw: [Bug 199995] New: Ramdomly sent TCP Reset from Kernel with bonding mode "brodcast"

2018-06-08 Thread Michal Kubecek
On Fri, Jun 08, 2018 at 09:59:54AM -0700, Stephen Hemminger wrote:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=15
> 
> Bug ID: 15
>Summary: Ramdomly sent TCP Reset from Kernel with bonding mode
> "brodcast"
> 
> after a dist upgrade from Ubuntu 17.10 (Kernel 4.13.x) to Ubuntu 18.04 (Kernel
> 4.15.0) I suffer from ramdomly generated TCP RST packets sent (presumably) by
> the Kernel 
> on a bonding device that uses bonding mode "brodcast" with 2 physical NICs.
> 
> With tcpdump/whireshark I can see that the kernel randomly sends TCP-RST
> packets after the SYN/ACK/ACK packet is received (see attached PCAP).
> This only happens if the kernel receives the initial SYN packet on both
> physical NICs (and therefore seeing it twice), before the connection is
> established by sending SYN/ACK.
> It's not happening in 100% of all cases and only, if the system can use two or
> more CPU cores/threads. With only one CPU available to the system, this
> behaviour is not reproducable.

I have seen similar report earlier from one of our customers running
SLE12 SP2 (kernel 4.4). The problem is that if duplicated SYN packet is
received on both slaves, these two copies can be processed by the
lockless listener simultaneously on different CPUs and each can reply by
SYNACK with different sequence number which results in a reset.

I tried to think of a way to prevent this race without losing the
performance gain of lockless listener but couldn't come with anything.
Eventually, I managed to persuade the customer that this setup (where
each packet is received twice under normal circumstances) is not what
broadcast mode was designed for (based on the description in
Documentation/networking/bonding.txt).

However, the lockless listener was introduced in 4.4 so it's not clear
why reporter started encountering this after an upgrade from 4.13 to
4.15.

Michal Kubecek


Re: [RFC net-next] ipv4: Don't promote secondaries when flushing addresses

2018-06-07 Thread Michal Kubecek
On Thu, Jun 07, 2018 at 02:35:39PM +0200, Phil Sutter wrote:
> Yes, I agree with Michal. IIRC, flushing a specific primary along with
> all it's secondaries from an interface is not even supported by
> iproute2, so no need to optimize for that I guess. OTOH, if your
> solution allowed to get rid of that nasty loop in ipaddr_flush(), I owe
> you one extra beer at the next occasion. :)

I'm afraid it will have to stay as fallback for older kernels not
supporting flush requests. But there would be no need to actually use
it. If we know RTM_DELADDR request for zero address is guaranteed to
fail with current and older kernels, we could do

  - use RTM_DELADDR with IFA_F_FLUSH and zero address
  - if it fails, get the list and run the loop

If not, it could still be

  - use RTM_DELADDR with IFA_F_FLUSH and zero address
  - get the list of addresses (empty if first step worked)
  - run the loop

Michal Kubecek


Re: [RFC net-next] ipv4: Don't promote secondaries when flushing addresses

2018-06-07 Thread Michal Kubecek
On Thu, Jun 07, 2018 at 12:13:01PM +0200, Jakub Sitnicki wrote:
> Promoting secondary addresses on address removal makes flushing all
> addresses from a device with 1000's of them slow. This is because we
> cannot take down the secondary addresses when we are removing the
> primary one, which would make it faster.
> 
> However, the userspace, when performing a flush, will in the end remove
> all the addresses regardless of secondary address promotion taking
> place. Unfortunately the kernel currently cannot distinguish between a
> single address removal and a flush of all addresses.
> 
> To help with this case introduce a IFA_F_FLUSH flag that can be used by
> userspace to signal that a removal operation is being done because of a
> flush. When the flag is set, don't bother with secondary address
> promotion as we expect that secondary addresses will be removed soon as
> well.

Unless you intend to use the flag to allow deleting a specific address
with its secondaries (overriding promote_secondaries), maybe it would
be more practical to go even further and delete all addresses on the
interface if IFA_F_FLUSH is set so that userspace could delete all
addresses with one request.

Michal Kubecek


Re: Unable to create ip alias on bridge interface

2018-05-29 Thread Michal Kubecek
On Tue, May 29, 2018 at 03:39:05PM +0530, Akshat Kakkar wrote:
> For following commands,
>   ip addr add 10.10.10.1/24 brd +  dev br0
>   ip addr add 10.10.10.2/24 brd +  dev br0
>   ip addr add 20.20.20.1/24 brd +  dev br0
>   ip addr add 20.20.20.2/24 brd +  dev br0
> 
> Both 10.10.10.1 and 20.20.20.1 becomes primary. Which one will be used
> as source IP?
> 
> Is it nextHop of route that will decide?

Unless you have an unusual routing setup, yes. When unsure, you can use
"ip route get ..." which should tell you which route and which source
address will be used.

> And what about communication in local subnet, say ping to 10.10.10.200
> and 20.20.20.200? Will source for both will change according to
> destination IP?

This is the same. Under "normal" circumstances, 10.10.10.1 will be used
for targets from 10.10.10.0/24 and 20.20.20.1 for targets from
20.20.20.0/24.

Michal Kubecek


Re: Unable to create ip alias on bridge interface

2018-05-29 Thread Michal Kubecek
On Mon, May 28, 2018 at 11:50:05PM +0530, Akshat Kakkar wrote:
> 1. How can this survive across reboots without having a custom script
> on boot up? Like some ifcfg file,etc.

Every reasonable distribution should provide a way to use more than one
address on an interface. But as there is no universal standard for
config files and their format, there is no universal standard for
listing multiple addresses either. You have to check the documentation
of your distribution.

> 2. is there a way to tell to make a given ip as primary, irrespective
> of order?

AFAIK there is no interface allowing to switch the primary address. It
only changes when you delete primary address and have promote_secondaries
enabled for the interface.

Also, don't forget that this primary/secondary distinction is only done
for addresses with the same range (which would create the same automatic
route), i.e. e.g. 10.0.1.42/24 and 10.0.1.43/24 but not if it's e.g.
10.0.1.42/24 and 10.0.2.42/24.

Michal Kubecek


Re: Is it possible to get device information via CMSG?

2018-05-28 Thread Michal Kubecek
On Sat, May 26, 2018 at 05:39:12AM -0400, Eric S. Raymond wrote:
> I'm trying to untangle some nasty code in the Mills implementation of
> NTP.  I could simplify it a lot if there were a way to query a packet
> to find out the name of the network interface it arrived on.  (At the
> moment the code has to iterate over all interfaces checking for
> traffic on each one just so it doesn't lose that information.)
> 
> This seems like the kind of thing the CMSG macros are intended to
> support, but I can't find anywhere a specification of what cmsg_level
> and cmsg_type values are valid and what their semantics are.
> 
> So I have two questions:
> 
> 1. Is there a cmsg_level/cmsg_type combination that will return the
> name of the device the packet arrived through?

Not name directly, AFAIK, but you can set SOL_IP / IP_PKTINFO (or
SOL_IPV6 / IPV6_RECVPKTINFO) socket option and get IP_PKTINFO
(IPV6_PKTINFO) message with recvmsg(). This will tell you incoming
interface index so that you can look the name up. See ip(7) or ipv6(7)
for format of the message (struct ip_pktinfo, struct in6_pktinfo).

However, I suspect that userspace application is not really interested
in incoming interface name but rather in destination address of the
incoming packet which is also provided in IP_PKTINFO / IPV6_PKTINFO
message. 

Michal Kubecek


Re: Unable to create ip alias on bridge interface

2018-05-28 Thread Michal Kubecek
On Mon, May 28, 2018 at 02:35:41PM +0530, Akshat Kakkar wrote:
> I am having a bridge named br0 having ports eno1 and eno2 as members.
> I have given IP to br0 as 10.10.10.1/24
> 
> Now I want to create alias on br0 as br0:1 and give IP as
> 10.10.10.2/24, but I am unable to.
> 
> I know, we can add multiple IPs to br0 using "ip addr" command, but I
> dont want to do it that way as I want all outgoing connections from
> br0 to take src ip as 10.10.10.1. I know by providing option of "src"
> in all routes, things can work but this looks more like a hack and
> less of a solution.

I don't understand. There are no actual aliases since kernel 2.2 and an
attempt to add "br0:1 with address 10.10.10.2/24" using ifconfig should
result in the same configuration as

  ip addr add 10.10.10.2/24 brd + label br0:1 dev br0

where the "label br0:1" part only adds a label which allows ifconfig to
see the new address.

As both addresses share the same range, you don't even have to worry
about source address as primary address (10.10.10.1 - or first one added
in general) will be used unless specified otherwise.

Michal Kubecek


Re: [PATCH net] sctp: fix the issue that flags are ignored when using kernel_connect

2018-05-22 Thread Michal Kubecek
On Sun, May 20, 2018 at 04:39:10PM +0800, Xin Long wrote:
> Now sctp uses inet_dgram_connect as its proto_ops .connect, and the flags
> param can't be passed into its proto .connect where this flags is really
> needed.
> 
> sctp works around it by getting flags from socket file in __sctp_connect.
> It works for connecting from userspace, as inherently the user sock has
> socket file and it passes f_flags as the flags param into the proto_ops
> .connect.
> 
> However, the sock created by sock_create_kern doesn't have a socket file,
> and it passes the flags (like O_NONBLOCK) by using the flags param in
> kernel_connect, which calls proto_ops .connect later.
> 
> So to fix it, this patch defines a new proto_ops .connect for sctp,
> sctp_inet_connect, which calls __sctp_connect() directly with this
> flags param. After this, the sctp's proto .connect can be removed.
> 
> Note that sctp_inet_connect doesn't need to do some checks that are not
> needed for sctp, which makes thing better than with inet_dgram_connect.
> 
> Suggested-by: Marcelo Ricardo Leitner <marcelo.leit...@gmail.com>
> Signed-off-by: Xin Long <lucien@gmail.com>

Reviewed-by: Michal Kubecek <mkube...@suse.cz>


Re: [PATCH] netfilter: properly initialize xt_table_info structure

2018-05-17 Thread Michal Kubecek
On Thu, May 17, 2018 at 10:44:42AM +0200, Greg Kroah-Hartman wrote:
> When allocating a xt_table_info structure, we should be clearing out the
> full amount of memory that was allocated, not just the "header" of the
> structure.  Otherwise odd values could be passed to userspace, which is
> not a good thing.
> 
> Cc: stable <sta...@vger.kernel.org>
> Signed-off-by: Greg Kroah-Hartman <gre...@linuxfoundation.org>
> ---
>  net/netfilter/x_tables.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> index cb7cb300c3bc..a300e8252bb6 100644
> --- a/net/netfilter/x_tables.c
> +++ b/net/netfilter/x_tables.c
> @@ -1187,7 +1187,7 @@ struct xt_table_info *xt_alloc_table_info(unsigned int 
> size)
>   if (!info)
>   return NULL;
>  
> - memset(info, 0, sizeof(*info));
> + memset(info, 0, sz);
>   info->size = size;
>   return info;
>  }
> -- 
> 2.17.0
> 

Or we can replace kvmalloc() by kvzalloc() and remove the memset().

Michal Kubecek


Re: [iproute2] Bug#898840: Latest update breaks ip6 default gateway cli api

2018-05-16 Thread Michal Kubecek
On Wed, May 16, 2018 at 02:42:24PM +0100, Luca Boccassi wrote:
> Hans reported a regression in v4.16.0, ip route now requires -6 to be
> manually added when using v6 addresses while up to 4.15 it didn't, the
> commands quoted show the problem.
> 
> Bisecting shows that the following commit from Serhey introduced the
> problem:
> 
> 93fa12418dc6f5943692250244be303bb162175b
> utils: Always specify family and ->bytelen in get_prefix_1()
> 
> Could you please have a look when you have a moment? It's very easy to
> reproduce, and it breaks existing scripts and so on.

Fixed already:

--
mike@unicorn:~/work/git/iproute2> git --no-pager log --grep 93fa12418dc6
commit d42c7891d26e4d5616a55aac9fe10813767fcf9c
Author: David Ahern <dsah...@gmail.com>
Date:   Fri Apr 13 09:36:33 2018 -0700

utils: Do not reset family for default, any, all addresses

Thomas reported a change in behavior with respect to autodectecting
address families. Specifically, 'ip ro add default via fe80::1'
syntax was failing to treat fe80::1 as an IPv6 address as it did in
prior releases. The root causes appears to be a change in family when
the default keyword is parsed.

'default', 'any' and 'all' are relevant outside of AF_INET. Leave the
family arg as is for these when setting addr.

Fixes: 93fa12418dc6 ("utils: Always specify family and ->bytelen in 
get_prefix_1()")
Reported-by: Thomas Deutschmann <whi...@gentoo.org>
Signed-off-by: David Ahern <dsah...@gmail.com>
Cc: Serhey Popovych <serhe.popov...@gmail.com>
----------

Michal Kubecek



Re: Kernel panic on kernel-3.10.0-693.21.1.el7 in ndisc.h

2018-05-15 Thread Michal Kubecek
On Mon, May 14, 2018 at 02:06:11PM -0700, Stephen Hemminger wrote:
> On Mon, 14 May 2018 21:29:03 +0300 Roman Makhov <roman.mak...@gmail.com> 
> wrote:
> > Thank you for the answer.
> > Unfortunately CentOS goes with these dinosaurs.
> > So we will try to debug the problem in the current one and try to
> > reproduce on the latest kernel.
> 
> If you are stuck in old kernels, please bug the CentOs maintainers not
> upstream developers.

Actually, it's not even the dinosaur. Kernels from RHEL (and therefore
CentOS) or SLES differ from the original mainline version they are based
on quite a lot. It's possible that this bug didn't really exist in the
old 3.10 and is related to one of the backports added to CentOS (or
rather RHEL in this case) kernel. People outside of the distribution
have little idea what was backported and why so unless the issue can be
reproduced with mainline kernel they cannot be expected to help.

Michal Kubecek


[PATCH ethtool] ethtool: fix stack clash in do_get_phy_tunable and do_set_phy_tunable

2018-05-09 Thread Michal Kubecek
Users reported stack clash detected when using --get-phy-tunable on
ppc64le. Problem is caused by local variable ds of type struct
ethtool_tunable which has last member "void *data[0]". Accessing data[0]
(as do_get_phy_tunable() does) or adding requested value at the end (which
is what kernel ioctl does) writes past allocated space for the variable.

Make ds part of an anonymous structure to make sure there is enough space
for tunable value and drop the (pointless) access to ds.data[0]. The same
problem also exists in do_set_phy_tunable().

Fixes: b0fe96dec90f ("Ethtool: Implements 
ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE and PHY downshift")
Signed-off-by: Michal Kubecek <mkube...@suse.cz>
---
 ethtool.c | 39 +--
 1 file changed, 21 insertions(+), 18 deletions(-)

diff --git a/ethtool.c b/ethtool.c
index 3289e0f6e8ec..2e873848eb4e 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -4740,20 +4740,22 @@ static int do_get_phy_tunable(struct cmd_context *ctx)
}
 
if (downshift_changed) {
-   struct ethtool_tunable ds;
+   struct {
+   struct ethtool_tunable ds;
+   u8 __count;
+   } cont;
u8 count = 0;
 
-   ds.cmd = ETHTOOL_PHY_GTUNABLE;
-   ds.id = ETHTOOL_PHY_DOWNSHIFT;
-   ds.type_id = ETHTOOL_TUNABLE_U8;
-   ds.len = 1;
-   ds.data[0] = 
-   err = send_ioctl(ctx, );
+   cont.ds.cmd = ETHTOOL_PHY_GTUNABLE;
+   cont.ds.id = ETHTOOL_PHY_DOWNSHIFT;
+   cont.ds.type_id = ETHTOOL_TUNABLE_U8;
+   cont.ds.len = 1;
+   err = send_ioctl(ctx, );
if (err < 0) {
perror("Cannot Get PHY downshift count");
return 87;
}
-   count = *((u8 *)[0]);
+   count = *((u8 *)[0]);
if (count)
fprintf(stdout, "Downshift count: %d\n", count);
else
@@ -4931,16 +4933,17 @@ static int do_set_phy_tunable(struct cmd_context *ctx)
 
/* Do it */
if (ds_changed) {
-   struct ethtool_tunable ds;
-   u8 count;
-
-   ds.cmd = ETHTOOL_PHY_STUNABLE;
-   ds.id = ETHTOOL_PHY_DOWNSHIFT;
-   ds.type_id = ETHTOOL_TUNABLE_U8;
-   ds.len = 1;
-   ds.data[0] = 
-   *((u8 *)[0]) = ds_cnt;
-   err = send_ioctl(ctx, );
+   struct {
+   struct ethtool_tunable ds;
+   u8 __count;
+   } cont;
+
+   cont.ds.cmd = ETHTOOL_PHY_STUNABLE;
+   cont.ds.id = ETHTOOL_PHY_DOWNSHIFT;
+   cont.ds.type_id = ETHTOOL_TUNABLE_U8;
+   cont.ds.len = 1;
+   *((u8 *)[0]) = ds_cnt;
+   err = send_ioctl(ctx, );
if (err < 0) {
perror("Cannot Set PHY downshift count");
err = 87;
-- 
2.16.3



Re: Failed to clone net-next.git

2018-05-09 Thread Michal Kubecek
On Tue, May 08, 2018 at 08:04:42PM -0400, David Miller wrote:
> From: Song Liu <songliubrav...@fb.com>
> Date: Tue, 8 May 2018 17:46:23 +
> 
> > We are seeing the following error on multiple different systems while 
> > cloning net-next tree. 
> > 
> >   $ git clone 
> > https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git
> >   Cloning into 'net-next'...
> 
> Regardless of the failure, it is so _insanely_ wasteful to clone my
> trees like this.
> 
> Just simply always have Linus's tree always checked out somewhere:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> 
> Let's say you have it under src/GIT/linux as I do.
> 
> Then go to src/GIT and say:
> 
>   git clone --reference linux/.git 
> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git
> 
> This way it only downloads the objects that are unique to the net-next
> tree.  Similarly for 'net':
> 
>   git clone --reference linux/.git 
> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git
> 
> Or any other subsystem tree.
> 
> Periodically "git pull --ff-only" your Linus's tree, and you'll be
> much happier in GIT land :-)
> 
> As subsystem changes make their way into Linus's GIT tree, git will
> notice over time and garbage collect the dups that are in your
> subsystem GIT trees.

With reasonably recent git (>= 2.5), another option worth considering is
using worktrees:

  cd /src/GIT/linux
  git remote add --no-tags net-next 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git
  git branch --track net-next net-next/master
  git worktree add ../net-next net-next

It works well for me, the only limitation to keep in mind is that you
cannot have the same branch checked out in two different worktrees at
the same time.

Michal Kubecek



Re: Silently dropped UDP packets on kernel 4.14

2018-05-03 Thread Michal Kubecek
On Thu, May 03, 2018 at 07:03:45AM +0200, Florian Westphal wrote:
> Kristian Evensen <kristian.even...@gmail.com> wrote:
> > I went for the early-insert approached and have patched
> 
> I'm sorry for suggesting that.
> 
> It doesn't work, because of NAT.
> NAT rewrites packet content and changes the reply tuple, but the tuples
> determine the hash insertion location.
> 
> I don't know how to solve this problem.

It's an old problem which surfaces from time to time when some special
conditions make it more visible. When I was facing it in 2015, I found
this thread from as early as 2009:

  https://www.spinics.net/lists/linux-net/msg16712.html

In our case, the customer was using IPVS in "one packet scheduling" mode
(it drops the conntrack entry after each packet) which increased the
probability of insert collisions significantly. Using NFQUEUE 

We were lucky, though, as it turned out the only reason why customer
needed connection tracking was to make sure fragments of long UDP
datagrams are not sent to different real servers. For newer kernels
after commit 6aafeef03b9d ("netfilter: push reasm skb through instead of
original frag skbs"), this was no longer necessary so that they could
disable connection tracking for these packets.

For older kernels without this change, I tried several ideas, each of
which didn't work for some reason. We ended up with rather hacky
workaround, not dropping the packet on collision (so that its conntrack
wasn't inserted into the table and was dropped once the packet was
sent). It worked fine for our customer but like the early insert
approach, it wouldn't work with NAT.

One of the ideas I had was this:

  - keep also unconfirmed conntracks in some data structure
  - check new packets also against unconfirmed conntracks
  - if it matches an unconfirmed conntrack, defer its processing
until that conntrack is either inserted or discarded

But as it would be rather complicated to implement without races and
harming performance, I didn't want to actually try it until I would
run out of other ideas. With NAT coming to the play, there doesn't seem
to be many other options.

Michal Kubecek


Re: non-blocking connect for kernel SCTP sockets

2018-05-02 Thread Michal Kubecek
On Wed, May 02, 2018 at 05:46:23PM +0800, Xin Long wrote:
> On Wed, May 2, 2018 at 5:06 PM, Michal Kubecek <mkube...@suse.cz> wrote:
> > Hello,
> >
> > while investigating a bug, we noticed that DLM tries to connect an SCTP
> > socket in non-blocking mode using
> >
> > result = sock->ops->connect(sock, (struct sockaddr *), 
> > addr_len,
> > O_NONBLOCK);
> >
> > which does not work. The reason is that inet_dgram_connect() cannot pass
> > its flags argument to sctp_connect() so that __sctp_connect() which does
> > the actual waiting resorts to checking sk->sk_socket->file->f_flags
> > instead. As the socket used by DLM is a kernel socket with no associated
> > file, it ends up blocking.
> >
> > TCP doesn't suffer from this problem as for TCP, the waiting is done in
> > inet_stream_connect() which has the flags argument. I also checked other
> > proto::connect handlers and sctp_connect() seems to be the only one with
> > this kind of problem.
> >
> > This could be worked around in DLM and further experiments indicate
> > current DLM code wouldn't actually handle the non-blocking connect
> > properly. But I still feel ignoring the flags argument is rather a trap
> > that should be fixed.
> It is a bug, https://bugzilla.redhat.com/show_bug.cgi?id=1251530

Not authorized. :-)

> We have the fix which also includes some cleanup, and needs to do
> more testing.

OK, I'll wait for your submission.

> > I have prepared a series adding flags argument to proto::connect and
> > using it in sctp_connect() and __sctp_connect(). But I'm not sure if
> > it's not too big hammer to address issue only affecting one handler.
> > So my question is: would such generic approach be preferred or should we,
> > rather make SCTP work the way TCP does, i.e. move the waiting from,
> > proto::connect() to proto_ops::connect()? This would require introducing
> > inet_seqpacket_connect() as inet_dgram_connect() is primarily intended
> > for use with UDP.)
> We don't fix it in the generic proto::connect, which will afftect
> many other places.

That was my concern, too. On the other hand, the TCP specific waiting
code in inet_stream_connect() makes me wonder if it wouldn't be cleaner
to move it into the TCP specific handler as well (which is something
this approach would allow).

> We're replacing only sctp's proto_ops::connect with sctp_connect and
> leave its proto::connect as NULL, so that it can get this flags param
> without touching the generic struct and code.

Yes, that should do the trick (and makes backporting to distribution
kernels with frozen kABI much easier). I guess I was too fixed on the
split between proto_ops::connect and proto::connect to see this
solution.

Michal Kubecek


non-blocking connect for kernel SCTP sockets

2018-05-02 Thread Michal Kubecek
Hello,

while investigating a bug, we noticed that DLM tries to connect an SCTP
socket in non-blocking mode using

result = sock->ops->connect(sock, (struct sockaddr *), addr_len,
O_NONBLOCK);

which does not work. The reason is that inet_dgram_connect() cannot pass
its flags argument to sctp_connect() so that __sctp_connect() which does
the actual waiting resorts to checking sk->sk_socket->file->f_flags
instead. As the socket used by DLM is a kernel socket with no associated
file, it ends up blocking.

TCP doesn't suffer from this problem as for TCP, the waiting is done in
inet_stream_connect() which has the flags argument. I also checked other
proto::connect handlers and sctp_connect() seems to be the only one with
this kind of problem.

This could be worked around in DLM and further experiments indicate
current DLM code wouldn't actually handle the non-blocking connect
properly. But I still feel ignoring the flags argument is rather a trap
that should be fixed.

I have prepared a series adding flags argument to proto::connect and
using it in sctp_connect() and __sctp_connect(). But I'm not sure if
it's not too big hammer to address issue only affecting one handler.
So my question is: would such generic approach be preferred or should we
rather make SCTP work the way TCP does, i.e. move the waiting from
proto::connect() to proto_ops::connect()? This would require introducing
inet_seqpacket_connect() as inet_dgram_connect() is primarily intended
for use with UDP.)

Michal Kubecek


Re: [PATCH net-next] udp: disable gso with no_check_tx

2018-05-02 Thread Michal Kubecek
On Mon, Apr 30, 2018 at 03:58:36PM -0400, Willem de Bruijn wrote:
> From: Willem de Bruijn <will...@google.com>
> 
> Syzbot managed to send a udp gso packet without checksum offload into
> the gso stack by disabling tx checksum (UDP_NO_CHECK6_TX). This
> triggered the skb_warn_bad_offload.
> 
>   RIP: 0010:skb_warn_bad_offload+0x2bc/0x600 net/core/dev.c:2658
>skb_gso_segment include/linux/netdevice.h:4038 [inline]
>validate_xmit_skb+0x54d/0xd90 net/core/dev.c:3120
>__dev_queue_xmit+0xbf8/0x34c0 net/core/dev.c:3577
>dev_queue_xmit+0x17/0x20 net/core/dev.c:3618
> 
> UDP_NO_CHECK6_TX sets skb->ip_summed to CHECKSUM_NONE just after the
> udp gso integrity checks in udp_(v6_)send_skb. Extend those checks to
> catch and fail in this case.

Sounds rather familiar... perhaps we might want to check other
exceptions added to UFO over the years, some might apply here as well.

Michal Kubecek


Re: tcp hang when socket fills up ?

2018-04-17 Thread Michal Kubecek
On Tue, Apr 17, 2018 at 02:34:37PM +0200, Dominique Martinet wrote:
> Michal Kubecek wrote on Tue, Apr 17, 2018:
> > Data (21 bytes) packet in reply direction. And somewhere between the
> > first and second debugging print, we ended up with sender scale=0 and
> > that value is then preserved from now on.
> > 
> > The only place between the two debug prints where we could change only
> > one of the td_sender values are the two calls to tcp_options() but
> > neither should be called now unless I missed something. I'll try to
> > think about it some more.
> 
> Could it have something to do with the way I setup the connection?
> I don't think the "both remotes call connect() with carefully selected
> source/dest port" is a very common case..
> 
> If you look at the tcpdump outputs I attached the sequence usually is
> something like
>  server > client SYN
>  client > server SYN
>  server > client SYNACK
>  client > server ACK

This must be what nf_conntrack_proto_tcp.c calls "simultaneous open".

> ultimately it IS a connection, but with an extra SYN packet in front of
> it (that first SYN opens up the conntrack of the nat so that the
> client's syn can come in, the client's conntrack will be that of a
> normal connection since its first SYN goes in directly after the
> server's (it didn't see the server's SYN))
> 
> 
> Looking at my logs again, I'm seeing the same as you:
> 
> This looks like the actual SYN/SYN/SYNACK/ACK:
>  - 14.364090 seq=505004283 likely SYN coming out of server
>  - 14.661731 seq=1913287797 on next line it says receiver
> end=505004284 so likely the matching SYN from client
> Which this time gets a proper SYNACK from server:
> 14.662020 seq=505004283 ack=1913287798
> And following final dataless ACK:
> 14.687570 seq=1913287798 ack=505004284
> 
> Then as you point out some data ACK, where the scale poofs:
> 14.688762 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 
> end=1913287819
> 14.688793 tcp_in_window: sender end=1913287798 maxend=1913316998 maxwin=29312 
> scale=7 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7
> 14.688824 tcp_in_window: 
> 14.688852 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 
> end=1913287819
> 14.62 tcp_in_window: sender end=1913287819 maxend=1913287819 maxwin=229 
> scale=0 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7
> 
> As you say, only tcp_options() will clear only on side of the scales.
> We don't have sender->td_maxwin == 0 (printed) so I see no other way
> than we are in the last else if:
>  - we have after(end, sender->td_end) (end=1913287819 > sender
> end=1913287798)
>  - I assume the tcp state machine must be confused because of the
> SYN/SYN/SYNACK/ACK pattern and we probably enter the next check, 
> but since this is a data packet it doesn't have the tcp option for scale
> thus scale resets.

I agree that sender->td_maxwin is not zero so that the handshake above
probably left the conntrack in TCP_CONNTRACK_SYN_RECV state for some
reason. I'll try to go through the code with the pattern you mentioned
in mind.

Michal Kubecek


Re: tcp hang when socket fills up ?

2018-04-17 Thread Michal Kubecek
in_window: I=1 II=1 III=1 IV=1
14.688938 tcp_in_window: res=1 sender end=1913287819 maxend=1913287819 
maxwin=229 receiver end=505004284 maxend=505033596 maxwin=29200

Data (21 bytes) packet in reply direction. And somewhere between the
first and second debugging print, we ended up with sender scale=0 and
that value is then preserved from now on.

The only place between the two debug prints where we could change only
one of the td_sender values are the two calls to tcp_options() but
neither should be called now unless I missed something. I'll try to
think about it some more.

Michal Kubecek


Re: tcp hang when socket fills up ?

2018-04-17 Thread Michal Kubecek
On Mon, Apr 16, 2018 at 10:28:11PM -0700, Eric Dumazet wrote:
> > I turned pr_debug on in tcp_in_window() for another try and it's a bit
> > mangled because the information on multiple lines and the function is
> > called in parallel but it looks like I do have some seq > maxend +1
> > 
> > Although it's weird, the maxend was set WAY earlier apparently?
> > Apr 17 11:13:14 res=1 sender end=1913287798 maxend=1913316998 maxwin=29312 
> > receiver end=505004284 maxend=505033596 maxwin=29200
> > then window decreased drastically e.g. previous ack just before refusal:
> > Apr 17 11:13:53 seq=1913292311 ack=505007789+(0) sack=505007789+(0) win=284 
> > end=1913292311
> > Apr 17 11:13:53 sender end=1913292311 maxend=1913331607 maxwin=284 scale=0 
> > receiver end=505020155 maxend=505033596 maxwin=39296 scale=7
> 
> scale=0 is suspect.
> 
> Really if conntrack does not see SYN SYNACK packets, it should not
> make any window check, since windows can be scaled up to 14 :/

Or maybe set the scaling to

  - TCP_MAX_WSCALE (14) by default
  - 0 when SYN or SYNACK without window scale option is seen
  - value of window scale option when SYN or SYNACK with it is seen

Michal Kubecek


Re: tcp hang when socket fills up ?

2018-04-13 Thread Michal Kubecek
P .31872 > .13317: Flags 
> [.], ack 77346, win 1444, options [nop,nop,TS val 1617129822 ecr 
> 1313937714,nop,nop,sack 1 {32004:33378}], length 0
> 16:49:27.371182 IP .31872 > .13317: Flags 
> [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617130018 
> ecr 1313937714], length 36
> 16:49:27.519862 IP .13317 > .31872: Flags 
> [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313938426 
> ecr 1617129473], length 1374
> 16:49:27.547662 IP .31872 > .13317: Flags 
> [.], ack 77346, win 1444, options [nop,nop,TS val 1617130293 ecr 
> 1313937714,nop,nop,sack 1 {32004:33378}], length 0
> 16:49:27.883372 IP .31872 > .13317: Flags 
> [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617130530 
> ecr 1313937714], length 36
> 16:49:28.511861 IP .13317 > .31872: Flags 
> [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313939418 
> ecr 1617129473], length 1374
> 16:49:28.538891 IP .31872 > .13317: Flags 
> [.], ack 77346, win 1444, options [nop,nop,TS val 1617131285 ecr 
> 1313937714,nop,nop,sack 1 {32004:33378}], length 0
> 16:49:28.907197 IP .31872 > .13317: Flags 
> [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617131554 
> ecr 1313937714], length 36
> 16:49:30.431864 IP .13317 > .31872: Flags 
> [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313941338 
> ecr 1617129473], length 1374
> 16:49:30.459127 IP .31872 > .13317: Flags 
> [.], ack 77346, win 1444, options [nop,nop,TS val 1617133204 ecr 
> 1313937714,nop,nop,sack 1 {32004:33378}], length 0
> 16:49:30.955388 IP .31872 > .13317: Flags 
> [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617133602 
> ecr 1313937714], length 36
> 16:49:34.207879 IP .13317 > .31872: Flags 
> [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313945114 
> ecr 1617129473], length 1374
> 16:49:34.235726 IP .31872 > .13317: Flags 
> [.], ack 77346, win 1444, options [nop,nop,TS val 1617136981 ecr 
> 1313937714,nop,nop,sack 1 {32004:33378}], length 0
> 16:49:35.256285 IP .31872 > .13317: Flags 
> [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617137954 
> ecr 1313937714], length 36
> 16:49:42.143864 IP .13317 > .31872: Flags 
> [.], seq 32004:33378, ack 4190, win 307, options [nop,nop,TS val 1313953050 
> ecr 1617129473], length 1374
> 16:49:42.171531 IP .31872 > .13317: Flags 
> [.], ack 77346, win 1444, options [nop,nop,TS val 1617144917 ecr 
> 1313937714,nop,nop,sack 1 {32004:33378}], length 0
> 16:49:43.448262 IP .31872 > .13317: Flags 
> [P.], seq 4190:4226, ack 77346, win 1444, options [nop,nop,TS val 1617146146 
> ecr 1313937714], length 36

The way I read this, server doesn't see anything sent by client since
some point shortly before the dump shown here starts (about 5ms). It
keeps sending data until 16:49:26.807754 (seq 77346) and then keeps
resending first (from its point of view) unacknowledged segment
(32004:33378) in exponentially growing intervals and ignores replies
from the client. Client apparently receives these retransmits and
replies with dupack (with D-SACK for 32004:33378) and retransmits of its
own first unacknowledged segment (4190:4226).

As we can see the client packets in the dump (which was taken on
server), it would mean they are dropped after the point where packet
socket would pass them to libpcap. That might be e.g. netfilter
(conntrack?) or the IP/TCP code detecting them to be invalid for some
reason (which is not obvious to me from the dump above).

There are two strange points:

1. While client apparently responds to all server retransmits, it does
so with TSecr=1313937714 (matching server packet from 16:49:26.807754)
rather than TSval of the packets it dupacks (1313937955 through
1313953050). This doesn't seem to follow the rules of RFC 7323
Section 4.3.

2. Window size values in acks from client grow with each acked packet by
22-23 (which might be ~1400 with scaling factor of 64). I would rather
expect advertised receive window to go down by 1374 with each received
segment and to grow by bigger steps with each read()/recv() call from
application.

We might get more insight if we saw the same connection on both sides.
>From what was presented here, my guess is that

  (1) received packets are dropped somewhere on server side (after they
  are cloned for the packet socket)
  (2) there is something wrong either on client side or between the two
  hosts (there is at least a NAT, IIUC)

Michal Kubecek


Re: TCP one-by-one acking - RFC interpretation question

2018-04-11 Thread Michal Kubecek
On Wed, Apr 11, 2018 at 12:58:37PM +0200, Michal Kubecek wrote:
> There is something else I don't understand, though. In the case of
> acking previously sacked and never retransmitted segment,
> tcp_clean_rtx_queue() calculates the parameters for tcp_ack_update_rtt()
> using
> 
> if (sack->first_sackt.v64) {
> sack_rtt_us = skb_mstamp_us_delta(,
> >first_sackt);
> ca_rtt_us = skb_mstamp_us_delta(,
> >last_sackt);
> }
> 
> (in 4.4; mainline code replaces  with tp->tcp_mstamp). If I read the
> code correctly, both sack->first_sackt and sack->last_sackt contain
> timestamps of initial segment transmission. This would mean we use the
> time difference between the initial transmission and now, i.e. including
> the RTO of the lost packet).
> 
> IMHO we should take the actual round trip time instead, i.e. the
> difference between the original transmission and the time the packet
> sacked (first time). It seems we have been doing this before commit
> 31231a8a8730 ("tcp: improve RTT from SACK for CC").

Sorry for the noise, this was my misunderstanding, the first_sackt and
last_sackt values are only taken from segments newly sacked by ack
received right now, not those which were already sacked before.

The actual problem and unrealistic RTT measurements come from another
RFC violation I didn't mention before: the NAS doesn't follow RFC 2018
section 4 rule for ordering of SACK blocks. Rather than sending SACK
blocks three most recently received out-of-order blocks, it simply sends
first three ordered by sequence numbers. In the earlier example (odd
packets were received, even lost)

   ACK SAK SAK SAK
+---+---+---+---+---+---+---+---+---+
|   1   |   2   |   3   |   4   |   5   |   6   |   7   |   8   |   9   |
+---+---+---+---+---+---+---+---+---+
  34273   35701   37129   38557   39985   41413   42841   44269   45697   47125

it responds to retransmitted segment 2 by

  1. ACK 37129, SACK 37129-38557 39985-41413 42841-44269
  2. ACK 38557, SACK 39985-41413 42841-44269 45697-47125

This new SACK block 45697-47125 has not been retransmitted and as it
wasn't sacked before, it is considered newly sacked. Therefore it gets
processed and its deemed RTT (time since its original transmit time)
"poisons" the RTT calculation, leading to RTO spiraling up.

Thus if we want to work around the NAS behaviour, we would need to
recognize such new SACK block as "not really new" and ignore it for
first_sackt/last_sackt. I'm not sure if it's possible without
misinterpreting actually delayed out of order packets. Of course, it is
not clear if it's worth the effort to work around so severely broken TCP
implementations (two obvious RFC violations, even if we don't count the
one-by-one acking).

Michal Kubecek


Re: TCP one-by-one acking - RFC interpretation question

2018-04-11 Thread Michal Kubecek
On Fri, Apr 06, 2018 at 09:49:58AM -0700, Eric Dumazet wrote:
> Cc Neal and Yuchung if they missed this thread.
> 
> On 04/06/2018 08:03 AM, Michal Kubecek wrote:
> > On Fri, Apr 06, 2018 at 05:01:29AM -0700, Eric Dumazet wrote:
> >>
> >>
> >> On 04/06/2018 03:05 AM, Michal Kubecek wrote:
> >>> Hello
> >>>
> >>> I encountered a strange behaviour of some (non-linux) TCP stack which
> >>> I believe is incorrect but support engineers from the company producing
> >>> it claim is OK.
> >>>
> >>> Assume a client (sender, Linux 4.4 kernel) sends a stream of MSS sized
> >>> segments but segments 2, 4 and 6 do not reach the server (receiver):
> >>>
> >>>  ACK SAK SAK SAK
> >>>   +---+---+---+---+---+---+---+
> >>>   |   1   |   2   |   3   |   4   |   5   |   6   |   7   |
> >>>   +---+---+---+---+---+---+---+
> >>> 34273   35701   37129   38557   39985   41413   42841   44269
> >>>
> >>> When segment 2 is retransmitted after RTO timeout, normal response would
> >>> be ACK-ing segment 3 (38557) with SACK for 5 and 7 (39985-41413 and
> >>> 42841-44269).
> >>>
> >>> However, this server stack responds with two separate ACKs:
> >>>
> >>>   - ACK 37129, SACK 37129-38557 39985-41413 42841-44269
> >>>   - ACK 38557, SACK 39985-41413 42841-44269
> >>
> >> Hmmm... Yes this seems very very wrong and lazy.
> >>
> >> Have you verified behavior of more recent linux kernel to such threats ?
> > 
> > No, unfortunately the problem was only encountered by our customer in
> > production environment (they tried to reproduce in a test lab but no
> > luck). They are running backups to NFS server and it happens from time
> > to time (in the order of hours, IIUC). So it would be probably hard to
> > let them try with more recent kernel.
> > 
> > On the other hand, they reported that SLE11 clients (kernel 3.0) do not
> > run into this kind of problem. It was originally reported as a
> > a regression on migration from SLE11-SP4 (3.0 kernel) to SLE12-SP2 (4.4
> > kernel) and the problem was reported as "SLE12-SP2 is ignoring dupacks"
> > (which seems to be mostly caused by the switch to RACK).
> > 
> > It also seems that part of the problem is specific packet loss pattern
> > where at some point, many packets are lost in "every second" pattern.
> > The customer finally started to investigate this problem and it seems it
> > has something to do with their bonding setup (they provided no details,
> > my guess is packets are divided over two paths and one of them fails).
> > 
> >> packetdrill test would be relatively easy to write.
> > 
> > I'll try but I have very little experience with writing packetdrill
> > scripts so it will probably take some time.
> > 
> >> Regardless of this broken alien stack, we might be able to work around
> >> this faster than the vendor is able to fix and deploy a new stack.
> >>
> >> ( https://en.wikipedia.org/wiki/Robustness_principle )
> >> Be conservative in what you do, be liberal in what you accept from
> >> others...
> > 
> > I was thinking about this a bit. "Fixing" the acknowledgment number
> > could do the trick but it doesn't feel correct. We might use the fact
> > that TSecr of both ACKs above matches TSval of the retransmission which
> > triggered them so that RTT calculated from timestamp would be the right
> > one. So perhaps something like "prefer timestamp RTT if measured RTT
> > seems way too off". But I'm not sure if it couldn't break other use
> > cases where (high) measured RTT is actually correct, rather than (low)
> > timestamp RTT.

I stared at the code some more and apparently I was wrong. I put my
tracing right after the check

if (flag & FLAG_SACK_RENEGING) {

in tcp_check_sack_reneging() and completely missed Neal's commit
5ae344c949e7 ("tcp: reduce spurious retransmits due to transient SACK
reneging") which deals with exactly this kind of broken TCP stacks
sending a series of acks for each segment from out of order queue.
Therefore it seems the events in my log weren't actual SACK reneging
events and the SACK scoreboard wasn't cleared.

There is something else I don't understand, though. In the case of
acking previously sacked and never retransmitted segment,
tcp_clean_rtx_queue() calculates the parameters

Re: TCP one-by-one acking - RFC interpretation question

2018-04-06 Thread Michal Kubecek
On Fri, Apr 06, 2018 at 05:01:29AM -0700, Eric Dumazet wrote:
> 
> 
> On 04/06/2018 03:05 AM, Michal Kubecek wrote:
> > Hello,
> > 
> > I encountered a strange behaviour of some (non-linux) TCP stack which
> > I believe is incorrect but support engineers from the company producing
> > it claim is OK.
> > 
> > Assume a client (sender, Linux 4.4 kernel) sends a stream of MSS sized
> > segments but segments 2, 4 and 6 do not reach the server (receiver):
> > 
> >  ACK SAK SAK SAK
> >   +---+---+---+---+---+---+---+
> >   |   1   |   2   |   3   |   4   |   5   |   6   |   7   |
> >   +---+---+---+---+---+---+---+
> > 34273   35701   37129   38557   39985   41413   42841   44269
> > 
> > When segment 2 is retransmitted after RTO timeout, normal response would
> > be ACK-ing segment 3 (38557) with SACK for 5 and 7 (39985-41413 and
> > 42841-44269).
> > 
> > However, this server stack responds with two separate ACKs:
> > 
> >   - ACK 37129, SACK 37129-38557 39985-41413 42841-44269
> >   - ACK 38557, SACK 39985-41413 42841-44269
> 
> Hmmm... Yes this seems very very wrong and lazy.
> 
> Have you verified behavior of more recent linux kernel to such threats ?

No, unfortunately the problem was only encountered by our customer in
production environment (they tried to reproduce in a test lab but no
luck). They are running backups to NFS server and it happens from time
to time (in the order of hours, IIUC). So it would be probably hard to
let them try with more recent kernel.

On the other hand, they reported that SLE11 clients (kernel 3.0) do not
run into this kind of problem. It was originally reported as a
a regression on migration from SLE11-SP4 (3.0 kernel) to SLE12-SP2 (4.4
kernel) and the problem was reported as "SLE12-SP2 is ignoring dupacks"
(which seems to be mostly caused by the switch to RACK).

It also seems that part of the problem is specific packet loss pattern
where at some point, many packets are lost in "every second" pattern.
The customer finally started to investigate this problem and it seems it
has something to do with their bonding setup (they provided no details,
my guess is packets are divided over two paths and one of them fails).

> packetdrill test would be relatively easy to write.

I'll try but I have very little experience with writing packetdrill
scripts so it will probably take some time.

> Regardless of this broken alien stack, we might be able to work around
> this faster than the vendor is able to fix and deploy a new stack.
> 
> ( https://en.wikipedia.org/wiki/Robustness_principle )
> Be conservative in what you do, be liberal in what you accept from
> others...

I was thinking about this a bit. "Fixing" the acknowledgment number
could do the trick but it doesn't feel correct. We might use the fact
that TSecr of both ACKs above matches TSval of the retransmission which
triggered them so that RTT calculated from timestamp would be the right
one. So perhaps something like "prefer timestamp RTT if measured RTT
seems way too off". But I'm not sure if it couldn't break other use
cases where (high) measured RTT is actually correct, rather than (low)
timestamp RTT.

Michal Kubecek


TCP one-by-one acking - RFC interpretation question

2018-04-06 Thread Michal Kubecek
Hello,

I encountered a strange behaviour of some (non-linux) TCP stack which
I believe is incorrect but support engineers from the company producing
it claim is OK.

Assume a client (sender, Linux 4.4 kernel) sends a stream of MSS sized
segments but segments 2, 4 and 6 do not reach the server (receiver):

 ACK SAK SAK SAK
  +---+---+---+---+---+---+---+
  |   1   |   2   |   3   |   4   |   5   |   6   |   7   |
  +---+---+---+---+---+---+---+
34273   35701   37129   38557   39985   41413   42841   44269

When segment 2 is retransmitted after RTO timeout, normal response would
be ACK-ing segment 3 (38557) with SACK for 5 and 7 (39985-41413 and
42841-44269).

However, this server stack responds with two separate ACKs:

  - ACK 37129, SACK 37129-38557 39985-41413 42841-44269
  - ACK 38557, SACK 39985-41413 42841-44269

There is no payload from server, no window update and it happens even if
there is no other packet received by server between those two. The
result is that as segment 3 was never retransmitted, second ACK is
interpreted as acking a newly arrived segment by 4.4 kernel so that the
whole interval between first transmission of segment 3 and this second
ACK is used for RTT estimator; even worse, when the same happens again
for segment 5, both timeouts (for 2 and 4) are counted into its RTT.
The result is RTO growing exponentially until it reaches the maximum
(120 seconds) and the connection is effectively stalled.

In my opinion, server behaviour violates the last paragraph of RFC 5681,
section 4.2:

  A TCP receiver MUST NOT generate more than one ACK for every incoming
  segment, other than to update the offered window as the receiving
  application consumes new data (see [RFC813] and page 42 of [RFC793]).

Server vendor claims that their behaviour is correct as first ACK is
sent in response to segment 2 and second ACK in response to segment 3
(which has just been delayed in the out of order queue).

Note that SACK doesn't really help here. First SACK block in first ACK
(37129-38557) is actually invalid as it violates the "the bytes just
below the block ... have not been received" condition from RFC 2018
section 3. Therefore Linux 4.4 stack ignores this SACK block, detects
(spurious) SACK reneging and unmarks the "previously sacked" flag of
segment 3 so that when second ACK arrives, there is no trace of it
having been sacked before. They already admitted this SACK block is
incorrect but there is still disagreement about the "one-by-one acking"
behaviour in general.

My question is: is my interpretation correct? If so, is there an even
less ambiguous statement somewhere that receiver is supposed to send one
ACK for "everything they got so far" rather than acking the segments one
by one? While reading the RFCs, I always considered this obvious but
apparently some people may think otherwise.

Thanks in advance,
Michal Kubecek



Re: [RFC] ethtool: Support for driver private ioctl's

2018-04-06 Thread Michal Kubecek
On Thu, Apr 05, 2018 at 08:50:49AM -0700, Florian Fainelli wrote:
> On 04/05/2018 03:47 AM, Jose Abreu wrote:
> > Background: Synopsys Ethernet IP's have a certain number of
> > features which can be reconfigured at runtime. Giving you two
> > examples: One of the most recent one is the safety features,
> > which can be enabled/disabled and forced at runtime. Another one
> > is a Flexible RX Parser which can route specific packets to
> > specific RX DMA channels. Given that these are features specific
> > to our IP's it would not be useful to add an uniform API for this
> > because the users would only be one or two drivers ...
> 
> Parsing of packets and directing the matched packets to specific
> queues/channels can be done through ethtool rxnfc API, tc/cls_flower as
> well, so you should really check whether those APIs don't already allow
> you to do what you want.
> 
> ethtool already supports a concept of private  flags, not ioctl() though
> which allows you to toggle boolean values for instance (or technically
> up to how many bits a "flag" is used to represent) is that enough or do
> you need to turn on/off the feature as well as pass configuration
> parameters?

Perhaps introducing "driver/device specific tunables" (i.e. something
like tunables or PHY tunables but specific to a particular device) could
be a way. But it could get out of control quickly and users wouldn't be
happy if they had to set the same (or almost the same) parameter under
five different names for five NIC vendors.

Michal Kubecek


Re: [RFC] ethtool: Support ETHTOOL_GSTATS2 command.

2018-03-20 Thread Michal Kubecek
On Tue, Mar 20, 2018 at 08:39:33AM -0700, Ben Greear wrote:
> On 03/20/2018 03:37 AM, Michal Kubecek wrote:
> > 
> > IMHO it would be more practical to set "0 means same as GSTATS" as a
> > rule and make ethtool_get_stats() a wrapper for ethtool_get_stats2() to
> > avoid code duplication (or perhaps a use fall-through in the switch). It
> > would also allow drivers to provide only one of the callbacks.
> 
> Yes, but that would require changing all drivers at once, and would make 
> backporting
> and out-of-tree drivers harder to manage.  I had low hopes that this feature 
> would
> make it upstream, so I didn't want to propose any large changes up front.

I don't think so. What I mean is:

(a) driver implements ->get_ethtool_stats2() callback; then we use it
for GSTATS2
(b) driver does not implement get_ethtool_stats2() but implements
->get_ethtool_stats(); then we call for GSTATS2 if level is zero,
otherwise GSTATS2 returns -EINVAL

and GSTATS is always translated to GSTATS2 with level 0, either by
defining ethtool_get_stats() as a wrapper or by fall-through in the
switch statement.

This way, most drivers could be left untouched and only those which
would implement non-default levels would provide ->get_ethtool_stats2()
callback instead of ->get_ethtool_stats().

Michal Kubecek


Re: [RFC] ethtool: Support ETHTOOL_GSTATS2 command.

2018-03-20 Thread Michal Kubecek
On Wed, Mar 07, 2018 at 11:51:29AM -0800, gree...@candelatech.com wrote:
> From: Ben Greear <gree...@candelatech.com>
> 
> This is similar to ETHTOOL_GSTATS, but it allows you to specify
> a 'level'.  This level can be used by the driver to decrease the
> amount of stats refreshed.  In particular, this helps with ath10k
> since getting the firmware stats can be slow.
> 
> Signed-off-by: Ben Greear <gree...@candelatech.com>
> ---
> 
> NOTE:  I know to make it upstream I would need to split the patch and
> remove the #define for 'backporting' that I added.  But, is the
> feature in general wanted?  If so, I'll do the patch split and
> other tweaks that might be suggested.

I'm not familiar enough with the technical background of stats
collecting to comment on usefulness and desirability of this feature.
Adding a new command just to add a numeric parameter certainly doesn't
feel right but it's how the ioctl interface works. I take it as
a reminder to find some time to get back to the netlink interface.

> diff --git a/net/core/ethtool.c b/net/core/ethtool.c
> index 674b6c9..d3b709f 100644
> --- a/net/core/ethtool.c
> +++ b/net/core/ethtool.c
> @@ -1947,6 +1947,54 @@ static int ethtool_get_stats(struct net_device *dev, 
> void __user *useraddr)
>   return ret;
>  }
>  
> +static int ethtool_get_stats2(struct net_device *dev, void __user *useraddr)
> +{
> + struct ethtool_stats stats;
> + const struct ethtool_ops *ops = dev->ethtool_ops;
> + u64 *data;
> + int ret, n_stats;
> + u32 stats_level = 0;
> +
> + if (!ops->get_ethtool_stats2 || !ops->get_sset_count)
> + return -EOPNOTSUPP;
> +
> + n_stats = ops->get_sset_count(dev, ETH_SS_STATS);
> + if (n_stats < 0)
> + return n_stats;
> + if (n_stats > S32_MAX / sizeof(u64))
> + return -ENOMEM;
> + WARN_ON_ONCE(!n_stats);
> + if (copy_from_user(, useraddr, sizeof(stats)))
> + return -EFAULT;
> +
> + /* User can specify the level of stats to query.  How the
> +  * level value is used is up to the driver, but in general,
> +  * 0 means 'all', 1 means least, and higher means more.
> +  * The idea is that some stats may be expensive to query, so user
> +  * space could just ask for the cheap ones...
> +  */
> + stats_level = stats.n_stats;
> +
> + stats.n_stats = n_stats;
> + data = vzalloc(n_stats * sizeof(u64));
> + if (n_stats && !data)
> + return -ENOMEM;
> +
> + ops->get_ethtool_stats2(dev, , data, stats_level);
> +
> + ret = -EFAULT;
> + if (copy_to_user(useraddr, , sizeof(stats)))
> + goto out;
> + useraddr += sizeof(stats);
> + if (n_stats && copy_to_user(useraddr, data, n_stats * sizeof(u64)))
> + goto out;
> + ret = 0;
> +
> + out:
> + vfree(data);
> + return ret;
> +}
> +
>  static int ethtool_get_phy_stats(struct net_device *dev, void __user 
> *useraddr)
>  {
>   struct ethtool_stats stats;

IMHO it would be more practical to set "0 means same as GSTATS" as a
rule and make ethtool_get_stats() a wrapper for ethtool_get_stats2() to
avoid code duplication (or perhaps a use fall-through in the switch). It
would also allow drivers to provide only one of the callbacks.

Michal Kubecek


Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-20 Thread Michal Kubecek
On Mon, Feb 19, 2018 at 06:09:39PM +0100, Phil Sutter wrote:
> What puzzles me about your argumentation is that you seem to propose for
> the kernel to cover up flaws in userspace. Spinning this concept further
> would mean that if there would be an old bug in iproute2 we should think
> of adding a workaround to rtnetlink interface in kernel because
> containers will keep the old iproute2 binary? Or am I (hopefully) just
> missing your point?

Actually, that's what we already do. This is from rtnl_dump_ifinfo():

/* A hack to preserve kernel<->userspace interface.
 * The correct header is ifinfomsg. It is consistent with rtnl_getlink.
 * However, before Linux v3.9 the code here assumed rtgenmsg and that's
 * what iproute2 < v3.9.0 used.
 * We can detect the old iproute2. Even including the IFLA_EXT_MASK
 * attribute, its netlink message is shorter than struct ifinfomsg.
 */

(There are, in fact, even current tools using rtgenmsg but that's
another story.)

Michal Kubecek



Re: regression in igb driver from commit 182785335447

2018-02-05 Thread Michal Kubecek
On Fri, Feb 02, 2018 at 12:54:27PM -0600, Aaron Sierra wrote:
> > From: "Michal Kubecek" <mkube...@suse.cz>
> > Sent: Thursday, February 1, 2018 6:47:32 AM
> > [   13.710535] igb: Intel(R) Gigabit Ethernet Network Driver - version 
> > 5.4.0-k
> > [   13.710538] igb: Copyright (c) 2007-2014 Intel Corporation.
> > [   13.710584] igb :08:00.0: PCI->APIC IRQ transform: INT A -> IRQ 56
> > [   13.712126] igb: probe of :08:00.0 failed with error -2
> > [   13.712152] igb :08:00.1: PCI->APIC IRQ transform: INT B -> IRQ 70
> > [   13.904537] igb :08:00.1: Intel(R) Gigabit Ethernet Network 
> > Connection
> > [   13.904545] igb :08:00.1: eth0: (PCIe:2.5Gb/s:Width x4) 
> > 00:30:48:7b:5d:37
> > [   13.904547] igb :08:00.1: eth0: PBA No: Unknown
> > [   13.904556] igb :08:00.1: Using MSI-X interrupts. 4 rx queue(s), 4 tx
> > queue(s)
> > [   13.927029] igb :08:00.1 eth1: renamed from eth0
> 
> Can you share whether (and which versions of) Intel Boot Agent was enabled
> for these ports.

It certainly was, first port is used for PXE boot.

This is what I caught on serial console on boot:


Initializing Intel(R) Boot Agent GE v0.0.13
*** DEVELOPMENT BUILD - NOT FOR PRODUCTION USE!!! ***
*** DEVELOPMENT BUILD - NOT FOR PRODUCTION USE!!! ***
*** DEVELOPMENT BUILD - NOT FOR PRODUCTION USE!!! ***
PXE 2.1 Build 086 (WfM 2.0)  
   
  
Initializing Intel(R) Boot Agent GE v0.0.13
*** DEVELOPMENT BUILD - NOT FOR PRODUCTION USE!!! ***
*** DEVELOPMENT BUILD - NOT FOR PRODUCTION USE!!! ***
*** DEVELOPMENT BUILD - NOT FOR PRODUCTION USE!!! ***
PXE 2.1 Build 086 (WfM 2.0)
Press Ctrl+S to enter the Setup Menu..   


Doesn't look wery trustworthy. :-( I'll check if I can find another
machine with the same card but more reasonable Intel Boot Agent.

>  Does the behavior change if Intel Boot Agent is in the
> opposite state? Was the failing device used for network booting?

This may be a bit tricky, the machine is in a test lab and I have only
remote access to it. I'll check if I can disable the Intel Boot Agent
using the serial console.

Michal Kubecek


regression in igb driver from commit 182785335447

2018-02-01 Thread Michal Kubecek
one of my colleagues observed a regression in recent 4.4.x kernels on
one of test machines with 82575EB NIC (rev 02, 8086:10a7, firmware
version 1.6.5). On boot, first port fails to initialize and only the net
device for second is created. Kernel log looks like

[   13.710535] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.4.0-k
[   13.710538] igb: Copyright (c) 2007-2014 Intel Corporation.
[   13.710584] igb :08:00.0: PCI->APIC IRQ transform: INT A -> IRQ 56
[   13.712126] igb: probe of :08:00.0 failed with error -2
[   13.712152] igb :08:00.1: PCI->APIC IRQ transform: INT B -> IRQ 70
[   13.904537] igb :08:00.1: Intel(R) Gigabit Ethernet Network Connection
[   13.904545] igb :08:00.1: eth0: (PCIe:2.5Gb/s:Width x4) 00:30:48:7b:5d:37
[   13.904547] igb :08:00.1: eth0: PBA No: Unknown
[   13.904556] igb :08:00.1: Using MSI-X interrupts. 4 rx queue(s), 4 tx 
queue(s)
[   13.927029] igb :08:00.1 eth1: renamed from eth0

Checking the changelog led us to a stable-4.4.y backport of mainline
commit 182785335447 ("igb: reset the PHY before reading the PHY ID") as
the most promising suspect and reverting it fixed the issue.

I also reproduced the issue with 4.15 kernel, except this time both
ports of the card failed to probe:

[   16.826649] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.4.0-k
[   16.840784] igb: Copyright (c) 2007-2014 Intel Corporation.
[   16.852176] igb :08:00.0: PCI->APIC IRQ transform: INT A -> IRQ 56
[   16.867919] igb: probe of :08:00.0 failed with error -2
[   16.879254] igb :08:00.1: PCI->APIC IRQ transform: INT B -> IRQ 70
[   16.898178] igb: probe of :08:00.1 failed with error -2

Reverting commit 182785335447 fixed the issue here as well.

Michal Kubecek


Re: r8169 regression: UDP packets dropped intermittantly

2017-12-19 Thread Michal Kubecek
On Tue, Dec 19, 2017 at 04:15:32PM +1030, Jonathan Woithe wrote:
> This clearly indicates that not every card using the r8169 driver is
> vulnerable to the problem.  It also explains why Holger was unable to
> reproduce the result on his system: the PCIe cards do not appear to suffer
> from the problem.  Most likely the PCI RTL-8169 chip is affected, but newer
> PCIe variations do not.  However, obviously more testing will be required
> with a wider variety of cards if this inference is to hold up.

The r8169 driver supports many slightly different variants of the chip.
To identify your variant more precisely, look for a line like

  r8169 :02:00.0 eth0: RTL8168evl/8111evl at 0xc90003135000, 
d4:3d:7e:2a:30:08, XID 0c900800 IRQ 38

in kernel log.

Michal Kubecek


Re: [RFC PATCH 2/9] ethtool: introduce ethtool netlink interface

2017-12-12 Thread Michal Kubecek
On Mon, Dec 11, 2017 at 01:45:47PM -0500, David Miller wrote:
> From: Jiri Pirko <j...@resnulli.us>
> Date: Mon, 11 Dec 2017 19:02:19 +0100
> 
> > The discussion we had before was about flag bitfield that was there
> > *always*. In this case, that is not true. It is either ifindex or
> > ifname. Even rtnetlink has ifname as attribute.
> > 
> > The flags and info_mask is just big mystery. If it is per-command,
> > seems natural to have it as attributes.
> 
> I think flags and info_mask indeed can be moved out of this struct.
> 
> I guess, in this case, I can see your point of view especially if we
> allow ethtool operations on non-netdev entities.
> 
> So, ok, let's move forward without a base command struct and just
> use attributes.

OK, I'll rework the interface to use attributes for all data.

Michal Kubecek


Re: [RFC PATCH 5/9] ethtool: implement GET_DRVINFO message

2017-12-12 Thread Michal Kubecek
On Mon, Dec 11, 2017 at 05:16:01PM +0100, Jiri Pirko wrote:
> Mon, Dec 11, 2017 at 02:54:01PM CET, mkube...@suse.cz wrote:
> >+
> >+ETHA_DRVINFO_DRIVER (string)driver name
> >+ETHA_DRVINFO_VERSION(string)driver version
> 
> You use 3 prefixes:
> ETHTOOL_ for cmd
> ETHA_ for attrs
> ethnl_ for function
> 
> I suggest to sync this, perhaps to:
> ETHNL_CMD_*
> ETHNL_ATTR_*
> ethnl_*
> ?

Switching to ETHNL_CMD_* is certainly a good idea. For attributes, I'm a
bit worried that ETHNL_ATTR_ prefix might make it even harder to fit
into the 80 characters per line limit. I'll try and see what it would
look like.

> >+ETHA_DRVINFO_FWVERSION  (string)firmware version
> >+ETHA_DRVINFO_BUSINFO(string)device bus address
> >+ETHA_DRVINFO_EROM_VER   (string)expansion ROM version
> >+ETHA_DRVINFO_N_PRIV_FLAGS   (u32)   number of private flags
> >+ETHA_DRVINFO_N_STATS(u32)   number of device stats
> >+ETHA_DRVINFO_TESTINFO_LEN   (u32)   number of test results
> >+ETHA_DRVINFO_EEDUMP_LEN (u32)   EEPROM dump size
> >+ETHA_DRVINFO_REGDUMP_LEN(u32)   register dump size
> 
> We are now working on providing various fw memory regions dump in
> devlink. It makes sense to have it in devlink for couple of reasons:
> 1) per-asic, not netdev specific, therefore does not really make sense
>to have netdev as handle, but rather devlink handle.
> 2) snapshot support - we need to provide support for getting snapshot
>(for example on failure), transferring to user and deleting it
>(remove from driver memory).
> 
> Also, driver name, version, fwversion, etc is per-asic. Would make sense
> to have it in devlink as well.
> 
> I think this is great opprotunity to move things where they should be to
> be alligned with the current world and kernel infrastructure.

IMHO this depends on the question whether we are going to rework also
the interface between kernel and NIC drivers (currently ethtool_ops).
If we are, it would make good sense to split the information in both
interfaces. If not, it doesn't seem to make userspace do two queries,
each getting one part of the same information provided by NIC.

Also, I must admit that one thing I dislike about the iw tool is that it
pushes me to distinguish between operations on "phy" and "dev". While
I understand that it makes perfect sense for someone familiar with the
internals, it's quite annoying for an occasional "consumer". It would be
sad if we created a set of perfectly logical and clean tools and
(almost) 20 years later, people still used old ioctl based ethtool
because "it's what they are used to and it works just fine for them"
(like they do with ifconfig, netstat or brctl).

Michal


Re: [PATCH V2] netlink: Add netns check on taps

2017-12-11 Thread Michal Kubecek
On Mon, Dec 11, 2017 at 11:58:37AM -0500, David Miller wrote:
> From: Michal Kubecek <mkube...@suse.cz>
> Date: Mon, 11 Dec 2017 17:13:50 +0100
> 
> > On Wed, Dec 06, 2017 at 03:57:14PM -0500, David Miller wrote:
> >> From: Kevin Cernekee <cerne...@chromium.org>
> >> Date: Wed,  6 Dec 2017 12:12:27 -0800
> >> 
> >> > Currently, a nlmon link inside a child namespace can observe systemwide
> >> > netlink activity.  Filter the traffic so that nlmon can only sniff
> >> > netlink messages from its own netns.
> >> > 
> >> > Test case:
> >> > 
> >> > vpnns -- bash -c "ip link add nlmon0 type nlmon; \
> >> >   ip link set nlmon0 up; \
> >> >   tcpdump -i nlmon0 -q -w /tmp/nlmon.pcap -U" &
> >> > sudo ip xfrm state add src 10.1.1.1 dst 10.1.1.2 proto esp \
> >> > spi 0x1 mode transport \
> >> > auth sha1 0x616263313233 \
> >> > enc aes 0x
> >> > grep --binary abc123 /tmp/nlmon.pcap
> >> > 
> >> > Signed-off-by: Kevin Cernekee <cerne...@chromium.org>
> >> 
> >> Applied and queued up for -stable, thanks Kevin.
> > 
> > David,
> > 
> > this patch is marked as accepted in patchworks and listed in
> > 
> >   http://patchwork.ozlabs.org/bundle/davem/stable/?state=*
> > 
> > but it's not in the net tree. Is there a problem with it?
> 
> Sorry, it should be there now.

Thank you,

Michal Kubecek


Re: [PATCH V2] netlink: Add netns check on taps

2017-12-11 Thread Michal Kubecek
On Wed, Dec 06, 2017 at 03:57:14PM -0500, David Miller wrote:
> From: Kevin Cernekee <cerne...@chromium.org>
> Date: Wed,  6 Dec 2017 12:12:27 -0800
> 
> > Currently, a nlmon link inside a child namespace can observe systemwide
> > netlink activity.  Filter the traffic so that nlmon can only sniff
> > netlink messages from its own netns.
> > 
> > Test case:
> > 
> > vpnns -- bash -c "ip link add nlmon0 type nlmon; \
> >   ip link set nlmon0 up; \
> >   tcpdump -i nlmon0 -q -w /tmp/nlmon.pcap -U" &
> > sudo ip xfrm state add src 10.1.1.1 dst 10.1.1.2 proto esp \
> > spi 0x1 mode transport \
> > auth sha1 0x616263313233 \
> > enc aes 0x
> > grep --binary abc123 /tmp/nlmon.pcap
> > 
> > Signed-off-by: Kevin Cernekee <cerne...@chromium.org>
> 
> Applied and queued up for -stable, thanks Kevin.

David,

this patch is marked as accepted in patchworks and listed in

  http://patchwork.ozlabs.org/bundle/davem/stable/?state=*

but it's not in the net tree. Is there a problem with it?

Michal Kubecek


[RFC PATCH 1/9] netlink: introduce nla_put_bitfield32()

2017-12-11 Thread Michal Kubecek
Similar to other data types, this helper puts NLA_BITFIELD32 attribute into
a netlink message. It takes separate value and selector arguments, if you
already have struct nla_bitfield32, you can use nla_put().

Signed-off-by: Michal Kubecek <mkube...@suse.cz>
---
 include/net/netlink.h | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/include/net/netlink.h b/include/net/netlink.h
index 0c154f98e987..6d4eb6bd9235 100644
--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -1064,6 +1064,21 @@ static inline int nla_put_in6_addr(struct sk_buff *skb, 
int attrtype,
return nla_put(skb, attrtype, sizeof(*addr), addr);
 }
 
+/**
+ * nla_put_bitfield32 - Add a bitfield32 value/selector attribute to
+ * a socket buffer
+ * @skb: socket buffer to add attribute to
+ * @value: 32-bit value bitmap
+ * @selector: 32-bit selector bitmap
+ */
+static inline int nla_put_bitfield32(struct sk_buff *skb, int attrtype,
+u32 value, u32 selector)
+{
+   struct nla_bitfield32 tmp = { .value = value, .selector = selector };
+
+   return nla_put(skb, attrtype, sizeof(tmp), );
+}
+
 /**
  * nla_get_u32 - return payload of u32 attribute
  * @nla: u32 netlink attribute
-- 
2.15.1



[RFC PATCH 3/9] ethtool: helper functions for netlink interface

2017-12-11 Thread Michal Kubecek
Misc helpers used by ethtool netlink code.

Signed-off-by: Michal Kubecek <mkube...@suse.cz>
---
 net/core/ethtool_netlink.c | 177 +
 1 file changed, 177 insertions(+)

diff --git a/net/core/ethtool_netlink.c b/net/core/ethtool_netlink.c
index 46a226bb9a2c..22d23d057623 100644
--- a/net/core/ethtool_netlink.c
+++ b/net/core/ethtool_netlink.c
@@ -8,6 +8,183 @@
 
 static struct genl_family ethtool_genl_family;
 
+/* misc helper functions */
+
+static int ethnl_str_size(const char *s)
+{
+   return nla_total_size(strlen(s) + 1);
+}
+
+static int ethnl_str_ifne_size(const char *s)
+{
+   return s[0] ? ethnl_str_size(s) : 0;
+}
+
+static int ethnl_put_str_ifne(struct sk_buff *skb, int attrtype, const char *s)
+{
+   if (!s[0])
+   return 0;
+   return nla_put_string(skb, attrtype, s);
+}
+
+static struct nlattr *ethnl_nest_start(struct sk_buff *skb, int attrtype)
+{
+   return nla_nest_start(skb, attrtype | NLA_F_NESTED);
+}
+
+/* ethnl_update_* return true if the value is changed */
+static bool ethnl_update_u32(u32 *dst, struct nlattr *attr)
+{
+   u32 val;
+
+   if (!attr)
+   return false;
+   val = nla_get_u32(attr);
+   if (*dst == val)
+   return false;
+
+   *dst = val;
+   return true;
+}
+
+static bool ethnl_update_u8(u8 *dst, struct nlattr *attr)
+{
+   u8 val;
+
+   if (!attr)
+   return false;
+   val = nla_get_u8(attr);
+   if (*dst == val)
+   return false;
+
+   *dst = val;
+   return true;
+}
+
+/* update u32 value used as bool from NLA_U8 */
+static bool ethnl_update_bool32(u32 *dst, struct nlattr *attr)
+{
+   u8 val;
+
+   if (!attr)
+   return false;
+   val = !!nla_get_u8(attr);
+   if (!!*dst == val)
+   return false;
+
+   *dst = val;
+   return true;
+}
+
+static bool ethnl_update_binary(u8 *dst, unsigned int len, struct nlattr *attr)
+{
+   if (!attr)
+   return false;
+   if (nla_len(attr) < len)
+   len = nla_len(attr);
+   if (!memcmp(dst, nla_data(attr), len))
+   return false;
+
+   memcpy(dst, nla_data(attr), len);
+   return true;
+}
+
+static bool ethnl_update_bitfield32(u32 *dst, struct nlattr *attr)
+{
+   struct nla_bitfield32 change;
+   u32 newval;
+
+   if (!attr)
+   return false;
+   change = nla_get_bitfield32(attr);
+   newval = (*dst & ~change.selector) | (change.value & change.selector);
+   if (*dst == newval)
+   return false;
+
+   *dst = newval;
+   return true;
+}
+
+static struct net_device *ethnl_dev_get(struct genl_info *info)
+{
+   const struct ethnlmsghdr *hdr = info->userhdr;
+   struct net *net = genl_info_net(info);
+   struct net_device *dev;
+
+   if (hdr->ifindex) {
+   dev = dev_get_by_index(net, hdr->ifindex);
+   if (!dev)
+   return ERR_PTR(-ENODEV);
+   /* if both ifindex and ifname are passed, both must match */
+   if (hdr->ifname && strncmp(dev->name, hdr->ifname, IFNAMSIZ)) {
+   GENL_SET_ERR_MSG(info,
+"ifindex and ifname do not match");
+   return ERR_PTR(-ENODEV);
+   }
+   return dev;
+   } else if (hdr->ifname[0]) {
+   dev = dev_get_by_name(net, hdr->ifname);
+   return dev ?: ERR_PTR(-ENODEV);
+   }
+
+   return ERR_PTR(-EINVAL);
+}
+
+static void warn_partial_info(struct genl_info *info)
+{
+   GENL_SET_ERR_MSG(info, "not all requested data could be retrieved");
+}
+
+/* Check user privileges explicitly to allow finer access control based on
+ * context of the request or hiding part of the information from unprivileged
+ * users
+ */
+static bool ethnl_is_privileged(struct sk_buff *skb, struct genl_info *info)
+{
+   struct net *net = genl_info_net(info);
+
+   return netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN);
+}
+
+/* create skb for a reply
+ * payload: payload length (without netlink, genetlink and ethnl headers)
+ * dev: device the reply is about
+ * cmd: ETHTOOL_CMD_* command for reply
+ * info:   info for the received packet we respond to
+ * ehdrp:   place to store pointer to ethtool specific header (may be null)
+ * returns: skb or null on error
+ */
+static struct sk_buff *ethnl_reply_init(size_t payload, struct net_device *dev,
+   u8 cmd, struct genl_info *info,
+   struct ethnlmsghdr **ehdrp)
+{
+   struct ethnlmsghdr *ehdr;
+   struct sk_buff *rskb;
+
+   rskb = genlmsg_new(ETHNL_HDRLEN + payload, GFP_KERNEL);
+   if (!rskb) {
+   GENL_SET_ERR_MSG(info,
+   

[RFC PATCH 5/9] ethtool: implement GET_DRVINFO message

2017-12-11 Thread Michal Kubecek
Request the same information as ETHTOOL_GDRVINFO. This is read-only so that
corresponding SET_DRVINFO exists but is only used in kernel replies. Rip
the code to query the driver out of the legacy interface and move it to
a new file ethtool_common.c so that both interfaces can use it.

Signed-off-by: Michal Kubecek <mkube...@suse.cz>
---
 Documentation/networking/ethtool-netlink.txt | 30 ++-
 include/uapi/linux/ethtool_netlink.h | 21 
 net/core/Makefile|  2 +-
 net/core/ethtool.c   | 42 +++-
 net/core/ethtool_common.c| 46 +
 net/core/ethtool_common.h| 11 
 net/core/ethtool_netlink.c   | 75 
 7 files changed, 189 insertions(+), 38 deletions(-)
 create mode 100644 net/core/ethtool_common.c
 create mode 100644 net/core/ethtool_common.h

diff --git a/Documentation/networking/ethtool-netlink.txt 
b/Documentation/networking/ethtool-netlink.txt
index 893e5156f6a7..cb992180b211 100644
--- a/Documentation/networking/ethtool-netlink.txt
+++ b/Documentation/networking/ethtool-netlink.txt
@@ -107,6 +107,9 @@ which the request applies.
 List of message types
 -
 
+ETHTOOL_CMD_GET_DRVINFO
+ETHTOOL_CMD_SET_DRVINFOresponse only
+
 All constants use ETHTOOL_CMD_ prefix followed by "GET", "SET" or "ACT" to
 indicate the type.
 
@@ -119,6 +122,31 @@ messages marked as "response only" in the table above.
 Later sections describe the format and semantics of these request messages.
 
 
+GET_DRVINFO
+---
+
+GET_DRVINFO request corresponds to ETHTOOL_GDRVINFO ioctl command and provides
+basic driver information. The request doesn't use any attributes and flags,
+info_mask and index field in request header are ignored. Kernel response
+contents:
+
+ETHA_DRVINFO_DRIVER(string)driver name
+ETHA_DRVINFO_VERSION   (string)driver version
+ETHA_DRVINFO_FWVERSION (string)firmware version
+ETHA_DRVINFO_BUSINFO   (string)device bus address
+ETHA_DRVINFO_EROM_VER  (string)expansion ROM version
+ETHA_DRVINFO_N_PRIV_FLAGS  (u32)   number of private flags
+ETHA_DRVINFO_N_STATS   (u32)   number of device stats
+ETHA_DRVINFO_TESTINFO_LEN  (u32)   number of test results
+ETHA_DRVINFO_EEDUMP_LEN(u32)   EEPROM dump size
+ETHA_DRVINFO_REGDUMP_LEN   (u32)   register dump size
+
+The meaning of these follows the corresponding fields of ETHTOOL_GDRVINFO
+response.
+
+All information is read only, SET_DRVINFO request is not implemented.
+
+
 Request translation
 ---
 
@@ -130,7 +158,7 @@ ioctl command   netlink command
 -
 ETHTOOL_GSET   n/a
 ETHTOOL_SSET   n/a
-ETHTOOL_GDRVINFO   n/a
+ETHTOOL_GDRVINFO   ETHTOOL_CMD_GET_DRVINFO
 ETHTOOL_GREGS  n/a
 ETHTOOL_GWOL   n/a
 ETHTOOL_SWOL   n/a
diff --git a/include/uapi/linux/ethtool_netlink.h 
b/include/uapi/linux/ethtool_netlink.h
index 6db35f00ea32..d6ab1d73d494 100644
--- a/include/uapi/linux/ethtool_netlink.h
+++ b/include/uapi/linux/ethtool_netlink.h
@@ -21,6 +21,8 @@ struct ethnlmsghdr {
 
 enum {
ETHTOOL_CMD_NOOP,
+   ETHTOOL_CMD_GET_DRVINFO,
+   ETHTOOL_CMD_SET_DRVINFO,/* only for reply */
 
__ETHTOOL_CMD_MAX,
ETHTOOL_CMD_MAX = (__ETHTOOL_CMD_MAX - 1),
@@ -57,6 +59,25 @@ enum {
ETHA_BITSET_MAX = (__ETHA_BITSET_MAX - 1),
 };
 
+/* GET_DRVINFO / SET_DRVINFO */
+
+enum {
+   ETHA_DRVINFO_UNSPEC,
+   ETHA_DRVINFO_DRIVER,/* string */
+   ETHA_DRVINFO_VERSION,   /* string */
+   ETHA_DRVINFO_FWVERSION, /* string */
+   ETHA_DRVINFO_BUSINFO,   /* string */
+   ETHA_DRVINFO_EROM_VER,  /* string */
+   ETHA_DRVINFO_N_PRIV_FLAGS,  /* u32 */
+   ETHA_DRVINFO_N_STATS,   /* u32 */
+   ETHA_DRVINFO_TESTINFO_LEN,  /* u32 */
+   ETHA_DRVINFO_EEDUMP_LEN,/* u32 */
+   ETHA_DRVINFO_REGDUMP_LEN,   /* u32 */
+
+   __ETHA_DRVINFO_MAX,
+   ETHA_DRVINFO_MAX = (__ETHA_DRVINFO_MAX - 1),
+};
+
 /* generic netlink info */
 #define ETHTOOL_GENL_NAME "ethtool"
 #define ETHTOOL_GENL_VERSION 1
diff --git a/net/core/Makefile b/net/core/Makefile
index 617ab2abecdf..3196641c0e70 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -11,7 +11,7 @@ obj-$(CONFIG_SYSCTL) += sysctl_net_core.o
 obj-y   += dev.o ethtool.o dev_addr_lists.o dst.o netevent.o \
neighbour.o rtnetlink.o utils.o link_watch.o fi

[RFC PATCH 7/9] ethtool: implement SET_SETTINGS message

2017-12-11 Thread Michal Kubecek
Sets the information provided by ETHTOOL_SLINKSETTINGS, ETHTOOL_SWOL and
ETHTOOL_SMSGLVL. Unlike with ioctl(), userspace can send only some
attributes so that we only need to call ethtool_ops callbacks which we
really need (and the "set" callback is only called when we actually changed
some setting).

Signed-off-by: Michal Kubecek <mkube...@suse.cz>
---
 Documentation/networking/ethtool-netlink.txt |  42 +-
 net/core/ethtool_netlink.c   | 547 +++
 2 files changed, 584 insertions(+), 5 deletions(-)

diff --git a/Documentation/networking/ethtool-netlink.txt 
b/Documentation/networking/ethtool-netlink.txt
index 7aabc87c9f09..187fab33f721 100644
--- a/Documentation/networking/ethtool-netlink.txt
+++ b/Documentation/networking/ethtool-netlink.txt
@@ -110,7 +110,7 @@ List of message types
 ETHTOOL_CMD_GET_DRVINFO
 ETHTOOL_CMD_SET_DRVINFOresponse only
 ETHTOOL_CMD_GET_SETTINGS
-ETHTOOL_CMD_SET_SETTINGS   response only (for now)
+ETHTOOL_CMD_SET_SETTINGS
 
 All constants use ETHTOOL_CMD_ prefix followed by "GET", "SET" or "ACT" to
 indicate the type.
@@ -199,6 +199,38 @@ is only provided by kernel in response to privileged 
(netns CAP_NET_ADMIN)
 requests.
 
 
+SET_SETTINGS
+
+
+SET_SETTINGS request allows setting some of the data reported by GET_SETTINGS.
+Request flags, info_mask and index are ignored. These attributes are allowed
+to be passed with SET_SETTINGS request:
+
+ETHA_SETTINGS_SPEED (u32)   link speed (Mb/s)
+ETHA_SETTINGS_DUPLEX(u8)duplex mode
+ETHA_SETTINGS_PORT  (u8)physical port
+ETHA_SETTINGS_PHYADDR   (u8)MDIO address of phy
+ETHA_SETTINGS_AUTONEG   (u8)autoneotiation status
+ETHA_SETTINGS_TP_MDIX_CTRL  (u8)MDI(-X) control
+ETHA_SETTINGS_WOL_MODES (bitfield32)wake-on-lan modes
+ETHA_SETTINGS_SOPASS(binary)SecureOn(tm) password
+ETHA_SETTINGS_MSGLVL(bitfield32)debug level
+ETHA_SETTINGS_LINK_MODES(bitset)device link modes
+
+For both bitfield32 types, value and selector work the usual way, i.e. bits
+set in selector are set to corresponding bits from value and the rest is
+preserved. In a similar fashion, ETHA_SETTINGS_LINK_MODES allows setting
+advertised link modes.
+
+If autonegotiation is on (either set now or kept from before), advertised
+modes are not changed (no ETHA_SETTINGS_LINK_MODES attribute) and at least one
+of speed and duplex is specified, kernel adjusts advertised modes to all
+supported modes matching speed, duplex or both (whatever is specified). This
+autoselection is done on ethtool side with ioctl interface, netlink interface
+is supposed to allow requesting changes without knowing what exactly kernel
+supports.
+
+
 Request translation
 ---
 
@@ -209,13 +241,13 @@ have their netlink replacement yet.
 ioctl command  netlink command
 -
 ETHTOOL_GSET   ETHTOOL_CMD_GET_SETTINGS
-ETHTOOL_SSET   n/a
+ETHTOOL_SSET   ETHTOOL_CMD_SET_SETTINGS
 ETHTOOL_GDRVINFO   ETHTOOL_CMD_GET_DRVINFO
 ETHTOOL_GREGS  n/a
 ETHTOOL_GWOL   ETHTOOL_CMD_GET_SETTINGS
-ETHTOOL_SWOL   n/a
+ETHTOOL_SWOL   ETHTOOL_CMD_SET_SETTINGS
 ETHTOOL_GMSGLVLETHTOOL_CMD_GET_SETTINGS
-ETHTOOL_SMSGLVLn/a
+ETHTOOL_SMSGLVLETHTOOL_CMD_SET_SETTINGS
 ETHTOOL_NWAY_RST   n/a
 ETHTOOL_GLINK  ETHTOOL_CMD_GET_SETTINGS
 ETHTOOL_GEEPROMn/a
@@ -283,7 +315,7 @@ ETHTOOL_STUNABLEn/a
 ETHTOOL_GPHYSTATS  n/a
 ETHTOOL_PERQUEUE   n/a
 ETHTOOL_GLINKSETTINGS  ETHTOOL_CMD_GET_SETTINGS
-ETHTOOL_SLINKSETTINGS  n/a
+ETHTOOL_SLINKSETTINGS  ETHTOOL_CMD_SET_SETTINGS
 ETHTOOL_PHY_GTUNABLE   n/a
 ETHTOOL_PHY_STUNABLE   n/a
 ETHTOOL_GFECPARAM  n/a
diff --git a/net/core/ethtool_netlink.c b/net/core/ethtool_netlink.c
index 0c2bed7850bc..4b14a02be12a 100644
--- a/net/core/ethtool_netlink.c
+++ b/net/core/ethtool_netlink.c
@@ -67,6 +67,222 @@ static const char *const link_mode_names[] = {
[ETHTOOL_LINK_MODE_FEC_BASER_BIT]   = "BASER",
 };
 
+struct link_mode_info {
+   int speed;
+   u8 duplex;
+};
+
+static const struct link_mode_info link_mode_params[] = {
+   [ETHTOOL_LINK_MODE_10baseT_Half_BIT]= {
+   .speed  = 10,
+   .duplex = DUPLEX_HALF,
+   },
+   [ETHTOOL_LINK_MODE_10baseT_Full_BIT]= {
+   .speed  = 10,
+   .duplex = DUPLEX_FULL,
+   },
+   [ETHTOOL_LINK_MODE_100bas

[RFC PATCH 8/9] ethtool: implement GET_PARAMS message

2017-12-11 Thread Michal Kubecek
Requests the information provide by ETHTOOL_GCOALESCE, ETHTOOL_GRINGPARAM,
ETHTOOL_GPAUSEPARAM, ETHTOOL_GCHANNELS, ETHTOOL_GEEE and ETHTOOL_GFECPARAM.
Flags in info_mask allow selecting only some (or on) of these categories.

Signed-off-by: Michal Kubecek <mkube...@suse.cz>
---
 Documentation/networking/ethtool-netlink.txt |  98 ++-
 include/uapi/linux/ethtool_netlink.h | 118 
 net/core/ethtool_netlink.c   | 411 +++
 3 files changed, 621 insertions(+), 6 deletions(-)

diff --git a/Documentation/networking/ethtool-netlink.txt 
b/Documentation/networking/ethtool-netlink.txt
index 187fab33f721..45ed1801ab50 100644
--- a/Documentation/networking/ethtool-netlink.txt
+++ b/Documentation/networking/ethtool-netlink.txt
@@ -111,6 +111,8 @@ List of message types
 ETHTOOL_CMD_SET_DRVINFOresponse only
 ETHTOOL_CMD_GET_SETTINGS
 ETHTOOL_CMD_SET_SETTINGS
+ETHTOOL_CMD_GET_PARAMS
+ETHTOOL_CMD_SET_PARAMS response only (for now)
 
 All constants use ETHTOOL_CMD_ prefix followed by "GET", "SET" or "ACT" to
 indicate the type.
@@ -231,6 +233,90 @@ is supposed to allow requesting changes without knowing 
what exactly kernel
 supports.
 
 
+GET_PARAMS
+--
+
+GET_PARAMS request retrieves information provided by legacy comands
+ETHTOOL_GCOALESCE (coalescing setting), ETHTOOL_GRINGPARAM (ring parameters),
+ETHTOOL_GPAUSEPARAM (pause parameters), ETHTOOL_GCHANNELS (channel settings),
+ETHTOOL_GEEE (EEE settings) and ETHTOOL_GFECPARAM (FEC parameters). For each
+of these, there is a bit in header info_mask so that only one type of
+information can be requested.
+
+Header flags:
+ETH_PARAMS_RF_COMPACT_BITSETS  bitset form in response (1 is compact)
+
+Header info_mask bits:
+ETH_PARAMS_IM_COALESCE coalescing settings
+ETH_PARAMS_IM_RING ring parameters
+ETH_PARAMS_IM_PAUSEpause parameters
+ETH_PARAMS_IM_CHANNELS channel settings
+ETH_PARAMS_IM_EEE  EEE settings
+ETH_PARAMS_IM_FEC  FEC parameters
+
+On top level, there is one attribute for each of the information categories,
+the information is nested in it.
+
+ETHA_PARAMS_COALESCE   (nest)  coalescing
+ETHA_COALESCE_RX_USECS (u32)
+ETHA_COALESCE_RX_MAXFRM(u32)
+ETHA_COALESCE_RX_USECS_IRQ (u32)
+ETHA_COALESCE_RX_MAXFRM_IRQ(u32)
+ETHA_COALESCE_RX_USECS_LOW (u32)
+ETHA_COALESCE_RX_MAXFRM_LOW(u32)
+ETHA_COALESCE_RX_USECS_HIGH(u32)
+ETHA_COALESCE_RX_MAXFRM_HIGH   (u32)
+ETHA_COALESCE_TX_USECS (u32)
+ETHA_COALESCE_TX_MAXFRM(u32)
+ETHA_COALESCE_TX_USECS_IRQ (u32)
+ETHA_COALESCE_TX_MAXFRM_IRQ(u32)
+ETHA_COALESCE_TX_USECS_LOW (u32)
+ETHA_COALESCE_TX_MAXFRM_LOW(u32)
+ETHA_COALESCE_TX_USECS_HIGH(u32)
+ETHA_COALESCE_TX_MAXFRM_HIGH   (u32)
+ETHA_COALESCE_PKT_RATE_LOW (u32)
+ETHA_COALESCE_PKT_RATE_HIGH(u32)
+ETHA_COALESCE_RX_USE_ADAPTIVE  (bool)
+ETHA_COALESCE_TX_USE_ADAPTIVE  (bool)
+ETHA_COALESCE_RATE_SAMPLE_INTERVAL (u32)
+ETHA_COALESCE_STATS_BLOCK_USECS(u32)
+ETHA_PARAMS_RING   (nest)  ring parameters
+ETHA_RING_RX_MAX_PENDING   (u32)
+ETHA_RING_RX_MINI_MAX_PENDING  (u32)
+ETHA_RING_RX_JUMBO_MAX_PENDING (u32)
+ETHA_RING_TX_MAX_PENDING   (u32)
+ETHA_RING_RX_PENDING   (u32)
+ETHA_RING_RX_MINI_PENDING  (u32)
+ETHA_RING_RX_JUMBO_PENDING (u32)
+ETHA_RING_TX_PENDING   (u32)
+ETHA_PARAMS_PAUSE  (nest)  pause parameters
+ETHA_PAUSE_AUTONEG (bool)
+ETHA_PAUSE_RX  (bool)
+ETHA_PAUSE_TX  (bool)
+ETHA_PARAMS_CHANNELS   (nest)  channel settings
+ETHA_CHANNELS_MAX_RX   (u32)
+ETHA_CHANNELS_MAX_TX   (u32)
+ETHA_CHANNELS_MAX_OTHER(u32)
+ETHA_CHANNELS_MAX_COMBINED (u32)
+ETHA_CHANNELS_RX_COUNT (u32)
+ETHA_CHANNELS_TX_COUNT (u32)
+ETHA_CHANNELS_OTHER_COUNT  (u32)
+ETHA_CHANNELS_COMBINED_COUNT   (u32)
+ETHA_PARAMS_EEE(nest)  EEE settings
+ETHA_EEE_LINK_MODES(bitset)
+   - modes for which EEE is advertised (value) or supported

[RFC PATCH 9/9] ethtool: implement SET_PARAMS message

2017-12-11 Thread Michal Kubecek
Sets the information provided by ETHTOOL_SCOALESCE, ETHTOOL_SRINGPARAM,
ETHTOOL_SPAUSEPARAM, ETHTOOL_SCHANNELS, ETHTOOL_SEEE and ETHTOOL_SFECPARAM.
Each of these has corresponding nesting attribute containing attributes for
its settings. This way, userspace can request changes equivalent to one or
more of the legacy requests (and provide only part of the data to update).

Signed-off-by: Michal Kubecek <mkube...@suse.cz>
---
 Documentation/networking/ethtool-netlink.txt |  71 -
 net/core/ethtool_netlink.c   | 422 +++
 2 files changed, 486 insertions(+), 7 deletions(-)

diff --git a/Documentation/networking/ethtool-netlink.txt 
b/Documentation/networking/ethtool-netlink.txt
index 45ed1801ab50..e96c18f52356 100644
--- a/Documentation/networking/ethtool-netlink.txt
+++ b/Documentation/networking/ethtool-netlink.txt
@@ -112,7 +112,7 @@ List of message types
 ETHTOOL_CMD_GET_SETTINGS
 ETHTOOL_CMD_SET_SETTINGS
 ETHTOOL_CMD_GET_PARAMS
-ETHTOOL_CMD_SET_PARAMS response only (for now)
+ETHTOOL_CMD_SET_PARAMS
 
 All constants use ETHTOOL_CMD_ prefix followed by "GET", "SET" or "ACT" to
 indicate the type.
@@ -316,6 +316,63 @@ the information is nested in it.
- active (value) and configured (selector) FEC encodings
 
 
+SET_PARAMS
+--
+
+SET_PARAMS request modifies the settings retrieved by GET_PARAMS, i.e. it
+replaces ETHTOOL_SCOALESCE, ETHTOOL_SRINGPARAM, ETHTOOL_SPAUSEPARAM,
+ETHTOOL_SCHANNELS, ETHTOOL_SEEE and ETHTOOL_SFECPARAM legacy commands. For
+each of these, relevant data attributes are contained in a corresponding nest
+attribute. Some of the attributes provided by GET_SETPARAMS are read only and
+cannot be set by SET_PARAMS request.
+
+ETHA_PARAMS_COALESCE   (nest)  coalescing
+ETHA_COALESCE_RX_USECS (u32)
+ETHA_COALESCE_RX_MAXFRM(u32)
+ETHA_COALESCE_RX_USECS_IRQ (u32)
+ETHA_COALESCE_RX_MAXFRM_IRQ(u32)
+ETHA_COALESCE_RX_USECS_LOW (u32)
+ETHA_COALESCE_RX_MAXFRM_LOW(u32)
+ETHA_COALESCE_RX_USECS_HIGH(u32)
+ETHA_COALESCE_RX_MAXFRM_HIGH   (u32)
+ETHA_COALESCE_TX_USECS (u32)
+ETHA_COALESCE_TX_MAXFRM(u32)
+ETHA_COALESCE_TX_USECS_IRQ (u32)
+ETHA_COALESCE_TX_MAXFRM_IRQ(u32)
+ETHA_COALESCE_TX_USECS_LOW (u32)
+ETHA_COALESCE_TX_MAXFRM_LOW(u32)
+ETHA_COALESCE_TX_USECS_HIGH(u32)
+ETHA_COALESCE_TX_MAXFRM_HIGH   (u32)
+ETHA_COALESCE_PKT_RATE_LOW (u32)
+ETHA_COALESCE_PKT_RATE_HIGH(u32)
+ETHA_COALESCE_RX_USE_ADAPTIVE  (bool)
+ETHA_COALESCE_TX_USE_ADAPTIVE  (bool)
+ETHA_COALESCE_RATE_SAMPLE_INTERVAL (u32)
+ETHA_COALESCE_STATS_BLOCK_USECS(u32)
+ETHA_PARAMS_RING   (nest)  ring parameters
+ETHA_RING_RX_PENDING   (u32)
+ETHA_RING_RX_MINI_PENDING  (u32)
+ETHA_RING_RX_JUMBO_PENDING (u32)
+ETHA_RING_TX_PENDING   (u32)
+ETHA_PARAMS_PAUSE  (nest)  pause parameters
+ETHA_PAUSE_AUTONEG (bool)
+ETHA_PAUSE_RX  (bool)
+ETHA_PAUSE_TX  (bool)
+ETHA_PARAMS_CHANNELS   (nest)  channel settings
+ETHA_CHANNELS_RX_COUNT (u32)
+ETHA_CHANNELS_TX_COUNT (u32)
+ETHA_CHANNELS_OTHER_COUNT  (u32)
+ETHA_CHANNELS_COMBINED_COUNT   (u32)
+ETHA_PARAMS_EEE(nest)  EEE settings
+ETHA_EEE_LINK_MODES(bitset)
+   - change modes for which EEE is advertised
+ETHA_EEE_ENABLED   (bool)
+ETHA_EEE_TX_LPI_ENABLED(bool)
+ETHA_EEE_TX_LPI_TIMER  (u32)
+ETHA_PARAMS_FEC(nest)  FEC parameters
+ETHA_FEC_MODES (bitfield32)
+   - change configured FEC encodings
+
 
 Request translation
 ---
@@ -339,11 +396,11 @@ ETHTOOL_GLINK ETHTOOL_CMD_GET_SETTINGS
 ETHTOOL_GEEPROMn/a
 ETHTOOL_SEEPROMn/a
 ETHTOOL_GCOALESCE  ETHTOOL_CMD_GET_PARAMS
-ETHTOOL_SCOALESCE  n/a
+ETHTOOL_SCOALESCE  ETHTOOL_CMD_SET_PARAMS
 ETHTOOL_GRINGPARAM ETHTOOL_CMD_GET_PARAMS
-ETHTOOL_SRINGPARAM n/a
+ETHTOOL_SRINGPARAM ETHTOOL_CMD_SET_PARAMS
 ETHTOOL_GPAUSEPARAMETHTOOL_CMD_GET_PARAMS
-ETHTOOL_SPAUSEPARAM

[RFC PATCH 6/9] ethtool: implement GET_SETTINGS message

2017-12-11 Thread Michal Kubecek
Requests the information provided by ETHTOOL_GLINKSETTINGS, ETHTOOL_GWOL
and ETHTOOL_GMSGLVL. The info_mask header field can be used to request only
part of the information. Flag ETH_SETTINGS_RF_COMPACT_BITSETS switches
between flag-by-flag list and compact bitmaps for link modes in the reply.

Signed-off-by: Michal Kubecek <mkube...@suse.cz>
---
 Documentation/networking/ethtool-netlink.txt |  62 +-
 include/linux/ethtool_netlink.h  |   3 +
 include/linux/netdevice.h|   2 +
 include/uapi/linux/ethtool.h |   3 +
 include/uapi/linux/ethtool_netlink.h |  36 
 net/core/ethtool.c   | 108 +-
 net/core/ethtool_common.c| 112 ++
 net/core/ethtool_common.h|   8 +
 net/core/ethtool_netlink.c   | 299 +++
 9 files changed, 528 insertions(+), 105 deletions(-)

diff --git a/Documentation/networking/ethtool-netlink.txt 
b/Documentation/networking/ethtool-netlink.txt
index cb992180b211..7aabc87c9f09 100644
--- a/Documentation/networking/ethtool-netlink.txt
+++ b/Documentation/networking/ethtool-netlink.txt
@@ -109,6 +109,8 @@ List of message types
 
 ETHTOOL_CMD_GET_DRVINFO
 ETHTOOL_CMD_SET_DRVINFOresponse only
+ETHTOOL_CMD_GET_SETTINGS
+ETHTOOL_CMD_SET_SETTINGS   response only (for now)
 
 All constants use ETHTOOL_CMD_ prefix followed by "GET", "SET" or "ACT" to
 indicate the type.
@@ -147,6 +149,56 @@ response.
 All information is read only, SET_DRVINFO request is not implemented.
 
 
+GET_SETTINGS
+
+
+GET_SETTINGS request retrieves information provided by ETHTOOL_GLINKSETTINGS,
+ETHTOOL_GWOL, ETHTOOL_GMSGLVL and ETHTOOL_GLINK ioctl requests. The request
+doesn't use any attributes.
+
+Header flags:
+ETH_SETTINGS_RF_COMPACT_BITSETSbitset form in response (1 is compact)
+
+Header info_mask bits:
+ETH_SETTINGS_IM_LINKINFO   link_ksettings except link modes
+ETH_SETTINGS_IM_LINKMODES  link modes from link_ksettings
+ETH_SETTINGS_IM_MSGLEVEL   msglevel
+ETH_SETTINGS_IM_WOLINFOstruct ethtool_wolinfo
+ETH_SETTINGS_IM_LINK   link state
+
+Zero info_mask
+
+Response contents:
+
+ETHA_SETTINGS_SPEED(u32)   link speed (Mb/s)
+ETHA_SETTINGS_DUPLEX   (u8)duplex mode
+ETHA_SETTINGS_PORT (u8)physical port
+ETHA_SETTINGS_PHYADDR  (u8)MDIO address of phy
+ETHA_SETTINGS_AUTONEG  (u8)autoneotiation status
+ETHA_SETTINGS_MDIO_SUPPORT (bitfield32)MDIO support flags
+ETHA_SETTINGS_TP_MDIX  (u8)MDI(-X) status
+ETHA_SETTINGS_TP_MDIX_CTRL (u8)MDI(-X) control
+ETHA_SETTINGS_TRANSCEIVER  (u8)transceiver
+ETHA_SETTINGS_WOL_MODES(bitfield32)wake-on-lan modes
+ETHA_SETTINGS_SOPASS   (binary)SecureOn(tm) password
+ETHA_SETTINGS_MSGLVL   (bitfield32)debug level
+ETHA_SETTINGS_LINK_MODES   (bitset)device link modes
+ETHA_SETTINGS_PEER_MODES   (bitset)link partner link modes
+ETHA_SETTINGS_LINK (u32)   link state
+
+Most of the attributes have the same meaning (including values) as
+corresponding members of ioctl structures. For ETHA_SETTINGS_MDIO_SUPPORT and
+ETHA_SETTINGS_MSGLVL, selector reports flags supported by kernel. For
+ETHA_SETTINGS_WOL_MODES it reports flags supported by the device. For
+ETHA_SETTINGS_LINK_MODES, value represent advertised modes and mask represents
+supported modes. For ETHA_SETTINGS_PEER_MODES, both value and mask represent
+partner advertised link modes.
+
+GET_SETTINGS request is allowed for unprivileged user but ETHA_SETTINGS_SOPASS
+is only provided by kernel in response to privileged (netns CAP_NET_ADMIN)
+requests.
+
+
 Request translation
 ---
 
@@ -156,16 +208,16 @@ have their netlink replacement yet.
 
 ioctl command  netlink command
 -
-ETHTOOL_GSET   n/a
+ETHTOOL_GSET   ETHTOOL_CMD_GET_SETTINGS
 ETHTOOL_SSET   n/a
 ETHTOOL_GDRVINFO   ETHTOOL_CMD_GET_DRVINFO
 ETHTOOL_GREGS  n/a
-ETHTOOL_GWOL   n/a
+ETHTOOL_GWOL   ETHTOOL_CMD_GET_SETTINGS
 ETHTOOL_SWOL   n/a
-ETHTOOL_GMSGLVLn/a
+ETHTOOL_GMSGLVLETHTOOL_CMD_GET_SETTINGS
 ETHTOOL_SMSGLVLn/a
 ETHTOOL_NWAY_RST   n/a
-ETHTOOL_GLINK  n/a
+ETHTOOL_GLINK  ETHTOOL_CMD_GET_SETTINGS
 ETHTOOL_GEEPROMn/a
 ETHTOOL_SEEPROMn/a
 ETHTOOL_GCOALESCE  n/a
@@ -230,7 +282,7 @@ ETHTOOL_GTUNA

[RFC PATCH 4/9] ethtool: netlink bitset handling

2017-12-11 Thread Michal Kubecek
Declare attribute type constants and add helper functions to handle
arbitrary length bit sets.

Signed-off-by: Michal Kubecek <mkube...@suse.cz>
---
 Documentation/networking/ethtool-netlink.txt |  56 +
 include/uapi/linux/ethtool_netlink.h |  31 +++
 net/core/ethtool_netlink.c   | 346 +++
 3 files changed, 433 insertions(+)

diff --git a/Documentation/networking/ethtool-netlink.txt 
b/Documentation/networking/ethtool-netlink.txt
index c94da66cb5fb..893e5156f6a7 100644
--- a/Documentation/networking/ethtool-netlink.txt
+++ b/Documentation/networking/ethtool-netlink.txt
@@ -48,6 +48,62 @@ in "set" requests). For these attributes, the "true" value 
should be passed as
 number 1 but any non-zero value should be understood as "true" by recipient.
 
 
+Bit sets
+
+
+For short bitmaps of (reasonably) fixed length, standard NLA_BITFIELD32 type
+is used. For arbitrary length bitmaps, ethtool netlink uses a nested attribute
+with contents of one of two forms: compact (two binary bitmaps representing
+bit values and mask of affected bits) and bit-by-bit (list of bits identified
+by either index or name).
+
+Compact form: nested (bitset) atrribute contents:
+
+ETHA_BITSET_SIZE   (u32)   number of significant bits
+ETHA_BITSET_VALUE  (binary)bitmap of bit values
+ETHA_BITSET_MASK   (binary)bitmap of valid bits
+
+Value and mask must have length at least ETHA_BITSET_SIZE bits rounded up to
+a multiple of 32 bits. They consist of 32-bit words in host byt order, words
+ordered from least significant to most significant (i.e. the same way as
+bitmaps are passed with ioctl interface).
+
+For compact form, ETHA_BITSET_SIZE and ETHA_BITSET_VALUE are mandatory.
+Similar to BITFIELD32, a compact form bit set requests to set bits in the mask
+to 1 (if set in value) or 0 (if not) and preserve the rest. If the mask is
+omitted, it is supposed to be "all ones", i.e. set all bits according to
+value.
+
+Kernel bit set length may differ from userspace length if older application is
+used on newer kernel or vice versa. If userspace bitmaps are longer, error is
+only issued if request actually tries to set bits not implemented in kernel.
+
+Bit-by-bit form: nested (bitset) attribute contents:
+
+ETHA_BITSET_SIZE   (u32)   number of significant bits (optional)
+ETHA_BITSET_BITS   (nested)array of bits
+   ETHA_BITSET_BIT
+   ETHA_BIT_INDEX  (u32)   bit index (0 for LSB)
+ETHA_BIT_NAME  (string)bit name
+   ETHA_BIT_VALUE  (flag)  present if bit is set
+ETHA_BITSET_BIT
+   ...
+
+Bit size is optional for bit-by-bit form. ETHA_BITSET_BITS nest can only
+contain ETHA_BITS_BIT attributes but there can be an arbitrary number of them.
+A bit may be identified by its index or by its name. When used in requests,
+listed bits are set to 0 or 1 according to ETHA_BIT_VALUE, the rest is
+preserved. A request fails if index exceeds kernel bit length or if name is
+not recognized.
+
+In requests, application can use either form. Form used by kernel in reply is
+determined by a flag in flags field of request header. Semantics of value and
+mask depends on the attribute. General idea is that flags control request
+processing, info_mask control which parts of the information are returned in
+"get" request and index identifies a particular subcommand or an object to
+which the request applies.
+
+
 List of message types
 -
 
diff --git a/include/uapi/linux/ethtool_netlink.h 
b/include/uapi/linux/ethtool_netlink.h
index 06cff2b52dfe..6db35f00ea32 100644
--- a/include/uapi/linux/ethtool_netlink.h
+++ b/include/uapi/linux/ethtool_netlink.h
@@ -26,6 +26,37 @@ enum {
ETHTOOL_CMD_MAX = (__ETHTOOL_CMD_MAX - 1),
 };
 
+/* bit sets */
+
+enum {
+   ETHA_BIT_UNSPEC,
+   ETHA_BIT_INDEX, /* u32 */
+   ETHA_BIT_NAME,  /* string */
+   ETHA_BIT_VALUE, /* flag */
+
+   __ETHA_BIT_MAX,
+   ETHA_BIT_MAX = (__ETHA_BIT_MAX - 1),
+};
+
+enum {
+   ETHA_BITS_UNSPEC,
+   ETHA_BITS_BIT,
+
+   __ETHA_BITS_MAX,
+   ETHA_BITS_MAX = (__ETHA_BITS_MAX - 1),
+};
+
+enum {
+   ETHA_BITSET_UNSPEC,
+   ETHA_BITSET_SIZE,   /* u32 */
+   ETHA_BITSET_BITS,   /* nest - ETHA_BITS_* */
+   ETHA_BITSET_VALUE,  /* binary */
+   ETHA_BITSET_MASK,   /* binary */
+
+   __ETHA_BITSET_MAX,
+   ETHA_BITSET_MAX = (__ETHA_BITSET_MAX - 1),
+};
+
 /* generic netlink info */
 #define ETHTOOL_GENL_NAME "ethtool"
 #define ETHTOOL_GENL_VERSION 1
diff --git a/net/core/ethtool_netlink.c b/net/core/ethtool_netlink.c
index 22d23d057623..9f287e67f30b 100644
--- a/net/core/ethtool_netlink.c
+++ b/net/core

[RFC PATCH 2/9] ethtool: introduce ethtool netlink interface

2017-12-11 Thread Michal Kubecek
No function implemented yet, only genetlink and module infrastructure.
Register/unregister genetlink family "ethtool" and allow the module to be
autoloaded by genetlink code (if built as a module, distributions would
probably prefer "y").

Signed-off-by: Michal Kubecek <mkube...@suse.cz>
---
 Documentation/networking/ethtool-netlink.txt | 155 +++
 include/linux/ethtool_netlink.h  |   9 ++
 include/uapi/linux/ethtool_netlink.h |  33 ++
 net/Kconfig  |   7 ++
 net/core/Makefile|   1 +
 net/core/ethtool_netlink.c   |  46 
 6 files changed, 251 insertions(+)
 create mode 100644 Documentation/networking/ethtool-netlink.txt
 create mode 100644 include/linux/ethtool_netlink.h
 create mode 100644 include/uapi/linux/ethtool_netlink.h
 create mode 100644 net/core/ethtool_netlink.c

diff --git a/Documentation/networking/ethtool-netlink.txt 
b/Documentation/networking/ethtool-netlink.txt
new file mode 100644
index ..c94da66cb5fb
--- /dev/null
+++ b/Documentation/networking/ethtool-netlink.txt
@@ -0,0 +1,155 @@
+Netlink interface for ethtool
+=
+
+
+Basic information
+-
+
+Netlink interface for ethtool uses generic netlink family "ethtool" (userspace
+application should use macros ETHTOOL_GENL_NAME and ETHTOOL_GENL_VERSION
+defined in  uapi header). This family uses a message
+header of 24 bytes:
+
+struct ethnlmsghdr {
+   __u32   ifindex;
+   __u16   flags;
+   union {
+   __u16   info_mask;
+   __u16   index;
+   }
+   charifname[IFNAMSIZ];
+};
+
+In requests, device can be identified by ifindex or by name; if both are used,
+they must match. In replies, kernel fills both. The meaning of flags,
+info_mask and index fields depends on request type.
+
+The ethtool netlink interface uses extended ACK for error and warning
+reporting, userspace application developers are encouraged to make these
+messages available to user in a suitable way.
+
+Requests can be divided into three categories: "get" (retrieving information),
+"set" (setting parameters) and "action" (invoking an action).
+
+All "set" and "action" type requests require admin privileges (CAP_NET_ADMIN
+in the namespace). Most "get" type request are allowed for anyone but there
+are exceptions (where the response contains sensitive information). In some
+cases, the request as such is allowed for anyone but unprivileged users have
+attributes with sensitive information (e.g. wake-on-lan password) omitted.
+
+
+Conventions
+---
+
+Attributes which represent a boolean value usually use u8 type so that we can
+distinguish three states: "on", "off" and "not present" (meaning the
+information is not available in "get" requests or value is not to be changed
+in "set" requests). For these attributes, the "true" value should be passed as
+number 1 but any non-zero value should be understood as "true" by recipient.
+
+
+List of message types
+-
+
+All constants use ETHTOOL_CMD_ prefix followed by "GET", "SET" or "ACT" to
+indicate the type.
+
+Messages of type "get" are used by userspace to request information and
+usually do not contain any attributes (that may be added later for dump
+filtering). Kernel response is in the form of corresopinding "set" message;
+the same message can be also used to set (some of) the parameters, except for
+messages marked as "response only" in the table above.
+
+Later sections describe the format and semantics of these request messages.
+
+
+Request translation
+---
+
+The following table maps iosctl commands to netlink commands providing their
+functionality. Entries with "n/a" in right column are commands which do not
+have their netlink replacement yet.
+
+ioctl command  netlink command
+-
+ETHTOOL_GSET   n/a
+ETHTOOL_SSET   n/a
+ETHTOOL_GDRVINFO   n/a
+ETHTOOL_GREGS  n/a
+ETHTOOL_GWOL   n/a
+ETHTOOL_SWOL   n/a
+ETHTOOL_GMSGLVLn/a
+ETHTOOL_SMSGLVLn/a
+ETHTOOL_NWAY_RST   n/a
+ETHTOOL_GLINK  n/a
+ETHTOOL_GEEPROMn/a
+ETHTOOL_SEEPROMn/a
+ETHTOOL_GCOALESCE  n/a
+ETHTOOL_SCOALESCE  n/a
+ETHTOOL_GRINGPARAM n/a
+ETHTOOL_SRINGPARAM n/a
+ETHTOOL_GPAUSEPARAMn/a
+ETHTOOL_SPAUSEPARAMn/a
+ETHTOOL_GRXCSUM 

[RFC PATCH 0/9] ethtool netlink interface (WiP)

2017-12-11 Thread Michal Kubecek
This is still work in progress and only a very small part of the ioctl
interface is reimplemented but I would like to get some comments before
the patchset becomes too big and changing things becomes too tedious.

The interface used for communication between ethtool and kernel is based on
ioctl() and suffers from many problems. The most pressing seems the be the
lack of extensibility. While some of the newer commands use structures
designed to allow future extensions (e.g. GFEATURES or TEST), most either
allow no extension at all (GPAUSEPARAM, GCOALESCE) or only limited set of
reserved fields (GDRVINFO, GEEE). Even most of those which support future
extensions limit the data types that can be used.

This series aims to provide an alternative interface based on netlink which
is what other network configuration utilities use. In particular, it uses
generic netlink (family "ethtool"). The goal is to provide an interface
which would be extensible, flexible and practical both for ethtool and for
other network configuration tools (e.g. wicked or systemd-networkd).

The interface is documented in Documentation/networking/ethtool-netlink.txt

I'll post RFC patch series for ethtool in a few days, it still needs some
more polishing.

Basic concepts:

- the interface is based on generic netlink (family name "ethtool")

- provide everything ioctl can do but allow easier future extensions

- inextensibility of ioctl interface resulted in way too many commands,
  many of them obsoleted by newer ones; reduce the number by  ignoring the
  obsolete commands and grouping some together

- for "set" type commands, netlink allows providing only the attributes to
  be changed; therefore we don't need a get-modify-set cycle, userspace can
  simply say what it wants to change and how

- be less dependent on ethtool and kernel being in sync; allow e.g. saying
  "ethtool -s eth0 advertise xyz off" without knowing what "xyz" means;
  it's kernel's job to know what mode "xyz" is and if it exists and is
  supported

Unresolved questions/tasks:

- allow dumps for "get" type requests, e.g. listing EEE settings for all
  interfaces in current netns

- find reasonable format for data transfers (e.g. eeprom dump or flash)

- while the netlink interface allows easy future extensions, ethtool_ops
  interface does not; some settings could be implemented using tunables and
  accessed via relevant netlink messages (as well as tunables) from
  userspace but in the long term, something better will be needed

- it would be nice if driver could provide useful error/warning messages to
  be passed to userspace via extended ACK; example: while testing, I found
  a driver which only allows values 0, 1, 3 and 1 for certain parameter
  but the only way poor user can find out is either by trying all values or
  by checking driver source

Michal Kubecek (9):
  netlink: introduce nla_put_bitfield32()
  ethtool: introduce ethtool netlink interface
  ethtool: helper functions for netlink interface
  ethtool: netlink bitset handling
  ethtool: implement GET_DRVINFO message
  ethtool: implement GET_SETTINGS message
  ethtool: implement SET_SETTINGS message
  ethtool: implement GET_PARAMS message
  ethtool: implement SET_PARAMS message

 Documentation/networking/ethtool-netlink.txt |  466 ++
 include/linux/ethtool_netlink.h  |   12 +
 include/linux/netdevice.h|2 +
 include/net/netlink.h|   15 +
 include/uapi/linux/ethtool.h |3 +
 include/uapi/linux/ethtool_netlink.h |  239 +++
 net/Kconfig  |7 +
 net/core/Makefile|3 +-
 net/core/ethtool.c   |  150 +-
 net/core/ethtool_common.c|  158 ++
 net/core/ethtool_common.h|   19 +
 net/core/ethtool_netlink.c   | 2323 ++
 12 files changed, 3260 insertions(+), 137 deletions(-)
 create mode 100644 Documentation/networking/ethtool-netlink.txt
 create mode 100644 include/linux/ethtool_netlink.h
 create mode 100644 include/uapi/linux/ethtool_netlink.h
 create mode 100644 net/core/ethtool_common.c
 create mode 100644 net/core/ethtool_common.h
 create mode 100644 net/core/ethtool_netlink.c

-- 
2.15.1



Re: Linux 4.14 - regression: broken tun/tap / bridge network with virtio - bisected

2017-12-08 Thread Michal Kubecek
On Fri, Dec 08, 2017 at 01:45:38PM +0100, Andreas Hartmann wrote:
> On 12/08/2017 at 12:40 PM Michal Kubecek wrote:
> > On Fri, Dec 08, 2017 at 11:31:50AM +0100, Andreas Hartmann wrote:
> >>
> >> When will there be a fix for 4.14? It is clearly a regression. Is
> >> it possible / a good idea to just remove the complete patch series
> >> "Remove UDP Fragmentation Offload support"?
> > 
> > I cannot give an exact date but the patch is queued for stable (see
> > http://patchwork.ozlabs.org/bundle/davem/stable/?state=* ) so that
> > it should land in stable-4.14 in near future (weeks at most).
> 
> Which one is it? I couldn't find any patch related to this problem at
> first glance.

"[net,v2] net: accept UFO datagrams from tuntap and packet" - the
subject was mentioned in one of my earlier e-mails (with commit id).

Michal Kubecek



Re: Linux 4.14 - regression: broken tun/tap / bridge network with virtio - bisected

2017-12-08 Thread Michal Kubecek
On Fri, Dec 08, 2017 at 11:31:50AM +0100, Andreas Hartmann wrote:
> On 12/08/2017 at 09:47 AM Michal Kubecek wrote:
> > On Fri, Dec 08, 2017 at 08:21:16AM +0100, Andreas Hartmann wrote:
> >>
> >> All my VMs are using virtio_net. BTW: I couldn't see the problems
> >> (sometimes, the VM couldn't be stopped at all) if all my VMs are using
> >> e1000 as interface instead.
> >>
> >> This finding now matches pretty much the responsible UDP-package which
> >> caused the stall. I already mentioned it here [2].
> >>
> >> To prove it, I reverted from the patch series "[PATCH v2 RFC 0/13]
> >> Remove UDP Fragmentation Offload support" [3]
> >>
> >> 11/13 [v2,RFC,11/13] net: Remove all references to SKB_GSO_UDP. [4]
> >> 12/13 [v2,RFC,12/13] inet: Remove software UFO fragmenting code. [5]
> >> 13/13 [v2,RFC,13/13] net: Kill NETIF_F_UFO and SKB_GSO_UDP. [6]
> >>
> >> and applied it to Linux 4.14.4. It compiled fine and is running fine.
> >> The vnet doesn't die anymore. Yet, I can't say if the qemu stop hangs
> >> are gone, too.
> >>
> >> Obviously, there is something broken with the new UDP handling. Could
> >> you please analyze this problem? I could test some more patches ... .
> > 
> > Any chance your VMs were live migrated from pre-4.14 host kernel?
> 
> No - the VMs are not live migrated. They are always running on the same
> host - either with kernel < 4.14 or with kernel 4.14.x.

This is disturbing... unless I'm mistaken, it shouldn't be possible to
have UFO enabled on a virtio device in a VM booted on a host with 4.14
kernel.

> > If this is the case, you should try commit 0c19f846d582 ("net:
> > accept UFO datagrams from tuntap and packet"). 
> 
> It doesn't apply to 4.14.4
> 
> > Or disabling UFO in the guest should
> > work around the issue.
> 
> ethtool -K ethX ufo off for each device / bridge in VM.
> 
> Yes, this seems to work. I'll wait and see if the non stoppable
> qemu-problem on shutdown will remain.
> 
> When will there be a fix for 4.14? It is clearly a regression. Is it
> possible / a good idea to just remove the complete patch series "Remove
> UDP Fragmentation Offload support"?

I cannot give an exact date but the patch is queued for stable
(see http://patchwork.ozlabs.org/bundle/davem/stable/?state=* ) so that
it should land in stable-4.14 in near future (weeks at most).

 Michal Kubecek



Re: Linux 4.14 - regression: broken tun/tap / bridge network with virtio - bisected

2017-12-08 Thread Michal Kubecek
On Fri, Dec 08, 2017 at 08:21:16AM +0100, Andreas Hartmann wrote:
> 
> Thanks for this hint - I'm not using xdp. Therefore I rechecked my
> bisect and detected a mistake. The rebisect now leads to
> 
> 
> 
> [v2,RFC,11/13] net: Remove all references to SKB_GSO_UDP. [1]
> 
> 
> 
> For the repeated bisect, I switched back to the original qemu 2.6.2
> (instead of 2.10.1), because problems can be seen reliably with 2.6.2.
> 
> All my VMs are using virtio_net. BTW: I couldn't see the problems
> (sometimes, the VM couldn't be stopped at all) if all my VMs are using
> e1000 as interface instead.
> 
> This finding now matches pretty much the responsible UDP-package which
> caused the stall. I already mentioned it here [2].
> 
> To prove it, I reverted from the patch series "[PATCH v2 RFC 0/13]
> Remove UDP Fragmentation Offload support" [3]
> 
> 11/13 [v2,RFC,11/13] net: Remove all references to SKB_GSO_UDP. [4]
> 12/13 [v2,RFC,12/13] inet: Remove software UFO fragmenting code. [5]
> 13/13 [v2,RFC,13/13] net: Kill NETIF_F_UFO and SKB_GSO_UDP. [6]
> 
> and applied it to Linux 4.14.4. It compiled fine and is running fine.
> The vnet doesn't die anymore. Yet, I can't say if the qemu stop hangs
> are gone, too.
> 
> Obviously, there is something broken with the new UDP handling. Could
> you please analyze this problem? I could test some more patches ... .

Any chance your VMs were live migrated from pre-4.14 host kernel? If
this is the case, you should try commit 0c19f846d582 ("net: accept UFO
datagrams from tuntap and packet"). Or disabling UFO in the guest should
work around the issue.

  Michal Kubecek


Re: [PATCH v2 3/3] ethtool: Add ETHTOOL_RESET support via --reset command

2017-12-05 Thread Michal Kubecek
On Tue, Dec 05, 2017 at 02:06:09PM -0800, Scott Branden wrote:
> On 17-12-05 01:30 PM, Michal Kubecek wrote:
> > On Tue, Dec 05, 2017 at 12:53:23PM -0800, Scott Branden wrote:
> > > Add ETHTOOL_RESET support via --reset command.
> > > 
> > > ie.  ethtool --reset DEVNAME <flagname(s)>
> > > 
> > > flagnames currently match the ETH_RESET_xxx names:
> > > mgmt,irq,dma,filter,offload,mac,phy,ram,dedicated,all
> > > 
> > > Alternatively, you can specific component bitfield directly using
> > > ethtool --reset DEVNAME flags %x
> > IMHO it would be more consistent with e.g. msglvl without the keyword
> > "flags".
> I don't see the consistency in ethtool of specifying a number without a
> keyword in front of it.
> I can only find --set-dump specify a number?
> Others have keyword and number.  msglvl is the keyword after specifying -s -
> same as flags is the keyword I use after specifying --reset.

What I meant is that you can write

ethtool -s eth0 msglvl drv on probe off
ethtool -s eth0 msglvl 0x7

i.e. either number or names (with on/off in this case) while your patch
has

ethtool --reset eth0 mgmg,irq
ethtool --reset eth0 flags 0x3

i.e. an extra keyword if a number is used.

But it's not really important, it doesn't seem I would be able to share
a parser for this with any other subcommand or parameter anyway.

> >   It would be also nice to provide a symbolic way to specify the
> > shared flags.
> 
> I'll change to allow -shared to be added to the end of each component
> specified to use the shared bit.
>  IE. mgmt-shared, irq-shared, dma-shared ?

Sounds good to me.

> > > + resetinfo.cmd = ETHTOOL_RESET;
> > > +
> > > + if (send_ioctl(ctx, )) {
> > > + perror("Cannot issue RESET");
> > > + return 1;
> > > + }
> > > + fprintf(stdout, "RESET 0x%x issued\n", resetinfo.data);
> > 
> > According to documentation, driver is supposed to clear the flags
> > corresponding to components which were reset so that what is left are
> > those which were _not_ reset.
> 
> I'll move the print above the send_ioctl.

It might be even more useful if ethtool informed user what actually
happened, i.e. either change the message to saying these are bits for
components not reset (if resetinfo.data is not zero) or save the
original value of resetinfo.data and show  saved_data & ~resetinfo.data

Michal Kubecek



Re: [PATCH v2 3/3] ethtool: Add ETHTOOL_RESET support via --reset command

2017-12-05 Thread Michal Kubecek
On Tue, Dec 05, 2017 at 12:53:23PM -0800, Scott Branden wrote:
> Add ETHTOOL_RESET support via --reset command.
> 
> ie.  ethtool --reset DEVNAME <flagname(s)>
> 
> flagnames currently match the ETH_RESET_xxx names:
> mgmt,irq,dma,filter,offload,mac,phy,ram,dedicated,all
> 
> Alternatively, you can specific component bitfield directly using
> ethtool --reset DEVNAME flags %x

IMHO it would be more consistent with e.g. msglvl without the keyword
"flags". It would be also nice to provide a symbolic way to specify the
shared flags.

> +static int do_reset(struct cmd_context *ctx)
> +{
> + struct ethtool_value resetinfo;
> + int argc = ctx->argc;
> + char **argp = ctx->argp;
> + int i;
> +
> + if (argc == 0)
> + exit_bad_args();
> +
> + resetinfo.data = 0;
> +
> + for (i = 0; i < argc; i++) {
> + if (!strcmp(argp[i], "flags")) {
> + __u32 flags;
> +
> + i++;
> + if (i >= argc)
> + exit_bad_args();
> + flags = strtoul(argp[i], NULL, 0);
> + if (flags == 0)
> + exit_bad_args();
> + else
> + resetinfo.data |= flags;
> + } else if (!strcmp(argp[i], "mgmt")) {
> + resetinfo.data |= ETH_RESET_MGMT;
> + } else if (!strcmp(argp[i], "irq")) {
> + resetinfo.data |= ETH_RESET_IRQ;
> + } else if (!strcmp(argp[i], "dma")) {
> + resetinfo.data |= ETH_RESET_DMA;
> + } else if (!strcmp(argp[i], "filter")) {
> + resetinfo.data |= ETH_RESET_FILTER;
> + } else if (!strcmp(argp[i], "offload")) {
> + resetinfo.data |= ETH_RESET_OFFLOAD;
> + } else if (!strcmp(argp[i], "mac")) {
> + resetinfo.data |= ETH_RESET_MAC;
> + } else if (!strcmp(argp[i], "phy")) {
> + resetinfo.data |= ETH_RESET_PHY;
> + } else if (!strcmp(argp[i], "ram")) {
> + resetinfo.data |= ETH_RESET_RAM;
> + } else if (!strcmp(argp[i], "ap")) {
> + resetinfo.data |= ETH_RESET_AP;
> + } else if (!strcmp(argp[i], "dedicated")) {
> + resetinfo.data |= ETH_RESET_DEDICATED;
> + } else if (!strcmp(argp[i], "all")) {
> + resetinfo.data |= ETH_RESET_ALL;
> + } else {
> + exit_bad_args();
> + }
> + }
> +
> + resetinfo.cmd = ETHTOOL_RESET;
> +
> + if (send_ioctl(ctx, )) {
> + perror("Cannot issue RESET");
> + return 1;
> + }
> + fprintf(stdout, "RESET 0x%x issued\n", resetinfo.data);

According to documentation, driver is supposed to clear the flags
corresponding to components which were reset so that what is left are
those which were _not_ reset.

Michal Kubecek


Re: Fixing CVE-2017-16939 in v4.4.y and possibly v3.18.y

2017-12-02 Thread Michal Kubecek
On Sat, Dec 02, 2017 at 04:20:40PM -0800, Guenter Roeck wrote:
> On 12/01/2017 11:48 AM, Michal Kubecek wrote:
> > On Thu, Nov 30, 2017 at 10:37:40AM -0800, Guenter Roeck wrote:
> > > Hi,
> > > 
> > > The fix for CVE-2017-16939 has been applied to v4.9.y, but not to v4.4.y
> > > and older kernels. However, I confirmed that running the published POC
> > > (see https://blogs.securiteam.com/index.php/archives/3535) does crash a 
> > > 4.4
> > > kernel.
> > > 
> > > I confirmed that the following two patches fix the problem in v4.4.y.
> > > Please consider applying them to v4.4.y (and possibly v3.18.y).
> > > 
> > > fc9e50f5a5a4e ("netlink: add a start callback for starting a netlink 
> > > dump")
> > > 1137b5e2529a8 ("ipsec: Fix aborted xfrm policy dump crash")
> > > 
> > > My apologies for the noise if this is already under consideration.
> > 
> > It's a bit too big hammer. As Nicolai Stange noticed when we were
> 
> The hammer is just as big as the upstream hammer. Personally I prefer the
> upstream patch; I don't see a reason to deviate from upstream just because
> the upstream solution is more complex than necessary.

Comparing that little patch with the combination of the two commits,
I would say we have a very different idea what "as big as" means. :-)

> > handling this for SLE12 (where fc9e50f5a5a4e would break kABI), it's
> 
> I didn't know that this is even a concern for stable releases. Is there
> some guideline that kABI changes should be avoided in stable releases ?

Not to my knowledge, stable updates break kABI quite often. I just
mentioned it to explain why we had stronger motivation to find another
solution.

Michal Kubecek



Re: [PATCH iproute2] iproute2: Fix undeclared __kernel_long_t type build error in RHEL 6.8

2017-12-01 Thread Michal Kubecek
On Fri, Dec 01, 2017 at 08:48:07AM -0800, Stephen Hemminger wrote:
> On Fri,  1 Dec 2017 13:04:51 +0200
> Leon Romanovsky <l...@kernel.org> wrote:
> 
> > From: Leon Romanovsky <leo...@mellanox.com>
> > 
> > Add asm/posix_types.h header file to the list of needed includes,
> > because the headers files in RHEL 6.8 are too old and doesn't
> > have declaration of __kernel_long_t.
> > 
> > In file included from ../include/uapi/linux/kernel.h:5,
> >  from ../include/uapi/linux/netfilter/x_tables.h:4,
> >  from ../include/xtables.h:20,
> >  from em_ipset.c:26:
> > ../include/uapi/linux/sysinfo.h:9: error: expected specifier-qualifier-list 
> > before ‘__kernel_long_t’
> > 
> > Cc: Riad Abo Raed <ri...@mellanox.com>
> > Cc: Guy Ergas <g...@mellanox.com>
> > Signed-off-by: Leon Romanovsky <leo...@mellanox.com>
> 
> I see the problem, but the solution of dragging in posix_types.h
> would be too much of a long term maintenance issue.
> All the headers in uapi are regularly generated from upstream
> kernel headers; I don't want to start making exceptions.
> 
> Is it just the xtables stuff (which has always been problematic)?

Actually, the only place where __kernel_long_t and __kernel_ulong_t
appear is struct sysinfo in include/uapi/linux/sysinfo.h and this
structure isn't even used anywhere in iproute2 source (not even in the
include/uapi/linux/kernel.h file which includes ).

So one could work around the problem by defining _LINUX_SYSINFO_H but
that seems a bit dirty hack.

Michal Kubecek



Re: Fixing CVE-2017-16939 in v4.4.y and possibly v3.18.y

2017-12-01 Thread Michal Kubecek
On Thu, Nov 30, 2017 at 10:37:40AM -0800, Guenter Roeck wrote:
> Hi,
> 
> The fix for CVE-2017-16939 has been applied to v4.9.y, but not to v4.4.y
> and older kernels. However, I confirmed that running the published POC
> (see https://blogs.securiteam.com/index.php/archives/3535) does crash a 4.4
> kernel.
> 
> I confirmed that the following two patches fix the problem in v4.4.y.
> Please consider applying them to v4.4.y (and possibly v3.18.y).
> 
> fc9e50f5a5a4e ("netlink: add a start callback for starting a netlink dump")
> 1137b5e2529a8 ("ipsec: Fix aborted xfrm policy dump crash")
> 
> My apologies for the noise if this is already under consideration.

It's a bit too big hammer. As Nicolai Stange noticed when we were
handling this for SLE12 (where fc9e50f5a5a4e would break kABI), it's
much simpler to use the flag we already have in cb->args[0] to let
xfrm_dump_policy_done() call xfrm_policy_walk_done() only if the walk
structure has been initialized. Thus all you need is the patch below.

Michal Kubecek

diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 7a5a64e70b4d..c01c7a7eb4d3 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -1655,7 +1655,9 @@ static int xfrm_dump_policy_done(struct netlink_callback 
*cb)
struct xfrm_policy_walk *walk = (struct xfrm_policy_walk *) 
>args[1];
struct net *net = sock_net(cb->skb->sk);
 
-   xfrm_policy_walk_done(walk, net);
+   /* cb->args[0] is set when walk is initialized */
+   if (cb->args[0])
+   xfrm_policy_walk_done(walk, net);
return 0;
 }
 


[PATCH ipsec] xfrm: fix XFRMA_OUTPUT_MARK policy entry

2017-11-29 Thread Michal Kubecek
This seems to be an obvious typo, NLA_U32 is type of the attribute, not its
(minimal) length.

Fixes: 077fbac405bf ("net: xfrm: support setting an output mark.")
Signed-off-by: Michal Kubecek <mkube...@suse.cz>
---
 net/xfrm/xfrm_user.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index c2cfcc6fdb34..ff58c37469d6 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -2485,7 +2485,7 @@ static const struct nla_policy xfrma_policy[XFRMA_MAX+1] 
= {
[XFRMA_PROTO]   = { .type = NLA_U8 },
[XFRMA_ADDRESS_FILTER]  = { .len = sizeof(struct xfrm_address_filter) },
[XFRMA_OFFLOAD_DEV] = { .len = sizeof(struct xfrm_user_offload) },
-   [XFRMA_OUTPUT_MARK] = { .len = NLA_U32 },
+   [XFRMA_OUTPUT_MARK] = { .type = NLA_U32 },
 };
 
 static const struct nla_policy xfrma_spd_policy[XFRMA_SPD_MAX+1] = {
-- 
2.15.0



Re: [PATCH v5 next 1/5] modules:capabilities: add request_module_cap()

2017-11-28 Thread Michal Kubecek
On Tue, Nov 28, 2017 at 11:48:49PM +0100, Luis R. Rodriguez wrote:
> On Tue, Nov 28, 2017 at 02:18:18PM -0800, Kees Cook wrote:
> > On Tue, Nov 28, 2017 at 2:12 PM, Luis R. Rodriguez <mcg...@kernel.org> 
> > wrote:
> > > On Tue, Nov 28, 2017 at 01:39:58PM -0800, Kees Cook wrote:
> > >> On Tue, Nov 28, 2017 at 1:16 PM, Luis R. Rodriguez <mcg...@kernel.org> 
> > >> wrote:
> > >> > And *all* auto-loading uses aliases? What's the difference
> > >> > between auto-loading and direct-loading?
> > >>
> > >> The difference is the process privileges. Unprivilged autoloading
> > >> (e.g. int n_hdlc = N_HDLC; ioctl(fd,
> > >> TIOCSETD, _hdlc)), triggers a privileged call to finit_module()
> > >> under CAP_SYS_MODULE.
> > >
> > > Ah, so system call implicated request_module() calls.
> > 
> > Yup. Unprivileged user does something that ultimately hits a
> > request_module() in the kernel. Then the kernel calls out with the
> > usermode helper (which has CAP_SYS_MODULE) and calls finit_module().
> 
> Thanks, using this terminology is much better to understand than
> auto-loading, given it does make it clear an unprivileged call was one
> that initiated the request_module() call, there are many uses of
> request_module() which *are* privileged.
> 
> > > OK and since CAP_SYS_MODULE is much more restrictive one could
> > > argue, what's the point here?
> > 
> > The goal is to block an unprivileged user from being able to trigger a
> > module load without blocking root from loading modules directly.
> 
> I see now. Do we have an audit of all system calls which implicate a
> request_module() call? Networking is a good example for sure to start
> off with but I was curious if we have a grasp of how wide spread this
> could be.

I'm not sure it makes sense to classify this by syscalls. In networking,
request_module() can be triggered e.g. by a netlink message (genetlink
family lookup is an example not needing any privileges) so that one of
the syscalls would be sendmsg().

Michal Kubecek


Re: Linux 4.14 - regression: broken tun/tap / bridge network with virtio - bisected

2017-11-27 Thread Michal Kubecek
On Mon, Nov 27, 2017 at 05:46:14PM +0100, Andreas Hartmann wrote:
> 
> Using virtio not just breaks the network completely as described above,
> it even leaves a never stoppable or restartable qemu process (even kill
> -9 doesn't work). It's absolutely necessary to *force* a reboot to exit
> or restart the VM.
> 
> I switched back to linux 4.13 as 4.14 virtualization is quite unusable.
> 
> I'm not the only one affected:
> https://bugzilla.kernel.org/show_bug.cgi?id=197861

What does stack trace of that process look like (/proc/$pid/stack)? Is
it similar to the stack trace from kernel.org bugzilla?

Michal Kubecek


DMA coalescing - ethtool vs. kernel

2017-11-23 Thread Michal Kubecek
Hello,

while digging through the interface between ethtool and kernel,
I noticed that ethtool commit 5dd7bfbc5079 ("ethtool: Add DMA Coalescing
support") added new member dmac into struct ethtool_coalesce which is
part of kernel UAPI but there is no kernel counterpart to this change in
master, net or net-next tree.

This doesn't cause any serious trouble as with userspace structure
longer than kernel thinks, kernel would simply ignore the extra member
so that the feature "only" doesn't work. But I doubt such change could
be accepted to kernel side of the interface as new kernel would then
overflow shorter structure passed by older ethtool. Stephen Hemminger
mentioned the ABI compatibility issue when this patch was submitted:

  https://patchwork.ozlabs.org/patch/806049/#1757846

Does it make sense to have ethtool feature without kernel counterpart?

Michal Kubecek


Re: [RFC PATCH net-next 0/2] Configuring PFC stall prevention via ethtool

2017-11-16 Thread Michal Kubecek
On Thu, Nov 16, 2017 at 02:03:21PM +0200, Eran Ben Elisha wrote:
> On Thu, Nov 16, 2017 at 10:44 AM, Michal Kubecek <mkube...@suse.cz> wrote:
> >
> > I don't like adding another ethtool_ops callback tightly tied to the
> > structures passed via ioctl() but when I started to think what to
> > suggest as an alternative, I started to wonder if it is really necessary
> > to add a new ethtool command at all. Couldn't this be handled as
> > a tunable?
> 
> tunable seems as a good infrastructure to PHY tuning, however this
> feature is not a PHY feature.

There are two kinds: tunables (ETHTOOL_{G,S}TUNABLE) and phy tunables
(ETHTOOL_PHY_{G,S}TUNABLE). My understanding is that former is meant as
a generic interface for parameters related to net_device, latter for
parameters related to phydev.

It's only my guess but IMHO the idea behind (both kinds of) tunables was
to add at least some extensibility to the ioctl() interface so that we
don't have to add more and more new one-purpose commands (until we can
switch to netlink).

> To me, it's looks fit to ethtool -a where pause operations are being
> controlled.  Unfortunately set/get_pauseparam is not extensible and
> need a new operation.

Yes, logically it belongs there, no question about that. But the
relation between ethtool subcommands (on command line) and ioctl
commands (cmd member of the structures) is not 1:1 in general (this is
one of the reasons. After all, your patchset uses another command
anyway; my suggestion is to use an existing one (ETHTOOL_{G,S}TUNABLE)
rather than adding a new one (and new callback to ethtool_ops).

Michal Kubecek



Re: [RFC PATCH net-next 0/2] Configuring PFC stall prevention via ethtool

2017-11-16 Thread Michal Kubecek
On Wed, Nov 15, 2017 at 09:00:09PM +0200, Eran Ben Elisha wrote:
> From: Inbar Karmy <inb...@mellanox.com>
> 
> This RFC adds support for configuring PFC stall prevention through ethtool.
> 
> In the event where the device unexpectedly becomes unresponsive for a long
> period of time, flow control mechanism may propagate pause frames which will
> cause congestion spreading to the entire network.
> 
> To prevent this scenario, the device may implement a protection mechanism for
> monitoring and resolving such state.  The following patches allow the user to
> control the stall prevention functionality.
> 
> PFC stall prevention configuration is done via ethtool -a (pause).
> Two modes are introduced:
> Default - current behavior per driver.
> Auto - protection mechanism controlled automatically by the driver.
> Due to lack of extension ability of ethtool_ops set_pauseparam, a new
> ethtool_ops get_pfc_prevention_mode is introduced.

I don't like adding another ethtool_ops callback tightly tied to the
structures passed via ioctl() but when I started to think what to
suggest as an alternative, I started to wonder if it is really necessary
to add a new ethtool command at all. Couldn't this be handled as
a tunable?

Michal Kubecek

> I based this patch set on net-next commit 50895b9de1d3("tcp: highest_sack fix
> ").
> 
> Inbar Karmy (2):
>   ethtool: Add support for configuring PFC stall prevention in ethtool
>   net/mlx5e: PFC stall prevention support
> 
>  .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   | 51 
>  drivers/net/ethernet/mellanox/mlx5/core/port.c | 56 
> ++
>  include/linux/ethtool.h|  8 
>  include/linux/mlx5/mlx5_ifc.h  | 22 +++--
>  include/linux/mlx5/port.h  |  5 ++
>  include/uapi/linux/ethtool.h   | 20 
>  net/core/ethtool.c | 39 +++
>  7 files changed, 188 insertions(+), 13 deletions(-)
> 
> -- 
> 1.8.3.1
> 


Re: [PATCH net] genetlink: fix genlmsg_nlhdr()

2017-11-15 Thread Michal Kubecek
On Wed, Nov 15, 2017 at 01:20:10PM +0100, Johannes Berg wrote:
> On Wed, 2017-11-15 at 13:09 +0100, Michal Kubecek wrote:
> > According to the description, first argument of genlmsg_nlhdr() points to
> > what genlmsg_put() returns, i.e. beginning of user header. Therefore we
> > should only subtract size of genetlink header and netlink message header,
> > not user header.
> > 
> > This also means we don't need to pass the pointer to genetlink family and
> > the same is true for genl_dump_check_consistent() which is the only caller
> > of genlmsg_nlhdr(). (Note that at the moment, these functions are only
> > used for families which do not have user header so that they are not
> > affected.)
> > 
> > Fixes: 670dc2833d14 ("netlink: advertise incomplete dumps")
> > Signed-off-by: Michal Kubecek <mkube...@suse.cz>
> 
> Looks sensible, though I don't think it's really needed in net since it
> really has no effect - as you note, family->hdrsize is 0 for all the
> families calling this right now.
> 
> Reviewed-by: Johannes Berg <johan...@sipsolutions.net>

Well, I'm currently working on one with hdrsize > 0 which is how I found
this. :-) But it will still need some time before it can get even into
net-next.

Michal Kubecek



[PATCH net] genetlink: fix genlmsg_nlhdr()

2017-11-15 Thread Michal Kubecek
According to the description, first argument of genlmsg_nlhdr() points to
what genlmsg_put() returns, i.e. beginning of user header. Therefore we
should only subtract size of genetlink header and netlink message header,
not user header.

This also means we don't need to pass the pointer to genetlink family and
the same is true for genl_dump_check_consistent() which is the only caller
of genlmsg_nlhdr(). (Note that at the moment, these functions are only
used for families which do not have user header so that they are not
affected.)

Fixes: 670dc2833d14 ("netlink: advertise incomplete dumps")
Signed-off-by: Michal Kubecek <mkube...@suse.cz>
---
 drivers/net/macsec.c  |  2 +-
 drivers/net/wireless/mac80211_hwsim.c |  2 +-
 include/net/genetlink.h   | 11 +++
 net/nfc/netlink.c |  6 +++---
 net/wireless/nl80211.c|  4 ++--
 5 files changed, 10 insertions(+), 15 deletions(-)

diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c
index 5ab1b8849c30..da3975c5636d 100644
--- a/drivers/net/macsec.c
+++ b/drivers/net/macsec.c
@@ -2410,7 +2410,7 @@ static int dump_secy(struct macsec_secy *secy, struct 
net_device *dev,
if (!hdr)
return -EMSGSIZE;
 
-   genl_dump_check_consistent(cb, hdr, _fam);
+   genl_dump_check_consistent(cb, hdr);
 
if (nla_put_u32(skb, MACSEC_ATTR_IFINDEX, dev->ifindex))
goto nla_put_failure;
diff --git a/drivers/net/wireless/mac80211_hwsim.c 
b/drivers/net/wireless/mac80211_hwsim.c
index 6467ffac9811..3244d957454b 100644
--- a/drivers/net/wireless/mac80211_hwsim.c
+++ b/drivers/net/wireless/mac80211_hwsim.c
@@ -2805,7 +2805,7 @@ static int mac80211_hwsim_get_radio(struct sk_buff *skb,
return -EMSGSIZE;
 
if (cb)
-   genl_dump_check_consistent(cb, hdr, _genl_family);
+   genl_dump_check_consistent(cb, hdr);
 
if (data->alpha2[0] && data->alpha2[1])
param.reg_alpha2 = data->alpha2;
diff --git a/include/net/genetlink.h b/include/net/genetlink.h
index 5ac169a735f4..decf6012a401 100644
--- a/include/net/genetlink.h
+++ b/include/net/genetlink.h
@@ -154,15 +154,12 @@ void *genlmsg_put(struct sk_buff *skb, u32 portid, u32 
seq,
 /**
  * genlmsg_nlhdr - Obtain netlink header from user specified header
  * @user_hdr: user header as returned from genlmsg_put()
- * @family: generic netlink family
  *
  * Returns pointer to netlink header.
  */
-static inline struct nlmsghdr *
-genlmsg_nlhdr(void *user_hdr, const struct genl_family *family)
+static inline struct nlmsghdr *genlmsg_nlhdr(void *user_hdr)
 {
return (struct nlmsghdr *)((char *)user_hdr -
-  family->hdrsize -
   GENL_HDRLEN -
   NLMSG_HDRLEN);
 }
@@ -190,16 +187,14 @@ static inline int genlmsg_parse(const struct nlmsghdr 
*nlh,
  * genl_dump_check_consistent - check if sequence is consistent and advertise 
if not
  * @cb: netlink callback structure that stores the sequence number
  * @user_hdr: user header as returned from genlmsg_put()
- * @family: generic netlink family
  *
  * Cf. nl_dump_check_consistent(), this just provides a wrapper to make it
  * simpler to use with generic netlink.
  */
 static inline void genl_dump_check_consistent(struct netlink_callback *cb,
- void *user_hdr,
- const struct genl_family *family)
+ void *user_hdr)
 {
-   nl_dump_check_consistent(cb, genlmsg_nlhdr(user_hdr, family));
+   nl_dump_check_consistent(cb, genlmsg_nlhdr(user_hdr));
 }
 
 /**
diff --git a/net/nfc/netlink.c b/net/nfc/netlink.c
index b251fb936a27..371352f4dd60 100644
--- a/net/nfc/netlink.c
+++ b/net/nfc/netlink.c
@@ -75,7 +75,7 @@ static int nfc_genl_send_target(struct sk_buff *msg, struct 
nfc_target *target,
if (!hdr)
return -EMSGSIZE;
 
-   genl_dump_check_consistent(cb, hdr, _genl_family);
+   genl_dump_check_consistent(cb, hdr);
 
if (nla_put_u32(msg, NFC_ATTR_TARGET_INDEX, target->idx) ||
nla_put_u32(msg, NFC_ATTR_PROTOCOLS, target->supported_protocols) ||
@@ -603,7 +603,7 @@ static int nfc_genl_send_device(struct sk_buff *msg, struct 
nfc_dev *dev,
return -EMSGSIZE;
 
if (cb)
-   genl_dump_check_consistent(cb, hdr, _genl_family);
+   genl_dump_check_consistent(cb, hdr);
 
if (nfc_genl_setup_device_added(dev, msg))
goto nla_put_failure;
@@ -1332,7 +1332,7 @@ static int nfc_genl_send_se(struct sk_buff *msg, struct 
nfc_dev *dev,
goto nla_put_failure;
 
if (cb)
-   genl_dump_check_consistent(cb, hdr, _genl_family);
+   genl_dump_check_consistent(cb

regression: UFO removal breaks kvm live migration

2017-11-07 Thread Michal Kubecek
Hello,

I just received this bug report:

  https://bugzilla.suse.com/show_bug.cgi?id=1066757

The reporter runs a live migration of a kvm guest from a host with
kernel supporting UFO (openSUSE 42.2 or 42.3, based on 4.4) to a host
with kernel with UFO support removed (SLE15 or openSUSE 15.0 pre-release
which is based on 4.12 but has the UFO removal series backported).

The migration fails with

  kvm: virtio-net: saved image requires TUN_F_UFO support

because the guest image has a virtio_net device with UFO enabled which
requires TUN_F_UFO on the corresponding host tun device but that is no
longer available on the target host.

This kind of problem already happened once:

  https://www.spinics.net/lists/netdev/msg443821.html

At that time, commit 3d0ad09412ff ("drivers/net: Disable UFO through
virtio") was reverted once the issue it worked around was resolved in
a different way.

I didn't have time to think it through yet but perhaps we could allow
setting TUN_F_UFO and ignore its value.

This is not time critical for SLE15 / openSUSE 15.0 which are still at
early beta stage but 4.14 final is close and once it's out, more users
are going to hit this.

Michal Kubecek



[PATCH iproute2] ip maddr: fix filtering by device

2017-10-19 Thread Michal Kubecek
Commit 530903dd9003 ("ip: fix igmp parsing when iface is long") uses
variable len to keep trailing colon from interface name comparison.  This
variable is local to loop body but we set it in one pass and use it in
following one(s) so that we are actually using (pseudo)random length for
comparison. This became apparent since commit b48a1161f5f9 ("ipmaddr: Avoid
accessing uninitialized data") always initializes len to zero so that the
name comparison is always true. As a result, "ip maddr show dev eth0" shows
IPv4 multicast addresses for all interfaces.

Instead of keeping the length, let's simply replace the trailing colon with
a null byte. The bonus is that we get correct interface name in ma.name.

Fixes: 530903dd9003 ("ip: fix igmp parsing when iface is long")
Signed-off-by: Michal Kubecek <mkube...@suse.cz>
---
 ip/ipmaddr.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/ip/ipmaddr.c b/ip/ipmaddr.c
index 5683f6fa830c..46b86a3a7723 100644
--- a/ip/ipmaddr.c
+++ b/ip/ipmaddr.c
@@ -136,17 +136,18 @@ static void read_igmp(struct ma_info **result_p)
 
while (fgets(buf, sizeof(buf), fp)) {
struct ma_info *ma;
-   size_t len = 0;
 
if (buf[0] != '\t') {
+   size_t len;
+
sscanf(buf, "%d%s", , m.name);
len = strlen(m.name);
if (m.name[len - 1] == ':')
-   len--;
+   m.name[len - 1] = '\0';
continue;
}
 
-   if (filter.dev && strncmp(filter.dev, m.name, len))
+   if (filter.dev && strcmp(filter.dev, m.name))
continue;
 
sscanf(buf, "%08x%d", (__u32 *), );
-- 
2.14.2



Re: [RFC 0/3] Adding config get/set to devlink

2017-10-12 Thread Michal Kubecek
On Thu, Oct 12, 2017 at 12:20:07PM -0700, Florian Fainelli wrote:
> On 10/12/2017 12:06 PM, David Miller wrote:
> > 
> > One suggestion is that devlink is used for getting ethtool stats for
> > objects lacking netdev representor's, and a new genetlink family is
> > used for netdev based ethtool.
> 
> Right, I was also thinking along those lines that we we would have a new
> generic netlink family for ethtool to support ethtool over netlink.

This is what I plan to work on on next SUSE Hackweek in November. But
I'm, of course, open to suggestions and I don't insist on this approach.

Michal Kubecek



Re: [PATCHv4 iproute2 2/2] lib/libnetlink: update rtnl_talk to support malloc buff at run time

2017-10-10 Thread Michal Kubecek
On Mon, Oct 09, 2017 at 10:25:25PM +0200, Phil Sutter wrote:
> Hi Stephen,
> 
> On Mon, Oct 02, 2017 at 10:37:08AM -0700, Stephen Hemminger wrote:
> > On Thu, 28 Sep 2017 21:33:46 +0800
> > Hangbin Liu <ha...@redhat.com> wrote:
> > 
> > > From: Hangbin Liu <liuhang...@gmail.com>
> > > 
> > > This is an update for 460c03f3f3cc ("iplink: double the buffer size also 
> > > in
> > > iplink_get()"). After update, we will not need to double the buffer size
> > > every time when VFs number increased.
> > > 
> > > With call like rtnl_talk(, , NULL, 0), we can simply remove the
> > > length parameter.
> > > 
> > > With call like rtnl_talk(, nlh, nlh, sizeof(req), I add a new variable
> > > answer to avoid overwrite data in nlh, because it may has more info after
> > > nlh. also this will avoid nlh buffer not enough issue.
> > > 
> > > We need to free answer after using.
> > > 
> > > Signed-off-by: Hangbin Liu <liuhang...@gmail.com>
> > > Signed-off-by: Phil Sutter <p...@nwl.cc>
> > > ---
> > 
> > Most of the uses of rtnl_talk() don't need to this peek and dynamic sizing.
> > Can only those places that need that be targeted?
> 
> We could probably do that, by having a buffer on stack in __rtnl_talk()
> which will be used instead of the allocated one if 'answer' is NULL. Or
> maybe even introduce a dedicated API call for the dynamically allocated
> receive buffer. But I really doubt that's feasible: AFAICT, that stack
> buffer still needs to be reasonably sized since the reply might be
> larger than the request (reusing the request buffer would be the most
> simple way to tackle this), also there is support for extack which may
> bloat the response to arbitrary size. Hangbin has shown in his benchmark
> that the overhead of the second syscall is negligible, so why care about
> that and increase code complexity even further?
> 
> Not saying it's not possible, but I just doubt it's worth the effort.

Agreed. Current code is based on the assumption that we can estimate the
maximum reply length in advance and the reason for this series is that
this assumption turned out to be wrong. I'm afraid that if we replace
it by an assumption that we can estimate the maximum reply length for
most requests with only few exceptions, it's only matter of time for us
to be proven wrong again.

Michal Kubecek



Re: [PATCHv4 iproute2 1/2] lib/libnetlink: re malloc buff if size is not enough

2017-09-29 Thread Michal Kubecek
On Fri, Sep 29, 2017 at 10:54:40AM -0700, Stephen Hemminger wrote:
> On Thu, 28 Sep 2017 21:33:45 +0800
> Hangbin Liu <ha...@redhat.com> wrote:
> 
> >  
> > +static int __rtnl_recvmsg(int fd, struct msghdr *msg, int flags)
> > +{
> > +   int len;
> > +
> > +   do {
> > +   len = recvmsg(fd, msg, flags);
> > +   } while (len < 0 && (errno == EINTR || errno == EAGAIN));
> > +
> > +   if (len < 0) {
> > +   fprintf(stderr, "netlink receive error %s (%d)\n",
> > +   strerror(errno), errno);
> > +   return -errno;
> > +   }
> > +
> > +   if (len == 0) {
> > +   fprintf(stderr, "EOF on netlink\n");
> > +   return -ENODATA;
> > +   }
> > +
> > +   return len;
> > +}
> > +
> > +static int rtnl_recvmsg(int fd, struct msghdr *msg, char **answer)
> > +{
> > +   struct iovec *iov = msg->msg_iov;
> > +   char *buf;
> > +   int len;
> > +
> > +   iov->iov_base = NULL;
> > +   iov->iov_len = 0;
> > +
> > +   len = __rtnl_recvmsg(fd, msg, MSG_PEEK | MSG_TRUNC);
> > +   if (len < 0)
> > +   return len;
> > +
> > +   buf = malloc(len);
> > +   if (!buf) {
> > +   fprintf(stderr, "malloc error: not enough buffer\n");
> > +   return -ENOMEM;
> > +   }
> > +
> > +   iov->iov_base = buf;
> > +   iov->iov_len = len;
> > +
> > +   len = __rtnl_recvmsg(fd, msg, 0);
> > +   if (len < 0) {
> > +   free(buf);
> > +   return len;
> > +   }
> > +
> > +   if (answer)
> > +   *answer = buf;
> > +   else
> > +   free(buf);
> > +
> > +   return len;
> > +}
> 
> Doubling the number of system calls per message is not going to make
> users with 5,000,000 routes or 1000 vlans, or 10,000 tunnels happy.
> Please rethink this.

I'm not sure it's possible to avoid this if we want to be able to get
rid of a preset message length limit. If you call recvmsg() without
MSG_PEEK and your buffer isn't sufficiently large, the message is lost.
And once you use MSG_PEEK, you need another syscall to remove the
message from the queue even if you read all data. In other words, to be
sure you don't lose the reply, you have to do two syscalls.

One alternative I can see would be calling recvmsg() without MSG_PEEK
(but with reasonably large buffer) and repeating the request if the
buffer is not large enough (and caller is actually interested in the
answer). But I don't think this is desirable either as that would result
in even worse overhead.

Michal Kubecek



Re: [PATCHv4 iproute2 2/2] lib/libnetlink: update rtnl_talk to support malloc buff at run time

2017-09-29 Thread Michal Kubecek
On Thu, Sep 28, 2017 at 09:33:46PM +0800, Hangbin Liu wrote:
> From: Hangbin Liu <liuhang...@gmail.com>
> 
> This is an update for 460c03f3f3cc ("iplink: double the buffer size also in
> iplink_get()"). After update, we will not need to double the buffer size
> every time when VFs number increased.
> 
> With call like rtnl_talk(, , NULL, 0), we can simply remove the
> length parameter.
> 
> With call like rtnl_talk(, nlh, nlh, sizeof(req), I add a new variable
> answer to avoid overwrite data in nlh, because it may has more info after
> nlh. also this will avoid nlh buffer not enough issue.
> 
> We need to free answer after using.
> 
> Signed-off-by: Hangbin Liu <liuhang...@gmail.com>
> Signed-off-by: Phil Sutter <p...@nwl.cc>
> ---

Reviewed-by: Michal Kubecek <mkube...@suse.cz>

> diff --git a/ip/link_gre.c b/ip/link_gre.c
> index 9ea2970..35782ca 100644
> --- a/ip/link_gre.c
> +++ b/ip/link_gre.c
> @@ -68,7 +68,6 @@ static int gre_parse_opt(struct link_util *lu, int argc, 
> char **argv,
>   struct {
>   struct nlmsghdr n;
>   struct ifinfomsg i;
> - char buf[16384];
>   } req = {
>   .n.nlmsg_len = NLMSG_LENGTH(sizeof(*ifi)),
>   .n.nlmsg_flags = NLM_F_REQUEST,
> @@ -76,6 +75,7 @@ static int gre_parse_opt(struct link_util *lu, int argc, 
> char **argv,
>   .i.ifi_family = preferred_family,
>   .i.ifi_index = ifi->ifi_index,
>   };
> + struct nlmsghdr *answer = NULL;
>   struct rtattr *tb[IFLA_MAX + 1];
>   struct rtattr *linkinfo[IFLA_INFO_MAX+1];
>   struct rtattr *greinfo[IFLA_GRE_MAX + 1];
> @@ -100,19 +100,20 @@ static int gre_parse_opt(struct link_util *lu, int 
> argc, char **argv,
>   __u32 erspan_idx = 0;
>  
>   if (!(n->nlmsg_flags & NLM_F_CREATE)) {
> - if (rtnl_talk(, , , sizeof(req)) < 0) {
> + if (rtnl_talk(, , ) < 0) {
>  get_failed:
>   fprintf(stderr,
>   "Failed to get existing tunnel info.\n");
> + free(answer);
>       return -1;
>   }

Took me a moment to realize answer is still NULL if we get here via
failed rtnl_talk() but non-NULL if we get here via "goto get_failed"
later. Nice trick. :-)

Michal Kubecek



Re: [PATCHv4 iproute2 1/2] lib/libnetlink: re malloc buff if size is not enough

2017-09-29 Thread Michal Kubecek
On Thu, Sep 28, 2017 at 09:33:45PM +0800, Hangbin Liu wrote:
> From: Hangbin Liu <liuhang...@gmail.com>
> 
> With commit 72b365e8e0fd ("libnetlink: Double the dump buffer size")
> we doubled the buffer size to support more VFs. But the VFs number is
> increasing all the time. Some customers even use more than 200 VFs now.
> 
> We could not double it everytime when the buffer is not enough. Let's just
> not hard code the buffer size and malloc the correct number when running.
> 
> Introduce function rtnl_recvmsg() to always return a newly allocated buffer.
> The caller need to free it after using.
> 
> Signed-off-by: Hangbin Liu <liuhang...@gmail.com>
> Signed-off-by: Phil Sutter <p...@nwl.cc>
> ---
>  lib/libnetlink.c | 114 
> ++-----
>  1 file changed, 80 insertions(+), 34 deletions(-)
> 

Reviewed-by: Michal Kubecek <mkube...@suse.cz>


[PATCH iproute2] ip xfrm: use correct key length for netlink message

2017-09-29 Thread Michal Kubecek
When SA is added manually using "ip xfrm state add", xfrm_state_modify()
uses alg_key_len field of struct xfrm_algo for the length of key passed to
kernel in the netlink message. However alg_key_len is bit length of the key
while we need byte length here. This is usually harmless as kernel ignores
the excess data but when the bit length of the key exceeds 512
(XFRM_ALGO_KEY_BUF_SIZE), it can result in buffer overflow.

We can simply divide by 8 here as the only place setting alg_key_len is in
xfrm_algo_parse() where it is always set to a multiple of 8 (and there are
already multiple places using "algo->alg_key_len / 8").

Signed-off-by: Michal Kubecek <mkube...@suse.cz>
---
 ip/xfrm_state.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ip/xfrm_state.c b/ip/xfrm_state.c
index 4483fb8f71d2..99fdec2325ec 100644
--- a/ip/xfrm_state.c
+++ b/ip/xfrm_state.c
@@ -539,7 +539,7 @@ static int xfrm_state_modify(int cmd, unsigned int flags, 
int argc, char **argv)
 
xfrm_algo_parse((void *), type, name, key,
buf, sizeof(alg.buf));
-   len += alg.u.alg.alg_key_len;
+   len += alg.u.alg.alg_key_len / 8;
 
addattr_l(, sizeof(req.buf), type,
  (void *), len);
-- 
2.14.2



Re: [PATCHv3 iproute2 1/2] lib/libnetlink: re malloc buff if size is not enough

2017-09-21 Thread Michal Kubecek
On Thu, Sep 21, 2017 at 03:20:02PM +0800, Hangbin Liu wrote:
> 
> Or maybe we can set buf_len very small first. Then it will force to realloc at
> the second time. And the code would like
> 
>   int buf_len = 16;
>   bufp = realloc(buf, buf_len);
>   /* check bufp and set msg */
> 
>   len = recvmsg(fd, msg, flag);
>   /* check len */
> 
>   buf_len = len;
>   bufp = realloc(buf, buf_len);
>   /* check bufp and set msg */
> 
>   len = recvmsg(fd, msg, flag);
>   /* check len */
> 
> What do you think?

I will have to check but IIRC it might be possible to use zero length
for the peek to only check the length which could help you to avoid both
the reallocation and copying the same data from kernel to userspace
twice.

Michal Kubecek



Re: [PATCHv2 iproute2 1/2] lib/libnetlink: re malloc buff if size is not enough

2017-09-19 Thread Michal Kubecek
On Tue, Sep 19, 2017 at 11:05:20AM +0800, Hangbin Liu wrote:
> On Mon, Sep 18, 2017 at 09:55:05AM +0200, Michal Kubecek wrote:
> > > @@ -471,19 +516,23 @@ int rtnl_dump_filter_l(struct rtnl_handle *rth,
> > >  
> > >   if (h->nlmsg_type == NLMSG_ERROR) {
> > >   rtnl_dump_error(rth, h);
> > > + free(buf);
> > >   return -1;
> > >   }
> > >  
> > >   if (!rth->dump_fp) {
> > >   err = a->filter(, h, a->arg1);
> > > - if (err < 0)
> > > + if (err < 0) {
> > > + free(buf);
> > >   return err;
> > > + }
> > >   }
> > >  
> > >  skip_it:
> > >   h = NLMSG_NEXT(h, msglen);
> > >   }
> > >   }
> > > + free(buf);
> > 
> > We only free the last buffer returned by rtnl_recvmsg() this way.
> > IMHO this free(buf) should be moved inside the loop.
> 
> Do you mean the outside while loop or the for loop? I think we could
> not put it inside the for loop, because we may need the buf multi
> times based on arg.

Sorry for the confusion, you are right, this part is correct. I misread
the indentation.

Michal Kubecek



Re: [PATCHv2 iproute2 1/2] lib/libnetlink: re malloc buff if size is not enough

2017-09-18 Thread Michal Kubecek
On Wed, Sep 13, 2017 at 05:59:39PM +0800, Hangbin Liu wrote:
> With commit 72b365e8e0fd ("libnetlink: Double the dump buffer size")
> we doubled the buffer size to support more VFs. But the VFs number is
> increasing all the time. Some customers even use more than 200 VFs now.
> 
> We could not double it everytime when the buffer is not enough. Let's just
> not hard code the buffer size and malloc the correct number when running.
> 
> Introduce function rtnl_recvmsg() to always return a newly allocated buffer.
> The caller need to free it after using.
> 
> Signed-off-by: Hangbin Liu 
> Signed-off-by: Phil Sutter 
> ---
>  lib/libnetlink.c | 112 
> ++-
>  1 file changed, 78 insertions(+), 34 deletions(-)
> 
> diff --git a/lib/libnetlink.c b/lib/libnetlink.c
> index be7ac86..e3fa7cf 100644
> --- a/lib/libnetlink.c
> +++ b/lib/libnetlink.c
> @@ -402,6 +402,62 @@ static void rtnl_dump_error(const struct rtnl_handle 
> *rth,
>   }
>  }
>  
> +static int rtnl_recvmsg(int fd, struct msghdr *msg, char **answer)
> +{
> + struct iovec *iov;
> + int len = -1, buf_len = 32768;
> + char *bufp, *buf = NULL;
> +
> + int flag = MSG_PEEK | MSG_TRUNC;
> +
> +realloc:
> + bufp = realloc(buf, buf_len);
> +
> + if (bufp == NULL) {
> + fprintf(stderr, "malloc error: not enough buffer\n");
> + free(buf);
> + return -ENOMEM;
> + }
> + buf = bufp;
> + iov = msg->msg_iov;
> + iov->iov_base = buf;
> + iov->iov_len = buf_len;
> +
> +recv:
> + len = recvmsg(fd, msg, flag);
> +
> + if (len < 0) {
> + if (errno == EINTR || errno == EAGAIN)
> + goto recv;
> + fprintf(stderr, "netlink receive error %s (%d)\n",
> + strerror(errno), errno);

free(buf);

> + return len;

Maybe we should return -errno (saved before calling fprintf()) to be
consistent.

> + }
> +
> + if (len == 0) {
> + fprintf(stderr, "EOF on netlink\n");

free(buf);

> + return -ENODATA;
> + }
> +
> + if (len > buf_len) {
> + buf_len = len;
> + flag = 0;
> + goto realloc;
> + }
> +
> + if (flag != 0) {
> + flag = 0;
> + goto recv;
> + }

This means that even if the default buffer size is sufficient (which
should be most of the time) we make the kernel copy the message to
userspace again. Perhaps we could just call recvmsg() with zero length
to discard the message from the queue in this case. But it's not really
a big problem, I guess.

> +
> + if (answer)
> + *answer = buf;
> + else
> + free(buf);
> +
> + return len;
> +}
> +
>  int rtnl_dump_filter_l(struct rtnl_handle *rth,
>  const struct rtnl_dump_filter_arg *arg)
>  {
> @@ -413,31 +469,18 @@ int rtnl_dump_filter_l(struct rtnl_handle *rth,
>   .msg_iov = ,
>   .msg_iovlen = 1,
>   };
> - char buf[32768];
> + char *buf;
>   int dump_intr = 0;
>  
> - iov.iov_base = buf;
>   while (1) {
>   int status;
>   const struct rtnl_dump_filter_arg *a;
>   int found_done = 0;
>   int msglen = 0;
>  
> - iov.iov_len = sizeof(buf);
> - status = recvmsg(rth->fd, , 0);
> -
> - if (status < 0) {
> - if (errno == EINTR || errno == EAGAIN)
> - continue;
> - fprintf(stderr, "netlink receive error %s (%d)\n",
> - strerror(errno), errno);
> - return -1;
> - }
> -
> - if (status == 0) {
> - fprintf(stderr, "EOF on netlink\n");
> - return -1;
> - }
> + status = rtnl_recvmsg(rth->fd, , );
> + if (status < 0)
> + return status;
>  
>   if (rth->dump_fp)
>   fwrite(buf, 1, NLMSG_ALIGN(status), rth->dump_fp);
> @@ -462,8 +505,10 @@ int rtnl_dump_filter_l(struct rtnl_handle *rth,
>  
>   if (h->nlmsg_type == NLMSG_DONE) {
>   err = rtnl_dump_done(h);
> - if (err < 0)
> + if (err < 0) {
> + free(buf);
>   return -1;
> + }
>  
>   found_done = 1;
>   break; /* process next filter */
> @@ -471,19 +516,23 @@ int rtnl_dump_filter_l(struct rtnl_handle *rth,
>  
>   if (h->nlmsg_type == NLMSG_ERROR) {
>   rtnl_dump_error(rth, h);
> +   

Re: Can libpcap filter on vlan tags when vlans are hardware-accelerated?

2017-09-12 Thread Michal Kubecek
On Tue, Sep 12, 2017 at 11:54:43AM -0700, Ben Greear wrote:
> It does not appear to work on Fedora-26, and I'm curious if someone
> knows what needs doing to get this support working?

It's rather complicated. The "vlan" and "vlan " filters didn't
handle the case when vlan information is passed in metadata until commit
04660eb1e561 ("Use BPF extensions in compiled filters"), i.e. libpcap
1.7.0. Unfortunately that commit made libpcap always check only metadata
for the first outermost vlan tag so that it broke the case when vlan
information is passed in packet itself (which is less frequent today).

To handle both cases correctly, you would need libpcap with commits
d739b068ac29 ("Make VLAN filter handle both metadata and inline tags")
and 7c7a19fbd9af ("Fix logic of combined VLAN test") and also the
optimizer fix from

  https://github.com/the-tcpdump-group/libpcap/pull/582/commits/075015a3d17a

(without it the filters generate incorrect BPF in some cases unless the
optimizer is disabled). As far as I can see, these commits are not in
any release yet.

       Michal Kubecek



Re: [PATCH iproute2 1/2] lib/libnetlink: re malloc buff if size is not enough

2017-09-12 Thread Michal Kubecek
On Tue, Sep 12, 2017 at 10:47:48AM +0200, Michal Kubecek wrote:
> On Mon, Sep 11, 2017 at 03:19:55PM +0800, Hangbin Liu wrote:
> > On Fri, Sep 08, 2017 at 04:51:13PM +0200, Phil Sutter wrote:
> > > Regarding Michal's concern about reentrancy, maybe we should go into a
> > > different direction and make rtnl_recvmsg() return a newly allocated
> > > buffer which the caller has to free.
> > 
> > Hmm... But we could not free the buf in __rtnl_talk(). Because in
> > __rtnl_talk() we assign the answer with the buf address and return to 
> > caller.
> > 
> > for (h = (struct nlmsghdr *)buf; status >= sizeof(*h); ) {
> > [...]
> > if (answer) {
> > *answer= h;
> > return 0;
> > }
> > }
> > 
> > And the caller will keep use it in later code. Since there are plenty of
> > functions called rtnl_talk. I think it would be much more complex to free
> > the buffer every time.
> > 
> > 
> > Hi Michal,
> > 
> > Would you like to tell me more about your concern with reentrancy? It's 
> > looks
> > arpd doesn't call rtnl_talk() or rtnl_dump_filter_l().
> 
> I checked again and arpd indeed isn't a problem. It doesn't seem to call
> any of the two functions (directly or indirectly) and while it's linked
> with "-lpthread", it's not really multithreaded.
> 
> But my concern was rather about other potential users of libnetlink
> (i.e. those which are not part of iproute2). I must admit, though, that
> I'm not sure if libnetlink code is reentrant as of now. (And people are
> discouraged from using it in its own manual page.)
> 
> That being said, I still like Phil's idea for a different reason. While
> investigating the issue with "ip link show dev eth ..." which led me to
> commit 6599162b958e ("iplink: check for message truncation in
> iplink_get()"), I quickly peeked at some other callers of rtnl_talk()
> and I'm afraid there may be others which wouldn't handle truncated
> message correctly. I assume the maxlen argument was always chosen to be
> sufficient for any expected messages but as the example of iplink_get()
> shows, messages returned by kernel my grow over time.
> 
> That's why I like the idea of __rtnl_talk() returning a pointer to newly
> allocated buffer (of sufficient size) rather than copying the response
> into a buffer provided by caller and potentially truncating it.

I'm sorry, I managed to forget that your patch 2 does already address
this problem. But the fact that any caller must keep in mind that he
must not call the same function again until the previous response is no
longer needed still feels like a trap. It's something you need to keep
in mind (where "you" in fact means any future contributor) and it's
easy to forget. That's why I prefer the reentrant functions like
strerror_r() or localtime_r() even in code which is not intended to be
multithreaded. Getting an object which is "mine" to do with whatever
I want until I no longer need it feels like a cleaner interface to me.

Michal Kubecek



  1   2   3   >