date:20181130

Re: [PATCH net-next 1/2] ixgbe: register a mdiobus

2018-11-30 Thread Steve Douthit

On 11/30/18 11:34 AM, Andrew Lunn wrote:
>> Yep, registering multiple interfaces is wrong.  The first board I tested
>> against only had a single MAC enabled (they can be disabled/hidden via
>> straps) so it just happened to work.
> 
> Hi Steve
> 
> Can you hide any/all via straps, or is 00.0 always guaranteed to
> exist?

You can hide all the devices, but if function 1 is enabled then function
0 must also be enabled, so not all combinations are valid.

>> The Intel C3xxx family of SoCs have up to four ixgbe MACs.  These are
>> structured as two devices of two functions each on fixed internal root
>> ports.
>>
>> from lspci:
>> 
>>  +-16.0-[05]--+-00.0
>>  |\-00.1
>>  +-17.0-[06]--+-00.0
>>  |\-00.1
>> 
> 
> Is there any other hardware resource which is shared between the MAC
> interfaces? I'm just wondering if the driver has already solved this
> once. Is there an EEPROM per interface for the MAC address, or one
> shared EEPROM?

NVM is shared, it's actually the same SPI flash that holds the BIOS for
this SoC.  Access is serialized by the swfw_sync semaphore.  I think the
device firmware is automagically handling offset translation.

I don't think that helps for this case.

There might be a better match for shared resources, but nothing springs
to mind.

> Ah, how about using the 'cards_found' found variable. It is not
> perfect, in that it is not decremented in ixgb_remove(), and i wonder
> about race conditions since there does not appear to be any lock when
> it is incremented. But if cards_found == 0, register the MDIO bus.

'cards_found' doesn't exist for the ixgbe driver.  I could add it and
fix the race/decrement issues you mention, but it'd have to be a device
type specific count.  It's still possible there are other non-x550em_a
ixgbe devices in external PCIe slots that have different resource
sharing setups.

It's still a lighter weight solution than poking around the parent bus
so I'll add a 'x550em_a_devs_found' counter to v2.

[PATCH] net: tcp: add correct check for tcp_retransmit_skb()

2018-11-30 Thread Sharath Chandra Vurukala

when the tcp_retranmission_timer expires and tcp_retranmsit_skb is
called if the retranmsission fails due to local congestion,
backoff should not incremented.

tcp_retransmit_skb() returns non-zero negative value in some cases of
failure but the caller tcp_retransmission_timer() has a check for
failure which checks if the return value is greater than zero.
The check is corrected to check for non-zero value.

Signed-off-by: Sharath Chandra Vurukala 
---
 net/ipv4/tcp_timer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 091c5392..c19f371 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -511,7 +511,7 @@ void tcp_retransmit_timer(struct sock *sk)
 
tcp_enter_loss(sk);
 
-   if (tcp_retransmit_skb(sk, tcp_rtx_queue_head(sk), 1) > 0) {
+   if (tcp_retransmit_skb(sk, tcp_rtx_queue_head(sk), 1) != 0) {
/* Retransmission failed because of local congestion,
 * do not backoff.
 */
-- 
1.9.1

Re: [PATCH net] rtnetlink: Refine sanity checks in rtnl_fdb_{add|del}

2018-11-30 Thread Eric Dumazet

On Fri, Nov 30, 2018 at 8:10 AM Dmitry Vyukov  wrote:
>
> On Fri, Nov 30, 2018 at 4:02 PM, Ido Schimmel  wrote:
> > On Fri, Nov 30, 2018 at 08:59:09AM -0700, David Ahern wrote:
> >> This does not repro for me:
> >> # ./a.out
> >> Invalid address length 6 - must be 4 bytes
> >> RTNETLINK answers: No buffer space available
> >> RTNETLINK answers: Operation not supported
> >> Invalid address length 6 - must be 4 bytes
> >> Invalid address length 6 - must be 4 bytes
> >> Invalid address length 6 - must be 4 bytes
> >> Invalid address length 6 - must be 16 bytes
> >> Invalid address length 6 - must be 16 bytes
> >> Invalid address length 6 - must be 16 bytes
> >>
> >> config available>?
> >
> > You need a kernel with kmsan. See:
> > https://github.com/google/kmsan
>
>
> Another option may be to spray memory at the allocation stack with
> some distinctive byte pattern and then check this pattern at the use
> stack. Not 100% sounds, but should be enough for testing.

Well, no need for kmsan here, once you know the problem

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 
a498bb41c9aa9b8976eb8d87f71489695cb019f2..5ce53215d622a670b0dcf06113b01174d49e6efc
100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -3471,6 +3471,7 @@ static int nlmsg_populate_fdb_fill(struct sk_buff *skb,
ndm->ndm_ifindex = dev->ifindex;
ndm->ndm_state   = ndm_state;

+   WARN_ON_ONCE(dev->addr_len != ETH_ALEN);
if (nla_put(skb, NDA_LLADDR, ETH_ALEN, addr))
goto nla_put_failure;
if (vid)

Re: [PATCHv2 net] sctp: check and update stream->out_curr when allocating stream_out

2018-11-30 Thread Neil Horman

On Fri, Nov 30, 2018 at 10:48:10PM +0900, Xin Long wrote:
> On Fri, Nov 30, 2018 at 9:21 PM Neil Horman  wrote:
> >
> > On Fri, Nov 30, 2018 at 03:22:39PM +0900, Xin Long wrote:
> > > On Thu, Nov 29, 2018 at 11:39 PM Neil Horman  
> > > wrote:
> > > >
> > > > On Thu, Nov 29, 2018 at 02:42:56PM +0800, Xin Long wrote:
> > > > > Now when using stream reconfig to add out streams, stream->out
> > > > > will get re-allocated, and all old streams' information will
> > > > > be copied to the new ones and the old ones will be freed.
> > > > >
> > > > > So without stream->out_curr updated, next time when trying to
> > > > > send from stream->out_curr stream, a panic would be caused.
> > > > >
> > > > > This patch is to check and update stream->out_curr when
> > > > > allocating stream_out.
> > > > >
> > > > > v1->v2:
> > > > >   - define fa_index() to get elem index from stream->out_curr.
> > > > >
> > > > > Fixes: 5e32a431 ("sctp: introduce stream scheduler foundations")
> > > > > Reported-by: Ying Xu 
> > > > > Reported-by: syzbot+e33a3a138267ca119...@syzkaller.appspotmail.com
> > > > > Signed-off-by: Xin Long 
> > > > > ---
> > > > >  net/sctp/stream.c | 20 
> > > > >  1 file changed, 20 insertions(+)
> > > > >
> > > > > diff --git a/net/sctp/stream.c b/net/sctp/stream.c
> > > > > index 3892e76..30e7809 100644
> > > > > --- a/net/sctp/stream.c
> > > > > +++ b/net/sctp/stream.c
> > > > > @@ -84,6 +84,19 @@ static void fa_zero(struct flex_array *fa, size_t 
> > > > > index, size_t count)
> > > > >   }
> > > > >  }
> > > > >
> > > > > +static size_t fa_index(struct flex_array *fa, void *elem, size_t 
> > > > > count)
> > > > > +{
> > > > > + size_t index = 0;
> > > > > +
> > > > > + while (count--) {
> > > > > + if (elem == flex_array_get(fa, index))
> > > > > + break;
> > > > > + index++;
> > > > > + }
> > > > > +
> > > > > + return index;
> > > > > +}
> > > > > +
> > > > >  /* Migrates chunks from stream queues to new stream queues if needed,
> > > > >   * but not across associations. Also, removes those chunks to streams
> > > > >   * higher than the new max.
> > > > > @@ -147,6 +160,13 @@ static int sctp_stream_alloc_out(struct 
> > > > > sctp_stream *stream, __u16 outcnt,
> > > > >
> > > > >   if (stream->out) {
> > > > >   fa_copy(out, stream->out, 0, min(outcnt, 
> > > > > stream->outcnt));
> > > > > + if (stream->out_curr) {
> > > > > + size_t index = fa_index(stream->out, 
> > > > > stream->out_curr,
> > > > > + stream->outcnt);
> > > > > +
> > > > > + BUG_ON(index == stream->outcnt);
> > > > > + stream->out_curr = flex_array_get(out, index);
> > > > > + }
> > > > >   fa_free(stream->out);
> > > > >   }
> > > > >
> > > > > --
> > > > > 2.1.0
> > > > >
> > > > >
> > > >
> > > > This is the sort of thing I'm talking about. Its a little more code, 
> > > > but if you
> > > > augment the flex_array api like this, you can preform a resize 
> > > > operation on your
> > > > existing flex array, and you can avoid all the copying, and need to 
> > > > update
> > > > pointers maintained outside the array.  Note this code isn't tested at 
> > > > all, but
> > > > its close to what I think should work.
> > > >
> > > >
> > > > diff --git a/include/linux/flex_array.h b/include/linux/flex_array.h
> > > > index b94fa61b51fb..7fa1f27a91b5 100644
> > > > --- a/include/linux/flex_array.h
> > > > +++ b/include/linux/flex_array.h
> > > > @@ -73,6 +73,8 @@ struct flex_array {
> > > >  struct flex_array *flex_array_alloc(int element_size, unsigned int 
> > > > total,
> > > > gfp_t flags);
> > > >
> > > > +struct flex_array *flex_array_resize(struct flex_array *fa, unsigned 
> > > > int total, gfp_t flags);
> > > > +
> > > >  /**
> > > >   * flex_array_prealloc() - Ensures that memory for the elements 
> > > > indexed in the
> > > >   * range defined by start and nr_elements has been allocated.
> > > > diff --git a/lib/flex_array.c b/lib/flex_array.c
> > > > index 2eed22fa507c..f8d54af3891b 100644
> > > > --- a/lib/flex_array.c
> > > > +++ b/lib/flex_array.c
> > > > @@ -109,6 +109,7 @@ struct flex_array *flex_array_alloc(int 
> > > > element_size, unsigned int total,
> > > > ret->total_nr_elements = total;
> > > > ret->elems_per_part = elems_per_part;
> > > > ret->reciprocal_elems = reciprocal_elems;
> > > > +   ret->elements_used = 0;
> > > > if (elements_fit_in_base(ret) && !(flags & __GFP_ZERO))
> > > > memset(>parts[0], FLEX_ARRAY_FREE,
> > > > 
> > > > FLEX_ARRAY_BASE_BYTES_LEFT);
> > > > @@ -116,6 +117,53 @@ struct flex_array *flex_array_alloc(int 
> > > > element_size, unsigned int total,
> > > >  }
> > > >  EXPORT_SYMBOL(flex_array_alloc);
> > > >
> > > > +static

Re: [PATCH net] rtnetlink: Refine sanity checks in rtnl_fdb_{add|del}

2018-11-30 Thread Eric Dumazet

On Fri, Nov 30, 2018 at 7:36 AM David Ahern  wrote:
>
> On 11/30/18 7:58 AM, Ido Schimmel wrote:
> > Can you please share the reproducer (assuming it exists)? I don't really
> > understand the fix. None of the functions you patched are in the trace.
> > Also, looking at IPv4 GRE code, while GRE device has dev->addr_len set
> > to 4, dev->type is set to ARPHRD_IPGRE.
>
> I had the same reaction ... you can not claim to be an ethernet device
> and have a hw address that is not 6 bytes.

This has been discussed a number of times TUNSETLINK can do that.
(I have not checked what the repro does )

Re: [PATCH net] rtnetlink: Refine sanity checks in rtnl_fdb_{add|del}

2018-11-30 Thread Eric Dumazet

On Fri, Nov 30, 2018 at 6:58 AM Ido Schimmel  wrote:
>
> On Fri, Nov 30, 2018 at 05:35:01AM -0800, 'Eric Dumazet' via syzkaller wrote:
> > Commit da71577545a5 ("rtnetlink: Disallow FDB configuration
> > for non-Ethernet device") added a test against dev->type.
> >
> > kmsan was still able to trigger a kernel-infoleak using a gre device,
> > with a correct device type (ARPHRD_ETHER), but with a not
> > correct dev->addr_len (4 bytes instead of the expected 6 bytes)
>
> Hi,
>
> Can you please share the reproducer (assuming it exists)? I don't really
> understand the fix. None of the functions you patched are in the trace.
> Also, looking at IPv4 GRE code, while GRE device has dev->addr_len set
> to 4, dev->type is set to ARPHRD_IPGRE.
>
> Thanks

// autogenerated by syzkaller (https://github.com/google/syzkaller)

#define _GNU_SOURCE

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 
#include 
#include 
#include 
#include 

static void vsnprintf_check(char* str, size_t size, const char* format,
va_list args)
{
  int rv;
  rv = vsnprintf(str, size, format, args);
  if (rv < 0)
exit(1);
  if ((size_t)rv >= size)
exit(1);
}

#define COMMAND_MAX_LEN 128
#define PATH_PREFIX\
  "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin "
#define PATH_PREFIX_LEN (sizeof(PATH_PREFIX) - 1)

static void execute_command(bool panic, const char* format, ...)
{
  va_list args;
  char command[PATH_PREFIX_LEN + COMMAND_MAX_LEN];
  int rv;
  va_start(args, format);
  memcpy(command, PATH_PREFIX, PATH_PREFIX_LEN);
  vsnprintf_check(command + PATH_PREFIX_LEN, COMMAND_MAX_LEN, format, args);
  va_end(args);
  rv = system(command);
  if (rv) {
if (panic)
  exit(1);
  }
}

#define DEV_IPV4 "172.20.20.%d"
#define DEV_IPV6 "fe80::%02hx"
#define DEV_MAC "aa:aa:aa:aa:aa:%02hx"

static void snprintf_check(char* str, size_t size, const char* format, ...)
{
  va_list args;
  va_start(args, format);
  vsnprintf_check(str, size, format, args);
  va_end(args);
}
static void initialize_netdevices(void)
{
  unsigned i;
  const char* devtypes[] = {"ip6gretap", "bridge", "vcan", "bond", "team"};
  const char* devnames[] = {"lo",
"sit0",
"bridge0",
"vcan0",
"tunl0",
"gre0",
"gretap0",
"ip_vti0",
"ip6_vti0",
"ip6tnl0",
"ip6gre0",
"ip6gretap0",
"erspan0",
"bond0",
"veth0",
"veth1",
"team0",
"veth0_to_bridge",
"veth1_to_bridge",
"veth0_to_bond",
"veth1_to_bond",
"veth0_to_team",
"veth1_to_team"};
  const char* devmasters[] = {"bridge", "bond", "team"};
  for (i = 0; i < sizeof(devtypes) / (sizeof(devtypes[0])); i++)
execute_command(0, "ip link add dev %s0 type %s", devtypes[i], devtypes[i]);
  execute_command(0, "ip link add type veth");
  for (i = 0; i < sizeof(devmasters) / (sizeof(devmasters[0])); i++) {
execute_command(
0, "ip link add name %s_slave_0 type veth peer name veth0_to_%s",
devmasters[i], devmasters[i]);
execute_command(
0, "ip link add name %s_slave_1 type veth peer name veth1_to_%s",
devmasters[i], devmasters[i]);
execute_command(0, "ip link set %s_slave_0 master %s0", devmasters[i],
devmasters[i]);
execute_command(0, "ip link set %s_slave_1 master %s0", devmasters[i],
devmasters[i]);
execute_command(0, "ip link set veth0_to_%s up", devmasters[i]);
execute_command(0, "ip link set veth1_to_%s up", devmasters[i]);
  }
  execute_command(0, "ip link set bridge_slave_0 up");
  execute_command(0, "ip link set bridge_slave_1 up");
  for (i = 0; i < sizeof(devnames) / (sizeof(devnames[0])); i++) {
char addr[32];
snprintf_check(addr, sizeof(addr), DEV_IPV4, i + 10);
execute_command(0, "ip -4 addr add %s/24 dev %s", addr, devnames[i]);
snprintf_check(addr, sizeof(addr), DEV_IPV6, i + 10);
execute_command(0, "ip -6 addr add %s/120 dev %s", addr, devnames[i]);
snprintf_check(addr, sizeof(addr), DEV_MAC, i + 10);
execute_command(0, "ip link set dev %s address %s", devnames[i], addr);
execute_command(0, "ip link set dev %s up", devnames[i]);
  }
}

static void setup_common()
{
  if (mount(0,

Re: [PATCH net-next 3/3] vxlan: move flag sets to use a helper func vxlan_nl2conf

2018-11-30 Thread kbuild test robot

Hi Roopa,

I love your patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Roopa-Prabhu/vxlan-support-changelink-for-a-few-more-attributes/20181130-030315
config: x86_64-allmodconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

>> drivers/net/vxlan.c:3526:22: error: Expected ; at end of statement
>> drivers/net/vxlan.c:3526:22: error: got {
>> drivers/net/vxlan.c:3534:17: error: Expected ) in function declarator
   drivers/net/vxlan.c:3534:17: error: got [
>> drivers/net/vxlan.c:3534:9: error: Trying to use reserved word 'if' as 
>> identifier
   drivers/net/vxlan.c:3536:25: error: Trying to use reserved word 'if' as 
identifier
>> drivers/net/vxlan.c:3536:25: error: Expected ; at the end of type declaration
   drivers/net/vxlan.c:3536:25: error: got }
>> drivers/net/vxlan.c:3539:25: error: Trying to use reserved word 'return' as 
>> identifier
>> drivers/net/vxlan.c:3539:32: error: Expected ; at end of declaration
   drivers/net/vxlan.c:3539:32: error: got -
   drivers/net/vxlan.c:3540:17: error: Expected ; at the end of type declaration
   drivers/net/vxlan.c:3540:17: error: got }
   drivers/net/vxlan.c:3542:9: error: Expected ; at the end of type declaration
   drivers/net/vxlan.c:3542:9: error: got }
   drivers/net/vxlan.c:3546:25: error: Trying to use reserved word 'if' as 
identifier
   drivers/net/vxlan.c:3546:25: error: Expected ; at the end of type declaration
   drivers/net/vxlan.c:3546:25: error: got }
   drivers/net/vxlan.c:3549:25: error: Trying to use reserved word 'return' as 
identifier
   drivers/net/vxlan.c:3549:32: error: Expected ; at end of declaration
   drivers/net/vxlan.c:3549:32: error: got -
   drivers/net/vxlan.c:3550:17: error: Expected ; at the end of type declaration
   drivers/net/vxlan.c:3550:17: error: got }
   drivers/net/vxlan.c:3553:9: error: Expected ; at the end of type declaration
   drivers/net/vxlan.c:3553:9: error: got }
   drivers/net/vxlan.c:3558:63: error: Expected ) in function declarator
   drivers/net/vxlan.c:3558:63: error: got ,
   drivers/net/vxlan.c:3561:56: error: Expected ) in function declarator
   drivers/net/vxlan.c:3561:56: error: got ,
   drivers/net/vxlan.c:3564:56: error: Expected ) in function declarator
   drivers/net/vxlan.c:3564:56: error: got ,
   drivers/net/vxlan.c:3567:49: error: Expected ) in function declarator
   drivers/net/vxlan.c:3567:49: error: got ,
   drivers/net/vxlan.c:3570:49: error: Expected ) in function declarator
   drivers/net/vxlan.c:3570:49: error: got ,
   drivers/net/vxlan.c:3573:63: error: Expected ) in function declarator
   drivers/net/vxlan.c:3573:63: error: got ,
   drivers/net/vxlan.c:3576:15: error: Expected ) in function declarator
   drivers/net/vxlan.c:3576:15: error: got [
   drivers/net/vxlan.c:3576:9: error: Trying to use reserved word 'if' as 
identifier
   drivers/net/vxlan.c:3578:25: error: Trying to use reserved word 'if' as 
identifier
   drivers/net/vxlan.c:3578:25: error: Expected ; at the end of type declaration
   drivers/net/vxlan.c:3578:25: error: got }
   drivers/net/vxlan.c:3581:25: error: Trying to use reserved word 'return' as 
identifier
   drivers/net/vxlan.c:3581:32: error: Expected ; at end of declaration
   drivers/net/vxlan.c:3581:32: error: got -
   drivers/net/vxlan.c:3582:17: error: Expected ; at the end of type declaration
   drivers/net/vxlan.c:3582:17: error: got }
   drivers/net/vxlan.c:3584:9: error: Expected ; at the end of type declaration
   drivers/net/vxlan.c:3584:9: error: got }
   drivers/net/vxlan.c:3589:9: error: Trying to use reserved word 'return' as 
identifier
   drivers/net/vxlan.c:3589:16: error: Expected ; at end of declaration
   drivers/net/vxlan.c:3589:16: error: got 0
   drivers/net/vxlan.c:3590:1: error: Expected ; at the end of type declaration
   drivers/net/vxlan.c:3590:1: error: got }
   drivers/net/vxlan.c:3600:9: error: Trying to use reserved word 'if' as 
identifier
   drivers/net/vxlan.c:3601:17: error: Expected ; at end of declaration
>> drivers/net/vxlan.c:3601:17: error: got return
   drivers/net/vxlan.c:3603:9: error: Trying to use reserved word 'return' as 
identifier
   drivers/net/vxlan.c:3603:16: error: Expected ; at end of declaration
>> drivers/net/vxlan.c:3603:16: error: got __vxlan_dev_create
   drivers/net/vxlan.c:3604:1: error: Expected ; at the end of type declaration
   drivers/net/vxlan.c:3604:1: error: got }
   drivers/net/vxlan.c:3620:9: error: Trying to use reserved word 'if' as 
identifier
   drivers/net/vxlan.c:3621:17: error: Expected ; at end of declaration
   drivers/net/vxlan.c:3621:17: error: got return
   drivers/net/vxlan.c:3624:9: error: Expected ) in function declarator
   drivers/net/vxlan.c:3624:9: error:

Re: [PATCH net] rtnetlink: Refine sanity checks in rtnl_fdb_{add|del}

2018-11-30 Thread David Ahern

On 11/30/18 7:58 AM, Ido Schimmel wrote:
> Can you please share the reproducer (assuming it exists)? I don't really
> understand the fix. None of the functions you patched are in the trace.
> Also, looking at IPv4 GRE code, while GRE device has dev->addr_len set
> to 4, dev->type is set to ARPHRD_IPGRE.

I had the same reaction ... you can not claim to be an ethernet device
and have a hw address that is not 6 bytes.

Re: [PATCH net] rtnetlink: Refine sanity checks in rtnl_fdb_{add|del}

2018-11-30 Thread David Ahern

On 11/30/18 8:14 AM, Eric Dumazet wrote:
> 
> // autogenerated by syzkaller (https://github.com/google/syzkaller)
> 
> #define _GNU_SOURCE
> 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> static void vsnprintf_check(char* str, size_t size, const char* format,
> va_list args)
> {
>   int rv;
>   rv = vsnprintf(str, size, format, args);
>   if (rv < 0)
> exit(1);
>   if ((size_t)rv >= size)
> exit(1);
> }
> 
> #define COMMAND_MAX_LEN 128
> #define PATH_PREFIX   
>  \
>   "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin "
> #define PATH_PREFIX_LEN (sizeof(PATH_PREFIX) - 1)
> 
> static void execute_command(bool panic, const char* format, ...)
> {
>   va_list args;
>   char command[PATH_PREFIX_LEN + COMMAND_MAX_LEN];
>   int rv;
>   va_start(args, format);
>   memcpy(command, PATH_PREFIX, PATH_PREFIX_LEN);
>   vsnprintf_check(command + PATH_PREFIX_LEN, COMMAND_MAX_LEN, format, args);
>   va_end(args);
>   rv = system(command);
>   if (rv) {
> if (panic)
>   exit(1);
>   }
> }
> 
> #define DEV_IPV4 "172.20.20.%d"
> #define DEV_IPV6 "fe80::%02hx"
> #define DEV_MAC "aa:aa:aa:aa:aa:%02hx"
> 
> static void snprintf_check(char* str, size_t size, const char* format, ...)
> {
>   va_list args;
>   va_start(args, format);
>   vsnprintf_check(str, size, format, args);
>   va_end(args);
> }
> static void initialize_netdevices(void)
> {
>   unsigned i;
>   const char* devtypes[] = {"ip6gretap", "bridge", "vcan", "bond", "team"};
>   const char* devnames[] = {"lo",
> "sit0",
> "bridge0",
> "vcan0",
> "tunl0",
> "gre0",
> "gretap0",
> "ip_vti0",
> "ip6_vti0",
> "ip6tnl0",
> "ip6gre0",
> "ip6gretap0",
> "erspan0",
> "bond0",
> "veth0",
> "veth1",
> "team0",
> "veth0_to_bridge",
> "veth1_to_bridge",
> "veth0_to_bond",
> "veth1_to_bond",
> "veth0_to_team",
> "veth1_to_team"};
>   const char* devmasters[] = {"bridge", "bond", "team"};
>   for (i = 0; i < sizeof(devtypes) / (sizeof(devtypes[0])); i++)
> execute_command(0, "ip link add dev %s0 type %s", devtypes[i], 
> devtypes[i]);
>   execute_command(0, "ip link add type veth");
>   for (i = 0; i < sizeof(devmasters) / (sizeof(devmasters[0])); i++) {
> execute_command(
> 0, "ip link add name %s_slave_0 type veth peer name veth0_to_%s",
> devmasters[i], devmasters[i]);
> execute_command(
> 0, "ip link add name %s_slave_1 type veth peer name veth1_to_%s",
> devmasters[i], devmasters[i]);
> execute_command(0, "ip link set %s_slave_0 master %s0", devmasters[i],
> devmasters[i]);
> execute_command(0, "ip link set %s_slave_1 master %s0", devmasters[i],
> devmasters[i]);
> execute_command(0, "ip link set veth0_to_%s up", devmasters[i]);
> execute_command(0, "ip link set veth1_to_%s up", devmasters[i]);
>   }
>   execute_command(0, "ip link set bridge_slave_0 up");
>   execute_command(0, "ip link set bridge_slave_1 up");
>   for (i = 0; i < sizeof(devnames) / (sizeof(devnames[0])); i++) {
> char addr[32];
> snprintf_check(addr, sizeof(addr), DEV_IPV4, i + 10);
> execute_command(0, "ip -4 addr add %s/24 dev %s", addr, devnames[i]);
> snprintf_check(addr, sizeof(addr), DEV_IPV6, i + 10);
> execute_command(0, "ip -6 addr add %s/120 dev %s", addr, devnames[i]);
> snprintf_check(addr, sizeof(addr), DEV_MAC, i + 10);
> execute_command(0, "ip link set dev %s address %s", devnames[i], addr);
> execute_command(0, "ip link set dev %s up", devnames[i]);
>   }
> }
> 
> static void setup_common()
> {
>   if (mount(0, "/sys/fs/fuse/connections", "fusectl", 0, 0)) {
>   }
> }
> 
> static void loop();
> 
> static void sandbox_common()
> {
>   prctl(PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0);
>   setpgrp();
>   setsid();
>   struct rlimit rlim;
>   rlim.rlim_cur = rlim.rlim_max = 200 << 20;
>   setrlimit(RLIMIT_AS, );
>   rlim.rlim_cur = rlim.rlim_max = 32 << 20;
>   setrlimit(RLIMIT_MEMLOCK, );
>   rlim.rlim_cur = rlim.rlim_max = 136 << 20;
>

Re: [PATCH net-next 1/2] ixgbe: register a mdiobus

2018-11-30 Thread Andrew Lunn

> Yep, registering multiple interfaces is wrong.  The first board I tested
> against only had a single MAC enabled (they can be disabled/hidden via
> straps) so it just happened to work.

Hi Steve

Can you hide any/all via straps, or is 00.0 always guaranteed to
exist?
 
> The Intel C3xxx family of SoCs have up to four ixgbe MACs.  These are
> structured as two devices of two functions each on fixed internal root
> ports.
> 
> from lspci:
> 
> +-16.0-[05]--+-00.0
> |\-00.1
> +-17.0-[06]--+-00.0
> |\-00.1
> 

Is there any other hardware resource which is shared between the MAC
interfaces? I'm just wondering if the driver has already solved this
once. Is there an EEPROM per interface for the MAC address, or one
shared EEPROM?

Ah, how about using the 'cards_found' found variable. It is not
perfect, in that it is not decremented in ixgb_remove(), and i wonder
about race conditions since there does not appear to be any lock when
it is incremented. But if cards_found == 0, register the MDIO bus.

   Andrew

Re: [PATCH net] rtnetlink: Refine sanity checks in rtnl_fdb_{add|del}

2018-11-30 Thread Ido Schimmel

On Fri, Nov 30, 2018 at 08:17:04AM -0800, Eric Dumazet wrote:
> On Fri, Nov 30, 2018 at 8:10 AM Dmitry Vyukov  wrote:
> >
> > On Fri, Nov 30, 2018 at 4:02 PM, Ido Schimmel  wrote:
> > > On Fri, Nov 30, 2018 at 08:59:09AM -0700, David Ahern wrote:
> > >> This does not repro for me:
> > >> # ./a.out
> > >> Invalid address length 6 - must be 4 bytes
> > >> RTNETLINK answers: No buffer space available
> > >> RTNETLINK answers: Operation not supported
> > >> Invalid address length 6 - must be 4 bytes
> > >> Invalid address length 6 - must be 4 bytes
> > >> Invalid address length 6 - must be 4 bytes
> > >> Invalid address length 6 - must be 16 bytes
> > >> Invalid address length 6 - must be 16 bytes
> > >> Invalid address length 6 - must be 16 bytes
> > >>
> > >> config available>?
> > >
> > > You need a kernel with kmsan. See:
> > > https://github.com/google/kmsan
> >
> >
> > Another option may be to spray memory at the allocation stack with
> > some distinctive byte pattern and then check this pattern at the use
> > stack. Not 100% sounds, but should be enough for testing.
> 
> Well, no need for kmsan here, once you know the problem

Yes, agree. Patch is good. I'll tag your v2.

Thanks!

[PATCH net] sctp: kfree_rcu asoc

2018-11-30 Thread Xin Long

In sctp_hash_transport/sctp_epaddr_lookup_transport, it dereferences
a transport's asoc under rcu_read_lock while asoc is freed not after
a grace period, which leads to a use-after-free panic.

This patch fixes it by calling kfree_rcu to make asoc be freed after
a grace period.

Note that only the asoc's memory is delayed to free in the patch, it
won't cause sk to linger longer.

Thanks Neil and Marcelo to make this clear.

Fixes: 7fda702f9315 ("sctp: use new rhlist interface on sctp transport 
rhashtable")
Fixes: cd2b70875058 ("sctp: check duplicate node before inserting a new 
transport")
Reported-by: syzbot+0b05d8aa7cb185107...@syzkaller.appspotmail.com
Reported-by: syzbot+aad231d51b1923158...@syzkaller.appspotmail.com
Suggested-by: Neil Horman 
Signed-off-by: Xin Long 
---
 include/net/sctp/structs.h | 2 ++
 net/sctp/associola.c   | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index a11f937..feada35 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -2075,6 +2075,8 @@ struct sctp_association {
 
__u64 abandoned_unsent[SCTP_PR_INDEX(MAX) + 1];
__u64 abandoned_sent[SCTP_PR_INDEX(MAX) + 1];
+
+   struct rcu_head rcu;
 };
 
 
diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index 6a28b96..3702f48 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -434,7 +434,7 @@ static void sctp_association_destroy(struct 
sctp_association *asoc)
 
WARN_ON(atomic_read(>rmem_alloc));
 
-   kfree(asoc);
+   kfree_rcu(asoc, rcu);
SCTP_DBG_OBJCNT_DEC(assoc);
 }
 
-- 
2.1.0

Re: [PATCHv2 bpf 1/2] bpf: Support sk lookup in netns with id 0

2018-11-30 Thread Joe Stringer

On Thu, 29 Nov 2018 at 16:30, Joe Stringer  wrote:
>
> David Ahern and Nicolas Dichtel report that the handling of the netns id
> 0 is incorrect for the BPF socket lookup helpers: rather than finding
> the netns with id 0, it is resolving to the current netns. This renders
> the netns_id 0 inaccessible.
>
> To fix this, adjust the API for the netns to treat all negative s32
> values as a lookup in the current netns, while any values with a
> positive value in the signed 32-bit integer space would result in a
> lookup for a socket in the netns corresponding to that id. As before, if
> the netns with that ID does not exist, no socket will be found.
> Furthermore, if any bits are set in the upper 32-bits, then no socket
> will be found.

This last sentence is a little misleading, it only applies if the
highest bit in the lower 32 bits is 0.

Re: [PATCH iproute2] stats output

2018-11-30 Thread Roopa Prabhu

On Fri, Nov 30, 2018 at 10:12 AM Stephen Hemminger
 wrote:
>
> On Fri, 30 Nov 2018 14:33:49 +0100
> Alexis Vachette  wrote:
>
> > When using:
> >
> > - ip -s link
> >
> > I think it should be better to print errors stats without adding -s
> > option twice.
> >
> > This option print stats for each network interfaces and we want to see
> > if something is off and especially timers with errors.
> >
> > Man page of ip command is updated accordingly.
> >
> > Here is a patchset:
> >
> > Signed-off-by: Alexis Vachette 
> > ---
> > diff --git a/ip/ipaddress.c b/ip/ipaddress.c
> > index 85f05a2..c70394d 100644
> > --- a/ip/ipaddress.c
> > +++ b/ip/ipaddress.c
> > @@ -323,7 +323,7 @@ int print_linkinfo(const struct sockaddr_nl *who,
> >   fprintf(fp,"\nalias %s",
> >   (const char *) RTA_DATA(tb[IFLA_IFALIAS]));
> >
> > - if (do_link && tb[IFLA_STATS64] && show_stats) {
> > + if (do_link && tb[IFLA_STATS64]) {
> >   struct rtnl_link_stats64 slocal;
> >   struct rtnl_link_stats64 *s = RTA_DATA(tb[IFLA_STATS64]);
> >   if (((unsigned long)s) & (sizeof(unsigned long)-1)) {
> > @@ -343,16 +343,14 @@ int print_linkinfo(const struct sockaddr_nl *who,
> >   if (s->rx_compressed)
> >   fprintf(fp, " %-7llu",
> >   (unsigned long long)s->rx_compressed);
> > - if (show_stats > 1) {
> > - fprintf(fp, "%s", _SL_);
> > - fprintf(fp, "RX errors: length  crc frame   fifomissed%s", 
> > _SL_);
> > - fprintf(fp, "   %-7llu  %-7llu %-7llu %-7llu %-7llu",
> > - (unsigned long long)s->rx_length_errors,
> > - (unsigned long long)s->rx_crc_errors,
> > - (unsigned long long)s->rx_frame_errors,
> > - (unsigned long long)s->rx_fifo_errors,
> > - (unsigned long long)s->rx_missed_errors);
> > - }
> > + fprintf(fp, "%s", _SL_);
> > + fprintf(fp, "RX errors: length  crc frame   fifomissed%s", 
> > _SL_);
> > + fprintf(fp, "   %-7llu  %-7llu %-7llu %-7llu %-7llu",
> > + (unsigned long long)s->rx_length_errors,
> > + (unsigned long long)s->rx_crc_errors,
> > + (unsigned long long)s->rx_frame_errors,
> > + (unsigned long long)s->rx_fifo_errors,
> > + (unsigned long long)s->rx_missed_errors);
> >   fprintf(fp, "%s", _SL_);
> >   fprintf(fp, "TX: bytes  packets  errors  dropped carrier collsns 
> > %s%s",
> >   s->tx_compressed ? "compressed" : "", _SL_);
> > @@ -366,17 +364,15 @@ int print_linkinfo(const struct sockaddr_nl *who,
> >   if (s->tx_compressed)
> >   fprintf(fp, " %-7llu",
> >   (unsigned long long)s->tx_compressed);
> > - if (show_stats > 1) {
> > - fprintf(fp, "%s", _SL_);
> > - fprintf(fp, "TX errors: aborted fifowindow  heartbeat%s", _SL_);
> > - fprintf(fp, "   %-7llu  %-7llu %-7llu %-7llu",
> > - (unsigned long long)s->tx_aborted_errors,
> > - (unsigned long long)s->tx_fifo_errors,
> > - (unsigned long long)s->tx_window_errors,
> > - (unsigned long long)s->tx_heartbeat_errors);
> > - }
> > + fprintf(fp, "%s", _SL_);
> > + fprintf(fp, "TX errors: aborted fifowindow  heartbeat%s", _SL_);
> > + fprintf(fp, "   %-7llu  %-7llu %-7llu %-7llu",
> > + (unsigned long long)s->tx_aborted_errors,
> > + (unsigned long long)s->tx_fifo_errors,
> > + (unsigned long long)s->tx_window_errors,
> > + (unsigned long long)s->tx_heartbeat_errors);
> >   }
> > - if (do_link && !tb[IFLA_STATS64] && tb[IFLA_STATS] && show_stats) {
> > + if (do_link && !tb[IFLA_STATS64] && tb[IFLA_STATS]) {
> >   struct rtnl_link_stats slocal;
> >   struct rtnl_link_stats *s = RTA_DATA(tb[IFLA_STATS]);
> >   if (((unsigned long)s) & (sizeof(unsigned long)-1)) {
> > @@ -393,17 +389,15 @@ int print_linkinfo(const struct sockaddr_nl *who,
> >   );
> >   if (s->rx_compressed)
> >   fprintf(fp, " %-7u", s->rx_compressed);
> > - if (show_stats > 1) {
> > - fprintf(fp, "%s", _SL_);
> > - fprintf(fp, "RX errors: length  crc frame   fifomissed%s", 
> > _SL_);
> > - fprintf(fp, "   %-7u  %-7u %-7u %-7u %-7u",
> > - s->rx_length_errors,
> > - s->rx_crc_errors,
> > - s->rx_frame_errors,
> > - s->rx_fifo_errors,
> > - s->rx_missed_errors
> > - );
> > - }
> > + fprintf(fp, "%s", _SL_);
> > + fprintf(fp, "RX errors: length  crc frame   fifomissed%s", 
> > _SL_);
> > + fprintf(fp, "   %-7u  %-7u %-7u %-7u %-7u",
> > + s->rx_length_errors,
> > + s->rx_crc_errors,
> > + s->rx_frame_errors,
> > + s->rx_fifo_errors,
> > + s->rx_missed_errors
> > + );
> >   fprintf(fp, "%s", _SL_);
> >   fprintf(fp, "TX: bytes  packets  errors  dropped carrier collsns 
> > %s%s",
> >   s->tx_compressed ? "compressed" : "", _SL_);
> > @@ -412,16 +406,14 @@ int print_linkinfo(const struct sockaddr_nl *who,
> >   s->tx_dropped, s->tx_carrier_errors, s->collisions);
> >   if (s->tx_compressed)
> >   fprintf(fp, " %-7u", s->tx_compressed);
> > - if (show_stats > 1) {
> > - fprintf(fp, "%s", _SL_);
> > - fprintf(fp, "TX errors: aborted fifowindow  heartbeat%s", _SL_);
> > - fprintf(fp, "   %-7u  %-7u %-7u %-7u",
> > - s->tx_aborted_errors,
> > -

Re: [PATCH net] rtnetlink: Refine sanity checks in rtnl_fdb_{add|del}

2018-11-30 Thread Dmitry Vyukov

On Fri, Nov 30, 2018 at 4:02 PM, Ido Schimmel  wrote:
> On Fri, Nov 30, 2018 at 08:59:09AM -0700, David Ahern wrote:
>> This does not repro for me:
>> # ./a.out
>> Invalid address length 6 - must be 4 bytes
>> RTNETLINK answers: No buffer space available
>> RTNETLINK answers: Operation not supported
>> Invalid address length 6 - must be 4 bytes
>> Invalid address length 6 - must be 4 bytes
>> Invalid address length 6 - must be 4 bytes
>> Invalid address length 6 - must be 16 bytes
>> Invalid address length 6 - must be 16 bytes
>> Invalid address length 6 - must be 16 bytes
>>
>> config available>?
>
> You need a kernel with kmsan. See:
> https://github.com/google/kmsan


Another option may be to spray memory at the allocation stack with
some distinctive byte pattern and then check this pattern at the use
stack. Not 100% sounds, but should be enough for testing.

Re: [iproute PATCH] man: ip-route.8: Fix ENCAP references in synopsis

2018-11-30 Thread Phil Sutter

Hi Simon,

On Fri, Nov 30, 2018 at 03:39:05PM +0100, Simon Horman wrote:
> On Wed, Nov 28, 2018 at 12:12:32PM +0100, Phil Sutter wrote:
> > The different encapsulation types are described in ENCAP_*
> > non-terminals, but ENCAP definition lists them without the ENCAP_
> > prefix. Fix this for consistency.
> > 
> > Signed-off-by: Phil Sutter 
> > ---
> >  man/man8/ip-route.8.in | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/man/man8/ip-route.8.in b/man/man8/ip-route.8.in
> > index 11dd4ca7abf68..26dfe0b06c86b 100644
> > --- a/man/man8/ip-route.8.in
> > +++ b/man/man8/ip-route.8.in
> > @@ -187,7 +187,8 @@ throw " | " unreachable " | " prohibit " | " blackhole 
> > " | " nat " ]"
> >  
> >  .ti -8
> >  .IR ENCAP " := [ "
> > -.IR MPLS " | " IP " | " BPF " | " SEG6 " | " SEG6LOCAL " ] "
> > +.IR ENCAP_MPLS " | " ENCAP_IP " | " ENCAP_BPF " | "
> > +.IR ENCAP_SEG6 " | " ENCAP_SEG6LOCAL " ] "
> >  
> >  .ti -8
> >  .IR ENCAP_MPLS " := "
> 
> This looks good but do we have another inconsistency with regards
> to ENCAP and ENCAPTYPE ENCAPHDR (further down).
> 
> Glancing over the file the following also seem inconsistent:
> 
> * NH / NEXTHOP
> * SCOPE / SCOPE_VAL

Well, strictly speaking yes. If I want to know more about NH listed in
synopsis I can't find it further down. Though its description in
'nexthop' is incomplete, as well.

My personal focus is that non-terminals in synopsis are defined in there
as well, otherwise I can't parse and the thrown exception is hard to
clean off my screen. ;)

My assumption (actually what *I* do) is to search for terminals in order
to get more info. For instance, I would search for 'nexthop' or 'encap'
not 'NH'.

I guess the broader question is about the scope of non-terminals in
synopsis and description sections - I don't necessarily consider them
related.

Of course feel free to fix what you don't like, I'll review and (N)ACK
if you put me in Cc. :)

Cheers, Phil

Re: [PATCH 2/2] arm64: dts: clearfog-gt-8k: 1G eth PHY reset signal

2018-11-30 Thread Gregory CLEMENT

Hi Baruch,
 
 On mar., oct. 16 2018, Baruch Siach  wrote:

> This reset signal controls the Marvell 1512 1G PHY.
>
> Note that current implementation queries the PHY over the MDIO bus
> (get_phy_device() call from of_mdiobus_register_phy()) before reset
> signal deassert. If the PHY reset signal is asserted at boot time, PHY
> registration fails. So current code relies on the bootloader to deassert
> the reset signal.

Applied on mvebu/dt64

Thanks,

Gregory

>
> Signed-off-by: Baruch Siach 
> ---
>  arch/arm64/boot/dts/marvell/armada-8040-clearfog-gt-8k.dts | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/arch/arm64/boot/dts/marvell/armada-8040-clearfog-gt-8k.dts 
> b/arch/arm64/boot/dts/marvell/armada-8040-clearfog-gt-8k.dts
> index af1310c53bc8..73df0ef5e0c4 100644
> --- a/arch/arm64/boot/dts/marvell/armada-8040-clearfog-gt-8k.dts
> +++ b/arch/arm64/boot/dts/marvell/armada-8040-clearfog-gt-8k.dts
> @@ -337,6 +337,10 @@
>*/
>   marvell,reg-init = <3 16 0 0x1017>;
>   reg = <0>;
> + pinctrl-names = "default";
> + pinctrl-0 = <_copper_eth_phy_reset>;
> + reset-gpios = <_gpio1 11 GPIO_ACTIVE_LOW>;
> + reset-assert-us = <1>;
>   };
>  
>   switch0: switch0@4 {
> -- 
> 2.19.1
>

-- 
Gregory Clement, Bootlin
Embedded Linux and Kernel engineering
http://bootlin.com

Re: [PATCH 1/2] arm64: dts: clearfog-gt-8k: fix USB regulator gpio polarity

2018-11-30 Thread Gregory CLEMENT

Hi Baruch,
 
 On mar., oct. 16 2018, Baruch Siach  wrote:

> The fixed regulator driver ignores the gpio flags, so this change has
> no practical effect in the current implementation. Fix it anyway to
> correct the hardware description.
>

Applied on mvebu/dt64

Thanks,

Gregory

> Signed-off-by: Baruch Siach 
> ---
>  arch/arm64/boot/dts/marvell/armada-8040-clearfog-gt-8k.dts | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/arm64/boot/dts/marvell/armada-8040-clearfog-gt-8k.dts 
> b/arch/arm64/boot/dts/marvell/armada-8040-clearfog-gt-8k.dts
> index aea9c220ae6a..af1310c53bc8 100644
> --- a/arch/arm64/boot/dts/marvell/armada-8040-clearfog-gt-8k.dts
> +++ b/arch/arm64/boot/dts/marvell/armada-8040-clearfog-gt-8k.dts
> @@ -42,7 +42,7 @@
>  
>   v_5v0_usb3_hst_vbus: regulator-usb3-vbus0 {
>   compatible = "regulator-fixed";
> - gpio = <_gpio2 15 GPIO_ACTIVE_HIGH>;
> + gpio = <_gpio2 15 GPIO_ACTIVE_LOW>;
>   pinctrl-names = "default";
>   pinctrl-0 = <_xhci_vbus_pins>;
>   regulator-name = "v_5v0_usb3_hst_vbus";
> -- 
> 2.19.1
>

-- 
Gregory Clement, Bootlin
Embedded Linux and Kernel engineering
http://bootlin.com

Re: [PATCH] net: tcp: add correct check for tcp_retransmit_skb()

2018-11-30 Thread Yuchung Cheng

On Fri, Nov 30, 2018 at 10:28 AM Sharath Chandra Vurukala
 wrote:
>
> when the tcp_retranmission_timer expires and tcp_retranmsit_skb is
> called if the retranmsission fails due to local congestion,
> backoff should not incremented.
>
> tcp_retransmit_skb() returns non-zero negative value in some cases of
> failure but the caller tcp_retransmission_timer() has a check for
> failure which checks if the return value is greater than zero.
> The check is corrected to check for non-zero value.
>
> Signed-off-by: Sharath Chandra Vurukala 
Perhaps my previous comment was not clear: your bug-fix patch is incorrect.

On local congestion, tcp_retransmit_skb returns positive values
*only*. negative values do not indicate local congestion.

> ---
>  net/ipv4/tcp_timer.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index 091c5392..c19f371 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -511,7 +511,7 @@ void tcp_retransmit_timer(struct sock *sk)
>
> tcp_enter_loss(sk);
>
> -   if (tcp_retransmit_skb(sk, tcp_rtx_queue_head(sk), 1) > 0) {
> +   if (tcp_retransmit_skb(sk, tcp_rtx_queue_head(sk), 1) != 0) {
> /* Retransmission failed because of local congestion,
>  * do not backoff.
>  */
> --
> 1.9.1
>

Re: [PATCH net] rtnetlink: Refine sanity checks in rtnl_fdb_{add|del}

2018-11-30 Thread Ido Schimmel

On Fri, Nov 30, 2018 at 08:59:09AM -0700, David Ahern wrote:
> This does not repro for me:
> # ./a.out
> Invalid address length 6 - must be 4 bytes
> RTNETLINK answers: No buffer space available
> RTNETLINK answers: Operation not supported
> Invalid address length 6 - must be 4 bytes
> Invalid address length 6 - must be 4 bytes
> Invalid address length 6 - must be 4 bytes
> Invalid address length 6 - must be 16 bytes
> Invalid address length 6 - must be 16 bytes
> Invalid address length 6 - must be 16 bytes
> 
> config available>?

You need a kernel with kmsan. See:
https://github.com/google/kmsan

[PATCH bpf-next 5/5] tools: bpftool: add owner_prog_type and owner_jited to bpftool output

2018-11-30 Thread Quentin Monnet

For prog array maps, the type of the owner program, and the JIT-ed state
of that program, are available from the file descriptor information
under /proc. Add them to "bpftool map show" output. Example output:

# bpftool map show
158225: prog_array  name jmp_table  flags 0x0
key 4B  value 4B  max_entries 8  memlock 4096B
owner_prog_type flow_dissector  owner jited
# bpftool --json --pretty map show
[{
"id": 1337,
"type": "prog_array",
"name": "jmp_table",
"flags": 0,
"bytes_key": 4,
"bytes_value": 4,
"max_entries": 8,
"bytes_memlock": 4096,
"owner_prog_type": "flow_dissector",
"owner_jited": true
}
]

As we move the table used for associating names to program types,
complete it with the missing types (lwt_seg6local and sk_reuseport).
Also add missing types to the help message for "bpftool prog"
(sk_reuseport and flow_dissector).

Suggested-by: Daniel Borkmann 
Signed-off-by: Quentin Monnet 
Acked-by: Jakub Kicinski 
---
 tools/bpf/bpftool/main.h | 26 +
 tools/bpf/bpftool/map.c  | 50 ++--
 tools/bpf/bpftool/prog.c | 25 +---
 3 files changed, 75 insertions(+), 26 deletions(-)

diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h
index 7431669fae0a..2761981669c8 100644
--- a/tools/bpf/bpftool/main.h
+++ b/tools/bpf/bpftool/main.h
@@ -78,6 +78,32 @@
 #define HELP_SPEC_MAP  \
"MAP := { id MAP_ID | pinned FILE }"
 
+static const char * const prog_type_name[] = {
+   [BPF_PROG_TYPE_UNSPEC]  = "unspec",
+   [BPF_PROG_TYPE_SOCKET_FILTER]   = "socket_filter",
+   [BPF_PROG_TYPE_KPROBE]  = "kprobe",
+   [BPF_PROG_TYPE_SCHED_CLS]   = "sched_cls",
+   [BPF_PROG_TYPE_SCHED_ACT]   = "sched_act",
+   [BPF_PROG_TYPE_TRACEPOINT]  = "tracepoint",
+   [BPF_PROG_TYPE_XDP] = "xdp",
+   [BPF_PROG_TYPE_PERF_EVENT]  = "perf_event",
+   [BPF_PROG_TYPE_CGROUP_SKB]  = "cgroup_skb",
+   [BPF_PROG_TYPE_CGROUP_SOCK] = "cgroup_sock",
+   [BPF_PROG_TYPE_LWT_IN]  = "lwt_in",
+   [BPF_PROG_TYPE_LWT_OUT] = "lwt_out",
+   [BPF_PROG_TYPE_LWT_XMIT]= "lwt_xmit",
+   [BPF_PROG_TYPE_SOCK_OPS]= "sock_ops",
+   [BPF_PROG_TYPE_SK_SKB]  = "sk_skb",
+   [BPF_PROG_TYPE_CGROUP_DEVICE]   = "cgroup_device",
+   [BPF_PROG_TYPE_SK_MSG]  = "sk_msg",
+   [BPF_PROG_TYPE_RAW_TRACEPOINT]  = "raw_tracepoint",
+   [BPF_PROG_TYPE_CGROUP_SOCK_ADDR]= "cgroup_sock_addr",
+   [BPF_PROG_TYPE_LWT_SEG6LOCAL]   = "lwt_seg6local",
+   [BPF_PROG_TYPE_LIRC_MODE2]  = "lirc_mode2",
+   [BPF_PROG_TYPE_SK_REUSEPORT]= "sk_reuseport",
+   [BPF_PROG_TYPE_FLOW_DISSECTOR]  = "flow_dissector",
+};
+
 enum bpf_obj_type {
BPF_OBJ_UNKNOWN,
BPF_OBJ_PROG,
diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index 3850f8d65703..8469ea6cf1c8 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -487,7 +487,6 @@ static int show_map_close_json(int fd, struct bpf_map_info 
*info)
char *memlock;
 
memlock = get_fdinfo(fd, "memlock");
-   close(fd);
 
jsonw_start_object(json_wtr);
 
@@ -514,6 +513,30 @@ static int show_map_close_json(int fd, struct bpf_map_info 
*info)
jsonw_int_field(json_wtr, "bytes_memlock", atoi(memlock));
free(memlock);
 
+   if (info->type == BPF_MAP_TYPE_PROG_ARRAY) {
+   char *owner_prog_type = get_fdinfo(fd, "owner_prog_type");
+   char *owner_jited = get_fdinfo(fd, "owner_jited");
+
+   if (owner_prog_type) {
+   unsigned int prog_type = atoi(owner_prog_type);
+
+   if (prog_type < ARRAY_SIZE(prog_type_name))
+   jsonw_string_field(json_wtr, "owner_prog_type",
+  prog_type_name[prog_type]);
+   else
+   jsonw_uint_field(json_wtr, "owner_prog_type",
+prog_type);
+   }
+   if (atoi(owner_jited))
+   jsonw_bool_field(json_wtr, "owner_jited", true);
+   else
+   jsonw_bool_field(json_wtr, "owner_jited", false);
+
+   free(owner_prog_type);
+   free(owner_jited);
+   }
+   close(fd);
+
if (!hash_empty(map_table.table)) {
struct pinned_obj *obj;
 
@@ -536,7 +559,6 @@ static int show_map_close_plain(int fd, struct

[PATCH bpf-next 4/5] tools: bpftool: mark offloaded programs more explicitly in plain output

2018-11-30 Thread Quentin Monnet

In bpftool (plain) output for "bpftool prog show" or "bpftool map show",
an offloaded BPF object is simply denoted with "dev ifname", which is
not really explicit. Change it with something that clearly shows the
program is offloaded.

While at it also add an additional space, as done between other
information fields.

Example output, before:

# bpftool prog show
1337: xdp  tag a04f5eef06a7f555 dev foo
loaded_at 2018-10-19T16:40:36+0100  uid 0
xlated 16B  not jited  memlock 4096B

After:

# bpftool prog show
1337: xdp  tag a04f5eef06a7f555  offloaded_to foo
loaded_at 2018-10-19T16:40:36+0100  uid 0
xlated 16B  not jited  memlock 4096B

Suggested-by: Daniel Borkmann 
Signed-off-by: Quentin Monnet 
Acked-by: Jakub Kicinski 
---
 tools/bpf/bpftool/common.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
index 4349b6683ca8..172d3761d9ab 100644
--- a/tools/bpf/bpftool/common.c
+++ b/tools/bpf/bpftool/common.c
@@ -604,7 +604,7 @@ void print_dev_plain(__u32 ifindex, __u64 ns_dev, __u64 
ns_inode)
if (!ifindex)
return;
 
-   printf(" dev ");
+   printf("  offloaded_to ");
if (ifindex_to_name_ns(ifindex, ns_dev, ns_inode, name))
printf("%s", name);
else
-- 
2.7.4

[PATCH bpf-next 2/5] tools: bpftool: fix bash completion for bpftool prog (attach|detach)

2018-11-30 Thread Quentin Monnet

Fix bash completion for "bpftool prog (attach|detach) PROG TYPE MAP" so
that the list of indices proposed for MAP are map indices, and not PROG
indices. Also use variables for map and prog reference types ("id",
"pinned", and "tag" for programs).

Fixes: b7d3826c2ed6 ("bpf: bpftool, add support for attaching programs to maps")
Signed-off-by: Quentin Monnet 
Reviewed-by: Jakub Kicinski 
---
 tools/bpf/bpftool/bash-completion/bpftool | 73 +--
 1 file changed, 49 insertions(+), 24 deletions(-)

diff --git a/tools/bpf/bpftool/bash-completion/bpftool 
b/tools/bpf/bpftool/bash-completion/bpftool
index 45c2db257d2b..b7e6c4f25ad1 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -243,16 +243,20 @@ _bpftool()
 # Completion depends on object and command in use
 case $object in
 prog)
-if [[ $command != "load" && $command != "loadall" ]]; then
-case $prev in
-id)
-_bpftool_get_prog_ids
-return 0
-;;
-esac
-fi
+# Complete id, only for subcommands that use prog (but no map) ids
+case $command in
+show|list|dump|pin)
+case $prev in
+id)
+_bpftool_get_prog_ids
+return 0
+;;
+esac
+;;
+esac
 
 local PROG_TYPE='id pinned tag'
+local MAP_TYPE='id pinned'
 case $command in
 show|list)
 [[ $prev != "$command" ]] && return 0
@@ -293,22 +297,43 @@ _bpftool()
 return 0
 ;;
 attach|detach)
-if [[ ${#words[@]} == 7 ]]; then
-COMPREPLY=( $( compgen -W "id pinned" -- "$cur" ) )
-return 0
-fi
-
-if [[ ${#words[@]} == 6 ]]; then
-COMPREPLY=( $( compgen -W "msg_verdict skb_verdict \
-skb_parse flow_dissector" -- "$cur" ) )
-return 0
-fi
-
-if [[ $prev == "$command" ]]; then
-COMPREPLY=( $( compgen -W "id pinned" -- "$cur" ) )
-return 0
-fi
-return 0
+case $cword in
+3)
+COMPREPLY=( $( compgen -W "$PROG_TYPE" -- "$cur" ) 
)
+return 0
+;;
+4)
+case $prev in
+id)
+_bpftool_get_prog_ids
+;;
+pinned)
+_filedir
+;;
+esac
+return 0
+;;
+5)
+COMPREPLY=( $( compgen -W 'msg_verdict skb_verdict 
\
+skb_parse flow_dissector' -- "$cur" ) )
+return 0
+;;
+6)
+COMPREPLY=( $( compgen -W "$MAP_TYPE" -- "$cur" ) )
+return 0
+;;
+7)
+case $prev in
+id)
+_bpftool_get_map_ids
+;;
+pinned)
+_filedir
+;;
+esac
+return 0
+;;
+esac
 ;;
 load|loadall)
 local obj
-- 
2.7.4

Re: [PATCH 3/3] dt-bindings: net: dsa: add new bindings MT7530

2018-11-30 Thread Florian Fainelli

Hi Greg,

On 11/29/2018 11:57 PM, g...@kernel.org wrote:
> From: Greg Ungerer 
> 
> Add descriptive entries for the new bindings introduced to support the
> MT7530 implementation in the MediaTek MT7621 SoC.
> 
> New bindings added for:
> 
>   mediatek,no-clock-regulator
>   mediatek,mfc-has-cpuport

I don't think any of these properties are necessary, if you can either
use a compatible string, and/or infer the actual model at runtime in the
driver's probe function, then you can assess based on that chip model as
well as the properties being provided in Device Tree whether these
resources must be grabbed and used. See mv88e6xxx and b53 for how these
drivers deal with supporting several distinct models within the same
code base.

As far as the MFC programming goes, this is definitively something that
must be done once you know the chip model you are dealing with.
-- 
Florian

[PATCH 0/3] net: macb: DMA race condition fixes

2018-11-30 Thread Anssi Hannula

Hi all,

Here are a couple of race condition fixes for the macb driver. The first
two are issues observed on real HW.

Anssi Hannula (3):
  net: macb: fix random memory corruption on RX with 64-bit DMA
  net: macb: fix dropped RX frames due to a race
  net: macb: add missing barriers when reading buffers

 drivers/net/ethernet/cadence/macb_main.c | 34 +-
 1 file changed, 28 insertions(+), 6 deletions(-)

-- 
Anssi Hannula / Bitwise Oy

[PATCH 2/3] net: macb: fix dropped RX frames due to a race

2018-11-30 Thread Anssi Hannula

Bit RX_USED set to 0 in the address field allows the controller to write
data to the receive buffer descriptor.

The driver does not ensure the ctrl field is ready (cleared) when the
controller sees the RX_USED=0 written by the driver. The ctrl field might
only be cleared after the controller has already updated it according to
a newly received frame, causing the frame to be discarded in gem_rx() due
to unexpected ctrl field contents.

A message is logged when the above scenario occurs:

  macb ff0b.ethernet eth0: not whole frame pointed by descriptor

Fix the issue by ensuring that when the controller sees RX_USED=0 the
ctrl field is already cleared.

This issue was observed on a ZynqMP based system.

Signed-off-by: Anssi Hannula 
Fixes: 4df95131ea80 ("net/macb: change RX path for GEM")
Cc: Nicolas Ferre 
---
 drivers/net/ethernet/cadence/macb_main.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb_main.c 
b/drivers/net/ethernet/cadence/macb_main.c
index 0bc2aab7be40..430b7a0f5436 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -935,14 +935,19 @@ static void gem_rx_refill(struct macb_queue *queue)
 
if (entry == bp->rx_ring_size - 1)
paddr |= MACB_BIT(RX_WRAP);
-   macb_set_addr(bp, desc, paddr);
desc->ctrl = 0;
+   /* Setting addr clears RX_USED and allows reception,
+* make sure ctrl is cleared first to avoid a race.
+*/
+   dma_wmb();
+   macb_set_addr(bp, desc, paddr);
 
/* properly align Ethernet header */
skb_reserve(skb, NET_IP_ALIGN);
} else {
-   desc->addr &= ~MACB_BIT(RX_USED);
desc->ctrl = 0;
+   dma_wmb();
+   desc->addr &= ~MACB_BIT(RX_USED);
}
}
 
-- 
2.17.2

[PATCH 1/3] net: macb: fix random memory corruption on RX with 64-bit DMA

2018-11-30 Thread Anssi Hannula

64-bit DMA addresses are split in upper and lower halves that are
written in separate fields on GEM. For RX, bit 0 of the address is used
as the ownership bit (RX_USED). When the RX_USED bit is unset the
controller is allowed to write data to the buffer.

The driver does not guarantee that the controller already sees the upper
half when the RX_USED bit is cleared, possibly resulting in the
controller writing an incoming frame to an address with an incorrect
upper half and therefore possibly corrupting unrelated system memory.

Fix that by adding the necessary DMA memory barrier between the writes.

This corruption was observed on a ZynqMP based system.

Signed-off-by: Anssi Hannula 
Fixes: fff8019a08b6 ("net: macb: Add 64 bit addressing support for GEM")
Cc: Nicolas Ferre 
Cc: Harini Katakam 
Cc: Michal Simek 
---
 drivers/net/ethernet/cadence/macb_main.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/ethernet/cadence/macb_main.c 
b/drivers/net/ethernet/cadence/macb_main.c
index d8c7ca037ae3..0bc2aab7be40 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -682,6 +682,11 @@ static void macb_set_addr(struct macb *bp, struct 
macb_dma_desc *desc, dma_addr_
if (bp->hw_dma_cap & HW_DMA_CAP_64B) {
desc_64 = macb_64b_desc(bp, desc);
desc_64->addrh = upper_32_bits(addr);
+   /* The low bits of RX address contain the RX_USED bit, clearing
+* of which allows packet RX. Make sure the high bits are also
+* visible to HW at that point.
+*/
+   dma_wmb();
}
 #endif
desc->addr = lower_32_bits(addr);
-- 
2.17.2

[PATCH 3/3] net: macb: add missing barriers when reading buffers

2018-11-30 Thread Anssi Hannula

When reading buffer descriptors on RX or on TX completion, an
RX_USED/TX_USED bit is checked first to ensure that the descriptor has
been populated. However, there are no memory barriers to ensure that the
data protected by the RX_USED/TX_USED bit is up-to-date with respect to
that bit.

Fix that by adding DMA read memory barriers on those paths.

I did not observe any actual issues caused by these being missing,
though.

Tested on a ZynqMP based system.

Signed-off-by: Anssi Hannula 
Fixes: 89e5785fc8a6 ("[PATCH] Atmel MACB ethernet driver")
Cc: Nicolas Ferre 
---
 drivers/net/ethernet/cadence/macb_main.c | 20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb_main.c 
b/drivers/net/ethernet/cadence/macb_main.c
index 430b7a0f5436..c93baa8621d5 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -861,6 +861,11 @@ static void macb_tx_interrupt(struct macb_queue *queue)
 
/* First, update TX stats if needed */
if (skb) {
+   /* Ensure all of desc is at least as up-to-date
+* as ctrl (TX_USED bit)
+*/
+   dma_rmb();
+
if (gem_ptp_do_txstamp(queue, skb, desc) == 0) {
/* skb now belongs to timestamp buffer
 * and will be removed later
@@ -1000,12 +1005,16 @@ static int gem_rx(struct macb_queue *queue, int budget)
rmb();
 
rxused = (desc->addr & MACB_BIT(RX_USED)) ? true : false;
-   addr = macb_get_addr(bp, desc);
-   ctrl = desc->ctrl;
 
if (!rxused)
break;
 
+   /* Ensure other data is at least as up-to-date as rxused */
+   dma_rmb();
+
+   addr = macb_get_addr(bp, desc);
+   ctrl = desc->ctrl;
+
queue->rx_tail++;
count++;
 
@@ -1180,11 +1189,14 @@ static int macb_rx(struct macb_queue *queue, int budget)
/* Make hw descriptor updates visible to CPU */
rmb();
 
-   ctrl = desc->ctrl;
-
if (!(desc->addr & MACB_BIT(RX_USED)))
break;
 
+   /* Ensure other data is at least as up-to-date as addr */
+   dma_rmb();
+
+   ctrl = desc->ctrl;
+
if (ctrl & MACB_BIT(RX_SOF)) {
if (first_frag != -1)
discard_partial_frame(queue, first_frag, tail);
-- 
2.17.2

Re: [iproute PATCH] ssfilter: Fix for inverted last expression

2018-11-30 Thread Eric Dumazet




On 11/29/2018 04:20 AM, Phil Sutter wrote:
> When fixing for shift/reduce conflicts, possibility to invert the last
> expression by prefixing with '!' or 'not' was accidentally removed.
> 
> Fix this by allowing for expr to be an inverted expr so that any
> reference to it in exprlist accepts the inverted prefix.
> 
> Reported-by: Eric Dumazet 
> Fixes: b2038cc0b2403 ("ssfilter: Eliminate shift/reduce conflicts")
> Signed-off-by: Phil Sutter 
> ---


Thanks Phil !

Acked-by: Eric Dumazet

Re: [PATCH] net: tcp: add correct check for tcp_retransmit_skb()

2018-11-30 Thread Eric Dumazet




On 11/30/2018 10:28 AM, Sharath Chandra Vurukala wrote:
> when the tcp_retranmission_timer expires and tcp_retranmsit_skb is
> called if the retranmsission fails due to local congestion,
> backoff should not incremented.
> 
> tcp_retransmit_skb() returns non-zero negative value in some cases of
> failure but the caller tcp_retransmission_timer() has a check for
> failure which checks if the return value is greater than zero.
> The check is corrected to check for non-zero value.
> 
> Signed-off-by: Sharath Chandra Vurukala 
>

This looks wrong.

Yuchung has cooked a patch series to really address issues, please wait for it

We are waiting for David Miller to merge a prior patch series into net tree, 
then
in net-next.

Thanks.

Re: [PATCH net] rtnetlink: Refine sanity checks in rtnl_fdb_{add|del}

2018-11-30 Thread Eric Dumazet

On Fri, Nov 30, 2018 at 7:46 AM Eric Dumazet  wrote:
>
> On Fri, Nov 30, 2018 at 7:40 AM Eric Dumazet  wrote:
> >
> > On Fri, Nov 30, 2018 at 7:36 AM David Ahern  wrote:
> > >
> > > On 11/30/18 7:58 AM, Ido Schimmel wrote:
> > > > Can you please share the reproducer (assuming it exists)? I don't really
> > > > understand the fix. None of the functions you patched are in the trace.
> > > > Also, looking at IPv4 GRE code, while GRE device has dev->addr_len set
> > > > to 4, dev->type is set to ARPHRD_IPGRE.
> > >
> > > I had the same reaction ... you can not claim to be an ethernet device
> > > and have a hw address that is not 6 bytes.
> >
> > This has been discussed a number of times TUNSETLINK can do that.
> > (I have not checked what the repro does )
>
>
> Oh well...ndo_dflt_fdb_dump() seems to be used on a gre device.

What about :

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 
54cfbda0b58be34dcb164028f17cdde2826c857b..b4cfc139c8b05b19564e02f29bd030c5ff85b51b
100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -3800,6 +3800,9 @@ int ndo_dflt_fdb_dump(struct sk_buff *skb,
 {
int err;

+   if (dev->type != ARPHRD_ETHER || dev->addr_len != ETH_ALEN)
+   return -EINVAL;
+
netif_addr_lock_bh(dev);
err = nlmsg_populate_fdb(skb, cb, dev, idx, >uc);
if (err)

Re: [PATCH iproute2] stats output

2018-11-30 Thread David Ahern

On 11/30/18 11:12 AM, Stephen Hemminger wrote:
> I can understand why you would want this, but it is changing the
> behavior of an existing command that might be used in scripts.

+1

[PATCH 2/2] net: phy: ensure autoneg is configured when resuming a phydev

2018-11-30 Thread Anssi Hannula

When a PHY_HALTED phydev is resumed by phy_start(), it is set to
PHY_RESUMING to wait for the link to come up.

However, if the phydev was put to PHY_HALTED (by e.g. phy_stop()) before
autonegotiation was ever started by phy_state_machine(), autonegotiation
remains unconfigured, i.e. phy_config_aneg() is never called.

Fix that by going to PHY_UP instead of PHY_RESUMING if autonegotiation
has never been configured.

Signed-off-by: Anssi Hannula 
---

This doesn't feel as clean is I'd like to, though. Maybe there is a
better way?


 drivers/net/phy/phy.c | 11 ++-
 include/linux/phy.h   |  2 ++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index d46858694db9..7a650cce7599 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -553,6 +553,8 @@ int phy_start_aneg(struct phy_device *phydev)
if (err < 0)
goto out_unlock;
 
+   phydev->autoneg_configured = 1;
+
if (phydev->state != PHY_HALTED) {
if (AUTONEG_ENABLE == phydev->autoneg) {
err = phy_check_link_status(phydev);
@@ -893,7 +895,14 @@ void phy_start(struct phy_device *phydev)
break;
}
 
-   phydev->state = PHY_RESUMING;
+   /* if autoneg/forcing was never configured, go back to PHY_UP
+* to make that happen
+*/
+   if (!phydev->autoneg_configured)
+   phydev->state = PHY_UP;
+   else
+   phydev->state = PHY_RESUMING;
+
break;
default:
break;
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 8f927246acdb..b306d5ed9d09 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -389,6 +389,8 @@ struct phy_device {
unsigned loopback_enabled:1;
 
unsigned autoneg:1;
+   /* autoneg has been configured at least once */
+   unsigned autoneg_configured:1;
/* The most recently read link state */
unsigned link:1;
 
-- 
2.17.2

[PATCH 0/2] net: phy: fixes to PHY_HALTED handling

2018-11-30 Thread Anssi Hannula

Hi all,

Here are a couple of fixes for PHY_HALTED/PHY_RESUMING handling.


On a related note, it feels a bit strange that AFAICS phydevs will only
be put to powerdown state (via PHY_HALTED) after the network interface
has been brought up and down once. If the ethernet interface is never
brought up, the phydev remains powered on (in PHY_READY).

This is because the phydev is only put to PHY_HALTED from phy_stop() and
phy_error(), and phy_stop() seems to be normally called only from
.ndo_stop().

But I didn't touch that now as I wasn't sure what is the intent there.


Anssi Hannula (2):
  net: phy: suspend phydev on PHY_HALTED even if there is no link
  net: phy: ensure autoneg is configured when resuming a phydev

 drivers/net/phy/phy.c | 13 +++--
 include/linux/phy.h   |  2 ++
 2 files changed, 13 insertions(+), 2 deletions(-)

-- 
Anssi Hannula / Bitwise Oy

[PATCH 1/2] net: phy: suspend phydev on PHY_HALTED even if there is no link

2018-11-30 Thread Anssi Hannula

When the phydev is put to PHY_HALTED (by e.g. phy_stop()) the state
machine suspends the phydev to save power.

However, this is wrongly inside a phydev->link check, causing the phydev
not to be suspended if there was no link at the time of stopping it.

Fix that by setting do_suspend regardless of whether there was a link or
not.

Signed-off-by: Anssi Hannula 
Fixes: be9dad1f9f26 ("net: phy: suspend phydev when going to HALTED")
Cc: Sebastian Hesselbarth 
---
 drivers/net/phy/phy.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 376a0d8a2b61..d46858694db9 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -956,8 +956,8 @@ void phy_state_machine(struct work_struct *work)
if (phydev->link) {
phydev->link = 0;
phy_link_down(phydev, true);
-   do_suspend = true;
}
+   do_suspend = true;
break;
}
 
-- 
2.17.2

Re: [PATCH net-next 1/2] ixgbe: register a mdiobus

2018-11-30 Thread Steve Douthit

On 11/30/18 8:21 AM, Andrew Lunn wrote:
> Hi Steve
> 
> Cool to see another interface being made DSA capable.
> 
>> +/**
>> + *  ixgbe_msca - Write the command register and poll for completion/timeout
>> + *  @hw: pointer to hardware structure
>> + *  @cmd: command register value to write
>> + **/
>> +static s32 ixgbe_msca_cmd(struct ixgbe_hw *hw, u32 cmd)
>> +{
>> +u32 i;
>> +
>> +IXGBE_WRITE_REG(hw, IXGBE_MSCA, cmd);
>> +
>> +for (i = 0; i < IXGBE_MDIO_COMMAND_TIMEOUT; i++) {
>> +udelay(10);
>> +cmd = IXGBE_READ_REG(hw, IXGBE_MSCA);
>> +if (!(cmd & IXGBE_MSCA_MDI_COMMAND))
>> +return 0;
>> +}
>> +
>> +return -ETIMEDOUT;
> 
> Please consider using readx_poll_timeout()

OK

>> +}
>> +
>> +/**
>> + *  ixgbe_msca - Use device_id to figure out if MDIO bus is shared between 
>> MACs.
>> + *  The embedded x550(s) in the C3000 line of SoCs only have a single 
>> mii_bus
>> + *  shared between all MACs, and that introduces some new mask flags that 
>> need
>> + *  to be passed to the *_swfw_sync callbacks.
>> + *  @hw: pointer to hardware structure
>> + **/
>> +static bool ixgbe_is_shared_mii(struct ixgbe_hw *hw)
>> +{
>> +switch (hw->device_id) {
>> +case IXGBE_DEV_ID_X550EM_A_KR:
>> +case IXGBE_DEV_ID_X550EM_A_KR_L:
>> +case IXGBE_DEV_ID_X550EM_A_SFP_N:
>> +case IXGBE_DEV_ID_X550EM_A_SGMII:
>> +case IXGBE_DEV_ID_X550EM_A_SGMII_L:
>> +case IXGBE_DEV_ID_X550EM_A_10G_T:
>> +case IXGBE_DEV_ID_X550EM_A_SFP:
>> +case IXGBE_DEV_ID_X550EM_A_1G_T:
>> +case IXGBE_DEV_ID_X550EM_A_1G_T_L:
>> +return true;
>> +}
>> +
>> +return false;
>> +}
>> +
>> +/**
>> + *  ixgbe_mii_bus_read - Read a clause 22/45 register
>> + *  @hw: pointer to hardware structure
>> + *  @addr: address
>> + *  @regnum: register number
>> + **/
>> +static s32 ixgbe_mii_bus_read(struct mii_bus *bus, int addr, int regnum)
>> +{
>> +struct ixgbe_adapter *adapter = (struct ixgbe_adapter *)bus->priv;
>> +struct ixgbe_hw *hw = >hw;
>> +u32 hwaddr, cmd, gssr = hw->phy.phy_semaphore_mask;
>> +s32 data;
>> +
>> +if (ixgbe_is_shared_mii(hw)) {
>> +gssr |= IXGBE_GSSR_TOKEN_SM;
>> +if (hw->bus.lan_id)
>> +gssr |= IXGBE_GSSR_PHY1_SM;
>> +else
>> +gssr |= IXGBE_GSSR_PHY0_SM;
>> +}
> 
> Could you explain this shared bus a bit more.  If there is only one
> physical MDIO bus, you should only register one bus with Linux. So i
> would not normally expect to see ixgbe_is_shared_mii() sort of
> statements into the read/write functions. That tends to happen at the
> point you are registering the MDIO bus, and when looking for the MACs
> PHY on the bus.

Yep, registering multiple interfaces is wrong.  The first board I tested
against only had a single MAC enabled (they can be disabled/hidden via
straps) so it just happened to work.  Then I moved to a board with all
four interfaces enabled and found this.

The Intel C3xxx family of SoCs have up to four ixgbe MACs.  These are
structured as two devices of two functions each on fixed internal root
ports.

from lspci:

+-16.0-[05]--+-00.0
|\-00.1
+-17.0-[06]--+-00.0
|\-00.1

Each of those MACs exposes the same set of MDIO control registers as if
there are four interface, but there's really only a single bus in the
hardware for these SoCs.

How about I change:
1) ixgbe_is_shared_mii() -> ixgbe_is_x550em_a()
2) ixgbe_mii_bus_(read|write) -> ixgbe_x550em_a_mii_bus_(read|write) and
update the phy_semaphore_mask unconditionally
3) Only have ixgbe_mii_bus_init() register something for the x550em_a
devices for now, and leave handling other platforms for the future.
4) Have ixgbe_mii_bus_init() register only a single mdiobus for func 0
of the device downstream of the 00:16.0 root port, or if that's been
disabled then func 0 downstream of the 00:17.0 root port.  All of the
devices on bus 0 of the SoC are at fixed device/function locations.

Would that be acceptable?

Re: [PATCH net] sctp: kfree_rcu asoc

2018-11-30 Thread Marcelo Ricardo Leitner

On Sat, Dec 01, 2018 at 01:36:59AM +0800, Xin Long wrote:
> In sctp_hash_transport/sctp_epaddr_lookup_transport, it dereferences
> a transport's asoc under rcu_read_lock while asoc is freed not after
> a grace period, which leads to a use-after-free panic.
> 
> This patch fixes it by calling kfree_rcu to make asoc be freed after
> a grace period.
> 
> Note that only the asoc's memory is delayed to free in the patch, it
> won't cause sk to linger longer.
> 
> Thanks Neil and Marcelo to make this clear.
> 
> Fixes: 7fda702f9315 ("sctp: use new rhlist interface on sctp transport 
> rhashtable")
> Fixes: cd2b70875058 ("sctp: check duplicate node before inserting a new 
> transport")
> Reported-by: syzbot+0b05d8aa7cb185107...@syzkaller.appspotmail.com
> Reported-by: syzbot+aad231d51b1923158...@syzkaller.appspotmail.com
> Suggested-by: Neil Horman 
> Signed-off-by: Xin Long 

Phew :-)

Acked-by: Marcelo Ricardo Leitner 

> ---
>  include/net/sctp/structs.h | 2 ++
>  net/sctp/associola.c   | 2 +-
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
> index a11f937..feada35 100644
> --- a/include/net/sctp/structs.h
> +++ b/include/net/sctp/structs.h
> @@ -2075,6 +2075,8 @@ struct sctp_association {
>  
>   __u64 abandoned_unsent[SCTP_PR_INDEX(MAX) + 1];
>   __u64 abandoned_sent[SCTP_PR_INDEX(MAX) + 1];
> +
> + struct rcu_head rcu;
>  };
>  
>  
> diff --git a/net/sctp/associola.c b/net/sctp/associola.c
> index 6a28b96..3702f48 100644
> --- a/net/sctp/associola.c
> +++ b/net/sctp/associola.c
> @@ -434,7 +434,7 @@ static void sctp_association_destroy(struct 
> sctp_association *asoc)
>  
>   WARN_ON(atomic_read(>rmem_alloc));
>  
> - kfree(asoc);
> + kfree_rcu(asoc, rcu);
>   SCTP_DBG_OBJCNT_DEC(assoc);
>  }
>  
> -- 
> 2.1.0
>

WISHES A UNIQUE DAY.

2018-11-30 Thread Benjamin Haga

My dear I contacted you few days ago but couldn’t get your response; I
like to share very vital information with you so do get back to me for
details.

Re: [PATCH iproute2] stats output

2018-11-30 Thread Stephen Hemminger

On Fri, 30 Nov 2018 14:33:49 +0100
Alexis Vachette  wrote:

> When using:
> 
> - ip -s link
> 
> I think it should be better to print errors stats without adding -s
> option twice.
> 
> This option print stats for each network interfaces and we want to see
> if something is off and especially timers with errors.
> 
> Man page of ip command is updated accordingly.
> 
> Here is a patchset:
> 
> Signed-off-by: Alexis Vachette 
> ---
> diff --git a/ip/ipaddress.c b/ip/ipaddress.c
> index 85f05a2..c70394d 100644
> --- a/ip/ipaddress.c
> +++ b/ip/ipaddress.c
> @@ -323,7 +323,7 @@ int print_linkinfo(const struct sockaddr_nl *who,
>   fprintf(fp,"\nalias %s",
>   (const char *) RTA_DATA(tb[IFLA_IFALIAS]));
> 
> - if (do_link && tb[IFLA_STATS64] && show_stats) {
> + if (do_link && tb[IFLA_STATS64]) {
>   struct rtnl_link_stats64 slocal;
>   struct rtnl_link_stats64 *s = RTA_DATA(tb[IFLA_STATS64]);
>   if (((unsigned long)s) & (sizeof(unsigned long)-1)) {
> @@ -343,16 +343,14 @@ int print_linkinfo(const struct sockaddr_nl *who,
>   if (s->rx_compressed)
>   fprintf(fp, " %-7llu",
>   (unsigned long long)s->rx_compressed);
> - if (show_stats > 1) {
> - fprintf(fp, "%s", _SL_);
> - fprintf(fp, "RX errors: length  crc frame   fifomissed%s", 
> _SL_);
> - fprintf(fp, "   %-7llu  %-7llu %-7llu %-7llu %-7llu",
> - (unsigned long long)s->rx_length_errors,
> - (unsigned long long)s->rx_crc_errors,
> - (unsigned long long)s->rx_frame_errors,
> - (unsigned long long)s->rx_fifo_errors,
> - (unsigned long long)s->rx_missed_errors);
> - }
> + fprintf(fp, "%s", _SL_);
> + fprintf(fp, "RX errors: length  crc frame   fifomissed%s", 
> _SL_);
> + fprintf(fp, "   %-7llu  %-7llu %-7llu %-7llu %-7llu",
> + (unsigned long long)s->rx_length_errors,
> + (unsigned long long)s->rx_crc_errors,
> + (unsigned long long)s->rx_frame_errors,
> + (unsigned long long)s->rx_fifo_errors,
> + (unsigned long long)s->rx_missed_errors);
>   fprintf(fp, "%s", _SL_);
>   fprintf(fp, "TX: bytes  packets  errors  dropped carrier collsns %s%s",
>   s->tx_compressed ? "compressed" : "", _SL_);
> @@ -366,17 +364,15 @@ int print_linkinfo(const struct sockaddr_nl *who,
>   if (s->tx_compressed)
>   fprintf(fp, " %-7llu",
>   (unsigned long long)s->tx_compressed);
> - if (show_stats > 1) {
> - fprintf(fp, "%s", _SL_);
> - fprintf(fp, "TX errors: aborted fifowindow  heartbeat%s", _SL_);
> - fprintf(fp, "   %-7llu  %-7llu %-7llu %-7llu",
> - (unsigned long long)s->tx_aborted_errors,
> - (unsigned long long)s->tx_fifo_errors,
> - (unsigned long long)s->tx_window_errors,
> - (unsigned long long)s->tx_heartbeat_errors);
> - }
> + fprintf(fp, "%s", _SL_);
> + fprintf(fp, "TX errors: aborted fifowindow  heartbeat%s", _SL_);
> + fprintf(fp, "   %-7llu  %-7llu %-7llu %-7llu",
> + (unsigned long long)s->tx_aborted_errors,
> + (unsigned long long)s->tx_fifo_errors,
> + (unsigned long long)s->tx_window_errors,
> + (unsigned long long)s->tx_heartbeat_errors);
>   }
> - if (do_link && !tb[IFLA_STATS64] && tb[IFLA_STATS] && show_stats) {
> + if (do_link && !tb[IFLA_STATS64] && tb[IFLA_STATS]) {
>   struct rtnl_link_stats slocal;
>   struct rtnl_link_stats *s = RTA_DATA(tb[IFLA_STATS]);
>   if (((unsigned long)s) & (sizeof(unsigned long)-1)) {
> @@ -393,17 +389,15 @@ int print_linkinfo(const struct sockaddr_nl *who,
>   );
>   if (s->rx_compressed)
>   fprintf(fp, " %-7u", s->rx_compressed);
> - if (show_stats > 1) {
> - fprintf(fp, "%s", _SL_);
> - fprintf(fp, "RX errors: length  crc frame   fifomissed%s", 
> _SL_);
> - fprintf(fp, "   %-7u  %-7u %-7u %-7u %-7u",
> - s->rx_length_errors,
> - s->rx_crc_errors,
> - s->rx_frame_errors,
> - s->rx_fifo_errors,
> - s->rx_missed_errors
> - );
> - }
> + fprintf(fp, "%s", _SL_);
> + fprintf(fp, "RX errors: length  crc frame   fifomissed%s", 
> _SL_);
> + fprintf(fp, "   %-7u  %-7u %-7u %-7u %-7u",
> + s->rx_length_errors,
> + s->rx_crc_errors,
> + s->rx_frame_errors,
> + s->rx_fifo_errors,
> + s->rx_missed_errors
> + );
>   fprintf(fp, "%s", _SL_);
>   fprintf(fp, "TX: bytes  packets  errors  dropped carrier collsns %s%s",
>   s->tx_compressed ? "compressed" : "", _SL_);
> @@ -412,16 +406,14 @@ int print_linkinfo(const struct sockaddr_nl *who,
>   s->tx_dropped, s->tx_carrier_errors, s->collisions);
>   if (s->tx_compressed)
>   fprintf(fp, " %-7u", s->tx_compressed);
> - if (show_stats > 1) {
> - fprintf(fp, "%s", _SL_);
> - fprintf(fp, "TX errors: aborted fifowindow  heartbeat%s", _SL_);
> - fprintf(fp, "   %-7u  %-7u %-7u %-7u",
> - s->tx_aborted_errors,
> - s->tx_fifo_errors,
> - s->tx_window_errors,
> - s->tx_heartbeat_errors
> - );
> - }
> + fprintf(fp, "%s", _SL_);
> + fprintf(fp, "TX errors: aborted fifowindow  heartbeat%s", _SL_);
> + fprintf(fp, "   %-7u  %-7u %-7u %-7u",
> + s->tx_aborted_errors,
> + s->tx_fifo_errors,
> +

Re: [PATCHv2 net] sctp: check and update stream->out_curr when allocating stream_out

2018-11-30 Thread Xin Long

On Sat, Dec 1, 2018 at 12:23 AM Neil Horman  wrote:
>
> On Fri, Nov 30, 2018 at 10:48:10PM +0900, Xin Long wrote:
> > On Fri, Nov 30, 2018 at 9:21 PM Neil Horman  wrote:
> > >
> > > On Fri, Nov 30, 2018 at 03:22:39PM +0900, Xin Long wrote:
> > > > On Thu, Nov 29, 2018 at 11:39 PM Neil Horman  
> > > > wrote:
> > > > >
> > > > > On Thu, Nov 29, 2018 at 02:42:56PM +0800, Xin Long wrote:
> > > > > > Now when using stream reconfig to add out streams, stream->out
> > > > > > will get re-allocated, and all old streams' information will
> > > > > > be copied to the new ones and the old ones will be freed.
> > > > > >
> > > > > > So without stream->out_curr updated, next time when trying to
> > > > > > send from stream->out_curr stream, a panic would be caused.
> > > > > >
> > > > > > This patch is to check and update stream->out_curr when
> > > > > > allocating stream_out.
> > > > > >
> > > > > > v1->v2:
> > > > > >   - define fa_index() to get elem index from stream->out_curr.
> > > > > >
> > > > > > Fixes: 5e32a431 ("sctp: introduce stream scheduler foundations")
> > > > > > Reported-by: Ying Xu 
> > > > > > Reported-by: syzbot+e33a3a138267ca119...@syzkaller.appspotmail.com
> > > > > > Signed-off-by: Xin Long 
> > > > > > ---
> > > > > >  net/sctp/stream.c | 20 
> > > > > >  1 file changed, 20 insertions(+)
> > > > > >
> > > > > > diff --git a/net/sctp/stream.c b/net/sctp/stream.c
> > > > > > index 3892e76..30e7809 100644
> > > > > > --- a/net/sctp/stream.c
> > > > > > +++ b/net/sctp/stream.c
> > > > > > @@ -84,6 +84,19 @@ static void fa_zero(struct flex_array *fa, 
> > > > > > size_t index, size_t count)
> > > > > >   }
> > > > > >  }
> > > > > >
> > > > > > +static size_t fa_index(struct flex_array *fa, void *elem, size_t 
> > > > > > count)
> > > > > > +{
> > > > > > + size_t index = 0;
> > > > > > +
> > > > > > + while (count--) {
> > > > > > + if (elem == flex_array_get(fa, index))
> > > > > > + break;
> > > > > > + index++;
> > > > > > + }
> > > > > > +
> > > > > > + return index;
> > > > > > +}
> > > > > > +
> > > > > >  /* Migrates chunks from stream queues to new stream queues if 
> > > > > > needed,
> > > > > >   * but not across associations. Also, removes those chunks to 
> > > > > > streams
> > > > > >   * higher than the new max.
> > > > > > @@ -147,6 +160,13 @@ static int sctp_stream_alloc_out(struct 
> > > > > > sctp_stream *stream, __u16 outcnt,
> > > > > >
> > > > > >   if (stream->out) {
> > > > > >   fa_copy(out, stream->out, 0, min(outcnt, 
> > > > > > stream->outcnt));
> > > > > > + if (stream->out_curr) {
> > > > > > + size_t index = fa_index(stream->out, 
> > > > > > stream->out_curr,
> > > > > > + stream->outcnt);
> > > > > > +
> > > > > > + BUG_ON(index == stream->outcnt);
> > > > > > + stream->out_curr = flex_array_get(out, index);
> > > > > > + }
> > > > > >   fa_free(stream->out);
> > > > > >   }
> > > > > >
> > > > > > --
> > > > > > 2.1.0
> > > > > >
> > > > > >
> > > > >
> > > > > This is the sort of thing I'm talking about. Its a little more code, 
> > > > > but if you
> > > > > augment the flex_array api like this, you can preform a resize 
> > > > > operation on your
> > > > > existing flex array, and you can avoid all the copying, and need to 
> > > > > update
> > > > > pointers maintained outside the array.  Note this code isn't tested 
> > > > > at all, but
> > > > > its close to what I think should work.
> > > > >
> > > > >
> > > > > diff --git a/include/linux/flex_array.h b/include/linux/flex_array.h
> > > > > index b94fa61b51fb..7fa1f27a91b5 100644
> > > > > --- a/include/linux/flex_array.h
> > > > > +++ b/include/linux/flex_array.h
> > > > > @@ -73,6 +73,8 @@ struct flex_array {
> > > > >  struct flex_array *flex_array_alloc(int element_size, unsigned int 
> > > > > total,
> > > > > gfp_t flags);
> > > > >
> > > > > +struct flex_array *flex_array_resize(struct flex_array *fa, unsigned 
> > > > > int total, gfp_t flags);
> > > > > +
> > > > >  /**
> > > > >   * flex_array_prealloc() - Ensures that memory for the elements 
> > > > > indexed in the
> > > > >   * range defined by start and nr_elements has been allocated.
> > > > > diff --git a/lib/flex_array.c b/lib/flex_array.c
> > > > > index 2eed22fa507c..f8d54af3891b 100644
> > > > > --- a/lib/flex_array.c
> > > > > +++ b/lib/flex_array.c
> > > > > @@ -109,6 +109,7 @@ struct flex_array *flex_array_alloc(int 
> > > > > element_size, unsigned int total,
> > > > > ret->total_nr_elements = total;
> > > > > ret->elems_per_part = elems_per_part;
> > > > > ret->reciprocal_elems = reciprocal_elems;
> > > > > +   ret->elements_used = 0;
> > > > > if (elements_fit_in_base(ret) && !(flags & __GFP_ZERO))
> > > > >

Re: [PATCHv2 net] sctp: check and update stream->out_curr when allocating stream_out

2018-11-30 Thread Neil Horman

On Sat, Dec 01, 2018 at 03:53:26AM +0900, Xin Long wrote:
> On Sat, Dec 1, 2018 at 12:23 AM Neil Horman  wrote:
> >
> > On Fri, Nov 30, 2018 at 10:48:10PM +0900, Xin Long wrote:
> > > On Fri, Nov 30, 2018 at 9:21 PM Neil Horman  wrote:
> > > >
> > > > On Fri, Nov 30, 2018 at 03:22:39PM +0900, Xin Long wrote:
> > > > > On Thu, Nov 29, 2018 at 11:39 PM Neil Horman  
> > > > > wrote:
> > > > > >
> > > > > > On Thu, Nov 29, 2018 at 02:42:56PM +0800, Xin Long wrote:
> > > > > > > Now when using stream reconfig to add out streams, stream->out
> > > > > > > will get re-allocated, and all old streams' information will
> > > > > > > be copied to the new ones and the old ones will be freed.
> > > > > > >
> > > > > > > So without stream->out_curr updated, next time when trying to
> > > > > > > send from stream->out_curr stream, a panic would be caused.
> > > > > > >
> > > > > > > This patch is to check and update stream->out_curr when
> > > > > > > allocating stream_out.
> > > > > > >
> > > > > > > v1->v2:
> > > > > > >   - define fa_index() to get elem index from stream->out_curr.
> > > > > > >
> > > > > > > Fixes: 5e32a431 ("sctp: introduce stream scheduler 
> > > > > > > foundations")
> > > > > > > Reported-by: Ying Xu 
> > > > > > > Reported-by: syzbot+e33a3a138267ca119...@syzkaller.appspotmail.com
> > > > > > > Signed-off-by: Xin Long 
> > > > > > > ---
> > > > > > >  net/sctp/stream.c | 20 
> > > > > > >  1 file changed, 20 insertions(+)
> > > > > > >
> > > > > > > diff --git a/net/sctp/stream.c b/net/sctp/stream.c
> > > > > > > index 3892e76..30e7809 100644
> > > > > > > --- a/net/sctp/stream.c
> > > > > > > +++ b/net/sctp/stream.c
> > > > > > > @@ -84,6 +84,19 @@ static void fa_zero(struct flex_array *fa, 
> > > > > > > size_t index, size_t count)
> > > > > > >   }
> > > > > > >  }
> > > > > > >
> > > > > > > +static size_t fa_index(struct flex_array *fa, void *elem, size_t 
> > > > > > > count)
> > > > > > > +{
> > > > > > > + size_t index = 0;
> > > > > > > +
> > > > > > > + while (count--) {
> > > > > > > + if (elem == flex_array_get(fa, index))
> > > > > > > + break;
> > > > > > > + index++;
> > > > > > > + }
> > > > > > > +
> > > > > > > + return index;
> > > > > > > +}
> > > > > > > +
> > > > > > >  /* Migrates chunks from stream queues to new stream queues if 
> > > > > > > needed,
> > > > > > >   * but not across associations. Also, removes those chunks to 
> > > > > > > streams
> > > > > > >   * higher than the new max.
> > > > > > > @@ -147,6 +160,13 @@ static int sctp_stream_alloc_out(struct 
> > > > > > > sctp_stream *stream, __u16 outcnt,
> > > > > > >
> > > > > > >   if (stream->out) {
> > > > > > >   fa_copy(out, stream->out, 0, min(outcnt, 
> > > > > > > stream->outcnt));
> > > > > > > + if (stream->out_curr) {
> > > > > > > + size_t index = fa_index(stream->out, 
> > > > > > > stream->out_curr,
> > > > > > > + stream->outcnt);
> > > > > > > +
> > > > > > > + BUG_ON(index == stream->outcnt);
> > > > > > > + stream->out_curr = flex_array_get(out, 
> > > > > > > index);
> > > > > > > + }
> > > > > > >   fa_free(stream->out);
> > > > > > >   }
> > > > > > >
> > > > > > > --
> > > > > > > 2.1.0
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > This is the sort of thing I'm talking about. Its a little more 
> > > > > > code, but if you
> > > > > > augment the flex_array api like this, you can preform a resize 
> > > > > > operation on your
> > > > > > existing flex array, and you can avoid all the copying, and need to 
> > > > > > update
> > > > > > pointers maintained outside the array.  Note this code isn't tested 
> > > > > > at all, but
> > > > > > its close to what I think should work.
> > > > > >
> > > > > >
> > > > > > diff --git a/include/linux/flex_array.h b/include/linux/flex_array.h
> > > > > > index b94fa61b51fb..7fa1f27a91b5 100644
> > > > > > --- a/include/linux/flex_array.h
> > > > > > +++ b/include/linux/flex_array.h
> > > > > > @@ -73,6 +73,8 @@ struct flex_array {
> > > > > >  struct flex_array *flex_array_alloc(int element_size, unsigned int 
> > > > > > total,
> > > > > > gfp_t flags);
> > > > > >
> > > > > > +struct flex_array *flex_array_resize(struct flex_array *fa, 
> > > > > > unsigned int total, gfp_t flags);
> > > > > > +
> > > > > >  /**
> > > > > >   * flex_array_prealloc() - Ensures that memory for the elements 
> > > > > > indexed in the
> > > > > >   * range defined by start and nr_elements has been allocated.
> > > > > > diff --git a/lib/flex_array.c b/lib/flex_array.c
> > > > > > index 2eed22fa507c..f8d54af3891b 100644
> > > > > > --- a/lib/flex_array.c
> > > > > > +++ b/lib/flex_array.c
> > > > > > @@ -109,6 +109,7 @@ struct flex_array *flex_array_alloc(int 
> > > > > > element_size, unsigned int

Re: [PATCH iproute2] stats output

2018-11-30 Thread Stephen Hemminger

On Fri, 30 Nov 2018 20:22:35 +0100
Alexis Vachette  wrote:

> Hi Stephen,
> 
> Thanks for your kind reply.
> 
> It's sad to hear that, I am aware of your concern too.
> 
> But it's not the best behavior for a network engineer.
> 
> Is it possible to add a new option or everything is stone graved ?

You can add new options at will as long as they don't conflict with existing
usage (and you update the man page). JSON and colorized output were added
in the last years.

Note: the iproute options are definitely a hodge-podge collection, the code 
doesn't
follow the typical Linux style of using getopt and getopt_long, and has a bit of
network OS model as well.

What are you thinking of.

[PATCH bpf-next] bpf: allow BPF read access to qdisc pkt_len

2018-11-30 Thread Willem de Bruijn

From: Petar Penkov 

The pkt_len field in qdisc_skb_cb stores the skb length as it will
appear on the wire after segmentation. For byte accounting, this value
is more accurate than skb->len. It is computed on entry to the TC
layer, so only valid there.

Allow read access to this field from BPF tc classifier and action
programs. The implementation is analogous to tc_classid, aside from
restricting to read access.

Signed-off-by: Petar Penkov 
Signed-off-by: Vlad Dumitrescu 
Signed-off-by: Willem de Bruijn 
---
 include/uapi/linux/bpf.h|  1 +
 net/core/filter.c   | 16 +++
 tools/include/uapi/linux/bpf.h  |  1 +
 tools/testing/selftests/bpf/test_verifier.c | 32 +
 4 files changed, 50 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 597afdbc1ab9..ce028482d423 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2483,6 +2483,7 @@ struct __sk_buff {
__u32 data_meta;
struct bpf_flow_keys *flow_keys;
__u64 tstamp;
+   __u32 pkt_len;
 };
 
 struct bpf_tunnel_key {
diff --git a/net/core/filter.c b/net/core/filter.c
index bd0df75dc7b6..af071ef22c0a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5773,6 +5773,7 @@ static bool sk_filter_is_valid_access(int off, int size,
case bpf_ctx_range(struct __sk_buff, flow_keys):
case bpf_ctx_range_till(struct __sk_buff, family, local_port):
case bpf_ctx_range(struct __sk_buff, tstamp):
+   case bpf_ctx_range(struct __sk_buff, pkt_len):
return false;
}
 
@@ -5797,6 +5798,7 @@ static bool cg_skb_is_valid_access(int off, int size,
case bpf_ctx_range(struct __sk_buff, tc_classid):
case bpf_ctx_range(struct __sk_buff, data_meta):
case bpf_ctx_range(struct __sk_buff, flow_keys):
+   case bpf_ctx_range(struct __sk_buff, pkt_len):
return false;
case bpf_ctx_range(struct __sk_buff, data):
case bpf_ctx_range(struct __sk_buff, data_end):
@@ -5843,6 +5845,7 @@ static bool lwt_is_valid_access(int off, int size,
case bpf_ctx_range(struct __sk_buff, data_meta):
case bpf_ctx_range(struct __sk_buff, flow_keys):
case bpf_ctx_range(struct __sk_buff, tstamp):
+   case bpf_ctx_range(struct __sk_buff, pkt_len):
return false;
}
 
@@ -6273,6 +6276,7 @@ static bool sk_skb_is_valid_access(int off, int size,
case bpf_ctx_range(struct __sk_buff, data_meta):
case bpf_ctx_range(struct __sk_buff, flow_keys):
case bpf_ctx_range(struct __sk_buff, tstamp):
+   case bpf_ctx_range(struct __sk_buff, pkt_len):
return false;
}
 
@@ -6360,6 +6364,7 @@ static bool flow_dissector_is_valid_access(int off, int 
size,
case bpf_ctx_range(struct __sk_buff, data_meta):
case bpf_ctx_range_till(struct __sk_buff, family, local_port):
case bpf_ctx_range(struct __sk_buff, tstamp):
+   case bpf_ctx_range(struct __sk_buff, pkt_len):
return false;
}
 
@@ -6685,6 +6690,17 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type 
type,
  bpf_target_off(struct sk_buff,
 tstamp, 8,
 target_size));
+   break;
+
+   case offsetof(struct __sk_buff, pkt_len):
+   BUILD_BUG_ON(FIELD_SIZEOF(struct qdisc_skb_cb, pkt_len) != 4);
+
+   off = si->off;
+   off -= offsetof(struct __sk_buff, pkt_len);
+   off += offsetof(struct sk_buff, cb);
+   off += offsetof(struct qdisc_skb_cb, pkt_len);
+   *target_size = 4;
+   *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg, off);
}
 
return insn - insn_buf;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 597afdbc1ab9..ce028482d423 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2483,6 +2483,7 @@ struct __sk_buff {
__u32 data_meta;
struct bpf_flow_keys *flow_keys;
__u64 tstamp;
+   __u32 pkt_len;
 };
 
 struct bpf_tunnel_key {
diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 17021d2b6bfe..0ab37fb4afad 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -13972,6 +13972,38 @@ static struct bpf_test tests[] = {
.result_unpriv = REJECT,
.result = ACCEPT,
},
+   {
+   "check pkt_len is not readable by sockets",
+   .insns = {
+   BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+   offsetof(struct __sk_buff, pkt_len)),
+   BPF_EXIT_INSN(),
+

Re: [PATCH bpf-next 1/2] bpf: add BPF_LWT_ENCAP_IP option to bpf_lwt_push_encap

2018-11-30 Thread David Ahern

On 11/28/18 6:34 PM, Peter Oskolkov wrote:
> On Wed, Nov 28, 2018 at 4:47 PM David Ahern  wrote:
>>
>> On 11/28/18 5:22 PM, Peter Oskolkov wrote:
>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>> index bd0df75dc7b6..17f3c37218e5 100644
>>> --- a/net/core/filter.c
>>> +++ b/net/core/filter.c
>>> @@ -4793,6 +4793,60 @@ static int bpf_push_seg6_encap(struct sk_buff *skb, 
>>> u32 type, void *hdr, u32 len
>>>  }
>>>  #endif /* CONFIG_IPV6_SEG6_BPF */
>>>
>>> +static int bpf_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len)
>>> +{
>>> + struct dst_entry *dst;
>>> + struct rtable *rt;
>>> + struct iphdr *iph;
>>> + struct net *net;
>>> + int err;
>>> +
>>> + if (skb->protocol != htons(ETH_P_IP))
>>> + return -EINVAL;  /* ETH_P_IPV6 not yet supported */
>>> +
>>> + iph = (struct iphdr *)hdr;
>>> +
>>> + if (unlikely(len < sizeof(struct iphdr) || len > 
>>> LWTUNNEL_MAX_ENCAP_HSIZE))
>>> + return -EINVAL;
>>> + if (unlikely(iph->version != 4 || iph->ihl * 4 > len))
>>> + return -EINVAL;
>>> +
>>> + if (skb->sk)
>>> + net = sock_net(skb->sk);
>>> + else {
>>> + net = dev_net(skb_dst(skb)->dev);
>>> + }
>>> + rt = ip_route_output(net, iph->daddr, 0, 0, 0);
>>
>> That is a very limited use case. e.g., oif = 0 means you are not
>> considering any kind of policy routing (e.g., VRF).
> 
> Hi David! Could you be a bit more specific re: what you would like to
> see here? Thanks!
> 

Is the encap happening on ingress or egress? Seems like the current code
does not assume either direction for lwt (BPF_PROG_TYPE_LWT_IN vs
BPF_PROG_TYPE_LWT_OUT), yet your change does - output only. Basically,
you should be filling in a flow struct and doing a proper lookup.

When the potentially custom encap header is pushed on, seems to me skb
marks should still be honored for the route lookup. If not, you should
handle that in the API.

>From there skb->dev at a minimum should be used as either iif (ingress)
or oif (egress).

The iph is already set so you have quick access to the tos.

Also, this should implement IPv6 as well before going in.

Re: consistency for statistics with XDP mode

2018-11-30 Thread David Ahern

On 11/30/18 1:30 PM, Michael S. Tsirkin wrote:
 I would like to see basic packets, bytes, and dropped counters
 tracked
 for Rx and Tx via the standard netdev counters for all devices. 
>>
>> The problem of reporting XDP_DROP in the netedev drop counter is that
>> they don't fit this counter description : "no space in linux buffers"
>> and it will be hard for the user to determine whether these drops are
>> coming from XDP or because no buffer is available, which will make it
>> impossible to estimate packet rate performance without looking at
>> ethtool stats.
>> And reporting XDP_DROP in the netdev rx packets counter is somehow
>> misleading.. since those packets never made it out of this driver.. 
>>
>>
>> And reporting XDP_DROP in the netdev rx packets counter is somehow
>> misleading.. since those packets never made it out of this driver..
> 
> I think I agree. XDP needs minimal overhead - if user wants to do
> counters then user can via maps. And in a sense XDP dropping packet
> is much like e.g. TCP dropping packet - it is not counted
> against the driver since it's not driver's fault.
> 

XDP dropping a packet is completely different.

stats are important. packets disappearing with no counters -- standard
counters visible by standard tools -- is a user nightmare. If the
agreement is for XDP drops to be in driver level (e.g., xdp_drop) that
is fine since it is still retrievable by ethtool -S (existing APIs and
existing tools).

Re: [PATCH v3] ipv4: make DSCP values works with ip rules

2018-11-30 Thread David Miller

From: Pavel Balaev 
Date: Tue, 27 Nov 2018 13:07:10 +0300

> -#define IPTOS_RT_MASK(IPTOS_TOS_MASK & ~3)
> +#define IPTOS_RT_MASK(IPTOS_DSCP_MASK & ~3)

I was hoping my original feedback would have you actually go
investigate why IPTOS_RT_MASK is defined the way that it is.

I will try again.

Please grep around for RTO_ONLINK and see how that specifically
interferes with what you are trying to do here.

I don't think your goal is achievable with how things are defined
currently.

Sorry.

Re: [PATCHv2 net] sctp: update frag_point when stream_interleave is set

2018-11-30 Thread David Miller

From: Xin Long 
Date: Tue, 27 Nov 2018 19:11:50 +0800

> sctp_assoc_update_frag_point() should be called whenever asoc->pathmtu
> changes, but we missed one place in sctp_association_init(). It would
> cause frag_point is zero when sending data.
> 
> As says in Jakub's reproducer, if sp->pathmtu is set by socketopt, the
> new asoc->pathmtu inherits it in sctp_association_init(). Later when
> transports are added and their pmtu >= asoc->pathmtu, it will never
> call sctp_assoc_update_frag_point() to set frag_point.
> 
> This patch is to fix it by updating frag_point after asoc->pathmtu is
> set as sp->pathmtu in sctp_association_init(). Note that it moved them
> after sctp_stream_init(), as stream->si needs to be set first.
> 
> Frag_point's calculation is also related with datachunk's type, so it
> needs to update frag_point when stream->si may be changed in
> sctp_process_init().
> 
> v1->v2:
>   - call sctp_assoc_update_frag_point() separately in sctp_process_init
> and sctp_association_init, per Marcelo's suggestion.
> 
> Fixes: 2f5e3c9df693 ("sctp: introduce sctp_assoc_update_frag_point")
> Reported-by: Jakub Audykowicz 
> Signed-off-by: Xin Long 

Applied and queued up for -stable back to v4.18

Re: [next] bonding: pass link-local packets to bonding master also.

2018-11-30 Thread Vincent Bernat

 ❦ 15 juillet 2018 19:12 -0700, Mahesh Bandewar :

> Commit b89f04c61efe ("bonding: deliver link-local packets with
> skb->dev set to link that packets arrived on") changed the behavior
> of how link-local-multicast packets are processed. The change in
> the behavior broke some legacy use cases where these packets are
> expected to arrive on bonding master device also.

Unfortunately, this doesn't completely restore the previous
functionality as PACKET_ORIGDEV is broken for the copy: the original
interface is lost through the call to netif_rx(). A LLDP daemon
listening to the master interface won't get the original interface like
it was able to before 4.12.

I am a bit lost of what the original patch was trying to achieve. I am
using the following test program:

#v+
#!/usr/bin/env python3

import sys
import socket
import datetime

socket.SOL_PACKET = 263
socket.ETH_P_ALL = 3
socket.PACKET_ORIGDEV = 9

interface = sys.argv[1] if len(sys.argv) > 1 else 'lag1'

s = socket.socket(socket.AF_PACKET,
  socket.SOCK_RAW,
  socket.htons(socket.ETH_P_ALL))
s.bind((interface, 0))
s.setsockopt(socket.SOL_PACKET, socket.PACKET_ORIGDEV, 1)
while True:
data, addrinfo = s.recvfrom(1500)
if addrinfo[2] == socket.PACKET_OUTGOING:
continue
print(f"{datetime.datetime.now().isoformat()}: "
  f"Received {len(data)} bytes from {addrinfo}")
#v-

If I run it with a kernel compiled with the commit before b89f04c61efe
(plus a few more cherry-pick to make it work like ea8ffc0818d8 and
72ccc471e13b), I get:

#v+
2018-11-30T22:20:40.193378: Received 221 bytes from ('eth1', 35020, 2, 1, 
b'RT3\x00\x00\x02')
2018-11-30T22:20:40.194504: Received 221 bytes from ('eth0', 35020, 2, 1, 
b'RT3\x00\x00\x01')
#v-

If I send non link-local packets, I get:

#v+
2018-11-30T22:25:57.300965: Received 98 bytes from ('eth0', 2048, 0, 1, 
b'PT3\x00\x00\x02')
#v-

I am also able to correctly receive link-local packets directly on each
interface. So, it seems everything was working as expected before
b89f04c61efe.
-- 
AWAKE! FEAR! FIRE! FOES! AWAKE!
FEAR! FIRE! FOES!
AWAKE! AWAKE!
-- J. R. R. Tolkien

Re: [PATCH bpf-next 1/4] bpf: Add BPF_F_ANY_ALIGNMENT.

2018-11-30 Thread David Miller

From: Alexei Starovoitov 
Date: Fri, 30 Nov 2018 13:58:20 -0800

> On Thu, Nov 29, 2018 at 07:32:41PM -0800, David Miller wrote:
>> +/* If BPF_F_ANY_ALIGNMENT is used in BPF_PROF_LOAD command, the
>> + * verifier will allow any alignment whatsoever.  This bypasses
>> + * what CONFIG_EFFICIENT_UNALIGNED_ACCESS would cause it to do.
> 
> I think majority of user space folks who read uapi/bpf.h have no idea
> what that kernel config does.
> Could you reword the comment here to say that this flag is only
> effective on architectures and like sparc and mips that don't
> have efficient unaligned access and ignored on x86/arm64 ?

Ok.

>> +if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
>> +(attr->prog_flags & BPF_F_ANY_ALIGNMENT) &&
>> +!capable(CAP_SYS_ADMIN))
>> +return -EPERM;
> 
> I guess we don't want to add:
> if (IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
> (attr->prog_flags & BPF_F_ANY_ALIGNMENT))
> return -EINVAL;
> 
> so that test_verifier.c can just unconditionally pass this flag
> on all archs ?

Right.

>> @@ -247,7 +249,11 @@ int bpf_verify_program(enum bpf_prog_type type, const 
>> struct bpf_insn *insns,
>>  attr.log_level = log_level;
>>  log_buf[0] = 0;
>>  attr.kern_version = kern_version;
>> -attr.prog_flags = strict_alignment ? BPF_F_STRICT_ALIGNMENT : 0;
>> +if (strict_alignment)
>> +prog_flags |= BPF_F_STRICT_ALIGNMENT;
>> +if (any_alignment)
>> +prog_flags |= BPF_F_ANY_ALIGNMENT;
>> +attr.prog_flags = prog_flags;
> 
> instead of adding another argument may be replace 'int strict_alignment'
> with '__u32 prog_flags' ?
> and future flags won't need tweaks to bpf_verify_program() api.

Yeah the number of arguments are getting out of control, I'll do that.

Re: [PATCHv2 bpf 1/2] bpf: Support sk lookup in netns with id 0

2018-11-30 Thread Alexei Starovoitov

On Thu, Nov 29, 2018 at 04:29:33PM -0800, Joe Stringer wrote:
> David Ahern and Nicolas Dichtel report that the handling of the netns id
> 0 is incorrect for the BPF socket lookup helpers: rather than finding
> the netns with id 0, it is resolving to the current netns. This renders
> the netns_id 0 inaccessible.
> 
> To fix this, adjust the API for the netns to treat all negative s32
> values as a lookup in the current netns, while any values with a
> positive value in the signed 32-bit integer space would result in a
> lookup for a socket in the netns corresponding to that id. As before, if
> the netns with that ID does not exist, no socket will be found.
> Furthermore, if any bits are set in the upper 32-bits, then no socket
> will be found.
> 
> Signed-off-by: Joe Stringer 
..
> +/* Current network namespace */
> +#define BPF_CURRENT_NETNS(-1L)

I was about to apply it, but then noticed that the name doesn't match
the rest of the names.
Could you rename it to BPF_F_CURRENT_NETNS ?

Also reword the commit log so it's less misleading.

thx!

Re: [PATCH net] rtnetlink: Refine sanity checks in rtnl_fdb_{add|del}

2018-11-30 Thread Eric Dumazet

On Fri, Nov 30, 2018 at 7:40 AM Eric Dumazet  wrote:
>
> On Fri, Nov 30, 2018 at 7:36 AM David Ahern  wrote:
> >
> > On 11/30/18 7:58 AM, Ido Schimmel wrote:
> > > Can you please share the reproducer (assuming it exists)? I don't really
> > > understand the fix. None of the functions you patched are in the trace.
> > > Also, looking at IPv4 GRE code, while GRE device has dev->addr_len set
> > > to 4, dev->type is set to ARPHRD_IPGRE.
> >
> > I had the same reaction ... you can not claim to be an ethernet device
> > and have a hw address that is not 6 bytes.
>
> This has been discussed a number of times TUNSETLINK can do that.
> (I have not checked what the repro does )


Oh well...ndo_dflt_fdb_dump() seems to be used on a gre device.

Re: [PATCH net] rtnetlink: Refine sanity checks in rtnl_fdb_{add|del}

2018-11-30 Thread Ido Schimmel

On Fri, Nov 30, 2018 at 07:51:34AM -0800, Eric Dumazet wrote:
> On Fri, Nov 30, 2018 at 7:46 AM Eric Dumazet  wrote:
> >
> > On Fri, Nov 30, 2018 at 7:40 AM Eric Dumazet  wrote:
> > >
> > > On Fri, Nov 30, 2018 at 7:36 AM David Ahern  wrote:
> > > >
> > > > On 11/30/18 7:58 AM, Ido Schimmel wrote:
> > > > > Can you please share the reproducer (assuming it exists)? I don't 
> > > > > really
> > > > > understand the fix. None of the functions you patched are in the 
> > > > > trace.
> > > > > Also, looking at IPv4 GRE code, while GRE device has dev->addr_len set
> > > > > to 4, dev->type is set to ARPHRD_IPGRE.
> > > >
> > > > I had the same reaction ... you can not claim to be an ethernet device
> > > > and have a hw address that is not 6 bytes.
> > >
> > > This has been discussed a number of times TUNSETLINK can do that.
> > > (I have not checked what the repro does )
> >
> >
> > Oh well...ndo_dflt_fdb_dump() seems to be used on a gre device.
> 
> What about :
> 
> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> index 
> 54cfbda0b58be34dcb164028f17cdde2826c857b..b4cfc139c8b05b19564e02f29bd030c5ff85b51b
> 100644
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -3800,6 +3800,9 @@ int ndo_dflt_fdb_dump(struct sk_buff *skb,
>  {
> int err;
> 
> +   if (dev->type != ARPHRD_ETHER || dev->addr_len != ETH_ALEN)
> +   return -EINVAL;

This makes more sense. I'm still compiling the kmsan kernel. I'll be
able to test the patch after compilation is done.

> +
> netif_addr_lock_bh(dev);
> err = nlmsg_populate_fdb(skb, cb, dev, idx, >uc);
> if (err)

[PATCH bpf-next 0/5] tools: bpftool: fixes and small improvements

2018-11-30 Thread Quentin Monnet

Hi,
Several items for bpftool are included in this set: the first three patches
are fixes for bpftool itself and bash completion, while the last two
slightly improve the information obtained when dumping programs or maps, on
Daniel's suggestion. Please refer to individual commit logs for more
details.

Quentin Monnet (5):
  tools: bpftool: use "/proc/self/" i.o. crafting links with getpid()
  tools: bpftool: fix bash completion for bpftool prog (attach|detach)
  tools: bpftool: fix bash completion for new map types (queue and
stack)
  tools: bpftool: mark offloaded programs more explicitly in plain
output
  tools: bpftool: add owner_prog_type and owner_jited to bpftool output

 tools/bpf/bpftool/bash-completion/bpftool | 75 ---
 tools/bpf/bpftool/common.c|  7 ++-
 tools/bpf/bpftool/jit_disasm.c| 11 +
 tools/bpf/bpftool/main.h  | 26 +++
 tools/bpf/bpftool/map.c   | 50 -
 tools/bpf/bpftool/prog.c  | 27 +--
 6 files changed, 130 insertions(+), 66 deletions(-)

-- 
2.7.4

[PATCH bpf-next 1/5] tools: bpftool: use "/proc/self/" i.o. crafting links with getpid()

2018-11-30 Thread Quentin Monnet

The getpid() function is called in a couple of places in bpftool to
craft links of the shape "/proc//...". Instead, it is possible to
use the "/proc/self/" shortcut, which makes things a bit easier, in
particular in jit_disasm.c.

Do the replacement, and remove the includes of  from the
relevant files, now we do not use getpid() anymore.

Signed-off-by: Quentin Monnet 
Reviewed-by: Jakub Kicinski 
---
 tools/bpf/bpftool/common.c |  5 ++---
 tools/bpf/bpftool/jit_disasm.c | 11 +--
 2 files changed, 3 insertions(+), 13 deletions(-)

diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
index 4e217d57118e..4349b6683ca8 100644
--- a/tools/bpf/bpftool/common.c
+++ b/tools/bpf/bpftool/common.c
@@ -48,7 +48,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #include 
@@ -276,7 +275,7 @@ int get_fd_type(int fd)
char buf[512];
ssize_t n;
 
-   snprintf(path, sizeof(path), "/proc/%d/fd/%d", getpid(), fd);
+   snprintf(path, sizeof(path), "/proc/self/fd/%d", fd);
 
n = readlink(path, buf, sizeof(buf));
if (n < 0) {
@@ -304,7 +303,7 @@ char *get_fdinfo(int fd, const char *key)
ssize_t n;
FILE *fdi;
 
-   snprintf(path, sizeof(path), "/proc/%d/fdinfo/%d", getpid(), fd);
+   snprintf(path, sizeof(path), "/proc/self/fdinfo/%d", fd);
 
fdi = fopen(path, "r");
if (!fdi) {
diff --git a/tools/bpf/bpftool/jit_disasm.c b/tools/bpf/bpftool/jit_disasm.c
index b2ed5ee1af5f..545a92471c33 100644
--- a/tools/bpf/bpftool/jit_disasm.c
+++ b/tools/bpf/bpftool/jit_disasm.c
@@ -19,7 +19,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 
@@ -28,20 +27,12 @@
 
 static void get_exec_path(char *tpath, size_t size)
 {
+   const char *path = "/proc/self/exe";
ssize_t len;
-   char *path;
-
-   snprintf(tpath, size, "/proc/%d/exe", (int) getpid());
-   tpath[size - 1] = 0;
-
-   path = strdup(tpath);
-   assert(path);
 
len = readlink(path, tpath, size - 1);
assert(len > 0);
tpath[len] = 0;
-
-   free(path);
 }
 
 static int oper_count;
-- 
2.7.4

[PATCH bpf-next 3/5] tools: bpftool: fix bash completion for new map types (queue and stack)

2018-11-30 Thread Quentin Monnet

Commit 197c2dac74e4 ("bpf: Add BPF_MAP_TYPE_QUEUE and BPF_MAP_TYPE_STACK
to bpftool-map") added support for queue and stack eBPF map types in
bpftool map handling. Let's update the bash completion accordingly.

Fixes: 197c2dac74e4 ("bpf: Add BPF_MAP_TYPE_QUEUE and BPF_MAP_TYPE_STACK to 
bpftool-map")
Signed-off-by: Quentin Monnet 
---
 tools/bpf/bpftool/bash-completion/bpftool | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/bpf/bpftool/bash-completion/bpftool 
b/tools/bpf/bpftool/bash-completion/bpftool
index b7e6c4f25ad1..9a60080f085f 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -436,7 +436,7 @@ _bpftool()
 lru_percpu_hash lpm_trie array_of_maps \
 hash_of_maps devmap sockmap cpumap xskmap \
 sockhash cgroup_storage reuseport_sockarray \
-percpu_cgroup_storage' -- \
+percpu_cgroup_storage queue stack' -- \
"$cur" ) )
 return 0
 ;;
-- 
2.7.4

[PATCH net] net/ibmvnic: Fix RTNL deadlock during device reset

2018-11-30 Thread Thomas Falcon

Commit a5681e20b541 ("net/ibmnvic: Fix deadlock problem
in reset") made the change to hold the RTNL lock during
driver reset but still calls netdev_notify_peers, which
results in a deadlock. Instead, use call_netdevice_notifiers,
which is functionally the same except that it does not
take the RTNL lock again.

Fixes: a5681e20b541 ("net/ibmnvic: Fix deadlock problem in reset")

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index c0203a0..ed50b8d 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1859,7 +1859,7 @@ static int do_reset(struct ibmvnic_adapter *adapter,
 
if (adapter->reset_reason != VNIC_RESET_FAILOVER &&
adapter->reset_reason != VNIC_RESET_CHANGE_PARAM)
-   netdev_notify_peers(netdev);
+   call_netdevice_notifiers(NETDEV_NOTIFY_PEERS, netdev);
 
netif_carrier_on(netdev);
 
-- 
1.8.3.1

Re: [PATCH net-next 1/2] ixgbe: register a mdiobus

2018-11-30 Thread Florian Fainelli




On 11/30/2018 9:34 AM, Steve Douthit wrote:
> On 11/30/18 11:34 AM, Andrew Lunn wrote:
>>> Yep, registering multiple interfaces is wrong.  The first board I tested
>>> against only had a single MAC enabled (they can be disabled/hidden via
>>> straps) so it just happened to work.
>>
>> Hi Steve
>>
>> Can you hide any/all via straps, or is 00.0 always guaranteed to
>> exist?
> 
> You can hide all the devices, but if function 1 is enabled then function
> 0 must also be enabled, so not all combinations are valid.
> 
>>> The Intel C3xxx family of SoCs have up to four ixgbe MACs.  These are
>>> structured as two devices of two functions each on fixed internal root
>>> ports.
>>>
>>> from lspci:
>>> 
>>>  +-16.0-[05]--+-00.0
>>>  |\-00.1
>>>  +-17.0-[06]--+-00.0
>>>  |\-00.1
>>> 
>>
>> Is there any other hardware resource which is shared between the MAC
>> interfaces? I'm just wondering if the driver has already solved this
>> once. Is there an EEPROM per interface for the MAC address, or one
>> shared EEPROM?
> 
> NVM is shared, it's actually the same SPI flash that holds the BIOS for
> this SoC.  Access is serialized by the swfw_sync semaphore.  I think the
> device firmware is automagically handling offset translation.
> 
> I don't think that helps for this case.
> 
> There might be a better match for shared resources, but nothing springs
> to mind.
> 
>> Ah, how about using the 'cards_found' found variable. It is not
>> perfect, in that it is not decremented in ixgb_remove(), and i wonder
>> about race conditions since there does not appear to be any lock when
>> it is incremented. But if cards_found == 0, register the MDIO bus.
> 
> 'cards_found' doesn't exist for the ixgbe driver.  I could add it and
> fix the race/decrement issues you mention, but it'd have to be a device
> type specific count.  It's still possible there are other non-x550em_a
> ixgbe devices in external PCIe slots that have different resource
> sharing setups.
> 
> It's still a lighter weight solution than poking around the parent bus
> so I'll add a 'x550em_a_devs_found' counter to v2.
> 

If the MDIO bus block is hardwired to function 0, would it be acceptable
to walk the PCI bus hierarchy from any driver's probe routine where PCI
function != 0 and see if it is claimed/enabled already, and if so, skip
registering the MDIO bus entirely?

There is also possibly a problem if you have a shared MDIO device, and
you turn off/power off the PCI function that does provide it. How can we
make sure it remains alive for other functions to use it?
-- 
Florian

Re: [PATCH net] sctp: kfree_rcu asoc

2018-11-30 Thread Neil Horman

On Sat, Dec 01, 2018 at 01:36:59AM +0800, Xin Long wrote:
> In sctp_hash_transport/sctp_epaddr_lookup_transport, it dereferences
> a transport's asoc under rcu_read_lock while asoc is freed not after
> a grace period, which leads to a use-after-free panic.
> 
> This patch fixes it by calling kfree_rcu to make asoc be freed after
> a grace period.
> 
> Note that only the asoc's memory is delayed to free in the patch, it
> won't cause sk to linger longer.
> 
> Thanks Neil and Marcelo to make this clear.
> 
Thank you for taking the extra time on this

Acked-by: Neil Horman 

> Fixes: 7fda702f9315 ("sctp: use new rhlist interface on sctp transport 
> rhashtable")
> Fixes: cd2b70875058 ("sctp: check duplicate node before inserting a new 
> transport")
> Reported-by: syzbot+0b05d8aa7cb185107...@syzkaller.appspotmail.com
> Reported-by: syzbot+aad231d51b1923158...@syzkaller.appspotmail.com
> Suggested-by: Neil Horman 
> Signed-off-by: Xin Long 
> ---
>  include/net/sctp/structs.h | 2 ++
>  net/sctp/associola.c   | 2 +-
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
> index a11f937..feada35 100644
> --- a/include/net/sctp/structs.h
> +++ b/include/net/sctp/structs.h
> @@ -2075,6 +2075,8 @@ struct sctp_association {
>  
>   __u64 abandoned_unsent[SCTP_PR_INDEX(MAX) + 1];
>   __u64 abandoned_sent[SCTP_PR_INDEX(MAX) + 1];
> +
> + struct rcu_head rcu;
>  };
>  
>  
> diff --git a/net/sctp/associola.c b/net/sctp/associola.c
> index 6a28b96..3702f48 100644
> --- a/net/sctp/associola.c
> +++ b/net/sctp/associola.c
> @@ -434,7 +434,7 @@ static void sctp_association_destroy(struct 
> sctp_association *asoc)
>  
>   WARN_ON(atomic_read(>rmem_alloc));
>  
> - kfree(asoc);
> + kfree_rcu(asoc, rcu);
>   SCTP_DBG_OBJCNT_DEC(assoc);
>  }
>  
> -- 
> 2.1.0
> 
>

Re: [PATCH net-next 1/2] ixgbe: register a mdiobus

2018-11-30 Thread Steve Douthit

On 11/30/18 12:43 PM, Florian Fainelli wrote:
> 
> 
> On 11/30/2018 9:34 AM, Steve Douthit wrote:
>> On 11/30/18 11:34 AM, Andrew Lunn wrote:
 Yep, registering multiple interfaces is wrong.  The first board I tested
 against only had a single MAC enabled (they can be disabled/hidden via
 straps) so it just happened to work.
>>>
>>> Hi Steve
>>>
>>> Can you hide any/all via straps, or is 00.0 always guaranteed to
>>> exist?
>>
>> You can hide all the devices, but if function 1 is enabled then function
>> 0 must also be enabled, so not all combinations are valid.
>>
 The Intel C3xxx family of SoCs have up to four ixgbe MACs.  These are
 structured as two devices of two functions each on fixed internal root
 ports.

 from lspci:
 
   +-16.0-[05]--+-00.0
   |\-00.1
   +-17.0-[06]--+-00.0
   |\-00.1
 
>>>
>>> Is there any other hardware resource which is shared between the MAC
>>> interfaces? I'm just wondering if the driver has already solved this
>>> once. Is there an EEPROM per interface for the MAC address, or one
>>> shared EEPROM?
>>
>> NVM is shared, it's actually the same SPI flash that holds the BIOS for
>> this SoC.  Access is serialized by the swfw_sync semaphore.  I think the
>> device firmware is automagically handling offset translation.
>>
>> I don't think that helps for this case.
>>
>> There might be a better match for shared resources, but nothing springs
>> to mind.
>>
>>> Ah, how about using the 'cards_found' found variable. It is not
>>> perfect, in that it is not decremented in ixgb_remove(), and i wonder
>>> about race conditions since there does not appear to be any lock when
>>> it is incremented. But if cards_found == 0, register the MDIO bus.
>>
>> 'cards_found' doesn't exist for the ixgbe driver.  I could add it and
>> fix the race/decrement issues you mention, but it'd have to be a device
>> type specific count.  It's still possible there are other non-x550em_a
>> ixgbe devices in external PCIe slots that have different resource
>> sharing setups.
>>
>> It's still a lighter weight solution than poking around the parent bus
>> so I'll add a 'x550em_a_devs_found' counter to v2.
>>
> 
> If the MDIO bus block is hardwired to function 0, would it be acceptable
> to walk the PCI bus hierarchy from any driver's probe routine where PCI
> function != 0 and see if it is claimed/enabled already, and if so, skip
> registering the MDIO bus entirely?

It's not really hardwired to any function at the hardware level AFAICT.
Unless by 'hardwired' you mean we always register it to the same devfn?
Walking the bus to find the lowest numbered devfn was my original
suggestion.  I'm good with a PCI walk or a counter to choose which MAC
registers the MDIO bus.  Let me know what the consensus is.

> There is also possibly a problem if you have a shared MDIO device, and
> you turn off/power off the PCI function that does provide it. How can we
> make sure it remains alive for other functions to use it?

I'm not sure this is an issue.  I wouldn't expect the other MACs to lose
the ability to talk to their PHYs.  Those callbacks weren't changed to
use the registered MDIO bus.

Re: [PATCHv2 net] sctp: check and update stream->out_curr when allocating stream_out

2018-11-30 Thread Neil Horman

On Sat, Dec 01, 2018 at 03:53:26AM +0900, Xin Long wrote:
> On Sat, Dec 1, 2018 at 12:23 AM Neil Horman  wrote:
> >
> > On Fri, Nov 30, 2018 at 10:48:10PM +0900, Xin Long wrote:
> > > On Fri, Nov 30, 2018 at 9:21 PM Neil Horman  wrote:
> > > >
> > > > On Fri, Nov 30, 2018 at 03:22:39PM +0900, Xin Long wrote:
> > > > > On Thu, Nov 29, 2018 at 11:39 PM Neil Horman  
> > > > > wrote:
> > > > > >
> > > > > > On Thu, Nov 29, 2018 at 02:42:56PM +0800, Xin Long wrote:
> > > > > > > Now when using stream reconfig to add out streams, stream->out
> > > > > > > will get re-allocated, and all old streams' information will
> > > > > > > be copied to the new ones and the old ones will be freed.
> > > > > > >
> > > > > > > So without stream->out_curr updated, next time when trying to
> > > > > > > send from stream->out_curr stream, a panic would be caused.
> > > > > > >
> > > > > > > This patch is to check and update stream->out_curr when
> > > > > > > allocating stream_out.
> > > > > > >
> > > > > > > v1->v2:
> > > > > > >   - define fa_index() to get elem index from stream->out_curr.
> > > > > > >
> > > > > > > Fixes: 5e32a431 ("sctp: introduce stream scheduler 
> > > > > > > foundations")
> > > > > > > Reported-by: Ying Xu 
> > > > > > > Reported-by: syzbot+e33a3a138267ca119...@syzkaller.appspotmail.com
> > > > > > > Signed-off-by: Xin Long 
> > > > > > > ---
> > > > > > >  net/sctp/stream.c | 20 
> > > > > > >  1 file changed, 20 insertions(+)
> > > > > > >
> > > > > > > diff --git a/net/sctp/stream.c b/net/sctp/stream.c
> > > > > > > index 3892e76..30e7809 100644
> > > > > > > --- a/net/sctp/stream.c
> > > > > > > +++ b/net/sctp/stream.c
> > > > > > > @@ -84,6 +84,19 @@ static void fa_zero(struct flex_array *fa, 
> > > > > > > size_t index, size_t count)
> > > > > > >   }
> > > > > > >  }
> > > > > > >
> > > > > > > +static size_t fa_index(struct flex_array *fa, void *elem, size_t 
> > > > > > > count)
> > > > > > > +{
> > > > > > > + size_t index = 0;
> > > > > > > +
> > > > > > > + while (count--) {
> > > > > > > + if (elem == flex_array_get(fa, index))
> > > > > > > + break;
> > > > > > > + index++;
> > > > > > > + }
> > > > > > > +
> > > > > > > + return index;
> > > > > > > +}
> > > > > > > +
> > > > > > >  /* Migrates chunks from stream queues to new stream queues if 
> > > > > > > needed,
> > > > > > >   * but not across associations. Also, removes those chunks to 
> > > > > > > streams
> > > > > > >   * higher than the new max.
> > > > > > > @@ -147,6 +160,13 @@ static int sctp_stream_alloc_out(struct 
> > > > > > > sctp_stream *stream, __u16 outcnt,
> > > > > > >
> > > > > > >   if (stream->out) {
> > > > > > >   fa_copy(out, stream->out, 0, min(outcnt, 
> > > > > > > stream->outcnt));
> > > > > > > + if (stream->out_curr) {
> > > > > > > + size_t index = fa_index(stream->out, 
> > > > > > > stream->out_curr,
> > > > > > > + stream->outcnt);
> > > > > > > +
> > > > > > > + BUG_ON(index == stream->outcnt);
> > > > > > > + stream->out_curr = flex_array_get(out, 
> > > > > > > index);
> > > > > > > + }
> > > > > > >   fa_free(stream->out);
> > > > > > >   }
> > > > > > >
> > > > > > > --
> > > > > > > 2.1.0
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > This is the sort of thing I'm talking about. Its a little more 
> > > > > > code, but if you
> > > > > > augment the flex_array api like this, you can preform a resize 
> > > > > > operation on your
> > > > > > existing flex array, and you can avoid all the copying, and need to 
> > > > > > update
> > > > > > pointers maintained outside the array.  Note this code isn't tested 
> > > > > > at all, but
> > > > > > its close to what I think should work.
> > > > > >
> > > > > >
> > > > > > diff --git a/include/linux/flex_array.h b/include/linux/flex_array.h
> > > > > > index b94fa61b51fb..7fa1f27a91b5 100644
> > > > > > --- a/include/linux/flex_array.h
> > > > > > +++ b/include/linux/flex_array.h
> > > > > > @@ -73,6 +73,8 @@ struct flex_array {
> > > > > >  struct flex_array *flex_array_alloc(int element_size, unsigned int 
> > > > > > total,
> > > > > > gfp_t flags);
> > > > > >
> > > > > > +struct flex_array *flex_array_resize(struct flex_array *fa, 
> > > > > > unsigned int total, gfp_t flags);
> > > > > > +
> > > > > >  /**
> > > > > >   * flex_array_prealloc() - Ensures that memory for the elements 
> > > > > > indexed in the
> > > > > >   * range defined by start and nr_elements has been allocated.
> > > > > > diff --git a/lib/flex_array.c b/lib/flex_array.c
> > > > > > index 2eed22fa507c..f8d54af3891b 100644
> > > > > > --- a/lib/flex_array.c
> > > > > > +++ b/lib/flex_array.c
> > > > > > @@ -109,6 +109,7 @@ struct flex_array *flex_array_alloc(int 
> > > > > > element_size, unsigned int

Re: [PATCH iproute2] ss: add support for bytes_sent, bytes_retrans, dsack_dups and reord_seen

2018-11-30 Thread Stefano Brivio

Hi David,

On Thu, 29 Nov 2018 11:51:48 -0700
David Ahern  wrote:

> On 11/29/18 11:50 AM, Stephen Hemminger wrote:
> > PS: ss still doesn't support JSON output, given the volume of output it 
> > would be good.  
> 
> I thought Stefano was investigating it as an alternative to the 'display
> selected columns' patches.

Thanks for Cc'ing me, I missed this.

I'm not really investigating it at the moment, I'm convinced it's not
complicated, it's on my to-do list, but I'm not working on it right now
(and maybe also not in a very near future).

I would still say the "display selected columns" patches could be a
stopgap for now (even though it's not related to this one patch).

-- 
Stefano

Re: [PATCH v2 net-next] cxgb4: number of VFs supported is not always 16

2018-11-30 Thread David Miller

From: Ganesh Goudar 
Date: Tue, 27 Nov 2018 14:59:06 +0530

> Total number of VFs supported by PF is used to determine the last
> byte of VF's mac address. Number of VFs supported is not always
> 16, use the variable nvfs to get the number of VFs supported
> rather than hard coding it to 16.
> 
> Signed-off-by: Casey Leedom 
> Signed-off-by: Ganesh Goudar 
> ---
> V2: Fixes typo in commit message

Applied.

Re: [PATCH net-next 00/11] nfp: update TX path to enable repr offloads

2018-11-30 Thread David Miller

From: Jakub Kicinski 
Date: Tue, 27 Nov 2018 22:24:47 -0800

> This set starts with three micro optimizations to the TX path.
> The improvement is measurable, but below 1% of CPU utilization.
> 
> Patches 4 - 9 add basic TX offloads to representor devices, like
> checksum offload or TSO, and remove the unnecessary TX lock and
> Qdisc (our representors are software constructs on top of the PF).
> 
> The last 2 patches add more info to error messages - id of command
> which failed and exact location of incorrect TLVs, very useful for
> debugging.

Series applied, thanks Jakub.

Re: [PATCH net-next 0/2] rtnetlink: avoid a warning in rtnl_newlink()

2018-11-30 Thread David Miller

From: Jakub Kicinski 
Date: Tue, 27 Nov 2018 22:32:29 -0800

> I've been hoping for some time that someone more competent would fix
> the stack frame size warning in rtnl_newlink(), but looks like I'll
> have to take a stab at it myself :)  That's the only warning I see
> in most of my builds.
> 
> First patch refactors away a somewhat surprising if (1) code block.
> Reindentation will most likely cause cherry-pick problems but OTOH
> rtnl_newlink() doesn't seem to be changed often, so perhaps we can
> risk it in the name of cleaner code?
> 
> Second patch fixes the warning in simplest possible way.  I was
> pondering if there is any more clever solution, but I can't see it..
> rtnl_newlink() is quite long with a lot of possible execution paths
> so doing memory allocations half way through leads to very ugly
> results.

Series applied, thanks for tackling this Jakub.

That whole "if (1) {" was probably just a construct used in order
to create an inner basic block for local variables, nothing more.

Re: [PATCH bpf-next 1/4] bpf: Add BPF_F_ANY_ALIGNMENT.

2018-11-30 Thread Alexei Starovoitov

On Thu, Nov 29, 2018 at 07:32:41PM -0800, David Miller wrote:
> 
> Often we want to write tests cases that check things like bad context
> offset accesses.  And one way to do this is to use an odd offset on,
> for example, a 32-bit load.
> 
> This unfortunately triggers the alignment checks first on platforms
> that do not set CONFIG_EFFICIENT_UNALIGNED_ACCESS.  So the test
> case see the alignment failure rather than what it was testing for.
> 
> It is often not completely possible to respect the original intention
> of the test, or even test the same exact thing, while solving the
> alignment issue.
> 
> Another option could have been to check the alignment after the
> context and other validations are performed by the verifier, but
> that is a non-trivial change to the verifier.
> 
> Signed-off-by: David S. Miller 
> ---
>  include/uapi/linux/bpf.h| 10 ++
>  kernel/bpf/syscall.c|  7 ++-
>  kernel/bpf/verifier.c   |  2 ++
>  tools/include/uapi/linux/bpf.h  | 10 ++
>  tools/lib/bpf/bpf.c | 12 +---
>  tools/lib/bpf/bpf.h |  6 +++---
>  tools/testing/selftests/bpf/test_align.c|  2 +-
>  tools/testing/selftests/bpf/test_verifier.c |  1 +
>  8 files changed, 42 insertions(+), 8 deletions(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 426b5c8..c9647ea 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -232,6 +232,16 @@ enum bpf_attach_type {
>   */
>  #define BPF_F_STRICT_ALIGNMENT   (1U << 0)
>  
> +/* If BPF_F_ANY_ALIGNMENT is used in BPF_PROF_LOAD command, the
> + * verifier will allow any alignment whatsoever.  This bypasses
> + * what CONFIG_EFFICIENT_UNALIGNED_ACCESS would cause it to do.

I think majority of user space folks who read uapi/bpf.h have no idea
what that kernel config does.
Could you reword the comment here to say that this flag is only
effective on architectures and like sparc and mips that don't
have efficient unaligned access and ignored on x86/arm64 ?

> + * It is mostly used for testing when we want to validate the
> + * context and memory access aspects of the validator, but because
> + * of an unaligned access the alignment check would trigger before
> + * the one we are interested in.
> + */
> +#define BPF_F_ANY_ALIGNMENT  (1U << 1)
> +
>  /* when bpf_ldimm64->src_reg == BPF_PSEUDO_MAP_FD, bpf_ldimm64->imm == fd */
>  #define BPF_PSEUDO_MAP_FD1
>  
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index cf5040f..cae65bb 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1450,9 +1450,14 @@ static int bpf_prog_load(union bpf_attr *attr)
>   if (CHECK_ATTR(BPF_PROG_LOAD))
>   return -EINVAL;
>  
> - if (attr->prog_flags & ~BPF_F_STRICT_ALIGNMENT)
> + if (attr->prog_flags & ~(BPF_F_STRICT_ALIGNMENT | BPF_F_ANY_ALIGNMENT))
>   return -EINVAL;
>  
> + if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
> + (attr->prog_flags & BPF_F_ANY_ALIGNMENT) &&
> + !capable(CAP_SYS_ADMIN))
> + return -EPERM;

I guess we don't want to add:
if (IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
(attr->prog_flags & BPF_F_ANY_ALIGNMENT))
return -EINVAL;

so that test_verifier.c can just unconditionally pass this flag
on all archs ?
Kinda weird that this knob doesn't do anything on x86.
But I guess it's fine.

> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> index 03f9bcc..d4f76d2 100644
> --- a/tools/lib/bpf/bpf.c
> +++ b/tools/lib/bpf/bpf.c
> @@ -232,9 +232,11 @@ int bpf_load_program(enum bpf_prog_type type, const 
> struct bpf_insn *insns,
>  
>  int bpf_verify_program(enum bpf_prog_type type, const struct bpf_insn *insns,
>  size_t insns_cnt, int strict_alignment,
> -const char *license, __u32 kern_version,
> -char *log_buf, size_t log_buf_sz, int log_level)
> +int any_alignment, const char *license,
> +__u32 kern_version, char *log_buf, size_t log_buf_sz,
> +int log_level)
>  {
> + __u32 prog_flags = 0;
>   union bpf_attr attr;
>  
>   bzero(, sizeof(attr));
> @@ -247,7 +249,11 @@ int bpf_verify_program(enum bpf_prog_type type, const 
> struct bpf_insn *insns,
>   attr.log_level = log_level;
>   log_buf[0] = 0;
>   attr.kern_version = kern_version;
> - attr.prog_flags = strict_alignment ? BPF_F_STRICT_ALIGNMENT : 0;
> + if (strict_alignment)
> + prog_flags |= BPF_F_STRICT_ALIGNMENT;
> + if (any_alignment)
> + prog_flags |= BPF_F_ANY_ALIGNMENT;
> + attr.prog_flags = prog_flags;

instead of adding another argument may be replace 'int strict_alignment'
with '__u32 prog_flags' ?
and future flags won't need tweaks to bpf_verify_program() api.

Re: [PATCH bpf-next 0/5] tools: bpftool: fixes and small improvements

2018-11-30 Thread Alexei Starovoitov

On Fri, Nov 30, 2018 at 04:25:43PM +, Quentin Monnet wrote:
> Hi,
> Several items for bpftool are included in this set: the first three patches
> are fixes for bpftool itself and bash completion, while the last two
> slightly improve the information obtained when dumping programs or maps, on
> Daniel's suggestion. Please refer to individual commit logs for more
> details.

Applied, Thanks

Re: [PATCH net-next 1/2] ixgbe: register a mdiobus

2018-11-30 Thread Andrew Lunn

> 'cards_found' doesn't exist for the ixgbe driver.

Agh, sorry, i was looking at ixgb, not ixgbe.

 Andrew

Re: [PATCH net] net: aquantia: fix rx checksum offload bits

2018-11-30 Thread David Miller

From: Igor Russkikh 
Date: Tue, 27 Nov 2018 14:51:17 +

> From: Dmitry Bogdanov 
> 
> The last set of csum offload fixes had a leak:
> 
> Checksum enabled status bits from rx descriptor were incorrectly
> interpreted. Consequently all the other valid logic worked on zero bits.
> That caused rx checksum offloads never to trigger.
> 
> Tested by dumping rx descriptors and validating resulting csum_level.
> 
> Reported-by: Igor Russkikh 
> Signed-off-by: Dmitry Bogdanov 
> Signed-off-by: Igor Russkikh 
> Fixes: ad703c2b9127f ("net: aquantia: invalid checksumm offload 
> implementation")

Applied.

This shows that the feature added was not actually tested.

Thanks.

Re: [PATCH bpf] bpf: Fix verifier log string check for bad alignment.

2018-11-30 Thread Alexei Starovoitov

On Wed, Nov 28, 2018 at 10:33:53PM -0800, David Miller wrote:
> 
> The message got changed a lot time ago.
> 
> This was responsible for 36 test case failures on sparc64.
> 
> Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
> Signed-off-by: David S. Miller 

Applied to bpf tree. Thanks!

[PATCH net-next v4 0/3] udp msg_zerocopy

2018-11-30 Thread Willem de Bruijn

From: Willem de Bruijn 

Enable MSG_ZEROCOPY for udp sockets

Patch 1/3 is the main patch, a rework of RFC patch
  http://patchwork.ozlabs.org/patch/899630/
  more details in the patch commit message

Patch 2/3 is an optimization to remove a branch from the UDP hot path
  and refcount_inc/refcount_dec_and_test pair when zerocopy is used.
  This used to be included in the first patch in v2.

Patch 3/3 runs the already existing udp zerocopy tests
  as part of kselftest

See also recent Linux Plumbers presentation
  
https://linuxplumbersconf.org/event/2/contributions/106/attachments/104/128/willemdebruijn-lpc2018-udpgso-presentation-20181113.pdf

Changes:
  v1 -> v2
- Fixup reverse christmas tree violation
  v2 -> v3
- Split refcount avoidance optimization into separate patch
  - Fix refcount leak on error in fragmented case
(thanks to Paolo Abeni for pointing this one out!)
  - Fix refcount inc on zero
  v3 -> v4
- Move skb_zcopy_set below the only kfree_skb that might cause
  a premature uarg destroy before skb_zerocopy_put_abort
  - Move the entire skb_shinfo assignment block, to keep that
cacheline access in one place

Willem de Bruijn (3):
  udp: msg_zerocopy
  udp: elide zerocopy operation in hot path
  selftests: extend zerocopy tests to udp

 include/linux/skbuff.h  | 13 +---
 net/core/skbuff.c   | 15 ++---
 net/core/sock.c |  5 ++-
 net/ipv4/ip_output.c| 37 -
 net/ipv4/tcp.c  |  2 +-
 net/ipv6/ip6_output.c   | 37 -
 tools/testing/selftests/net/msg_zerocopy.c  |  3 +-
 tools/testing/selftests/net/msg_zerocopy.sh |  2 ++
 tools/testing/selftests/net/udpgso_bench.sh |  3 ++
 9 files changed, 90 insertions(+), 27 deletions(-)

-- 
2.20.0.rc1.387.gf8505762e3-goog

[PATCH net-next v4 1/3] udp: msg_zerocopy

2018-11-30 Thread Willem de Bruijn

From: Willem de Bruijn 

Extend zerocopy to udp sockets. Allow setting sockopt SO_ZEROCOPY and
interpret flag MSG_ZEROCOPY.

This patch was previously part of the zerocopy RFC patchsets. Zerocopy
is not effective at small MTU. With segmentation offload building
larger datagrams, the benefit of page flipping outweights the cost of
generating a completion notification.

tools/testing/selftests/net/msg_zerocopy.sh after applying follow-on
test patch and making skb_orphan_frags_rx same as skb_orphan_frags:

ipv4 udp -t 1
tx=191312 (11938 MB) txc=0 zc=n
rx=191312 (11938 MB)
ipv4 udp -z -t 1
tx=304507 (19002 MB) txc=304507 zc=y
rx=304507 (19002 MB)
ok
ipv6 udp -t 1
tx=174485 (10888 MB) txc=0 zc=n
rx=174485 (10888 MB)
ipv6 udp -z -t 1
tx=294801 (18396 MB) txc=294801 zc=y
rx=294801 (18396 MB)
ok

Changes
  v1 -> v2
- Fixup reverse christmas tree violation
  v2 -> v3
- Split refcount avoidance optimization into separate patch
  - Fix refcount leak on error in fragmented case
(thanks to Paolo Abeni for pointing this one out!)
  - Fix refcount inc on zero
  - Test sock_flag SOCK_ZEROCOPY directly in __ip_append_data.
This is needed since commit 5cf4a8532c99 ("tcp: really ignore
MSG_ZEROCOPY if no SO_ZEROCOPY") did the same for tcp.

Signed-off-by: Willem de Bruijn 
---
 include/linux/skbuff.h |  1 +
 net/core/skbuff.c  |  6 ++
 net/core/sock.c|  5 -
 net/ipv4/ip_output.c   | 23 ++-
 net/ipv6/ip6_output.c  | 23 ++-
 5 files changed, 55 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 73902acf2b71..04f52e719571 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -485,6 +485,7 @@ void sock_zerocopy_put_abort(struct ubuf_info *uarg);
 
 void sock_zerocopy_callback(struct ubuf_info *uarg, bool success);
 
+int skb_zerocopy_iter_dgram(struct sk_buff *skb, struct msghdr *msg, int len);
 int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
 struct msghdr *msg, int len,
 struct ubuf_info *uarg);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3c814565ed7c..1350901c5cb8 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1105,6 +1105,12 @@ EXPORT_SYMBOL_GPL(sock_zerocopy_put_abort);
 extern int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
   struct iov_iter *from, size_t length);
 
+int skb_zerocopy_iter_dgram(struct sk_buff *skb, struct msghdr *msg, int len)
+{
+   return __zerocopy_sg_from_iter(skb->sk, skb, >msg_iter, len);
+}
+EXPORT_SYMBOL_GPL(skb_zerocopy_iter_dgram);
+
 int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
 struct msghdr *msg, int len,
 struct ubuf_info *uarg)
diff --git a/net/core/sock.c b/net/core/sock.c
index 6d7e189e3cd9..f5bb89785e47 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1018,7 +1018,10 @@ int sock_setsockopt(struct socket *sock, int level, int 
optname,
 
case SO_ZEROCOPY:
if (sk->sk_family == PF_INET || sk->sk_family == PF_INET6) {
-   if (sk->sk_protocol != IPPROTO_TCP)
+   if (!((sk->sk_type == SOCK_STREAM &&
+  sk->sk_protocol == IPPROTO_TCP) ||
+ (sk->sk_type == SOCK_DGRAM &&
+  sk->sk_protocol == IPPROTO_UDP)))
ret = -ENOTSUPP;
} else if (sk->sk_family != PF_RDS) {
ret = -ENOTSUPP;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 5dbec21856f4..6f843aff628c 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -867,6 +867,7 @@ static int __ip_append_data(struct sock *sk,
unsigned int flags)
 {
struct inet_sock *inet = inet_sk(sk);
+   struct ubuf_info *uarg = NULL;
struct sk_buff *skb;
 
struct ip_options *opt = cork->opt;
@@ -916,6 +917,19 @@ static int __ip_append_data(struct sock *sk,
(!exthdrlen || (rt->dst.dev->features & NETIF_F_HW_ESP_TX_CSUM)))
csummode = CHECKSUM_PARTIAL;
 
+   if (flags & MSG_ZEROCOPY && length && sock_flag(sk, SOCK_ZEROCOPY)) {
+   uarg = sock_zerocopy_realloc(sk, length, skb_zcopy(skb));
+   if (!uarg)
+   return -ENOBUFS;
+   if (rt->dst.dev->features & NETIF_F_SG &&
+   csummode == CHECKSUM_PARTIAL) {
+   paged = true;
+   } else {
+   uarg->zerocopy = 0;
+   skb_zcopy_set(skb, uarg);
+   }
+   }
+
cork->length += length;
 
/* So, what's going on in the loop below?
@@ -1006,6 +1020,7 @@ static int __ip_append_data(struct

[PATCH net-next v4 3/3] selftests: extend zerocopy tests to udp

2018-11-30 Thread Willem de Bruijn

From: Willem de Bruijn 

Both msg_zerocopy and udpgso_bench have udp zerocopy variants.
Exercise these as part of the standard kselftest run.

With udp, msg_zerocopy has no control channel. Ensure that the
receiver exits after the sender by accounting for the initial
delay in starting them (in msg_zerocopy.sh).

Signed-off-by: Willem de Bruijn 
---
 tools/testing/selftests/net/msg_zerocopy.c  | 3 ++-
 tools/testing/selftests/net/msg_zerocopy.sh | 2 ++
 tools/testing/selftests/net/udpgso_bench.sh | 3 +++
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/msg_zerocopy.c 
b/tools/testing/selftests/net/msg_zerocopy.c
index 406cc70c571d..4b02933cab8a 100644
--- a/tools/testing/selftests/net/msg_zerocopy.c
+++ b/tools/testing/selftests/net/msg_zerocopy.c
@@ -651,12 +651,13 @@ static void do_flush_datagram(int fd, int type)
 
 static void do_rx(int domain, int type, int protocol)
 {
+   const int cfg_receiver_wait_ms = 400;
uint64_t tstop;
int fd;
 
fd = do_setup_rx(domain, type, protocol);
 
-   tstop = gettimeofday_ms() + cfg_runtime_ms;
+   tstop = gettimeofday_ms() + cfg_runtime_ms + cfg_receiver_wait_ms;
do {
if (type == SOCK_STREAM)
do_flush_tcp(fd);
diff --git a/tools/testing/selftests/net/msg_zerocopy.sh 
b/tools/testing/selftests/net/msg_zerocopy.sh
index c43c6debda06..825ffec85cea 100755
--- a/tools/testing/selftests/net/msg_zerocopy.sh
+++ b/tools/testing/selftests/net/msg_zerocopy.sh
@@ -25,6 +25,8 @@ readonly path_sysctl_mem="net.core.optmem_max"
 if [[ "$#" -eq "0" ]]; then
$0 4 tcp -t 1
$0 6 tcp -t 1
+   $0 4 udp -t 1
+   $0 6 udp -t 1
echo "OK. All tests passed"
exit 0
 fi
diff --git a/tools/testing/selftests/net/udpgso_bench.sh 
b/tools/testing/selftests/net/udpgso_bench.sh
index 0f0628613f81..5670a9ffd8eb 100755
--- a/tools/testing/selftests/net/udpgso_bench.sh
+++ b/tools/testing/selftests/net/udpgso_bench.sh
@@ -35,6 +35,9 @@ run_udp() {
 
echo "udp gso"
run_in_netns ${args} -S 0
+
+   echo "udp gso zerocopy"
+   run_in_netns ${args} -S 0 -z
 }
 
 run_tcp() {
-- 
2.20.0.rc1.387.gf8505762e3-goog

[PATCH net-next v4 2/3] udp: elide zerocopy operation in hot path

2018-11-30 Thread Willem de Bruijn

From: Willem de Bruijn 

With MSG_ZEROCOPY, each skb holds a reference to a struct ubuf_info.
Release of its last reference triggers a completion notification.

The TCP stack in tcp_sendmsg_locked holds an extra ref independent of
the skbs, because it can build, send and free skbs within its loop,
possibly reaching refcount zero and freeing the ubuf_info too soon.

The UDP stack currently also takes this extra ref, but does not need
it as all skbs are sent after return from __ip(6)_append_data.

Avoid the extra refcount_inc and refcount_dec_and_test, and generally
the sock_zerocopy_put in the common path, by passing the initial
reference to the first skb.

This approach is taken instead of initializing the refcount to 0, as
that would generate error "refcount_t: increment on 0" on the
next skb_zcopy_set.

Changes
  v3 -> v4
- Move skb_zcopy_set below the only kfree_skb that might cause
  a premature uarg destroy before skb_zerocopy_put_abort
  - Move the entire skb_shinfo assignment block, to keep that
cacheline access in one place

Signed-off-by: Willem de Bruijn 
---
 include/linux/skbuff.h | 12 
 net/core/skbuff.c  |  9 +
 net/ipv4/ip_output.c   | 22 +++---
 net/ipv4/tcp.c |  2 +-
 net/ipv6/ip6_output.c  | 22 +++---
 5 files changed, 36 insertions(+), 31 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 04f52e719571..75d50ab7997c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -481,7 +481,7 @@ static inline void sock_zerocopy_get(struct ubuf_info *uarg)
 }
 
 void sock_zerocopy_put(struct ubuf_info *uarg);
-void sock_zerocopy_put_abort(struct ubuf_info *uarg);
+void sock_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref);
 
 void sock_zerocopy_callback(struct ubuf_info *uarg, bool success);
 
@@ -1326,10 +1326,14 @@ static inline struct ubuf_info *skb_zcopy(struct 
sk_buff *skb)
return is_zcopy ? skb_uarg(skb) : NULL;
 }
 
-static inline void skb_zcopy_set(struct sk_buff *skb, struct ubuf_info *uarg)
+static inline void skb_zcopy_set(struct sk_buff *skb, struct ubuf_info *uarg,
+bool *have_ref)
 {
if (skb && uarg && !skb_zcopy(skb)) {
-   sock_zerocopy_get(uarg);
+   if (unlikely(have_ref && *have_ref))
+   *have_ref = false;
+   else
+   sock_zerocopy_get(uarg);
skb_shinfo(skb)->destructor_arg = uarg;
skb_shinfo(skb)->tx_flags |= SKBTX_ZEROCOPY_FRAG;
}
@@ -1374,7 +1378,7 @@ static inline void skb_zcopy_abort(struct sk_buff *skb)
struct ubuf_info *uarg = skb_zcopy(skb);
 
if (uarg) {
-   sock_zerocopy_put_abort(uarg);
+   sock_zerocopy_put_abort(uarg, false);
skb_shinfo(skb)->tx_flags &= ~SKBTX_ZEROCOPY_FRAG;
}
 }
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 1350901c5cb8..c78ce114537e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1089,7 +1089,7 @@ void sock_zerocopy_put(struct ubuf_info *uarg)
 }
 EXPORT_SYMBOL_GPL(sock_zerocopy_put);
 
-void sock_zerocopy_put_abort(struct ubuf_info *uarg)
+void sock_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref)
 {
if (uarg) {
struct sock *sk = skb_from_uarg(uarg)->sk;
@@ -1097,7 +1097,8 @@ void sock_zerocopy_put_abort(struct ubuf_info *uarg)
atomic_dec(>sk_zckey);
uarg->len--;
 
-   sock_zerocopy_put(uarg);
+   if (have_uref)
+   sock_zerocopy_put(uarg);
}
 }
 EXPORT_SYMBOL_GPL(sock_zerocopy_put_abort);
@@ -1137,7 +1138,7 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct 
sk_buff *skb,
return err;
}
 
-   skb_zcopy_set(skb, uarg);
+   skb_zcopy_set(skb, uarg, NULL);
return skb->len - orig_len;
 }
 EXPORT_SYMBOL_GPL(skb_zerocopy_iter_stream);
@@ -1157,7 +1158,7 @@ static int skb_zerocopy_clone(struct sk_buff *nskb, 
struct sk_buff *orig,
if (skb_copy_ubufs(nskb, GFP_ATOMIC))
return -EIO;
}
-   skb_zcopy_set(nskb, skb_uarg(orig));
+   skb_zcopy_set(nskb, skb_uarg(orig), NULL);
}
return 0;
 }
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 6f843aff628c..78f028bdad30 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -881,8 +881,8 @@ static int __ip_append_data(struct sock *sk,
int csummode = CHECKSUM_NONE;
struct rtable *rt = (struct rtable *)cork->dst;
unsigned int wmem_alloc_delta = 0;
+   bool paged, extra_uref;
u32 tskey = 0;
-   bool paged;
 
skb = skb_peek_tail(queue);
 
@@ -921,12 +921,13 @@ static int __ip_append_data(struct sock *sk,
uarg = sock_zerocopy_realloc(sk, length, skb_zcopy(skb));
if

Re: consistency for statistics with XDP mode

2018-11-30 Thread Michael S. Tsirkin

On Fri, Nov 30, 2018 at 08:10:58PM +, Saeed Mahameed wrote:
> On Thu, 2018-11-22 at 18:00 +0100, Toke Høiland-Jørgensen wrote:
> > David Ahern  writes:
> > 
> > > On 11/22/18 1:26 AM, Toke Høiland-Jørgensen wrote:
> > > > Saeed Mahameed  writes:
> > > > 
> > > > > > > I'd say it sounds reasonable to include XDP in the normal
> > > > > > > traffic
> > > > > > > counters, but having the detailed XDP-specific counters is
> > > > > > > quite
> > > > > > > useful
> > > > > > > as well... So can't we do both (for all drivers)?
> > > > > > > 
> > > > > 
> > > > > What are you thinking ? 
> > > > > reporting XDP_DROP in interface dropped counter ?
> > > > > and XDP_TX/REDIRECT in the TX counter ?
> > > > > XDP_ABORTED in the  err/drop counter ?
> > > > > 
> > > > > how about having a special XDP command in the .ndo_bpf that
> > > > > would query
> > > > > the standardized XDP stats ?
> > > > the XDP-specific stats are useful to have separately as well :)
> > > > 
> > > 
> > > I would like to see basic packets, bytes, and dropped counters
> > > tracked
> > > for Rx and Tx via the standard netdev counters for all devices. 
> 
> The problem of reporting XDP_DROP in the netedev drop counter is that
> they don't fit this counter description : "no space in linux buffers"
> and it will be hard for the user to determine whether these drops are
> coming from XDP or because no buffer is available, which will make it
> impossible to estimate packet rate performance without looking at
> ethtool stats.
> And reporting XDP_DROP in the netdev rx packets counter is somehow
> misleading.. since those packets never made it out of this driver.. 
> 
> 
> And reporting XDP_DROP in the netdev rx packets counter is somehow
> misleading.. since those packets never made it out of this driver..

I think I agree. XDP needs minimal overhead - if user wants to do
counters then user can via maps. And in a sense XDP dropping packet
is much like e.g. TCP dropping packet - it is not counted
against the driver since it's not driver's fault.


> > > for ease in accounting as well as speed and simplicity for bumping
> > > counters for virtual devices from bpf helpers.
> > > 
> > > From there, the XDP ones can be in the driver private stats as they
> > > are
> > > currently but with some consistency across drivers for redirects,
> > > drops,
> > > any thing else.
> > > 
> > > So not a radical departure from where we are today, just getting
> > > the
> > > agreement for consistency and driver owners to make the changes.
> > 
> > Sounds good to me :)
> > 
> > -Toke

Re: [PATCH net 2/2] nfp: flower: prevent offload if rhashtable insert fails

2018-11-30 Thread David Miller

From: Jakub Kicinski 
Date: Tue, 27 Nov 2018 14:04:12 -0800

> From: John Hurley 
> 
> For flow offload adds, if the rhash insert code fails, the flow will still
> have been offloaded but the reference to it in the driver freed.
> 
> Re-order the offload setup calls to ensure that a flow will only be written
> to FW if a kernel reference is held and stored in the rhashtable. Remove
> this hashtable entry if the offload fails.
> 
> Fixes: c01d0efa5136 ("nfp: flower: use rhashtable for flow caching")
> Signed-off-by: John Hurley 
> Reviewed-by: Pieter Jansen van Vuuren 
> Reviewed-by: Jakub Kicinski 
> ---
> Merge note: there will be a slight merge conflict with net-next
> here:
>  - the first argument of nfp_flower_xmit_flow() changed from 'netdev'
>to 'app';
>  - we only bump the port offload cnt if (port).
> 
> FWIW the net-next version of the patch can be found at:
> 
>  git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux.git
> 
>  17ed95873d51 nfp: flower: prevent offload if rhashtable insert fails
>  d857fc8f472b nfp: flower: release metadata on offload failure
> 
> CC: Stephen Rothwell  # for linux-next

Applied.

Re: [PATCH net 1/2] nfp: flower: release metadata on offload failure

2018-11-30 Thread David Miller

From: Jakub Kicinski 
Date: Tue, 27 Nov 2018 14:04:11 -0800

> From: John Hurley 
> 
> Calling nfp_compile_flow_metadata both assigns a stats context and
> increments a ref counter on (or allocates) a mask id table entry. These
> are released by the nfp_modify_flow_metadata call on flow deletion,
> however, if a flow add fails after metadata is set then the flow entry
> will be deleted but the metadata assignments leaked.
> 
> Add an error path to the flow add offload function to ensure allocated
> metadata is released in the event of an offload fail.
> 
> Fixes: 81f3ddf2547d ("nfp: add control message passing capabilities to flower 
> offloads")
> Signed-off-by: John Hurley 
> Reviewed-by: Pieter Jansen van Vuuren 
> Reviewed-by: Jakub Kicinski 

Applied.

Re: [PATCH v4] net: Add trace events for all receive exit points

2018-11-30 Thread David Miller

From: Geneviève Bastien 
Date: Tue, 27 Nov 2018 12:52:39 -0500

> Trace events are already present for the receive entry points, to indicate
> how the reception entered the stack.
> 
> This patch adds the corresponding exit trace events that will bound the
> reception such that all events occurring between the entry and the exit
> can be considered as part of the reception context. This greatly helps
> for dependency and root cause analyses.
> 
> Without this, it is not possible with tracepoint instrumentation to
> determine whether a sched_wakeup event following a netif_receive_skb
> event is the result of the packet reception or a simple coincidence after
> further processing by the thread. It is possible using other mechanisms
> like kretprobes, but considering the "entry" points are already present,
> it would be good to add the matching exit events.
> 
> In addition to linking packets with wakeups, the entry/exit event pair
> can also be used to perform network stack latency analyses.
> 
> Signed-off-by: Geneviève Bastien 
> CC: Mathieu Desnoyers 
> CC: Steven Rostedt 
> CC: Ingo Molnar 
> CC: David S. Miller 
> Reviewed-by: Steven Rostedt (VMware)  (tracing
> side)

Applied to net-next, thanks for sticking with this.

Re: [PATCH 2/2] net: phy: ensure autoneg is configured when resuming a phydev

2018-11-30 Thread Heiner Kallweit

On 30.11.2018 19:45, Anssi Hannula wrote:
> When a PHY_HALTED phydev is resumed by phy_start(), it is set to
> PHY_RESUMING to wait for the link to come up.
> 
> However, if the phydev was put to PHY_HALTED (by e.g. phy_stop()) before
> autonegotiation was ever started by phy_state_machine(), autonegotiation
> remains unconfigured, i.e. phy_config_aneg() is never called.
> 
phy_stop() should only be called once the PHY was started. But it's
right that we don't have full control over it because misbehaving
drivers may call phy_stop() even if the PHY hasn't been started yet.

I think phy_error() and phy_stop() should be extended to set the state
to PHY_HALTED only if the PHY is in a started state, means:
don't touch the state if it's DOWN, READY, or HALTED.
In case of phy_error() it's more a precaution, because it's used within
phylib only and AFAICS it can be triggered from a started state only.

So yes, there is a theoretical issue. But as you wrote already, I think
there's a better way to deal with it.

For checking whether PHY is in a started state you could use the helper
which is being discussed here:
https://marc.info/?l=linux-netdev=154360368104946=2


> Fix that by going to PHY_UP instead of PHY_RESUMING if autonegotiation
> has never been configured.
> 
> Signed-off-by: Anssi Hannula 
> ---
> 
> This doesn't feel as clean is I'd like to, though. Maybe there is a
> better way?
> 
> 
>  drivers/net/phy/phy.c | 11 ++-
>  include/linux/phy.h   |  2 ++
>  2 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
> index d46858694db9..7a650cce7599 100644
> --- a/drivers/net/phy/phy.c
> +++ b/drivers/net/phy/phy.c
> @@ -553,6 +553,8 @@ int phy_start_aneg(struct phy_device *phydev)
>   if (err < 0)
>   goto out_unlock;
>  
> + phydev->autoneg_configured = 1;
> +
>   if (phydev->state != PHY_HALTED) {
>   if (AUTONEG_ENABLE == phydev->autoneg) {
>   err = phy_check_link_status(phydev);
> @@ -893,7 +895,14 @@ void phy_start(struct phy_device *phydev)
>   break;
>   }
>  
> - phydev->state = PHY_RESUMING;
> + /* if autoneg/forcing was never configured, go back to PHY_UP
> +  * to make that happen
> +  */
> + if (!phydev->autoneg_configured)
> + phydev->state = PHY_UP;
> + else
> + phydev->state = PHY_RESUMING;
> +
>   break;
>   default:
>   break;
> diff --git a/include/linux/phy.h b/include/linux/phy.h
> index 8f927246acdb..b306d5ed9d09 100644
> --- a/include/linux/phy.h
> +++ b/include/linux/phy.h
> @@ -389,6 +389,8 @@ struct phy_device {
>   unsigned loopback_enabled:1;
>  
>   unsigned autoneg:1;
> + /* autoneg has been configured at least once */
> + unsigned autoneg_configured:1;
>   /* The most recently read link state */
>   unsigned link:1;
>  
>

Re: consistency for statistics with XDP mode

2018-11-30 Thread Saeed Mahameed

On Thu, 2018-11-22 at 18:00 +0100, Toke Høiland-Jørgensen wrote:
> David Ahern  writes:
> 
> > On 11/22/18 1:26 AM, Toke Høiland-Jørgensen wrote:
> > > Saeed Mahameed  writes:
> > > 
> > > > > > I'd say it sounds reasonable to include XDP in the normal
> > > > > > traffic
> > > > > > counters, but having the detailed XDP-specific counters is
> > > > > > quite
> > > > > > useful
> > > > > > as well... So can't we do both (for all drivers)?
> > > > > > 
> > > > 
> > > > What are you thinking ? 
> > > > reporting XDP_DROP in interface dropped counter ?
> > > > and XDP_TX/REDIRECT in the TX counter ?
> > > > XDP_ABORTED in the  err/drop counter ?
> > > > 
> > > > how about having a special XDP command in the .ndo_bpf that
> > > > would query
> > > > the standardized XDP stats ?
> > > the XDP-specific stats are useful to have separately as well :)
> > > 
> > 
> > I would like to see basic packets, bytes, and dropped counters
> > tracked
> > for Rx and Tx via the standard netdev counters for all devices. 

The problem of reporting XDP_DROP in the netedev drop counter is that
they don't fit this counter description : "no space in linux buffers"
and it will be hard for the user to determine whether these drops are
coming from XDP or because no buffer is available, which will make it
impossible to estimate packet rate performance without looking at
ethtool stats.
And reporting XDP_DROP in the netdev rx packets counter is somehow
misleading.. since those packets never made it out of this driver.. 


And reporting XDP_DROP in the netdev rx packets counter is somehow
misleading.. since those packets never made it out of this driver..
> > for ease in accounting as well as speed and simplicity for bumping
> > counters for virtual devices from bpf helpers.
> > 
> > From there, the XDP ones can be in the driver private stats as they
> > are
> > currently but with some consistency across drivers for redirects,
> > drops,
> > any thing else.
> > 
> > So not a radical departure from where we are today, just getting
> > the
> > agreement for consistency and driver owners to make the changes.
> 
> Sounds good to me :)
> 
> -Toke

Re: [Patch net-next 2/2] net: dump whole skb data in netdev_rx_csum_fault()

2018-11-30 Thread Saeed Mahameed

On Thu, 2018-11-22 at 17:45 -0800, Cong Wang wrote:
> On Wed, Nov 21, 2018 at 11:33 AM Saeed Mahameed 
> wrote:
> > On Wed, 2018-11-21 at 10:26 -0800, Eric Dumazet wrote:
> > > On Wed, Nov 21, 2018 at 10:17 AM Cong Wang <
> > > xiyou.wangc...@gmail.com>
> > > wrote:
> > > > On Wed, Nov 21, 2018 at 5:05 AM Eric Dumazet <
> > > > eric.duma...@gmail.com> wrote:
> > > > > 
> > > > > On 11/20/2018 06:13 PM, Cong Wang wrote:
> > > > > > Currently, we only dump a few selected skb fields in
> > > > > > netdev_rx_csum_fault(). It is not suffient for debugging
> > > > > > checksum
> > > > > > fault. This patch introduces skb_dump() which dumps skb mac
> > > > > > header,
> > > > > > network header and its whole skb->data too.
> > > > > > 
> > > > > > Cc: Herbert Xu 
> > > > > > Cc: Eric Dumazet 
> > > > > > Cc: David Miller 
> > > > > > Signed-off-by: Cong Wang 
> > > > > > ---
> > > > > > + print_hex_dump(level, "skb data: ",
> > > > > > DUMP_PREFIX_OFFSET,
> > > > > > 16, 1,
> > > > > > +skb->data, skb->len, false);
> > > > > 
> > > > > As I mentioned to David, we want all the bytes that were
> > > > > maybe
> > > > > already pulled
> > > > > 
> > > > > (skb->head starting point, not skb->data)
> > > > 
> > > > Hmm, with mac header and network header, it is effectively from
> > > > skb->head, no?
> > > > Is there anything between skb->head and mac header?
> > > 
> > > Oh, I guess we wanted a single hex dump, or we need some user
> > > program
> > > to be able to
> > > rebuild from different memory zones the original
> > > CHECKSUM_COMPLETE
> > > value.
> > > 
> > 
> > Normally the driver keeps some headroom @skb->head, so the actual
> > mac
> > header starts @ skb->head + driver_specific_headroom
> 
> Good to know, but this headroom isn't covered by skb->csum, so
> not useful here, right? The skb->csum for mlx5 only covers network
> header and its payload.

correct

Re: [PATCH net-next] net/flow_dissector: correct comments on enum flow_dissector_key_id

2018-11-30 Thread David Miller

From: Edward Cree 
Date: Tue, 27 Nov 2018 15:40:59 +

> There are no such structs flow_dissector_key_flow_vlan or
>  flow_dissector_key_flow_tags, the actual structs used are struct
>  flow_dissector_key_vlan and struct flow_dissector_key_tags.  So correct the
>  comments against FLOW_DISSECTOR_KEY_VLAN, FLOW_DISSECTOR_KEY_FLOW_LABEL and
>  FLOW_DISSECTOR_KEY_CVLAN to refer to those.
> 
> Signed-off-by: Edward Cree 

Applied, thanks.

Re: [PATCH v4 3/4] libbpf: add bpf_prog_test_run_xattr

2018-11-30 Thread Alexei Starovoitov

On Wed, Nov 28, 2018 at 04:53:11PM +, Lorenz Bauer wrote:
> Add a new function, which encourages safe usage of the test interface.
> bpf_prog_test_run continues to work as before, but should be considered
> unsafe.
> 
> Signed-off-by: Lorenz Bauer 
..
> +
> +LIBBPF_API int bpf_prog_test_run_xattr(struct bpf_prog_test_run_attr 
> *test_attr);

it fails to be build due to:
  LINK tools/testing/selftests/bpf/test_libbpf
(117) does NOT match with num of versioned symbols in 
testing/selftests/bpf/libbpf.so (116).
Please make sure all LIBBPF_API symbols are versioned in libbpf.map.

Please fixup and respin.
The rest looks good.

Re: [PATCH v2 net] bonding: fix 802.3ad state sent to partner when unbinding slave

2018-11-30 Thread David Miller

From: Toni Peltonen 
Date: Tue, 27 Nov 2018 16:56:57 +0200

> Previously when unbinding a slave the 802.3ad implementation only told
> partner that the port is not suitable for aggregation by setting the port
> aggregation state from aggregatable to individual. This is not enough. If the
> physical layer still stays up and we only unbinded this port from the bond 
> there
> is nothing in the aggregation status alone to prevent the partner from sending
> traffic towards us. To ensure that the partner doesn't consider this
> port at all anymore we should also disable collecting and distributing to
> signal that this actor is going away. Also clear AD_STATE_SYNCHRONIZATION to
> ensure partner exits collecting + distributing state.
> 
> I have tested this behaviour againts Arista EOS switches with mlx5 cards
> (physical link stays up even when interface is down) and simulated
> the same situation virtually Linux <-> Linux with two network namespaces
> running two veth device pairs. In both cases setting aggregation to
> individual doesn't alone prevent traffic from being to sent towards this
> port given that the link stays up in partners end. Partner still keeps
> it's end in collecting + distributing state and continues until timeout is
> reached. In most cases this means we are losing the traffic partner sends
> towards our port while we wait for timeout. This is most visible with slow
> periodic time (LACP rate slow).
> 
> Other open source implementations like Open VSwitch and libreswitch, and
> vendor implementations like Arista EOS, seem to disable collecting +
> distributing to when doing similar port disabling/detaching/removing change.
> With this patch kernel implementation would behave the same way and ensure
> partner doesn't consider our actor viable anymore.
> 
> Signed-off-by: Toni Peltonen 
> ---
> v2 changes:
> * Fix typo in commit message
> * Also clear AD_STATE_SYNCHRONIZATION

Applied, thank you.

Re: [PATCH bpf-next] bpf: allow BPF read access to qdisc pkt_len

2018-11-30 Thread Song Liu

On Fri, Nov 30, 2018 at 12:09 PM Willem de Bruijn
 wrote:
>
> From: Petar Penkov 
>
> The pkt_len field in qdisc_skb_cb stores the skb length as it will
> appear on the wire after segmentation. For byte accounting, this value
> is more accurate than skb->len. It is computed on entry to the TC
> layer, so only valid there.
>
> Allow read access to this field from BPF tc classifier and action
> programs. The implementation is analogous to tc_classid, aside from
> restricting to read access.
>
> Signed-off-by: Petar Penkov 
> Signed-off-by: Vlad Dumitrescu 
> Signed-off-by: Willem de Bruijn 
> ---
>  include/uapi/linux/bpf.h|  1 +
>  net/core/filter.c   | 16 +++
>  tools/include/uapi/linux/bpf.h  |  1 +
>  tools/testing/selftests/bpf/test_verifier.c | 32 +
>  4 files changed, 50 insertions(+)

Please split this into 3 patches:
1 for include/uapi/linux/bpf.h and filter.c
1 for tools/include/uapi/linux/bpf.h
1 for tools/testing/selftests/bpf/test_verifier.c

Other than this
Acked-by: Song Liu 


>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 597afdbc1ab9..ce028482d423 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2483,6 +2483,7 @@ struct __sk_buff {
> __u32 data_meta;
> struct bpf_flow_keys *flow_keys;
> __u64 tstamp;
> +   __u32 pkt_len;
>  };
>
>  struct bpf_tunnel_key {
> diff --git a/net/core/filter.c b/net/core/filter.c
> index bd0df75dc7b6..af071ef22c0a 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5773,6 +5773,7 @@ static bool sk_filter_is_valid_access(int off, int size,
> case bpf_ctx_range(struct __sk_buff, flow_keys):
> case bpf_ctx_range_till(struct __sk_buff, family, local_port):
> case bpf_ctx_range(struct __sk_buff, tstamp):
> +   case bpf_ctx_range(struct __sk_buff, pkt_len):
> return false;
> }
>
> @@ -5797,6 +5798,7 @@ static bool cg_skb_is_valid_access(int off, int size,
> case bpf_ctx_range(struct __sk_buff, tc_classid):
> case bpf_ctx_range(struct __sk_buff, data_meta):
> case bpf_ctx_range(struct __sk_buff, flow_keys):
> +   case bpf_ctx_range(struct __sk_buff, pkt_len):
> return false;
> case bpf_ctx_range(struct __sk_buff, data):
> case bpf_ctx_range(struct __sk_buff, data_end):
> @@ -5843,6 +5845,7 @@ static bool lwt_is_valid_access(int off, int size,
> case bpf_ctx_range(struct __sk_buff, data_meta):
> case bpf_ctx_range(struct __sk_buff, flow_keys):
> case bpf_ctx_range(struct __sk_buff, tstamp):
> +   case bpf_ctx_range(struct __sk_buff, pkt_len):
> return false;
> }
>
> @@ -6273,6 +6276,7 @@ static bool sk_skb_is_valid_access(int off, int size,
> case bpf_ctx_range(struct __sk_buff, data_meta):
> case bpf_ctx_range(struct __sk_buff, flow_keys):
> case bpf_ctx_range(struct __sk_buff, tstamp):
> +   case bpf_ctx_range(struct __sk_buff, pkt_len):
> return false;
> }
>
> @@ -6360,6 +6364,7 @@ static bool flow_dissector_is_valid_access(int off, int 
> size,
> case bpf_ctx_range(struct __sk_buff, data_meta):
> case bpf_ctx_range_till(struct __sk_buff, family, local_port):
> case bpf_ctx_range(struct __sk_buff, tstamp):
> +   case bpf_ctx_range(struct __sk_buff, pkt_len):
> return false;
> }
>
> @@ -6685,6 +6690,17 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type 
> type,
>   bpf_target_off(struct sk_buff,
>  tstamp, 8,
>  target_size));
> +   break;
> +
> +   case offsetof(struct __sk_buff, pkt_len):
> +   BUILD_BUG_ON(FIELD_SIZEOF(struct qdisc_skb_cb, pkt_len) != 4);
> +
> +   off = si->off;
> +   off -= offsetof(struct __sk_buff, pkt_len);
> +   off += offsetof(struct sk_buff, cb);
> +   off += offsetof(struct qdisc_skb_cb, pkt_len);
> +   *target_size = 4;
> +   *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg, off);
> }
>
> return insn - insn_buf;
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 597afdbc1ab9..ce028482d423 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -2483,6 +2483,7 @@ struct __sk_buff {
> __u32 data_meta;
> struct bpf_flow_keys *flow_keys;
> __u64 tstamp;
> +   __u32 pkt_len;
>  };
>
>  struct bpf_tunnel_key {
> diff --git a/tools/testing/selftests/bpf/test_verifier.c 
> b/tools/testing/selftests/bpf/test_verifier.c
> index 17021d2b6bfe..0ab37fb4afad 100644
> --- a/tools/testing/selftests/bpf/test_verifier.c
> +++

Re: [PATCH net] bpf: uninitialized variables in test code

2018-11-30 Thread Song Liu

On Thu, Nov 29, 2018 at 2:28 AM Dan Carpenter  wrote:
>
> Smatch complains that if bpf_test_run() fails with -ENOMEM at the
> begining then the "duration" is uninitialized.  We then copy the
> unintialized variables to the user inside the bpf_test_finish()
> function.  The functions require CAP_SYS_ADMIN so it's not really an
> information leak.
>
> Fixes: 1cf1cae963c2 ("bpf: introduce BPF_PROG_TEST_RUN command")
> Signed-off-by: Dan Carpenter 
Acked-by: Song Liu 

> ---
>  net/bpf/test_run.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
> index c89c22c49015..49304192a031 100644
> --- a/net/bpf/test_run.c
> +++ b/net/bpf/test_run.c
> @@ -114,7 +114,7 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const 
> union bpf_attr *kattr,
> bool is_l2 = false, is_direct_pkt_access = false;
> u32 size = kattr->test.data_size_in;
> u32 repeat = kattr->test.repeat;
> -   u32 retval, duration;
> +   u32 retval, duration = 0;
> int hh_len = ETH_HLEN;
> struct sk_buff *skb;
> struct sock *sk;
> @@ -196,7 +196,7 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const 
> union bpf_attr *kattr,
> u32 repeat = kattr->test.repeat;
> struct netdev_rx_queue *rxqueue;
> struct xdp_buff xdp = {};
> -   u32 retval, duration;
> +   u32 retval, duration = 0;
> void *data;
> int ret;
>
> --
> 2.11.0
>

Re: [PATCHv2 bpf 1/2] bpf: Support sk lookup in netns with id 0

2018-11-30 Thread Joe Stringer

On Fri, 30 Nov 2018 at 15:27, Alexei Starovoitov
 wrote:
>
> On Fri, Nov 30, 2018 at 03:18:25PM -0800, Joe Stringer wrote:
> > On Fri, 30 Nov 2018 at 14:42, Alexei Starovoitov
> >  wrote:
> > >
> > > On Thu, Nov 29, 2018 at 04:29:33PM -0800, Joe Stringer wrote:
> > > > David Ahern and Nicolas Dichtel report that the handling of the netns id
> > > > 0 is incorrect for the BPF socket lookup helpers: rather than finding
> > > > the netns with id 0, it is resolving to the current netns. This renders
> > > > the netns_id 0 inaccessible.
> > > >
> > > > To fix this, adjust the API for the netns to treat all negative s32
> > > > values as a lookup in the current netns, while any values with a
> > > > positive value in the signed 32-bit integer space would result in a
> > > > lookup for a socket in the netns corresponding to that id. As before, if
> > > > the netns with that ID does not exist, no socket will be found.
> > > > Furthermore, if any bits are set in the upper 32-bits, then no socket
> > > > will be found.
> > > >
> > > > Signed-off-by: Joe Stringer 
> > > ..
> > > > +/* Current network namespace */
> > > > +#define BPF_CURRENT_NETNS(-1L)
> > >
> > > I was about to apply it, but then noticed that the name doesn't match
> > > the rest of the names.
> > > Could you rename it to BPF_F_CURRENT_NETNS ?
> >
> > I skipped the F_ part since it's not really a flag, it's a value. I
> > can put it back though.
>
> BPF_F_ prefix has smaller chance of conflicts.
> I wish we did that sooner.
> In retrospect BPF_ANY, BPF_EXIST were poorly picked names.

OK, I'll send out a v3 shortly.

Re: [PATCHv2 bpf 1/2] bpf: Support sk lookup in netns with id 0

2018-11-30 Thread Alexei Starovoitov

On Fri, Nov 30, 2018 at 03:18:25PM -0800, Joe Stringer wrote:
> On Fri, 30 Nov 2018 at 14:42, Alexei Starovoitov
>  wrote:
> >
> > On Thu, Nov 29, 2018 at 04:29:33PM -0800, Joe Stringer wrote:
> > > David Ahern and Nicolas Dichtel report that the handling of the netns id
> > > 0 is incorrect for the BPF socket lookup helpers: rather than finding
> > > the netns with id 0, it is resolving to the current netns. This renders
> > > the netns_id 0 inaccessible.
> > >
> > > To fix this, adjust the API for the netns to treat all negative s32
> > > values as a lookup in the current netns, while any values with a
> > > positive value in the signed 32-bit integer space would result in a
> > > lookup for a socket in the netns corresponding to that id. As before, if
> > > the netns with that ID does not exist, no socket will be found.
> > > Furthermore, if any bits are set in the upper 32-bits, then no socket
> > > will be found.
> > >
> > > Signed-off-by: Joe Stringer 
> > ..
> > > +/* Current network namespace */
> > > +#define BPF_CURRENT_NETNS(-1L)
> >
> > I was about to apply it, but then noticed that the name doesn't match
> > the rest of the names.
> > Could you rename it to BPF_F_CURRENT_NETNS ?
> 
> I skipped the F_ part since it's not really a flag, it's a value. I
> can put it back though.

BPF_F_ prefix has smaller chance of conflicts.
I wish we did that sooner.
In retrospect BPF_ANY, BPF_EXIST were poorly picked names.

Re: [PATCH bpf-next 1/2] bpf: add BPF_LWT_ENCAP_IP option to bpf_lwt_push_encap

2018-11-30 Thread Peter Oskolkov

Thanks, David! This is for egress only, so I'll add an appropriate
check. I'll also address your other comments/concerns in a v2 version
of this patchset.
On Fri, Nov 30, 2018 at 12:08 PM David Ahern  wrote:
>
> On 11/28/18 6:34 PM, Peter Oskolkov wrote:
> > On Wed, Nov 28, 2018 at 4:47 PM David Ahern  wrote:
> >>
> >> On 11/28/18 5:22 PM, Peter Oskolkov wrote:
> >>> diff --git a/net/core/filter.c b/net/core/filter.c
> >>> index bd0df75dc7b6..17f3c37218e5 100644
> >>> --- a/net/core/filter.c
> >>> +++ b/net/core/filter.c
> >>> @@ -4793,6 +4793,60 @@ static int bpf_push_seg6_encap(struct sk_buff 
> >>> *skb, u32 type, void *hdr, u32 len
> >>>  }
> >>>  #endif /* CONFIG_IPV6_SEG6_BPF */
> >>>
> >>> +static int bpf_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len)
> >>> +{
> >>> + struct dst_entry *dst;
> >>> + struct rtable *rt;
> >>> + struct iphdr *iph;
> >>> + struct net *net;
> >>> + int err;
> >>> +
> >>> + if (skb->protocol != htons(ETH_P_IP))
> >>> + return -EINVAL;  /* ETH_P_IPV6 not yet supported */
> >>> +
> >>> + iph = (struct iphdr *)hdr;
> >>> +
> >>> + if (unlikely(len < sizeof(struct iphdr) || len > 
> >>> LWTUNNEL_MAX_ENCAP_HSIZE))
> >>> + return -EINVAL;
> >>> + if (unlikely(iph->version != 4 || iph->ihl * 4 > len))
> >>> + return -EINVAL;
> >>> +
> >>> + if (skb->sk)
> >>> + net = sock_net(skb->sk);
> >>> + else {
> >>> + net = dev_net(skb_dst(skb)->dev);
> >>> + }
> >>> + rt = ip_route_output(net, iph->daddr, 0, 0, 0);
> >>
> >> That is a very limited use case. e.g., oif = 0 means you are not
> >> considering any kind of policy routing (e.g., VRF).
> >
> > Hi David! Could you be a bit more specific re: what you would like to
> > see here? Thanks!
> >
>
> Is the encap happening on ingress or egress? Seems like the current code
> does not assume either direction for lwt (BPF_PROG_TYPE_LWT_IN vs
> BPF_PROG_TYPE_LWT_OUT), yet your change does - output only. Basically,
> you should be filling in a flow struct and doing a proper lookup.
>
> When the potentially custom encap header is pushed on, seems to me skb
> marks should still be honored for the route lookup. If not, you should
> handle that in the API.
>
> From there skb->dev at a minimum should be used as either iif (ingress)
> or oif (egress).
>
> The iph is already set so you have quick access to the tos.
>
> Also, this should implement IPv6 as well before going in.

Re: [PATCH bpf-next] bpf: allow BPF read access to qdisc pkt_len

2018-11-30 Thread Willem de Bruijn

On Fri, Nov 30, 2018 at 5:48 PM Song Liu  wrote:
>
> On Fri, Nov 30, 2018 at 12:09 PM Willem de Bruijn
>  wrote:
> >
> > From: Petar Penkov 
> >
> > The pkt_len field in qdisc_skb_cb stores the skb length as it will
> > appear on the wire after segmentation. For byte accounting, this value
> > is more accurate than skb->len. It is computed on entry to the TC
> > layer, so only valid there.
> >
> > Allow read access to this field from BPF tc classifier and action
> > programs. The implementation is analogous to tc_classid, aside from
> > restricting to read access.
> >
> > Signed-off-by: Petar Penkov 
> > Signed-off-by: Vlad Dumitrescu 
> > Signed-off-by: Willem de Bruijn 
> > ---
> >  include/uapi/linux/bpf.h|  1 +
> >  net/core/filter.c   | 16 +++
> >  tools/include/uapi/linux/bpf.h  |  1 +
> >  tools/testing/selftests/bpf/test_verifier.c | 32 +
> >  4 files changed, 50 insertions(+)
>
> Please split this into 3 patches:
> 1 for include/uapi/linux/bpf.h and filter.c
> 1 for tools/include/uapi/linux/bpf.h
> 1 for tools/testing/selftests/bpf/test_verifier.c
>
> Other than this
> Acked-by: Song Liu 

Thanks for the fast review.

I'm happy to resend in three parts, of course, but am curious: what is
the rationale for splitting this up?

This patch follows the process for commit  f11216b24219 ("bpf: add
skb->tstamp r/w access from tc clsact and cg skb progs"), which went
in as a single patch just last week.

Re: consistency for statistics with XDP mode

2018-11-30 Thread Saeed Mahameed

On Fri, 2018-11-30 at 15:30 -0500, Michael S. Tsirkin wrote:
> On Fri, Nov 30, 2018 at 08:10:58PM +, Saeed Mahameed wrote:
> > On Thu, 2018-11-22 at 18:00 +0100, Toke Høiland-Jørgensen wrote:
> > > David Ahern  writes:
> > > 
> > > > On 11/22/18 1:26 AM, Toke Høiland-Jørgensen wrote:
> > > > > Saeed Mahameed  writes:
> > > > > 
> > > > > > > > I'd say it sounds reasonable to include XDP in the
> > > > > > > > normal
> > > > > > > > traffic
> > > > > > > > counters, but having the detailed XDP-specific counters
> > > > > > > > is
> > > > > > > > quite
> > > > > > > > useful
> > > > > > > > as well... So can't we do both (for all drivers)?
> > > > > > > > 
> > > > > > 
> > > > > > What are you thinking ? 
> > > > > > reporting XDP_DROP in interface dropped counter ?
> > > > > > and XDP_TX/REDIRECT in the TX counter ?
> > > > > > XDP_ABORTED in the  err/drop counter ?
> > > > > > 
> > > > > > how about having a special XDP command in the .ndo_bpf that
> > > > > > would query
> > > > > > the standardized XDP stats ?
> > > > > the XDP-specific stats are useful to have separately as well
> > > > > :)
> > > > > 
> > > > 
> > > > I would like to see basic packets, bytes, and dropped counters
> > > > tracked
> > > > for Rx and Tx via the standard netdev counters for all
> > > > devices. 
> > 
> > The problem of reporting XDP_DROP in the netedev drop counter is
> > that
> > they don't fit this counter description : "no space in linux
> > buffers"
> > and it will be hard for the user to determine whether these drops
> > are
> > coming from XDP or because no buffer is available, which will make
> > it
> > impossible to estimate packet rate performance without looking at
> > ethtool stats.
> > And reporting XDP_DROP in the netdev rx packets counter is somehow
> > misleading.. since those packets never made it out of this
> > driver.. 
> > 
> > 
> > And reporting XDP_DROP in the netdev rx packets counter is somehow
> > misleading.. since those packets never made it out of this driver..
> 
> I think I agree. XDP needs minimal overhead - if user wants to do
> counters then user can via maps. And in a sense XDP dropping packet
> is much like e.g. TCP dropping packet - it is not counted
> against the driver since it's not driver's fault.

So we should count all XDP RX packets as successful rx packets i.e
netdev->stats.rx_packets++; regardless of the XDP program decision ? 

this implies that XDP_TX packets will be counted twice once in 
netdev->stats.rx_packets and once in netdev->stats.tx_packets

I think this is the only valid option if we are going to use standard
netdev stats for XDP use cases.

Re: [PATCH net-next] net: reorder flowi_common fields to avoid holes

2018-11-30 Thread David Miller

From: Paolo Abeni 
Date: Wed, 28 Nov 2018 17:37:53 +0100

> the flowi* structures are used and memsetted by server functions
> in critical path. Currently flowi_common has a couple of holes that
> we can eliminate reordering the struct fields. As a side effect,
> both flowi4 and flowi6 shrink by 8 bytes.
 ...
> Signed-off-by: Paolo Abeni 

Applied, thanks.

Re: [PATCHv3 bpf 1/2] bpf: Support sk lookup in netns with id 0

2018-11-30 Thread Joey Pabalinas

On Fri, Nov 30, 2018 at 02:36:57PM -1000, Joey Pabalinas wrote:
> On Fri, Nov 30, 2018 at 03:32:20PM -0800, Joe Stringer wrote:
> > + * the netns associated with the *ctx*. *netns* values beyond the
> > + * range of 32-bit integers are reserved for future use.
> 
> Would adding a word or two before "*netns*" here be helpful to placate
> the english pedants among us (such as myself)? Starting a sentence with
> a lowercase word, even if it's a variable name, just never really sits
> right with me.
> 
>   Any *netns* values ...
> 
> Doesn't that kind of flow a bit better anyway?

Hah,

> Anyway, now with that unimportant nitpick if off my chest, the rest
^
|
I make an absolutely terrible pedant. --

-- 
Cheers,
Joey Pabalinas


signature.asc
Description: PGP signature

Re: [Patch net] mlx5: fix get_ip_proto()

2018-11-30 Thread David Miller

From: Cong Wang 
Date: Wed, 28 Nov 2018 15:04:05 -0800

> IP header is not necessarily located right after struct ethhdr,
> there could be multiple 802.1Q headers in between, this is why
> we call __vlan_get_protocol().
> 
> Fixes: fe1dc069990c ("net/mlx5e: don't set CHECKSUM_COMPLETE on SCTP packets")
> Cc: Alaa Hleihel 
> Cc: Or Gerlitz 
> Cc: Saeed Mahameed 
> Signed-off-by: Cong Wang 

Applied.

Re: [PATCH net] tun: forbid iface creation with rtnl ops

2018-11-30 Thread David Miller

From: Nicolas Dichtel 
Date: Thu, 29 Nov 2018 14:45:39 +0100

> It's not supported right now (the goal of the initial patch was to support
> 'ip link del' only).
> 
> Before the patch:
> $ ip link add foo type tun
> [  239.632660] BUG: unable to handle kernel NULL pointer dereference at 
> 
> [snip]
> [  239.636410] RIP: 0010:register_netdevice+0x8e/0x3a0
> 
> This panic occurs because dev->netdev_ops is not set by tun_setup(). But to
> have something usable, it will require more than just setting
> netdev_ops.
> 
> Fixes: f019a7a594d9 ("tun: Implement ip link del tunXXX")
> CC: Eric W. Biederman 
> Signed-off-by: Nicolas Dichtel 

Super old bug, scary.

Applied, thanks.

[PATCHv3 bpf 1/2] bpf: Support sk lookup in netns with id 0

2018-11-30 Thread Joe Stringer

David Ahern and Nicolas Dichtel report that the handling of the netns id
0 is incorrect for the BPF socket lookup helpers: rather than finding
the netns with id 0, it is resolving to the current netns. This renders
the netns_id 0 inaccessible.

To fix this, adjust the API for the netns to treat all negative s32
values as a lookup in the current netns (including u64 values which when
truncated to s32 become negative), while any values with a positive
value in the signed 32-bit integer space would result in a lookup for a
socket in the netns corresponding to that id. As before, if the netns
with that ID does not exist, no socket will be found. Any netns outside
of these ranges will fail to find a corresponding socket, as those
values are reserved for future usage.

Signed-off-by: Joe Stringer 
Acked-by: Nicolas Dichtel 
---
 include/uapi/linux/bpf.h  | 35 ++---
 net/core/filter.c | 11 +++---
 tools/include/uapi/linux/bpf.h| 39 ---
 tools/testing/selftests/bpf/bpf_helpers.h |  4 +-
 .../selftests/bpf/test_sk_lookup_kern.c   | 18 -
 5 files changed, 63 insertions(+), 44 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 852dc17ab47a..ad68b472dad2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2170,7 +2170,7 @@ union bpf_attr {
  * Return
  * 0 on success, or a negative error in case of failure.
  *
- * struct bpf_sock *bpf_sk_lookup_tcp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u32 netns, u64 flags)
+ * struct bpf_sock *bpf_sk_lookup_tcp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u64 netns, u64 flags)
  * Description
  * Look for TCP socket matching *tuple*, optionally in a child
  * network namespace *netns*. The return value must be checked,
@@ -2187,12 +2187,14 @@ union bpf_attr {
  * **sizeof**\ (*tuple*\ **->ipv6**)
  * Look for an IPv6 socket.
  *
- * If the *netns* is zero, then the socket lookup table in the
- * netns associated with the *ctx* will be used. For the TC hooks,
- * this in the netns of the device in the skb. For socket hooks,
- * this in the netns of the socket. If *netns* is non-zero, then
- * it specifies the ID of the netns relative to the netns
- * associated with the *ctx*.
+ * If the *netns* is a negative signed 32-bit integer, then the
+ * socket lookup table in the netns associated with the *ctx* will
+ * will be used. For the TC hooks, this is the netns of the device
+ * in the skb. For socket hooks, this is the netns of the socket.
+ * If *netns* is any other signed 32-bit value greater than or
+ * equal to zero then it specifies the ID of the netns relative to
+ * the netns associated with the *ctx*. *netns* values beyond the
+ * range of 32-bit integers are reserved for future use.
  *
  * All values for *flags* are reserved for future usage, and must
  * be left at zero.
@@ -2202,7 +2204,7 @@ union bpf_attr {
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
  *
- * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u32 netns, u64 flags)
+ * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u64 netns, u64 flags)
  * Description
  * Look for UDP socket matching *tuple*, optionally in a child
  * network namespace *netns*. The return value must be checked,
@@ -2219,12 +2221,14 @@ union bpf_attr {
  * **sizeof**\ (*tuple*\ **->ipv6**)
  * Look for an IPv6 socket.
  *
- * If the *netns* is zero, then the socket lookup table in the
- * netns associated with the *ctx* will be used. For the TC hooks,
- * this in the netns of the device in the skb. For socket hooks,
- * this in the netns of the socket. If *netns* is non-zero, then
- * it specifies the ID of the netns relative to the netns
- * associated with the *ctx*.
+ * If the *netns* is a negative signed 32-bit integer, then the
+ * socket lookup table in the netns associated with the *ctx* will
+ * will be used. For the TC hooks, this is the netns of the device
+ * in the skb. For socket hooks, this is the netns of the socket.
+ * If *netns* is any other signed 32-bit value greater than or
+ * equal to zero then it specifies the ID of the netns relative to
+ * the netns associated with the *ctx*. *netns* values beyond the
+ * range of 32-bit integers are reserved for future use.
  *
  * All values for *flags* are

[PATCHv3 bpf 2/2] bpf: Improve socket lookup reuseport documentation

2018-11-30 Thread Joe Stringer

Improve the wording around socket lookup for reuseport sockets, and
ensure that both bpf.h headers are in sync.

Signed-off-by: Joe Stringer 
---
 include/uapi/linux/bpf.h   | 4 
 tools/include/uapi/linux/bpf.h | 8 
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ad68b472dad2..47f620b5cc5c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2203,6 +2203,8 @@ union bpf_attr {
  * **CONFIG_NET** configuration option.
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
+ * For sockets with reuseport option, the *struct bpf_sock*
+ * result is from reuse->socks[] using the hash of the tuple.
  *
  * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u64 netns, u64 flags)
  * Description
@@ -2237,6 +2239,8 @@ union bpf_attr {
  * **CONFIG_NET** configuration option.
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
+ * For sockets with reuseport option, the *struct bpf_sock*
+ * result is from reuse->socks[] using the hash of the tuple.
  *
  * int bpf_sk_release(struct bpf_sock *sk)
  * Description
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index de2072ef475b..47f620b5cc5c 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2203,8 +2203,8 @@ union bpf_attr {
  * **CONFIG_NET** configuration option.
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
- * For sockets with reuseport option, *struct bpf_sock*
- * return is from reuse->socks[] using hash of the packet.
+ * For sockets with reuseport option, the *struct bpf_sock*
+ * result is from reuse->socks[] using the hash of the tuple.
  *
  * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u64 netns, u64 flags)
  * Description
@@ -2239,8 +2239,8 @@ union bpf_attr {
  * **CONFIG_NET** configuration option.
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
- * For sockets with reuseport option, *struct bpf_sock*
- * return is from reuse->socks[] using hash of the packet.
+ * For sockets with reuseport option, the *struct bpf_sock*
+ * result is from reuse->socks[] using the hash of the tuple.
  *
  * int bpf_sk_release(struct bpf_sock *sk)
  * Description
-- 
2.19.1

Re: BPF uapi structures and 32-bit

2018-11-30 Thread Daniel Borkmann

On 12/01/2018 12:33 AM, Alexei Starovoitov wrote:
> On Wed, Nov 28, 2018 at 11:02:00AM -0800, David Miller wrote:
>> From: Daniel Borkmann 
>> Date: Wed, 28 Nov 2018 11:34:55 +0100
>>
>>> Yeah fully agree. Thinking diff below should address it, do you
>>> have a chance to give this a spin for sparc / 32 bit to check if
>>> test_verifier still explodes?
>>
>> Great, let me play with this.
>>
>> I did something simpler yesterday, just changing the data pointers to
>> "u64" and that made at least one test pass that didn't before :-)
>>
>> I'll get back to you with results.
> 
> Did you have a chance to test it ?

David got back to me and mentioned it worked fine on sparc.

> I'd like to add a tested-by before I apply Daniel's patch
> which looks good to me. btw.

Yeah, wanted to cook an official patch today, let me do that now.

Re: [PATCH bpf-next 1/2] bpf: add BPF_LWT_ENCAP_IP option to bpf_lwt_push_encap

2018-11-30 Thread Peter Oskolkov

On Fri, Nov 30, 2018 at 3:52 PM David Ahern  wrote:
>
> On 11/30/18 4:35 PM, Peter Oskolkov wrote:
> > Thanks, David! This is for egress only, so I'll add an appropriate
> > check. I'll also address your other comments/concerns in a v2 version
> > of this patchset.
>
> Why are you limiting this to egress only?

I'm focusing on egress because this is a use case that we have and
understand well, and would like to have a solution for sooner rather
than later.

Without understanding why anybody would want to do lwt-bpf encap on
ingress, I don't know, for example, what a good test would look like;
but I'd be happy to look into a specific ingress use case if you have
one.

Re: [PATCH net-next] tun: implement carrier change

2018-11-30 Thread David Miller

From: Nicolas Dichtel 
Date: Wed, 28 Nov 2018 19:12:56 +0100

> The userspace may need to control the carrier state.
> 
> Signed-off-by: Nicolas Dichtel 
> Signed-off-by: Didier Pallard 

Applied.

Re: [PATCH net 0/3] fixes in timeout and retransmission accounting

2018-11-30 Thread David Miller

From: Yuchung Cheng 
Date: Wed, 28 Nov 2018 16:06:42 -0800

> This patch set has assorted fixes of minor accounting issues in
> timeout, window probe, and retransmission stats.

Looks good, series applied.

1 2 >

1 - 100 of 181 matches

Mail list logo